David I Warton - Eco-Stats - Data Analysis in Ecology - From T-Tests To Multivariate Abundances (Methods in Statistical Ecology) - Springer (2022)
David I Warton - Eco-Stats - Data Analysis in Ecology - From T-Tests To Multivariate Abundances (Methods in Statistical Ecology) - Springer (2022)
David I Warton - Eco-Stats - Data Analysis in Ecology - From T-Tests To Multivariate Abundances (Methods in Statistical Ecology) - Springer (2022)
David I. Warton
Eco-Stats:
Data Analysis
in Ecology
From t-tests to Multivariate Abundances
Methods in Statistical Ecology
Series Editors
Andrew P. Robinson, Melbourne, VIC, Australia
Stephen T. Buckland, University of St. Andrews, St. Andrews, UK
Peter Reich, Dept of Forest Records, University of Minnesotta, St. Paul, USA
Michael McCarthy, School of Botany, University of Melbourne, Parkville, Australia
This new series in statistical ecology is designed to cover diverse topics in emerging
interdisciplinary research. The emphasis is on specific statistical methodologies
utilized in burgeoning areas of quantitative ecology. The series focuses primarily on
monographs from leading researchers.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The core idea in this book is to recognise that most statistical methods you use can
be understood under a single framework, as special cases of (generalised) linear
models—including linear regression, t-tests, ANOVA, ANCOVA, logistic regres-
sion, chi-squared tests, and much more. Learning these methods in a systematic way,
instead of as a “cookbook” of different methods, enables a systematic approach to
key steps in analysis (like assumption checking) and an extension to handle more
complex situations (e.g. random factors, multivariate analysis, choosing between a
set of competing models). A few simplifications have been made along the way to
avoid “dead-end” ideas that don’t add to this broader narrative.
This book won’t teach readers everything they need to know about data analysis in
ecology—obviously, no book could—but there is a specific focus on developing the
tools needed to understand modern developments in multivariate analysis. The way
multivariate analysis is approached in ecology has been undergoing a paradigm shift
of late, towards the use of statistical models to answer research questions (a notion
already well established in most fields, but previously difficult to apply in many areas
of ecology). The first half of the book (Parts I–II) provides a general platform that
should be useful to any reader wanting to learn some stats, thereafter (Part III) it
focuses on the challenging problem of analysing multivariate abundances.
This book started out as notes for a couple of intensive short courses for ecologists,
each taught across just one week (or less!). OK, those two courses don’t quite cover
all of the material in this book, and they are super intensive, but the emphasis on
a few key principles makes it possible to cover a lot of ground, and importantly,
modern statistics can be seen as a cohesive approach to analysis rather than being
misunderstood as an eclectic cookbook of different recipes.
The book is punctuated with exercises and examples to help readers check how
they are going and to motivate particular chapters or sections. All correspond to
real people, projects, and datasets. Most come from colleagues doing interesting
work or problems I encountered consulting; some are gems I found in the literature.
Where project outcomes were published, a reference has been included. Solutions to
v
vi Preface
First I’d like to thank those who permitted the use of their datasets in this text—these
were invaluable in motivating and illustrating the analysis methods described in this
work.
I’d like to thank those who helped me get to the point where I was writing a book on
stats for ecologists, including my mother, for encouraging me to do whatever I liked
for a career; my lecturers at the University of Sydney and Macquarie University
for teaching me stats and ecology; my mentors who have helped me develop as
a researcher—most notably Mark Westoby, William Dunsmuir, Malcolm Hudson,
Matt Wand, Glenda Wardle, and Neville Weber; my workplace supervisors at UNSW,
who have given me a long leash; the Australian Research Council, which provided
financial support throughout much of my career to date; and my collaborators and
research team, the UNSW Eco-Stats research group. This team is my “work family”;
they are a pleasure to work with and I have learnt much from them and with them.
There are a bunch of original ideas in this work, in the later chapters, which the
Eco-Stats group have been integral to developing.
Thanks to everyone who has looked over part of this book and offered advice
on different sections, including Wesley Brooks, Elliot Dovers, Daniel Falster, Rob
Freckleton, Francis Hui, Michelle Lim, Mitch Lyons, Ben Maslen, Sam Mason,
Maeve McGillycuddy, Robert Nguyen, Eve Slavich, Jakub Stoklosa, Sara Taskinen,
Löic Thibaut, Mark Westoby, and Gordana Popovic. Gordana in particular can be
credited with the idea of adding maths boxes to complement the code boxes.
Finally, this book has been something of a journey, taking over 5 years to write,
which is a long enough period for anyone to go through some ups and downs. Thanks
to my friends and family for helping me to find strength and confidence when it was
needed and supporting me when I couldn’t. And thanks to Bob for the apple crumble.
vii
Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Part I
Regression Analysis for a Single Response
Variable
Chapter 1
“Stats 101” Revision
No doubt you’ve done some stats before—probably in high school and at university,
even though it might have been some time ago. I’m not expecting you to remember
all of it, and in this Chapter you will find some important lessons to reinforce before
we get cracking.
Key Point
Some definitions.
Response variable y The variable you are most interested in, the one you
are trying to predict. There can be more than one such variable, though (as
in Chap. 11).
Predictor variable x The variable you are using to predict the response vari-
able y. There is often more than one such variable (as in Chap. 3).
Regression model A model for predicting y from x. Pretty much every
method in this book can be understood as a regression model of one type
or another.
Statistical inference The act of making general statements (most commonly
about a population) based on just a sample (a smaller number of “represen-
tative” cases for which you have data.
Categorical variables break subjects into categories (e.g. colour, species ID)
Quantitative variables are measured on a scale (e.g. biomass, species rich-
ness)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 3
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_1
4 1 “Stats 101” Revision
Most methods of data analysis, and all methods covered in this book, can be thought
of as a type of regression model. A regression model is used in any situation where
we are interested in understanding one or more response variables (which we will
call y) which we believe we can describe as a function of one or more predictor
variables (which we will call x).
The predictors might be quantitative variables measured on a scale, categorical
predictors that always fall in one of a set number of predefined categories, e.g.
experimental treatments, or other variables taking one of a set of possible values that
might be predetermined by the sampling design.
A response variable can similarly be quantitative or categorical, but importantly,
its value cannot be predetermined by the sampling design—as in its name, it is a
response, not something that is fixed or constrained by the way in which sampling
was done. Sometimes you might have more than one response variable, and you
can’t see a way around modelling these jointly; this complicates things and will be
considered later (Chap. 11).
How can you tell whether a variable is a predictor (x) or a response (y)? The
response is the variable you are interested in understanding. Usually, if one variable
is thought to be causing changes in another, the variable driving the change is x and
the one responding is y, but this is not always the case. If the goal is to predict one
variable, then it is most naturally treated as the response y, irrespective of which
causes which.
Key Point
The study design is critical—think carefully before you start, and get advice.
There is no regression method that can magically correct for a crappy design.
Garbage in =⇒ Garbage out
So you need to put a lot of thought into how you collect data.
It is always worth consulting with a statistician in the design phase. Even if you don’t
get anything out of the meeting, it’s better to waste an hour (and maybe $300) at the
1.2 Study Design Is Critical 5
start than to find out a year later you wasted 300 h of field work (and $20,000) with
a flawed design that can’t effectively answer the question you are interested in.
Statistical consultants absolutely love talking to researchers during the design
phase of a study. They will typically ask a bunch of questions intended to establish
what the primary research question is, to check that the study is designed in such
a way that it would be able to answer that question, and to understand what sort of
sample sizes might be needed to get a clear answer to the question. They would also
think about how to analyse data prior to their collection.
Many universities and research organisations employ statistical consultants to
offer free study design advice to students, and probably to staff, or at least advice
at a reduced rate. If your workplace doesn’t, consider it your mission to get this
situation addressed—any research-intensive organisation will want its researchers to
have access to quality statistics advice!
There may be a “resident expert” in your research group or nearby who is the
go-to person for study design and stats advice but who lacks formal training in
statistics. Some of these people are very good at what they do, actually many of
them are (otherwise why would they be considered the expert?). But as a general
rule, they should not be a substitute for a qualified statistical consultant—because the
consultant does this as their day job and is trained for this job, whereas the resident
expert typically does statistics only incidentally and it is neither the resident expert’s
main job nor something they trained for. I used to be one of these resident experts
as a student, in a couple of biology departments, before I retrained as a statistician. I
can assure you I am a much better consultant now that I have a deeper understanding
of the theory and a greater breadth of experience that comes from formal training in
stats.
Having said this, the resident expert, if one exists in your research area, will be
really good at relating to some of the intricacies of your research area (e.g. difficulties
finding extra field sites, “hot issues” in your literature) and might raise important
discipline-specific issues you need to think about. Because the resident expert might
see different issues not raised by a statistical consultant, and vice versa, perhaps the
best approach is to talk to both your resident expert and a statistical consultant! After
all, when making any big decision it is worth consulting widely and getting a range
of views. This should be no exception to that rule, because most research projects
represent serious commitments that will take a lot of your time.
While you should seek advice from more qualified people, you should not pass
responsibility for design and analysis decisions over to them entirely—as a researcher
you have a responsibility to understand key concepts in design and analysis as
they relate to your research area, to ensure design and analysis decisions are well
informed. I am fortunate to know some really successful ecologists who are at the
top of their field internationally, and one thing they have in common is a high level
of understanding of key issues in study design and a broad understanding of data
analysis methods and issues that can arise in this connection. Something else they
have in common is that they know that they don’t know everything, and occasionally
they seek advice from me or my colleagues on trickier problems, usually in the
design phase.
6 1 “Stats 101” Revision
Before forging ahead with study design (or analysis), you need to clarify the objective
of your research. A good way to do this is to summarise your research project as a
question that you want to answer. Once you have framed the research question, you
can use this as a guide in everything that follows.
Is your research question interesting, something that is important, that lots of
people will be interested in finding out the answer to? If not, go find another research
question! Easier said than done though— asking a good question is typically the
hardest thing to do as a researcher. Finding the answer can be relatively easy, after
you have found a good question.
like a field site. This prompts the following question: Sure you demonstrated this in
your experiment, but will it happen in the real world? Good experimental procedure
either mimics the real world as well as possible or is supplemented by additional
trials mimicking the real world (e.g. repeat the growth chamber experiment out in
the field).
Another thing to think about early in the design of a study is what broader group of
subjects your research question is intended to apply to. This is known as the target
population.
The decision about the target population is a function both of the group of subjects
you are interested in and the group of subjects you might include in the study. For
example, you may be interested in insects in forests around the world, but if you only
have time and funding for field trips a few hundred kilometres from home, you will
need to narrow your focus, e.g. the target population could be all insects in forests
within a few hundred kilometres of home.
Ideally, any narrowing of your focus should be done in a way that does not sacrifice
scientific interest (or at least minimises the sacrifice). Is there a network of forested
national parks within a few hundred kilometres of home? Or could you team up with
others going on field trips further afield? If not, maybe you will need to switch to
an interesting research question for which you can actually access a relevant target
population. . .
Having decided on your target population, you will then work out how to select
a smaller group of subjects (a sample) to actually use in your study. This sample
should be representative of this population; one way to guarantee this is to randomly
sample from the population, as discussed later. How you sample requires careful
thought, and it is well worth your while seeking advice at this step.
Target populations are especially important in observational studies, but they
are also important in experiments—in the absence of arm-waving, the effects of a
treatment on subjects can only be generalised as far as the population from which
the experimental subjects were sampled.
If the study being undertaken is an experiment, then it will need to compare, replicate
and randomise, as discussed in what follows.
Compare across treatments that vary only in the factor of interest. This may
require the use of an experimental or “sham” control. For example, consider a study
of the effect of bird exclusion on plant growth, where birds are excluded by building
a cage around individual plants. A sham control would involve still caging plants,
8 1 “Stats 101” Revision
but leaving off most of the netting so that the birds can still get in—so to the extent
possible, the only thing that differs across treatments is the presence/absence of birds.
The cage structure itself might adversely affect plants, e.g. due to soil disturbance
during its construction or as a result of shading from the cage structure, but using
a sham control means that these effects would be common across treatments and so
would not mess up the experiment.
Replicate the application of the treatment to subjects. This is necessary so we
can generalise about the effects of treatment, which would be impossible if it were
applied once to a large batch. For example, if looking at the effect of a gene through
breeding knock-out fruit flies, the process of gene knock-out should be replicated,
so that we know the effect is due to knock-out and not to something else particular
to a given batch.
Maths Box 1.2: Random Samples Are Identically and Independently Dis-
tributed
A random sample from a population is a sample taken in such a way that all
possible samples have an equal chance of being selected.
In a random sample, all subjects have an equal chance of being the ith in
the sample (for each i), which means that all observations Yi have the same
distribution, fYi (y) = fY (y) (the Yi are identically distributed).
If the population sampled from is large, knowing which is the first subject in
your sample gives no information about which will be the second, since all are
equally likely. This means that the conditional distribution fY2 |Y1 =y1 (y) = fY (y)
is not a function of y1 , and so Y1 and Y2 are independent (similarly, all n
observations in the sample are independently distributed).
Hence, we can say that a random sample from a variable, Y1, . . . , Yn , is
independently and identically distributed, or iid.
This is the most common measure of the spread of a variable—the more spread
out the values are, the larger the values (Y − μ)2 tends to take and, hence, the
larger the variance. Variance can be estimated from sample data by averaging
the values of (y − ȳ)2 , although this is usually rescaled slightly to remove
small-sample bias.
The square root of the variance σ is called the standard deviation. It is
often used in place of the variance because it has the same units as Y , so its
values make more sense.
The mean and variance, and assumptions made about them in analysis, are
central to regression methods and their performance.
1.3 When Do You Use a Given Method? 11
Key Point
In data analysis, always mind your Ps and Qs—let your decisions on what
to do be guided by the research question and data properties (especially the
properties of your response variable).
When thinking about how to analyse data, there are two key things to think about:
let’s call them the Ps and Qs of data analysis:
P What are the main properties of my data?
Q What is the research question?
And whenever analysing data, we need to mind our Ps and Qs—make sure we check
that the assumed properties align with what we see in our data and that the analysis
procedure aligns with the research question we want to answer.
Consider the data in Table 1.1, which reports the number of ravens at 12 sites before
and after a gun is fired. Consider the problem of visualising the data, the first step in
any analysis. How should we graph the data? It depends on what we want to know.
12 1 “Stats 101” Revision
Table 1.1: Raven counts at 12 different sites before and after the sound of a gunshot,
courtesy of White (2005)
Before 0 0 0 0 0 2 1 0 0 3 5 0
After 2 1 4 1 0 5 0 1 0 3 5 2
Fig. 1.1: The method of analysis depends on the research question. Consider the set of
paired observations from White (2005), as in Table 1.1. Here are three example plots,
all of which are valid, but they answer different research questions—see Exercise 1.2
A few options are given in Fig. 1.1, where each graph corresponds to one of the
following three research questions:
1. Are counts larger after the gunshot sound than before?
2. Are counts at a site after the gunshot related to counts before the gunshot?
3. Do before and after counts measure the same thing?
The right approach to analysis depends on the question!
So this is number one on the list of things to think about when analysing your
data—what question are you trying to answer?
There are many different approaches to data analysis, depending on the exact
question of primary interest. Broadly speaking, the main approaches fall into one of
the following categories (which will be discussed later):
• Descriptive statistics (playing with data)
• Hypothesis testing (a priori hypothesis of key interest)
• Estimation/confidence intervals (CIs)(to estimate the key quantity/effect size)
• Predictive modelling (to predict some key response variable)
• Model selection (which theory or model is more consistent with my data? Which
predictor variables are related to the response?)
Sometimes more than one of these is appropriate for a particular problem, e.g. I
would never test a hypothesis without first doing descriptive statistics to visualise
my data—or anything, actually. There are some things you can’t combine, though; in
particular, when doing model selection, usually you cannot do hypothesis testing or
CI estimation on the same data because standard procedures don’t take into account
the fact that you have done model selection beforehand.
A key distinction is between descriptive and inferential statistics.
Descriptive statistics is what you would do if the goal were to describe—
understanding the main features of the data. Common examples are graphing (his-
tograms, scatterplots, barplots, . . .), numerical summaries (mean, median, sd, . . .),
and “pattern-finders” or data-mining tools like cluster analysis and ordination. Graph-
ing data is a key step in any analysis: one analysis goal should always be to find a
visualisation of the data that directly answers the research question.
Inferential statistics is what you would do if the goal were to infer—making
general statements beyond the data at hand. Common examples include hypothesis
tests (t-tests, binomial tests, χ2 test of independence, ANOVA F-test, . . .), CIs (for
mean difference, regression slope, some other measure of effect size), predictive
modelling, and variable or model selection.
The distinction between descriptive and inferential statistics is important because
inference tends to involve making stronger statements from the data, and doing so
typically requires stronger assumptions about the data and study design. Thus, it
is important to be aware of when you are moving beyond descriptions and making
inferences and to always check assumptions before making inferences!
When working out how to analyse data, we need to think not just about our research
question, but also about the properties of our data. Some of these properties are
implied by our study design.
14 1 “Stats 101” Revision
You have probably already taken a stats course. In your first stats course you would
have come across a bunch of different techniques—histograms, clustered bar charts,
linear regression, χ2 tests of independence, and so forth. Figure 1.2 is a schematic
summarising how these all fit together. Down the rows are different techniques of
analysis that depend on the research question (Qs)—am I doing a hypothesis test,
just making inferences about some key parameter I want to estimate, etc? Across the
columns we have different data properties (Ps)—does the research question involve
one or two variables, and are they categorical or quantitative?
Figure 1.2 isn’t the end of matters when it comes to doing statistics, obviously,
although unfortunately many researchers have no formal training that takes them
beyond these methods. Clearly sometimes you have more than two variables, or one
of them is discrete and takes small values (unlikely to satisfy normality assumptions),
or the response variable is ordinal. Such situations will be covered later in this book.
1.3 When Do You Use a Given Method? 15
Fig. 1.2: A schematic of “Stats 101” methods—tools you likely saw in an introductory
statistics course, organised according to the research question they answer (rows)
and the properties the data have (columns). In this text, we will find commonalities
across these methods and extend them to deal with more complex settings
to scavenge. He went to 12 locations, counted the ravens he saw, then shot his
gun, waited 10 min, and counted again. The results are in Table 1.1.
What do the data tell us—one variable or two? Categorical or quantitative?
What does the question tell us—descriptive, estimation, hypothesis testing,
etc?
So how would you analyse the data?
What graph would you use to visualise the data?
When analysing your data you should always mind your Ps and Qs—make sure
the assumptions about data properties are reasonable, and make sure that the analysis
procedure aligns with the research question you want to answer. We will discuss this
in more detail in Sect. 1.5.
Key Point
Statistical inference involves making general statements about population or
“true” quantities (e.g. mean) based on estimates of these quantities from sam-
ples. We need to take into account uncertainty due to the fact that we only have
a sample, which is commonly done using a CI or a hypothesis test.
1.4 Statistical Inference 17
When we use the term statistical inference, we are talking about the process of gen-
eralising from a sample—making statements about general (“population”) patterns
based on given data (“sample”). In an observational study, there is typically some
statistic that was calculated from a sample, and we would like to use this to say
something about the true (and usually unknown) value of some parameter in the
target population, if we sampled all possible subjects of interest. In an experiment,
we use our data (“sample”) to calculate a treatment effect and want to know what this
says about the true treatment effect if we repeat the experiment on an ever-increasing
set of replicates (“population”).
It is always important to be clear whether you are talking about the sample or
the population—one way to flag what you mean is through notation, to distinguish
between what you know (sample) and what you wish you knew (population).
Notation for some common sample estimators and their corresponding population
parameters are listed in Table 1.2. Sample estimates usually use the regular alphabet,
and population parameters usually use Greek letters (e.g. s vs σ). The Greek letter
used is often chosen to match up with the subject (e.g. σ is the Greek letter s, s for
standard deviation, μ is the Greek letter m, m for mean). An important exception to
the rule of using Greek letters is proportions—the population proportion is usually
denoted by p (because the Greek letter p is π, which we reserve for a special number
arising in circle geometry).
Another common way to write sample estimators is by taking the population
parameter and putting a hat on it (e.g. β̂ vs β). So there really is a bit of a mixture
of terminologies around, which doesn’t make things easy. A way to be consistent
with notation, instead of chopping and changing, is to go with hats on all your
estimators—you can write an estimator of the mean as μ̂ and an estimator of the
standard deviation as σ̂, and most of us will know what you mean.
We usually refer to the number of observations in a sample as the sample size, n.
Table 1.2: Notation for common sample estimators and the corresponding population
parameters being estimated
Sample Population
Mean x̄ (or ȳ) μ (or μy )
SD s σ
Proportion p̂ p
Regression slope β̂ β
18 1 “Stats 101” Revision
When making inferences, we need to account for sampling error—the fact that we
did not sample all subjects of potential interest, only some of them, so any statistic
we calculate from the sample will likely differ from what the true answer would be
if we took measurements on the whole population.
If we repeated a study, any statistic we calculated might take a different answer.
So statistics vary from one sample to the next. A key idea in statistical inference is
to treat a statistic as a random variable that varies across samples according to its
sampling distribution. We can often use theory (or simulation) to study the sampling
distribution of a statistic to understand its behaviour (as in Maths Boxes 1.4–1.5)
and to design inference procedures, as discussed in what follows.
Consider Exercise 1.4, where the target we want to make inferences about is p, the
(true) sex ratio in the bat colony. There are two main ways to make inferences about
the true value of some parameter, such as p. If you use the prop.test function to
analyse the data, as in Code Box 1.1, the output reports both analyses.
1.4 Statistical Inference 19
> prop.test(65,109,0.5)
Hypothesis test—there is a specific hypothesis we want to test using data (the null
hypothesis, often written H0 ). In Exercise 1.7, we could test the hypothesis that there
is no gender bias, H0 : p = 0.5. We observed 65 female bats out of a colony of 109,
and we could use probability to work out how likely it would be to get this many
females if there were no gender bias:
65
P-value = 2 × P p̂ > = 0.055
109
(See Code Box 1.1 for P-value computation.) A statistic this far from 50:50 is
reasonably unlikely (almost 5%) so there is reasonable evidence against H0 , i.e.
reasonable evidence of gender bias.
Confidence interval—we don’t know the true sex ratio, but we can construct an
interval which we are pretty sure contains the true sex ratio.
e.g. from the CI in Code Box 1.1 we can say that we are 95% confident that the true p is
between 0.498 and 0.688.
Loosely speaking, a P-value measures how unlikely your data are under H0 . (More
specifically, it is the probability of getting a test statistic as extreme as or more
extreme than the observed one if the null hypothesis is true.) This can be used as a
measure of how much evidence there is against H0 (but not as a measure of evidence
for H0 ).
20 1 “Stats 101” Revision
Often researchers use the 0.05 significance level in testing—declaring results are
statistically significant at the 0.05 level if the P-value is less than 0.05. Using this
hard rule, in Exercise 1.7 we would say results are not statistically significant. Others,
such as myself, would interpret P-values along a scale and think of anything in the
neighbourhood of 0.05 as marginally significant, whether it was larger or smaller
than 0.05. The arguments against this are in terms of objectivity (if not sticking with
0.05, what exactly is your decision rule?) and convention (what is your reason for
departing from 0.05 exactly?). Neither of these issues is insurmountable, but they
are nonetheless issues to consider.
It always helps to look at the CI to help with interpretation; in fact, this is strongly
advised (Nakagawa & Cuthill, 2007, for example). If you have done a hypothesis test
and found a significant effect, the next natural question is how big of an effect it is. So
you find yourself computing a CI for the parameter of interest. Conversely, if there
is no significant effect, the next natural question is whether this means the effect is
small. Or is it plausible that maybe there was a big effect that has gone undetected?
Which leads you back to the CI.
Some argue you should look at CIs and other techniques and not bother with
hypothesis tests at all (see for example the Special Issue in Ecology, introductory
remarks by Ellison et al., 2014). But while hypothesis tests and CIs are different
ways of presenting the same information, they put the focus on different aspects
of it, and both have their place in the toolbox. If there is a specific hypothesis of
primary interest (e.g. is there an effect?), that is what a hypothesis test is for. It
will quantify how likely your data are under the null hypothesis and, hence, to give
1.4 Statistical Inference 21
you a measure of how much evidence there is against this hypothesis of primary
interest (e.g. evidence for an effect). On the other hand, if the main game is not
to test a specific hypothesis but rather to estimate some population parameter, then
hypothesis testing has no place in analysis, and a CI is your best bet.
While usually we have the luxury of moving interchangeably between hypothesis
tests and CIs, depending on what is of interest, things get tricky for interval estimation
when you have lots of response variables (Chap. 11). In that case it is hard to work
out a single quantity that you want to estimate (instead, there are lots of separate
quantities to jointly estimate), making it difficult to define a target parameter that
you want to construct an interval around. There are, however, no such difficulties for
hypothesis tests. Thus, you might notice in the multivariate literature a little more
reliance on hypothesis testing (although arguably too much).
Hypothesis tests are often abused or misused, to the point where some discourage
their use (Johnson, 1999, for example). The American Statistical Association released
a joint statement on the issue some years ago (Wasserstein & Lazar, 2016) in an
attempt to clarify the key issues. Some of the main issues are discussed in what
follows.
Don’t conclude that H0 is true because P is large: This mistake can be avoided
by also looking at your question in terms of estimation (with CIs)—what is a range
of plausible values for the effect size?
Don’t test hypotheses you didn’t collect the data to test.
Don’t “search for significance”, testing similar hypotheses many times with
slightly different data or methods: This gets you into trouble with multiple testing—
every time you do a test, at the 0.05 significance level, by definition, 5% of the time
you will accidently conclude there is significant evidence against the null. So if you
trawl through a bunch of response variables, doing a test for each of 20 possible
response variables, on average you would expect one of them to be significant at
the 0.05 level by chance alone, even if there were no relation between these re-
sponses and the predictors in your model. To guard against this, tests should be used
sparingly—only for problems of primary interest—and if you really do have lots of
related hypotheses to test, consider adjusting for multiple testing to reduce your rate
of false positives, as in Sect. 3.2.4.
Don’t test claims you know aren’t true in the first place: I have reviewed papers
that have included P-values concerning whether bigger spiders have bigger legs,
whether taller trees have thicker trunks, etc.. These are hypotheses we all know the
answer to; what the researchers were really interested in was, for example, how leg
length scales against body size or how basal diameter scales against tree height. So
there is no need to do the test!
22 1 “Stats 101” Revision
Most sample estimates of parameters ( x̄, p̂, β̂, . . .) are approximately normally dis-
tributed, and most software will report a standard error (standard deviation of the
estimator) as well as its estimate. In such cases, an approximate 95% CI for the true
(“population”) value of the parameter (μ, p, β, . . .) is
About 95% of the time, such an interval will capture the true value of the parameter
if the assumptions are met.
Equation (1.2) uses two times the standard error, but to three significant figures
this so-called critical value is actually 1.960 when using a normal distribution. And
sometimes a t distribution can be used, which can give a value slightly greater than
2. Your software will probably know what it is doing (best to check yourself if not
sure), but on occasion you might need to compute a CI manually, and unless sample
size is small (say, less than 10), the rule of doubling standard errors in Eq. (1.2)
should do fine.
You can often compute CIs on R using the confint function.
Key Point
There are assumptions when making inferences—you need to know what they
are and to what extent violations of them are important (to the validity and
efficiency of your inferences). Independence, mean, and variance assumptions
tend to be important for validity, distributional assumptions not so much, but
skew and outliers can reduce efficiency.
Recall that when analysing data you should always mind our Ps and Qs—make
sure the assumptions about data properties are reasonable, and make sure that the
analysis procedure aligns with the research question you want to answer. The role
of assumptions and assumption checking is critical when making inferences (e.g.
computing a CI or a P-value)—it is not possible to make big general statements
based on samples without making a set of assumptions, they need to be interrogated
to have any assurance that you are on the right track.
One key implicit assumption when making inferences is that you are sampling
from the population of interest. If you think about the broader set of subjects or
events you would like to generalise to, is your sample representative of this broader
set? This could be assured in an observational study by randomly sampling from
the population of interest in a survey. In an experiment, replicating the treatment of
1.5 Mind Your Ps and Qs—Assumptions 23
interest and randomising the allocation of subjects to these treatments ensures that
we can generalise about any impacts of treatment (at least the impacts of treatment
on this set of subjects). If you did not do one of the aforementioned steps (in an
observational study, it may not be possible), then some amount of arm-waving and
hand-wringing is needed—you will need to convince yourself and others that your
sample is representative of some broader population of interest.
Other assumptions are specific to the analysis method, and how you check as-
sumptions varies to some extent from one context to the next. The bat (Exercise 1.9)
and raven (Exercise 1.10) analyses, and indeed most statistical models, involve an
independence assumption, an important assumption and one that is quite easily vi-
olated. This is not something that is easy to check from the data; it is more about
checking the study design—in an experiment, randomising the allocation of subjects
to treatments guarantees independence is satisfied, and in a survey, random sampling
goes some way towards satisfying the independence assumption. One exception to
24 1 “Stats 101” Revision
this rule is that if you sample randomly in space (or time) but include in your analysis
predictor variables measured at these sites that are correlated across space (or time),
this can induce spatial (temporal) autocorrelation that should be accounted for in
analysis. So, for example, if we were to investigate bat gender ratio across colonies
and whether it was related to temperature, then because temperature is a spatial
variable, we would need to consider whether this induced spatial dependence in the
response.
Often there are additional assumptions, though. The main types of assumptions a
model makes can be broken down into the following categories:
Independence: Most (but not all) methods in this book require observations to be
independent, conditional on the values of any predictors (x).
Mean model: Common regression methods involve assuming a model for the
mean response (the true mean of y, as in Maths Box 1.3) as a function of x. e.g.
We can assume that the mean of y is linearly related to x. In the raven example, our
mean model under H0 is that the true mean (of the after–before differences) is zero.
Variance model: Usually we need to make assumptions about the variance, the
most typical assumption being that the variance of y under repeated sampling (at
a given value of x) will be the same no matter what. We made this equal variance
assumption (of the after–before differences) in the raven example.
Distributional assumption: Parametric methods of regression (which are the focus
of this book) involve assuming the response y comes from a particular distribution.
Most commonly we will assume a normal distribution, as in the raven example, but
for something different see Chap. 10 and later chapters.
Strictly speaking, you could consider mean and variance assumptions as special
cases of distributional assumptions, but it is helpful to tease these apart because
mean and variance assumptions often have particular importance.
Some especially useful tools for checking assumptions are normal quantile plots
and residual vs fits plots, although it is also worth thinking about the extent to which
assumptions can be violated before it becomes a problem.
15
10
15
8
Frequency
Frequency
Frequency
10
10
4 6
5
2
0
0
−2 −1 0 1 2 0 1 2 3 4 5 15 16 17 18 19 20
Data Data Data
20
l l l l
ll lll l l
ll llll
l lllll
l l llll
ll
4
1
ll
l
lll
lll
19
ll
ll
ll
lll
l l
ll
l
ll lll
ll
l l l
l ll
3
ll
l
l
18
0
l l
l
ll
Data
Data
Data
l
l
l l
l
ll ll l
ll l
2
lll
l
17
ll
lll l l
−1
l
ll ll
lll
ll
ll
l ll
lll lll
1
l
l ll
16
llll l
l lllll l
llll
−2
l lll
l l l l l
0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles
Fig. 1.3: Interpreting normal quantile plots. Example histograms (top row) and
normal quantile plots (bottom row) for data simulated from a normal distribution
(left), a right-skewed distribution (centre), or a left-skewed distribution (right). On
the normal quantile plot, data stay fairly close to a straight line if it is close to a
normal distribution, are J-shaped if right-skewed, or r-shaped if left-skewed
l
4
2
l
3
Sample Quantiles
Sample Quantiles
1
l l
2
lll
0
1
l l l l
0
−1
l
−1
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Theoretical Quantiles Theoretical Quantiles
This produces the plot on the left. There are lots of tied values on this plot (several dots
fall along the same horizontal line), and a small sample size makes it hard to see a pattern
(but maybe a slight suggestion of right skew).
The plot on the right was constructed using the qqenvelope function as follows:
library(ecostats)
qqenvelope(After-Before)
This puts a simulation envelope around where data should lie if they were normal. All points
lie inside the envelope, so we have no evidence of a violation of the normality assumption.
Don’t be too fussy about the normality assumption, though—even when some
points lie outside their simulation envelope, this is not necessarily a problem. Vio-
lations of the normality assumption tend not to matter unless one of the following
applies:
• Sample size is small (e.g. n < 10),
• Data are strongly skewed or have big outliers, or
• You are using a really, really small significance level (e.g. only declaring signifi-
cance at 0.001), which might happen for example if doing multiple testing.
What happens if your data aren’t normally distributed? Or the variances aren’t equal?
There is error in any model fitted to data, and one of the main points of statistics is to
1.5 Mind Your Ps and Qs—Assumptions 27
try to deal with this error appropriately. If assumptions are not satisfied, this might
not be done successfully. There are two things to consider:
Validity A valid method deals with error appropriately. Although your model
doesn’t give exactly the true answer, does it give estimates centred around it
(unbiased)? Is the estimate of the amount of error reasonable? Is a 95% CI really
going to capture the true parameter 95% of the time? Or, equivalently, is the 0.05
significance level really going to be exceeded only 5% of the time when the null
is true?
Efficiency An efficient method has relatively small error and so is relatively good
at answering your research question. Is the standard error small and CI narrow?
Or, equivalently, is a test going to have good power at detecting an effect?
What happens depends on which assumption we are talking about. But here are
some general rules (we will elaborate on them when we have a specific example to
talk about in Sect. 2.1.2).
Violations of independence assumption: If your data are not independent, but you
have assumed your data are independent, then you are stuffed.1 When using methods
that assume independence, it is really important to make sure your observations
in each sample are independent of each other—and sometimes it is quite easy to
guarantee this assumption is satisfied through randomisation. For a mathematical
definition of independence and how randomisation can help satisfy this assumption
see Maths Box 1.1.
The most common violation of this assumption is when everything is positively
correlated (e.g. observations close together in space are more similar), and the effect
this has is to make estimates of standard errors smaller than they should be (as in
Maths Box 1.5). If standard errors are too small, you tend to get false confidence—
CIs are too narrow (missing the true value more often than they claim to) and test
statistics are too large (more likely to declare false significance than they ought to).
This is really bad news.
False confidence is really bad news because one of the main points of inferential
statistics is to guard against this (or, more specifically, to control the rate of false
discoveries). If our data violate the independence assumption, and this is not ac-
counted for in our inferential procedure, then it can’t do the job it was designed to.
People have a habit of finding an explanation for anything, even really random things
that make no sense (there is a lot of literature behind this, for example Shermer,
2012, and a word for it, “apophenia”). An important benefit of inferential statistics is
providing a reality check to scientists, so that when they think they see a pattern, the
analysis procedure can tell them if it should be taken seriously. If the independence
assumption is violated, the stats procedure can’t do this job and there is little point
using it, you might as well just eyeball the data. . .
Another common example of where the independence assumption is often vi-
olated is commonly referred to in ecology as pseudo-replication (Hurlbert, 1984),
where there is structure from the sampling design that is not accounted for in mod-
elling, such that some groups of observations are related to each other and cannot
1 That is, you are in a world of pain!
28 1 “Stats 101” Revision
So Ȳ is unbiased, being centred around the true mean μY , if all Yi have mean
μY .
Note we did not use assumptions of independence, equal variance, or nor-
mality in the preceding problem—all those assumptions could be violated, and
we could still get an unbiased estimate of the mean. We did assume a simple
type of mean model—that all observations share the same mean μY .
For similar reasons, most regression models in this book give unbiased
estimates of the mean if the mean model is satisfied, irrespective of violations
of independence, equal variance, and distributional assumptions. Violations
of assumptions will, however, often lead to other problems (Maths Box 1.5).
√ √
than σY / n. In this situation, assuming independence and using σY / n in
analyses would be invalid, using too small a value for the standard error, giving
us false confidence in Ȳ . We also assumed in the preceding problem that all
standard deviations were equal (our variance model). Note that normality was
not used; it tends to be less important than other assumptions.
For similar reasons, all regression models in this book that assume inde-
pendence rely heavily on this assumption and risk false confidence when it is
not satisfied. Similarly, beyond the mean and variance models, distributional
assumptions tend to be less critical for validity throughout this book.
tailed data. The maths of the proof are surprisingly straightforward, but it does use
moment generating functions (MGFs) and Taylor series (typically taught in first-year
mathematics courses), so it is not for everyone.
We will prove the central limit theorem using the moment generating function
(mgf) of Y , written as mY (t) and defined as the mean of etY . The mgf of
a variable, when it exists, uniquely defines its distribution. For N (0, σ 2 ),
the MGF is eσ t /2 , so our goal here will be to show that as n → ∞,
2 2
log(m√n(Ȳ−μ) ) → σ 2 t 2 /2. But to do this, we will use a few tricks for MGFs.
We will use this Taylor series expansion of mY (t):
μY 2 2 μY 3 3 μY 4 4
mY (t) = 1 + μY t + t + t + t +... (1.3)
2 6 24
where μY 3 is a function of skewness, and μY 4 is a function of long-tailedness.
Rescaling If Y = aX, mY (t) = μexp(Yt) = μexp(aXt) = μexp(X at) = mX (at).
Adding If Y = X1 + X2 where X1 and X2 are independent, mY (t) =
μexp (X1 +X2 )t = μexp(X1 t) exp(X2 t) = μexp(X1 t) μexp(X2 t) = mX1 (t)mX2 (t)
The adding rule generalises to n independent variables as mi Xi (t) =
i mXi (t). Independence is critical for the adding rule! Now we prove the
central limit theorem.
Equation (1.4) was obtained by applying Eq. (1.3), noting that the mean of
Yi − μ is 0 and the mean of (Yi − μ)2 is σ 2 (Eq. (1.1)), and ignoring third- and
3
higher-order terms because for large n they are all multiplied by terms √tn
or larger, and this higher power of n sends these terms to zero as n gets large.
32 1 “Stats 101” Revision
σ2t 2
log m√n(X̄−μ) (t) →
2
which completes the proof.
Maths Box 1.7: How Fast Does the Central Limit Theorem Work?
So sample means are approximately normal when n is large, irrespective of
the distribution of the response variable Y , but how large does n need to be?
There is no simple answer; it depends on the distribution of Y , in particular,
it depends on skewness, which we define as κ1 = μ(Yi −μ)3 , and kurtosis (or
long-tailedness), defined as κ2 = μ(Yi −μ)4 .
The main approximation in the proof was when the third and fourth moments
of Yi − μ (and higher-order terms) in Eq. (1.4) were ignored. If we go back to
this equation and stick these terms back in, we get
√
log m√n(Ȳ−μ) (t) = n log m(Yi −μ) (t/ n)
2 3 4
σ2 t κ1 t κ2 t
= n log 1 + √ + √ + √ +...
2 n 6 n 24 n
σ2t 2 κ1 t 3 κ2 t 4
= n log 1 + + √ + + . . .
2n 6n n 24n2
σ 2 t 2 κ1 t 3 κ2 t 4
+ √ + +...
2 6 n 24n
The higher-order terms, having n in the denominator, will always become
ignorable when n is large enough, but the n that is “large enough” is bigger
when data come from a distribution with a larger (in absolute terms) value of
κ1 or κ2 . The third moment κ1 is zero for symmetric distributions, non-zero for
skewed, and larger (in absolute terms) for more strongly skewed distributions.
The fourth moment κ2 is larger when Y has a more long-tailed distribution.
So if averaging a variable that is skewed or that has the occasional large
outlier, the sample size needed for the central limit theorem to take effect
needs to be larger. Simulations suggest that n = 10 is plenty for fairly sym-
metrical, short-tailed distributions, but n = 100 might not be enough for more
pathological long-tailed cases. If data are strongly skewed or if they come with
large outliers, methods that model the mean should not really be used anyway,
without data transformation. Keeping this in mind, n = 30 is usually plenty
for appropriately transformed data.
1.5 Mind Your Ps and Qs—Assumptions 33
The central limit theorem applies surprisingly quickly—if taking a simple av-
erage of a symmetric variable without long tails, five observations are usually
plenty for it to kick in and ensure that the sample mean will be close to be-
ing normally distributed. It works more slowly for data from skewed distribu-
tions, or when a statistic is highly influenced by just a few data points (long-
tailed distributions and outliers being a particular issue). Some useful insights
can be found in Miller Jr. (1997), and there are plenty of online applets as well
to discover the robustness of the central limit theorem yourself, a good one is at
http://onlinestatbook.com/stat_sim/sampling_dist.
The convergence to normality of most statistics, in most situations, explains why
the mean model and variance model are the most important things to focus on. The
mean and the variance are the only parameters that matter in a normal distribution.
Roughly speaking, if you get your mean model right, then the mean of any statistic
capturing (mean) trends in the data will be right, and if you get the variance model
right, plus your independence assumption, then the variance (standard error) of your
statistic will be right too. Throw in some central limit theorem and you’re done, you
know the distribution of your test statistic (in large samples) irrespective of whether
your distributional assumptions are right. In which case, if your sample size is large
enough for the central limit theorem to apply (usually, n > 30 is plenty), then any
inferences you make based on this statistic will be valid.
That doesn’t mean that your inferences will be efficient. If you have significant
departures from your distributional assumptions, in particular, ones that add addi-
tional skewness to the distribution, or long-tailedness (propensity for outliers), this
can have substantial effects on the efficiency of your inferences, i.e. how likely you
are to pick up patterns that are there. So if you know your distributional assumptions
are quite wrong (especially in terms of skewness and long-tailedness), you should
try to do something about it, so that your inferences (tests, CIs) are more likely to
pick up any signal in your data. One thing you could do is transform your data, as in
the following section; another option is to use a different analysis technique designed
for data with your properties. A more flexible but more sophisticated option is to try
non-parametric statistics (Corder & Foreman, 2009) or robust statistics (Maronna
et al., 2006)—branches of statistics that have developed a suite of techniques intended
to maintain reasonable efficiency even for long-tailed or skewed data.
Assessing skewness is easy enough—is there a J shape (right-skewed) or an r
shape (left-skewed) on a normal quantile plot of residuals. You can also look at
a histogram of residuals, but this tends to be less sensitive as a diagnostic check.
Assessing long-tailedness is a little problematic because by definition you don’t
see outliers often, so from a small sample you don’t know if you are likely to get
any. It will show up in a larger sample, though, and, if present, can be seen on a
normal quantile plot as departures from a straight line above the line for large values
and below the line for small values (since observed values are more extreme than
expected in this case).
You can also compute sample estimates of skewness and kurtosis (long-tailedness)
on most stats packages, but I wouldn’t worry about these—they are unreliable unless
34 1 “Stats 101” Revision
you have a big enough sample size that the answer is probably screaming at you from
a graph anyway.
So in summary, what you need to worry about depends on sample size, along the
lines of the key point stated below. The preceding rules represent a slight relaxation
of those in Moore et al. (2014); many use tougher criteria than suggested here. I
arrived at the rules stated in this key point using a combination of theory (Maths
Box 1.7) and practice, simulating data (and corresponding normal quantile plots)
and seeing what happened to the validity of the methods.
An important and difficult situation arises as a result of small sample sizes
(n < 10)—in this case, distributional assumptions matter but are hard to check. They
matter because the central limit theorem doesn’t help us too much, and they are hard
to check because the small sample size doesn’t give us a lot of information to detect
violations of our assumptions. This situation is a good case for trying non-parametric
statistics (Corder & Foreman, 2009) or design-based inference (Chap. 9) to ensure
the inference procedure is valid.
Key Point
You don’t need to be too fussy when checking distributional assumptions on
your model fit. Usually, we can use a model fitted to our data to construct a
set of residuals that are supposed to be normally distributed if the model is
correct. How carefully you check these residuals for normality depends on
your sample size:
n < 10 Be a bit fussy about the normality of residuals. They should look
symmetric and not be long-tailed (i.e. on a normal quantile plot, points
aren’t far above the line for large values and aren’t far below the line for
small values).
10 < n < 30 Don’t worry too much; just check you don’t have strong skew
or big outliers/long-tailedness.
n > 30 You are pretty safe unless there is really strong skew or some quite
large outliers.
There are plenty of formal tests around for checking assumptions, including tests
of normality (Anderson-Darling, Shapiro-Wilk, . . . ), or tests for equal variance
(Levene’s test, F-test, . . . ). These tests are not such a good idea; in fact, I actively
advise against using them. Checking a graph is fine (using the rules in the previously
stated key point).
There are two main reasons not to use hypothesis tests to check assumptions.
First, it is a misuse of the technique, and second, it doesn’t work, as explained below.
1.6 Transformations 35
Hypothesis tests on assumptions are a misuse of the approach because you should
only test hypotheses you collected the data to test! Was the primary purpose of your
study really to test the assumption of normality?
Hypothesis tests on assumptions don’t work well because this approach doesn’t
answer the question we want answered. A hypothesis test would give you a sense
of whether there is evidence in your data against the hypothesis of normality, but
recall we want to know whether violations of normality are sufficiently strong that
we should worry about them (in particular, skewness and long-tailedness). These are
quite different things. Further, hypothesis tests of assumptions behave in the wrong
way as the sample size increases. A larger sample size gives us more information
about the distribution of our data, so when n is large, we are more likely to detect
violations of normality, even if they are small. But we know, from the central limit
theorem, that statistics are more robust to violations of normality when sample size
is larger, so tests of normality are going to make us worry more in situations where
we should be worrying less!
Key Point
Transformation can be useful; in particular, when data are “pushed” up against
a boundary, transformation can remove the boundary and spread the data out
better. The log transformation is a good one because it is easier to interpret.
1.6 Transformations
1
Sample Quantiles
0
−1
−2
This plot looks okay, but the previous one (Code Box 1.2) looked reasonably good too.
Another option to consider is to model the data as discrete, as in Chap. 10.
Why transform data? Usually, to change its shape, in particular, to get rid of
strong skew and outliers. (As before, if residuals have strong skew or outliers, most
analysis methods don’t work well.) Figure 1.4 is a schematic to gain some intuition
for why transformation can change the shape of data.
A special type of transformation is the linear transformation, which is used to
change the scale (e.g. from grams to kilograms) rather than the shape of a distribution.
Linear transformations have the form ynew = a + by and are so called because if you
plot ynew against y, you get a straight line. The most common linear transformation
in statistics is when you want to standardise a variable, subtracting the mean and
dividing by the standard deviation. Note that a linear transformation doesn’t change
the shape at all; it only changes the scale. Only non-linear transformations are shape
changers.
1.6 Transformations 37
Fig. 1.4: How transformation changes the shape of data. The log10 function (blue
curve) has a steeper slope at smaller values, so if most observations are small
(yellow vertical band), then after log transformation, these will get spread out over
a much broader range of values (light yellow horizontal band). The histograms in
the margins show the effect this has on the shape of a right-skewed variable. In this
case, population size of first-world countries becomes much more symmetric after
log transformation, and the apparent outlier on untransformed data (USA) looks fine
after transformation
themselves are meaningful, e.g. the log transformation, and to use transformation
with caution if predictions are required on the original scale (e.g. to estimate standing
biomass of a forest).
The most common situation is where your data take positive values only (y > 0)
and are right-skewed. In this situation, the following transformations might make it
more symmetric:
√
• ynew = y
• ynew = y 1/4
• ynew = log y
These transformations are all “concave down”, so they reduce the length of the right
tail. They are in increasing order of strength—that is, for strongly skewed data,
√
ynew = log y is more likely to work than ynew = y (Fig. 1.5).
They are also monotonically increasing—that is, as y gets larger, ynew gets larger.
(A transformation that didn’t have this property would be hard to interpret!)
ynew = loga y
y sqrt(y)
60
150
50
Frequency
Frequency
40
100
30
20
50
10
0
0
0 50 100 150 0 2 4 6 8 10 12
x sqrt(x)
y^(1/4) log(y)
40
80
30
60
Frequency
Frequency
20
40
10
20
0
• Profit
• Population
By log-transforming such processes, they change from multiplicative to additive.
See, for example, the data on population sizes of first-world countries in Fig. 1.4.
If your data are “pushed up” against a boundary, you should think about transforming
the data to remove the boundary.
40 1 “Stats 101” Revision
For example, a population can be small, but it can’t be negative, so values are
pushed up against zero. A log transformation removes this boundary (because as y
approaches 0, log(y) approaches −∞).
Removing boundaries in this way often makes the data look more symmetric,
but it also prevents some awkward moments. Boundaries that are not removed can
lead to nonsensical predictions from regression models, e.g. a predicted population
of −2 million! A rather infamous example of this is a paper in Nature that analysed
Olympic 100 m sprint times for males and females, without transformation to remove
the boundary at zero seconds. Both male and female plots had a decreasing trend
over time, but the fitted line was steeper for females, so the authors predicted that
by the mid-22nd century, a woman would complete the 100 m sprint faster than the
winning time for the men’s event (Stephens et al., 2004). But there was also the
rather embarrassing problem that by the 27th century the race would be completed
in less than zero seconds (Rice, 2004)!
(This stretches data over the whole real line, from −∞ to ∞.)
The arcsine transform was often used historically—it is not a good idea because it
is not monotonic and hard to interpret (Warton & Hui, 2011). If your proportions
arise from counts, methods specially developed for discrete data might be worth
a go (Chap. 10). Otherwise, beta regression is also worth considering (Ferrari &
Cribari-Neto, 2004, an extension of methods of Chap. 10).
Right-skewed with zeros: This often happens when data are counts. The problem
is that you can’t take logs because log 0 is undefined. Try
ynew = log(y + 1)
This might work for you, unless you have lots of small counts. If there are lots of
zeros, you will need to use generalised linear models, described in Chap. 10.
Left-skewed data: This is less common. But if data are left-skewed and negative,
then −y is right-skewed and positive, in which case the transformations previously
discussed can be applied to −y. Maybe take the negative of them afterwards so they
remain monotonic increasing (if a value is bigger on the untransformed scale, it is
bigger after transformation too).
For example, ynew = − log(−y) takes negative, left-skewed values of y and tries
to make them more symmetric.
1.6 Transformations 41
library(ecostats)
data(globalPlants)
globalPlants$height
Construct a histogram. Do you see any boundaries that could cause problems
for analysis? What sort of transformation would you suggest? Try out this
transformation on the data and see if you think it has fixed the problem.
In this chapter we will revise two of the most commonly used statistical tools—
the two-sample t-test and simple linear regression. Then we will see a remarkable
equivalence—that these are actually exactly the same thing! This is a very important
result; it will give us some intuition for how we can write most of the statistical
techniques you have previously learnt as special cases of the same linear model.
The two-sample t-test is perhaps the most widely used method of making statistical
inferences. I reckon it has had a hand in maybe half of the significant scientific
advances of the twentieth century, maybe more. So it will be quite instructive for us
to revise some of its key properties and some issues that arise in its application.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 43
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_2
44 2 An Important Equivalence Result
ȳControl − ȳNicotine
t=
standard error of ( ȳControl − ȳNicotine )
which (if H0 is true) comes from a t distribution with degrees of freedom n − 2,
where n is the total sample size (n = nControl + nNicotine ).
For the guinea pig data, this test statistic works out to be −2.67. We would
like to know whether this statistic is unusually large compared to the sorts of
values you would expect if there were actually no effect of treatment. What do
you think?
The strategy for constructing any t-statistic, to test for evidence against the
claim that a parameter is zero, is to divide the quantity of interest by its stan-
dard error and to see whether this standardised or t-statistic is unusually far
from zero. (As a rough rule of thumb, values beyond 2 or −2 start to get sus-
piciously far from zero.) This type of statistic is commonly used and is known
as a Wald test. (The test is named after Abraham Wald, who in math geneal-
ogy terms http://genealogy.math.ndsu.nodak.edu/ happens to be my great-
great-great-grandfather!)
In a two-sample t-test we are looking for evidence that the true (population)
means differ across two groups. The quantity of primary interest here is the mean
difference, and we want to know if it is non-zero. So as in Exercise 2.1, we construct
a two-sample t-statistic by dividing the sample mean difference by its standard error
and considering whether this statistic is unusually far from zero.
A confidence interval for the true mean difference is returned automatically in
a t-test using R, and it is calculated in the usual way (but typically using a critical
value from the t-distribution, not using 2 as a rough approximation, which would get
things a bit wrong if sample size is small).
Code Box 2.1: A Two-Sample t-Test of the Data from the Guinea Pig
Experiment
> library(ecostats)
> data(guineapig)
> t.test(errors~treatment, data=guineapig, var.equal=TRUE,
alternative="less")
yi j ∼ N (μi, σ 2 )
guineapig$treatment: C
[1] 12.30357
---------------------------------------------------
guineapig$treatment: N
[1] 21.46858
Do you think assumptions are reasonable?
Normal quantile plot of Nicotine observations Normal quantile plot of Control observations
2.0
2.0
1.5
1.5
Sample Quantiles
Sample Quantiles
1.0
1.0
0.5
0.5
0.0
−1.0 −0.5 0.0
−1.0 −0.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Theoretical Quantiles Theoretical Quantiles
Fig. 2.1: Normal quantile plots of the treatment and control data from the smoking-
during-pregnancy study of Exercise 2.1
46 2 An Important Equivalence Result
What happens if your data aren’t normally distributed? Or the variances aren’t equal?
Recall that there are two main things to consider:
Validity Am I actually going to exceed my 0.05 significance level only 5% of the
time when the null is true? (Or, equivalently, is a 95% confidence interval really
going to capture the true parameter 95% of the time?)
Efficiency Is my test going to have good power at detecting a change in mean?
Or, equivalently, is a confidence interval going to be narrow, to give me a precise
estimate of the true mean difference?
Recall also that most analyses can be broken down into four different types of
assumption: independence, mean model, variance model, and distributional assump-
tions. What does the t-test say about each of these components? And how much do
the different assumptions really matter?
Independence: Observations are assumed independent within and across samples.
As previously, if your data are not independent but you have assumed your data are
independent, then you are stuffed. Your methods will probably not be valid, most
typically with standard errors being too small and false declarations of significance.
In an experiment, randomly assigning subjects to treatment groups guarantees this
assumption will be satisfied.
Distribution assumption: We assume data are normally distributed. As previously,
distributional assumptions (in this case, the assumption that the data are normally
distributed) rarely matter for the validity of our method, thanks to the central limit
theorem. Only if sample size is very small do we need to take this assumption
seriously. However, if the data are strongly skewed or have outliers, then the t-test
will not be very efficient, so we need to check for skew and outliers—see the key
point from Sect. 1.6 for rough guidelines.
Mean model: The only assumption here is that the mean is the same across
observations in each sample. This is satisfied if we randomly sample observations
from the same population, or in an experiment, if we randomly assign subjects to
treatment groups.
Variance model: We assume constant variance across samples, also known as
homoscedasticity. Apart from independence, this is the most important assumption.
How robust the t-test is to violations of this assumption will depend on the relative
sample sizes in the two groups. If your sample sizes in the two treatment groups are
equal, n1 = n2 or nearly equal (on a proportional scale, e.g. they only differ from
each other by 10% or less), then the validity of the two-sample t-test will be quite
robust to violations of the equal variance assumption. If your sample sizes are quite
different, e.g. one is four times the size of the other (n1 = 4n2 ), then violations of
the equal variance assumption will spell trouble. What happens is that the estimate
of the standard error of the mean difference (used on the denominator of a t-statistic
and in the margin of error of a confidence interval for mean difference) is either over-
2.2 Simple Linear Regression 47
or underestimated, depending on whether the group with the larger sample size has
a larger (or smaller) variance. This leads to an overly conservative (or overly liberal)
test and confidence intervals that are too wide (or too short).
In practice, you rarely get substantial differences in variance across samples unless
they are also accompanied by skewed data or outliers. Transformation is the first thing
to consider to fix the problem.
Linear regression, when there is only one x variable, is often referred to as simple
linear regression. Let’s quickly review what it’s about.
For the water quality example of Exercise 2.2, we study the relationship between
two quantitative variables, so the obvious graphical summary is a scatterplot, as in
Fig. 2.2.
A regression line was added to the plot in Fig. 2.2. This regression line was fitted
using least squares. That is, we found the line that gave the least value of the sum
of squared errors from the line, as measured in the vertical direction. (Why just the
vertical direction? Because this is the best strategy if your goal is to predict y using
the observed x.)
A nice applet illustrating how this works can be found (at the time of writing)
at https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_
en.html
48 2 An Important Equivalence Result
Fig. 2.2: Scatterplot of water quality (IBI) against catchment area (km2 ) for data
of Exercise 2.2. A simple linear regression line was added to the plot. This line
was fitted using least squares, i.e. minimising the sum of squared vertical distance
between each point and the line (as indicated by arrows)
But if you Google “least squares regression applet”, something useful should
come up that you can interact with to get a sense for how regression works.
Simple linear regression is a straight line, taking the algebraic form
μy = β0 + β1 x
where
thing similar gives the estimated values of these parameters (or “coefficients”), and
estimated standard errors will be given in a column of the table labelled something
like “Standard error” or “SE”.
Code Box 2.3: Fitting a Linear Regression to the Water Quality Data
> data(waterQuality)
> fit_qual=lm(quality~logCatchment, data = waterQuality)
> summary(fit_qual)
Call:
lm(formula = quality ~ logCatchment, data = waterQuality)
Residuals:
Min 1Q Median 3Q Max
-7.891 -3.354 -1.406 4.102 11.588
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.266 7.071 10.502 1.38e-08 ***
logCatchment -11.042 1.780 -6.204 1.26e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice there is a t-statistic and P-value for each coefficient in the foregoing output.
These test the null hypothesis that the true value of the parameter is 0.
Why 0?
For the y-intercept—no reason; usually it would be a stupid idea!
For the slope—this tests for an association between y and x. If you are interested in
testing for an association between y and x, this is the P-value to pay attention to.
Recall that the regression line is for predicting y from x. If there were no as-
sociation between y and x, then the (true) predicted value of y would be the same
irrespective of the value of x. This makes a straight horizontal line, as in Fig. 2.4b.
Fig. 2.4: The slope of the regression line, β1 , is the most important parameter; it tells
us how y relates to x: (a) if β1 is positive, y is expected to increase as x increases;
(b) if β1 is zero, there is no relationship between y and x; (c) if β1 is negative, y is
expected to decrease as x increases
2.2 Simple Linear Regression 51
For inference about simple linear regression, we make the following assumptions:
1. The observed y-values are independent, after accounting for x.
This assumption can often be guaranteed to be satisfied—how???
2. The y-values are normally distributed with constant variance.
yi ∼ N (μi, σ 2 )
Normality usually doesn’t matter (due to the central limit theorem) except for small
samples/strongly skewed data/outliers. The discussion in Sect. 2.1.2 applies here.
You can check normality on a normal quantile plot of residuals (as in Code
Box 2.4).
Constant variance can be checked using a residuals vs fits plot to see if there is
any fan-shape pattern—as in the following section.
3. There is a straight-line relationship between mean of y and x
μi = β0 + β1 xi
A residuals vs fits plot (y − μ̂y vs μ̂y as in Fig. 2.5b) can be used to check whether data
are linearly related and whether the variance is constant for different values of x. The
original data in Fig. 2.5a are squished (linearly transformed and, sometimes, flipped
around) in a residual plot so that the linear trend is removed, to focus our attention
on the errors around the line. A residuals vs fits plot typically has a horizontal line
at a residual of zero, which represents the original regression line, since any point
that fell exactly on the regression line would have had a residual of zero. Residuals
with positive values correspond to points above the regression line; residuals with
negative values correspond to points below the regression line.
There should be no pattern on a residual plot—if there is, then the assumptions
made previously are violated, and we cannot make inferences about the true regres-
sion line using simple linear regression. Which assumption is violated depends on
what sort of pattern you see.
The two most common diagnostic plots used in regression are a normal quantile
plot of residuals and a residuals vs fits plot (check for no pattern), which can be
constructed in R using the plot function on a fitted regression object (Code Box 2.4).
Do you think the regression assumptions seem reasonable for this dataset?
52 2 An Important Equivalence Result
Fig. 2.5: A residuals vs fits plot for water quality data. (a) Original data, with linear
regression line added in red. (b) Residuals (y − μ̂y ) plotted against fitted values ( μ̂y ),
which can be understood as a transformation of the data so that the regression line
is now horizontal and passes through zero. Because fitted values decreased when
catchment area increased, the plot against fitted values is flipped horizontally, with
the leftmost observation from (a) appearing as the rightmost observation in (b). We
can check regression assumptions using this plot by checking for a pattern
Key Point
A simple linear regression assumes independence, normality, constant vari-
ance, and a straight-line relationship.
There are two scary patterns to watch out for in a residuals vs fits plot:
• A U-shaped pattern, suggesting a problem with your straight-line assump-
tion.
• A fan-shaped pattern, suggesting a problem with the constant variance
assumption.
2.2 Simple Linear Regression 53
To produce a residuals vs fits plot and a normal quantile plot of residuals, you just take a
fitted regression object (like fit_qual, produced in Code Box 2.3) and apply the plot
function:
> plot(fit_qual, which=1:2)
The which argument lets you choose which plot to construct (1 = residuals vs fits,
2 = normal quantile plot).
(a) (b)
6l 6l
10
2
Standardized residuals
l9 9l
l l
l l
5
1
l
Residuals
l
l
l
l
l
0
ll
0
l l l ll
l l
l l
l ll
−5
l
l l
−1
l l
l 12 l
−10
l 12
20 25 30 35 40 45 50 −2 −1 0 1 2
Fitted values Theoretical Quantiles
(a) (b)
2
2
1
1
Residuals
Residuals
0
0
−1
−1
20 25 30 35 40 45 50 −2 −1 0 1 2
Fitted values Theoretical Quantiles
If there is a U-shaped pattern in your residuals vs fits plot, you should not be
fitting a straight line to the data, as in Fig. 2.6. A U shape in the residual plot means
that points tend to start off above the regression line, then they mostly move below
54 2 An Important Equivalence Result
Fig. 2.6: A problem to look out for in residual plots. (a) Scatterplot showing non-
linear function, a curve increasing in slope as x increases. (b) The residuals vs fits
plot suggests a problem by having a U shape, residuals tending to take positive values
for small fitted values, negative values in the middle, and positive residuals for large
fitted values. We wanted no pattern in residuals as we moved from left to right, so
we have a problem. Clearly we should not be fitting a straight line to the data
the line, then they finish off above it again. That is, the relationship is non-linear, and
you should not assume it is linear. Your next step is either to consider transforming
your data or to look into how to fit a non-linear model. You could also see this type
of pattern but upside-down (n-shape).
A common example of where we would expect a non-linear relationship is when
we are looking at some measure of habitat suitability as a function of an environ-
mental variable (Austin, 2002, for example), e.g. species abundance vs temperature.
All species have a range of thermal tolerance and an “optimal” range that they are
most likely to be found in, with suitability dropping (sometimes quite abruptly) as
you move away from the optimum (when it becomes too warm or too cold). Hence,
linear regression is unlikely to do much good when modelling some measure of
species suitability (e.g. abundance) as a function of environmental variables, at least
if measured across a broad enough range of the environment. In this sort of situation
we should know not to try a linear regression in the first place.
Fig. 2.7: Another problem to look out for in residual plots. (a) Scatterplot of data
with variance increasing as predicted value increases. (b) The residuals vs fits plot
with fan shape, residuals tending to take values closer to zero for small fitted values
(left), and residuals tending to take values farther from zero for large fitted values
(right). We wanted no pattern in residuals as we moved from left-to-right, i.e. similar
spread in residuals at all fitted values, so we have a problem. Clearly we should not
be assuming equal variance for these data
The other main type of pattern to watch out for is a fan-shaped pattern, meaning
that the variance changes with x, as in Fig. 2.7. The equal variance assumption
means that we should expect similar amounts of spread around the fitted line at each
point along the line. So you have a problem with the equal variance assumption if
your residuals vs fits plot suggests a smaller amount of spread around the line at
some points compared to others. Typically, this shows up as residuals fanning out as
fitted values increase, from left to right, as in Fig. 2.7. A common example of this is
when studying abundance—the variance of abundance tends to increase as the mean
increases (because for small means, data are squished up against the boundary of
zero). In principle, residuals could fan inwards, but this is less common.
An influential observation is one that has an unusual x-value. These are often easily
seen in residuals vs fits plots as outlying values on the x-axis (outlying fitted values),
although it gets more complicated for multiple linear regression (coming in Chap. 3).
Influential observations are dangerous because they have undue influence on the
fitted line—pretty much the whole fit can come down to the location of one point.
Once detected, a simple way to determine whether the whole story comes down to
56 2 An Important Equivalence Result
Fig. 2.8: Water quality example—testing the effect of the most influential value, the
value with the largest catchment area. Compare (a) the full dataset to (b) the dataset
with the most influential value removed. This had little effect on the regression fit
these influential values is to remove them and see if this changes anything. If little
changes, there is little to worry about.
To check for high-influence points, always plot your x variables before you use
them in analyses—you could use a histogram or boxplot, but if you have many x
variables, try a pairs plot (a scatterplot matrix). You are looking for outliers and
strongly skewed x variables.
High-influence points can often be avoided by transformation of the x variable,
just as outliers can often be avoided by transformation. This doesn’t always work,
though.
Notice in Fig. 2.8 that removal of the most extreme x-value (smallest catchment
area) had little influence on the fitted line. This is largely because, although that point
was quite influential, it lay close to the main trend, so it didn’t really change things
compared to the information gained from the other data points.
A nice way to summarise the strength of regression is the R2 value, the proportion
of variance in the y variable that has been explained by regression against x. In the
water quality example output of Code Box 2.3, there was a line that read
Multiple R-squared: 0.118, Adjusted R-squared: 0.06898
so 11.8% of variance in water quality can be explained by catchment area.
2.3 Equivalence of t-Test and Linear Regression 57
This section describes one of the most important results in this book, so it is worth
reading through it carefully. It is also one of the more difficult results. . . .
Consider again the output for the two-sample t-test of the smoking-during-
pregnancy data. The output is repeated in Code Box 2.5, but this time using a
two-sided test.
Key Point
A two-sample t-test can be thought of as a special case of linear regression,
where the predictor variable only takes two values. Thus, you can fit a t-test as
a regression and check the same assumptions in the same way as you would
for linear regression.
constructed in the guineapig dataset in the ecostats package. The results are in
Code Box 2.6. Can you see any similarities with Code Box 2.5?
Residuals:
Min 1Q Median 3Q Max
-21.30 -11.57 -5.35 11.85 44.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.400 5.533 4.229 0.000504 ***
treatmentN 20.900 7.825 2.671 0.015581 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
n
It turns out that this least-squares estimate is the sample mean, ȳ = n1 i=1 yi ,
as shown below.
Notice that as a gets very large or very small, the sum of squared errors
SS(a) goes to infinity. Thus, the least-squares estimator must be a stationary
point on this function. Differentiating SS(a) with respect to a yields
d n
SS(a) = −2 (yi − a)
da i=1
n
0 = −2 (yi − â)
i=1
n
nâ = yi
i=1
1
n
â = yi = ȳ
n i=1
Since â = ȳ is the only stationary point on SS(a), it must be the minimiser, i.e.
the least-squares estimate.
What does this mean for the slope and elevation? In our regression analysis, #
of errors is the y-variable and treatment is the x variable. Now there are only two
values on the x-axis (Control and Nicotine). R by default will put Control values at
x = 0 and Nicotine values at x = 1. As in Fig. 2.9, this means the following:
• The y-intercept is the Control mean.
• The slope is the mean difference between Nicotine and Control.
Further, because the parameter estimator of the slope is a quantity we saw earlier,
the sample mean difference, the standard errors will be the same also, hence t- and
P-values too. This is why key parts of the output match up between Code Boxes 2.5
and 2.6.
We can write any two-sample t-test as a linear regression. Recall that a simple linear
regression model for the mean has the form
μy = β0 + β1 x
60 2 An Important Equivalence Result
Fig. 2.9: Equivalence of a two-sample t-test (as in Exercise 2.1) and linear regression:
a least-squares line will join each mean, so if the difference in x between the two
groups is one, the slope of the least-squares line will be the difference between
sample means (and if one group, such as the Control, has x = 0, the sample mean of
that group becomes the estimated y-intercept)
If we let x = 0 for Control and x = 1 for Nicotine, then the two-sample t-test is
exactly the regression model with
β0 = μControl
β1 = μNicotine − μControl
yi j ∼ N (μi, σ 2 )
yi ∼ N (μi, σ 2 )
μi = β0 + β1 xi
These are the same set of assumptions. They can be checked the same way. Well
OK, regression had a third assumption about linearity, but this turns out not to be
important for the two-sample t-test—we only have two points on the x-axis and can
join them however we want, straight line, curve, . . ..
2.3.2 So What?
So now you don’t need to worry about two-sample t-tests anymore—just think of it
as linear regression with a binary predictor variable. One less thing to remember.
Distribution of the response is what matters: The equivalence between t-tests and
linear regression helps us realise that the distribution of our predictor variable x
doesn’t really matter when choosing an analysis method—it is really the response
variable y we have to focus our attention on. Recall that in Fig. 1.2 we had the
following rules for when to use two-sample t vs regression:
Two-sample t: when y is quantitative and x is binary
Linear regression: when y is quantitative and x is quantitative
The fact that two-sample t = linear regression means that it doesn’t matter what
type of variable x is. We can use the following rule instead:
If y is quantitative, try a linear regression (“linear model”).
Why don’t we have to worry about the distribution of the predictor when choosing
a method of analysis? Because regression conditions on x. A regression model says:
“If x was this value, what would we expect y to be?” The thing we are treating as
random is y. The x-value has been given in the question; we don’t need to treat it as
random (even if it was when sampling).
In the next chapter we will learn about multiple regression, i.e. regression with
multiple x variables. After that you can think of pretty much any (fixed effects)
sampling design under a single framework—they are all just linear regressions. This
is a major simplification for us because then we can fit most of the analysis methods
we need to use in experimental work using the same function, checking the same
assumptions in the same way, and interpreting results in the same way. Really, once
you understand that t-tests (and ANOVA-type problems) can be understood as types
of linear regression, then you can do almost anything you want as a linear model,
i.e. as a linear regression with multiple x variables (which may be categorical or
quantitative predictors).
62 2 An Important Equivalence Result
Key Point
• Multiple regression is pretty much the same as simple linear regression,
except you have more than one predictor variable. But effects should be
interpreted as conditional not marginal, and multi-collinearity should be
checked (if important).
• ANOVA can be understood as a special case of multiple regression.
In this chapter we will discuss what changes in a linear regression when you have
more than one predictor variable, and we will consider the special case where you
have one predictor, which is categorical and takes more than two values—commonly
known as analysis of variance (ANOVA).
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 63
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_3
64 3 Regression with Multiple Predictor Variables
Multiple linear regression is a special name for linear models that have more than
one x variable that we want to use to predict y.
It’s an extension of simple linear regression to two or more x variables.
The multiple regression model for two predictor variables x1 and x2 is
y ∼ N (μy, σ 2 )
μy = β0 + β1 x1 + β2 x2
Geometrically, this fits a plane in three dimensions, not a line (in two dimensions).
The model for the mean can be written in vector notation as
μy = β0 + xT β
where β and x are vectors. This notation is a simple way to write a linear model for
any number of predictor variables; it just means “multiply the corresponding values
in x and β, then add them up”.
When making inferences about a multiple regression, we make the following as-
sumptions:
1. The observed y-values are independent, after conditioning on x (that is, all infor-
mation contained in the data about what y you might observe is found in x, and
y-values for other observations wouldn’t help predict the current y-value at all).
2. The y-values after conditioning on x are normally distributed with constant
variance
y ∼ N (μy, σ 2 )
(this is the assumed distribution of y under repeated sampling if all predictors are
kept at a fixed value).
3. There is a straight-line relationship between the mean of y and each x variable
μy = β0 + xT β
Look familiar? Same assumptions as before, checked the same way as before (resid-
uals vs fits plots, normal quantile plots).
It can also be useful to plot residuals against each x variable to check that each
predictor is linearly related to μy .
3.1 Multiple Regression 65
Key Point
Multiple regression makes the same assumptions we saw earlier—
independence, normality, constant variance (all these assumptions are con-
ditional on x), and linearity.
So we can mind our Ps and Qs the same way as before—normal quantile
plot of residuals (but don’t be fussy about normality, especially if you have a
large sample) and a residuals vs fits plot. Best to also check residuals against
each predictor.
You fit multiple regression models the same way as simple linear regression. The
output looks pretty much the same too. For example, compare the code and output
from simple linear regression (Code Box 3.1) and multiple regression (Code Box 3.2).
If you are interested in seeing the mathematics of how least-squares regressions are
fitted, it isn’t too hard to follow; try out Maths Box 3.1.
Code Box 3.1: Simple Linear Regression of Global Plant Height Data—
Predicting Height as a Function of Latitude Only
> library(ecostats)
> data(globalPlants)
> ft_heightLat=lm(height~lat, data=globalPlants)
> summary(ft_heightLat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.00815 2.61957 6.493 1.66e-09 ***
lat -0.20759 0.06818 -3.045 0.00282 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Code Box 3.2: Multiple Linear Regression of Global Plant Height Data
on R—Predicting Plant Height as a Function of Annual Precipitation and
Latitude
Note the code is almost the same as for simple linear regression—just add an extra predictor
variable!
> ft_heightRainLat=lm(height~rain+lat, data=globalPlants)
> summary(ft_heightRainLat)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.223135 4.317148 1.210 0.22856
rain 0.005503 0.001637 3.363 0.00102 **
lat -0.052507 0.080197 -0.655 0.51381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
∂SS(β)
= −2X y + 2X Xβ
∂β
where differentiation has been applied element-wise. Differentiating vectors
has its subtleties, and there is a bit of thinking behind the above line. From here
however it is quite straightforward to find the value β for which ∂SS(β)
∂β = 0:
3.1 Multiple Regression 67
0 = −2X y + 2X X
β
β = (X X)−1 X y
There is only one stationary point, and it will be a global maximum or minimum
of SS(β). A little more work (e.g. looking at the second derivative) shows that
it is a minimum, hence β = (X X)−1 X y is the least-squares estimate of β.
The maths and the computation are pretty much the same as for simple linear
regression. But there are a few new ideas to look out for when you have multiple
x-values, as below.
Key Point
Regression with multiple predictors is pretty much the same as regression with
one predictor, except for the following new ideas that you should keep in mind:
1. Interpret coefficients as conditional effects (not marginal).
2. You can plot partial residuals to visualise conditional effects.
3. You can test hypotheses about multiple slope parameters at once.
4. Multi-collinearity makes estimates of coefficients inefficient.
22 22
60
60
79
56
40
40
Height|rain
Height
20
20
0
0
151 163
36
120
−20
−20 0 20 40 −30 −10 0 10 20 30
Latitude Latitude|rain
Fig. 3.1: (a) Scatterplot of height against latitude, and (b) a partial residual plot, after
controlling for the effect of rain, for the global plant height data of Exercise 3.1. The
partial residual plot helps us visualise the effects of latitude after controlling for the
effects of rainfall. Any problems with assumptions?
2. Partial residual plots: Partial residual plots let us look at the effect of a variable
after controlling for another. (“Added variable plot”, avPlots from car package.)
This is a good way of visualising the conditional effect of one variable given another.
To get a partial residual plot of height against latitude controlling for rain
(Fig. 3.1b), we take the residuals from a regression of height against rain and plot
them against the residuals from a regression of latitude against rain, to remove any
linear trend with rain from either variable. If latitude had an association with height
not explained by rain, then there would still be an association on this partial residual
plot, and the slope of the (least-squares) trend on the partial residual plot would
equal the multiple regression slope.
Partial residual plots can also be handy as an assumption check, just like scat-
terplots. Looking at Fig. 3.1b it is clear that there is a problem, with lots of points
packed in close together below the line and spread out above the line. Height is
right-skewed (because it is being “pushed up” against the boundary of zero). Can
you think of a transformation that might help here?
3. Testing of multiple slope parameters: Notice the bottom line of the output in
Code Box 3.2:
F-statistic: 14.43 on 2 and 175 DF, p-value: 1.579e-06
This tests the null hypothesis that y is unrelated to any x variables.
What do you conclude?
What if we wanted to know the answer to the following question:
Does latitude explain the effect of climate on plant height?
We would want a single test of the hypothesis that there was no effect of rain or
temp (i.e. H0 : β temp = β rain = 0), while accounting for the fact that there was a
relationship with latitude. To do this, we would fit both models and compare their
fits. We could do this in R using the anova function (Code Box 3.4). How would you
interpret these results?
> ft_Lat=lm(height~lat,data=globalPlants)
> ft_LatClim=lm(height~lat+rain+temp,data=globalPlants)
> anova(ft_Lat,ft_LatClim)
ANOVA Table
The foregoing approach only works for nested models—when one model includes
all the terms in the other model plus some extra ones.
For instance, you can’t use the anova function to compare
because latitude only appears in the second model, so it would need to appear in the
first model as well for the second model to be nested in the first model.
70 3 Regression with Multiple Predictor Variables
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.907914 4.607746 1.065 0.289
rain 0.004911 0.003370 1.457 0.148
rain.wetm 0.004686 0.023283 0.201 0.841
lat -0.046927 0.085140 -0.551 0.582
...
Note that standard errors are larger and suddenly everything is non-significant
The sum of the variances of these predictions can be calculated using the trace
function (tr), which has the special property that tr(AB) = tr(BA):
tr(var(X
β)) = tr σ 2 X(X X)−1 X
= σ 2 tr X X(X X)−1 = σ 2 tr(I)
where I is known as the identity matrix. If there are p parameters in the model,
then tr(I) = p. The main point to notice is that the sum of the variances
of predictions is not a function of X and, hence, is not affected by multi-
collinearity. More generally, we can conclude (by similar methods) that multi-
collinearity affects inferences about individual coefficients but has little effect
on predictions (unless you extrapolate!).
Code Box 3.6: Computing Variance Inflation Factors to Check for Multi-
Collinearity
> library(car)
> vif(ft_heightRainLat)
rain lat
1.494158 1.494158
> vif(ft_climproblems)
rain rain.wetm lat
6.287835 7.031092 1.671396
Clearly adding rain.wetm to the model has done some damage (but to rain only, not so
much to lat).
72 3 Regression with Multiple Predictor Variables
Another way to see what is going on is to simply look at the correlation between
predictor variables, as in Code Box 3.7.
Code Box 3.7: Correlations and Pairwise Scatterplots to Look for Multi-
Collinearity
> X = data.frame(globalPlants$lat,globalPlants$rain,globalPlants$rain.wetm)
> cor(X)
globalPlants.lat globalPlants.rain globalPlants.rain.wetm
globalPlants.lat 1.0000000 -0.5750884 -0.633621
globalPlants.rain -0.5750884 1.0000000 0.917008
globalPlants.rain.wetm -0.6336210 0.9170080 1.000000
> pairs(X)
0 1000 2500
60
40
globalPlants.lat
20
0
2500
globalPlants.rain
1000
0
400
globalPlants.rain.wetm
200
0
0 20 40 60 0 200 400
We already looked at this question in Code Box 3.2, but recall that when we
looked at the residual plots (Fig. 3.1), it was clear that the height variable was
strongly right-skewed.
Transform height and rerun your analyses. Note that this changes the results.
Which set of results do you think is more correct, and why do you think results
changed in connection with the data transformation?
3.2 ANOVA
20
Density (/gram seaweed)
15
10
5
0 2 10
Isolation of patch (m)
Fig. 3.2: Epifauna counts in algal beds with different levels of isolation
Consider Exercise 3.4. The response variable in this case is the density of
epifauna—a quantitative variable, meaning we should be thinking of some vari-
ation on linear regression as a method of analysis. The predictor variable, distance
of isolation of an algal bed, takes one of three different values (0, 2, or 10 m). This
can be understood as a categorical variable with three levels. The analysis method
for this situation, which you have probably already seen, is conventionally referred
to as analysis of variance. But we won’t go into details because it is just another
example of a linear model. . . .
Recall that a two-sample t-test was just linear regression with a binary predictor
variable.
Multiple regression is an extension of simple linear regression, and ANOVA is a
special case of multiple regression (linear models).
Recall that a multiple regression equation with two predictors has the form
μy = β0 + x1 β1 + x2 β2
μ0 = β0
μ2 = β0 + β1
μ10 = β0 + β2
y ∼ N (μy, σ 2 )
The only difference from multiple regression assumptions is that we don’t have to
worry about any linearity assumption. (Because we only observe data at two points
on each x-axis, you can join them however you want.) That is, unless we have multiple
categorical variables and we want to assume additivity (as in Chap. 4).
How do we check the foregoing assumptions?
76 3 Regression with Multiple Predictor Variables
You can fit ANOVA the same way as linear regression in R, just make sure your x
variable is a factor before fitting the model, as in Code Box 3.8. Factor is just another
word for a predictor variable that is categorical. Its categories are often referred to
as factor levels.
Code Box 3.8: Analysis of Variance in R for the Seaweed Data of Exer-
cise 3.4 Using lm
> data(seaweed)
> seaweed$Dist = factor(seaweed$Dist)
> ft_seaweed=lm(Total~Dist,data=seaweed)
> anova(ft_seaweed)
ANOVA Table
Response: Total
Df Sum Sq Mean Sq F value Pr(>F)
Dist 2 300.25 150.123 8.5596 0.0005902 ***
Residuals 54 947.08 17.539
Any evidence that density is related to distance of isolation?
One thing that has been done differently here, compared to previous multiple
regression examples, is that we have used the anova function to test for an effect
of Dist instead of using the summary function. The reason for this is that the Dist
term has more than one slope parameter in it (β1 and β2 as described previously),
and when we want to test for a Dist effect, we want to test simultaneously across
these two slope parameters. As discussed previously, the anova function is the way
to test across multiple parameters at once (for nested models).
Key Point
When you have lots of related hypotheses that you want to test, you should
correct for multiple testing, so that you can control the chance of falsely
declaring significance. A common example of this is when doing pairwise
comparisons in ANOVA. (Another example, coming in Chap. 11, is when
there are multiple response variables and you want to do separate tests of each
response variable.)
3.2 ANOVA 77
After using ANOVA to establish that there is an effect of treatment, the next question
is: What is the effect? In particular, which pairs of treatments are different from
which? If we try to answer this question using confint on the fitted linear model,
we don’t get what we want. Firstly, this will not give all pairwise comparisons: it
compares Isolation = 2 with Isolation = 0, and it compares 10 with 0, but it doesn’t
compare 10 with 2. Secondly, it doesn’t correct for the fact that we did more than
one test.
Code Box 3.9: Running confint on the Seaweed Data Doesn’t Give Us
What We Want
> confint(ft_seaweed)
2.5 % 97.5 %
(Intercept) 2.7785049 6.533423
Dist2 2.9211057 8.460686
Dist10 0.4107071 5.720963
This compares each of the “2” and “10” groups to “0”.
But:
• What about “2” vs “10”?
• When doing multiple tests we should correct for this in assessing significance.
Every time you do a hypothesis test and conclude you have significant evidence
against H0 when P < 0.05, you have a 5% chance of accidently rejecting the null
hypothesis (assuming the null hypothesis is in fact true). This is called a Type I error.
This happens by definition, because the meaning of a P-value of 0.05 is that there
is a 5% chance of observing a test statistic this large by chance alone. If doing three
pairwise comparisons (2–0, 10–0, 10–2), then this would give about a 15% chance
of accidently declaring significance.1 And the more groups being compared, the
greater the chance—with 10 different treatment groups you are almost guaranteed a
false positive!
Tukey’s “Honestly Significant Differences” are more conservative to account for
this, so that across all comparisons, there is a 5% chance of accidently declaring
significance. We will use the multcomp package in R because of its flexibility, but
another option is to use the TukeyHSD function.
1 Not exactly 15%, for a couple of reasons, one being that the probability of no false significance
from three independent tests is 1 − 0.953 14.3%. Another reason is that these tests aren’t
independent, so we shouldn’t be multiplying probabilities.
78 3 Regression with Multiple Predictor Variables
Code Box 3.10: Analysis of Variance of the Seaweed Data of Exercise 3.4
with Tukey’s Multiple Comparisons via the multcomp Package
> library(multcomp)
> contDist = mcp(Dist="Tukey") # telling R to compare on the Dist factor
> compDist = glht(ft_seaweed, linfct=contDist) # run multiple comparisons
> summary(compDist) # present a summary of the results
Simultaneous Tests for General Linear Hypotheses
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
2 - 0 == 0 5.691 1.382 4.119 <0.001 ***
10 - 0 == 0 3.066 1.324 2.315 0.0623 .
10 - 2 == 0 -2.625 1.382 -1.900 0.1483
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
> plot(compDist)
2−0 ( l )
10 − 0 ( l )
10 − 2 ( l )
−5 0 5 10
Linear Function
Say we took the two-sample data of Sect. 2.1 and tried an ANOVA on it. After
all, we have a quantitative response variable (number of errors) and a categorical
predictor (treatment); it’s just that the predictor has only two possible levels (control
and treatment) rather than three or more. What would happen?
3.2 ANOVA 79
We would get exactly the same P-value as we get from a two-sample t-test
(for a two-sided test). The methods are mathematically equivalent. So this is yet
another reason why you don’t need to worry about two-sample t-tests any more—
alternative methods are available that are more general and, hence, able to be used
for this problem as well as a bunch of other ones you will come across. There’s
nothing wrong with using the t-test; you just don’t need to use it given the equivalent
alternatives that are available.
What sort of model(s) should be fitted to answer this question? Fit the models
and answer this question. Remember to mind your Ps and Qs!
In this chapter we will look at some other common fixed effects designs, all of which
can be understood as special cases of the linear model.
Key Point
There are lots of different methods designed for analysing data with a contin-
uous response that can be assumed to be normally distributed with constant
variance, e.g. factorial ANOVA, multiple regression, analysis of covariance
(ANCOVA), variations for paired data, and blocked designs. All these methods
are special cases of the linear model.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 81
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_4
82 4 Linear Models—Anything Goes
White visited 12 locations, counted the number of ravens he saw, shot his gun,
waited 10 min, and recounted the ravens. The results follow.
Before 000002 1 00350
After 214105 0 10352
After-Before 2 1 4 1 0 3 -1 1 0 0 0 2
We cannot use a two-sample t-test on before and after measurements because
we cannot assume observations are independent across samples. But we could
analyse the paired differences and check whether their mean is significantly
different from zero.
What assumptions are made in analysing the paired differences? Do they
seem like reasonable assumptions?
> library(ecostats)
> data(ravens)
> crowGun = ravens[ravens$treatment == 1,]
> t.test(crowGun$Before, crowGun$After, paired=TRUE, alternative=
"less")
Paired t-test
Another way to think of what is going on with paired data is that there is a third
variable—a blocking factor that pairs up the data. In the raven example, the blocking
factor is site. The problem with a two-sample t-test would be that it would ignore
this third variable.
Another way to analyse paired data is via a linear model, where you include the
blocking factor in the model to control for the pairing structure in the data. For the
raven example, we could fit a model to predict the number of ravens as a function of
time (before–after) and site, using a site categorical variable to control for site-to-site
variation in abundance. This is done in Code Box 4.2.
Code Box 4.2: Paired t-Test for Raven Data via Linear Model
> library(reshape2)
> crowLong = melt(crowGun,measure.vars = c("Before","After"),
variable.name="time",value.name="ravens")
> head(crowLong)
delta site treatment few.0..or.many..1..trees time ravens
1 2 pilgrim 1 1 Before 0
2 1 pacific 1 1 Before 0
3 4 uhl hil 1 1 Before 0
4 1 wolff r 1 1 Before 0
5 0 teton p 1 1 Before 0
6 3 glacier 1 1 Before 2
> ravenlm = lm(ravens~site+time,data=crowLong)
> anova(ravenlm)
Analysis of Variance Table
Response: ravens
Df Sum Sq Mean Sq F value Pr(>F)
site 11 55.458 5.0417 4.84 0.007294 **
time 1 7.042 7.0417 6.76 0.024694 *
Residuals 11 11.458 1.0417
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
How do these results compare to those of Code Box 4.1?
Comparing results with the paired t-test output of Code Box 4.1, you might notice
that the P-value from the test of significance of the time term (Code Box 4.2) is
exactly double the value for the paired t-test (Code Box 4.1). It turns out that these
methods are mathematically equivalent, but the ANOVA (Code Box 4.2) did a two-
tailed test while the paired t-test (Code Box 4.1) used a one-tailed test, hence the
P-value was twice as large. But aside from this, these are two alternative ways of
doing exactly the same analysis.
The foregoing result is handy for a few reasons. For example, it gives us a way
forward when we start thinking about more complicated data types for which we
84 4 Linear Models—Anything Goes
can’t just analyse paired differences (Chaps. 10 and 11). It also gives us a way to
handle more complicated study designs, such as Exercise 4.2.
1. The observed y-values are independent given x—in other words, for Exercise 4.2,
after accounting for site and treatment.
2. The y-values are normally distributed with constant variance
yi j ∼ N (μi j , σ 2 )
μi j = βsitei + βtreatmentj
(“Additive effects”=Linearity)
Same as before, right. Check assumptions the same way.
Note that the independence of y is conditional on x: We don’t assume abundances
at each site and treatment are independent. We do assume that beyond site and
treatment; there are no further sources of dependence between observations. As
before, this would be satisfied in an experiment that randomly allocated subjects to
treatment groups. For Crow’s data, randomly choosing locations to sample at would
help a lot with this assumption.
Code Box 4.3: A Linear Model for Blocked Design Given by Raven
Counts in Exercise 4.2
To analyse, we first subset to the three treatments of interest (1=gunshot, 2=air horn, 3=whis-
tle):
> crowAfter = ravens[ravens$treatment <=3,]
> ft_crowAfter = lm(After~site+treatment,data=crowAfter)
4.1 Paired and Blocked Designs 85
> anova(ft_crowAfter)
Analysis of Variance Table
Response: After
Df Sum Sq Mean Sq F value Pr(>F)
site 11 28.667 2.6061 0.9269 0.5327
treatment 1 2.667 2.6667 0.9485 0.3402
Residuals 23 64.667 2.8116
Is there evidence of a treatment effect? What assumptions were made here, and how can
they be checked?
Note that when using anova in R, the order matters—to test for an ef-
fect of treatment, after accounting for a blocking effect of site, the formula
for the linear model has to be written ravens~site+treatment rather than
ravens~treatment+site. This can be understood as using Type I Sums of Squares.
For Type II Sums of Squares, i.e. for output that looks at the conditional effect of
each term in the model after adding all other terms, you can use the drop1 function.
Key Point
Order matters when doing an ANOVA in R using anova—make sure you
specify model terms in the correct order to answer the research question you
are interested in. Alternatively, use drop1 to look at what happens when each
term is left out of the model while keeping all remaining terms.
You can think of Exercise 4.2 as a blocked design, where for each site (“block”)
we get three measurements—one for each treatment. Paired designs can also be
understood as blocked, with a block size of two. Code Box 4.3 shows how to fit
such a model using R—it is pretty straightforward once you know about multiple
regression!
Another term for this sort of design is a randomised block design, where the
randomised part comes about when treatments are randomly allocated to observa-
tions within each block (e.g. the order in which gunshot and air horn treatments are
applied could be randomised). This randomisation would give us a leg up when it
comes to satisfying independence assumptions. For a more conventional example of
a randomised blocks design see
http://www.r-tutor.com/elementary-statistics/analysis-variance/randomized-block-design
The point of blocking is to control for known major sources of variability that are
not of direct interest to the research question (such as species-to-species variation).
86 4 Linear Models—Anything Goes
By controlling for these extra sources of variation and having less unaccounted-for
sampling error, it is easier to see patterns in the data. Statistically speaking, we can
more efficiently estimate the quantity of interest, e.g. in Exercise 4.2 we can have
more power when testing for an effect of treatment.
The blocking factor is not of interest—so ignore its P-value in output!
This problem is similar to the randomised block design—we have a variable (in
the case of Exercise 4.3, Wmass) that we know is important but not of primary interest,
i.e. we only want to control for this “covariate”. Like site in the raven example, but
the difference is that the covariate is quantitative rather than being categorical. But
that’s no big deal for us—it can still be handled using a linear model. The code for
analysis doesn’t need to change from what was used in a randomised block design
(see for example Code Box 4.5).
data(seaweed)
seaweed$Dist = factor(seaweed$Dist)
plot(Total~Wmass, data=seaweed, col=Dist,
xlab="Wet Mass [log scale]",ylab="Density (per gram) [log scale]")
legend("topright",levels(seaweed$Dist),col=1:3,pch=1)
4.2 Analysis of Covariance 87
l l 0
20
l 2
l l l 10
l l l
l l
l
Density (per gram) [log scale]
l
l l l l
10
l
lll
l l l l
l l
ll l
l ll
l
l l l
5
l
l l
l l
l l l
l
l l
l ll
l l
l l
l l
l l
2
5 10 20
Wet Mass [log scale]
Code Box 4.5: Analysis of Covariance for Seaweed Data of Exercise 4.3
Response: log(Total)
Df Sum Sq Mean Sq F value Pr(>F)
logWmass 1 7.7216 7.7216 35.7165 1.975e-07 ***
Dist 2 2.1415 1.0708 4.9528 0.01067 *
Residuals 53 11.4582 0.2162
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(The log-transform on Total was chosen based on exploratory plots along the lines of Code
Box 4.4. It was also applied to Wmass because total abundance may be proportional to wet
mass, so it should be included in the model with the transformation.)
88 4 Linear Models—Anything Goes
We are still using a linear model, so we still have linear model assumptions:
1. The observed y-values are independent, conditional on x.
2. The y-values are normally distributed with constant variance
yi j ∼ N (μi j , σ 2 )
3. Linearity—the effect of the covariate on the mean of y is linear, and the effect of
factors is additive. For example, for the seaweed data:
μi j = βDisti + x j βWmass
Recall again that order matters when using the anova function in R (except for
perfectly balanced designs). If the categorical predictor from an ANCOVA were
listed first in the model, rather than the quantitative predictor, different results would
be obtained (Code Box 4.6) with a different interpretation.
Response: logTot
Df Sum Sq Mean Sq F value Pr(>F)
Dist 2 4.8786 2.4393 11.283 8.273e-05 ***
logWmass 1 4.9845 4.9845 23.056 1.329e-05 ***
Residuals 53 11.4582 0.2162
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We have a very significant effect of wet mass, after controlling for the effects of distance of
isolation.
The anova function in R uses Type I sums of squares—they sequentially add terms
to the model and test if each added term explains additional variation compared to
those already in the model. So for example an anova call of a model with formula
lm(Total~Dist+log(Wmass),data=seaweed), as in Code Box 4.6, will fit the
following sequence of models:
1. “Intercept model”, no terms for Dist or log(Wmass).
2. Dist only.
3. Dist and log(Wmass).
And in the anova call:
• The first row tests the first term in the model (model 2 vs 1); is there any effect of
Dist (ignoring log(Wmass))?
• The second row tests the second term in the model (model 3 vs 2); is there any
additional effect of log(Wmass) after controlling for Dist?
The results are different from Code Box 4.5, and they mean something different—
these tests would answer questions different to what would be answered if the terms
were entered into the model the other way around. Which way we should enter the
terms depends on the research question.
90 4 Linear Models—Anything Goes
Exercise 4.5: Order of Terms in Writing Out a Model for Snails and
Seaweed
Recall from Exercise 4.3 that David and Alistair measured the wet mass of algal
beds as well, because it is expected to be an important predictor of epifauna
density. Recall that they want to answer the following question:
Is there an effect of distance of isolation after controlling for wet mass?
You can beat this “problem” of order dependence in R using the drop1 function,
as in Code Box 4.7. I say “problem” because it is not usually a problem if you are
clear about exactly which hypothesis your study was designed to test and you order
your model accordingly. The drop1 function tests for an effect of each term in the
model after including all other terms in the model. Hence, the order in which the
model was specified no longer matters—in Code Box 4.7, we test for an effect of
Dist after controlling for log(Wmass), and we test for an effect of log(Wmass)
after controlling for the effect of Dist.
Code Box 4.7: Type II Sums of Squares for the ANCOVA of Snails and
Seaweed
> drop1(lmMassDist,test="F")
Single term deletions
Model:
log(Total) ~ log(Wmass) + Dist
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 11.458 -83.448
log(Wmass) 1 4.9845 16.443 -64.861 23.0561 1.329e-05 ***
Dist 2 2.1415 13.600 -77.681 4.9528 0.01067 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
l
5
4.3.1 Interactions
An interaction between two variables tells us whether the nature of the effect of one
variable changes as the other variable changes.
To answer the question of whether the isolation effect varies with time period, we
need to test for an interaction between Dist and Time.
This is often written Dist*Time, but in R, it is written Dist:Time. In R,
Dist*Time is useful as shorthand for “a factorial design with main effects and
interactions”, i.e. Dist + Time + Dist:Time, as in Code Box 4.9. This fits a fac-
torial ANOVA, a special case of linear models (hence, it is fitted using the same
function; it makes the same assumptions, checked in the same way).
l
20
l
l l
10 l l l
l l l Time
l l
l
l l
l
l 5
l
l
5 l l
l
l 10
l l
l l
l l
l
l
l l
l l
l l
l l
l
l
2 l
0 2 10
Distance of Isolation
Fig. 4.1: Interaction plot of effects of distance of isolation and sampling time on
density of epifauna on seaweed, as in Exercise 4.6. This type of display can be useful
for factorial ANOVA
> library(dplyr)
> seaweed$Time = as.factor(seaweed$Time)
> by_DistTime = group_by(seaweed,Dist,Time)
> distTimeMeans = summarise(by_DistTime, logTotal=mean(log(Total)))
> distTimeMeans
> library(ggplot2)
> library(ggthemes) #loads special themes
> ggplot(seaweed, aes(x = factor(Dist), y = Total, colour = Time)) +
geom_point() + geom_line(data = distTimeMeans, aes(y = exp(logTotal),
group = Time)) + theme_few() + xlab("Distance of Isolation") +
ylab("Total abundance [log scale]") +
scale_y_log10(breaks=c(1, 5, 10, 50, 100, 500))
Dist Time logTotal
<fct> <fct> <dbl>
1 0 5 1.58
2 0 10 1.31
3 2 5 2.01
4 2 10 2.36
5 10 5 1.68
94 4 Linear Models—Anything Goes
0.50
logit(Probability of presence)
500 Ground breeding
Seabird Count [log scale]
Building breeding
Scrub breeding
0.25
100 Foliage breeding
50 Oiled
1
1984 1988 1992 1996 No fields Fields
Year
Fig. 4.2: Interactions are cool. (left) Seabird counts in Prince William Sound follow-
ing the Exxon-Valdez spill of 1989 (dotted vertical line), data from McDonald et al.
(2000). In BACI designs like this, we can test for an environmental impact by testing
for an interaction between time (before–after) and treatment (control–impact). An
interaction is seen here with a rather dramatic relative change in mean abundance
(solid lines) immediately following the spill. (right) A fourth corner model for how
bird response to environment varies with species traits along a rural–urban gradient
in France (Brown et al., 2014). This plot focuses on how bird presence at sites near
fields is associated with breeding habit. There is an environment–trait interaction,
with the likelihood of encountering most bird species increasing as you move to sites
near fields, except for species that breed in buildings
6 10 10 2.13
Alternatively, for a simpler plot without the data points on it, try
> interaction.plot(seaweed$Dist, seaweed$Time, ft_seaweedFact$fitted,
xlab="Isolation of patch", ylab="Total density [log]",
trace.label="Time")
Recall from Chap. 3 that ANOVA can be understood as multiple regression, where
the factor in the model is represented as a set of indicator variables (ones and zeros)
in a regression model. The number of indicator variables that are needed to include a
factor in a model, known as its degrees of freedom (df), is # levels − 1. For example,
for David and Alistair’s factorial seaweed experiment (Exercise 4.6), Time has 1 df
(two sampling times, 5 and 10) and Dist has 2 df (three distances, 0, 2, and 10).
4.3 Factorial Experiments 95
For an interaction, the rule is to multiply the df for the corresponding main effects,
e.g. Dist:Time has 2 × 1 = 2 df.
For a quantitative variable x, only one predictor (x) is included in the linear model
so it adds one df.
Why are the df important? Well, they aren’t really! They used to be, back in the
day, when they were used to manually compute stuff (variance estimates). But this is
not really relevant to modern statistics, where the computer can do most of the work
for us, and the computations it does rely less on df than they used to.
Degrees of freedom are mostly useful just as a check to make sure nothing is
wrong with the way you specified a model. In particular, if you accidently treat a
factor as quantitative, it will have one df no matter how many levels the factor has.
If we forget to turn Dist into a factor, the table will look like Code Box 4.11, and
we will reach the wrong conclusion!
This is what your output might look like if your computer thinks your factors are quantitative
variables—each term has only one df, when Dist should have two (because it is a factor
with three levels). The biggest problem with this is that we don’t necessarily want to assume
the effect of Dist is linear, but that is what is done when Dist is entered into the model as
a quantitative variable rather than as a factor.
> data(seaweed)
> ft_nofactor=lm(log(Total)~Time*Dist,data=seaweed)
> anova(ft_nofactor)
Df Sum Sq Mean Sq F value Pr(>F)
Time 1 0.243 0.2433 0.667 0.4177
Dist 1 0.716 0.7164 1.964 0.1669
Time:Dist 1 1.030 1.0303 2.825 0.0987 .
Residuals 53 19.331 0.3647
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You can use the multcomp package, as before, for multiple comparisons. Things
get more complicated now, though—if you try the same approach as previously, the
multcomp package will give us a warning (Code Box 4.12).
96 4 Linear Models—Anything Goes
Code Box 4.12: Tukey’s Comparisons Don’t Work for Main Effects in an
Orthogonal Design
The reason we get a warning is that when you fit a model with interactions such
as Time*Dist, you are saying the effect of Dist varies with Time. So it no longer
makes sense to look at the main effects for Dist; we have to look within each level
of Time to see what the effect is.
We have a few options:
1. Fit a main effects model and then do multiple comparisons on the main effect
of interest (Code Box 4.13). You should not do this if you have a significant
interaction!!
2. Compare all treatment combinations with each other (Code Box 4.14). But this is
wasteful if some treatment combinations are not of interest. For example, David
and Alistair do not want to know if isolation = 10 at time = 5 is significantly
different from isolation = 0 and time = 10!
3. Manually specify the contrasts you are interested in testing (Code Box 4.15). For
example, David and Alistair were interested in the effect of Dist, so they wanted
to specify all pairwise comparisons of levels of Dist at each sampling time. (This
got messy to code because the names for the sampling levels were numbers, which
multcomp gets upset about, so the first step in Code Box 4.15 was to change the
factor levels from numbers to words.)
Code Box 4.13: Tukey’s Comparisons for a Main Effect of Dist for Exer-
cise 4.6, Assuming No Interaction
Quantile = 2.4119
95% family-wise confidence level
Linear Hypotheses:
Estimate lwr upr
2 - 0 == 0 0.72650 0.28760 1.16539
10 - 0 == 0 0.45838 0.03872 0.87805
10 - 2 == 0 -0.26812 -0.70701 0.17078
Or you could use summary on the multiple testing object compDistMain. Note that the in-
tervals don’t cover zero when comparing isolation = 2 to isolation = 0, and when comparing
10 to 0, meaning there is significant evidence of an effect in these instances.
Code Box 4.14: Tukey’s Comparisons for All Possible Treatment Combi-
nations for Exercise 4.6
This approach is wasteful as it compares some pairs we are not interested in (e.g. 2.10 vs
0.5).
> td = interaction(seaweed$Dist,seaweed$Time)
> ft_seaweedInt=lm(logTot~td,data=seaweed) # Time*Dist as a single term
> contInt = mcp(td="Tukey") # so R compares on all Time*Dist levels
> compDistInt = glht(ft_seaweedInt, linfct=contInt)
> summary(compDistInt)
Simultaneous Tests for General Linear Hypotheses
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
2.5 - 0.5 == 0 0.4356 0.2391 1.822 0.46046
10.5 - 0.5 == 0 0.1013 0.2391 0.424 0.99815
0.10 - 0.5 == 0 -0.2643 0.2391 -1.105 0.87659
2.10 - 0.5 == 0 0.7852 0.2635 2.980 0.04761 *
10.10 - 0.5 == 0 0.5512 0.2391 2.305 0.21028
10.5 - 2.5 == 0 -0.3343 0.2391 -1.398 0.72720
0.10 - 2.5 == 0 -0.6999 0.2391 -2.927 0.05417 .
2.10 - 2.5 == 0 0.3496 0.2635 1.327 0.76842
98 4 Linear Models—Anything Goes
Code Box 4.15: Tukey’s Comparisons for Dist Within Each Sampling
Time, for Exercise 4.6
This is the best approach to use if you think there is an interaction and are primarily interested
in Dist.
> levels(seaweed$Time) = c("five","ten") #mcp needs non-numeric levels
> levels(seaweed$Dist) = c("Zero","Two","Ten")
> td = interaction(seaweed$Dist,seaweed$Time)
> ft_seaweedInt=lm(log(Total)~td,data=seaweed) # Time*Dist as one term
> contDistinTime = mcp(td = c("Two.five - Zero.five = 0",
"Ten.five - Zero.five = 0",
"Ten.five - Two.five = 0",
"Two.ten - Zero.ten = 0",
"Ten.ten - Zero.ten = 0",
"Ten.ten - Two.ten = 0"))
> compDistinTime = glht(ft_seaweedInt, linfct=contDistinTime)
> summary(compDistinTime)
Simultaneous Tests for General Linear Hypotheses
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Two.five - Zero.five == 0 0.4356 0.2391 1.822 0.31172
Ten.five - Zero.five == 0 0.1013 0.2391 0.424 0.99084
Ten.five - Two.five == 0 -0.3343 0.2391 -1.398 0.57117
Two.ten - Zero.ten == 0 1.0495 0.2635 3.983 0.00124 **
Ten.ten - Zero.ten == 0 0.8155 0.2391 3.411 0.00717 **
Ten.ten - Two.ten == 0 -0.2340 0.2635 -0.888 0.87445
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported---single-step method)
At a sampling time of 10 weeks, there are significant differences between isolation = 2 and
isolation = 0, and between 10 and 0, after correcting for multiple testing.
4.4 Interactions in Regression 99
Recall that David and Alistair also measured the wet mass of algal beds (also known
as seaweed) as well, because that is expected to be important to epifauna density.
We assumed additivity—that distance of isolation had an additive effect on
log(density). Could there be an interaction between isolation and wet mass?
Graphically, an ANCOVA interaction would mean that the slope of the relationship
between log(density) and Wmass changes as Dist changes, as in Fig. 4.3. The slope
of the line at a distance of isolation of 10 seems steeper than for the other treatment
levels. But is it significantly steeper, or could this be explained away by sampling
variation? We can test this using an interaction term in an ANCOVA, in just the same
way as we test for interactions in factorial ANOVA, as in Code Box 4.16.
l
Dist
20 l 0
l l l 2
l l l
l l
l 10
l
Total abundance [log scale]
l l
l
10 l l
l
ll
l l l l
l
l
l
l
l l l
l
l
l
5 l l
l
l l
l l
l l
l l
l l
l l
l
l l
l l
l
l
l
l
2
l
5 10 20
Algal wet mass
Fig. 4.3: An interaction in an ANCOVAmeans that the slopes of the regression lines
differ across levels of the treatment variable. For David and Alistair’s seaweed data
(Exercise 4.3), the slope of the line at a distance of isolation of 10 seems steeper
than for the other treatment levels. But is it significantly steeper, or could this be
explained away by sampling variation?
100 4 Linear Models—Anything Goes
> lmMassDistInter=lm(logTot~log(Wmass)*Dist,data=seaweed)
> anova(lmMassDistInter)
Analysis of Variance Table
Response: log(Total)
Df Sum Sq Mean Sq F value Pr(>F)
log(Wmass) 1 7.7216 7.7216 35.3587 2.489e-07 ***
Dist 2 2.1415 1.0708 4.9032 0.01128 *
log(Wmass):Dist 2 0.3208 0.1604 0.7345 0.48475
Residuals 51 11.1374 0.2184
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Is there evidence of an interaction between the effects on density of wet mass and distance
of isolation?
In multiple regression models, you can also have interactions between two continuous
covariates, too, but it’s a little tricky. Consider for example Exercise 4.8, where
Angela would like to know if the associations of height with precipitation and
latitude interact.
Quadratic function
Linear function
Weird function!
y
y
x x x
Fig. 4.4: Interactions without other quadratic terms look weird: from left to right
we have a linear model, a model with all quadratic terms, and a model with just a
quadratic interaction (x:y) without quadratic main effects (x^2 and y^2). Notice the
last one is a weird saddle-shaped function that doesn’t make a whole lot of sense, so
it shouldn’t really be our default fit!
Code Box 4.17: Using R to Fit a Quadratic Model to the Plant Height Data
of Exercise 3.1
Call:
lm(formula = log(height) ~ poly(rain, lat, degree = 2),
data = globalPlants)
Residuals:
Min 1Q Median 3Q Max
-3.3656 -0.9546 -0.0749 0.9775 3.1311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3314 0.2298 5.794 5.25e-08 ***
poly(rain, lat, degree = 2)1.0 7.2939 2.8268 2.580 0.0110 *
poly(rain, lat, degree = 2)2.0 -1.4744 2.6221 -0.562 0.5749
poly(rain, lat, degree = 2)0.1 -5.6766 2.2757 -2.494 0.0139 *
poly(rain, lat, degree = 2)1.1 -9.1362 56.1632 -0.163 0.8710
poly(rain, lat, degree = 2)0.2 -2.5617 2.7153 -0.943 0.3473
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As we add more variables to our analysis, the model gets more complicated. The
assumptions we are making are the same as before, with the same consequences, but
with more variables in the model there are more ways these assumptions could go
wrong, hence more to check. Assumptions and their importance were first discussed
in general in Sect. 1.5, in what follows we review these in the context of linear
models.
The mean assumption in a linear model is that the mean of y is linearly related to
our predictor variables. If any predictors are quantitative, it is assumed that the mean
of y is a linear function of each of these, and if any of them are categorical, then
the only real assumption imposed here is additivity (the effect of the categorical
variable is the same at all values of other x variables). If you include interactions
between categorical predictors and others, then you no longer make the additivity
assumption.
Our variance assumption in a linear model, as previously, is that the error variance
is constant irrespective of the value of predictors.
4.5 Robustness of Linear Models—What Could Go Wrong? 103
Linear models assume a response is normally distributed, but as previously, they are
pretty robust to violations of this assumption, because parameter estimates and fitted
values are approximately normally distributed irrespective of whether or not the data
are normally distributed, thanks to the central limit theorem, often with surprisingly
small sample sizes (Miller Jr., 1997). But as previously, just because a statistical
procedure is valid doesn’t mean that it is efficient. We should be on the lookout for
strongly skewed distributions and outliers, because these make our inferences less
efficient. The simplest course of action in this case is to transform the data, if it makes
sense to. Sometimes that doesn’t work, especially for rare counts or binary data, and
alternative methods designed specially for this type of data have been developed
(Chap. 10).
An alternative regression approach is to use quantile regression (Koenker & Hal-
lock, 2001)—instead of modelling the mean of y as a function of x, model a quantile
(such as the median, or the 90th percentile). Occasionally you may see it argued
that quantile regression is an alternative to linear modelling that is suitable for data
that don’t satisfy the normality assumption. Strictly speaking this is true—quantile
regression makes no normality assumption—but a better reason to consider using
it would be if your research question were actually about quantiles. For example,
Angela could hypothesise that plants were height-limited by climate, so short plants
would be found everywhere, but tall plants were only found in high-rainfall areas.
Quantile regression would be useful here—she could estimate how the 90th per-
centile (say) of height varied with rainfall and look at how much steeper this line
was than that relating the 10th percentile of height to rainfall.
Few if any models are perfect, and often factors that we don’t know about or haven’t
measured affect our response. What effect do they have on results?
The first thing to worry about is confounding. If you leave out a potentially
important predictor that is correlated with other predictors, then it can change the
104 4 Linear Models—Anything Goes
yi = β0 + xi βx + zi βz + i
zi = γxi + δi
yi = β0 + xi βx + (γxi + δi )βz + i
= β0 + xi (βx + γ βz ) + i + δi βz
= β0 + xi β∗ + i∗
where
β∗ = βx + γ βz (4.1)
i∗ = i + δi βz ∼ N (0, σ + 2
σz2 βz2 ) (4.2)
Equations 4.1 and 4.2 show the two key consequences of missing predictors.
Equation 4.1 shows the idea of confounding—if a missing predictor is
correlated with those in the model, the value of the slope is no longer centred
4.5 Robustness of Linear Models—What Could Go Wrong? 105
around its true value βx ; instead, it is shifted (biased) by γ βz —an amount that
gets larger as the relationship with the included predictor (γ) gets stronger and
as the relationship with the response (βz ) gets stronger. Note that if the missing
and included predictors are uncorrelated (γ = 0), there is no confounding.
Equation 4.2 illustrates the idea of increased error—if a predictor is missing
from the model, regression error increases from σ 2 by the amount σz2 βz2 , which
is larger if the missing predictor is more variable (σz2 ) or more important to
the response (βz2 ).
Obviously, if a missing predictor is not important to the response, then its
exclusion has no effect on the model. So if βz = 0, there is no bias or increased
error.
Answer these questions using the appropriate linear model. Be sure to mind
your Ps and Qs!
Chapter 5
Model Selection
Often it is not clear which model you should use for the data at hand—maybe
because it is not known ahead of time which combination of variables should be
used to predict the response, or maybe it is not obvious how the response should
be modelled. In this chapter we will take a look at a few strategies for comparing
different models and choosing between them.
Key Point
How do you choose between competing models? A natural approach to this
problem is to choose the model that has the best predictive performance on
new, independent data, whether directly (using training data to fit the model
and separate test data to evaluate it) or indirectly (using information criteria).
A key issue to consider is the level of model complexity the data can
support—not too simple and not too complex! If the model is too simple,
there will be bias because of important features missing from the model. If
the model is too complex, there will be too much variance in predictions,
because the extra parameters will allow the model to chase the data too much.
(Occasionally, it is better to leave out a term even when it is thought to affect the
response if there are insufficient data to do a good job of estimating its effect.)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 107
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_5
108 5 Model Selection
What does the question tell us—descriptive, hypothesis test, interval esti-
mation, . . . ?
What do the data tell us—one variable/more, what type of variable is the
response?
So what sort of analysis method are you thinking of using?
Consider Exercise 5.1. The key difference in this example, compared to those of
previous chapters, is that we are primarily interested in choosing the best x variables
(which climate variables height relates to most closely). This is a variable selection
or model selection problem—the goal is to select the best (or a set of best) models
for predicting plant height.
The paradigm we will use for model selection is to maximise predictive
capability—if presented with new data, which model would do the best job of
predicting the values of the new responses?
Model selection is a new way of thinking about things compared to what we have
seen before and introduces some new issues we have to consider.
In model selection, as well as trying to choose the right predictors, you are trying
to choose the right number of them, i.e. the right level of model complexity. When
making a decision about model complexity, you are making a decision about how to
trade off bias against variance (Geman et al., 1992). If you make a model too simple,
leaving out important terms, predictions will be systematically wrong (they will be
biased). If you make a model too complex, adding terms that don’t need to be there,
it will “overfit” the data, chasing it too much and moving away from the main trend.
This will increase the variance of predictions, essentially absorbing some of the
error variance into predictions. The outcome of this is usually a J curve in predictive
error, with a steep decrease in predictive error as bias is removed, a gradual increase
in variance with overfitting, and a minimum somewhere in between that optimally
manages this bias–variance trade-off (Maths Box 5.1).
The idea of the bias–variance trade-off applies any time you are choosing between
models of differing complexity—most commonly, when deciding which predictors
and how many predictors to add to a model, but in many other contexts, too.
One example useful for illustrating the idea is when predicting a response as a
non-linear function of a single predictor. Many responses in ecology are thought
5.1 Understanding Model Selection 109
* * * * * * * *
4 * ** 4 * ** 4 * ** 4 * **
* * * * * * * * * * * * * * * *
* * * *
* * * *
* * * * * * * *
* * * *
y 3 *
3 *
3 *
3 *
* * * *
* * * *
2 * * 2 * * 2 * * 2 * *
o Training data
* Test data
* * * *
2 3 4 2 3 4 2 3 4 2 3 4
x x x x
Fig. 5.1: Overfitting a model increases the variance in predictions. Data were gen-
erated using a quadratic model (degree = 2), and training data (black circles) were
fitted with a model that is linear (degree = 1), quadratic (degree = 2), quartic (degree
= 4) or a polynomial with degree 8. Test data (green stars) were used to assess model
fit but were not used in model fitting. The straight line is biased; all other models
can capture the true trend. But as the degree increased beyond 2, the extra model
parameters enabled better tracking of the training data, at the cost of pulling the fitted
model away from the true trend and, hence, away from test data. This is especially
apparent for x-values between 3 and 4, where several large y-values in the training
data “dragged” the fit above the test data for the quartic and 8-degree polynomial
models
In practice, we don’t know the μi , so we don’t know MSE( μ ). But theory tells
us a bit about how it will behave. First, we can write it in terms of bias and
variance:
1 1
n n
MSE( μ) = bias( μ̂i )2 + var( μ̂i )
n i=1 n i=1
Bias—one way to get bias is to include too few predictors in the model,
missing some important ones. For example, in Maths Box 4.1, we studied
the situation where there was one predictor xi in the model and one missing
predictor zi , where μi = β0 + xi βx + zi βz , and we fit μi = β0 + xi β∗ . In this
situation the bias is
If we use too many predictors, p will be too large, so var( μ̂i ) will be too large.
Our goal in analysis is to build a model that is big enough to capture the main
trends (not too much bias), but not excessively large (not too much variance).
For models with an increasing number of terms, MSE typically follows a J
curve—there is a steep initial decrease in MSE as bias is reduced, because there
are fewer missing predictors, then a gradual increase as too many predictors
are added. The increase is gradual because the 1/n in (5.1) keeps the variance
small relative to bias, except in the analysis of a small, noisy (large σ) dataset,
in which case a small model is often best.
The aim is to find the optimal point in the bias–variance trade-off. This point will
be different for different datasets because it depends not just on how much data you
have and the relative complexity of the models being compared but also on how well
each model captures the true underlying process. In Fig. 5.2, the optimum was at
degree = 2, which was the correct answer for this simulation, since a quadratic model
was the true underlying process from which these data were simulated. (Sometimes,
the optimum can be much smaller than the true model, if the true model is too
complex for our data to fit it well.)
5.1 Understanding Model Selection 111
Training data
Test data
0
1 2 4 8
Degree of polynomial [log scale]
Fig. 5.2: The bias–variance trade-off for polynomial models in Fig. 5.1. Note that for
the test data (green curve), the biased model (degree = 1) has a high predictive error,
and as the model gets overfitted (degree > 2), the predictive error increases due to an
increased variance of predictions. The predictive error for the training data does not
account for overfitting, so it always decreases as more parameters are added to the
model (black curve). Predictive error was measured here using MSE, defined later
Many use R2 and P-values to decide how well a model fits, but these aren’t good
tools to use for model selection.
R2 makes no attempt to account for the costs of model complexity—it keeps going
up as you add more terms, even useless ones! If you used R2 as a basis for including
potential predictors in a model, you would end up putting all of them in because that
would maximise R2 , irrespective of whether or not each predictor was useful. The
same is true of estimated error variance for the data the model was fitted to (σ̂ 2 ),
except this (usually) decreases as you add more terms to the model, as in Fig. 5.2
(black line).
OK, well why not use hypothesis tests? Why not add terms to the model if they are
significant and remove terms if they are not significant? This is commonly done and
for many years was the main strategy for model selection. Much software still uses
this approach as the default, which encourages its use. But there are a few problems
with the technique. From a philosophical perspective, it is not what hypothesis testing
was designed for, there is not really an a priori hypothesis being tested. So it is not
really the right way to think about the problem. From a pragmatic perspective, using
hypothesis testing for model selection has some undesirable properties. In particular,
it is not variable selection consistent, i.e. is not guaranteed to pick the right model
even when given as much data as it wants in order to do so—it overfits, choosing too
complex a model, especially when considering a large number of potential predictors.
112 5 Model Selection
Recall that statistical inference is the process of making some general claim about a
population based on a sample. So far we have talked about two types of statistical
inference:
1. Hypothesis testing—to see if data are consistent with a particular hypothesis
2. Confidence interval estimation—constructing a plausible range of values for some
parameter of key interest.
Now we have a third type of statistical inference:
3. Model selection—which model (or which set of predictor variables) best captures
the true underlying process from which the data were generated.
This can be understood as statistical inference because again we are using a sample
to make general claims—this time about models (or combinations of predictors),
and how well they predict, instead of about parameters.
Note that model selection should never be used in combination with hypothesis
testing or confidence interval estimation to look at related questions on the same
dataset – these methods of inference are not compatible. The process of model
selection will only include a term in the model if it is considered important – hence
it is doing something kind of like significance testing already. If you were to perform
model selection to choose key predictors, then do a hypothesis test on one of these
predictors, this is known to lead to high rates of false significance, and similarly,
performing model selection then constructing a confidence interval is known to lead
to intervals that are biased away from zero. It is best to think of model selection and
hypothesis testing/CIs as mutually exclusive: you either use one of these approaches
or the other. Although having said this, there is a growing literature on post-selection
inference (Kuchibhotla et al., 2022) which offers approaches to address this, the
simplest of which is data splitting – splitting your data into two independent sets,
and applying model selection to one part, and inference to the other.
Key Point
Model selection can be thought of as a method of inference, alongside hy-
pothesis testing and confidence interval estimation. However, it should not be
applied to the same data you plan to use for a hypothesis test or confidence in-
terval to answer a related question, because these methods won’t work correctly
in this situation, unless using methods specifically designed for post-selection
inference.
5.1 Understanding Model Selection 113
Consider a situation where you have a set of predictor variables and you want to fit
all possible models (“all subsets”). If there are p predictor variables, there are 2 p
possible models—this gets unmanageable very quickly, as in Table 5.1. If you have
200 observations and 10 variables, the use of all subsets means trying to choose
from 1000+ models using just 200 observations. Good luck!
Table 5.1: Variable selection gets ugly quickly—the number of candidate models
increases exponentially with the number of predictor variables, such that it is no
longer feasible to explore all possible models with just 20 or 30 predictors
# variables # models to fit
2 4
3 8
5 32
10 1024
20 1,048,576
100 1.27 × 1030
300 More than the number of electrons in the known universe!
training data). One group even wrote a program for their spam filter and put it on a
website, so if you copy-pasted text into it, it would return the predicted probability
that your e-mail was spam.
At the other extreme, one group put hardly any effort into the assignment—it
seemed like they had forgotten about the assignment and slapped something together
at the last minute, with a handwritten report and a simple linear model with just a
few terms in (for which model assumptions looked like they wouldn’t be satisfied, if
they had thought to check them).
As part of the assessment, I brought five e-mails to class and asked students to
classify them using their filter (a “test” dataset). Most groups did poorly, getting
one or two out of five correct, but one group managed four out of five correct—the
last-minute group with the simple linear model.
The problem was that students weren’t thinking about the costs of model com-
plexity and were assuming that if they could do a really good job of modelling their
training data, their method would work well on new data, too. So they built overly
complex models that overfitted their data, chasing them too much and ending up
with highly variable predictions, a long way past the optimum on the bias–variance
trade-off. The group with the last-minute, simple linear model had the best predic-
tive performance because their model did not overfit the data, so they ended up a lot
closer to the optimum choice for model complexity.
The way to beat this problem is to use model selection tools, as described in the
following sections, to make sure the level of model complexity is appropriate for the
data at hand—not too complex and not too simple. Directly or indirectly, all such
methods work by thinking about how well a model can predict using new data.
5.2 Validation
The simplest way to compare predictive models is to see how well they predict using
new data, validation. In the absence of new data, you can take a test or hold-out
sample from the original data that is kept aside for model evaluation. The remaining
training data are used to fit each candidate model. Overfitted models may look good
on the training data, but they will tend to perform worse on test data, as in Figs. 5.1
and 5.2 or as in the spam filter story of the previous section.
It is critical that the test sample be independent of the training sample; otherwise
this won’t work (see Maths Box 5.2). If all observations are independent (given x),
then a random allocation of observations to test/training will be fine. If you have
spatial data that are not independent of each other (Chap. 7), a common approach
is to break it into coarse spatial blocks and assign these to training and test datasets
(Roberts et al., 2017, for example).
5.2 Validation 115
Maths Box 5.2: Validation Data Can Be Used to Estimate Mean Squared
Error
1
n
μ) =
MSE( ( μ̂i − μi )2
n i=1
But how can we calculate this when we don’t know the true mean, μi ?
We can use new observations, since yi = μi + i . We can compare the new
responses to their predicted values by estimating the variance of yi − μ̂i . Using
the adding rule for standard deviations (from Maths Box 1.5) yields
The first term σμ̂2i −μi is another way of writing MSE. The second term is a
constant. The third term, the covariance of i and μ̂i , is zero if i is independent
of μ̂i . This independence condition is satisfied if yi is a new observation that
is independent of those used in fitting the model.
So when using a set of new “test” observations that are independent of
those used to fit the model, estimating σy2i −μ̂i will estimate MSE( μ ), up to a
constant.
How should we choose the size of the test sample? Dunno! (There is no single
best answer.)
One well-known argument (Shao, 1993) is that as sample size n increases, the size
of the training sample should increase, but as a proportion of n it should decrease
towards zero. This ensures “variable selection consistency”, guaranteeing the correct
model is chosen for very large n.
An example strategy Shao (1993) suggested (which hence became a bit of a thing)
is to use n3/4 observations in the training sample. This can be quite harsh, though,
as in Table 5.2. This rule tends not to be used so much in ecology, but the general
strategy is certainly worth keeping in mind—using a smaller proportion of data in
the training set when analysing a larger dataset, rather than sticking with the same
proportion irrespective of sample size.
Table 5.2: How the suggested number of training observations changes with sample
size if using the n3/4 rule mentioned in Shao (1993)
How should we measure predictive performance? For linear regression, the obvi-
ous answer is MSE:
1
ntest
(yi − μ̂i )2
ntest i=1
where the summation is over test observations, for each of which we compare the
observed y-value, yi , to the value predicted by the model fitted to the training sample,
μ̂i . This criterion was used in Fig. 5.2. Maths Box 5.2 explains how this quantity
estimates the MSE of predictions. It makes sense to use this criterion for models
where we assume equal variance—if not assuming equal variance, then it would
make sense to use a criterion that weighted observations differently according to
their variance. In later chapters, we will learn about models fitted by maximum
likelihood, and in such a situation, it would make sense to maximise the likelihood
on test data rather than minimising MSE.
An example using validation via MSE for model selection is in Code Box 5.1.
Code Box 5.1: Using Validation for Model Selection Using Angela’s Plant
Height Data
Comparing MSE for test data, for models with rain and considering inclusion of rain.seas
(seasonal variation in rainfall)
> library(ecostats)
> data(globalPlants)
> n = dim(globalPlants)[1]
> indTrain = sample(n,n^0.75) #select a training sample of size n^0.75:
> datTrain = globalPlants[indTrain,]
> datTest = globalPlants[-indTrain,]
> ft_r = lm(log(height)~rain,dat=datTrain)
> ft_rs = lm(log(height)~rain+rain.seas,dat=datTrain)
> pr_r = predict(ft_r,newdata=datTest)
> pr_rs = predict(ft_rs,newdata=datTest)
> rss_r = mean( (log(datTest$height)-pr_r)^2 )
> rss_rs = mean( (log(datTest$height)-pr_rs)^2 )
> print( c(rss_r,rss_rs) )
[1] 2.145927 2.154608
So it seems from this training/test split that the smaller model with just rain is slightly
better.
Try this yourself—do you get the same answer? What if you repeat this again multiple
times? Here are my next three sets of results:
> print( c(rss_r,rss_rs) )
[1] 2.244812 2.116212
> print( c(rss_r,rss_rs) )
[1] 2.102593 2.143109
> print( c(rss_r,rss_rs) )
[1] 2.575069 2.471916
The third run supported the initial results, but the second and fourth runs (and most) gave a
different answer—suggesting that including rain.seas as well gave the smaller MSE. But
when the answer switches, it suggests that it is a close run thing and the models are actually
quite similar in performance.
5.3 K-fold Cross-Validation 117
Data
1 2 3 4 5
First run: (test) (training) (training) (training) (training)
1 2 3 4 5
Second run:
(training) (test) (training) (training) (training)
1 2 3 4 5
Third run:
(training) (training) (test) (training) (training)
1 2 3 4 5
Fourth run:
(training) (training) (training) (test) (training)
1 2 3 4 5
Fih run:
(training) (training) (training) (training) (test)
Fig. 5.3: A schematic diagram of five-fold CV. Each observation in the original
dataset is allocated to one of five validation groups, and the model is fitted five times,
leaving each group out (as the test dataset) once. Estimates of predictive performance
are computed for each run by comparing predictions to test observations, then pooled
across runs, for a measure of predictive performance that uses each observation
exactly once
When using a test dataset to estimate predictive performance on new data, clearly
the test/training split matters—it is a random split that introduces randomness to
results. In Code Box 5.1, four different sets of results were obtained, leading to
different conclusions about which model was better. This issue could be handled
by repeating the process many (e.g. 50) times and averaging results (and reporting
standard errors, too). The process of repeating for different test/training splits is
known as cross-validation (CV).
A special case of CV is when you split data into K groups (usually K = 5, 5-fold CV,
or K = 10) and fit K models—using each group as the test data once, as in Fig. 5.3.
Results tend to be less noisy than just using one training/test split, because each
observation is used as a test observation once, so one source of randomness (choice
of test observation) has been removed.
118 5 Model Selection
> library(DAAG)
> ft_r = lm(log(height)~rain,dat=datTrain)
> ft_rs = lm(log(height)~rain+rain.seas,dat=datTrain)
> cv_r = cv.lm(data=globalPlants, ft_r, m=5, printit=FALSE) # 5 fold CV
> cv_rs = cv.lm(data=globalPlants, ft_rs, m=5, printit=FALSE) # 5 fold CV
> print( c( attr(cv_r,"ms"),attr(cv_rs,"ms") ), digits=6 )
[1] 2.22541 2.15883
suggesting that the models are very similar, the model without rain.seas predicting slightly
better, but by an amount that is likely to be small compared to sample variation. For example,
repeating analyses with different random splits (controlled through the seed argument):
> cv_r = cv.lm(data=globalPlants, ft_r, m=5, seed=1, printit=FALSE)
> cv_rs = cv.lm(data=globalPlants, ft_rs, m=5, seed=1, printit=FALSE)
> print( c( attr(cv_r,"ms"),attr(cv_rs,"ms") ), digits=6 )
[1] 2.21103 2.16553
> cv_r = cv.lm(data=globalPlants, ft_r, m=5, seed=2, printit=FALSE)
> cv_rs = cv.lm(data=globalPlants, ft_rs, m=5, seed=2, printit=FALSE)
> print( c( attr(cv_r,"ms"),attr(cv_rs,"ms") ), digits=6 )
[1] 2.22425 2.14762
> cv_r = cv.lm(data=globalPlants, ft_r, m=5, seed=3, printit=FALSE)
> cv_rs = cv.lm(data=globalPlants, ft_rs, m=5, seed=3, printit=FALSE)
> print( c( attr(cv_r,"ms"),attr(cv_rs,"ms") ), digits=6 )
[1] 2.2783 2.2373
we are now getting consistent results on different runs, unlike in Code Box 5.1, suggesting
that adding rain. seas to the model improves predictive performance. Also note the answers
are looking much more consistent now than before, with predictive errors within 2–3% of
each other across runs.
Another way to do model selection is to use the whole dataset (no training/test
split) and to penalise more complex models in some way to try to account for the
additional variance they introduce. Such approaches are referred to as information
criteria, largely for historical reasons (specifically, the first such criterion was derived
to minimise an expected Kullback-Leibler information). The two most common
criteria are AIC and BIC, which for linear models can be written
AIC = n log σ̂ 2 + 2p
BIC = n log σ̂ 2 + p log(n)
even if the sample size is very large, it will often favour a model that is larger than
the best-fitting one. BIC, in contrast, is known to be model selection consistent in a
relatively broad range of conditions, i.e. as sample size increases, it will tend towards
selecting the best model all the time.
The similarities in the form of the criteria motivate a more general approach,
aptly named the generalised information criterion (GIC) (Nishii, 1984), which takes
the form
GIC = n log σ̂ 2 + λp
where λ is an unknown value to be estimated by some method (preferably from the
data). Usually, we would want to estimate λ in such a way that if sample size (n) were
to increase, λ would get larger and larger (going to infinity) but at a slower rate than
n. One way to ensure this would be to use CV, but with an increasing proportion of
the data in the test sample in larger datasets, as in Table 5.2. This is something of a
hybrid approach between those of the two previous sections, using both information
criteria and CV. The main advantage of this idea is that predictive error is better
estimated, because it is estimated by fitting the model to the whole dataset all at once
(using information criteria), while at the same time, the appropriate level of model
complexity is chosen using CV, to ensure independent data are used in making this
decision.
Information criteria have the advantage that there are no random splits in the data—
you get the same answer every time. This makes them simpler to interpret. (An
exception is when using GIC with λ estimated by CV—in that case, the choice of λ
can vary depending how the data are split into validation groups.)
The disadvantages are that they are slightly less intuitive than CV, derived indi-
rectly as measures of predictive performance on new data, and in the case of AIC
and BIC, their validity relies on model assumptions (essentially, the fitted models
need to be close to the correct model). CV requires only the assumption that the
5.5 Ways to Do Subset Selection 121
test/training data are independent—so it can still be used validly when you aren’t
sure about the fitted model.
This is the first example we shall see of the distinction between model-based and
design-based inference; we will see more about this in Chap. 9.
It’s all well and good if you only have a few candidate models to compare, but what
if you have a whole bunch of predictor variables and you just want to find the subset
that is best for predicting y? Common approaches:
• Forward selection—add one variable at a time, adding the best-fitting variable at
each step
• Backward selection—add all variables, then delete one variable at a time, deleting
the worst-fitting variable at each step
• All subsets—search all possible combinations. For p predictors there are 2 p
possible combinations, which is not easy unless there are only a few variables (p
small), as in Code Box 5.4.
There are also hybrid approaches that do a bit of everything, such as the step
function in R, as in Code Box 5.5.
Code Box 5.4: All Subsets Selection for the Plant Height Data of Exer-
cise 5.1
> library(leaps)
> fit_heightallsub<-regsubsets(log(height)~temp+rain+rain.wetm+
temp.seas, data=globalPlants,nbest=2)
The results are most easily accessed using summary, but we will look at two parts of the
summary output side by side (the variables included in models, stored in outmat, and the
BICs of each model, stored in bics):
> cbind(summary(fit_heightallsub)$outmat,summary(fit_heightallsub)$bic)
temp rain rain.wetm temp.seas
1 ( 1 ) " " " " "*" " " "-21.06175277099"
1 ( 2 ) " " "*" " " " " "-19.2868231448677"
2 ( 1 ) "*" "*" " " " " "-24.8920679441895"
2 ( 2 ) "*" " " "*" " " "-23.9315826810965"
3 ( 1 ) "*" "*" " " "*" "-20.9786934545272"
3 ( 2 ) "*" "*" "*" " " "-20.3405400349995"
4 ( 1 ) "*" "*" "*" "*" "-16.4229239023018"
The best single-predictor model has just rain.wetm and has a BIC of about −21; the
next best single-predictor model has just rain. But including both temp and rain does the
best among the models considered here, with a BIC of about −25.
122 5 Model Selection
Code Box 5.5: Stepwise Subset Selection for Plant Height Data of Exer-
cise 5.1
So which method is best? Dunno! (There is no simple answer.) You can explore
this yourself by simulating data and seeing how different methods go—no method is
universally best. Results of a small simulation looking at this question are given in
Fig. 5.4, but the simulation settings could readily be varied so that any of the three
methods was the best performer.
All-subsets selection is more comprehensive but not necessarily better—because
it considers so many possible models, it is more likely that some quirky model will
jump out and beat the model with the important predictors, i.e. it is arguably less
robust. In Fig. 5.4, all-subsets selection seemed to perform best when looking at AIC
on training data (left), but when considering how close predicted values were to their
true means, forward selection performed better (right). This will not necessarily be
true in all simulations. Recall also that all subsets is simply not an option if you have
lots of x variables.
5.5 Ways to Do Subset Selection 123
Backward selection is not a good idea when you have many x variables—because
you don’t really want to use all of them, and the model with all predictors (the “full
model”) is probably quite unstable, with some parameters that are poorly estimated.
In this situation, the full model is not the best place to start from and you would
be better off doing forward selection. In Fig. 5.4 (right), backward selection was
the worst performing measure, probably because the sample size (32) was not large
compared to the number of predictors (8), so the full model was not a good starting
point.
Forward selection doesn’t work as well in situations where the final model has
lots of terms in it—because the starting point is a long way from the final answer,
there are more points along the way where things can go wrong. In Fig. 5.4 (right), it
was the best performing method, probably in part because in this simulation only two
of the predictors were associated with the response, so this method started relatively
close to the true answer.
Multi-collinearity (Page 70) can muck up stepwise methods—well it will cause
trouble for any variable selection method, but especially for stepwise methods where
the likelihood of a term entering the model is dramatically reduced by correlation
with a term already in the model, the end result being that the process is a lot more
noisy. In Fig. 5.4, all predictors had a correlation of 0.5, and the true model was
correctly selected about 20% of the time. When the correlation was increased to 0.8,
the correct model was only selected about 5% of the time.
Backward l l
Forward l l
All subsets l l
A modern and pretty clever way to do subset selection is to use penalised estimation.
Instead of estimating model parameters (β) to minimise least squares
n
min (yi − μi )2
i=1
we add a penalty as well, which encourages estimates towards zero, such as this one:
n
min (yi − μi ) + λ
2
| βj |
i=1 j
This approach is known as the LASSO (Tibshirani, 1996, least absolute shrinkage
and subset selection operator). This is implemented in a lot of recently developed
statistical tools, so many ecologists will have used the LASSO without realising it,
e.g. in MAXENT software under default settings (Phillips et al., 2006).
This looks a bit like GIC, but the penalty term is a function of the size of the
parameters in the model, rather than just the number of parameters. The effect is to
push parameter estimates towards zero (so their penalty is smaller), especially for
coefficients of variables that aren’t very useful in predicting the response. This biases
parameter estimates in order to reduce their sampling variance.
Penalised estimation is a good thing when
• the main goal is prediction—it tends to improve predictive performance (by
reducing variance);
• you have lots of parameters in your model (or not a large sample size)—in such
cases, reducing the sampling variance is an important issue.
The λ parameter is a nuisance parameter that we need to estimate to fit a LASSO
model. The value of this parameter determines how hard we push the slope parameters
towards zero, i.e. how much we bias estimates, in an effort to reduce variance. So
this parameter is what manages the bias–variance trade-off.
λ is large =⇒ most β j = 0
λ is small =⇒ few β j = 0
The parameter λ controls model complexity, determining how many predictors
are included in the model. The full range of model sizes is possible, from having no
predictors included (if λ is large enough) to including all of them (as λ approaches
zero and we approach the least-squares fit). We can choose λ using the same methods
we used to choose model complexity previously—CV is particularly common, but
BIC is known to work well also.
The LASSO can equivalently be thought of as constrained minimisation:
n
min (yi − μi ) such that
2
| βj | ≤ t
i=1 j
5.6 Penalised Estimation 125
Code Box 5.6: LASSO for Plant Height Data of Exercise 5.1
> library(glmnet)
> X = cbind(globalPlants$temp, globalPlants$rain,
globalPlants$rain.wetm, globalPlants$temp.seas)
> ft_heightcv=cv.glmnet(X,log(globalPlants$height))
> plot(ft_heightcv)
> ft_lasso=glmnet(X,log(globalPlants$height),
lambda=ft_heightcv$lambda.min)
> ft_lasso$beta
Sometimes we are interested not just in which variables best predict a response, but
how important they are relative to each other, as in Exercise 5.2. There are a few
options here in terms of how to approach this sort of problem.
Recall the difference between marginal and conditional effects (Sect. 3.1.2)—
the estimated effect of a predictor will change depending on what other terms
are included in the model, because linear models estimate conditional effects. So
when measuring the relative importance of predictors, we can expect to get different
answers depending on what other terms are included in the model, as indeed happens
in Code Boxes 5.7 and 5.8.
One option is to use forward selection to sequentially enter the predictor that
most reduces the sum of squares at each step (Code Box 5.7). This is one way to
order variables from most important to least, but not the only way (e.g. backward
selection, which often leads to a different ordering). The table in Code Box 5.7 is
intuitive, breaking down the overall model R2 into components due to each predictor,
but it does so in a misleading way. By adding terms sequentially, the R2 for the first
predictor temp estimates the marginal effect of temperature, because it was added
to the model before any other predictors. But by the time temp.seas was added to
the model, all other predictors had been included, so its conditional effect was being
estimated in Code Box 5.7. So we are comparing “apples with oranges”—it would
be better to either include all other predictors in the model or none of them when
quantifying the relative effects of predictors.
> stepAnova
Step Df Deviance Resid. Df Resid. Dev AIC R2
1 NA NA 130 355.9206 130.93585 NA
2 + rain.wetm -1 74.598450 129 281.3221 100.12370 0.209592965
3 + temp -1 16.150298 128 265.1718 92.37867 0.045376129
4 + rain -1 2.586703 127 262.5851 91.09452 0.007267641
5 + temp.seas -1 1.912441 126 260.6727 90.13694 0.005373225
Deviance means the same thing as sum of squares for a linear model, and deviance(ft1)
gets the total sum of squares needed to construct R2 . In the line calling the step function,
setting k=0 ensures no penalty when adding terms, so that all four variables get added, in
decreasing order of importance.
We can see that rain.wetm explains about 21% of variation in plant height, temp adds
another 5%, and the other two variables do very little.
But there are different ways we could add these terms to the model! See Code Box 5.8
for alternatives.
Code Box 5.8 presents two alternatives: first, estimating the marginal effect of
a predictor, with the variation in height explained by this predictor when included
in the model by itself; or, second, estimating the conditional effect of a predictor,
128 5 Model Selection
with the variation in height it explains not being captured by other predictors. Fitting
the full model and looking at standardised coefficients is another and more or less
equivalent way to look at conditional effects (Code Box 5.9). The advantage of using
standardised coefficients is that this method can be readily applied to other types of
models (e.g. using a LASSO).
Looking at marginal vs conditional effects can give quite different answers (as
in Code Box 5.8), especially when predictors are correlated. Neither of these is a
perfect way to measure what is happening, and both seem to miss some details.
The problem with looking at marginal effects only is that if two predictors are
highly correlated, measuring very similar things, then both can have large marginal
effects, even if one predictor is not directly related to the response. For example, the
marginal effect of temperature seasonality is R2 = 13% (Code Box 5.8, temp.seas),
but there seems to be little effect of temp.seas on plant height after annual temper-
ature (temp) has been added to the model (Code Box 5.7, R2 < 1% for temp.seas).
It seems that the marginal R2 for temp.seas was as high as 13% simply because
it is correlated with temp. The temp predictor on the other hand does seem to be
important, because even after including other predictors in the model, it still explains
about 5% of variation in plant height (Code Box 5.8).
The problem with looking at conditional effects only is that if two predictors are
highly correlated, measuring very similar things, then the conditional effect of each
is small. For example, total precipitation (rain) and rainfall in the wettest month
(rain.wetm) are very highly correlated (Code Box 3.7). Hence the conditional
effect of each, after the other has already been included in the model, is small (Code
Box 5.8) because most of the information captured in rain has already entered the
model via rain.wetm, and vice versa. However, rainfall is clearly important to the
distribution of plant height—it was in all models produced by best-subsets selection
5.7 Variable Importance 129
(Code Box 5.4) and actually explains about 8% of variation after temperature has
been added to the model; a leave-one-out approach misses this part of the story
because rainfall enters the model via two variables. We would need to leave both
rainfall variables out to see this—as in Code Box 5.10.
Model 1: log(height) ~ 1
Model 2: log(height) ~ temp + temp.seas
Model 3: log(height) ~ temp + rain + rain.wetm + temp.seas
Res.Df RSS Df Sum of Sq F Pr(>F) R2
1 130 355.92
2 128 289.00 2 66.917 16.173 0.00000056 0.188011
3 126 260.67 2 28.331 6.847 0.00150359 0.079599
Looking at the effect of temperature after rainfall:
> ft_onlyRain = lm(log(height)~rain+rain.wetm,data=globalPlants)
> rainAn=anova(ft_int,ft_onlyRain,ft_clim)
> rainAn$R2=rainAn$Sum of Sq/deviance(ft_int)
> rainAn
Model 1: log(height) ~ 1
Model 2: log(height) ~ rain + rain.wetm
Model 3: log(height) ~ temp + rain + rain.wetm + temp.seas
Res.Df RSS Df Sum of Sq F Pr(>F) R2
1 130 355.92
2 128 279.80 2 76.118 18.3964 0.0000001 0.213863
3 126 260.67 2 19.130 4.6233 0.0115445 0.053747
temperature seems able to explain about 19% of global variation in plant height; then rainfall
can explain about 8% more, whereas over 21% of variation can be explained by rainfall alone.
This idea is visualised in Fig. 5.5.
For Angela’s height data, one solution is to aggregate variables into types (tem-
perature vs rainfall) and look at the importance of these variable types as a unit,
130 5 Model Selection
Ra
ainfall
a
Unexplained
Tem
T mperature
Fig. 5.5: Schematic diagram of relative importance of temperature and rainfall for
Angela’s height data, based on results in Code Box 5.10. Temperature and rainfall
variables jointly explain about 27% of global variation in plant height for Angela’s
data, but temperature and rainfall each on their own explain closer to 20% of variation.
About 5% of variation can be attributed to temperature, about 8% to rainfall, and
the remaining 14% could be explained by either (it is confounded). This sort of plot,
while conceptually helpful, is difficult to generalise to several predictors
as in Code Box 5.10. Each of temperature and rainfall, on its own, seems able to
explain about 20% of variation in plant height, but adding the other variable type
as well explains an additional 5–8% of variation, as visualised in Fig. 5.5. So we
may conclude that temperature and rainfall are both important, separately, rainfall
perhaps slightly more so, but we can do a better job (about 5% better) at explaining
global variation in plant height by looking at temperature as well.
Plenty of alternative approaches could be used here. The simplest is to reduce
multi-collinearity by removing highly correlated responses—this reduces the over-
lap, so conditional and marginal effects become more comparable. For Angela’s data,
we could have reduced the dataset to one temperature and one rainfall variable—a
model with just temp and rain, for example, which ended up being suggested by
step anyway (Code Box 5.5). Another option is to use structural equation modelling
(Grace, 2006) to explicitly build into the model the idea that while temperature and
rainfall are important, each is measured using multiple predictors. A more contro-
versial option is to use a technique that averages measures of variable importance
across different choices of model, which has been very popular in some parts of
ecology under the name hierarchical partitioning (Chevan & Sutherland, 1991).
The issue with that type of approach is that coefficients have different meanings
depending on what other terms are included in the model—recall linear models
estimate conditional effects, so changing what terms are in the model changes what
we are conditioning on. So it makes little sense to average measures of variable
importance across different models, which condition on different things, meaning
we are measuring different things.
5.8 Summary 131
5.8 Summary
Say you have a model selection problem, like Angela’s (Exercise 5.1). We have seen
a suite of different tools that can be used for this purpose. So what should she actually
do? Well, she could try a number of these methods; the important thing is to abide
by a few key principles:
• Model selection is difficult and will be most successful when there are only a
few models to choose between—so it is worth putting careful thought into what
you actually want to compare, and deciding whether you can shortlist just a few
candidate models.
• A key step is choosing model complexity—how many terms should be in the
model? Too few means your model will be biased, too many means its predictions
will be too variable, and we are looking to choose a model somewhere in the
middle. A good way to choose the model complexity for your data is to consider
how well different models predict to new, independent data—directly, typically
using some type of CV, or indirectly, using information criteria.
• If you do have a whole heap of predictors, penalised estimation using methods
like the LASSO is a nice solution that returns an answer quickly, meaning it is
applicable to big data problems. Stepwise methods can also be useful, but it is
worth starting them from a model near where you think the right model will be.
For example, if you have many predictors but only want a few in the final model,
it would be much better to use forward selection starting with a model that has no
predictors in it than to use backward selection from a model with all predictors
in it.
In the analyses for Angela’s paper (Moles et al., 2009), there were 22 potential
predictors, which we shortlisted to 10; then I specially wrote some code to use
all-subsets selection and CV to choose the best-fitting model. With the benefit of
hindsight I don’t think the added complexity implementing an all-subsets algorithm
justified the effort, as compared to, say, forward selection. A decade on, if dealing
with a similar problem, I would definitely still shortlist variables, but then I would
probably recommend a LASSO approach.
Model selection is an active area of research, and the methods used for this
problem have changed a lot over the last couple of decades, so it is entirely possible
that things will change again in the decades to come!
Exercise 5.4: Head Bobs in Lizards—Do Their Displays Change with the
Environment?
Terry recorded displays of 14 male Anolis lizards in the wild (Ord et al., 2016).
These lizards bob their head up and down (and do push-ups) in attempts to
attract the attention of females. Terry measured how fast they bobbed their
heads and wanted to know which environmental features (out of temperature,
132 5 Model Selection
light, and noisiness) were related to head bobbing speed. The data, with one
observation for each lizard, can be found in the headbobLizards dataset.
What type of inference method is appropriate here?
What sort of model would you fit?
Load the data and take a look. Would it make sense to transform any of the
variables in the data prior to analysis?
Which environmental variables best predict head bob speed?
Key Point
A random effect is a set of terms in a model that are assumed to come from
a common distribution (usually a normal distribution). This technique is most
commonly used to capture the effects of random factors in a design, i.e. factors
whose levels are sampled at random from a larger population of potential
levels.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 133
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_6
134 6 Mixed Effects Models
ment plates at four to seven sites, and counted the invertebrates that established
on those plates (Clark et al., 2015).
What factors are there? Fixed or random?
Exercise 6.1 is an example of the most common scenario when a random effect
pops up, nested factors. A factor B is nested in A if each level of B only occurs in one
level of A. In Exercise 6.1, the estuary factor is nested within modification, because
each estuary is classified as either pristine or modified (Fig. 6.1). Nested factors are
not necessarily random, but they probably should be—otherwise you would have a
tough job making inferences about at the higher sampling level (“factor A”).
50
40
30
Total
20
10
Modified
Pristine
Fig. 6.1: Total invertebrate counts from Graeme’s estuary data (Example 6.1), with
replicate measurements at each of seven estuaries that were classified as either mod-
ified or pristine. Because of the multiple levels of sampling (sampling estuaries, then
sites within estuaries), Graeme will need a mixed effects model to make inferences
about the effect of estuary modification on invertebrates
For example, if one is studying an environmental impact on sites, in the first instance
look to increase replication by sampling more sites (rather than by taking more
measurements within sites). But one is studying the effects of fine-scale (within site)
environmental variation, replication within sites clearly has a more important role.
Key Point
Factor B is nested within A if each level of B only occurs within one level of A
In Exercise 6.1, Graeme’s estuaries were nested within modification—each
of the seven estuaries (B) was classified as either modified or pristine (A).
A nested factor, like estuary, is typically sampled as a random factor.
A common tool for fitting mixed models is the R package lme4 (Bates et al., 2015) as
in Code Box 6.1. A brief outline is given here, but a full text is available online that
gets into the gory details (Bates, 2010). Models are fitted using the lmer function,
with random effects specified in brackets.
136 6 Mixed Effects Models
Code Box 6.1: Fitting a Linear Mixed Model to the Estuary Data of
Exercise 6.1
> library(ecostats)
> data(estuaries)
> library(lme4)
> ft_estu = lmer(Total~Mod+(1|Estuary),data=estuaries)
> summary(ft_estu)
Linear mixed model fit by REML
Formula: Total ~ Mod + (1 | Estuary)
Data: estuaries
AIC BIC logLik deviance REMLdev
322.4 329.3 -157.2 322.4 314.4
Random effects:
Groups Name Variance Std.Dev.
Estuary (Intercept) 10.68 3.268
Residual 123.72 11.123
Number of obs: 42, groups: Estuary, 7
Fixed effects:
Estimate Std. Error t value
(Intercept) 39.053 3.237 12.066
ModPristine -11.243 4.287 -2.623
Any effect of modification on invertebrate abundance? What effect?
6.1.1 (Huh|What)?
Note in Code Box 6.1 the part of the formula (1|Estuary). In R formulas, 1
means fit a y-intercept (because the intercept can be understood as a coefficient of
1, β0 = 1 × β0 ). Using (1|Estuary) means “shift the y-intercept to a different
value for each level of the factor Estuary”. The vertical bar (|) means “given” or
“conditional on”—so we are saying that the value of the y-intercept depends on
Estuary, and it needs to be different for different levels of Estuary. In short, this
introduces a main effect for Estuary that is random.
You can also use this notation to introduce random slopes—(pH|Estuary) would
fit a different slope against pH (as well as intercept) in each estuary.
The linear mixed effects model for the response of observation i, yi , in its general
form, is as follows:
6.2 Linear Mixed Effects Model 137
yi ∼ N (mi, σ 2 )
mi = β0 + xTi β + zTi b
b ∼ N (0, Σ) independently of yi
Basically, it looks just like the standard (fixed effects) linear model, except now there
is a third line—this line says that some of the coefficients in the model are random
(and normal and independent of yi ). Vector notation is used here, and it is worth
noting that x is a list of any fixed number of predictor variables, as is z. These lists
do not need to have the same length—you can have different numbers of fixed vs
random effects in a model. Every predictor should, however, have a value for every
observation (every i); otherwise, you may need to use special methods to deal with
the missing data.
Graeme (Exercise 6.1) had a model with random intercepts for Estuary, i.e. the
zi were indicator variables for each estuary.
The Σ in the foregoing equation is a variance–covariance matrix, an idea that
will be discussed in more detail in Chap. 7. Basically this allows us to introduce
random effects that are correlated with each other instead of being independent.
In this chapter we will assume all random effects are independent with constant
variance (except for Exercise 6.2), so for the jth element of b we could rewrite the
assumption as
b j ∼ N (0, σb2 )
The Σ notation allows for situations where there is more than one type of random
effect in the model, and they should be related to each other, such as when including
random slopes as well as random intercepts (as needed in Exercise 6.4).
y ∼ N (μy, σ 2 )
As before, the normality of y doesn’t really matter (due to the central limit theo-
rem), except for small samples/strongly skewed data/outliers. You can check this
on a normal quantile plot.
μi = β0 + xTi β + zTi b
138 6 Mixed Effects Models
(As before, when x variables are factors, this assumption doesn’t really matter.)
This can be checked by looking for no pattern on a residual plot; a U shape is
particularly bad news.
4. The random effects b are independent of y.
This assumption can be guaranteed by sampling in a particular way (how??).
5. The random effects b are normally distributed, sometimes with constant variance.
These assumptions don’t seem to matter much (but not due to CLT, for a different
reason. See Schielzeth et al., 2020, for example).
Note these are the same old linear model assumptions, checked in the same way,
but with a couple of assumptions on the additional random terms in the model, the
random effects b.
Code Box 6.2: Residual Plots from a Mixed Model for Exercise 6.1
ft_estu = lmer(Total~Mod+(1|Estuary),data=estuaries)
scatter.smooth(residuals(ft_estu)~fitted(ft_estu),
xlab="Fitted values",ylab="Residuals")
abline(h=0,col="red")
Or to plot residuals against “unconditional” predicted values (using the fixed effects term
only):
scatter.smooth(residuals(ft_estu)~predict(ft_estu,re.form=NA),
xlab="Fitted values (no random effects)",ylab="Residuals")
abline(h=0,col="red")
With random effects Without random effects
20
20
l l
l l
l l l
l
l l
l l
10
10
l l l l
l l l l
ll l
l l
l l
l l
Residuals
Residuals
l l
l l l l l
l l l l
0
l l l l
l l
ll l l
l l
l l
−10
−10
l l
ll l
l l
l l
ll l
l
−20
−20
l l
l l
28 30 32 34 36 38 40 28 30 32 34 36 38
Fitted values Fitted values (no random effects)
There are two types of residual plot that you could produce, depending on whether
or not you include random effects as predictors in the model. Including them some-
times creates artefacts (inducing a correlation between residuals and predicted val-
ues), so it is often a good idea to consider both types of plot, as in Code Box 6.2.
Studying these residuals, there is a suggestion that residuals are actually less variable
in one group (modified) than the other. Perhaps modified estuaries are more homo-
geneous? Although having said that, we are talking about quite small sample sizes
6.3 Likelihood Functions 139
here (seven estuaries sampled in total) and making generalisations about changes in
variability from this sample size is a tough ask. We will consider this in Exercise 6.2.
Recall that (fixed effects) linear models are fitted by least squares—minimise the
sum of squared errors in predicting y from x. Mixed effects models are fitted by
maximum likelihood or restricted maximum likelihood.
We usually take the log of this, which tends to be easier to work with.
For example, in a linear model, we assume (conditionally on x) that ob-
servations are independent and come from a normal distribution with mean
μi = β0 + xi β (for observation i) and standard deviation σ. The probability
function of observation i is
1 − 1 (y −μ )2
fYi (yi ) = √ e 2σ 2 i i
2πσ
and so the log-likelihood of the linear model with parameters β0, β, σ is
n
n 1
n
(β0, β, σ; y) = log fYi (yi ) = − log 2π − n log σ − 2
(yi − μi )2
i=1
2 2σ i=1
(6.1)
In practice, we don’t know the values of these parameters and want
to estimate them. We can estimate the parameters by maximum likelihood
estimation—finding the parameter values that make (β0, β, σ; y), hence
L(β0, β, σ; y), as big as possible.
Notice that to maximise (β0, β, σ; y) with respect to β0 and β, we need to
minimise i=1 n
(yi − μi )2 , which is what least-squares regression does. That
is, least-squares regression gives the maximum likelihood estimate of a linear
model.
To test whether model M1 (with parameter estimates θ 1 ) has a significantly
better fit than a simpler model M0 (with estimates θ 0 ), we often use a (log-
)likelihood ratio statistic:
140 6 Mixed Effects Models
− 2 log Λ(M0, M1 ) = 2
M1 (θ 1 ; y) −2
M0 (θ 0 ; y) (6.2)
For a linear model, this is a function of the usual F statistic. The multiplier of
−2 looks weird but ends up being convenient theoretically.
We often measure how well a model M fits data, as compared to some
“perfect” model S (which makes predictions θ S as close to data as possible),
using the deviance:
D M (y) = 2
S (θ S ; y) −2
M (θ; y) (6.3)
The likelihood function, for a given value of model parameters, is the joint
probability function of your data. The higher the value, the more likely your data.
The key idea of maximum likelihood estimation is to choose as your parameter
estimates the values that make your data most likely, i.e. the values for parameters
that would have given the highest probability of observing the values of the response
variables you actually observed. Maths Box 6.1 gives the likelihood function for
linear models (which ends up simplifying to least-squares estimation), and Maths
Box 6.2 discusses challenges extending this to mixed models.
Under a broad set of conditions that most models satisfy, maximum likelihood
estimators
• are consistent (in large samples, they go to the right answer),
• are asymptotically normal (which makes inference a lot easier),
• are efficient (they have minimum variance among consistent, asymptotically nor-
mal estimators).
Basically, they are awesome, and these properties give most statisticians license to
base their whole world on maximum likelihood1 —provided that you can specify a
plausible statistical model for your data (you need to know the right model so you
are maximising the right likelihood).
Many familiar “classical” statistical procedures can be understood as maximum
likelihood or some related likelihood-based procedure. The sample mean can be
understood as a maximum likelihood estimator, as can sample proportions; ANOVA
is a likelihood-based procedure, as are χ2 tests for contingency tables. . ..
Restricted maximum likelihood (REML) is a slight variation on the theme in the
special case of linear mixed models—it is more or less a cheat fix to ensure all
parameter estimates are exactly unbiased when sampling is balanced (they are only
approximately unbiased for maximum likelihood, not exactly unbiased). This is worth
doing if your sample size is small relative to the number of terms in the model. In a
1 Well, Bayesians multiply the likelihood by a prior and base their whole world around that.
6.3 Likelihood Functions 141
linear model, i.e. a fixed effects model with a normally distributed response variable,
least-squares regression is (exactly) restricted maximum likelihood. REML is the
default method in the lme4 package; for maximum likelihood you use the argument
REML=FALSE as in Code Box 6.3.
cause the values that y might take depend on the random effects b. We know
the conditional distribution of data fYi |B=b (yi ), given random effects, and we
know the probability function of random effects fB (b). But we don’t directly
observe the values of the random effects, so we have to marginalise over all
possible values of b:
∫
fYi (yi ) = fYi |B (yi ) fB (b)db
b
This integral is our first difficulty—it often makes it hard to calculate the
likelihood, and often we can’t write down a simple expression for the maximum
likelihood estimator (in contrast to Maths Box 6.1, for example). Sometimes
it is possible to solve integrals exactly, but sometimes it is not, and they have
to be approximated. The linear mixed model log-likelihood works out OK (it
has a closed form), but extensions of it may not (e.g. Sects. 10.6.3 and 12.3).
A second difficulty for mixed models is that tools we conventionally use
for inference don’t always work. For example, the likelihood ratio statistic
−2 log Λ of Maths Box 6.1 can be shown to often approximately follow a
known distribution (a χ2 distribution) when the null hypothesis is true, a
result that we often use to make inferences about key model terms (in R, using
the anova function). However, this result does not hold if the null hypothesis
puts a parameter on a boundary of its range of possible values. When testing
whether a fixed effect should be included in the model, we test to see whether
a slope coefficient (β) is zero, which is not on a boundary because slopes
can take negative or positive values. But when testing to determine whether a
random effect should be included in the model, we are testing to see whether
its variance (σb2 ) is zero, which is as small as a variance can get. Hence we have
a boundary problem, and the theory that says −2 log Λ has a χ2 distribution
does not hold when testing random effects.
142 6 Mixed Effects Models
Inference from mixed effects models is a little complicated, because the likelihood
theory that usually holds sometimes doesn’t when you have random effects (Maths
Box 6.2).
Note in Code Box 6.1 that there are no P-values for the random effects or the
fixed effects—these were deliberately left out because the package authors are a little
apologetic about them; they would only be approximately correct. The t-values give
you a rough idea, though anything larger than 2 is probably significant at the 0.05
level. (You could also try the nlme package for slightly more friendly output.)
You can use the anova function as usual to compare models, as in Code Box 6.3.
This uses a likelihood ratio test (comparing the maximised likelihood under the null
and alternative models, as in Maths Box 6.1), and curiously, this often does return P-
values in the output. When comparing models that differ only in fixed effects terms,
these P-values are fairly reliable (although a little dicey for small sample sizes). It is
advised that you use REML=FALSE when fitting the models to be compared (models
should be fitted by maximum likelihood to do a likelihood ratio test).
Code Box 6.3: Using anova to Compare Mixed Effects Models for Estuary
Data
You can use anova to similarly test for random effects, but this gets a little
complicated for a couple of reasons:
• It doesn’t work when you have a single random effect (as Graeme does); Zuur
et al. (2009) proposes a workaround using the nlme package and gls.
• P-values are very approximate, and the theory used to derive them breaks down
when testing for random effects (Maths Box 6.2). The P-values tend to be roughly
double what they should be, but even then are still very approximate.
6.4 Inference from Mixed Effects Models 143
Use the confint function (just like for lm), as in Code Box 6.4.
Code Box 6.4: Confidence Intervals for Parameters from a Mixed Effects
Model for Estuary Data
> confint(ft_estu)
Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 NA 7.613337
.sigma 8.944066 14.084450
(Intercept) 32.815440 45.246159
ModPristine -19.461059 -2.994392
For particular values of a random effect (e.g. the size of the effect for a given
estuary in Exercise 6.1) the mechanics of inference are messier, but you can get
intervals that can be interpreted more or less as approximate 95% confidence intervals
for the random effects (true values of shift in each estuary), as in Code Box 6.5.
Code Box 6.5: Prediction Intervals for Random Effects Terms in a Mixed
Effects Model
> rft=ranef(ft_estu,condVar=T)
> library(lattice)
> dotplot(rft)
(Intercept)
JER l
BOT l
CLY l
JAK l
WAG l
KEM l
HAK l
−5 0 5
This is a dotplot with 95% prediction intervals for random effects from the mixed effects
model to the estuary data of Code Box 6.5. Any evidence of an effect of estuary?
144 6 Mixed Effects Models
Key Point
Inferences from standard software for mixed models are only approximate
and are quite arm-wavy when testing for a random effect. If you want a
better answer, you should use a simulation-based approach like the parametric
bootstrap—this is advisable if testing for a random effect.
Code Box 6.6: Using the Parametric Bootstrap to Compute the Standard
Error of the Mod Fixed Effect in Exercise 6.1
We will estimate the standard error of the Mod effect using a parametric bootstrap—by
simulating a large number (nBoot) of datasets from the fitted model, re-estimating the Mod
effect for each, then taking the standard deviation.
146 6 Mixed Effects Models
> nBoot=500
> bStat=rep(NA,nBoot)
> ft_estu = lmer(Total~Mod+(1|Estuary),data=estuaries)
> for(iBoot in 1:nBoot)
+ {
+ estuaries$TotalSim=unlist(simulate(ft_estu))
+ ft_i = lmer(TotalSim~Mod+(1|Estuary),data=estuaries)
+ bStat[iBoot] = fixef(ft_i)[2]
+ }
> sd(bStat) #standard error of Mod effect
[1] 4.294042
How does this compare to the standard error from the original model fit?
We will test for an effect of Estuary using the anovaPB function in the ecostats pack-
age. This computes a likelihood ratio statistic (LRT) to compare a model with an Estuary
effect (ft_estu) to a model without (ft_noestu), then repeats this process a large number
(n.sim) of times on datasets simulated from our fitted model under the null hypothesis
(ft_noestu). The P-value reports the proportion of test statistics for simulated data ex-
ceeding the observed value.
> ft_noestu = lm(Total~Mod,data=estuaries)
> library(ecostats)
> anovaPB(ft_noestu,ft_estu,n.sim=99)
Analysis of Deviance Table
When you have random effects, there are now multiple sample sizes to worry about:
Total sample size—The larger it is, the better your estimates of lower-level fixed
effects (and residual variance).
6.6 Design Considerations 147
The number of levels of the random factor—The larger it is, the better your
estimate of the random effect variance and any higher-level effects depending on it.
You could have a million observations, but if you only have a few levels of your
random effect, you have little information about the random effect variance and,
hence, about anything it is nested in. For example, if Graeme had taken 100 samples
at each of 3 estuaries, he would probably have less of an idea about the effect of
modification than he does now, with his 4–7 samples at each of 7 estuaries.
There are broad principles but no hard guidelines on precisely how large your
various sample sizes need to be. Some would argue you need to sample at least five
levels of a factor in order to call it random, but this is a fairly arbitrary number; the
idea is that the more levels you sample, the better you can estimate the variance of
the random effect. A good way to decide on your design in any setting is to think
about how accurately you want target quantities to be estimated and to study how
accurately you can estimate these target quantities under different sampling scenarios.
Usually, this requires some pilot data, to get a rough estimate of parameters and their
uncertainty, and some power analyses or margin of error calculations. For mixed
models this may need to be done by simulation, e.g. using the simR package.
Another issue to consider is what sample size to use within each level of the
random factor, e.g. how many samples Graeme should take within each estuary. It is
usually a good idea to aim for balanced sampling (i.e. the same number of samples
in each treatment combination), but mixed effects models do not require balanced
sampling. Some texts have said otherwise, largely because old methods of fitting
mixed effects models (via sums of squares decompositions) did require balanced
sampling for random effects estimation. But that was last century. (Restricted) maxi-
mum likelihood estimation has no such constraint. It is usually a great idea to aim for
balanced sampling—it is usually better for power and for robustness to assumption
violations (especially the equal variance assumption). But it is not necessary.
So you have a random factor in your study design. Does it really need to be treated as
a random effect in modelling? Maybe not. Use random effects if both the following
conditions are satisfied:
• If you have a random factor (i.e. large number of levels, from which you have a
random sample)
• You want to make general inferences across the random factor itself (across all
its possible levels, not just those that were sampled).
Usually, if the first condition is satisfied, the second will be too, but it need not be.
If you are happy making inferences conditional on your observed set of levels of
the random factor, then there is no harm in treating the effect as fixed and saving
yourself some pain and suffering. (But this is not always possible!)
Using fixed effects models has the advantage that estimation and inference are
much simpler and better understood—indeed some more advanced models (dis-
cussed in later chapters) can handle fixed effects models only. Using random effects
has the advantage that inferences at higher levels in the hierarchy are still permissible
even when there is significant variation at lower levels. For example, in Exercise 6.1,
if Estuary were treated as a fixed effect, then if different estuaries had different
invertebrate abundances, we could not have made any inferences about the effect of
modification. If we knew that different estuaries had different abundances, then by
default mod would have had a different abundance (because different estuaries had
different levels of mod). But treating Estuary as a random effect, we can estimate
the variation in abundance due to estuary and ask if there is a mod effect above and
beyond that due to variation with Estuary.
What if I need to treat my factor as random (to make inferences about higher-level
effects) but I didn’t actually sample levels of this factor at random? Well that’s a bit
naughty. Recall that for a factor to be random, the levels of it used in your study need
to be a random sample from the population of possible values. Wherever possible,
if you want to treat a factor as random in analysis, you should sample the levels of
the factor randomly. (Among other things, this ensures independence assumptions
are satisfied and that you can make valid inferences about higher-level terms in the
model.)
That said, the interpretation of random effects as randomly chosen levels is often
stretched a bit; random effects are sometimes used as a mathematical device that can
• induce correlation between groups of correlated observations (invertebrate abun-
dance across samples from the same estuary are correlated). This idea can be
extended to handle spatial or temporal correlation, as in the following chapter;
• stabilise parameter estimates when there are lots of parameters (fixed effects
models want the number of parameters to be small compared to the sample size
n).
6.7 Situations Where Random Effects Are and Aren’t Used 149
In these instances we are using random effects as a mathematical device rather than
using them to reflect (and generalise from) study design. A cost of doing this is that
inference becomes more difficult.
Recall the LASSO—a shrinkage method for improving predictive performance
by reducing the variance in parameter estimates. The LASSO minimises
n
min (yi − μi )2 + λ | βj |
i=1 j
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 151
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_7
152 7 Correlated Samples in Time, Space, Phylogeny. . .
where the value in row i and column j tells us the correlation between yi and
y j . This is known as an autoregressive process with lag one, or AR(1).
Data from such a process can be described as having autocorrelation be-
cause values of the response are correlated with themselves. One way to
estimate ρ would be to look at neighbouring values. Since cor(yi, yi−1 ) = ρ,
we could estimate ρ using the sample correlation coefficient of the values
y2, y3, . . . , yk against their “lag 1” values y1, y2, . . . yk−1 . A better estimator,
e.g. a maximum likelihood estimator, would consider other lags too, because
these also contain some information about ρ.
A useful summary of the autocorrelation structure is the sample auto-
correlation function, the sample autocorrelation cor(y ! i, yi−h ) plotted against
lag h. For an AR(1) process we would expect the sample autocorrelation to
(approximately) decay exponentially towards zero as the lag k increased.
For spatial data, we might instead assume that correlation is a function
of how far apart two points are. For phylogenetic data, we might assume
correlation is a function of the phylogenetic distance between the relevant
nodes. The sample autocorrelation functions are then slightly more difficult
to compute and would be plotted against distance rather than lag. We could
similarly use sample autocorrelations to estimate parameters in our assumed
correlation structure for spatially or phylogenetically structured data.
Key Point
Sometimes data are collected in a way that (potentially) introduces dependence
that is a function of some known variable such as time, space, or phylogeny,
sometimes called a structured correlation. In such cases we can fit a statistical
model that estimates and accounts for this dependence—usually done using
random effects that have a structured correlation.
7 Correlated Samples in Time, Space, Phylogeny. . . 153
∼ N (0, Σ)
where yi∗ and xi∗ can be understood as contrasts to y and x, calculated using
the contrast matrix σΣ−1/2 , and the vector of errors ∗ is multivariate normal
with variance–covariance matrix
σ 2 Σ−1/2 ΣΣ−1/2 = σ 2 I
where I is the identity matrix, with all diagonal elements equal to one and
all off-diagonal elements (covariances) zero. This means that the transformed
errors σi∗ are independent with variance σ 2 , and our dependent data linear
model is equivalent to fitting an ordinary (independent errors) linear model to
contrasts constructed using the assumed correlation matrix of errors.
Accounting for violations of independence is one reason for fitting a model with
a structured correlation; another reason is to model heterogeneity—if your treatment
effect varies across time/space/phylogeny, you will need to incorporate this into your
model in order to quantify it.
In this chapter, the main strategy for introducing a structured correlation into
models will be to use a random effect or an error term that contains the structured
correlation that is desired. The nlme package in R can do this and is good for
observations that are correlated in space or time (although it is more difficult to use
for discrete data, as in Chap. 10).
Another common strategy, not considered here, is autoregressive models, i.e.
models that assume the response at a given location (in space, time, or phylogeny) is
a function of nearby responses. Autoregressive models are typically the best option
for time series data, where you have a single chain of many measurements over time.
Time series analysis will not be considered here, as it requires a suite of additional
tools (Hyndman & Athanasopoulos, 2014) and is arguably less common in ecology
than the repeated measures setting considered below.
Consider Exercise 7.1. In eight plots, Ingo took measurements of the number of
aphids present on seven sampling occasions following application of a bird exclusion
treatment. The issue that arises in this design is that repeated measures will be
more similar to each other within plots than they will be across plots, and the
autocorrelation may be higher for times that are closer together. Analysis of such
repeated measures data, in a way that can account for these features, is often referred
to as longitudinal data analysis.
ously, Ingo actually took measurements in the oat field seven times following
application of the bird exclusion treatment.
This larger dataset is available as the oat object in the aphids dataset in
the ecostats package.
What sort of model is appropriate here?
There are four main approaches for modelling longitudinal data, which we will refer
to as
• ignoring it,
• the random intercept model,
• the random slope model, and
• use of a temporally structured random effect.
Ignoring it: While this approach needs to be used with caution, it is sometimes
OK to ignore correlation across repeated measures and use standard linear modelling
approaches for analysis, assuming repeated measures are independent of each other.
Fixed effects terms could be added to the model to try to account for changes over
time. This approach is appropriate if repeated measures are far enough apart in time
that residuals do not show evidence of autocorrelation, or if the autocorrelation is
effectively captured by predictors that vary over time—this should be checked, either
by constructing a sample autocorrelation plot of residuals or by fitting some of the
following models for comparison (Code Box 7.2). Remember the importance of the
independence assumption in linear models—if data are dependent, but you make
inferences using a linear model (assuming independence), you are stuffed!
Random intercept model: In this approach we include a random effect for the
subject that repeated measures are taken on (in the case of Exercise 7.1, this is the
plot). For example, for the aphid data of Exercise 7.1, we could model the mean of
plot i at time j using
μi j = β0 + xi j β + ui
where ui ∼ N (0, σ 2 ), i.e. we have a random effect in the model that takes different
values for different plots. An example of this is in Code Box 7.2, stored as aphid_int.
This is fine for situations where you have measurements at two time points (as in
Exercise 1.5 or Exercise 6.3), but when there are several, it is less commonly used.
The random intercept term induces correlation across repeated measures of the
subject, but a potential issue is that the correlation is the same strength irrespective
of how close together two sampling points are in time, which doesn’t make much
sense. This type of model is usually better suited to clustered or multi-level data (such
as Graeme’s estuary data, Exercise 6.1) but is simpler than the following model and
so is worth considering.
7.1 Longitudinal Analysis of Repeated Measures Data 157
Random slope model: A random effect can also be included for the slope of the
relationship between response and time, for each subject. This is sometimes referred
to as a random slope model. For example, for the aphid data:
μi j = β0 + xi j β + u1i + ti j u2i
where ti j is the time of the jth sample from plot i, and (u1i, u2i ) is a bivariate normal
random effect. (That is, each of u1i and u2i is normal, and they are correlated. The
variance of each random effect and the correlation between them are estimated from
the data.) An example of this is in Code Box 7.2, stored as aphid_slope. The
random slope induces a correlation that is stronger between repeated measures that
are closer together in time.
Temporally structured random effect: A temporally structured random effect could
be added to the model that is more highly correlated for points closer together in
time. For example, for the aphid data, we could model the mean of plot i at time j
using
μi j = β0 + xi j β + ui + ui j
where (as previously) ui is a random intercept term across plots (this term is optional
but often helpful), and now we have a multivariate normal random effect ui j that has
the same variance at every sampling time but a correlation that changes as a function
of time between repeat samples. For example, we could assume that the correlation
between sampling times ti j and ti j is
ρ |ti j −ti j |
which will decay to zero as the difference between sampling times increases. This
function is plotted in Fig. 7.1, and it defines what is known as an autoregressive
process with order one (well, it is a continuous time extension of this process). An
example of this is in Code Box 7.2, stored as aphid_CAR1. There are many different
ways of introducing temporally structured random effects into a model; this is one
of the simplest and most common.
Recall that the introduction to this chapter emphasised that not just any model
for correlation would do; we need to work towards specifying a correct model—
because inferences from a model with structured random effects rely critically on
their dependence assumptions, just as inferences from ordinary linear models rely
critically on their independence assumptions. So the goal, if we wish to make
inferences from a model, is to find the best-fitting model for the dependence in
the data. You could treat this as a model selection problem (Chap. 5), fitting each
longitudinal model and for example selecting the model with the smallest Bayesian
158 7 Correlated Samples in Time, Space, Phylogeny. . .
0 5 10 15 20 25 30
Time between samples (days)
Fig. 7.1: Example of how autocorrelation might decay as time between samples
increases. Correlation between sampling times ti j and ti j is assumed here to be
ρ |ti j −ti j | , with ρ = 0.8. Note in this case that the correlation is quite high between
points sampled at similar times (separated by a few days) but is almost zero when
points are sampled at least a couple of weeks apart (> 15 days)
information criterion (BIC). In the case of Exercise 7.1, we ended up with a random
intercept model, as indicated in Code Box 7.2.
If you don’t wish to make inferences from the model and are just after predictions,
then there is an argument that you could ignore dependence and fit a standard linear
model, to keep things simple. The main problems from ignoring dependence—
confidence intervals too narrow and P-values too small—aren’t relevant if you are
interested in neither confidence intervals nor P-values. However, you can expect to
lose some predictive power if ignoring dependence, especially if predicting the value
of future repeated measures for existing subjects. I would only ignore dependence if
predicting to new, independent observations, and only if dependence was relatively
weak.
Of the remaining options for longitudinal analysis, the random intercept and ran-
dom slope models have the advantage that they can be implemented on pretty much
any mixed modelling software. The last option, including a temporally structured
random effect, needs more specialised software. It can be fitted in nlme in R for
normally distributed data.
As such, if there are a couple of candidate models for dependence that seem
reasonable, it might be advisable to try each of them to see if the results are robust
to the choice of longitudinal model. In the case of the aphid data of Exercise 7.1, we
found that the marginally significant effect of treatment seen in the random intercept
model (Code Box 7.3) did not remain when fitting a random slope model (Code
Box 7.4).
7.1 Longitudinal Analysis of Repeated Measures Data 159
All of the aforementioned options involve fitting a linear (mixed) model and so
making the usual linear modelling assumptions, which are checked in the usual way.
Additionally, we now consider adding a correlation structure to repeated measures
of a subject. We can use model selection to choose between competing models for
the correlation structure, but it would be helpful to visualise it in some way also.
One common graphical technique for longitudinal data is a spaghetti plot, joining
the dots across repeated measures for each observation in the dataset, which can be
constructed using the interaction.plot function in R as in Code Box 7.1. Note
that the interaction.plot function treats time as a factor, so the sampling times
are distributed evenly along the x-axis, rather than having points closer together if
they are closer in time. This could be addressed by constructing the plot manually
(e.g. using lines to build it up subject by subject).
Sometimes a spaghetti plot reveals a lot of structure in data—for example, if
the lines cross less often than you would expect by chance, this suggests a positive
correlation across subjects. In Fig. 7.2a, there is a suggestion of a treatment effect,
with the reduction in aphid numbers being steeper in the bird exclusion treatment.
Beyond this pattern (looking within the red lines or within the black lines) there is
a suggestion of correlation in the data, with the lines not crossing each other very
often, considering the number of lines on the plot.
For a large number of subjects, a spaghetti plot can quickly become quite dense
and hard to interpret. One approach that helps with this is using transparent colours
(which can be done in R using the rgb function and alpha argument; see the second
line of Code Box 7.1) to reduce ink density; another option is to plot one (or several)
random samples of observations, at some risk of loss of information.
data(aphids)
cols=c(rgb(1,0,0,alpha=0.5),rgb(0,0,1,alpha=0.5)) #transparent colours
par(mfrow=c(2,1),mar=c(3,3,1.5,1),mgp=c(2,0.5,0),oma=c(0,0,0.5,0))
with(aphids$oat, interaction.plot(Time,Plot,logcount,legend=FALSE,
col=cols[Treatment], lty=1, ylab="Counts [log(y+1) scale]",
xlab="Time (days since treatment)") )
legend("bottomleft",c("Excluded","Present"),col=cols,lty=1)
mtext("(a)",3,adj=0,line=0.5,cex=1.4)
with(aphids$oat, interaction.plot(Time,Treatment,logcount, col=cols,
lty=1, legend=FALSE, ylab="Counts [log(y+1) scale]",
xlab="Time (days since treatment)"))
legend("topright",c("Excluded","Present"),col=cols,lty=1)
mtext("(b)",3,adj=0,line=0.5,cex=1.4)
(a)
6
Counts [log(y+1) scale]
5
4
3
Excluded
Present
2
3 10 14 18 23 30 38
Time (days since treatment)
(b)
3.5 4.0 4.5 5.0 5.5 6.0
Excluded
Counts [log(y+1) scale]
Present
3 10 14 18 23 30 38
Time (days since treatment)
Fig. 7.2: Exploratory plots looking at the effect of bird exclusion on aphid abundance.
(a) A “spaghetti” plot of how abundance changes over time within each plot. (b) An
interaction plot looking at how mean (transformed) abundance changes over time
across the two treatments. Notice in (a) that there is relatively little crossing of lines,
suggesting some correlation, specifically a “plot” effect, and notice in (b) that under
bird exclusion, aphid numbers tend to be lower (increasingly so over time since
exclusion)
variable and another, here we compute a correlation between a variable and itself at
different time lags, an autocorrelation. This can be plotted using the nlme package
in R for evenly spaced sampling times, and while we used this technique to produce
the figure in Code Box 7.3, the uneven intervals between samples was not accounted
for (as with the interaction plot). This could be addressed by manually constructing
such a function, binning pairs of residuals based on differences in sampling time, as
is commonly done for spatial data.
For temporally structured data, we would expect the autocorrelation to decrease
smoothly towards zero, approximating something like Fig. 7.1. In Code Box 7.3,
there was no such smooth decay to zero (the correlation quickly jumped to weak
levels), suggesting little need for a temporally structured term in addition to the
random intercept already in the model. This verifies our findings in Code Box 7.2,
where the preferred model appeared to be the random intercept model.
7.1 Longitudinal Analysis of Repeated Measures Data 161
0.5
0.0
−0.5
0 1 2 3 4 5 6
Lag
> print(aphid_int)
...
Random effects:
Groups Name Std.Dev.
Plot (Intercept) 0.202
162 7 Correlated Samples in Time, Space, Phylogeny. . .
Residual 0.635
...
The random effect for plot is relatively small compared to residual variation, i.e. compared
to within-plot variation not explained by the temporal trend.
> anova(aphid_int)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Treatment 1 2.0410 2.0410 5.0619
Time 1 24.7533 24.7533 61.3917
I(Time^2) 1 5.7141 5.7141 14.1717
Treatment:Time 1 1.6408 1.6408 4.0693
Treatment:I(Time^2) 1 0.0413 0.0413 0.1024
Any treatment effect is relatively small compared to the broader temporal variation in aphid
abundance; judging from mean squares, there is almost 10 times more variation across time
compared to across treatments.
> aphid_noTreat = lmer(logcount~Time+I(Time^2)+(1|Plot),
data=aphids$oat, REML=FALSE)
> anova(aphid_noTreat,aphid_int)
Df AIC BIC logLik deviance Chisq Df Pr(>Chisq)
aphid_noTreat 5 130.26 140.39 -60.131 120.26
aphid_int 8 128.34 144.54 -56.170 112.34 7.9224 3 0.04764 *
The treatment effect is marginally significant.
Ecological data are typically collected across different spatial locations, and it is often
the case that the data are spatially structured, with observations from closer locations
more likely to be similar in response. As previously, this autocorrelation could have
exogenous (predictors are spatial) or endogenous (response is spatial) sources, or
both. An example is found in Exercise 7.4, where species richness appears to have
spatial clusters of low and high richness.
eucalypts, so Ian wanted to know the answer to the following question: How
does Myrtaceae species richness vary from one area to the next, and what
are the main environmental correlates of richness? Ian obtained data on the
number of Myrtaceae species observed in 1000 plots and obtained estimates
of soil type and climate at these sites.
Plotting richness against spatial location, he found spatial clusters of high
or low species richness (Fig. 7.3).
What sort of analysis method should Ian consider using?
0 2 4 6 8 10
6500
6400
South<−−>North (km)
6300
6200
6100
Fig. 7.3: Observed Myrtaceae species richness in Blue Mountains World Heritage
Site (with a 100-km buffer) near Sydney, Australia. Note there are patches of high
and low species richness, suggestive of the need for a spatial model
7.2 Spatially Structured Data 165
Strategies similar to those used for spatial analysis can be used for longitudinal
data analysis, with one important difference. For longitudinal data, there are clusters
of (independent) subjects on which repeated measures are taken. In Ingo’s case
(Exercise 7.1), repeated measures were taken in each of eight plots, so while there
was temporal dependence among repeated measures within plots, responses were
still assumed independent across plots. There is no such clustering in the spatial
context—we typically assume all observations are spatially correlated with each
other (which is like what happens in a time series model). In effect, we have one big,
fat cluster of correlated observations.
Four main strategies are used for spatial modelling: ignore it, use spatial
smoothers, use autoregressive models, or use spatially structured random effects.
Ignore it: The biggest risk from ignoring a correlation is that uncertainty in
predictions will be underestimated. You can also make estimates that are a lot poorer
if there is a spatial structure and it is not used to improve predictions. This is OK as
a strategy, though, if you cannot see evidence of spatial structure in residuals. One
scenario where this might happen is if samples are sufficiently far apart that spatial
autocorrelation is negligible. But you should use your data to check this!
Spatial smoothers: We could include a bivariate smoother (Chap. 8) for latitude
and longitude in the model in order to approximately account for spatial structure.
This does not always work, though, e.g. in the next chapter we will see an example of
where a temporal structure is still evident in a residual plot, even though a smoother
for time is in the model (Code Box 8.8). The same thing could happen in a spatial
context. The main advantage of using smoothers is that if it works, it is the simplest
method of handling spatial structure.
Autoregressive models: This is a commonly used class of models for spatial data,
where we predict the response at a location as a function of the response at nearby
locations (as well as predictor variables). This method won’t be discussed in detail
here because it is a bit of a departure from the techniques we have used elsewhere;
for a quick description see Dormann et al. (2007).
Spatially structured correlation: This is closely analogous to the use of temporally
structured correlation for longitudinal data in Sect. 7.1; it simply involves adding a
random effect (or changing the error term) so that it is spatially structured. This can
be implemented in R using the lme or gls function in the nlme package. Generalised
least squares can be understood as doing a linear model on contrasts that have been
calculated in such a way that they are independent of each other (as in Maths Box 7.2).
There are a few options for correlation structures that could be assumed, and
standard model selection tools could be used to choose between them. In Code
Box 7.5, an exponential autocorrelation is assumed, presupposing the autocorrelation
decays exponentially towards zero as in Fig. 7.4. Sometimes, a response does not
vary smoothly in space but seems to have some random (“white”) noise on top of
the spatial signal. In this situation, we do not want to assume that observations at a
distance of zero from each other have a correlation of one, which can be done by
including what is known as a “nugget” in the model—a white noise component. This
was done in Code Box 7.5 using corExpo(~x+y,nugget=TRUE).
166 7 Correlated Samples in Time, Space, Phylogeny. . .
0.3
Autocorrelation
0.2
0.1
0.0
0 5 10 15 20 25 30
Distance (km)
Fig. 7.4: An example of how spatial autocorrelation might decay as distance between
samples increases. A correlation between samples at a distance d from each other is
assumed here to follow an exponential model, (1 − n)e−d/r , where n is the nugget
effect and r the range parameter. The curve plotted here uses n = 0.67 and r = 5.7,
as was the case for Ian’s richness data in Code Box 7.6. The correlation becomes
negligible at a distance of about 15 km
Code Box 7.5: Model Selection to Choose Predictors, and a Spatial Model,
for Ian’s Richness Data
Quantitative climatic predictors should be added as quadratic rather than linear terms, to
enable an “optimum” climate rather than a function that is always increasing or decreasing
as temperature or rainfall increases. But should this include interactions between climate
predictors or not?
> data(Myrtaceae)
> Myrtaceae$logrich=log(Myrtaceae$richness+1)
> ft_rich = lm(logrich~soil+poly(TMP_MAX,TMP_MIN,RAIN_ANN,degree=2),
data=Myrtaceae)
> ft_richAdd = lm(logrich~soil+poly(TMP_MAX,degree=2)+
poly(TMP_MIN,degree=2)+poly(RAIN_ANN,degree=2), data=Myrtaceae)
> BIC(ft_rich,ft_richAdd)
df BIC
ft_rich 19 1014.686
ft_richAdd 16 1002.806
This suggests that we don’t need interactions between environmental predictors. But fur-
ther study suggested an apparent weak spatial signal in the residuals, as indicated later in
Code Box 7.7. So consider a spatial model with exponential spatial correlation (potentially
including a nugget also):
> library(nlme)
> richForm = logrich~soil+poly(TMP_MAX,degree=2)+poly(TMP_MIN,degree
=2)+
poly(RAIN_ANN,degree=2)
> ft_richExp = gls(richForm,data=Myrtaceae,correlation=corExp(form
=~X+Y))
> ft_richNugg = gls(richForm,data=Myrtaceae,
7.2 Spatially Structured Data 167
correlation=corExp(form=~X+Y,nugget=TRUE))
> BIC(ft_richExp,ft_richNugg)
df BIC
ft_richExp 17 1036.2154
ft_richNugg 18 979.5212
These models take several minutes to run!
The model with a nugget in it has a much smaller BIC, suggesting that species richness
does not vary smoothly over space. Note the BIC of this model is slightly smaller than for
the non-spatial model, suggesting that it is worth including spatial terms.
Spatial models of this form can take a long time to fit! Code Box 7.5 contains 1000
observations, meaning that there are a lot (half a million) of pairwise correlations to
compute, and the model takes several minutes to run. The computational complexity
of these models increases rapidly with sample size, such that it becomes infeasible
to fit this sort of model when there are hundreds of thousands of observations. There
are, however, some nice tricks to handle larger models—some autoregressive models
have fewer parameters to estimate so can be fitted for larger datasets, and there are
also tricks like fixed rank kriging (Cressie & Johannesson, 2008) or using predictive
processes (Banerjee et al., 2008; Finley et al., 2009) that make some simplifying
approximations so the model can be fitted using a smaller number of random terms.
The impact of including spatial terms in a model tends to be to increase the size
of standard errors and reduce the significance of effects in the model. This is seen in
Code Box 7.6, where one of the terms in the model changed from having a P-value
of 0.0003 to being marginally significant!
Code Box 7.6: Inferences from Spatial and Non-Spatial Models for Ian’s
Richness Data
> ft_richNugg
...
Correlation Structure: Exponential spatial correlation
Formula: ~X + Y
Parameter estimate(s):
range nugget
5.691041 0.649786
Degrees of freedom: 1000 total; 985 residual
Residual standard error: 0.3829306
The model includes a nugget effect of 0.65 (meaning that the correlation increases to a
maximum value of 1 − 0.65 = 0.35 at a distance of zero) and has a range parameter of 5.7
(meaning that as distance increases by 5.7 km, the correlation between points decreases by
a factor of e−1 ≈ 37%). This is plotted in Fig. 7.4.
> anova(ft_richAdd)
Df Sum Sq Mean Sq F value Pr(>F)
soil 8 15.075 1.88438 12.9887 < 2.2e-16 ***
poly(TMP_MAX, degree = 2) 2 5.551 2.77533 19.1299 7.068e-09 ***
poly(TMP_MIN, degree = 2) 2 5.162 2.58082 17.7892 2.573e-08 ***
poly(RAIN_ANN, degree = 2) 2 2.382 1.19081 8.2081 0.0002915 ***
168 7 Correlated Samples in Time, Space, Phylogeny. . .
> anova(ft_richNugg)
numDF F-value p-value
(Intercept) 1 5687.895 <.0001
soil 8 5.926 <.0001
poly(TMP_MAX, degree = 2) 2 10.528 <.0001
poly(TMP_MIN, degree = 2) 2 7.672 0.0005
poly(RAIN_ANN, degree = 2) 2 2.719 0.0664
Notice that the F statistics are a factor of two or three smaller in the spatial model. It is worth
re-running model selection algorithms to see if all these predictors are still worth including,
e.g. the effect of precipitation is now only marginally significant, so inclusion of this term
may no longer lower the BIC.
The spaghetti plot, which we considered as a diagnostic tool for longitudinal data,
has no equivalent in the spatial world, for a couple of reasons. Most fundamentally,
we do not have independent clusters of observations we can plot as separate lines.
The idea behind a temporal autocorrelation function does, however, translate to the
spatial context.
The spatial correlogram is a tool that works much like the temporal autocorre-
lation function—it estimates the correlation between pairs of observations different
distances apart and plots this autocorrelation as a function of distance between
observations. The spatial autocorrelation might be close to one for neighbouring
observations (but then again it might not be), and correlation typically decays to-
wards zero as distance increases. There are a few techniques for computing spatial
autocorrelation; one of the more common is known as Moran’s I statistic. For a
correlogram, pairs of observations are “binned” into groups depending on how far
apart they are, then Moran’s I is computed within each distance class.
There are a lot of software options for computing correlograms, the correlog
function in the pgirmess R package being one example (Code Box 7.7), that allow
the user to control how the data are “binned” into groups of distances for computation
of the statistic.
When fitting a regression model, note the correlogram should be constructed using
the residuals after modelling the response of interest as a function of its predictors,
as in (b) from Code Box 7.7. The response is strongly spatially correlated, as in
(a) from Code Box 7.7, but a lot of this spatial structure is due to the relationship
between richness and climatic predictors, which are themselves strongly spatially
structured and capable of explaining much of the spatial signal. Hence the residuals
seem to have a much weaker spatial autocorrelation, which seems to operate over
shorter distances. In some situations, the predictors may seem to explain all of the
7.2 Spatially Structured Data 169
spatial autocorrelation, such that the autocorrelation curve of residuals would be flat
and near zero, in which case there would be no need to fit a spatial model.
A related tool is the variogram, which is like a correlogram but flipped around to
compute variances of differences between pairs of points. This means that as distance
increases, the variance increases. Variograms will not be considered further here in
order to keep the development analogous to what is usually done for longitudinal
data, where the correlogram is more common.
Code Box 7.7: Spatial Correlogram for Ian’s Species Richness Data
Correlograms will be computed for the richness data (log(y + 1)-transformed) and for
residuals from a linear model as a function of soil type and additive, quadratic climatic
predictors:
library(pgirmess)
corRich = with(Myrtaceae,correlog(cbind(X,Y),logrich))
plot(corRich,xlim=c(0,150),ylim=c(-0.05,0.2))
abline(h=0,col="grey90")
Myrtaceae$resid = residuals(ft_richAdd)
corRichResid = with(Myrtaceae,correlog(cbind(X,Y),resid))
plot(corRichResid,xlim=c(0,150),ylim=c(-0.05,0.2))
abline(h=0,col="grey90")
(a) Richness
0.05 0.10 0.15 0.20
l
l
l
l l
l
l
l
l
l
l
−0.05
Correlation (Moran's I)
0 20 40 60 80 100 120
(b) Residuals for richness
0.05 0.10 0.15 0.20
l
l
l
l
l
l
l l
−0.05
0 20 40 60 80 100 120
Distance (km)
There is still spatial autocorrelation after including climatic predictors, but less so.
170 7 Correlated Samples in Time, Space, Phylogeny. . .
Another way for data to have a structured correlation is if the subjects are species
(or some other taxonomic group), with some more closely related than others due
to shared ancestry. The evolutionary relationships between any given set of species
can be mapped out on a phylogenetic tree (Bear et al., 2017), which reconstructs as
best we can the evolutionary path from a common ancestor to all of the species used
in a study (usually based on molecular analysis of DNA sequences). Some pairs of
species will be closer together on the tree than others, which is usually measured
in terms of the number and length of shared branches until you get to a common
ancestor. If a response is measured that is expected to take similar values for closely
related species, there is phylogenetically structured correlation in the data, with
autocorrelation that is a function of phylogenetic distance.
As an example, consider Exercise 7.5. Data were collected across 71 species of
bird, some of which were more closely related than others. More closely related
species tended to be closer in body or egg size (Code Box 7.8). Hence, we expect
phylogenetically structured correlation in the data, and it is unlikely we would be
able to satisfy the independence assumption that we usually need in linear modelling.
Haematopus_finschi
Haematopus_unicolor
Haematopus_ostralegus
Haematopus_moquini
Haematopus_fuliginosus
Haematopus_longirostris
Recurvirostra_avosetta
Vanellus_armatus
Vanellus_lugubris
Vanellus_vanellus
Charadrius_melodus
Charadrius_hiaticula
Charadrius_dubius
Charadrius_wilsonia
Charadrius_vociferus
Charadrius_montanus
Eudromias_morinellus
Pluvialis_dominica
Pluvialis_apricaria
Pedionomus_torquatus
Actophilornis_africanus
Metopidius_indicus
Jacana_jacana
Jacana_spinosa
Rostratula_benghalensis
Numenius_americanus
Numenius_arquata
Numenius_phaeopus
Numenius_tahitiensis
Bartramia_longicauda
Limosa_lapponica
Limosa_limosa
Limosa_haemastica
Limosa_fedoa
Coenocorypha_pusilla
Coenocorypha_aucklandica
Gallinago_gallinago
Limnodromus_griseus
Tringa_macularia
Tringa_hypoleucos
Arenaria_melanocephala
Arenaria_interpres
Calidris_subminuta
Calidris_ruficollis
Calidris_minuta
Calidris_pusilla
Calidris_minutilla
Calidris_bairdii
Calidris_alpina
Calidris_alba
Calidris_mauri
Calidris_ptilocnemis
Calidris_maritima
Calidris_temminckii
Calidris_tenuirostris
Calidris_canutus
Aphriza_virgata
Eurynorhynchus_pygmeus
Micropalama_himantopus
Phalaropus_fulicaria
Phalaropus_lobatus
Steganopus_tricolor
Tringa_glareola
Tringa_melanoleuca
Tringa_erythropus
Tringa_totanus
Tringa_nebularia
Tringa_flavipes
Tringa_stagnatilis
Tringa_ochropus
Catoptrophorus_semipalmatus
−1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4
Egg.Mass F.Mass M.Mass
The phylogenetic tree is on the left of this plot, and species connected by a common ancestor
closer to the tips are more closely related. Note that closely related species seem to have
similar sizes (e.g. Haematopus species, oystercatchers, are all large). Note also that branches
have different lengths—because different pairs of species are thought to have diverged from
a common ancestor at different times during their evolutionary history.
structured dependence in the response can be fitted by a GLS approach using the
caper package, as in Code Box 7.10. Egg size data were log-transformed data given
that this seemed to symmetrise the data (Code Box 7.9).
This type of analysis is sometimes referred to in ecology as a comparative analysis,
and more broadly, the subdiscipline of ecology dealing with the study of patterns
across species is often known as comparative ecology (presumably because this
involves comparing across lineages).
There are many other approaches to comparative analysis, many of which have
connections to the approach described earlier. An early and influential method of
phylogenetic analysis (Felsenstein, 1985) is to assume that branching events along the
tree are independent, so we can construct phylogenetically independent contrasts be-
tween species and between species means. This won’t be considered further because
it can be understood as equivalent to a special case of the generalised least-squares
approach considered here (Blomberg et al., 2012), along the lines of Maths Box 7.2.
0.3 M.Mass
0.0
ll
l
ll
l
l
l
l
6 ll
l
l
l l ll
l
ll
F.Mass
lll
l
5 l l ll
l
l
lll
Corr:
ll
ll
ll
l 0.912
lll
l
ll l
llll
4 lll
l
l
ll l
l
ll
l
ll
l
3
l
l l
l
l l
4.0 l
l
l
l
l l
l l l l l l
ll
l l
l l
l ll lll
3.5
Egg.Mass
l l l l
l l
l l
l l l l
lll l ll
l l l l l l
3.0 l lll
l
ll
l
l
l ll
ll ll l
l
l
l l l
ll l l l ll ll
l
l l
2.5 l
l l
l
l l
l l l l
l lll
l l
l ll
l l l l l
l l
l l l l
l l l l l l
2.0 l
l l l
l
ll l
l
l l
l
l l l
l
l l
Note all correlations are quite high, in particular, there is multi-collinearity between male
and female bird size. This will make it harder to detect an effect of one of these variables
when the other is already in the model. Note also that the correlation of egg size with male
body size is slightly larger than with female body size, which is in keeping with Nerje’s
hypothesis (Exercise 7.5).
7.3 Phylogenetically Structured Data 173
> library(caper)
> shorebird = comparative.data(shorebird.tree, shorebird.data,
Species, vcv=TRUE)
> pgls_egg = pgls(log(Egg.Mass) ~ log(F.Mass)+log(M.Mass),
data=shorebird)
> summary(pgls_egg)
Branch length transformations:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.37902 0.23172 -1.6357 0.106520
log(F.Mass) -0.22255 0.22081 -1.0079 0.317077
log(M.Mass) 0.89708 0.22246 4.0325 0.000142 ***
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
The phylogenetic regression model used here is a type of linear model that (under a
simplifying assumption that the tree is ultrametric, with constant total branch lengths
from the root of the tree to each species) can be written as follows:
yi ∼ N (μi, σ 2 )
μi = β0 + xTi β
δ
K κ
cov(yi, y j ) = σ 2
λli(k)
j
k=1
near zero suggest no phylogenetic signal, values near one suggest most (co)variation
across species is explained by phylogeny. The other parameters determine the extent
to which correlation across species is determined by recent divergences vs ancestral
ones. By default, the pgls function of Code Box 7.10 assumes all parameters are
one, meaning that covariances are proportional to the total shared branch length
(assuming traits evolve entirely via so-called Brownian motion).
So in summary, we have the following assumptions:
1. The observed y-values are phylogenetically dependent in such a way that, after
accounting for predictors, the correlation between values is a fixed function of
shared branch lengths (as earlier).
2. The y-values are normally distributed with constant variance
yi ∼ N (μi, σ 2 )
for a pair of species with phylogenetic distance d. This function is plotted in Fig. 7.5
for a few different values of δ. If δ = 1, we get a straight line (Brownian motion).
The correlogram in Code Box 7.11 does not seem far from a straight line, so the
Brownian motion assumption (δ = 1) seems reasonable here.
7.3 Phylogenetically Structured Data 175
par(mfrow=c(2,2))
plot(pgls_egg)
The first two of these plots check normality (which, as previously, is not critical!) and the
first in the second row is our residual vs fits plot, suggesting no pattern.
For a phylogenetic correlogram of residuals
res.df = data.frame(Species = shorebird.data$Species,
res = residuals(pgls_egg))
res4d = phylobase::phylo4d(shorebird.tree,res.df)
res.pg = phyloCorrelogram(res4d,trait="res")
plot(res.pg)
Another way to get at the question of whether the phylogenetic dependence model
is adequate is to refit the model using different assumptions:
• The pgls model can easily be refitted estimating some of its phylogenetic pa-
rameters from the data and returning approximate confidence intervals for these
parameters. Of particular interest is whether confidence intervals for λ contain
zero (no phylogenetic signal) or one (all error is phylogenetic). Note, however, that
176 7 Correlated Samples in Time, Space, Phylogeny. . .
1.0
0.8
δ = 0.5
Autocorrelation
0.6
δ=1
0.4
δ = 1.5
0.2
δ=2
0.0
0 L
Phylogenetic distance
Fig. 7.5: Assumed phylogenetic correlograms used in pgls for different values of δ,
assuming κ = 1. In this situation there is a direct relationship between phylogenetic
distance (total shared branch length) and phylogenetic autocorrelation. The value of
λ determines the y-intercept, and we used λ = 0.75 here
Relatively recently there has been greater appreciation of the issue of confounding
when using structured random effects in models—as Hodges and Reich (2010) put
it, a structured random effect has the potential to “mess with the fixed effect you
love”. Mixed models work like multiple regression, in the sense that any terms that
enter into the model have their effects estimated conditionally on everything else that
is in the model. So we can in effect get multi-collinearity if some of our predictors
share a similarly structured correlation to that assumed in the model. This will often
happen—in our spatial example (Exercise 7.4), we used temperature as a predictor,
which will be highly spatial, and in our phylogenetic example (Exercise 7.5), we used
body mass as a predictor, which had a strong phylogenetic signal. In these situations,
inclusion of both the predictor and the structured dependence term with which it is
correlated will weaken each other’s estimated effects, much as collinearity weakens
signals in multiple regression. This is a type of confounding—there are two possible
explanations for a pattern in the data, and it is hard to tell which is the cause (without
experimentation). In Exercise 7.4, is richness varying spatially largely because of its
association with temperature, or is the association with temperature largely because
of spatial autocorrelation? In Exercise 7.5, is the strong phylogenetic signal in egg
size largely because of an association with body mass, or is the association with
body mass largely because of the strong phylogenetic signal?
equivalent to that used in Maths Box 7.2 since sums of normal variables are
normal.) We can use contrasts Σ1/2 to rewrite this in the form of a linear mixed
model (Chap. 6):
yi = β0 + xi β + zi b + i
where b = Σ−1/2 δ are independent random effects, and zi is the ith row of the
matrix of contrasts Σ1/2 that captures structured dependence in the data. Now
that the random effects are independent, we can treat this model as a standard
linear mixed model and apply ideas from earlier chapters.
If the predictors x have a similar correlation structure to that assumed for
the response, then x will be correlated with the contrast matrix z that was
designed to capture the structured dependence.
But recall that in a linear model, coefficients estimate the effects of predic-
tors conditional on all other terms in the model. So, for example, in a spatial
model, the coefficient of x will tell us only about the effects of x on responses
that are “non-spatial”, not able to be explained by z. This is a different quantity
to the marginal effect of x, if we had not conditioned on z. The mathematics
of missing predictors (Maths Box 4.1, Eq. 4.1) can tell us the extent to which
fixed effects coefficients change when you include structured correlation in the
model.
The models considered so far in this chapter will behave like multiple regression
and effectively condition on the assumed correlation structure when estimating fixed
effects. That is, a model for species richness will estimate the effect of temperature
after removing the variation in temperature that is spatially structured. A model
for egg size will estimate the effect of body mass after removing the variation in
body mass that is phylogenetically structured. Often dealing with confounding in
this way is appropriate if the question of interest requires studying the effect of a
predictor conditional on other terms in the model (just as multiple regression is often
appropriate, in attempts to remove the effects of confounding variables).
Sometimes this sort of effect is not desirable. We might wish to describe how
richness is associated with climatic variables, estimating the full (marginal) climate
effect rather than attributing part of it to a spatial error term. But in a spatial model,
the climate term represents only the portion of the climate effect that is uncorrelated
with the spatial random effect, so it would be smaller (sometimes surprisingly so)
and it would be harder to detect associations with climate. Hodges and Reich (2010)
proposed an approach that can adjust for spatial dependence but not attribute variation
in response to the structured random effect, unless it is uncorrelated with predictors
in the model.
A well-known debate in the comparative ecology literature (Westoby et al., 1995;
Harvey et al., 1995) can be understood as arising around this same issue but in the
context of phylogenetically structured correlations. By the early 1990s, comparative
methods had been developed for the analysis of data across multiple species (or
other taxa) to account for differing levels of correlation across species due to shared
7.5 Further Reading 179
phylogeny. Westoby et al. (1995) argued that these methods should not necessarily
be used by default in comparative studies, essentially because these methods look
for effects on a species trait conditional on phylogeny, e.g. estimating a cross-species
association between seed size and seed number beyond that which can be explained
by phylogeny. Their argument was based around the idea that ecological determinants
of a trait would have been present over the evolution of the species, rather than just in
the present day, so some of the trait effects that have been attributed to phylogeny are
confounded with ecologically relevant effects. This is visualised nicely in Figure 1 of
Westoby et al. (1995). How this confounded information should be treated is subject
to interpretation, and Westoby et al. (1995) argue it is related to the type of question
one is trying to answer—very much as Hodges and Reich (2010) argued in the context
of spatial models. In response, Harvey et al. (1995) emphasised the importance of
satisfying the independence assumption in analyses (or accounting for dependence
when it is there), but they did not propose a way forward to estimate or address
this “phylogenetic confounding”. Perhaps the methods of Hodges and Reich (2010)
could be adapted to the phylogenetic context for this purpose. In the meantime, many
comparative ecologists analyse their data with and without phylogenetic terms in the
model, to understand the extent to which cross-species patterns they observe are
confounded with phylogeny.
This chapter provided a very brief introduction to dependent data, which is a very
big area. We have focused on the use of GLS and random effects to account for
structured correlation and the use of correlograms to diagnose the type of correlation
we are dealing with. However, other techniques and other data types require different
approaches. For more details, there are entire separate texts on each of spatial,
longitudinal, and comparative analyses. A well-known text on repeated measures in
time is Diggle et al. (2002), and for spatial data a comprehensive but quite technical
text is that by Cressie (2015). For a relatively quick tour of spatial tools for ecology
see Dormann et al. (2007), who fitted a range of different methods to simulated data,
with a quick description of each. For the phylogenetic case, an accessible introduction
is Symonds and Blomberg (2014). For time series data, a good introductory-level
text is Hyndman and Athanasopoulos (2014).
Chapter 8
Wiggly Models
Recall from Sect. 4.4.2 that a “linear model” does not need to be linear. Mathemati-
cally, we say a model is linear if it is a linear function of its coefficients (“something
times β1 plus something times β2 . . .”). But if we include non-linear functions of x
as predictors, we can use this framework to fit a non-linear function of x to data. The
simplest example of this is including quadratic terms in the model, as in Fig. 8.1, or
cubic terms. This approach has a theoretical basis as a Taylor series approximation
(Maths Box 8.1).
for any a and suitably chosen values of bi . If we drop some of the higher-order
terms (e.g. just taking the first three terms from this expansion), we get a local
approximation to the function f (x) near a. The approximation works better
when we use more terms (hence are ignoring less higher-order terms). It is a
local approximation around a because it works better for values of x that are
close to a since x − a is small, so the higher-order terms we ignore are also
small.
Taylor approximations are linear in the coefficients, so they can be fitted
using linear model techniques. When we use just x as a predictor in a linear
model, we get a linear approximation, which we can think of as using just the
first two terms from a Taylor series expansion to approximate f (x) (estimating
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 181
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_8
182 8 Wiggly Models
Weird function!
y
y
x x x
Fig. 8.1: Linear models are not necessarily linear functions of explanatory variables
(but they are linear in terms of the parameters). All of the above response surfaces
can be fitted using linear modelling software: from left-to-right we have we have a
linear model, a model with all quadratic terms, and a saddle-shaped surface
A good way to relax the linearity assumption is often to break down the explanatory
variable into pieces (at knots) and fit a multiple regression against all of these pieces.
A piecewise approach usually works better than using polynomials (e.g. x, x 2 ),
because we are essentially doing a local approximation (as in Maths Box 8.1) for
each piece, instead of trying to apply a single approximation across a broad range of
values for x.
8.1 Spline Smoothers 183
Key Point
Do you have a regression with a potentially non-linear response and data at lots
of different values of your predictor (x)? In that case, you can get fancy and
try fitting a smoother to your data, e.g. using software for generalised additive
models (GAMs).
A simple example of this is piecewise linear model fits (as used in the well-known
MAXENT software, Phillips et al., 2006). A specific example of a piecewise linear
fit is in Fig. 8.2. Piecewise linear fits are a bit old school, at least for functions of one
variable. They don’t look smooth, and in most problems a “kinky” function is not
realistic, e.g. why should the red line in Fig. 8.2 suddenly change slope in the year
2000?
l
l
400
l
l
l
l
l
l
Carbon dioxide (ppm)
l
l
l
380
l
l
l
l
l
l
l
l
l
l
360
l l
l
l
l l
l
l l
l
l
l
l l
340
l l
l
l
l
l
l
l l
l l
l l
l l piecewise linear
320
l l
l
l l l
l l
l l smoother
Fig. 8.2: Non-linear models fitted to the annual Mauna Loa data of Exercise 8.1.
A piecewise linear model is fitted in red with a change in slope at the year 2000.
Note the fit is a bit weird, with a sudden change in slope in the year 2000. A spline
smoother is fitted in blue, which smooths out the kink
Hastie and Tibshirani (1990) and quickly became a mainstay in applied statistics. It
was popularised in ecology, especially for modelling species’ distributions, by papers
like Yee and Mitchell (1991) and, in particular, Guisan and Zimmerman (2000). An
excellent modern text on the subject is Wood (2017).
l ll
lll l
l
l ll
l
ll
ll
ll l
0.6 0.6 0.6 0.6 0.6 l
l
l
l
l
l
l
lll
ll
l
l
y 0.4 lll
l l
lll
0.4 0.4 0.4 0.4 l
l
l
l
ll
l
ll
ll
lll
ll
l
ll
l
l
0.2 0.2 0.2 0.2 0.2 ll
l
ll
ll
l
l
l
ll
l
l
ll l
llll
ll ll
ll lll
ll
l ll
ll l lll
ll l
l
0.0 0.0 0.0 0.0 0.0 l
l
l
ll
l
l
lll
ll ll
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
x x x x x
The prediction function from this model has “kinks” in it at the knots. This
is illustrated above (right) for a model with knots at 0.2, 0.4, 0.6, 0.8.
The
function telling us the slope of this model (the gradient function) is
β + tk ≤x γk , which is not continuous. The gradient starts with a slope of β,
then at x = t1 the slope suddenly jumps to β + γ1 , then at x = t2 it jumps to
β + γ1 + γ2 , and so forth.
We can get a smooth, kink-free fit using piecewise polynomials of order
two or higher; piecewise cubics are the most commonly used. The model is as
follows:
K
μy = β0 + x β1 + x 2 β2 + x 3 β3 + bk (x)3 γk
k=1
Below are the basis functions bk (x)3 (with knots at 0.2, 0.4, 0.6, 0.8), and a
model fit:
max((x−0.2)^3),0) max((x−0.4)^3),0) max((x−0.6)^3),0) max((x−0.8)^3),0) Piecewise cubic fit
1.0 1.0 1.0 1.0 1.0
lll
lll
l
l ll
lllll
l
l l l
0.8 0.8 0.8 0.8 0.8 l
l
ll
l l l
lll
l
ll
l
ll l
0.6 0.6 0.6 0.6 0.6 ll
ll
l
ll
l
l
l
lll
l
l
ll
lll
0.2 0.2 0.2 0.2 0.2 l
l
l l
l ll
l
l l
l
l ll l
llll
l
l
l l
0.0 0.0 0.0 0.0 0.0 ll
ll l
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
x x x x x
The fit from this model is smooth because its gradient function is continuous
(it has no sudden jumps at any value of x). Specifically, the gradient is
β1 + 2x β2 + 3x 2 β3 + 3bk (x)2 bk (x)γk
tk ≤x
Why “splines”? The term spline comes from woodworking, especially shipbuild-
ing, and the problem of bending straight sections of wood to form curves, e.g. a ship
hull. This is done by fixing the wood to control points or “knots” as in Fig. 8.3. The
statistician who came up with the term “spline smoothing” clearly spent a lot of time
in his/her garage. . . .
The most common way to fit a model with a spline smoother in R is to use the
mgcv package (using the methods of Wood, 2011). This package is well written and
is quite fast considering what it does. A model with a spline smoother is often called
a generalised additive model (GAM), so the relevant function for fitting a smoother
is called gam. You can add a spline smoother in the formula argument as s(year)
(for a spline for year) as in Code Box 8.1. The s stands for spline.
186 8 Wiggly Models
Fig. 8.3: Splines in woodwork—a piece of wood is bent around “knots” to create
a smooth curve. This is the inspiration for the statistical use of the terms “spline
smoother” and “knots” (drawing by Pearson Scott Foresman)
Code Box 8.1: Fitting a spline smoother to the Mauna Loa annual data of
Exercise 8.1 on R.
> library(mgcv)
> data(maunaloa)
> maunaJan = maunaloa[maunaloa$month==1,]
> ft_maunagam=gam(co2~s(year), data=maunaJan)
> summary(ft_maunagam)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 356.68048 0.05378 6632 <2e-16 ***
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
When fitting a GAM, a decision has to be made as to how many knots to include in
the model. This controls how wiggly the curve is—the more knots, the more wiggles.
Choosing the number of knots is a form of bias–variance trade-off, like choosing
the number of predictors in regression, or choosing λ in a LASSO regression. In
fact, mathematically this is exactly like changing the number of predictors in a linear
model. Too few knots might not fit the data well and might introduce bias; too many
knots might overfit the data and increase the variance of predictions to new values
(because the fit chases the data too much instead of following the broader trends).
In the mgcv package, you can specify an upper limit to the number of knots as
k, an argument to the spline, e.g. using s(lat,k=5) in the formula argument. By
default k is usually somewhere near 10 (but it depends). If you think your fit might
need to be extra wiggly, it is a good idea to try changing k to a larger value (20?
50?) to see if it changes much, as in Code Box 8.3. The gam function by default will
actually estimate the number of knots to use (treating the input k as the upper limit),
and it will often end up using many fewer knots than this. The effective degrees of
freedom (edf in Code Box 8.1) tell you roughly how many knots were used. This is
not always a whole number, because it tries to adjust for the fact that we are using
penalised likelihood (hence shrinking parameters a bit), and it usually reports a value
slightly less than the actual number of knots used. You can also get a P-value out
of some gam software, which is only approximate, because of the use of penalised
likelihood methods in estimation.
yi ∼ N (μi, σ 2 )
Fig. 8.2 are a time series, collected sequentially over time in the same place in the
same way. There are concerns about the independence assumption here, because
CO2 measurements in a given year might be similar to those in the previous year
for reasons not explained by the smoother. We can use the techniques of Chap. 7 to
check for autocorrelation and, if needed, to account for it in modelling.
As previously, normality is rarely important, but we should check that residuals
are not strongly skewed and that there aren’t outliers that substantially affect the fitted
model. The constant variance assumption remains as important as it was previously
in terms of making valid inferences about predictions and about the nature of the
association between y and x.
Residual plots, as usual, are a good idea for assumption checking; unfortu-
nately, these aren’t produced by the plot function, but they are produced by the
plotenvelope function in the ecostats package, as in Code Box 8.2.
Code Box 8.2: Residual plot from a GAM of the annual Mauna Loa data
of Exercise 8.1
The plot function doesn’t give us diagnostic tools for a gam object; it plots the smoothers
instead. But plotenvelope from the ecostats package works fine (although it will take a
while to run on large datasets):
plotenvelope(ft_maunagam)
Models with spline smoothers are often called additive models or GAMs, where
additive means that if there are multiple predictors, they are assumed not to interact,
and so only main effects for each have been included. However, while additivity
is a default assumption in most GAMs, these models don’t actually need to be
additive—you can include interactions, too.
A bivariate spline for x1 and x2 on the same scale can be fitted using
s(x1,x2)
190 8 Wiggly Models
This technique is not always suitable because it is quite data hungry (i.e. you need
a big dataset) and can be computationally intensive—the number of required knots
may be much larger for a bivariate smoother. For example, the Mauna Loa dataset
with annual measurements is too small (only 60 observations, but also only one
predictor so no need!). However Ian (Exercise 8.2) has 1000 observations and could
certainly consider predicting richness using a bivariate smoother for temperature
and rainfall, as in Code Box 8.4. You can extend this idea when you have more than
two predictors, but it becomes much harder computationally and quickly becomes
infeasible when using spline smoothers.
A simpler alternative is to try handling interactions in the usual way, with a
quadratic interaction term. Keeping additive smoothers in the model preserves some
capacity to account for violations of the quadratic assumption, as also illustrated
in Code Box 8.4. This approach has the advantage of ensuring only one model
coefficient is devoted to each pairwise interaction between predictors, so it can
handle situations where the dataset isn’t huge or where you have more than two
predictors (but not too many).
Code Box 8.4: Handling interactions in a GAM for Ian’s richness data of
Exercise 8.2.
You could use a bivariate smoother:
data(Myrtaceae)
ft_tmprain=gam(log(richness+1)~te(TMP_MIN,RAIN_ANN),data=Myrtaceae)
vis.gam(ft_tmprain,theta=-135) #rotating the plot to find a nice view
summary(ft_tmprain)$edf
[1] 14.89171
The vis.gam command produced Fig. 8.4 and used about 15 effective degrees of freedom
(but actually required 24 knots).
We could try a quadratic interaction term combined with smoothers, for something less
data hungry:
> ft_tmprain2=gam(log(richness+1)~s(TMP_MIN)+s(RAIN_ANN)+
8.2 Smoothers with Interactions 191
TMP_MIN*RAIN_ANN,data=Myrtaceae)
> summary(ft_tmprain2)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.970e-02 3.268e-02 2.133 0.0332 *
TMP_MIN 2.045e-02 7.029e-02 0.291 0.7712
RAIN_ANN 1.333e-03 1.501e-04 8.883 <2e-16 ***
TMP_MIN:RAIN_ANN 1.415e-05 4.060e-05 0.348 0.7276
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
TM NN
P_
M N _A
IN R AI
Fig. 8.4: A plot of a bivariate smoother for height as a function of temperature and
rainfall, produced as in Code Box 8.4
192 8 Wiggly Models
A smoother can be used to help diagnose lack of fit in residual plots by drawing a line
through the main trend in the data. In a residuals vs fits plot, we want no relationship
between the residuals and fitted values, i.e. the trend curve for residuals should
track zero pretty closely. Some residual plotting functions, such as the default plot
function as applied to lm in R, will include a smoother by default. This smoother is not
actually fitted using spline terms; it uses a different technique. The plotenvelope
function, on the other hand, uses the gam function from the mgcv package to fit a
smoother. Both of these are illustrated in Code Box 8.5.
The method used to fit a smoother in the residual plot of Fig. 8.5a uses the
scatter.smooth function, which (by default) uses, not splines, but a different
method (local or kernel fitting) to achieve a similar outcome. There are actually
heaps of different ways to fit smooth curves to data. Splines are the most common in
the regression setting because they allow us to stay in the linear modelling framework,
with all its benefits (especially diagnostic and inferential tools).
22
127
2
2
Residuals
Residuals
0
0
−2
−2
72
−4
Fig. 8.5: Residual plots for global height data of Exercise 8.3, with smoothers,
produced in Code Box 8.5
Many things in nature have a cyclical pattern to them—most often in time (e.g.
season, time of day) but also space (e.g. aspect). These are often called circular
variables or “circular statistics” because they are more naturally understood by
mapping them onto a circle rather a straight line. For example, consider the aspect
of slopes on which sheep were found (in degrees, 0 = 360 = due north) in Fig. 8.6.
Key Point
Do you have a cyclical variable, one that finishes the same place it starts from
(aspect, time of day, season, . . . )? These should enter the model in a cyclical
way, which can be done using sin and cos functions (which can map a variable
onto the unit circle). It is critical, however, to get the period right (so that your
cyclical variable gets back to the start at the right value).
194 8 Wiggly Models
Fig. 8.6: Aspect of slopes on which sheep were found (Warton and Aarts, 2013),
plotted as follows: (a) The raw aspect variable in degrees, with 0 and 360 degrees
at opposite ends of the scale but meaning the same thing (due north); (b) the same
data mapped onto a circle (and jittered), with each point representing the aspect of
the slope where a sheep was observed (with both 0 and 360 at the top). You can see
from (b), more so than from (a), that there is a higher density of points in southerly
aspects (especially SSW)
If you cos- and sin-transform a variable, it gets mapped onto a circle (cos does the
horizontal axis and sin does the vertical axis), as in Maths Box 8.3. It is critical to
get the period of the transformation right, though—you have to time it so that a full
cycle has length 2π, the angle of a full rotation (in radians).
Maths Box 8.3: The cos and sin functions map a variable onto a circle
Consider a cyclical variable x with period t—that is, something that ends up
where it started at t. For example, time of year has period t = 12 months. We
want to map this variable onto a circle, so that the cyclical variable ends off
where it starts. We will use a circle of radius one.
We will start on the positive horizontal axis (at January on the following
figure) and work our way around in an anti-clockwise direction, getting back
to the start at the value x = t (the end of the year). The angle of a full rotation
of the circle is 2π radians, and for this to correspond to one cycle of x we
compute angles using
2πx
θ=
t
8.4 Cyclical Variables 195
For example, the start of April is x = 3 months into the year, a quarter of the
π
12 = 2 radians, a quarter of the way
way through the year. Its angle is θ = 2π×3
around the circle.
April
1
sin(θ)
θ
July January
cos(θ)
October
To work out how to map from a value x to a point on the circle, we first form
a right-angled triangle with angle θ = 2πx t by dropping from a point on the
circle to the horizontal axis. The hypothenuse is the radius of the circle, which
has length one. Simple trigonometric rules then tell us that the horizontal
coordinate is cos θ (“adjacent over hypotenuse”) and the vertical coordinate is
sin θ (“opposite over hypotenuse”). So we can map from a cyclical variable
x to points on a circle by transforming to cos 2πx t , sin t
2πx
. When plotting
a cyclical variable x or using it in a model, it is more natural to work with
cos t , sin t , rather than x, to reflect its cyclical nature.
2πx 2πx
Consider Exercise 8.5. We would like to find a statistical model that characterises the
main features in this dataset. Two things stand out in Fig. 8.7: the increasing trend
and the periodic wiggles (seasonal variation).
For the long-term trend we could try poly or gam.
For the periodic wiggles, we should include cos and sin terms for the time
variable in the model, which ends up being month. This adds a sine curve to the
400
380
Carbon dioxide (ppm)
360
340
320
Fig. 8.7: Monthly carbon dioxide measurements at Mauna Loa observatory. The data
points have been joined to reflect that this is a time series where each value is always
close to the value in the previous month
8.4 Cyclical Variables 197
A/2 A
Asin(2πx t + ψ)
ψ
0
−A/2
−A
0 t/2 t 3t/2 2t
x
Code Box 8.6: A simple model for the Mauna Loa monthly data with a
cyclical predictor
A smoother (via a GAM) is used to handle non-linearity in the annual trend—since the rate
of increase in the concentration of carbon dioxide is not constant over time, it seems to get
steeper over time. A sine curve is added to handle the seasonality effect.
data(maunaloa)
library(mgcv)
ft_cyclic=gam(co2~s(DateNum)+sin(month/12*2*pi)+cos(month/12*2*pi),
data=maunaloa)
plot(maunaloa$co2~maunaloa$Date,type="l",
ylab=expression(CO[2]),xlab="Time")
points(predict(ft_cyclic)~maunaloa$Date,type="l",col="red",lwd=0.5)
320 340 360 380 400 420
CO2
How do we mind our Ps and Qs? To check we got each component right in
the model of the Mauna Loa monthly data, we need residual plots against each
component. As well as a residual plot against time (to diagnose the smoother), we
need a residual plot against season (to diagnose periodicity), which we construct
manually in Code Box 8.7. Notice in Code Box 8.7 that the lines have been joined
across time points, so we can see how residuals change over time and better diagnose
any problems with the model as a function of time.
Code Box 8.7: Residual plots across time and season for the Mauna Loa
monthly data
In what follows we construct residual plots joining the dots across time to see if there are
any patterns over time and, hence, any problems with the model as a function of time. Two
plots will be constructed—one across years, another within years (using sin of month) to
diagnose the smoother and the periodic component of the model, respectively. Using sin
of month means that December and January will appear in the same part of the Season
predictor (the middle, near zero). cos could equally well be used.
par(mfrow=c(1,2))
8.4 Cyclical Variables 199
plot(residuals(ft_cyclic)~maunaloa$Date,type="l", xlab="Time")
plot(residuals(ft_cyclic)~sin(maunaloa$month/12*2*pi),
type="l",xlab="Season")
2
2
residuals(ft_cyclic)
residuals(ft_cyclic)
1
1
0
0
−1
−1
−2
−2
1960 1970 1980 1990 2000 2010 2020 −1.0 −0.5 0.0 0.5 1.0
Time Season
For comparison, here are some plots for simulated data to show you what they might
look like if there were no violation of model assumptions:
1
1
Simulated residuals
Simulated residuals
0
0
−1
−1
−2
−2
1960 1970 1980 1990 2000 2010 2020 −1.0 −0.5 0.0 0.5 1.0
Time Season
As with any residual plot, we are looking for no pattern, but in Code Box 8.7, note
the hump shape on the residual plot against season. This is because a sine curve did
not adequately capture the seasonal trend. You can see this more clearly by zooming
in on the 2000s as in Fig. 8.8.
For quantitative predictors, if a linear trend doesn’t work, a simple thing you can
try is quadratic terms. For circular variables, if a sine curve doesn’t work, a simple
thing you can try is adding sin and cos terms with half the period by multiplying the
variable by two before cos- and sin-transforming, as in Code Box 8.8. This gives the
circular world’s equivalent of quadratic terms (Maths Box 8.5). Or if that doesn’t
work, try multiplying by three as well (like cubic terms)! This idea is related to
Fourier analysis (Maths Box 8.5), and the terms are sometimes called harmonics.
200 8 Wiggly Models
390
385
CO2 (ppm)
380
375
370
2000 2010
Year
Fig. 8.8: Trace plot of some carbon dioxide measurements (black) against predicted
values (red) from the initial model of Code Box 8.6, zooming in on the 2000s
Code Box 8.8: Another model for the Mauna Loa monthly data, with an
extra sine curve in there to better handle irregularities in the seasonal
effect
ft_cyclic2=gam(co2~s(DateNum)+sin(month/12*2*pi)+cos(month/12*2*pi)+
sin(month/12*4*pi)+cos(month/12*4*pi),data=maunaloa)
par(mfrow=c(1,2))
plot(residuals(ft_cyclic2)~maunaloa$Date,type="l", xlab="Time")
plot(residuals(ft_cyclic2)~sin(maunaloa$month/12*2*pi),
type="l",xlab="Season")
2.0
2.0
residuals(ft_cyclic2)
residuals(ft_cyclic2)
1.0
1.0
0.0
0.0
−1.0
−1.0
The mean trend seems to have been mostly dealt with in the residual plots of
Code Box 8.8, but there is some residual correlation between variables (since CO2
this month depends on CO2 last month). This is evident in the left-hand plot with
residuals tending to stay near previous values (e.g. a cluster of residuals around 0.5
in 1990, then a drop for several negative residuals around 1993). Residual correlation
is also evident in the right-hand plot, with residuals often running across the page
(left to right), remaining large if the previous value was large or remaining small if
the previous value was small. This correlation (or “time lag”) becomes a problem
for inference—standard errors and P-values will be too small, differences in BIC too
big. . . , and one thing we could do here is try adding temporal autocorrelation via
the gamm function, as in Code Box 8.9. This function combines ideas for modelling
dependent data from Chap. 7 with smoothers as used earlier.
0.8
0.6
autocorrelation
0.4
0.2
0.0
−0.2
0 5 10 15 20 25 0 5 10 15 20 25
Lag [months] Lag [months]
Note that the first autocorrelation function, of raw residuals, shows a strong autocorrelation
signal, lasting about a year (12 months). But after modelling this (using an AR(1) structure),
standardised residuals no longer have this pattern, suggesting the AR(1) structure does a
much better job of capturing the temporal autocorrelation structure. There is, however, still
a pretty large correlation at a lag of 12 months, and again at 24 months, suggesting that
the model for longer-term trends is not quite adequate, with some year-to-year variation not
being picked up.
Exercise 8.6: Mauna Loa monthly data—an extra term in seasonal trend?
Consider the Mauna Loa monthly data again. We tried mod-
elling the trend by fitting sine curves with two frequencies—
using (sin(month/12*2*pi),cos(month/12*2*pi)), which
gives a sine curve that completes one full cycle per year, and
(sin(month/12*4*pi),cos(month/12*4*pi)), for two cycles per
year. Is this sufficient, or should we add another frequency, too, like
month/12*6*pi (for curves with three cycles a year)?
Use gam and the model selection technique of your choice to consider these
options.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 205
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_9
206 9 Design-Based Inference
here for linear models, but note they are most commonly used in more complex
situations where it is difficult to do model-based inference (so they will feature
strongly in Chap. 14—a more complex situation!).
Key Point
Methods of design-based inference enable inference even when the model
used for analysis is not correct—by exploiting independent units in the study
design. This can be handy in situations where you trust your independence
assumptions but not much else. Common methods of design-based inference
include the permutation test, cross-validation, and the bootstrap.
A permutation test is applicable whenever the null hypothesis being tested is “no
effect of anything” (sometimes called an intercept model).
Consider Exercise 9.1. In this experiment, guinea pigs were randomly allocated to
treatment and control groups. If the nicotine treatment had no effect on response, we
would expect this random allocation of subjects to treatment groups to give results
no more unusual than under any other random allocation (where unusualness will be
measured here as size of the test statistic). That is, under the null hypothesis of no
treatment effect, randomly permuting the Control/Treatment labels as in Exercise 9.2
will not affect the distribution of the test statistic.
9.1 Permutation Tests 207
There are plenty of examples of permutation testing software available for simple
study designs like this one. Code Box 9.1 uses the mvabund package, developed
by myself and collaborators at UNSW Sydney, which reduces the problem to three
familiar-looking lines, or two if you have already loaded the package. The package
name mvabund stands for “multivariate abundance” data—the main type of data it
208 9 Design-Based Inference
was designed for (Chap. 14). But it can be used for plenty of other stuff, too. To use
mvabund for a permutation test of all terms in a linear model, you just change the
function from lm to manylm (this helps R work out what to do when you call generic
functions like anova).
−4 −2 0 2 4
t
Code Box 9.1: A permutation test for the guinea pig data using mvabund
> library(mvabund)
> ft_guinea = manylm(errors~treatment,data=guineapig)
> anova(ft_guinea)
Analysis of Variance Table
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Arguments: P-value calculated using 999 iterations via residual
(without replacement) resampling.
Comparing Fig. 9.1 with Code Box 9.1, you will notice the P-values are quite
different. The P-value for the t-test in Fig. 9.1 was 0.005, whereas mvabund gave a
much larger value (0.015). Why? There are two reasons.
The first reason for the difference is that we are using a random set of permutations
of treatment labels, so results are random, too. If you repeat Code Box 9.1 yourself,
you will typically get different answers on different runs (similarly for the code
behind Fig. 9.1). We can actually use probability to quantify how much random error
(Monte Carlo error) there is in a resampled P-value. It depends on the number
of permutations used, controlled by nBoot in the mvabund package, and is also
a function of the size of the true P-value (specifically, it is a binomial proportion
with parameters nBoot and p where p is the true P-value). The more permutations,
the smaller the error; nBoot=999 (default) is usually good enough, in the sense
that it usually gives an answer within about 0.015 of the correct P-value when it is
marginally significant.
The second reason for the difference is that the mvabund package uses a two-sided
test (is there evidence of a difference with treatment, not just an increase?), so it
usually gives about double the one-sided P-value.
The test statistics are also slightly different—manylm, like lm, reports an
(ANOVA) F-statistic in an anova call. This is the square of the t stat (7.13 2.672 ).
The F-test and two-sided t-test are equivalent, meaning they return the same P-value,
so this difference in test statistics has no implications when assessing significance.
You can use a permutation test for any situation where the null hypothesis says that all
observations come from exactly the same distribution (same mean, same variance,
. . .; if there is any correlation across observations, this has to be the same too). So,
for example, you could use a permutation test in regression, to test whether there is
an association between predictor(s) and response, as in Code Box 9.2.
210 9 Design-Based Inference
Code Box 9.2: Permutation test for a relationship between latitude and
plant height
> data(globalPlants)
> ft_height = manylm(height~lat, data=globalPlants)
> anova(ft_height)
Analysis of Variance Table
9.2 Bootstrapping
Another method of design-based inference is the bootstrap. The motivation for the
bootstrap is slightly different, but the end result is very similar. Bootstrapping can
be used not just for hypothesis testing, but also for estimating standard errors (SEs)
and CIs, it can even be tweaked for model selection (Efron and Tibshirani, 1997).
If we knew the true distribution of our data, we could compute a P-value by
simulating values directly from the true distribution (under the null hypothesis). But
all we have is our observed data. The idea of the bootstrap is to use our observed
data to estimate the true distribution. We then resample some data—generating a
new sample from our best estimate of the true distribution.
Consider the guinea pig data of Exercise 9.1. If treatment had no effect, all 20
observations could be understood as coming from the same distribution. Without
any further assumptions, our best estimate of the true distribution from the observed
data would be to say it takes the values 11, 19, 15, . . . , 34 with equal probability ( 20
1
).
1
So in each resample, each observation has a 20 probability of taking each value from
the observed dataset. Four example bootstrap datasets, constructed in this way, are
in Table 9.1.
212 9 Design-Based Inference
F(y)
F*(y), n=5
0.2
F*(y), n=50
F*(y), n=500
0.0
0 1 2 3 4
y
This idea, that bootstrapping can be understood as taking a random sample
from the empirical distribution function F ∗ (y), is fundamental to studying
the properties of the bootstrap. For example, because limn→∞ F ∗ (y) = F(y),
bootstrapping a statistic approximates its true sampling distribution (under the
random sampling assumption) and does so increasingly well as sample size
increases.
9.2 Bootstrapping 213
Table 9.1: Guinea pig data and four bootstrap resamples, obtained by resampling the
observed values with replacement
Treatment C C C C C C C C C C N N N N N N N N N N
Obs. data 11 19 15 47 35 10 26 15 36 20 38 26 33 89 66 23 28 63 43 34
Bootstrap 1 38 63 26 43 26 43 19 43 89 38 28 38 15 15 33 38 35 38 15 26
Bootstrap 2 43 26 26 26 26 23 34 43 63 11 19 35 34 89 43 28 19 28 10 47
Bootstrap 3 19 66 28 20 35 38 33 36 26 33 15 43 33 47 23 47 66 89 38 28
Bootstrap 4 15 66 28 89 47 10 28 11 19 11 66 89 36 36 47 15 28 47 63 89
Code Box 9.3: Using the mvabund package for a bootstrap test of guinea
pig data
> library(mvabund)
> ft_guinea = manylm(errors~treatment, data=guineapig)
> anova(ft_guinea, resamp="residual")
Analysis of Variance Table
There are many ways to bootstrap data. The approach in Table 9.1 resampled
the response variable while keeping the explanatory variables fixed—this is a nat-
ural approach to use when testing an “all means equal” hypothesis. This approach
keeps the number of treatment and control replicates fixed, which is desirable in a
planned experiment where these values were in fact fixed. Sometimes (especially in
observational studies) the explanatory variables do not really take fixed values but
214 9 Design-Based Inference
are most naturally considered random. In such cases we could bootstrap whole cases
(most commonly the rows) of our dataset, not just the response variable, commonly
known as “case resampling” (Davison and Hinkley, 1997). This idea is illustrated
for the guinea pig data in Exercise 9.3. Case resampling doesn’t make a whole lot
of sense for the guinea pig data, a designed experiment with 10 replicates in each
treatment group, because it doesn’t guarantee 10 replicates in resampled data. It
would make more sense on Angela’s height data or Ian’s species richness data, these
being observation studies where sites (hence rows of data) were sampled at random.
Case resampling is fine for estimating SEs or CIs but needs to be used with care
in hypothesis testing (Hall and Wilson, 1991). Case resampling can be implemented
using mvabund, with corrections along the lines of Hall and Wilson (1991), by
specifying resamp="case".
N N N N C N N N N N C C C C N C C C N C
33 89 34 89 35 43 38 33 23 63 10 26 11 10 43 47 10 19 66 35
N N C C N N N C N C N C N N N C C C C C
66 66 19 47 63 43 43 20 33 15 28 26 89 38 43 47 20 20 11 20
Count the number of controls in each case resampled dataset. Did you get
the number you expected to?
There are two main differences in resampling procedures between bootstrapping and
permutation testing, which imply different models for the underlying data-generating
mechanism:
• A bootstrap treats the response variable y as random and resamples it, whereas a
permutation test treats the treatment labels as random and permutes them.
• A bootstrap resamples with replacement, whereas a permutation test uses each
response variable exactly once.
These differences may be important conceptually, but in practice, they actually have
little by way of implications. In a test of a “no-effect” hypothesis, results should
not be expected to differ in any practically meaningful way between a bootstrap and
permutation test. So, from a practical perspective, really you could equally well use
9.4 Mind Your Ps and Qs! 215
either method in that context. One important practical difference, however, is when
moving beyond tests of “no-effect” null hypotheses—the bootstrap can readily be
extended to such contexts, e.g. it can be used to estimate SEs or CIs. The notion
of permutation testing does not extend as naturally to these settings (although it is
possible). Another more subtle difference is that whereas permutation tests are exact
for “no-effect” nulls, no such claim appears to have been proven for the bootstrap.
But experience suggests that if bootstrap tests for no effect are not exact, they are at
least very close. . . .
So how do you decide what to do in practice? Well, as previously, it doesn’t really
matter! But it would be prudent to (where possible) use a method of resampling
that reflects the underlying design—this is, after all, the intention of design-based
inference. So, for example, in an experiment testing for an effect of a treatment
applied to a randomised set of subjects (e.g. the guinea pigs), the permutation test
is the most intuitive approach to use because it tests explicitly whether the observed
randomisation led to significantly different results compared to other possible random
assignments of subjects to treatment groups. In other settings, where the response
variable is treated as the random quantity (such as the global plant height data, where
the goal is to model changes in plant height), bootstrapping the response tends to
make more sense in order to resample in such a way that the response is the quantity
that is treated as random.
When fitting a regression model to data collected in a fixed design (where the
values taken by explanatory variables were for the most part under the control of the
researcher), I tend to bootstrap, keeping the design fixed. In an observational study
where the explanatory variables and responses were random, I would case resample.
Bootstrapping assumes the units can be resampled at random, that is, it assumes
these units are independently and identically distributed (iid). This is a familiar
assumption from linear models, where we assume errors are iid. In the special case
of a “no-effect” null hypothesis, the iid assumption in linear models applies to the
response variable, which coincides with the iid assumption of the response we made
when bootstrapping the response in Code Box 9.3.
Permutation tests assume observations are exchangeable (can be swapped around
without changing the joint distribution). Technically this is slightly more general
than the iid assumption, where the independence part of the assumption is relaxed
to an assumption of equal correlation. In practice it is quite rare to be able to argue
that the exchangeability assumption is valid but the iid assumption is unreasonable,
so the distinction being made here is of limited utility.
There are no further assumptions when testing no-effect null hypotheses. That’s
the value of design-based inference—we can make inferences using only indepen-
dence assumptions that we can often guarantee are satisfied via our study design.
(How can you guarantee observations are independent?) This sounds fantastic, but
216 9 Design-Based Inference
recall that when we did model-based inference on the guinea pig dataset, we weren’t
really worried about the additional assumptions we needed to make (because of
robustness to assumption violations). So in this case, design-based inference added
little value, only freeing us from making assumptions we weren’t worried about mak-
ing. The real power of design-based inference is in more difficult statistical problems,
where model-based inference does not apply, is difficult to derive, or is known not
to work well.
The term “bootstrap” comes from the expression “to pull yourself up by the boot-
straps” from a Baron von Munchausen story, in which the protagonist did precisely
this to get himself out of a sticky situation. The implication is that the bootstrap kind
of lets the data get themselves out of a sticky situation, without external help (without
model-based assumptions). It allows us to make valid, distribution-free inferences
from very small samples where previously it was thought impossible.
The bootstrap is one of the most exciting developments in statistics over the last
half-century; it generated a lot of chatter in the late 1970s and 1980s and is now so
pervasive it is even in some high school syllabuses.
In Chap. 6 we met the parametric bootstrap. The idea there was that we added some
assumptions—we assumed we knew the form of the distribution of our data and
that all we were missing were the true values of parameters. We used our sample
estimates of parameters as if they were the true values, to simulate observations to
use to estimate our null distribution (or SE, for example). On R, you can construct
a parametric bootstrap for many types of model (including lm, lmer, and others we
will meet later) using the simulate function.
The parametric bootstrap looks and sounds like a design-based method—its name
mentions the bootstrap, it is a simulation-based method, and for complex models
it takes a very long time to run. While these features all sound quite like those of
design-based methods, the parametric bootstrap is in fact a model-based approach—
because it is assuming that the fitted model is correct and exploiting that information
to make general inferences from the sample. A design-based approach, in contrast,
exploits independent units implied by the study design to make inferences, whether
permuting across randomised treatment allocations, resampling responses, or some
other technique.
The practical implication of this distinction is that the parametric bootstrap gen-
erates valid P-values when we are reasonably happy with our model, but may not
9.5 Resampling Residuals 217
necessarily work so well under violations of assumptions (it depends which as-
sumptions are violated, and how badly). Design-based methods tend to make fewer
assumptions and thus tend to be more robust to violations. Because the parametric
bootstrap adds extra assumptions, it tends to work a bit better when those extra
assumptions are satisfied, but it can be misleading when the assumptions are not
satisfied. (This rule about assumptions is a pretty universal one—the “no free lunch”
principle of statistics.)
A good situation for using the parametric bootstrap is when we just don’t know
how to use theory to get good P-values directly from the model, but we are not
excessively worried by violations of model assumptions. The parametric bootstrap
is often used for linear mixed models because we often find ourselves in precisely
this situation.
These resampling methods are all well and good if the goal is to test a hypothesis of
no effect—when we can freely resample observations. But what if we don’t want to
do that? For example, in Exercise 9.4, Angela wants to test for an effect of latitude
after controlling for the effect of rainfall. In this case, Angela doesn’t have an “all
means equal” null hypothesis, so she can’t use the standard permutation and bootstrap
tests.
What model should be fitted under the null hypothesis? Does it include any
predictor variables?
The null hypothesis in Exercise 9.4 is that plant height is explained by rainfall.
We can’t freely permute or bootstrap observed data—this would break up the data
structure under the null (removing the relationships between plant height and rainfall,
or between latitude and rainfall). We want to resample under the null hypothesis that
there is a relationship between rainfall and plant height (and possibly between rainfall
and latitude). We can do this by resampling residuals using the fitted model under
the null hypothesis (as in Edgington, 1995) (Fig. 9.2).
218 9 Design-Based Inference
Fig. 9.2: Residual resampling of the first 10 observations from the plant height data.
(a) The observed responses (plant heights, y) can be split into fitted values ( μ̂) and
ˆ (b) the residuals are resampled (ˆ∗ ), then added back onto the fitted
residuals ();
values ( μ̂), to construct resampled responses (y ∗ )
Residual resampling is the default in mvabund, as in Code Box 9.4 for the plant
data of Exercise 9.4. Residual resampling works for any linear model—you could
rerun any of the analyses of Chaps. 2–4 using the same lines of code as in Code
Box 9.4 but with the relevant changes to the linear model specification in the first
line, i.e. you could use residual resampling in factorial ANOVA, in paired or blocked
designs, in ANCOVA, . . .. But a limitation of mvabund, and of residual resampling
more generally, is that while it works for any fixed designs, it can’t handle random
effects, at least not in a natural way. The main difficulty is that there are multiple
sources of randomness in such a model (specifically, some of the parameters are then
treated as random, as well as the response), so resampling would need to happen at
multiple levels. A few resampling approaches have been proposed for mixed effects
models, but the issue is not completely settled in the literature, and a parametric
bootstrap is usually a better option.
Code Box 9.4: Residual resampling using mvabund for Exercise 9.4.
When permuting data under the no-effect hypothesis, permutation testing is exact.
Under residual resampling, permuting and bootstrapping are only approximate.
Two tips for ensuring this approximation is good:
• Make sure you resample residuals from a plausible null model. If your null model
is wrong (e.g. forgot to transform data), your P-values can be wrong and the
whole thing can be invalid. So check assumptions.
• Only estimate the resampling distribution of a standardised (or “pivotal”) statistic,
e.g. t-stat, Z-stat, likelihood ratio stat. Do not estimate the resampling distribution
of an unstandardised statistic (e.g. β̂). In some settings, bootstrap resampling is
known not to help improve validity for unstandardised statistics (as compared to
using standardised statistics, e.g. Z, t), but it will help if the statistic is standardised
(Hall and Wilson, 1991).
These are the assumptions we make when using model-based approaches to infer-
ence, as for example when making an anova call to a lm object.
Key Point
Residual resampling extends the ideas of permutation tests and bootstrap tests
to general fixed effects designs, but at some cost—we need to assume residuals
are exchangeable or iid, which for linear models implies that we still require
the linearity and equal variance assumptions.
Residual resampling relaxes the normality assumption—and that’s it! And you may
recall normality was the least important assumption in the first place! (The CLT
gives us a lot of robustness to violations of this assumption.) So residual resampling
isn’t a solution to all the world’s problems.
It is also a bit of a stretch to call residual resampling design-based inference—a
model is needed to compute residuals! So really this approach sits in a grey area
between model-based and design-based inference—we are relaxing some of the
assumptions of the model but still require key model assumptions to be satisfied,
and we can’t assume our inferences are valid by pointing to independent units in our
study design.
Often you will find that the results you get from design-based inference are similar
to what you might get from model-based inference. For example, the P-value for the
9.6 Limitations of Resampling: Still Mind Your Ps and Qs! 221
guinea pig data of Code Box 2.5 is 0.016, and a resampling-based P-value using a
large number of resamples (like 9999) settles around 0.012. For the plant height data
of Code Box 3.1, most P-values are similar up to two significant figures.
The reason most P-values work out similarly is that, as previously, the only
assumption that is being relaxed when fitting these models is normality, which the
CLT gave us a lot of protection against anyway. So unless you have quite non-normal
data and quite a small sample, you should not expect appreciable differences in
results. As such, unless you have a small sample (n ≤ 10, say), you don’t really need
to bother with design-based inference at all when analysing a single quantitative
response variable. The main case for using design-based inference, considered in
this text, is when analysing highly multivariate data (Chap. 14).
We have seen that under resampling, the validity of P-values (or SEs or CIs or other
inference procedures) rests primarily on the independence assumption—there is no
longer an assumption about the actual distribution of y, although under residual
resampling we also require some model assumptions (for linear models, we still
need linearity and constant variance).
But we still should check linear model assumptions even when they aren’t impor-
tant for the test to be valid. Why check linear model assumptions that are not needed
when resampling?
valid efficient
This is a really important idea to understand—many procedures have been pro-
posed for ecological data, which use resampling for inference and which claim to be
generally applicable because they make no assumptions beyond independence. But
valid procedures can still work very badly—no procedure should be applied blindly
to data! A procedure should only be applied if it is valid and it can be expected to
work well for data like what you have (i.e. should lead to relatively small SEs, or
relatively high power).
Key Point
valid efficient
Just because your inferential procedure is valid doesn’t mean it is efficient.
That is, just because you use design-based inference, and so don’t require
all your model assumptions to be satisfied for valid inference, doesn’t mean
you can ignore model assumptions. Design-based methods will work better
(in terms of having better power, shorter CIs, and so forth) when model
assumptions are closer to being satisfied.
222 9 Design-Based Inference
Just because a method is valid doesn’t mean it works well. For example, linear
models are more efficient (more power) when our assumptions are closer to being
satisfied—so try to satisfy them as closely as you can to get the best answer you can!
5
4
4
3
3
Residuals
Residuals
2
2
1
1
0
0
−1
−1
5 10 15 20 25 −2 −1 0 1 2
Fitted values Theoretical Quantiles
What do you reckon?
Log-transform the number of errors and check assumptions. Does this better
satisfy assumptions than the model on untransformed data?
Repeat the permutation test of Code Box 9.1 on log-transformed data.
How do the results compare to the analysis without a log transformation?
Is this what you expected to happen?
Consider, for example, the height data of Code Box 9.4. If we construct a residual
plot, we see that the linear modelling assumptions are not satisfied (Code Box 9.5)—
in particular, there is something of a fan shape, suggesting residuals do not in fact
have equal variance, and hence they should not be permuted. The main cause of
the problem is that values are “pushing up” against the boundary of zero—a plant
can’t have negative height. The solution in Exercise 9.5 is to log-transform height,
in effect removing the lower boundary and putting height onto a proportional scale.
This ends up doing a much better job of satisfying assumptions, but it also changes
our answer—there ends up being a significant effect of latitude, after controlling for
rainfall. The most likely reason for this change is that the original analysis was not
efficient—by analysing a strongly skewed variable, with heteroscedastic errors, the
linear model test statistic was not able to detect the effect of latitude that was in fact
present.
So in summary, resampling can make your method valid even when assumptions
fail, but to get a method that works well—has good power, small SEs, for example—
you need to use a good model for the data at hand. Only with a good model for your
data can you ensure that you have a good statistic for answering the research question
of interest. You can get completely different results from different analyses even with
resampling (e.g. log-transformed vs untransformed) because of big differences in
how reasonable your fitted model is, hence big differences in how good your test
statistic is.
There are a few types of dependence we often encounter and can deal with in a
design-based way. In particular:
• Clustered data due to pseudo-replication
• Autocorrelation from temporal, spatial, or phylogenetic sources.
Any of these sources of dependence lead to positively correlated observations, and
the impact of ignoring them is to make SEs and, hence, CIs too small and P-values
too small. That is, too many false positives, which as we have discussed, is really bad
news.
Sometimes we can assume blocks of sites are independent (or approximately so), in
which case we can exploit independence across blocks and resample those as a basis
for inference about terms in the model that vary across blocks.
In mvabund, there is a block argument that can be used in anova.manylm in
order to resample blocks of observations and keep all observations within a block
together in resampling. This is appropriate in a multi-level (hierarchical) design,
where there is an interest in making inferences at the higher sampling level, as in
Graeme’s estuary data. Note, though, that using this function in combination with
residual resampling requires balanced sampling within blocks and assumes that if
there is an order to observations within blocks, values are entered in the same order
within each block. (This is needed so that each resampled residual matches up with
the appropriate fitted value from the same block.) With unbalanced blocks, the block
argument can only be used in combination with case resampling (i.e. when jointly
resampling rows of X and Y ). Note also that because manylm only works with fixed
effects designs, there is no opportunity to include random effects in a fitted model
(you would need to code your own function to do this). An example using manylm
with the block argument to analyse Graeme’s estuary data is at Code Box 9.6.
Code Box 9.6: Block resampling using mvabund for estuary data
To test for an effect of modification, using block resampling of estuaries:
> data(estuaries)
> ft_estLM = manylm(Total~Mod,data=estuaries)
> anova(ft_estLM,resamp="case",block=estuaries$Estuary)
Using block resampling...
Analysis of Variance Table
Table 9.2: Three examples of restricted permutation of the raven counts within
sites compared to the observed data. In a paired design, this involves switching the
ordering of the pairs at random
Observed Data
Before 000002100350
After 214105010352
Bootstrap 1
Before 200102010352
After 014005100350
Bootstrap 2
Before 210105100352
After 004002010350
Bootstrap 3
Before 214102110350
After 000005000352
1 Not to be confused with restricted randomisation, a technique often used to allocate patients to
treatment groups in clinical trials.
226 9 Design-Based Inference
Code Box 9.7: Block resampling using permute for raven data
We will start by taking the ravens data for the gunshot treatment only and arranging in long
format:
data(ravens)
crowGun = ravens[ravens$treatment == 1,]
library(reshape2)
crowLong = melt(crowGun,measure.vars = c("Before","After"),
variable.name="time",value.name="ravens")
as constructed in Code Box 4.2. To construct a matrix to use for the permutation of the raven
data, permuting treatment labels within each sampling location:
library(permute)
CTRL = how(blocks=crowLong$site)
permIDs = shuffleSet(24,nset=999,control=CTRL)
Now to use this in mvabund to test for a treatment effect using restricted resampling:
> ravenlm = manylm(ravens~site+time,data=crowLong)
> anova(ravenlm,bootID=permIDs)
Using <int> bootID matrix from input.
Analysis of Variance Table
Fig. 9.3: Moving block bootstrap idea (from Slavich & Warton, in review). (a)
Bootstrap procedure to construct distribution of a statistic, T. (b) Sampling units are
single sites. (c) Sampling units are non overlapping blocks of sites. (d) Sampling
units are overlapping blocks of sites, with a small block length. (e) Sampling units
are overlapping blocks of sites, with a larger block length
that of normality. For block resampling, this is no longer the case, and we can get
quite substantial changes in results as compared to model-based inference. Notice
that in Code Box 9.8, moving block bootstrap P-values are larger than what you
would get using a model-based approach, and in Code Box 9.9, the SEs on species
richness predictions also tend to be larger. This is because the assumption we are
relaxing here is the independence assumption, and violations of this key assumption
can have substantial effects on the validity of our method. Ian’s data are spatially
autocorrelated, with observations that are closer together tending to be more similar,
and if we ignore this and pretend our observations are independent, we are pretending
there is much more information in the data than there actually is. The moving block
bootstrap can correct for this effect, leading to less optimistic SEs and P-values.
The main difficulty when using this method is choosing the block size. The size
of the spatial blocks to be resampled needs to be large enough that observations
have little dependence across blocks, but it should be small enough that sufficient
replication is possible for inference. In Code Box 9.8 a block size of 20 km was used,
an educated guess given that the spatial autocorrelation in residuals decayed to low
values by about this distance in Code Box 7.7. Given that the study area was a few
hundred kilometres in diameter, this was also small enough for a decent amount of
replication. You can let the data choose the block size for you by trying out a few
different block sizes and finding the one that minimises some measure of precision.
This can be done using the blockBootApply function; for more details see the
software documentation.
Note that the moving block bootstrap is not always applicable—it is a useful tool
when the spatial scale over which dependence operates is small relative to the size
of the region. If there is large-scale dependence, with observations at opposite ends
of the region still being dependent, then it is simply not possible to construct a block
size such that observations in different blocks have little dependence. In this situation
one is stuck with model-based approaches, which also often have difficulty with this
situation (specifically, robustness to model misspecification).
Code Box 9.8: Moving block bootstrap test for species richness modelling
First we will fit Ian’s model, with quadratic terms for average maximum and minimum daily
temperature and annual precipitation:
> data(Myrtaceae)
> Myrtaceae$logrich=log(Myrtaceae$richness+1)
> mft_richAdd = manylm(logrich~soil+poly(TMP_MAX,degree=2)+
poly(TMP_MIN,degree=2)+poly(RAIN_ANN,degree=2),
data=Myrtaceae)
Now we will use a spatial moving block bootstrap for the species richness data, with a block
size of 20 km, generating 199 bootstrap resamples computed along on a 5 km grid, via the
BlockBootID function in the ecostats package. This takes a while!
> BootID = BlockBootID(x = Myrtaceae$X, y = Myrtaceae$Y, block_L = 20,
nBoot = 199, Grid_space = 5)
Now using this to test the significance of the various terms in Ian’s model:
> anova(mft_richAdd, resamp="case", bootID=BootID)
9.7 Design-Based Inference for Dependent Data 229
Code Box 9.9: Moving block bootstrap SEs for species richness predictions
Ian was originally interested in species richness predictions, and we can use the moving block
bootstrap to estimate these, adjusting for spatial autocorrelation (using BootID, constructed
in Code Box 9.8):
> ft_richAdd = lm(logrich~soil+poly(TMP_MAX,degree=2)+
poly(TMP_MIN,degree=2)+poly(RAIN_ANN,degree=2),
data=Myrtaceae)
> nBoot=199
> predMat = matrix(NA,length(Myrtaceae$logrich),nBoot)
> for(iBoot in 1:nBoot)
{
ids = BootID[iBoot,]
ft_i = update(ft_richAdd,data=Myrtaceae[ids,])
predMat[ids,iBoot] = predict(ft_i)
}
> bootSEs = apply(predMat,1,sd,na.rm=TRUE)
Compare these to the SEs on predicted values from the lm function:
> lmSEs = predict(ft_richAdd,se.fit=TRUE)$se.fit
> cbind(bootSEs,lmSEs)[1:10,]
sePreds
[1,] 0.03756633 0.02982530
[2,] 0.03166692 0.02396074
[3,] 0.04107874 0.02853756
[4,] 0.04217585 0.03348368
[5,] 0.06295291 0.06279115
[6,] 0.03707599 0.03413518
[7,] 0.05155717 0.02808283
[8,] 0.09053780 0.08272950
[9,] 0.03058480 0.02759009
[10,] 0.05739949 0.06622908
Is this what you expected to see?
230 9 Design-Based Inference
The moving block bootstrap was illustrated above for the spatial context, and the
software used was developed specifically for spatial data. However, the idea is quite
general and could be applied to other types of autocorrelated data, such as phylo-
genetically structured data. Resampling is commonly used in phylogenetic analysis,
but, at the time of writing, the moving block bootstrap has not been. The method
could be useful in a phylogenetic context if the traits being measured have “local”
dependence, i.e. phylogenetic structuring is seen over small phylogenetic distances,
but observations across more distant branches of the tree can be considered to be
largely independent of each other. Some thought would be required to adapt block
bootstrap methods to the phylogenetic context, and existing software could not be
applied to the problem directly.
Chapter 10
Analysing Discrete Data: The Generalised
Linear Model
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 231
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_10
232 10 Analysing Discrete Data
Key Point
Quantitative data can be continuous or discrete, and if your response is quan-
titative, this has important implications for how the data are analysed.
Continuous: Can take any value in some interval (e.g. any positive value)
=⇒ linear models (LMs)
Discrete: Takes a “countable” number of values (e.g. 0, 1, 2, . . .)
=⇒ generalised linear models (GLMs)
10 Analysing Discrete Data 233
If your data are discrete but never take small values (e.g. always larger than
5), you could think of it as kind of close to continuous and try LMs, but if you
have small counts and zeros, GLMs are needed to handle the mean–variance
relationship.
Why does discreteness matter? The main reason is that it tends to induce a mean–
variance relationship—as the mean changes, the variance changes. This is illustrated
in Figs. 10.1, 10.2, and 10.3 for Exercises 10.1–10.3, respectively.
When you have zeros and small counts, you will have trouble getting rid of the
mean–variance relationship via transformation—in fact it can be shown mathemati-
cally that it is generally impossible to remove a mean-variance relationship for small
counts (Warton, 2018), which leaves us violating the linear model assumption of
constant variance. This happens because small counts are “pushed up” against the
boundary of zero. Transformation can’t fix the problem for small counts because
if many values are zero or one, then transformation can’t spread these value out
well—the zeros will always stay together under transformation, as will all the ones.
Consider the eel data of Fig. 10.2—no eels were found in northern stations after the
wind farm was constructed, so the variance in this group will always be zero irrespec-
tive of data transformation, which is problematic for the equal-variance assumption
of linear models.
1.0 0.25 ll l
Proportion crabs present
0.8 0.20
l
Variance
0.6 l 0.15
l
0.4 l 0.10 l
0.2 l l 0.05
l
0.0 0.00
0.5 2.5 10.5 0.10 2.10 5.10 0.0 0.2 0.4 0.6 0.8 1.0
Dist:Time Mean
(a)
Before.S l
After.S l l
Before.N l
After.N
Before.WF l
After.WF l
0 1 2 3 4 5 6
Fish abundance
(b)
Before.S
After.S
Before.N l
After.N
Before.WF
After.WF
Fig. 10.2: (a) Untransformed and (b) square-root-transformed eel abundances from
Exercise 10.2. Notice in (a) that when mean abundance is higher, the abundances
are also more variable, e.g. south stations before impact have the largest mean and
variance. At the other extreme, no eels were caught at north stations after impact,
giving a mean and variance of zero. Note in (b) that after square-root transformation,
these patterns are still evident. In this situation, it is important to use a GLM to model
abundance
In ecology, this sort of issue arises especially often when studying the distribution
or abundance of a particular species across a range of environments. Most species
have some habitats in which they are rarely found, so when we sample in such
habitats, we get lots of zeros and small counts, as with the eels (Fig. 10.2).
On the other hand, if the response variable is a count across different species (e.g.
total abundance or species richness), counts tend to be larger, and transformation is
sometimes a viable option. The reason for this is that it is not common to sample
in such a way that you don’t see anything at all, hence zeros (and ones) are rarely
encountered when summing across species. Consider again the wind farm data,
10 Analysing Discrete Data 235
(a)
Abundances ⎡⎣ y scale ⎤⎦
l
Control
100
Treatment
Abundances
60
l l
20
l
l
ll
ll
l
ll
l
l
l
l
l
l
0
Soleolifera
Coleoptera
Diptera
Hemiptera
Hymenoptera
Araneae
Formicidae
Collembola
Amphipoda
Isopoda
Larvae
Acarina
l l
l
l
l
2.00
l
l l
1e+05
l l l
l ll
l
Variance [log scale]
l l l
l
lll l l
0.50
l l l
l l
l l
l l l
1e+03
l
l
l
ll l
0.20
ll l
l l l
l l
l
1e+01
ll
l l
l
0.05
ll l treatment treatment
l l
l
l ll l Control l Control
1e−01
l
0.02
l l Reveg l
l lReveg
1e−01 1e+00 1e+01 1e+02 1e+03 0.1 0.2 0.5 1.0 2.0 5.0 10.0
Mean [log scale] Mean [log scale]
but looking at all fish that were caught rather than just the eels (Fig. 10.4). While
not every species was caught at every sampling station, at least one species was
always caught (and commonly, 5–20 individuals). So while total fish abundance has
a mean–variance relationship (Fig. 10.4a), transformation does a pretty good job of
removing it in this case (Fig. 10.4b).
236 10 Analysing Discrete Data
(a)
Before.S l
After.S l
Before.N l
After.N
Before.WF
After.WF l
0 5 10 15 20 25 30
Fish abundance
(b)
Before.S
After.S l l
Before.N
After.N
Before.WF
After.WF l l
1 2 3 4 5
square root (Fish abundance)
Fig. 10.4: Wind farm data of Exercise 10.2, but now looking at total fish abundance
across all species (a) untransformed, (b) square-root-transformed. Notice in (a) that
when mean abundance is higher, the abundances are also more variable, e.g. compare
north stations after impact (high mean, high variance) and the wind farm before
impact (low mean, low variance). Notice in (b) that after square-root transformation,
this pattern is largely gone. This is because a look at total abundance across all species
reveals there aren’t many small counts, so transformation to stabilise variances
becomes an option. In this situation we might be able to get away with using a LM
instead of a GLM
A generalised linear model (Nelder and Wedderburn, 1972, GLM) can be understood
as taking a linear model and doing two things to it—changing the equal variance
assumption and changing the linearity assumption.
The most important thing that is done is to change the assumptions on the
variance—rather than assuming equal variance, we assume a mean–variance re-
lationship, which we will write as V(μ) for a mean μ. If the variance is known to
change in a predictable way as a function of the mean, a GLM will incorporate this
10.1 GLMs: Relaxing Linear Modelling Assumptions 237
information into the analysis model. For presence–absence data, for example, it is
known that the variance is exactly a quadratic function of the mean: V(μ) = μ(1 − μ),
which is the curve drawn on the mean–variance plot in Fig. 10.1. Using a GLM
there is no need to try to transform the data in advance of analysis to remove the
mean–variance relationship—we just include it in the model. This is an important
advance and quite a different way of approaching analysis if you are used to linear
models and data transformation—by avoiding data transformation, interpretation of
the model is simpler, and you keep statisticians happy, who tend to have a philosoph-
ical preference for trying to model the mechanism behind the observed data (rather
than messing around with the data first to try and match them to a particular type of
model).
The other change that is made in a GLM, as compared to a LM, is to the linear-
ity assumption—rather than assuming that the mean response varies linearly with
predictors, we now assume that some known function of the mean varies linearly
with predictors. This function is known as the link function and will be written g(μ).
The main reason the link function is introduced is to ensure that predicted values
will stay within the range of possible values for the response, e.g. presence–absence
data take the values zero (absence) or one (presence), so we use a link function that
ensures that the predicted values will always remain between zero and one (such as
the logit function, introduced later). In contrast, if you were to analyse discrete data
using a linear model, it would be possible to get nonsensical predictions like 120%
survival or a mean abundance of −2!
A common misconception is that the link function g(μ) transforms the data—
but it is not doing anything to the data. The link function is a way of introducing
special types of non-linearity to a model for the mean, e.g. assuming the mean is an
exponential function of predictors or a sigmoidal (“logistic”) function, for example,
without messing with the data at all.
where, as for linear models, x β is vector notation for the linear sum x1 β1 + x2 β2 +
. . .+x p βp , and the yi are assumed to be independent across observations (conditional
on x).
The distribution F is assumed to come from what is known as the exponential
family of distributions (Maths Box 10.1), which contains as special cases the bi-
nomial, Poisson, and (given the overdispersion parameter φ) the negative binomial.
These are the most natural distributions to use to model discrete data.
238 10 Analysing Discrete Data
where θ = h(μ) for some function h. The key idea in this definition is that the
value y and parameter μ only interact, on the log scale, through a product of
the value y and some function of the parameter θ = h(μ). The forms of the
functions h(μ) and A(θ) end up being important—h(μ) suggests a “default”
or canonical link function to use in a GLM, that has desirable properties,
and A(θ) captures information about the mean–variance relationship in the
data. Specifically, it turns out that the mean is the derivative of this function,
μ = A (θ), and the mean–variance function is the second derivative, V(μ) =
A (θ) (McCullagh and Nelder, 1989, Section 2.2.2). It also turns out that
h (μi ) = V(μi )−1 (McCullagh and Nelder, 1989, Section 2.5).
As an example, for the Poisson distribution with mean μ, f (y; μ) = e−μ μy! ,
y
so
log f (y; μ) = y log μ − μ + log y!
So the obvious or “canonical” link function to use in a GLM of Poisson counts
is θ = log(μ). A(θ) = μ = eθ , and differentiating twice we see that the variance
of Poisson counts is V(μ) = A (θ) = eθ = μ. Or, alternatively, h (μ) = μ1 , so
−1
V(μ) = μ1 = μ.
Some distributions have two parameters but an exponential form, most
notably, the normal, binomial, and gamma distributions. The second “scale”
parameter σ affects the variance, which becomes σ 2 A (θ). Some other two-
parameter distributions are related to the exponential family. In particular,
negative binomial and beta distributions have a second “shape” parameter and
will only be members of the exponential family if the value of this shape
parameter is known, making them slightly messier to work with. The binomial
also has a second shape parameter, the number of trials, which typically is
known.
Recall that in linear models, the normality assumption usually ends up not being
that important (thanks to the central limit theorem), the more important part is the
equal variance assumption, which can be understood as being implied by the normal-
ity assumption. In much the same way, the actual distribution F ends up not being
that important in a GLM (thanks to the central limit theorem); the more important
part is the mean–variance assumption that is implied by the chosen distribution.
Thus, the key parts of a GLM to pay attention to are the independence assumption,
the mean–variance relationship function V(μ), and the link function g(μ) (Maths
Box 10.2).
10.1 GLMs: Relaxing Linear Modelling Assumptions 239
g(μi ) = xi β
d dθ i ∂
n h (μi )
n xi yi − μi
n
(β) = (β) = xi (yi − A (θ i )) =
dβ i=1
dβ ∂θ i i=1
g (μi ) i=1
g (μi ) V(μi )
(10.4)
the last step follows since μi = A (θ i ) and h (μi ) = V(μi )−1 (as in Maths
h (μi )
Box 10.1). To show that dθ dβ = xi g (μi ) requires a little work—using the chain
i
−1
dβ
rule again we can say dθ dβ = dμi
i dθi
dμi , and recall that θ i = h(μi ).
Notice that the distribution of the data only affects the score equation via
its mean–variance function V(μi ), so this is the key distributional quantity that
matters. Notice also that if the link function used in fitting is the canonical
link, g(μi ) = h(μi ), then g (μi ) = h (μi ), so the score equation simplifies to
n
0 = i=1 xi (yi − μi ).
y ∼ N (μy, σ 2 )
μy = β0 + xT β
240 10 Analysing Discrete Data
As previously, the key changes here, as compared to linear models, are the introduc-
tion of a mean–variance relationship V(μ) and a link function g(μ).
Key Point
A GLM extends linear models by adding two features:
• A mean–variance relationship V(μ) (in place of constant variance)
• A link function g(·) used to transform the mean before assuming linearity.
The key function in R is glm, and you choose which type of GLM to fit using the
family argument, with common examples of how this is used in Table 10.1. Some
simple examples using this function are in Code Box 10.1.
data(windFarms)
eels = windFarms$abund[,16]
ft_eels = glm(eels~Station+Year*Zone,family="poisson",
data=windFarms$X)
Station was included in the model to account for the pairing in the data.
To fit a negative binomial regression to Anthony’s worm counts from Exercise 10.3,
using the manyglm function from the mvabund package:
data(reveg)
Haplotaxida=reveg$abund[,12]
library(mvabund)
worms = reveg$abund$Haplotaxida
ft_worms = manyglm(worms~treatment,family="negative.binomial",
data=reveg)
A GLM is fitted using maximum likelihood, as discussed in Sect. 6.3. The main
practical consequence of no longer using least squares to fit models is that when
measuring goodness of fit, we no longer talk about residual sums of squares. Instead
we talk about deviance—basically, twice the difference in (log-)likelihood between
the fitted model and a perfect model (in which predicted values are exactly equal to
observed values).
Remember to set the family argument—if you forget it, glm defaults to fitting a
linear model (family=gaussian).
Not all distributions can be used with generalised linear models, but a few important
ones can, as listed in Table 10.1. A few important things to know about these follow.
The binomial can be used for any response that has two possible outcomes (a
binary response), also for “x-out-of-n” counts across n independent
events. Three
μ
link functions are commonly used—the logit link, log 1−μ , otherwise referred
to as logistic regression, is the most common. The probit is sometimes used for
theoretical reasons, Φ−1 (μ), where Φ is the probability function of the standard
normal. Sometimes the complementary log-log link is used, log(− log(1 − μ)); it is
a good option for presence–absence data and should be used much more often than
it currently is. This link function is derived by assuming you had counts following a
Poisson log-linear model (see below) and converted them to presence–absence data
(following an argument first made in Fisher, 1922). The Poisson assumption may
be questionable, but this is the only common link function that was motivated by
thinking of data as arising from some underlying counting process (as they often are,
e.g. in David and Alistair’s crab data of Exercise 10.1).
242 10 Analysing Discrete Data
Table 10.1: Common choices of distribution and suggested link functions g(μ) in
generalised linear models. Each distribution implies a particular mean–variance
assumption V(μ). The required family argument to use each of these in R is also
given
a But does not account for overdispersion (i.e. all individuals being counted need to be independent,
A Poisson model with a log link is often called a log-linear model, or sometimes
Poisson regression. The mean–variance assumption of the Poisson, V (μ) = μ, is quite
restrictive and will be violated if the individuals being counted are not independent
(e.g. they cluster) or if there are potentially important covariates missing from your
model. Note this assumption does not seem to work for Anthony’s revegetation data
(Fig. 10.5, left).
A negative binomial model with a log link is often called negative binomial
regression. It is a safer option for count data than Poisson regression because it
includes an overdispersion parameter (φ) to deal with unaccounted-for variation.
Strictly speaking, this is only a GLM if the overdispersion parameter (φ) is known
in advance, and for that reason it is not included as an option in the standard glm
function. It can, however, be fitted using manyglm in the mvabund package or using
glm.nb from the MASS package. Note this assumption seems to do quite a good job
for Anthony’s revegetation data (Fig. 10.5, right).
A Tweedie model is only a GLM when the power parameter p is known, so p needs
to be specified first to use glm to fit a model. Usually, you would try a few values of
p and check diagnostics or use model selection tools. This has two cool features for
ecologists. Firstly, it has a mass at zero while being continuous for positive values,
so it can do a good job when modelling biomass data (which will be zero if nothing
is observed, otherwise continuous over positive values). However, it doesn’t always
do a good job of modelling the zeros, so assumptions need to be checked carefully.
The second cool feature is that its mean–variance relationship is V(μ) = aμ p , known
as Taylor’s power law, and much ecological data seem to (approximately) follow it
(Taylor, 1961).
10.2 Fitting a GLM 243
In the special case where you assume y is normal and use the identity link
function (i.e. no transformation on μ), GLMs reduce to linear models, as discussed
previously (Chaps. 2–4).
There are more distributions that you could use. I hardly ever do.
1e+07
1e+07
l l
1e+05
1e+05
l l
Variance [log scale]
1e+03
ll ll
ll ll
l l l l l l
l l l l
1e+01
1e+01
ll ll
l l l l
ll l treatment ll l treatment
l l
l l
l ll l Control l ll l Control
1e−01
1e−01
l l Reveg l l Reveg
l l
1e−01 1e+00 1e+01 1e+02 1e+03 1e−01 1e+00 1e+01 1e+02 1e+03
Mean [log scale] Mean [log scale]
In principle you can use any link function with any distribution. But in practice,
you almost always should use a function that matches the range of allowable values
(or the “domain”) of the mean, e.g. Poisson means are always positive, so use a link
function that works for positive values only and maps them onto the whole number
line, e.g. the log function.
Note that for counts and biomass, we typically use a log link, which makes
our model multiplicative rather than additive, such that we talk about effects of
treatments as having a c-fold effect rather than an effect of increasing the mean
by c. This is usually a good idea because counts tend to make more sense (as a
first approximation) when thought of as an outcome of multiplicative rather than
additive processes. For example, we talk a lot about rates when studying abundance
(e.g. survival rates, fecundity rates), which suggests that we are thinking about
multiplying things together not adding them.
244 10 Analysing Discrete Data
Some refer to the linear model as a “general linear model”, GLM for short. This is a
model where we assume the response is normally distributed, with no mean–variance
relationship and no link function. This terminology is bad (and confusing) and we
can blame software packages like SAS for popularising it.
When people talk about GLMs, make sure you are clear what they are talking
about—do they mean a generalised linear model (non-normal response, mean–
variance relationship, and so forth), or are they just using an ordinary linear model?
If they don’t mention a family or link function, they are probably just using a linear
model.
It is sometimes said that a good model will have residual deviance similar in size
to the residual degrees of freedom. For example, in Code Box 10.2, the residual
deviance is not far from the residual degrees of freedom, so we might conclude that
at this stage there does not seem to be a problem with the model fit. But this is a very
rough rule (see section 4.4.3 of McCullagh and Nelder, 1989, for example), and it
won’t work if you have lots of zeros or small counts. We really should dig deeper
and look at some other diagnostic tools.
> data(seaweed)
> seaweed$CrabPres = seaweed$Crab>0
> seaweed$Dist = as.factor(seaweed$Dist)
> ft_crab = glm(CrabPres~Time*Dist, family=binomial("cloglog"),
data=seaweed)
> summary(ft_crab)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.3282 1.5046 -1.547 0.122
Time 0.1656 0.1741 0.952 0.341
Dist2 -1.5921 2.5709 -0.619 0.536
Dist10 -0.5843 2.1097 -0.277 0.782
Time:Dist2 0.1683 0.2899 0.581 0.561
Time:Dist10 0.1169 0.2399 0.487 0.626
Residuals vs Fitted
8
2
3
2
1
Residuals
0
−1
Fig. 10.6: Residual plots aren’t as straightforward for a GLM. Here is the default resid-
ual plot from a glm fit in R to David and Alistair’s crab data using plot(ft_crab).
It looks super weird and is uninformative
Figure 10.6 looks really weird for a couple of reasons. The main problem is the
discreteness—there are two lines of points on the plot (for zeros and for ones), which
is a pattern that distracts us from the main game. The problem is with the idea of
a residual—it is not obvious how to define residuals appropriately for GLMs. Most
software (including R) chooses badly.
The right choice of residuals is to define them via the probability integral transform
(Maths Box 10.4), which we will call Dunn-Smyth residuals after Dunn and Smyth
(1996). These residuals have been around for some time, but it was only relatively
recently that their awesomeness was widely recognised.
246 10 Analysing Discrete Data
where this last step, inverting F(·), is only allowed for continuous random
variables (since inversion requires functions to be one-to-one). Now we notice
that this has the form of a cumulative probability for Y , which is the definition
of F(·):
= F(F −1 (q)) = q
Y = F −1 (Q) ∼ F (10.6)
meaning that Y has distribution function F(y). So PIT works both ways—we
can use it to transform any variable to standard uniform and to transform
from the standard uniform to any distribution. We use both of these results to
construct Dunn-Smyth residuals (Fig. 10.7).
Φ(r) = F(y)
10.3 Checking GLM Assumptions 247
for some standard uniform u, where y − is the previous value of F(y). The
right-hand side of this equation chooses a value at random between F(y) and
its previous value F(y − ) (Fig. 10.7). This makes the right-hand side standard
uniform when F(y) is the true distribution function of y, hence r is standard
normal if the model is correct (Fig. 10.7).
These residuals are not in most standard packages (yet), but you can use them
via the mvabund package by refitting your model using manyglm. See for example
Code Box 10.3, which makes use of the plotenvelope function to add simulation
envelopes for easier interpretation. You can also construct these residuals for many
common distributions using the qresid function in the statmod package in R (Dunn
and Smyth, 1996) or the DHARMa package (Hartig, 2020), which computes residuals
via simulation (but note it doesn’t transform to the standard normal by default).
Code Box 10.3: Dunn-Smyth residual plots for the crab data, using the
mvabund package.
> library(mvabund)
> ftMany_crab = manyglm(CrabPres~Time*Dist,family=binomial("cloglog"),
data=seaweed)
> plotenvelope(ftMany_crab, which=1)
248 10 Analysing Discrete Data
0
−1
−2
Fitted values
This plot has a bit of a curve on it, but notice that it stays well within the simulation
envelope, so it is not doing anything unusual compared to what would happen if model
assumptions were satisfied. So we don’t have evidence that there is anything to worry about
here.
You can interpret a plot of Dunn-Smyth residuals pretty much like a residual plot
for linear models. Recall that for linear regression
• U shape =⇒ violation of straight-line assumption
• Fan shape =⇒ violation of variance assumption
Well, Dunn-Smyth plots work in much the same way:
• U shape =⇒ violation of linearity assumption
• Fan shape =⇒ violation of mean–variance assumption
(although you can get a bit of interaction between these two). As previously, the
best place to see a fan shape is often a scale-location plot, where it shows up as an
increasing trend (as in Code Box 10.4).
Dunn-Smyth residuals deal with the problem of discreteness through random
number generation—basically, jittering points. This means different plots will give
you slightly different residuals, as in Fig. 10.8. Plot Dunn-Smyth residuals more than
once to check if any pattern is “real”.
10.3 Checking GLM Assumptions 249
Fig. 10.7: Diagram illustrating how Dunn-Smyth residuals are calculated, for a
variable that can take values 0, 1, 2, . . .; hence its distribution function F(y) (blue,
left) has steps at these values. Dunn-Smyth residuals find values r that satisfy Φ(r) =
uF(y) + (1 − u)F(y − ) (= q) (Maths Box 10.4). For example, if y = 1, we take a value
at random between F(1) and its previous value, F(0), and use Φ(r) to map across
to a value for r. By the PIT (Maths Box 10.3), if F(y) were the true distribution
function of the data, quantile values q would be standard uniform (green, centre),
and r would be standard normal (red, right)
If data are highly discrete, the jittering introduces a lot of noise, and assumptions
can show up as relatively subtle patterns in these plots. Fitting a smoother to the data
is a good thing to do in this situation; also include a simulation envelope on where
you expect it to be when assumptions are satisfied, as is done by plotenvelope.
1
1
Residuals
Residuals
0
0
−1
−1
−2
Fig. 10.8: Two Dunn-Smyth residual plots of David and Alistair’s crab data. Note
that the residual plot changes slightly due to jittering, as do the smoothers and their
simulation envelopes. If we see a pattern in a plot, it is worth plotting more than
once to check it is signal from the data rather than noise from the jittering!
250 10 Analysing Discrete Data
Code Box 10.4: Assumption checking for ostracod counts of Exercise 10.4.
We will start with a Poisson model, including algal wet mass as a covariate. The mvabund
package will be used for model fitting to facilitate residual plots:
seaweed$logWmass = log(seaweed$Wmass)
ft_countOst=manyglm(Ost~logWmass+Time*Dist,data=seaweed,
family="poisson")
plotenvelope(ft_countOst,which=1:3) # for a scale-location plot as well
Residuals vs Fitted Values Normal Quantile Plot Scale−Location Plot
2.0
4
1.5
2
|Residuals|
Residuals
Residuals
1.0
0
0.5
−2
−2
0.0
Notice that there is a bit of a fan shape, and the quantile plot suggests that the residuals
are clearly steeper than the one-to-one line. The scale-location plot shows a clear increasing
trend in scale of residuals as fitted values increase beyond the simulation envelope (i.e.
beyond what would be expected if model assumptions were satisfied). These are signs
of overdispersion. Plotting again to double-check will show the same problems. Trying a
negative binomial regression to deal with the overdispersion:
ft.countOstNB=manyglm(Ost~logWmass+Time*Dist,data=seaweed,
family="negative.binomial")
plot(ft_countOstNB,which=1:2)
10.4 Inference from Generalised Linear Models 251
1.6
2
1.4
1.2
1
1.0
|Residuals|
Residuals
Residuals
0
0.8 0.6
−1
−1
0.4
−2
−2
0.2
0.0 0.5 1.0 1.5 2.0 2.5 −2 −1 0 1 2 0.0 0.5 1.0 1.5 2.0 2.5
Fitted values Theoretical Quantiles Fitted Values
And we see the fan shape and the large residuals have largely been dealt with.
Exercise 10.5: Checking the Poisson assumption on the wind farm data.
Recall the eel abundances from Lena’s wind farm survey. We used Poisson
regression as our initial model for eel abundance in Code Box 10.1.
Refit the model using the manyglm function, and hence construct a residual
plot.
Does the Poisson assumption look reasonable?
Exercise 10.6: Checking the Poisson assumption for the worm counts.
Recall the worm counts from Anthony’s revegetation survey. We would like to
know if we could use Poisson regression for worm abundance.
Refit the model using the manyglm function under the Poisson assumption,
and hence construct a residual plot.
Also fit a negative binomial to the data and construct a residual plot.
Can you see any differences between plots?
Note it is hard to see differences because there are only two replicates in
the control group. Compare BIC for the two models using the BIC function.
Which model has the better fit to the worm counts?
The same techniques can be used to make inferences from GLMs as were used for
LMs, with minor changes to the underlying mathematics. In R all the same functions
as for linear models work:
Confidence intervals use the confint function
Hypothesis testing use the summary or anova function. The latter is generally
better for hypothesis testing.
252 10 Analysing Discrete Data
Model selection use stepAIC, AIC, BIC, or predict to predict to test data in
cross-validation code.
The main hypothesis testing function for GLMs on R is anova. The term anova is
a little misleading for GLMs—technically, what we get is an analysis of deviance
table (Maths Box 10.5), not analysis of variance (ANOVA). But in R the main testing
function for GLMs is called anova anyway. For GLMs in R we have to tell anova
what test statistic to use via the test argument (Code Box 10.5); otherwise it won’t
use any! There are a few different options for GLM test statistics.
The most common and usually the best way to test hypotheses concerning GLMs
is to use a likelihood ratio test. A likelihood ratio statistic fits models via maximum
likelihood under the null and alternative hypotheses and sees if the difference in
likelihoods is unusually large (usually by comparing to a chi-squared distribution
to get a P-value). This essentially compares deviances of models under the null
and alternative hypotheses, and so is where the name “analysis of deviance” comes
from, given that the procedure is closely analogous with how sums of squares are
compared in order to construct an ANOVA table. The likelihood ratio statistic has
some nice properties, including not being overly bothered by data with lots of zeros.
In R, add the argument test="Chisq", as in Code Box 10.5. This works just as
well as F-tests do for linear models at larger sample sizes, but for small samples it
is only approximate, as will be discussed shortly. An F-statistic from an ANOVA,
incidentally, is equivalent to a type of likelihood ratio statistic.
Code Box 10.5: R code using the anova function to test the key hypotheses
of interest to David and Alistair in Exercise 10.1.
Another type of test for GLMs is a Wald test, which is based on studying parameter
estimates and seeing if they are far from what is expected under the null hypothesis.
Specifically, we would usually compare se(β̂β̂) to a standard normal distribution. This
looks a lot like what a t-test does—a t-test is in fact a type of Wald test. The summary
function by default uses a Wald test. For GLMs this is less accurate than likelihood
10.4 Inference from Generalised Linear Models 253
ratio tests, and occasionally, especially for logistic regression, it gives wacky answers,
ignoring a really strong effect when it is there (especially for “separable models”
with a “perfect” fit, for details see Væth, 1985).
Maths Box 10.6: Problems Wald statistics have with zero means
In Lena’s study, recall that no eels were caught in northern sites after wind
farms were constructed (Fig. 10.2), giving an estimated mean abundance of
zero. This is a problem for a Poisson or negative binomial regression, because
the mean model uses a log link, and the log of zero is undefined (−∞). R
usually returns a warning in this situation but didn’t for the following fit:
> data(windFarms)
> eels = windFarms$abund[,16]
> ft_eels=glm(eels~Year*Zone,family="poisson",
data=windFarms$X)
> summary(ft_eels)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4700 0.2582 -1.820 0.0687 .
Year2010 -0.2803 0.3542 -0.791 0.4288
254 10 Analysing Discrete Data
The summary and anova tests are both approximate—they work well on large
samples (well, summary can occasionally go a bit weird even on large samples), but
both can be quite approximate when sample size is small (say, less than 30). This is
quite different to what happens for linear models—when assumptions are satisfied,
F-tests and t-tests based on linear models are exact for any sample size, meaning
that if a significance level of 0.05 is used, the actual chance of accidently rejecting
the null hypothesis when it is true (“Type I error rate”) is exactly 0.05. For a GLM,
10.4 Inference from Generalised Linear Models 255
however, even when all model assumptions are correct, in very small samples you
might accidently reject the null hypothesis twice as often as you think you will, i.e.
you can have a Type I error of 0.1 when using a significance level of 0.05 (Ives,
2015). We can beat this problem exactly the same way that we beat it for mixed
models—generating P-values by simulation (Warton et al., 2016).
One of the simplest ways to do design-based inference for GLMs is to use the
mvabund package, which uses resampling by default whenever you call summary or
anova for a manyglm object. This sort of approach is really needed for Anthony’s
data, where he has a very small and unbalanced sample (Code Box 10.6), but it isn’t
really needed for David and Alistair’s crabs (Code Box 10.7), where the sample size
is already moderately large, so it is no surprise that their results come back almost
identical to those from the anova call to glm (Code Box 10.5).
Code Box 10.6: Model-based inference for Anthony’s worm counts from
Exercise 10.3.
> ftmany_Hap=manyglm(Haplotaxida~treatment,family="negative.binomial",
data=reveg)
> anova(ftmany_Hap)
Time elapsed: 0 hr 0 min 0 sec
Analysis of Deviance Table
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
(Intercept) 9
treatment 8 1 2.811 0.173
Arguments: P-value calculated using 999 resampling iterations via
PIT-trap resampling.
Any evidence of an effect of revegetation on worms?
The default resampling method in the mvabund package is a new method called
the PIT-trap—a residual resampling method that bootstraps residuals computed
via the probability integral transform (Warton et al., 2017). Basically, it bootstraps
Dunn-Smyth residuals. Previously, the only serious method available for resampling
GLMs was the parametric bootstrap. Recall that we have already met the parametric
bootstrap; we used it in Sect. 6.5 as a technique for making accurate inferences about
linear mixed models. You can use the parametric bootstrap in mvabund by setting
resamp="monte.carlo" in the anova or summary call. Recall that the parametric
bootstrap uses model-based inference, so one might expect it to be more sensitive
to violations of model assumptions than using the PIT trap (which, for instance, is
close to exact in an example like Code Box 10.6 irrespective of model assumptions).
256 10 Analysing Discrete Data
Code Box 10.7: Design-based inference for David and Alistair’s crab data
using mvabund.
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
(Intercept) 56
Time 55 1 6.670 0.011 *
Dist 53 2 1.026 0.615
Time:Dist 51 2 0.401 0.869
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Arguments: P-value calculated using 999 resampling iterations via
PIT-trap resampling.
Any important differences from results David and Alistair previously obtained in Code
Box 10.5?
It is critical to consider overdispersion when fitting a GLM—if your data are overdis-
persed, and you don’t account for this, you can get things spectacularly wrong. Con-
sider for example an analysis of Anthony’s worm data using the Poisson distribution
(Code Box 10.8). Suddenly there is strongly significant evidence of a treatment ef-
fect, which we didn’t get with a negative binomial model (Code Box 10.6). A plot of
the raw data is not consistent with such a small P-value either—there is a suggestion
of a trend, but nothing so strongly significant. The main problem is that the data
are overdispersed compared to the Poisson distribution, meaning that the variance
assumed by the Poisson model underestimates how variable replicates actually are.
Hence standard errors are underestimated and P-values are too small, and we have
false confidence in our results—the key thing we are trying to guard against when
making inferences from data.
Resampling can offer some protection against the effects of overdispersion missing
from our model. For example, if we had used resampling, we would not have seen
a significant effect in Anthony’s worm data (Code Box 10.8, bottom). However, the
best protection is to mind your Ps and Qs and, hence, try to use the right model
for your data! In the case of Anthony’s worm data, a better model was a negative
binomial regression.
A common cause of overdispersion is missing predictors. Recall that in linear
models, if an important predictor is left out of the model, this increases the error
variance. Well, the same happens in a GLM, with missing predictors increasing the
variance compared to what might otherwise be expected, which has implications for
the mean–variance relationship. Any time you are modelling counts, and your model
for them is imperfect, you can expect overdispersion. This will actually happen most
of the time. The mvabund package was written initially with count data in mind, and
the default family for the manyglm function was chosen to be the negative binomial
precisely because we should expect overdispersion in most models for counts.
The issue of overdispersion can also arise in models assuming a binomial distribu-
tion, but only when measurements are taken as counts across clusters of observations,
e.g. if counting the number of seeds in a Petri dish that germinate. Overdispersion
arises here because of missing predictors across clusters, e.g. sources of variation in
germination rate across petri dishes that have not been accounted for. This is most
easily handled in a slightly different way—using a mixed model with an observation-
level random effect, i.e. adding a random effect to the model that takes different values
for different observations, as illustrated later in Code Box 10.9. This random effect
plays the role of a missing predictor. The same technique could also be used for
258 10 Analysing Discrete Data
counts, instead of using a negative binomial model. For binary responses, such as
presence–absence data, observations fall exactly along a quadratic mean–variance
relationship (as in Fig. 10.1b), and there is no possibility of overdispersion. This
means that you cannot include an observation-level random effect in a model for
presence–absence; there is not enough information in the data to estimate it.
Note that repeat experiments at the same temperature have different germination rates
(probably due to factors not controlled for by the experimenter). We can account for this
using an observation-level random effect:
> library(lme4)
> seedsTemp$ID = 1:length(seedsTemp$NumGerm)
> ft_temp = glmer(cbind(NumGerm,NumSown-NumGerm)~poly(Test.Temp,2)+
(1|ID),data=seedsTemp,family="binomial")
> summary(ft_temp)
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 2.961 1.721
Number of obs: 29, groups: ID, 29
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5896 0.3428 -1.720 0.0854 .
poly(Test.Temp, 2)1 2.8770 1.9395 1.483 0.1380
poly(Test.Temp, 2)2 -3.6910 1.9227 -1.920 0.0549 .
Note that the random effect had a standard deviation of 1.7, which is relatively large (on the
logit scale), indicating a lot of trial-to-trial variation in germination rate.
10.5 Don’t Standardise Counts, Use Offsets! 259
Code Box 10.10: Adding an offset to the model for worm counts
Because we are modelling the log of the mean, we add an offset for log(# pitfalls):
> ftmany_hapoffset = manyglm(Haplotaxida~treatment+offset(log(pitfalls)),
family="negative.binomial", data=reveg)
> anova(ftmany_hapoffset)
Time elapsed: 0 hr 0 min 0 sec
Analysis of Deviance Table
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
(Intercept) 9
treatment 8 1 2.889 0.15
Arguments: P-value calculated using 999 resampling iterations via
PIT-trap resampling.
Why didn’t we hear about offsets for linear models? Because linear models are
additive, so if you had an offset, you could just subtract it from y before fitting
the model. That is, in a linear model for log(Haplotaxida), to include an offset
for log(pitfalls) you would model log(Haplotaxida)-log(pitfalls) as a
function of treatment. But that trick wouldn’t work for a GLM.
10.6 Extensions 261
10.6 Extensions
The problem with something called “generalised” is what do you call generalisations
of it? There are a few important additional features and extensions of GLMs worth
knowing about.
Ecological counts often have many zeros. Consider Anthony’s cockroach counts
(Exercise 10.10). One option is to use a zero-inflated model—a model that expects
more zeros than a standard model, e.g. more than expected under a Poisson or
negative binomial regression. These models tend to be fitted by separately studying
the question of when you observe a non-zero count and the question of how large the
count is. For details on how to fit such models, see the VGAM package (Yee, 2010) or
the glmmTMB package (Brooks et al., 2017), both in R.
Zero-inflated models are quite widely used, which may in part be because of a
misunderstanding about when they are actually needed (Warton, 2005). Do you have
reason to believe that the process behind presence–absence patterns is distinct from
the process behind abundance? If so, a zero-inflated regression model is a good way
to approach the problem. If not, and you just have lots of zeros in your dataset, then
that in itself is not a good reason to use a zero-inflated regression model.
The confusion here comes from the fact that we are communicating in English,
not in maths—the term “zero-inflated” sounds like it applies to any dataset with lots
of zeros. But a more common reason for getting lots of zeros is that you are counting
something that is rare! (Its mean is small.) For example, a Poisson distribution
with μ = 0.1 expects 90% of values to be zero. Also recall that regression models
make assumptions about the conditional distribution of the response variable, after
including predictors. These predictors might be able to explain many of the zeros in
the data (as in Exercise 10.10). So observing many zeros is not on its own a good
reason for using a zero-inflated model.
Recall that sometimes a design includes a random factor (e.g. nested design), so
what is needed is a mixed model (Chap. 6). GLMs (the glm and manyglm functions)
only handle fixed effects. If you have random effects and a non-constant assumed
mean–variance relationship, then you want to fit a generalised linear mixed model
(GLMM). A good package for this, as in the linear mixed effects case, is the lme4
package (Bates et al., 2015). The glmmTMB package (Brooks et al., 2017) is also a
good choice—it was written to behave similarly to lme4 but tends to be more stable
for complex models and has more options for the family argument, including some
zero-inflated distributions.
There aren’t really any new tricks when fitting a GLMM—just use the glmer
argument as you would normally use lmer, but be sure to add a family argument.
For the negative binomial family, try nbinom2 in glmmTMB.
Current limitations:
• No nice residual plots. This time, the qresiduals function won’t help us out
either. However, the DHARMa package (Hartig, 2020) can approximate these resid-
uals via simulation. (Note DHARMa returns residuals on the standard uniform scale;
they need to be mapped to the standard normal.)
10.6 Extensions 263
• GLMMs can take much longer to fit, and even then they only give approximate
answers. The mathematics of GLMMs are way harder than anything else we’ve
seen so far and get computationally challenging deceptively quickly (in particular,
when the number of random effects gets large).
Maximum likelihood estimation is commonly used to fit GLMMs, but a difficulty
is that the likelihood can’t be computed directly in most cases and needs to be
approximated (by default, this is done using a Laplace approximation in lme4 and
glmmTMB). In lme4, there is an optional argument nAGQ that you can try to get a
better approximation for GLMMs for simple models (e.g. nAGQ=4) using adaptive
Gauss-Hermite quadrature. This requires more computation time and gives slightly
better approximations to the likelihood and standard errors. Feel free to experiment
with this to get a sense of the situations where this might make a difference (typically,
small sample sizes or many random effects).
10.6.4 Summary
Discrete data commonly arise in ecology, and when the data are highly discrete (e.g.
a lot of zeros), there will be a mean–variance relationship that needs to be accounted
for in modelling. The GLM is the simplest way to analyse such data, but there are a
number of other options, some of which were discussed earlier in this chapter. Most
of the remainder of this text will focus on various extensions of GLMs to handle
important problems in ecology, such as multivariate abundances (Chap. 14).
Part II
Regression Analysis for Multiple Response
Variables
Chapter 11
More Than One Response Variable:
Multivariate Analysis
Recall that the type of regression model you use is determined mostly by the proper-
ties of the response variable. Well what if you have more than one response variable?
A multivariate analysis is appropriate when your research question implies more
than one response variable—usually, because the quantity you are interested in is
characterised by more than one variable. For example, in Exercise 11.1, Ian is
interested in “leaf economics”, which he quantifies jointly using leaf longevity and
leaf mass per area. In Exercise 11.2, Edgar is interested in flower size/shape, which he
characterised using four length measurements. In Exercises 11.3, Petrus is interested
in a hunting spider community, which he quantified using abundance of organisms in
three different taxonomic groups. These last two examples are quite common places
where multivariate analysis is used in ecology—when studying size and shape or
when studying communities or assemblages as a whole.
What are the response variables? What sort of analysis is appropriate here?
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 267
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_11
268 11 Multivariate Analysis
What are the response variables? What sort of analysis is appropriate here?
What are the response variables? What type of analysis is appropriate here?
Key Point
multivariate = many response variables
Multivariate analysis involves simultaneously treating multiple variables as
the response in order to characterise some idea that cannot be captured ef-
fectively using one response variable. Everything is a lot harder in multi-
variate analysis—data visualisation, interpretations, checking assumptions,
analysis—so it is worth thinking carefully if you really need to do a multivari-
ate analysis in the first place!
The first thing you should know is that multivariate analysis is much more challenging
than univariate analysis, for a number of reasons:
Data visualisation Our graphs are limited to using just two (or maybe three) di-
mensions to visualise data. This means we can only look at responses one at a
time, two at a time, or maybe three, and if there are any higher-order patterns in
correlation, we will have a hard time visualising them in a graph.
Interpretation More variables means there is a lot going on, and it is hard work to
get a simple answer, if there is one.
Assumption violations More response variables means more ways model assump-
tions can be violated.
11.1 Do You Really Need to Go Multivariate? Really? 269
Fig. 11.1: A simulated dataset illustrating a situation where you could miss the
structure in the data by analysing correlated variables one at a time. The scatterplot
shows two samples (in different colours) of two response variables, Y1 and Y2 . A
multivariate analysis would readily find a difference between these two samples,
because at any given value of Y1 , the red points tend to have larger values of Y2 .
However, if analysing the variables one at a time (as in the boxplots in the margins),
it is harder to see a difference between samples, with shifts in each mean response
being small relative to within-sample variation. The specific issue here that makes it
hard to see the treatment effect marginally is that it occurs in a direction perpendicular
to the main axis of variation for the Y variables
y ∼ N (μ, Σ)
μ j = β0j + xT β j
11.2 MANOVA and Multivariate Linear Models 271
where y is a column vector of responses and μ is the vector of its means, whose jth
entry is μ j .
The multivariate linear model can be broken down into the following assumptions:
1. The observed y-values are independent across observations (after conditioning
on x).
2. The y-values are multivariate normally distributed with constant variance–
covariance matrix
y ∼ N (μ, Σ)
3. There is a straight-line relationship between the mean of each y j and each x
μ j = β0j + xT β j
Fitting a multivariate linear model using R is pretty easy—the only thing you have
to do differently to the univariate case is to make sure your response variables are
stored together in a matrix (which can be created using cbind). See for example
Code Box 11.2.
As usual, I recommend checking residual plots for no pattern. In R, the plot func-
tion doesn’t work for multivariate linear models, but I have made sure plotenvelope
works for this case, as in Code Box 11.3.
11.2.1 Variance-co-What?
but note that A is a matrix now rather than a constant, and it describes linear
functions of responses y. So this rule tells us what happens when responses
are summed as well as rescaled (unless A is a diagonal matrix, which just
rescales).
Code Box 11.2: Fitting a multivariate linear model to the leaf economics
data
> library(smatr)
> data(leaflife)
> Yleaf = cbind(leaflife$lma,leaflife$longev)
> ft_leaf = lm(Yleaf~rain*soilp, data=leaflife)
> anova(ft_leaf, test="Wilks")
Analysis of Variance Table
Df Wilks approx F num Df den Df Pr(>F)
(Intercept) 1 0.11107 248.096 2 62 < 2.2e-16 ***
rain 1 0.68723 14.108 2 62 8.917e-06 ***
soilp 1 0.93478 2.163 2 62 0.1236
rain:soilp 1 0.95093 1.600 2 62 0.2102
Residuals 63
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Any evidence of an effect of rainfall or soil nutrients?
With just two response variables, we can use a scatterplot to visualise the data:
plot(leaflife$lma~leaflife$longev, xlab="Leaf longevity (years)",
ylab="Leaf mass per area (mg/mm^2)",
col=interaction(leaflife$rain,leaflife$soilp))
legend("bottomright",legend=c("high rain, high soilp",
"low rain, high soilp", "high rain, low soilp",
"low rain, low soilp"), col=1:4, pch=1)
l
l l
300
Leaf mass per area (mg/mm^2)
l
l
l
250
l
l
ll lll
l
200
l
l
l l
l l l l l
l l l
l l l l l
150
l l
l l
l l
ll l
l l l
l
ll l l l high rain, high soilp
l l l
100
1 2 3 4
Leaf longevity (years)
Code Box 11.3: Checking multivariate linear model assumptions for leaf
economics data
Multivariate normality implies assumptions will be satisfied for a linear model of each
response against all other variables (against x and other y), which is a good way to check
assumptions. The plotenvelope function constructs residuals and fitted values from this
full conditional model.
par(mfrow=c(1,3),mar=c(3,3,1.5,0.5),mgp=c(1.75,0.75,0))
library (ecostats)
plotenvelope(ft_leaf,which=1:3)
Residuals vs Fitted Values Normal Quantile Plot Scale−Location Plot
1.5
2
2
1
1.0
|Residuals|
Residuals
Residuals
0
0.5
−1
−1
−2 −1 0 1 2 3 −2 −1 0 1 2 −2 −1 0 1 2 3
Fitted values Theoretical Quantiles Fitted Values
Maths Box 11.2: Having many responses creates many problems for the
variance–covariance matrix
There is a covariance
" # parameter for every possible pairwise combination of
responses, giving p2 in total (in addition to the p variance parameters). So the
number of parameters in Σ grows quickly as p increases, quadratically rather
than linearly:
# parameters in:
p μ Σ
2 2 3
4 4 10
10 10 55
20 20 210
100 100 5050
When p gets beyond about 10, there are an awful lot of parameters to estimate
in Σ, which is an important challenge that needs to be addressed in building a
multivariate model. In such situations the methods of this chapter are typically
not appropriate, and we need to make extra assumptions on Σ so it can be
estimated using fewer parameters.
11.2 MANOVA and Multivariate Linear Models 275
Fig. 11.2: (a) A contour map of the probability density for an example multivariate
normal distribution. Values of y1 and y2 are more likely to be observed when the
probability density is higher (lighter regions). Note that contours of equal probability
density are ellipses. (b) A sample of size 500 from this distribution. Note that there
tend to be more points where the probability density is higher, as expected. Note
also that multivariate normality implies that each response, conditional on all others,
satisfies the assumptions of a linear model
There are different types of test statistics for multivariate linear modelling. The
four most common, all available via the anova function, are Wilk’s lambda (the
likelihood ratio statistic), the Hotelling-Lawley statistic, the Pillai-Bartlett trace, and
Roy’s greatest root (using method="Wilks", "Hotelling-Lawley", "Pillai",
or "Roy", respectively). For details on how these are constructed, see Anderson
(2003). Some would argue for using more than one of these statistics to confirm a
null result, because each is good for slightly different things and will detect slightly
different types of violations of the null hypothesis; however, one must be careful
with this approach to avoid “searching for significance” (Sect. 1.4.3). The most
commonly used of these statistics is probably Wilk’s lambda, which is a likelihood
ratio statistic, an approach we saw earlier that is commonly used in other contexts
(e.g. mixed models, generalised linear models). Roy’s greatest root is the statistic
that behaves most differently from the others (looking for the axis along which there
is the largest effect) so would be another one to check if going for multiple test
statistics.
In one way or another, all of these statistics can be thought of as generalisations
of the classical F-statistic from univariate linear modelling.
11.2 MANOVA and Multivariate Linear Models 277
Code Box 11.2 fits a multivariate linear model to Ian’s leaf economics data and
tests for an effect of environment on leaf economics traits, finding some evidence of
an association with rainfall. We cannot tell, however, from the multivariate output
which responses are associated with rainfall, or how.
It is always important to visualise your data and, in particular, to try to use plots of
raw data to visualise effects seen in a statistical analysis. In Ian’s case, because there
are only two responses and categorical predictors, it is straightforward to visualise
the data in a scatterplot (Code Box 11.2). We can see on the plot that the leaf variables
are positively correlated, and points at high rainfall sites tend to fall slightly below
the main axis along which these variables covary. Thus, at high rainfall sites, leaves
of a given leaf mass per area tend to live longer (or, equivalently, leaves with a given
lifespan tend to have lower leaf mass per area).
The mvabund package can also fit multivariate linear models, via the manylm func-
tion. The main differences:
• It uses (residual) resampling for design-based inference. Rows of residuals are
resampled, which ensures correlation across responses is accounted for in testing
(by preserving it in resampled data).
• It uses a statistic that takes into account correlation across responses, as in the
usual multivariate linear model, if you use the argument cor.type="R".
• The test statistics are called different things—Hotelling-Lawley is test="F", and
Wilk’s lambda is test="LR" (Likelihood Ratio).
Using cor.type="R",test="LR" is equivalent to Wilk’s test (even though the
value of the test statistic is different). An example of this is in Code Box 11.4.
Code Box 11.4: A multivariate linear model for the leaf economics data
using mvabund
Note that irrespective of whether the lm or manylm function was used, the results
worked out pretty much the same. This is not unexpected; results should be similar
unless something goes wrong with model assumptions. The anova function for
multivariate linear models fitted using lm is model-based and uses large-sample
approximations to get the P-value. This sometimes goes wrong and can be fixed
using design-based inference (as in manylm), in two situations:
• If there are problems with the normality assumption (as usual, this is only really
an issue when sample size is small).
• If you have a decent number of response variables compared to the number of
observations.
The second of these points is particularly important. Testing via lm works badly
(if at all) when there are many response variables, even when data are multivariate
normal, but manylm holds up better in this situation. (But if there are many variables,
try cor.type="shrink" or cor.type="I" for a statistic that has better properties
in this difficult situation, as in Warton (2008).) In the data for Exercise 11.1, analysed
in Code Box 11.2, Ian had 67 observations and only 2 response variables, so there
were no issues with the number of response variables. Any lack of normality also
seemed to make no difference, which is unsurprising since the sample size was
reasonably large (hence the central limit theorem provided robustness to violations
of distributional assumptions).
Consider again Petrus’s spider counts (Exercise 11.3). Notice that the response
variables are counts, so we should be thinking of using some multivariate extension of
generalised linear models (GLMs) rather than of linear models (LMs). It turns out that
multivariate extensions of GLMs are much harder than for LMs. The main difficulty is
that, unlike the normal distribution, discrete distributions (like the Poisson, binomial,
or negative binomial) do not have natural multivariate generalisations. Multivariate
discrete distributions certainly do exist (Johnson et al., 1997), but they are often quite
inflexible, e.g. some can only handle positive correlations (by assuming a common
underlying counting process).
The most common solution is to use a type of hierarchical GLM, specifically,
a GLM that includes in the linear predictor a multivariate normal random effect to
induce correlation across responses. This is similar to the strategy used in Chap. 7 to
introduce structured dependence in discrete data. However, this approach will work
only if you have loads of observations and few responses.
The hierarchical GLM will make the following assumptions:
1. The observed yi j -values (i for observation/replicate, j for variable) are indepen-
dent, conditional on their mean mi j .1
2. (“Data model”) Conditional on their mean mi j , the yi j -values come from a
known distribution (from the exponential family) with a known mean–variance
relationship V(mi j ).
3. (“Process model”) Straight-line relationship between some known function of the
mean of yi j and each xi , with an error term i j :
4. The process model errors i j are multivariate normal in distribution, that is, for
i = (i1, . . . , ip ):
i ∼ MV N(0, Σ)
This type of model has been referred to in some parts of ecology as a joint species
distribution model (Clark et al., 2014; Ovaskainen and Abrego, 2020). Statisticians
often refer to this as a hierarchical model (Cressie et al., 2009) because it can be
thought of as a hierarchy with two levels (or sometimes more)—a process model for
g(mi j ), and a data model for yi j conditional on mi j . Random variables are present
in both levels of the model.
1The mean is written mi j here, not μi j , because in a hierarchical model, it is a random quantity,
not a fixed parameter.
280 11 Multivariate Analysis
The process model errors i j in a hierarchical GLM have two main consequences—
changing the marginal distribution of data (in particular, introducing overdispersion)
and inducing correlation, as in Fig. 11.3 or Maths Box 11.4.
Fig. 11.3: In a hierarchical model, correlated terms mi j in the process model (a) have
consequences for marginal (b) and (c) and joint (d) distributions of observed data
yi j . The data model takes values of mi j , such as the blue and red points in (a), and
adds noise to them to form the blue and red histograms in (b) and (c) and the blue
and red patches in (d). The marginal distribution of yi j (grey histograms in b and c)
is more variable (“overdispersed”) than the distribution specified by the data model
(blue or red histograms in b and c), with a potentially different shape, because the
mi j in the process model vary (a). The joint distribution of yi j (d) is correlated yi j
because the mi j in the process model are correlated (a). Note that the correlation
in observed data (yi j ) is weaker because of the extra noise introduced by the data
model, which maps points in mi j to patches in yi j .
11.3 Hierarchical Generalised Linear Models 281
This is a marginal distribution for Yi and can be quite different to the condi-
tional distribution specified in the data model. For example, except for a few
special cases, Yi j is not a member of the exponential family.
In fact, fYi (yi ) often does not even have a closed form, in which case it
can’t be written any more simply than it was above. We would then need to use
numerical integration (e.g. the trapezoidal rule, Davis and Rabinowitz, 2007,
Chapter 2) to calculate values of it. This makes model fitting more difficult,
because it is suddenly a lot harder to calculate the likelihood function, let
alone maximise it, or (in a Bayesian approach, Clark, 2007, Chapter 4) use
it to estimate a posterior distribution. We typically proceed by approximating
fYi (yi ) in some way, using techniques like Laplace approximation or Monte
Carlo integration (Evans and Swartz, 2000, Chapters 4 and 7, respectively).
where μY (h(Y )) denotes “take the mean of h(Y ) with respect to Y ”. Sometimes
this can be simplified, e.g. if the data model is a Poisson log-linear model, then
μYi j = μ Mi j e Mi j = eβ0 j +xi β j +σ j /2
T 2
Notice the σj2 /2 bit, which means that if we ignored the random effect and made
predictions using fixed effects eβ0 j +xi β j , we would underestimate the mean of
T
bias when transforming data to satisfy model assumptions (Section 1.6, see
also Diggle et al., 2002, Section 7.4). The marginal variance of Yi j can be
282 11 Multivariate Analysis
shown to satisfy
OK, I skipped a few steps. But note this ends up having the same form as the
mean–variance relationship of the negative binomial distribution, where the
overdispersion parameter is now eσ j − 1. So when using an observation-level
2
random effect in a model for overdispersed counts, you can often stick with
the Poisson and let the process model deal with overdispersion.
The marginal covariance of Yi j and Yi j can be calculated similarly, and for
a Poisson log-linear data model it ends up being
Compared to the variance expression, this is missing the first term, which can
make a big difference in the strength of the correlation induced in data. For
example, if the mean counts are one and the Mi j are perfectly correlated with
variance 0.25 (σj2 = σj2 = σj j = 0.25), the correlation between data values
Yi j and Yi j works out to be approximately 0.22, even though the correlation
between Mi j and Mi j is one!
The first consequence of process model errors i j in a hierarchical model is that they
change the marginal distribution of the responses (Fig. 11.3b–c) from what you might
expect based on the data model. For example, if the responses are counts assumed to
be conditionally Poisson in the data model, they will not be marginally Poisson; they
will have what is known as a compound Poisson-lognormal distribution. As well as
changing the marginal distribution, the interpretation of parameters also changes,
making hierarchical models harder to interpret (unless you are happy conditioning
on the errors in the process model).
The most striking change in the marginal distribution is that it is more overdis-
persed than the conditional distribution that was specified in the data model. Note
for example in Fig. 11.3b–c that the grey histograms, representing the marginal dis-
tribution, have much wider variances than either of the two conditional distributions
11.3 Hierarchical Generalised Linear Models 283
(red and blue histograms) generated by the data model around particular values of
the process model error (corresponding to the red and blue points in Fig. 11.3a).
Mathematically, a hierarchical model with Poisson conditional counts (for exam-
ple) no longer has a mean–variance relationship of V(y) = μ. Instead, the marginal
mean–variance relationship becomes V(y) = μ + φμ2 , where φ is a function of the
variances of the i j (specifically, φ j = eσ j −1, where σj2 is the jth variance parameter
2
in Σ). The Poisson-lognormal distribution behaves a lot like the negative binomial
(e.g. it has the same mean–variance relationship), so in a hierarchical model for
overdispersed counts you might not need to use the negative binomial; you can often
get away with a Poisson and let the i j absorb any overdispersion.
This overdispersion actually becomes a problem when modelling binary responses
(e.g. presence–absence), because there is no information in a binary response that
can be used to estimate overdispersion. In this situation, we need to fix the variances
of random effects in the process models to constants. Typically they are fixed to the
value one, but the precise value doesn’t actually matter, as long as it is larger than
the covariances in the data.
The interpretation of marginal effects is more difficult in hierarchical models
because parameters in the process model quantify effects on the conditional mean,
not effects on the marginal mean, which can be affected in surprising ways. For
example, in a hierarchical Poisson log-linear model, the predicted values of mi j
consistently underestimate the actual (marginal) mean count (by a factor of eσ j j /2 ).
2
Fortunately, slope coefficients can still be interpreted in the usual way in the Poisson
log-linear case, e.g. if the slope is two, then the marginal mean changes by a factor
of e2 when the x variable changes by one unit. However, the effects on marginal
means of slope coefficients are not so easily understood in other types of hierarchical
model, e.g. logistic regression, where a given change in the linear predictor can have
quite different implications for the marginal mean (for more details see Diggle et al.,
2002, Section 7.4).
The second and perhaps most important consequence of i j is that they induce cor-
relation between responses. Note, however, that the correlation between responses is
much weaker than the correlation between the i j , because the data model introduces
random noise that weakens the signal from the process model. So, for example, the i j
in Fig. 11.3a are highly correlated, but the observed data in Fig. 11.3d have a weaker
correlation, because the data model (conditional on i j ) assumes counts are indepen-
dent for each response variable. Thus, points in the process model (Fig. 11.3a) map
to patches of potential points in the data (Fig. 11.3d) because of noise from the data
model.
284 11 Multivariate Analysis
Hierarchical GLMs are much more difficult to fit than multivariate linear models
(Maths Box 11.3), with a lot more things that can go wrong. They can be fitted using
conventional mixed modelling software like lme4 or glmmTMB (Brooks et al., 2017,
Code Box 11.6), but not so easily if responses are binary, where the variances in the
process model need to be fixed to constants. As previously, the glmmTMB package is
recommended in preference to lme4; it is noticeably faster and more stable for these
sorts of models. Both require data in “long format” as in Code Box 11.5.
Alternative software, which is capable of fitting hierarchical models to binary
data, is the MCMCglmm package (Hadfield et al., 2010, Code Box 11.7). This package
is so named because it uses a Bayesian approach to model fitting via Markov chain
Monte Carlo (MCMC). It accepts data in so-called short format.
Code Box 11.5: Preparing spider data for analysis on lme4 or glmmTMB
Petrus’s data from Exercise 11.3 are available in the mvabund package, but with abundances
for 12 different species. First we will calculate the abundance of the three most abundant
genera:
> library(mvabund)
> library(reshape2)
> data(spider)
> Alop=apply(spider$abund[,1:3],1,sum)
> Pard=apply(spider$abund[,7:10],1,sum)
> Troc = spider$abund[,11]
> spidGeneraWide = data.frame(rows=1:28,scale(spider$x[,c(1,4)]),
Alop,Pard,Troc)
> head(spidGeneraWide)
rows soil.dry moss Alop Pard Troc
1 1 -0.1720862 0.6289186 35 117 57
2 2 0.7146218 -0.6870394 2 54 65
3 3 0.1062154 0.1916410 37 93 66
4 4 0.2507444 0.1916410 8 131 86
5 5 0.6728333 -1.4299919 21 214 91
The data are in short format, with observations in rows and different responses in different
columns; we need to rearrange them into long format:
> spiderGeneraLong = melt(spidGeneraWide,id=c("rows","soil.dry","moss"))
> names(spiderGeneraLong)[4:5] = c("genus","abundance")
> head(spiderGeneraLong)
rows soil.dry moss genus abundance
1 1 -0.1720862 0.6289186 Alop 35
2 2 0.7146218 -0.6870394 Alop 2
3 3 0.1062154 0.1916410 Alop 37
4 4 0.2507444 0.1916410 Alop 8
5 5 0.6728333 -1.4299919 Alop 21
6 6 1.1247181 0.1916410 Alop 6
spiderGeneraLong is ready for model fitting using lme4 or glmmTMB.
We need to have many more rows than columns when fitting the hierarchical GLM
described previously. One reason, as before, is because the size of the variance–
11.3 Hierarchical Generalised Linear Models 285
> library(glmmTMB)
> spid_glmm = glmmTMB(abundance~genus+soil.dry:genus+moss:genus
+(0+genus|rows), family="poisson",data=spiderGeneraLong)
> summary(spid_glmm)
Random effects:
Groups Name Variance Std.Dev. Corr
rows colsAlop 1.3576 1.1652
colsPard 1.5900 1.2610 0.75
colsTroc 0.7682 0.8765 0.59 0.82
Number of obs: 84, groups: rows, 28
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9843 0.2463 8.058 7.76e-16 ***
colsPard 1.0846 0.2100 5.165 2.41e-07 ***
colsTroc 0.6072 0.2335 2.600 0.00932 **
colsAlop:soil.dry -0.6012 0.3284 -1.831 0.06716 .
colsPard:soil.dry 1.4041 0.3704 3.791 0.00015 ***
colsTroc:soil.dry 1.4108 0.2950 4.782 1.74e-06 ***
colsAlop:moss 0.3435 0.3322 1.034 0.30103
colsPard:moss 0.7361 0.3547 2.075 0.03796 *
colsTroc:moss -0.2033 0.2648 -0.768 0.44264
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Can you see any differences in the response of different spider genera to environmental
conditions?
286 11 Multivariate Analysis
> library(MCMCglmm)
> set.seed(1)
11.3 Hierarchical Generalised Linear Models 287
As previously, we should plot the raw data to try to confirm the patterns we see
in analyses (Code Box 11.6). The main story in the data is that different genera have
different responses to environment (Fig. 11.4). We could also construct scatterplots
of counts (or of Dunn-Smyth residuals, to control for environmental effects) to study
the nature of correlation of abundances.
288 11 Multivariate Analysis
Fig. 11.4: Scatterplot of abundance against soil dryness for each genus of Petrus’s
spider data. Fitted lines are taken from spid_glmm (Code Box 11.6). Note that Alop
has a flatter association with soil dryness than the other genera, as indicated in the
summary output of Code Box 11.6
2 Note that fix=1 does not mean “fix the variance to one”; it means “fix the variances for all
random effects from the first one onwards”. If a model has three different types of random effects
in it, fix=2 would fix the variance of the second and third random effects but not the first.
11.3 Hierarchical Generalised Linear Models 289
Table 11.1: Common research questions you can answer using a hierarchical GLM
1
n
V̂( μ̂) = (yi − μ̂)2 = μ̂(1 − μ̂)2 + (1 − μ̂)(0 − μ̂)2
n i=1
= μ̂(1 − μ̂) (1 − μ̂ + μ̂) = μ̂(1 − μ̂)
Recall that we always need to check that our analysis is aligned with the research
question (Qs) and data properties (Ps).
In terms of research questions, a hierarchical GLM is a flexible tool that can be
used for many different things, some of which are summarised in Table 11.1. Like
any regression model, it can be used to test for and estimate associations, for variable
290 11 Multivariate Analysis
Code Box 11.8: Diagnostic plots for a hierarchical GLM of Petrus’s spider
data
library(DHARMa)
spidFits = predict(spid_glmm, re.form=NA)
res_spid = qnorm( simulateResiduals(spid_glmm)$scaledResiduals )
plot(spidFits, res_spid, col=spiderGeneraLong$genus,
xlab="Fitted values", ylab="Dunn-Smyth residuals")
abline(h=0, col="olivedrab")
addSmooth(spidFits,res_spid)
qqenvelope(res_spid, col=spiderGeneraLong$genus, main="")
The addSmooth function, provided in the ecostats package, adds a GAM smoother to
the data and confidence bands. Note these confidence bands give pointwise rather than
global coverage (meaning they do not control for multiple testing), unlike the output from
plotenvelope or qqenvelope (below right)
3 Although if the purpose of the study is prediction, additional features that are known to often
help are shrinking parameters (using a LASSO or assuming regression coefficients are drawn from
a common distribution), or in large datasets, using flexible regression tools (like additive models)
to handle non-linearity.
11.3 Hierarchical Generalised Linear Models 291
2
Dunn−Smyth residuals
Sample Quantiles
1
0 1
0
−1
−1
−2
−2
0 1 2 3 4 −2 −1 0 1 2
Fitted values Theoretical Quantiles
When fitting models using MCMC, an important thing to check is whether the
Markov chain converged. To check MCMC convergence, we can run multiple chains
and check that they have similar properties (Code Box 11.9). This is especially
important for hierarchical models like Petrus’s, which is parameter rich. If there are
too many response variables compared to the number of observations (where too
many is actually a small number, let’s say n/2), it is unlikely that the chain will
converge to the posterior distribution (Exercise 11.7). One option is to change the
priors to be informative, essentially telling the model the approximate values of some
of the parameters, in the event that they are hard to estimate from data. This has
parallels to penalised estimation (Sect. 5.6). A better idea usually is to consider a
different model that has fewer parameters (along the lines of Chap. 12).
plot.default(ft_MCMC$VCV[,iPlot],type="l",lwd=0.3,yaxt="n")
lines(ft_MCMC2$VCV[,iPlot],col=2,lwd=0.3)
lines(ft_MCMC3$VCV[,iPlot],col=3,lwd=0.3)
mtext(colnames(ft_MCMC$VCV)[iPlot])
}
> gelman.diag(mcmc.list(ft_MCMC$VCV[,whichPlot],ft_MCMC2$VCV
[,whichPlot],ft_MCMC3$VCV[,whichPlot]))
Potential scale reduction factors:
Point est. Upper C.I.
traitAlop:traitAlop.units 1.00 1.01
traitPard:traitAlop.units 1.00 1.01
traitTroc:traitAlop.units 1.00 1.00
traitPard:traitPard.units 1.01 1.02
traitTroc:traitPard.units 1.00 1.00
traitTroc:traitTroc.units 1.01 1.04
A trace plot and Gelman-Rubin statistic is constructed for each parameter in the 3 × 3
covariance matrix (whichPlot removes duplicates of covariance parameters). We want the
colours to be all mixed up on the trace plots, with parameters bouncing around the full range
of values relatively quickly. We want the Gelman-Rubin statistic to be close to one, which
seems to be the case here.
The introduction to multivariate analysis in this chapter has been relatively narrow,
focusing on simple multivariate regression models. There are many other types of
research question for which different types of multivariate analysis are appropriate,
examples of which appear in Table 11.2. Most of these can be answered using
multivariate regression models, with some extra bells and whistles, as we will see
over the coming chapters.
Table 11.2: Some common research questions you can’t answer using the multivariate
regression techniques of this chapter
A key step in any analysis is data visualisation. This is a challenging topic for
multivariate data because (if we have more than two responses) it is not possible to
jointly visualise the data in a way that captures correlation across responses or how
they relate to predictors. In this chapter, we will discuss a few key techniques to try.
What sort of graph should Edgar produce to visualise how species differ in
flower size and shape?
What sort of graph should Anthony produce to visualise the effects of bush
regeneration on invertebrate communities?
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 295
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_12
296 12 Visualising Many Responses
A good starting point is to plot each response variable separately against predictors.
The ggplot2 package (Wickham, 2016) makes this quite easy to do, or you could
try the plot function in the mvabund package (Code Box 12.1). Note that by
default, plot.mvabund only shows the 12 response variables with the highest total
abundance. This can be changed using n.vars and var.subset to control how
many taxa, and which, are plotted. Even with just a subset of 12 response variables,
the plot looks quite busy—as it should because there is a lot of data and (potentially)
a lot of information in it.
Code Box 12.1: Plotting the Bush Regeneration Data of Exercise 12.2
Using mvabund
library(mvabund)
library(ecostats)
data(reveg)
reveg$abundMV=mvabund(reveg$abund) #to treat data as multivariate
plot(abundMV~treatment, data=reveg)
⎤
+ 1 scale ⎥
⎦
10
⎞
⎠
Control
Reveg
Abundances
⎝ m in
⎛ y
l l
Abundances ⎢ log
⎡
⎣
ll
4
l
l
l l
l l
l
ll
2
l
ll
l
l
0
Soleolifera
Coleoptera
Diptera
Hemiptera
Hymenoptera
Araneae
Formicidae
Collembola
Amphipoda
Isopoda
Larvae
Acarina
Can you see any taxa that seem to be associated with bush regeneration?
You can learn a fair bit about the relationship between a multivariate y and x from
separate plots of y against x for each y variable. For example, in Code Box 12.1 we
see that the two control sites often had lower invertebrate abundances (especially for
Collembola, Acarina, Coleoptera, and Amphipoda), but this effect wasn’t seen in all
orders (e.g. Formicidae). Note, however, that this one-at-a-time approach, plotting
the marginal effect on y, will not detect more complex relationships between x and
combinations of y. As in Fig. 11.1, you can’t see joint effects on y of x without jointly
plotting y variables against x in some way. Another option is to plot responses two
at a time in a scatterplot matrix (e.g. using pairs in R). But too many variables
12.2 Ordination for Multivariate Normal Data 297
would be a nightmare—for 100 response variables, there are 4950 possible pairwise
scatterplots! Another option is ordination, discussed below.
The intention of ordination is to summarise data as best as can be done on just a few
axes (usually two). This requires some form of data reduction, reducing p response
variables to just a few. There are many approaches to doing this. For (approximately)
multivariate normal data, we use principal component analysis (PCA) or factor
analysis.
PCA is the simplest of ordination tools. It tries to rotate data to identify axes that
(sequentially) explain as much of the variation as possible, based on just the variance–
covariance matrix or the correlation matrix (Maths Box 12.1). In R you can use the
princomp function, as in Code Box 12.2.
PCA will report the rotation of the data that sequentially maximised sample vari-
ation, often referred to as the loadings, as well as the amount of variation explained
by each axis. For example, in Code Box 12.2, the first component was constructed
as 0.521 × Sepal.Length − 0.269 × Sepal.Width + 0.580 × Petal.Length +
0.565 × Petal.Width, which had a standard deviation of 1.71, explaining 73% of
the variation in the data. The direction defined by this linear combination of the four
responses gives the largest possible standard deviation for this dataset. The second
component explains the most variation in the data beyond that explained by the first
component and captures an additional 23% of the variation. Together, the first two
principal components explain 96% of the variation in the data, so most of what
is happening in the data can be captured by just focusing on these two principal
component scores instead of looking at all four responses.
A = P ΛP
Applying eigendecomposition to the sample variance–covariance matrix Σ,
we can re-express in the form
= PΣP
Λ
which looks like Eq. 11.1 of Maths Box 11.1. Specifically, P is a matrix of
loadings describing a rotation that can be applied to data, z = Py, which
will give uncorrelated values (since the variance–covariance matrix of z is
which is diagonal, i.e. it has zeros in all off-diagonal elements). A bit of
Λ,
further work shows that the largest eigenvalue in Λ is the biggest possible
variance that could be obtained from a rotation of data y. The corresponding
eigenvector gives us the loadings for this first principal component. The next
largest eigenvalue is the largest variance obtainable by a rotation uncorrelated
with the first axis, so we use this to construct the second principal component,
and so forth.
The trace of a matrix, tr(A), is the sum of its diagonal elements. The trace
is a measure of the total variation in the data. Now a special rule for traces
of Σ
of products is that tr(AB) = tr(BA), meaning that
= tr(PΣP
tr(Λ) = tr(Σ)
) = tr(P PΣ)
so the sum of the variances of the principal component scores Py is the same
as the sum of the variances of y, and we can look at the relative sizes of the
values in Λ as telling us the proportion of variance in y explained by each
principal component.
The sample correlation matrix R can be understood as the variance–
covariance matrix of standardised data. So applying an eigendecomposition to
the sample correlation matrix gives us a PCA of standardised data. Now the
diagonals of R are all one, so tr(R) = p, and the eigenvalues from a PCA on
standardised data will sum to p.
An important step in PCA is to look at the loadings and think about what they mean
in order to try to interpret principal components. For example, the loadings for the
first principal component of the Iris data (labelled Comp.1 as in Code Box 12.2) are
large and positive for most responses, so this can be interpreted as a size variable,
a measure of how big flowers are. The second principal component has a very
negative loading for Sepal.Width and so can be understood as (predominantly)
characterising sepal narrowness. Given that 96% of the variation in the data is
explained by these two principal components, it seems that most of what is going
on is that the Iris flowers vary in size, but some flowers of a given size will have
narrower sepals than others.
12.2 Ordination for Multivariate Normal Data 299
> data("iris")
> pc = princomp(iris[,1:4],cor=TRUE)
> pc
Call:
princomp(x = iris[, 1:4], cor = TRUE)
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.7083611 0.9560494 0.3830886 0.1439265
While a lot of information about how the principal components were constructed
can be seen from looking at their loadings and standard deviations, usually we really
want a plot of the data. We can’t readily plot the original data jointly for all four
responses (might need four-dimensional paper!), but we can plot the scores of each
observation along the first two principal components to visualise the main patterns
in the data. As before, this captures most (96%) of the variation, but it is harder to
interpret than when plotting raw data, because the axes are based, not on measured
variables, but linear combinations of them. We can make things a little easier to
understand by adding axes for the originally measured variables to the plot, based
300 12 Visualising Many Responses
Fig. 12.1: PCA biplot of Edgar’s Iris data. Points are scores of individual flowers
on the first two principal components; arrows are the loadings of each variable on
these principal components. The loadings suggest (as in Code Box 12.2) that bigger
flowers are towards the right and flowers with wider sepals are towards the top
(Exercise 12.1) but a bad option for discrete data with lots of zeros like Anthony’s
(Exercise 12.2).
Loadings:
ML1 ML2
Sepal.Length 0.997
Sepal.Width -0.115 -0.665
Petal.Length 0.871 0.486
Petal.Width 0.818 0.514
ML1 ML2
SS loadings 2.436 0.942
Proportion Var 0.609 0.236
Cumulative Var 0.609 0.844
How do the results compare to the PCA?
302 12 Visualising Many Responses
Factor analysis, with K factors, assumes that the ith observation of the jth response
variable comes from the following model:
yi j = zTi λ j + i j (12.1)
where the errors i j are assumed to be independent and normal with variance σj2
(which is constant across observations but may differ across responses). The K factor
scores are also assumed to be independently normally distributed with common
variance and are stored in zi . These assumptions on i j and zi imply that yi j is
assumed to be multivariate normal. Covariation is introduced across the responses
by assuming they all respond to some relatively small number (K) of common
factors. This covariation is captured by the loadings λ j —if data are highly correlated,
the loadings will be large, otherwise they will be small (although they are often
standardised in output, in which case look for a statement of proportion of variance
explained, as in Code Box 12.3). If a pair of response variables is highly correlated,
then their loadings will be similar; if a pair of response variables is independent,
no pattern will be apparent on comparing their loadings. The other parameters of
interest are the variances of the errors (i j ), often known as communalities, which
are smaller when the data are more strongly correlated.
The factor analysis model looks a lot like a multivariate linear regression model,
and it is. The main difference is that now we don’t have measured predictors (no xi )
but instead use the axes of covariation in the data to estimate these factor scores (zi ).
You can think of these factor scores as unmeasured predictors. If you have measured
predictors as well, it is possible to extend the factor analysis model to account for
them.
It seems weird, kind of like voodoo magic, that we can assume there are some
predictors in our model that are important but unmeasured and then fit a model that
goes and estimates them. But it turns out that if there were some underlying axis
that all responses were related to, it can actually be seen in the pattern of responses.
Consider, for example, the bivariate case, with two response variables that are linearly
related. A factor analysis model assumes that the sole cause of this correlation is
that both variables are responding to some underlying factor, which must then be
proportional to scores along some line of best fit through the data (Fig. 12.2). The
errors i j capture the variation in the data around this line of best fit, along each axis.
A factor analysis assumes data are a sample from a multivariate normal variable, but
where the variance–covariance matrix has a special structure that can be summarised
using a small number of factors. We can break down the assumptions as follows:
1. The observed y-values are independent given z.
2. The y-values are normally distributed, and within each response variable they
have constant variance.
3. The mean of each y is a linear function of K factor scores.
12.2 Ordination for Multivariate Normal Data 303
l
l l Data
l l
l l Factor score
l
l
l
l l
l
l l
l l
l
l
y2 l
l l
l
y1
Fig. 12.2: An illustration of how factor analysis works. The two response variables
here are (negatively) correlated with each other, and factor analysis assumes this is
because they both respond to some underlying, unmeasured factor. Our estimates of
the scores for this factor fall along a line of best fit, chosen in such a way that errors
from the line are uncorrelated noise, and so that all the covariation is captured by
the factor scores
4. The factor scores z are independent (across samples and factors) and are normally
distributed with constant variance.
These assumptions look a lot like linear model assumptions—because, as previously,
a factor analysis is like a linear model, but with unmeasured predictors. More specif-
ically, these look like mixed model assumptions, and this model can in fact be fitted
as a type of mixed model.
So we have four assumptions. How robust is this model to failure of assumptions?
The independence assumptions are critical when making inferences from a factor
analysis model. Note it is independence across observations that is required, not
across response variables; in fact, the point of factor analysis is to model the depen-
dence across responses. When using data to determine the number of factors to use
in the model, then note this is a type of model selection (an inference technique),
and, as always, independence of observations is needed for model selection methods
to work reliably.
Despite the foregoing discussion, often we use a factor analysis for descriptive
purposes (ordination, trying to identify underlying factors), and we don’t want to
make inferences from it. If we aren’t going to make inferences, then the independence
assumption across observations is not critical. You could even argue that the factors
estimate and account for dependence across observations, indirectly, e.g. if you
collect spatially structured data, that spatial structure will (to some extent) be captured
by the factors. However, if you have a known source of structured dependence in your
data, you could expect better results when modelling it directly, which is possible
using more advanced methods (Thorson et al., 2015; Ovaskainen et al., 2016).
304 12 Visualising Many Responses
Sepal.Length Sepal.Width
3
2
2
1
1
Residuals
Residuals
0
0
−1
−1
−2
−2
Petal.Length Petal.Width
3
3
2
2
1
1
Residuals
Residuals
0
−1 0
−1
−2
−2
−3
−3
Fig. 12.3: One way to check the assumptions of a factor analysis is by fitting a simple
linear regression for each response variable as a function of factor scores, as done
here for the Iris data. Different species have been labelled with different colours.
Note that assumptions do not appear to be satisfied; in particular, there appears to be
a fan shape in the petal variables
One way to check factor analysis assumptions is to fit a linear model for each
response variable as a function of the factors (Code Box 12.4). If the number of
response variables is small enough to permit it, it might also be a good idea to
construct a scatterplot matrix.
12.2 Ordination for Multivariate Normal Data 305
Code Box 12.4: Assumption Checking for a Factor Analysis of Iris Data
To check linearity and equal variance assumptions, we will check the assumptions of a linear
model for each response as a function of each factor (Fig. 12.3):
for(iVar in 1:4)
plotenvelope(lm(iris[,iVar]~fa.iris$scores), which=1, col=iris$
Species)
For a scatterplot matrix, as follows:
plot(iris[,1:4],col=iris$Species)
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
l l l
l l l l l ll
l l ll
l
ll lll l l ll l ll l l ll l ll l
l llll l lll l l l l l l
llll l l ll lll l l llll
l l l l lll l l l l
l l l
l ll l ll l ll l
l lll ll lllll l llll l
l l l lllll lllll ll lll l l
ll ll l ll l l ll l l
ll l ll l lll
l ll l
ll l l l ll ll
l l l
ll l ll l
l l l
l l l
l l l
2.0 2.5 3.0 3.5 4.0
l l l
l l l
l l l l
l l l l llll ll lll l l
l ll l l l
l ll l l l l ll l
lll l llll ll l
l l lll l l ll lllll l ll lll l ll
ll l l ll l ll l l l l l
l ll l
ll
l ll
lll
l
l ll lll
ll lll l
l l l
llll ll ll
Sepal.Width llll
llll
ll
lll ll lll
l lll ll l
llllllllll l lll l
l
ll
lll
ll
ll
l l
l
lllllll llll
l
l ll
l ll lllll l l l l lllll l l l lll l
lll lllll l l l ll llllll l l l llll lllll l
l l l l ll lll lll l lll l ll
l ll l l l l l l l l l l l
l l lll l l l ll l ll l l l l llll
l l l ll ll
l l l l l l l l l l l
l l l l l l l
l l l
1 2 3 4 5 6 7
l l l
ll l l l lll
l l l l l l
l l l
l l l lll l ll l l l
l lll ll l l l
l l l l l
l l
l ll l l ll l llll l l ll
l l l
l ll ll l ll l llll
l lll lll l
lll l l l l
l l ll l l lllll l l
ll
ll ll l
ll l l l l
l ll l
lll l
l llll
lll lll l lll
l lll ll lll l
ll l l
l llll lllllll ll ll l l l
ll l lll ll l
ll
l l l l l l ll
l
ll ll
l l
l l
l
l l
l
lll ll
l l
ll
l
l
l
ll l
l
l
lll
l
lll l
l
Petal.Length l
lll
ll l
l
lll
l
ll ll l
l l l
l l l l l l
ll
llll l l
llll ll ll
llll l
ll l lll
l ll
lll
llllll
l l l
ll ll ll lll
llllll l ll
l ll l
l l
l l l l l l ll l l
llll
l l l l l lll
0.5 1.0 1.5 2.0 2.5
l l l l l l ll
l l l l l l l l
l l lll l l lll l llll lll l
ll l l l l ll l
l lll l l l ll l lllll l
ll l l l l l l l l llll ll
l ll l l ll lll l
lllllll l ll l llllll lll llll l
l l l l l l
l l l l l ll ll l l
l l ll llll l l l l lllll l llllll
l l lll l lllllll l llll l
l
lll
l ll l
ll
llll l l l llll
ll
lll l
l
l lllllll
llll l
ll
Petal.Width
ll l ll l l lll ll lll ll
l l l
l l l
ll l l l lll l lllll
ll l ll l l l ll l llll
l ll
ll
lll
lllllll l llllllllll l l llll
ll
ll
llll
l ll l ll l l l ll
The assumptions don’t look good, so inferences based on this model would be questionable.
Within each species, linearity and equal variance look reasonable, so separate factor analyses
by species (or one with a species term added to the model) would be more reasonable.
Although the latent variables were assumed to be independent, this doesn’t actu-
ally impose any constraints on the model, because the loadings that z are multiplied
by induce dependence. Similarly, the equal variance assumption on z imposes no
constraints because the loadings handle how each factor scales against each response.
The variance of z is often set to one, without loss of generality.
It is worth keeping in mind that it only makes sense to do a factor analysis if you
expect some of your response variables to be correlated. The reason for this is that
the factors are introduced to capture correlation patterns, so if there are none there,
then the factors will have nothing to explain.
Often, when seeking a visualisation of the data, we choose to use a two-factor model
so that we can plot one factor against the other. But there is the question of whether
this captures all the main sources of covariation in the data, or if more factors would
be needed to do this. Or maybe fewer factors are needed—maybe there is only really
one underlying dimension that all variables are responding to, or maybe there is no
covariation in the data at all to explain, and the correct number of factors to choose
is zero (in which case we are no longer doing a factor analysis!). So how do you
work out how many factors (K) you should use?
A common way to choose the number of factors is to use a scree plot (Code
Box 12.5)—to plot the variance explained by each factor in the model and choose
a solution where the plot seems to flatten out, with a relatively small amount of
additional variation explained by any extra factors. This can work when the answer
is quite clear, but it is a little arbitrary and can be difficult to use reliably.
Code Box 12.5: Choosing the Number of Factors for the Iris Data
An informal way to choose the number of factors is to look for a kink on a scree plot:
> plot(fa.iris$values,type="l")
Variance explained
2.0
1.0
0.0
which suggests choosing one or two factors (maybe one, since the variance explained is less
than one for the second factor).
12.2 Ordination for Multivariate Normal Data 307
A more formal approach is to compare BICs for models with one, two, or three factors.
Unfortunately, these are not given in fa output, so we need to compute them manually. This
can be done inside a loop to avoid the repetition of calculations across the three models:
> nFactors=3 # to compare models with up to 3 factors
> BICs = rep(NA,nFactors) # define the vector that BIC values go in
> names(BICs) = 1:nFactors # name its values according to #factors
> for(iFactors in 1:nFactors) {
fa.iris <- fa(iris[,1:4], nfactors=iFactors, fm="ml",
rotate="varimax")
BICs[iFactors] = fa.iris$objective - log(fa.iris$nh) * fa.iris$dof
}
> BICs # correct up to a constant, which is ignorable
1 2 3
-9.436629 5.171006 15.031906
How many factors are supported by the data?
A better approach is to exploit the model-based nature of factor analysis and use
model-based approaches to choose K. For example, we could choose the value of
K that minimises BIC, as in Code Box 12.5. For Edgar’s Iris data, this led us to a
model with one factor, representing size. The second factor, which was essentially
petal width, is not a joint factor across multiple responses, so it can readily be
characterised by the error term i j for petal width.
For PCA, we cannot use model-based approaches to choose the number of prin-
cipal components to focus on. The reason is that PCA is basically a transformation
of data, not a model for it, and we are stuck using more arbitrary rules like scree
plots or taking all components that explain more than 1/p of the variation in the data
(which, when performed on a correlation matrix, means choosing all components
with a variance greater than one). For the Iris data, this would again lead us to a
model with just one component, similar to factor analysis.
12.2.2.3 Rotations
Factor analysis solutions are invariant under rotation (Maths Box 12.2), which essen-
tially means that results are the same when you look at a biplot from any orientation.
This issue doesn’t actually matter when it comes to interpreting an ordination because
the relative positions of points and arrows are unaffected by the plot’s orientation.
But if you wish to study factor loadings and try to attach an interpretation to axes,
a decision needs to be made as to which orientation of the solution to use. The most
common choice is a varimax rotation, which tends to push loadings towards zero or
one (Maths Box 12.2). This has no biological or statistical justification, it is done to
try and make it easier to interpret factors because they will tend to be a function of
fewer responses.
308 12 Visualising Many Responses
So which rotation should we use? Well, I guess it doesn’t really matter! It is the
relative position of points on the ordination that matters, not their orientation.
In fitting the model, often we assume (without loss of generality) that the
matrix of loadings is lower triangular, meaning that the first response has
loadings falling along the first axis, the second response is somewhere on the
plane specified by the first two axes, and so forth. Once the model has been
fitted, the solution can be rotated any way you like to obtain an equivalent
solution.
In viewing the factor analysis solution, a varimax rotation is often used,
intended to simplify interpretation of the fitted model. A varimax rotation
aims to maximise the variance of the squared factor loadings, which tends
to move them either towards zero or one, so each involves fewer variables,
simplifying interpretation.
Recall Anthony’s data are discrete and are not going to satisfy the all-important
linearity and equal variance assumptions. In Exercise 12.4, you may have found
instances where each assumption was violated. You might also have noticed some big
outliers and some predicted abundances that were negative—signs we are modelling
mean response on the wrong scale! What Anthony needs is a technique that combines
the ideas of factor analysis with something like generalised linear models to handle
the mean–variance relationship in his count data and to put mean response on a
more appropriate scale. Such techniques exist and are often called generalised latent
variable models (Skrondal & Rabe-Hesketh, 2004). They were initially developed
in the social sciences for questionnaire analysis, since questionnaire responses are
rarely normally distributed. They have been extended to ecology relatively recently
(Walker & Jackson, 2011; Hui et al., 2015).
12.3 Generalised Latent Variable Models 309
zik ∼ N(0, 1)
Basically, this is a GLM with latent variables instead of measured predictor variables.
As previously, z are factor scores, which can be used as ordination axes, and the
coefficients the latent variables are multiplied by (Λ j ) are often called loadings.
Although the latent variables were assumed to be independent with mean zero
and variance one, this doesn’t actually impose any constraints on the model. This is
because the loadings that z are multiplied by induce dependence and different sized
effects for different responses, and the intercept term β0j sets the overall mean for
each response.
A GLVM is a type of hierarchical model, as in Sect. 11.3. In fact, the factor model
zTi Λ j can be rewritten as a multivariate normal random intercept i j , which has
a reduced rank variance–covariance matrix, Σ = ΛΛ . This basically means that a
GLVM is like the hierarchical GLM in Sect. 11.3, but with fewer parameters, because
we are assuming a simpler form for Σ.
When it comes to fitting GLVMs, there are a few different options around. In
R, the fastest (at the time of writing) and one of the easiest to use is gllvm (Code
Box 12.6). This package uses maximum likelihood estimation, but for this sort of
model, the likelihood function has a weird shape and it can be hard to find its
maximum, especially for a dataset with many responses that are mostly zeros. So it
is advisable to run a few times and check that the maximised value for the likelihood
doesn’t change (preferably, using jitter.var=0.2 or something similar, to jitter
starting guesses of ordination scores so as to start estimation from slightly different
places). If it does change, you might need to run it about 10 times and keep the
solution with the highest likelihood. For Anthony’s data, on five runs I got −689.3
on every run, with a small amount of variation on the second decimal place, which
is ignorable.
Assumptions can be checked in the usual way (Fig. 12.4). Residuals vs fitted value
plots are constructed essentially using tools for fitting a GLM as a function of factor
scores, analogously to what is done in Fig. 12.3.
As previously, biplots can be readily constructed by overlaying factor loadings on
top of plots of factor scores (Code Box 12.6). Note from the biplot in Code Box 12.6
that sites 1 and 5 appear towards the left of the plot—these are the two control
310 12 Visualising Many Responses
sites. Notice also that Blattodea scores highly near these sites, while Amphipoda and
Coleoptera have high loadings away from these sites. This suggests that there are
more cockroaches at control sites and more amphipods and beetles away from them,
as an inspection of the data confirms (Table 12.1).
Table 12.1: Counts for key orders from Anthony’s revegetation data, with control sites
indicated in black and revegetated sites in red. Notice how closely these correspond
to the information on the biplot in Code Box 12.6
The gllvm package can handle some family arguments commonly used in ecol-
ogy; for more options see the boral package (Hui, 2016) or Hmsc (Ovaskainen et
al., 2017b; Ovaskainen & Abrego, 2020, Section 7.3). The glmmTMB package was
recently extended to fit latent variable models (version 1.1.2.2 and later), via the rr
correlation structure, which stands for “reduced rank”. This is an exciting develop-
ment because it nests the capacity to fit latent variable models in software that can
flexibly fit mixed models, with a range of different distribution options.
> data(reveg)
> library(gllvm)
> reveg_LVM = gllvm(reveg$abund, num.lv=2, family="negative.binomial",
trace=TRUE, jitter.var=0.2)
> logLik(reveg_LVM)
'log Lik.' -689.3072 (df=95)
Repeating this several times seems to return an answer within about 0.01 of this value, so
we can be confident this is (close to) the maximum likelihood solution. To get a biplot of
this solution, labelling only the 12 responses with highest loadings (to reduce clutter):
> ordiplot(reveg_LVM, col=as.numeric(reveg$treatment), biplot=TRUE,
ind.spp=12)
12.3 Generalised Latent Variable Models 311
You may get a different ordination, which is more or less a rotation or reflection of this one,
which should have a similar interpretation. Checking assumptions:
> par(mfrow=c(1,3))
> plot(reveg_LVM, which=c(1,2,5))
which returns the plot in Fig. 12.4.
Dunn−Smyth residuals
Dunn−Smyth−residuals
1
1
0
0
−1
−1
−2
−2
−5 0 5 −3 −2 −1 0 1 2 3 −5 0 5
linear predictors theoretical quantiles linear predictors
Fig. 12.4: Assumption checks of the GLVM fitted to Anthony’s revegetation data in
Code Box 12.6. There is no appreciable pattern in plots that would lead us to worry
about our variance or linearity assumptions, with no fan or U shape in the residuals
vs fits plot. There is a slight trend in the quantile plot, which remained there on
repeat runs. This suggests a possible violation of distributional assumptions, with
large residuals being slightly smaller than they were expected to be. This means
the distribution of counts was not quite as strongly right-skewed as expected by the
model, which is not a critical issue, since the mean–variance trend seems to have
been accounted for adequately
312 12 Visualising Many Responses
be avoided using a GLVM with the appropriate distributional assumptions for your
data.
> library(vegan)
> ord_mds=metaMDS(reveg$abund)
Square-root transformation
Wisconsin double standardisation
Run 0 stress 0.1611237
Run 1 stress 0.1680773
Run 2 stress 0.1934608
... New best solution
... Procrustes: rmse 5.511554e-05 max resid 0.0001003286
*** Solution reached
> plot(ord_mds$points, pch=as.numeric(reveg$treatment),
col=reveg$treatment)
By default, the vegan package uses the Bray-Curtis distances and a bunch of fairly arbitrary
transformations.
The separation of the two control plots from the others suggests to us that there is an
effect of bush regeneration treatment on the invertebrate community, although it doesn’t give
us much sense for what the nature of this effect is.
The data can be found in the mvabund package and can be loaded using:
library(mvabund)
data(tikus)
tikus20 = tikus$abund[1:20,] # for 1981 and 1983 data only
tikusAbund = tikus20[,apply(tikus20,2,sum)>0] # remove
zerotons
Construct a MDS plot of the data using the Bray-Curtis distance (default)
and colour-code symbols by year of sampling. Does this plot agree with the
Warwick et al. (1990) interpretation?
Construct another MDS plot using the Euclidean distance on log(y + 1)-
transformed data. Does this plot agree with the Warwick et al. (1990) interpre-
tation?
Use the plot.mvabund function to plot each coral response variable as a
function of time. What is the main pattern you see?
Convert the data into presence–absence as follows:
tikusPA = tikusAbund
tikusPA[tikusPA>1]=1
and use the gllvm package to construct an ordination.
Do the assumptions appear reasonable? How would you interpret this plot?
dissimilarity (such as either of the options in Exercise 12.6), or using a latent variable
model you could assume the wrong mean–variance relationship (mind your Ps and
Qs!). If you see a result in an ordination that you can’t reproduce on the originally
measured data, then you should question whether the pattern is really there and
maybe think harder about your choice of ordination approach.
Code Box 12.8: Studying Each Observation Separately for the Iris Data
For a summary of sample means for each species and each response variable:
> by(iris, iris$Species, function(dat){ apply(dat[,1:4],2,mean) } )
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
Note that I. verginica tends to have the largest flowers (in petal size and sepal length), and I.
setosa tends to have the smallest flowers (but the widest sepals). Using boxplots to visualise
this:
> par(mfrow=c(2,2))
> plot(Sepal.Length~Species,data=iris,xlab="")
> plot(Sepal.Width~Species,data=iris,xlab="")
> plot(Petal.Length~Species,data=iris,xlab="")
> plot(Petal.Width~Species,data=iris,xlab="")
produces Fig. 12.5.
Key Point
Interpretation is always difficult in multivariate analysis, and mistakes are
easily made. The best safeguard against this is to try to visualise key results
using your raw data to complement more abstract tools like ordination.
316 12 Visualising Many Responses
Fig. 12.5: Boxplots of Iris flower data for each species. Note the species differ in size,
especially in petal variables, although something different seems to be happening
with sepal width, with the widest sepals for the smallest flowers (I. setosa). This
graph is arguably more informative than the biplot, showing the value of looking for
ways to visualise patterns using the originally measured variables
Chapter 13
Allometric Line Fitting
(A crude argument for the “2/3 power law” is that a major function of the
brain is to interpret signals from the skin, so they should scale proportionately
to surface area rather than body mass, Rensch, 1954; Gould, 1966).
How should we analyse the data to answer this research question?
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 317
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_13
318 13 Allometric Line Fitting
Consider Exercises 13.1–13.2. In both cases, we have two variables, but we do not
have a response and a predictor, so we should not be thinking about univariate meth-
ods of analysis. Instead we have a multivariate problem, well, a bivariate problem
(with two responses), and there is interest in how these responses covary. This looks
like a job for factor analysis or principal component analysis (PCA)—trying to es-
timate some underlying factor that characterises how these responses covary. But
there is an important difference in the way the research question has been phrased,
specifically, the focus here is on making inferences about the slope of the line of best
fit. Factor loadings store this information, but factor analysis output will not tell you
if the loadings on y1 and y2 are significantly different from each other, which is what
we would want to know if trying to determine whether the slope is one (“isometry”),
and it will not help us see if there is a significant departure from the 2/3 power law
of Exercise 13.1 (Fig. 13.1).
Fig. 13.1: Brain size vs body size for different species of vertebrate, with a 2/3
power law overlaid (red line). All points correspond to different mammal species,
except the three outliers, which are dinosaurs. Should brain size go on the y-axis,
the x-axis, or either? Or put another way—why do humans (solid blue point) not
lie on the line—because they have large brain mass for their body size or a small
body for their brain size? This is a matter of perspective, and the ambiguities here
complicate analysis and lead us to think about multivariate techniques rather than
standard linear regression
Because we are talking about fitting a line to relate two variables, it is tempting
to think of this as a linear modelling problem and apply simple linear regression
as in Chap. 2. However, linear regression is fitted using least squares, minimising
errors predicting y. This certainly makes sense if your goal is to predict or explain
y, but that is not our goal in Exercises 13.1–13.2. Rather than having a single
response variable to explain, we have two y variables, and we are trying to estimate
some underlying factor that best explains why our responses covary. Another key
distinguishing feature here is that we are interested in specific values of the slope—in
Exercise 13.1 we want to know if there is evidence that the slope is not 2/3, and in
Exercise 13.2 we want to know if the slope differs from one. Because there is no
obvious choice of predictor and response, we could equally well do the analysis with
either variable as the predictor (Code Box 13.1), but this gives completely different
answers!
> library(MASS)
> data(Animals)
> ftBrainBody=lm(log(brain)~log(body),data=Animals)
> confint(ftBrainBody)
2.5 % 97.5 %
(Intercept) 1.7056829 3.4041133
log(body) 0.3353152 0.6566742
From a linear regression of brain size against body size, a 95% confidence interval (CI) for
the true slope does not quite cover 2/3, so we would conclude that there is some evidence
that the relationship is flatter than 2/3 scaling. But why do things this way around—what if
we regressed body size against brain size?
> ftBodyBrain=lm(log(body)~log(brain),data=Animals)
> confint(ftBodyBrain)
2.5 % 97.5 %
(Intercept) -3.6396580 0.3396307
log(brain) 0.8281789 1.6218881
Reversing axes, we might expect the slopes to be the inverse of what we saw before—so we
are now interested in whether or not the slope is 3/2. Confusingly, this time around the CI
does cover 3/2, so we have no evidence against the 2/3 scaling law.
The reason these lines give different answers is that they do different jobs—one tries to
predict y from x, while the other tries to predict x from y. Neither is what we are really after
when trying to characterise the main axis along which brain mass and body mass covary.
The problem that can arise using regression in allometry is known as regression
to the mean, with regression slopes being flatter than you might naïvely expect.
320 13 Allometric Line Fitting
The term “regression” comes from the idea that y predictions tend to be closer to
their mean (in standardised units) than the values of x from which their predictions
were made, hence the variable seems to “regress” towards its mean. The idea was
first discussed by Galton (1886) in relation to predicting sons’ heights from fathers’
heights, where sons were always predicted to be of more average stature than their
father, with tall fathers having shorter sons and short fathers having taller sons.
Regression to the mean is entirely appropriate when predicting y from x, because
we do expect things to become more average. For example, if you do a test and get
your best score ever, topping the class, do you expect to do just as well on your next
test for that subject? You will probably do well again, but you should not expect to
do quite as well.
In allometry, we often do not want regression to the mean. In Exercise 13.1,
when studying the 2/3 power law, regression to the mean makes the slope of the
line relating brain mass to body mass flatter than we might expect, and if we regress
body mass against brain mass, this line is then steeper than we might expect (hence
the contradictory results of Exercise 13.1). So instead of minimising error predicting
y (or x), maybe we should minimise something in between, like the straight-line
distance of points from the line.
The most common tools used for allometric line fitting are variants on PCA. One
option is to fit a “major axis” (MA), so named because it the major axis of the
ellipse that best fits the data. It can be understood as the line that minimises the
sum of squared straight line distances of each point from the line. The major axis
turns out to have the same slope as the first principal component fitted to a variance–
covariance matrix and can be interpreted as a PCA of the data—it estimates the axis
or dimension that characterises most of the (co)variation in the data.
If response variables are on different scales, it might make sense to standardise
prior to fitting, but then to rescale to the original axes for interpretation, known as
a “standardised major axis” or “reduced major axis” (hereafter SMA). This can be
understood as a PCA on the correlation matrix, but rescaled to the original axes. The
main difference between these methods is that MA is invariant under rotation (i.e.
if you rotate your data, the fitted line will be rotated by the same amount), whereas
SMA is invariant under changes of scale (i.e. if you change height from metres to
centimetres, the slope will change by a factor of 100). SMA is more commonly
used, but if the estimated slope is near one, then the methods will give quite similar
answers.
(S)MA methods are sometimes referred to as Model II regression (Sokal & Rohlf,
1995) and have some equivalences with models that assume there is measurement
error in the x variable (Carroll & Ruppert, 1996, “measurement error models”).
It is important, however, to recognise that in allometry we typically do not have a
regression problem because we do not seek to predict y from x, rather we want to
13.2 The (Standardised) Major Axis 321
study how y1 and y2 covary. Measurement error models are a form of regression
model that estimate how much error there is in measuring x and adjust predictions
for y accordingly. As such it is best not to think too deeply about relations between
allometric line-fitting methods and measurement error models, as their intentions
are quite different. An issue applying measurement error approaches in allometry is
that typically the main form of error is not due to measurement of x but to a lack of
fit of the model (Carroll & Ruppert, 1996, “equation error”).
Methods for estimating and making inferences about MA or SMA slopes can be
found in the smatr package (Warton et al., 2012a). Different types of hypotheses
that are often of interest are illustrated in Fig. 13.2.
Fig. 13.2: Schematic diagram of different types of hypothesis tests in smatr package,
and code to implement them, from Warton et al. (2012a). (a) Testing if the true SMA
slope is B, based on a single sample. (b) Comparing slopes of several samples, or if
slopes can be assumed to be common, comparing the location of samples, looking
either for a change in elevation or a shift along the common axis
For a single sample, as in Exercise 13.1, interest is most commonly in the slope
of the (S)MA. Methods to test the slope of the line have been around for a long
time (Pitman, 1939; Creasy, 1957, Code Box 13.2) and can be understood as testing
for correlation between fitted values and residuals, where residuals are measured in
different directions depending on which line-fitting method is used (hence, the test
statistic in Code Box 13.2 is a correlation coefficient). Confidence intervals for slope
are constructed by inverting this test—finding the range of values for the slope such
that the fitted values and residuals are not significantly correlated. When testing if
the slope is one, which often coincides with a test for “isometry” (Gould, 1966),
tests for MA and SMA are equivalent.
322 13 Allometric Line Fitting
> library(smatr)
> sma_brainBody = sma(brain~body, data=Animals,log="xy",slope.test=2/3)
> sma_brainBody
Coefficients:
elevation slope
estimate 0.8797718 0.6363038
lower limit 0.4999123 0.4955982
upper limit 1.2596314 0.8169572
------------------------------------------------------------
H0 : slope not different from 0.6666667
Test statistic : r= -0.07424 with 26 degrees of freedom under H0
P-value : 0.70734
What happens if you reverse the axes? Do you get the same answer?
> sma(body~brain, data=Animals,log="xy",slope.test=3/2)
Coefficients:
elevation slope
estimate -1.3826286 1.571576
lower limit -2.2584635 1.224054
upper limit -0.5067936 2.017763
------------------------------------------------------------
H0 : slope not different from 1.5
Test statistic : r= 0.07424 with 26 degrees of freedom under H0
P-value : 0.70734
Is this what you would have expected? Is there evidence against the 2/3 power law?
Code Box 13.3: Comparing Allometric Slopes for Ian’s Data Using smatr
First we will test for a common slope, log-transforming both variables:
> data(leaflife)
> leafSlopes = sma(longev~lma*site, log="xy", data=leaflife)
> summary(leafSlopes)
------------------------------------------------------------
Results of comparing lines among groups.
Group: 1
elevation slope
estimate -4.218236 2.119823
lower limit -5.903527 1.451816
upper limit -2.532946 3.095192
...
> plot(leafSlopes)
Output for groups 2–4 has been excluded due to space considerations. The plot appears in
Fig. 13.3.
The output suggests there is some evidence (P = 0.02) that SMA slopes for the so-called
leaf economics spectrum are different across the four sites that were sampled. Confidence
intervals for SMA slopes for each group are also reported, and it can be seen above that the
slope does not cover one for group 1 (high rainfall and high soil nutrients). This is also the
case for group 4 (low rainfall and low soil nutrients). So we can conclude that the relationship
is not isometric, with evidence that (at least in some cases) leaf longevity changes by more
than leaf mass per area (on a proportional scale) as you move along the leaf economics
spectrum.
Code Box 13.4: Comparing Elevations of Allometric Lines for Ian’s Low
Soil Nutrient Data Using smatr
We subset data to just sites with low soil nutrients, for which SMA slopes were quite similar
(it makes little sense to compare elevations when slopes are different).
> leaf_low_soilp = subset(leaflife, soilp == "low")
> leafElev = sma(longev~lma+rain, log="xy", data=leaf_low_soilp)
> leafElev
------------------------------------------------------------
Results of comparing lines among groups.
H0 : no difference in elevation.
Wald statistic: 6.566 with 1 degree of freedom
324 13 Allometric Line Fitting
P-value : 0.010393
------------------------------------------------------------
The results suggest some evidence of a difference in elevation of the leaf economics spectrum
between high- and low-rainfall sites. Looking at elevation estimates (using the summary
function), we would see that elevation is higher at high-rainfall sites, meaning that at high-
rainfall sites, leaves could have lower leaf mass per area without a cost in leaf longevity.
5.0
l
l
ll l l
l l
l
ll
longev [log scale]
l l l l
l l
2.0
l
l l l
l l l
lll l l l l l
l l l l
l l ll l
l
llll
l l l
1.0
l ll l
l l l
ll
l l l
l
l l
l
0.5
Fig. 13.3: Ian’s leaf economics data from Exercise 13.2, with separately fitted lines for
each treatment combination, as produced by plot(leafSlopes) in Code Box 13.3.
There is a suggestion here that some lines differ from each other in slope, and others
may differ in elevation, explored in Code Boxes 13.3–13.4
l l l
l ll
0.5
0.5
l l
ll l
l ll llllllll
l
Sample Quantiles
l l
l l ll
ll l l lll
l
l ll l lll
l
−0.5
−0.5
−1.5
−1.5
l l
l l l l
−4 −2 0 2 −2 −1 0 1 2
Fitted values (body v brain) Theoretical Quantiles
Being based on variances and covariances, (S)MA methods are quite sensitive to
outliers. Robust extensions have been developed (Taskinen & Warton, 2011, using
Huber’s M estimation) and are available in the smatr package (Code Box 13.6).
These methods are advisable if there are any outliers in the data. Outliers can have
undue influence on fits, substantially reducing efficiency and making it harder to see
the signal in data.
> plot(brain~body,data=Animals,log="xy")
> abline(sma_brainBody, col="red")
> abline(sma_brainBodyRobust, col="blue")
326 13 Allometric Line Fitting
l l
l
l
l
l
l SMA
1
l
robust SMA
Obvious limitations of the smatr software are that it is designed for problems
with only two size variables and for linear relationships. For problems with more
than two responses, PCA techniques could be used, but inference procedures might
be harder to get a handle on. There is a literature on non-linear approaches to
principal components (starting with Hastie & Stuetzle, 1989) that could be exploited
to develop non-linear extensions of (standardised) major axis techniques.
Exercise 13.4: Robust Allometric Line Fitting for Ian’s Leaf Data
The plot in Fig. 13.3 seems to have an outlying value towards the left, suggest-
ing that maybe we should think about using robust methods of analysis here,
too.
Repeat the analysis of Ian’s leaf economics data, as in Code Boxes 13.3–
13.4, using robust=TRUE. Do the results work out differently?
13.3 Controversies in the Allometry Literature 327
The literature on allometric line fitting has a history of controversy, making it difficult
for users to navigate. Some key points of contention are summarised below.
While it was argued in this chapter that allometry should be treated as a multivari-
ate problem, with two y variables, there is the question of when to actually do this
or whether one of the variables could serve as a predictor, such that we have a linear
model (as in Chap. 2). This would considerably simplify things and so should always
be considered. Smith (2009) argued that a good way to check if (standardised) major
axis techniques are warranted is to see if a problem makes as much sense if you flip
the axes around—this is a way to check if you have two responses (y1 and y2 , as in
this chapter) or a response and a predictor (y and x, as in Chap. 2). For example, is
the problem described in Code Box 13.1 really an issue, or was one or other of the
regressions fitted there the correct one.
Allometric data, being size measurements, are commonly log-transformed prior
to analysis. Packard (2013, and elsewhere) argues against transformation for a few
reasons, including interpretability, whereas Kerkhoff and Enquist (2009) argue that if
a variable is best understood as being the outcome of a set of multiplicative processes
(as size variables often are), then the log scale is more natural. Xiao et al. (2011)
make the obvious but important point that the decision to transform or not can be
informed by your data, by checking assumptions of the fitted model.
There was a fair bit of discussion in the fishery literature, going back to at least the
1970s (Ricker, 1973; Jolicoeur, 1975), about whether a major axis or standardised
major axis approach should be preferred. This is directly analogous to the question of
whether to do a PCA on a covariance or a correlation matrix, although, interestingly,
the problem has not really courted controversy when phrased this way. Inference
about slope is rarely of interest in PCA, but it is central to allometry, which may
be why the issue of how to estimate slope has taken on greater importance in the
allometric literature. A sensible way forward may, however, be to take advice from the
principal component literature—to standardise if variables are on different scales, or
perhaps if they differ appreciably in their variability. An alternative way to make the
decision is to think about the properties of the different lines, in particular whether
it is more desirable to fit a line that is invariant under changes of scale (SMA) or
under rotation (MA). Most would argue the former. Fortunately, because MA and
SMA tests for isometry are equivalent, in many situations it doesn’t actually matter
which is used.
It has been argued (Hansen & Bartoszek, 2012, for example) that because you
can also derive (S)MA methods from measurement error models, we should instead
estimate how much measurement error there is in y1 and y2 and use this to choose
the method of line fitting used. The problem here is that there are other sources of
error beyond measurement error, most importantly, due to a lack of model fit, and
studying measurement error is uninformative about how to measure lack of fit. For
example, humans lie above the line in Fig. 13.1, and the reason they do so is not
because of inaccuracies measuring body and brain size; it is because humans actually
have large brains for their body size. (Or should we say small bodies for their brain
328 13 Allometric Line Fitting
size?) Looking at how accurately body and brain size have been measured should
not tell us how we allocate error from the line due to lack of fit; it should be treated
and adjusted for as a separate issue (Warton et al., 2006).
A lot of the ambiguity in the literature arises because when finding a line of best
fit there is no single correct answer, with different options available depending on
your objective. Technically, at least in the case of multivariate normal data, if we
wish to attribute error to both y1 and y2 , then a single line characterising the data
is unidentifiable (Moran, 1971) without additional information on the relative errors
in y1 vs y2 . That is, a line cannot be estimated without a decision first being made
concerning how to measure error from the line (e.g. in the vertical direction for linear
regression, perpendicular to the line for a major axis). Some authors have advised
on line-fitting methods using simulation studies, generating data from a particular
line, with errors in y1 and y2 , and looking at how well different methods reproduce
that line. But the unidentifiability issue means that even in simulated data there is no
single best line, hence no single right answer—it is more helpful to think of the data
as coming from some true (bivariate) distribution, with a true variance–covariance
matrix, which could be characterised in a number of different ways (in terms of its
SMA or MA, for example). So one could argue against using simulation studies to
decide which of SMA, MA, or other to use, because they all characterise different
aspects of the same data.
Recall that in this text it has been emphasised that analysis methods are informed
by data properties and research questions, so we always need to mind our Ps and
Qs. A key point to recognise, which can help navigate the conflicting views in the
allometry literature, is that the decision to use MA, SMA, linear regression, or
another type of line-fitting method is more about the Qs than the Ps.
Part III
Regression Analysis for Multivariate
Abundances
Chapter 14
Multivariate Abundances—Inference
About Environmental Associations
The most common type of multivariate data collected in ecology is also one of
the most challenging types to analyse—when some abundance-related measure (e.g.
counts, presence–absence, biomass) is simultaneously collected for all taxa or species
encountered in a sample, as in Exercises 14.1–14.3. The rest of the book will focus
on the analysis of these multivariate abundances.
What type of response variable(s) does he have? How should Anthony analyse
his data?
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 331
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_14
332 14 Multivariate Abundances and Environmental Association
This type of data goes by lots of other names—“species by site data”, “community
composition data”, even sometimes “multivariate ecological data”, which sounds a
bit too broad, given that there are other types of multivariate data used in ecology
(such as allometric data, see Chap. 13). The term multivariate abundances is intended
to put the focus on the following key statistical properties.
Multivariate: There are many correlated response variables, sometimes more vari-
ables than there are observations:
– In Exercise 14.1, Anthony has 10 observations and 24 variables.
– In Exercise 14.2, David and Alistair have 57 observations and 16 variables.
– In Exercise 14.3, Lena has 179 observations and 16 variables.
Abundance: Abundance or presence–absence data usually exhibits a strong mean–
variance relationship, as in Fig. 14.1.
You need to account for both properties in your analysis.
l
1e+05
l
Variance [log scale]
l l
lll
l
l
1e+03
ll
ll
l l l
l l
1e+01
l l
l l
ll l treatment
l
l
l ll l Control
1e−01
l l Reveg
l
Fig. 14.1: Mean–variance relationships for (a) David’s and Alistair’s data of Exercise
14.2 and (b) the revegetation study of Exercise 14.1
14 Multivariate Abundances and Environmental Association 333
Multivariate abundance data are especially common in ecology, probably for two
reasons. Firstly, it is often of interest to say something collectively about a community,
e.g. in environmental impact assessment, we want to know if there is any impact of
some event on the ecological community. Secondly, this sort of data arises naturally
in sampling—even when you’re interested in some target species, others will often
be collected incidentally along the way, e.g. pitfall traps set specifically for ants
will inevitably capture a range of other types of invertebrate also. So even when
they are interested in something else, many ecologists end up with multivariate
abundances and feel like they should do something with them. In this second case
we do not have a good reason to analyse multivariate abundances. Only bother with
multivariate analysis if the primary research question of interest is multivariate, i.e.
if a community or assemblage of species is of primary interest. Don’t go multivariate
just because you have the data.
There are a few different types of questions one might wish to answer using
multivariate abundances. The most common type of question, as in each of Exer-
cises 14.1–14.3, is whether or not the community is associated with some predictor
(or set of predictors) characterising aspects of the environment—whether looking at
the effect on a community of an experimental treatment, testing for environmental
impact (Exercise 14.3), or something else again.
In Chap. 11 some multivariate regression techniques were introduced, and model-
based inference was used to study the effects of predictors on response. If there were
only a few taxa in the community, those methods would be applicable. But (as flagged
in Table 11.2) a key challenge with multivariate abundances is that typically there are
many responses. It’s called biodiversity for a reason! There are lots of different types
of organisms out there. The methods discussed in this chapter are types of high-
dimensional regression, intended for when you have many responses, but if you only
have a few responses, you might be better off back in Chap. 11. High-dimensional
regression is technically difficult and is currently a fast-moving field.
In this chapter we will use design-based inference (as in Chap. 9). Design-based
inference has been common in ecology for this sort of problem for a long time
as a way to handle the multivariate property, and the focus in this chapter will be
on applying design-based inference to models that appropriately account for the
mean–variance relationship in data (to also handle the abundance property). There
are some potential analysis options beyond design-based inference, which we will
discuss later.
Key Point
Multivariate abundance data (also “species by site data”, “community com-
position data”, and so forth) has two key properties: a multivariate property,
that there are many correlated response variables, and an abundance prop-
erty, a strong mean–variance relationship. It is important to account for both
properties in your analysis.
334 14 Multivariate Abundances and Environmental Association
Generalised estimating equations (GEEs, Liang & Zeger, 1986; Zeger & Liang,
1986) are a fast way to fit a model to correlated counts, compared to hierarchical mod-
els (Chaps. 11–12). Design-based inference techniques like the bootstrap (Chap. 9)
tend to be computationally intensive, especially when applied to many correlated
response variables, GEEs are a better choice when planning to use design-based
inference. Parameters from GEEs are also slightly easier to interpret than those of
a hierarchical model, because they specify marginal rather than conditional models,
so parameters in the mean model have direct implications for mean abundance (see
Maths Box 11.4 for problems with marginal interpretation of hierarchical parame-
ters).
GEEs are ad hoc extensions of equations used to estimate parameters in a GLM,
defined by taking the estimating equations from a GLM, forcing them to be multivari-
ate (Maths Box 14.1), and hoping for the best. An assumption about the correlation
structure of the data is required for GEEs. Independence of responses is commonly
assumed, sometimes called independence estimating equations, which simplifies
estimation to a GLM problem, and then correlation in the data is adjusted for later
(using “sandwich estimators” for standard errors, Hardin & Hilbe, 2002).
(where d i = ∂μ xi
∂β = g (μi ) , as in Eq. 10.4). The GEE approach involves taking
i
Notice that whereas the score equations for GLMs are derived as the gradient
of the log-likelihood function, that is not how GEEs are derived. In fact,
unless responses are assumed to be normally distributed, or they are assumed
to be independent of each other, there is no GEE likelihood function. This
complicates inference, because standard likelihood-based tools such as AIC,
BIC, and likelihood ratio tests cannot be used because we cannot calculate a
GEE likelihood.
Some difficulties arise when using GEEs, because of the fact that they are moti-
vated from equations for estimating parameters, rather than from a parametric model
for the data. GEEs define marginal models for data, but (usually) not a joint model,
with the estimating equations no longer corresponding to the derivative of any known
likelihood function. One difficulty that this creates is that we cannot simulate data
under a GEE “model”. A second difficulty is that without a likelihood, a likelihood
ratio statistic can’t be constructed. Instead, another member of the “Holy Trinity
of Statistics” (Rao, 1973) could be used for inference, a Wald or score statistic.
Maybe this should be called the Destiny’s Child of Statistics (Maths Box 14.2),
because while the Wald and score statistics are good performers in their own right,
the likelihood ratio statistic is the main star (the Beyoncé). Wald statistics have been
met previously, with the output from summary for most R objects returning Wald
statistics. These statistics are based on parameter estimates under the alternative
hypothesis, by testing if parameter estimates are significantly different from what is
expected under the null hypothesis. A score statistic (or Rao’s score statistic) is based
on the estimating equations themselves, exploiting the fact that plausible estimates
of parameters should give values of the estimating equations that are close to zero.
Specifically, parameter estimates under the null hypothesis are plugged into the esti-
mating equations under the alternative hypothesis, and a statistic constructed to test
for evidence that the expression on the right-hand side of Eq. 14.1 is significantly
different from zero.
log(L1) l
M1
− 2log(Λ)
Score test
log(L0) M0 l
θ0 θ^1
Wald test
The likelihood ratio test −2 log Λ(M0, M1 ) = 2 M1 (θ̂ 1 ; y) − 2 M0 (θ 0 ; y)
focuses on whether the likelihoods of the two models are significantly different
(vertical axis).
The Wald statistic focuses on the parameter (horizontal axis) of M1 and
whether θ̂ 1 is significantly far from what would be expected under M0 , using
θ̂1 −θ0
σ̂ .
θ̂1
The score statistic focuses on the score equation u(θ), the gradient of the
log-likelihood at M0 . The likelihood should be nearly flat for a model that fits
the data well. So if M0 is the correct model, u(θ 0 ) should be near zero, and
0)
we can use as a test statistic σ̂u(θ
u(θ0 )
.
In GEEs, u(θ) is defined, hence θ can be estimated, but the likelihood is
not defined (unless assuming all variables are independent). So for correlated
counts we can use GEEs to calculate a Wald or score statistic, but not a
likelihood ratio statistic. Sorry, no Beyoncé!
An offset or a row effect term can be added to account for variation in sampling
intensity, which is useful for diversity partitioning, as discussed later (Sect. 14.3).
A working correlation matrix (R) is needed, specifying how abundances are
associated with each other across taxa. The simplest approach is to use independence
estimating equations, ignoring correlation for the purposes of estimation (assuming
R = I, a diagonal matrix of ones, with all correlations equal to zero), so that the
model simplifies to fitting a GLM separately to each response variable. This is pretty
much the simplest possible model that will account for the abundance property, and
by choosing a simple model, we hope that resampling won’t be computationally
prohibitive.
We need to handle the multivariate property of the data to make valid multivariate
inferences about the effects of predictors (environmental associations), and this can
be done by resampling rows of data. Resampling rows keeps site abundances for all
taxa together in resamples, to preserve the correlation between taxa. This is a form of
block resampling (Sect. 9.7.1). Correlation can also be accounted for in constructing
the test statistic.
The manyglm function in the mvabund package was written to carry out the
preceding operation, and it behaves a lot like glm, so it is relatively easy to use if you
are familiar with the methods of Chap. 10 (Code Box 14.1). It does, however, take
longer to run (for anova or summary), so sometimes you have to be patient. Unlike
the glm function, manyglm defaults to family="negative.binomial". This is
done because the package was designed to analyse multivariate abundances (hence
the name), and these are most commonly available as overdispersed counts.
> library(ecostats)
> library(mvabund)
> data(reveg)
> reveg$abundMV=mvabund(reveg$abund)
> ft_reveg=manyglm(abundMV~treatment+offset(log(pitfalls)),
family="negative.binomial", data=reveg) # offset included as in
Ex 10.9
> anova(ft_reveg)
Time elapsed: 0 hr 0 min 9 sec
Analysis of Deviance Table
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
(Intercept) 9
treatment 8 1 78.25 0.024 *
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Arguments:
Test statistics calculated assuming uncorrelated response (for faster
computation)P-value calculated using 999 iterations via PIT-trap
resampling.
The manyglm function makes the same assumptions as for GLMs, plus a correlation
assumption:
1. The observed yi j -values are independent across observations (across i), after
conditioning on xi .
2. The yi j -values come from a known distribution (from the exponential family)
with known mean–variance relationship V(μi j ).
3. There is a straight-line relationship between some known function of the mean
of y j and x
g(μi j ) = β0j + xTi β j
4. Residuals have a constant correlation matrix across observations.
14.2 Design-Based Inference Using GEEs 339
Code Box 14.2: Checking Assumptions for the Revegetation Model of Code
Box 14.1
par(mfrow=c(1,3))
ft_reveg=manyglm(abundMV~treatment,offset=log(pitfalls),
family="negative.binomial", data=reveg)
plotenvelope(ft_reveg, which=1:3)
1.2
1
Residuals
0
0
−1
−1
0.4
0.2
−2
−2
ft_revegP=manyglm(abundMV~treatment, offset=log(pitfalls),
family="poisson", data=reveg)
par(mfrow=c(1,3))
plotenvelope(ft_revegP, which=1:3, sim.method="stand.norm")
Residuals vs Fitted Values Normal Quantile Plot Scale−Location Plot
4
15
15
10
10
3
|Residuals|
Residuals
Residuals
5
2
0
1
−5
−5
−10
−10
0
−15 −10 −5 0 5 −3 −2 −1 0 1 2 3 −15 −10 −5 0 5
Fitted values Theoretical Quantiles Fitted Values
Plotting sample variances against sample means for each taxon and treatment:
meanvar.plot(reveg$abundMV~reveg$treatment)
abline(a=0,b=1,col="darkgreen")
couple of different test statistics, in case the structure in the data is captured by one
of these but not another. This approach would be especially advisable if using Wald
statistics, because they can be insensitive when many predicted values for a taxon
are zero (as in Chap. 10).
> anova(ft_reveg,test="wald",cor.type="shrink")
Time elapsed: 0 hr 0 min 6 sec
Analysis of Variance Table
Multivariate test:
Res.Df Df.diff wald Pr(>wald)
(Intercept) 9
treatment 8 1 8.698 0.039 *
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Arguments:
Test statistics calculated assuming correlated response via ridge
regularization P-value calculated using 999 iterations via PIT-trap
resampling.
You can also use the summary function for manyglm objects, but the results aren’t
quite as trustworthy as for anova. The reason is that resamples are taken under the
alternative hypothesis for summary, where there is a greater chance of fitted values
being zero, especially for rarer taxa (e.g. if there is a treatment combination in which
a taxon is never present). Abundances don’t resample well if their predicted mean is
zero.
One major difference between glm and manyglm is in computation time. Analysing
your data using glm is near instantaneous, unless you have a very large dataset. But in
Code Box 14.1, an anova call to a manyglm object took almost 10 s, on a small dataset
with 24 responses variables. Bigger datasets will take minutes, hours, or sometimes
days! The main problem is that resampling is computationally intensive—by default,
this function will fit a glm to each response variable 1000 times, so there are 24,000
GLMs in total. If an individual GLM were to take 1s to fit, then fitting 24,000 of
them would take almost 7 h (fortunately, that would only happen for a pretty large
dataset).
For large datasets, try setting nBoot=49 or 99 to get a faster but less precise
answer. Then scale it up to around 999 when you need a final answer for publication.
14.2 Design-Based Inference Using GEEs 343
You can also use the show.time="all" argument to get updates every 100 bootstrap
samples, e.g. anova(ft_reveg,nBoot=499,show.time="all").
If you are dealing with long computation times, parallel computing is a solution
to this problem—if you have 4 computing cores to run an analysis on, you could
send 250 resamples to each core then combine, to cut computation down four-fold.
If you have access to a computational cluster, you could even send 1000 separate
jobs to 1000 nodes, each consisting of just one resample, and reduce a hard problem
from days to minutes. By default, mvabund will split operations up across however
many nodes are available to it at the time.
Another issue to consider with long computation times is whether some of the taxa
can be removed from the analysis. Most datasets contain many taxa that are observed
very few times (e.g. singletons, seen only once, and doubletons or tripletons), and
these typically provide very little information to the analysis, while slowing compu-
tation times. The slowdown due to rarer taxa can be considerable because they are
more difficult to fit models to. So an obvious approach to consider is removing rarer
taxa from the analysis—this rarely results in loss of signal from the data but removes
a lot of noise, so typically you will get faster (and better) results from removing rare
species (as in Exercise 14.7). It is worth exploring this idea for yourself and seeing
what effect removing rarer taxa has on results. Removing species seen three or fewer
times is usually a pretty safe bet.
The manyglm function is currently limited to just a few choices of family and link
function to do with count or presence–absence data, focusing on distributions like the
negative binomial, Poisson, and binomial. An extension of it is the manyany function
(Code Box 14.5), which allows you to fit (in principle) any univariate function to
344 14 Multivariate Abundances and Environmental Association
each column of data and use anova to resample rows to compare two competing
models. The cost of this added flexibility is that this function is very slow—manyglm
was coded in C (which is much faster than R) and optimised for speed, but manyany
was not.
Code Box 14.5: Analysing Ordinal Data from Habitat Configuration Study
Using manyany
Regression of ordinal data is not currently available in the manyglm function, but it can be
achieved using manyany:
> habOrd = counts = as.matrix( round(seaweed[,6:21]*seaweed$Wmass))
> habOrd[counts>0 & counts<10] = 1
> habOrd[counts>=10] = 2
> library(ordinal)
> summary(habOrd) # Amphipods are all "2" which would return an
error in clm
> habOrd=habOrd[,-1] #remove Amphipods
> manyOrd=manyany(habOrd~Dist*Time*Size,"clm",data=seaweed)
> manyOrdNull=manyany(habOrd~Time*Size,"clm",data=seaweed)
> anova(manyOrdNull, manyOrd)
LR Pr(>LR)
sum-of-LR 101.1 0.12
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
What hypothesis has been tested here? Is there any evidence against it?
In the foregoing analyses, the focus was on modelling mean abundance, but some-
times we wish to focus on relative abundance or composition. The main reason for
wanting to do this is if there are changes in sampling intensity for reasons that can’t
be directly measured. For example, pitfall traps are often set in terrestrial systems to
catch insects, but some will be more effective than others because of factors unrelated
to the abundance of invertebrates, such as how well pitfall traps were placed and
the extent to which ground vegetation impedes movement in the vicinity of the trap
(Greenslade, 1964). A key point here is that some of the variation in abundance mea-
surements is due to changes in the way the sample was taken rather than being due to
changes in the study organisms—variation is explained by the sampling mechanism
as well as by ecological mechanisms. In this situation, only relative abundance across
taxa is of interest, after controlling for variation in sampling intensity. In principle, it
is straightforward to study relative abundances using a model-based approach—we
simply add a term to the model to account for variation in abundance across samples.
So the model for abundance at site i of taxon j becomes
14.3 Compositional Change and Diversity Partitioning 345
The new term in the model, α0i , accounts for variation across samples in total
abundance, so that remaining terms in the model can focus on change in relative
abundance. The optional term x i α quantifies how much of this variation in total
abundance can be explained by environmental variables. The terms in Eq. 14.3 have
thus been partitioned into those studying total abundance (the α) and those studying
relative abundance (the β). Put another way, the effects of environmental variables
have been split into main effects (the α) and their interactions with taxa (the β, which
take different values for different taxa). The model needs additional constraints for
all the terms to be estimable, which R handles automatically (e.g. by setting α01 = 0).
Key Point
Often the primary research interest is in studying the effects of environmental
variables on community composition or species turnover. This is especially
useful if some variation in abundance is explained by the sampling mechanism,
as well as ecological mechanisms. This can be accounted for in a multivariate
analysis by adding a “row effect” to the model, a term that takes a different value
for each sample according to its total abundance. Thus, all remaining terms
in the model estimate compositional effects (β-diversity), after controlling for
effects on total abundance (α-diversity).
1 He also defined γ-diversity, the richness of species in a region, but this is of less interest to us
here.
346 14 Multivariate Abundances and Environmental Association
> ft_comp=manyglm(abundMV~treatment+offset(log(pitfalls)),
data=reveg, composition=TRUE)
> anova(ft_comp,nBoot=99)
Time elapsed: 0 hr 0 min 21 sec
Model: abundMV ~ cols + treatment + offset(log(pitfalls)) + rows
+ cols:(treatment + offset(log(pitfalls)))
For large datasets it may not be practical to convert data to long format, in which
case the composition=TRUE argument is not a practical option. In this situation a
so-called quick-and-dirty alternative for count data is to calculate the quantity
p
p
si = log yi j − log μ̂i j (14.4)
j=1 j=1
and use this as an offset (Code Box 14.8). The term μ̂ refers to the predicted value
for yi j from the model that would be fitted if you were to exclude the compositional
term. The si estimate the row effect for observation i as the difference in log-row
sums between the data and what would be expected for a model without row effects.
The best way to use this approach would be to calculate a separate offset for each
model being compared, as in Code Box 14.8. If there was already an offset in the
model, it stays there, and we now add a second offset as well (Code Box 14.8).
348 14 Multivariate Abundances and Environmental Association
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
ft_row0 9
ft_row 8 1 50.26 0.048 *
---
Signif. codes: 0 ***' 0.001 **' 0.01 *' 0.05 .' 0.1 ' 1
Arguments:
Test statistics calculated assuming uncorrelated response (for faster
computation)P-value calculated using 999 iterations via PIT-trap
resampling.
This was over 10 times quicker than Code Box 14.7 (note that it used 10 times as many
resamples), but the results are slightly different—the test statistic is slightly smaller and the
P-value larger. Why do you think this might be the case?
The approach of Code Box 14.8 is quick (because it uses short format), and we will
call it quick-and-dirty for two reasons. Firstly, unless counts are Poisson, it does not
use the maximum likelihood estimator of α0i , so in this sense it can be considered
sub-optimal. Secondly, when resampling, the offset is not re-estimated for each
resample when it should be, so P-values become more approximate. Simulations
suggest this approach is conservative, so perhaps the main cost of using the quick-
and-dirty approach is loss of power—test statistics are typically slightly smaller and
less significant, as in Code Box 14.8. Thus the composition argument should be
preferred, where practical.
14.4 In Which Taxa Is There an Effect? 349
Key Point
A key challenge in any multivariate analysis is understanding what the main
story is and communicating it in a simple way. Some tools that can help with
this include the following:
• Identifying a short list of indicator taxa that capture most of the effect.
• Visualisation tools—maybe an ordination, but especially looking for ways
to see the key results using the raw data.
In Code Box 14.1, Anthony established that revegetation does affect invertebrate
communities. This means that somewhere in the invertebrate community, there is
evidence that some invertebrates responded to revegetation—maybe all taxa re-
sponded, or maybe just one did. The next step is to think about which of the response
variables most strongly express the revegetation effect. This can be done by adding
a p.uni argument as in Code Box 14.9.
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
(Intercept) 9
treatment 8 1 78.25 0.022 *
Univariate Tests:
Acarina Amphipoda Araneae Blattodea
Dev Pr(>Dev) Dev Pr(>Dev) Dev Pr(>Dev) Dev
(Intercept)
treatment 8.538 0.208 9.363 0.172 0.493 0.979 10.679
Coleoptera Collembola Dermaptera
Pr(>Dev) Dev Pr(>Dev) Dev Pr(>Dev) Dev
(Intercept)
treatment 0.117 9.741 0.151 6.786 0.307 0.196
..
.
350 14 Multivariate Abundances and Environmental Association
The p.uni argument allows univariate test statistics to be stored for each response
variable, with P-values from separate tests reported for each response, to identify
taxa in which there is statistical evidence of an association with predictors. The
p.uni="adjusted" argument uses multiple testing, adjusting P-values to control
family-wise Type I error, so that the chance of a false positive is controlled jointly
across all responses (e.g. for each term in the model, if there were no effect of that
term, there would be a 10% chance, at most, of at least one response having a P-
value less than 0.1). The more response variables there are, the bigger the P-value
adjustment and the harder it is to get significant P-values, after adjusting for multiple
testing. It is not uncommon to get global significance but no univariate significance—
e.g. Anthony has good evidence of an effect on invertebrate communities but can’t
point to any individual taxon as being significant. This comes back to one of the
original arguments for why we do multivariate analysis (Chap. 11, introduction)—it
is more efficient statistically than separately analysing each response one at a time.
The size of univariate test statistics can be used as a guide to indicator taxa, those
that contribute most to a significant multivariate result. A test statistic constructed
assuming independence (cor.type="I") is a sum of univariate test statistics for
each response, so it is straightforward to work out what fraction of it is due to any
given subset of taxa. For example, the “top 5” taxa from Anthony’s revegetation
study account for more than half of the treatment effect (Code Box 14.10). This type
of approach offers a short list of taxa to focus on when studying the nature of a
treatment effect, which can be done by studying their coefficients (Code Box 14.10),
plotting the subset, and (especially for smaller datasets) looking at the raw data.
Code Box 14.10: Exploring Indicator Taxa Most Strongly Associated with
Treatment Effect in Anthony’s Revegetation Data
Firstly, sorting univariate test statistics and viewing the top 5:
> sortedRevegStats = sort(an_reveg$uni.test[2,],decreasing=T,
index.return=T)
> sortedRevegStats$x[1:5]
Blattodea Coleoptera Amphipoda Acarina Collembola
10.679374 9.741038 9.362519 8.537903 6.785946
How much of the overall treatment effect is due to these five orders of invertebrates? The
multivariate test statistic across all invertebrates, stored in an$table[2,3], is 78.25. Thus,
the proportion of the difference in deviance due to the top 5 taxa is
> sum(sortedRevegStats$x[1:5])/an_reveg$table[2,3]
[1] 0.5764636
So about 58% of the change in deviance is due to these five orders.
The model coefficients and corresponding standard errors for these five orders are as
follows:
> coef(ft_reveg)[,sortedRevegStats$ix[1:5]]
Blattodea Coleoptera Amphipoda Acarina Collembola
(Intercept) -0.3566749 -1.609438 -16.42495 1.064711 5.056246
treatmentReveg -3.3068867 5.009950 19.42990 2.518570 2.045361
> ft_reveg$stderr[,sortedRevegStats$ix[1:5]]
Blattodea Coleoptera Amphipoda Acarina Collembola
(Intercept) 0.3779645 1.004969 707.1068 0.5171539 0.4879159
14.6 Modelling Frameworks for Multivariate Abundances 351
This chapter has focused on design-based inference using GEEs. What other options
are there for making inferences about community–environment associations? A
few alternative frameworks could be used; their key features are summarised in
352 14 Multivariate Abundances and Environmental Association
Table 14.1. Copula models are mentioned in the table and will be discussed in more
detail later (Chap. 17).
Table 14.1: Summary of the main differences in functionality of four frameworks for
modelling multivariate abundances
community composition?
b That is, can they be used to study patterns in co-occurrence of taxa, e.g. by quantifying correlation
Multivariate analysis in ecology has a history dating back to the 1950s (Bray &
Curtis, 1957, for example), whereas the other techniques mentioned in Table 14.1
are modern advances using technology not available in most of the twentieth century,
and only actually introduced to ecology in the 2010s (Walker and Jackson, 2011;
Wang et al., 2012; Popovic et al., 2019). In the intervening years, ecologists developed
some algorithms to answer research questions using multivariate abundance data,
which were quite clever considering the computational and technological constraints
of the time. These methods are still available and widely used in software like
PRIMER (Anderson et al., 2008), CANOCO (ter Braak & Smilauer, 1998), and free
versions such as in the ade4 (Dray et al., 2007) or vegan (Oksanen et al., 2017)
packages.
The methods in those packages (Clarke, 1993; Anderson, 2001, for example)
tend to be stand-alone algorithms that are not motivated by an underlying statistical
model for abundance,2 in contrast to GEEs and all other methods in this book (so-
called model-based approaches). The algorithmic methods are typically faster than
those using a model-based framework because they were developed a couple of
decades ago to deal with computational constraints that were much more inhibiting
than they are now. However, these computational gains come at potentially high
cost in terms of statistical performance, and algorithmic approaches are difficult to
reconcile conceptually with conventional regression approaches used elsewhere in
ecology (Chaps. 2–11). So while at the time of writing many algorithmic techniques
are still widely used and taught to ecologists, a movement has been gathering pace
Design-based inference was used in this chapter to make inferences from models
about community–environment associations. As in Chap. 9, design-based inference
is often used in place of model-based inference when the sampling distribution
of a statistic cannot be derived without making assumptions that are considered
unrealistic or when it is not possible to derive the sampling distribution at all. A bit
of both is happening here, with high dimensionality making it difficult to specify
good models for multivariate abundances and to work out the relevant distribution
theory. However, progress is being made on both fronts.
354 14 Multivariate Abundances and Environmental Association
Fig. 14.2: Simulation results showing that while algorithmic approaches may be
valid, they are not necessarily efficient when testing no-effect null hypotheses. In
this simulation there were counts in two independent groups of observations (as in
Anthony’s revegetation study, Exercise 12.2), with identical means for all response
variables except for one, which had a large (10-fold) change in mean. Power (at the
0.05 significance level) is plotted against the variance of this one “effect variable”
when analysing (a) untransformed counts; (b) log(y + 1)-transformed counts using
dissimilarity-based approaches, compared to a model-based approach (“mvabund”).
A good method will have the power to detect a range of types of effects, but the
dissimilarity-based approaches only detect differences when expressed in responses
with high variance
In Chap. 14 the focus was on using design-based inference to test hypotheses about
environment–community associations and estimate model parameters with confi-
dence. But recall that sometimes the primary goal is not inference about the effect
of predictors. Other possible primary objectives are model selection, for example, to
study which type of model best predicts abundances in a community (Exercise 15.1),
or variable importance, studying which environmental variables are most useful for
predicting abundance (Exercise 15.2). In these cases, the problem of interest can be
framed in terms of prediction, and the techniques of Chap. 5 are applicable. However,
there is a lot that can go wrong, so the model selection technique and the predictive
model to be used should be chosen with care.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 357
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_15
358 15 Predicting Multivariate Abundances
As usual, a suitable basis for choosing between competing models for data is to use an
information criterion (such as AIC and BIC) or cross-validation, on a suitably chosen
objective function. Like the models being fitted, the model selection method should
be chosen in a way that accounts for the key properties of the data, in particular, the
abundance and multivariate properties.
The abundance property of data can be accounted for using as the objective
function the log-likelihood of a model that specifies an appropriate mean–variance
relationship. If the log-likelihood is used in combination with (cross-)validation,
this is known as predictive likelihood. In predictive likelihood, we estimate model
parameters from training data, then use these to compute the likelihood for test data
(Code Box 15.1). This method is related to AIC, having been shown to estimate the
same quantity (Stone, 1977). The main distinction is that information criteria are a
form of model-based inference, requiring the model to be (close to) correct, whereas
cross-validation is a form of design-based inference, requiring only that training
and test datasets be independent. While mean squared error was used previously
(Sect. 5.2) to compare linear models, this does not account for the mean–variance
relationship of abundance data, which can be very strong. If using mean squared
error on multivariate abundances, the model would put undue weight on how well
models predicted abundant (highly variable) taxa, at the expense of rare taxa. (And
mean squared error on the scale of the linear predictor might do the opposite.)
Model selection can account for the multivariate property of data in different
ways, depending on the approach taken. Information criteria, being a form of model-
based inference, require a correlation to be accounted for in model specification.
But correlation is not accounted for in the manyglm function, so while AIC can be
computed in this case, it should only be considered as a very rough guide. Cross-
validation, being a form of design-based inference, can handle the multivariate
property by assigning rows of observations to training and test samples, keeping
correlated observations at a site together in training or test sets. If there is additional
dependence structure in the rows, correlated rows should be kept together also (as in
Code Box 15.1). Cross-validation could be applied to manyglm or (in principle) any
function capable of predicting new values.
A final important consideration when applying model selection to multivariate
abundances is how to make predictions for rare taxa. There is little information in a
rare taxon, so one should not expect good predictions from a model that separately
estimates parameters for each response variable, as in Eq. 14.2. In fact, model selec-
tion approaches can return some strange answers for rare taxa if they aren’t modelled
carefully. For example, in cross-validation (Code Box 15.1, Exercise 15.3), if a taxon
is completely absent from a training sample (or from a factor level in the training
sample), manyglm would have a predicted mean of zero and near-infinite regression
parameters (as in Maths Box 10.6). If such a taxon is then observed in the test data,
it will have near-infinite predictive likelihood. This problem is a type of overfitting,
and it is common when modelling multivariate abundances, especially with factor
predictors. Information criteria also can behave strangely when modelling many
15.1 Special Considerations for Multivariate Abundances 359
responses if a model with many parameters is being used, for the reasons raised pre-
viously when discussing model-based inference in a different context (Sect. 14.6.2).
The solution to these issues is to use a simpler model that borrows strength across
taxa, as discussed in what follows.
Exercise 15.3: Cross-Validation for Wind Farm Data and Rare Species
Repeat the analyses of Code Box 15.1, but after removing rare species (ob-
served less than 10 times), using the following code:
notRare=colSums(windMV>0)>10
windMVnotRare=mvabund(windFarms$abund[,notRare])
Did you get a similar answer?
Note that so far we have only considered one test sample, and there is
randomness in the choice of training/test split. Repeat the analyses of Code
360 15 Predicting Multivariate Abundances
Box 15.1 as well as those you have done here, with and without rare species,
multiple times (which is a form of cross-validation).
Which set of results tends to be more reliable (less variable)—the ones with
or without the rare species? Why do you think this happened?
Key Point
Sometimes it is of primary interest to predict communities, or some
community-level property, from multivariate abundance data. A good pre-
dictive model will include the following features:
• The model will account for the multivariate and abundance properties.
• It will borrow strength across taxa to use information in more abundant
taxa to help guide predictions for rare taxa.
• It will be capable of handling non-linear responses of taxa to their environ-
ment.
It is hard to reliably estimate environmental responses for rare taxa because there is
little information in them that can be used for estimation. However, if model coeffi-
cients were estimated in a way that borrowed strength across responses, predictions
for rarer taxa could be guided in part by other taxa, for which more information
might be available.
When predicting multivariate abundances, it is a good idea to borrow strength
across response variables to improve predictions for rare taxa. This is done by
imposing some structure across taxa in some way (e.g. assuming they come from a
common distribution) to shrink parameters for rare taxa towards values commonly
seen for other taxa. Incidentally, a model that imposes structure across taxa will have
fewer parameters in it, making it easier to interpret, and sometimes it can be used to
better understand why different taxa respond to environment differently (Chap. 16).
One way to borrow strength is to shrink parameters towards a common value.
Fitting a mixed model, with a random effect for each environmental coefficient
that takes different values across taxa (Ovaskainen & Soininen, 2011, or see Code
Box 15.2), will shrink environmental coefficients towards each other. Rarer taxa will
tend to have their coefficients shrunk more strongly towards others. A LASSO is an
alternative way to shrink parameters (Code Box 15.3, using glmnet), which will
shrink parameters to exactly zero if they have little effect on response. An interesting
prospect, little studied for multivariate abundance data, is the group LASSO (Yuan &
Lin, 2006, Code Box 15.4, using the grplasso package), which groups parameters
into different types and shrinks the whole group towards zero jointly. This is quite
15.2 Borrowing Strength Across Taxa 361
4
5
coefficients
coefficients
2
0
0
−2
−5 −4
−6
−800 −800
predictive log(L)
predictive log(L)
−1000 −1000
−1200 −1200
−1400 −1400
−1600 −1600
−1800 −1800
0.0001 0.01 1 1 10 100
λ [log scale] λ [log scale]
Fig. 15.1: LASSO analyses of Lena’s wind farm data, including the regularisation
path for (a) a LASSO fit (Code Box 15.3) and (b) a group LASSO (Code Box 15.4),
and predictive likelihood for (c) the LASSO fit and (d) the group LASSO fit. Note
in (a) and (b) that all slope parameters start at zero for large values of the LASSO
penalty (λ) and fan out towards their unpenalised values (towards λ = 0). Different
types of model terms (e.g. for Zone, Year, and their interaction) have been colour-
coded; note that all parameters of the same type of model term enter at the same
time in the group LASSO fit because they are penalised as a group. Note that (c ) and
(d) follow a J curve, but the curve is upside down (because we maximise likelihood,
not minimise error) and flipped horizontally (because smaller models have larger λ).
The optimal value for the LASSO penalty is λ ≈ 0.002 and for the group LASSO
λ ≈ 2. These values are not on the same scale so are not directly comparable
Code Box 15.3: Fitting LASSO to Wind Farm Data via glmnet
We will use the data in long format from windComp$data (Code Box 15.2). Note, however,
that the glmnet package currently doesn’t take formula inputs—it requires a design matrix
of predictors and a vector of responses as the first two arguments:
15.2 Borrowing Strength Across Taxa 363
library(glmnet)
X = model.matrix(windMV~Year*Zone*cols,data=windComp$data)
y = windComp$data$windMV
windLasso = glmnet(X,y, family="poisson")
This fits a whole path of models, varying lambda (Fig. 15.1a). But which of these models
has the best fit? We can use validation to look at this using the same test stations as in Code
Box 15.1 (stored in isTestStn):
isTest = windComp$data$Station %in%
levels(windComp$data$Station)[isTestStn]
windLassoTrain = glmnet(X[isTest==FALSE,],y[isTest==FALSE],
family="poisson")
prLassoTest = predict(windLassoTrain,X[isTest,],type="response")
predLLlasso=colSums(dpois(windComp$data$windMV[isTest],prLassoTest,
log=TRUE))
plot(windLassoTrain$lambda,predLLlasso,type="l",log="x")
isBestLambda = which(predLLlasso==max(predLLlasso))
Results are in Fig. 15.1b. The value of λ for the best-fitting model is stored in
windLassoTrain$lambda[isBestLambda], and coefficients for this model are stored
in coef(windLassoTrain)[,isBestLambda]. This model includes 58 non-zero param-
eters, across main effects and interactions for different species.
Code Box 15.5: Reduced Rank Regression for Wind Farm Data
VGAM accepts data in short format. To fit a rank two Poisson reduced rank regression:
> library(VGAM)
> wind_RR2=rrvglm(as.matrix(windFarms$abund)~Year*Zone,
family=poissonff, data=windFarms$X, Rank=2)
We could compare this to a model fitted via manyglm:
> wind_manyglm = manyglm(windMV~Year*Zone, data=windFarms$X,
family=poisson())
> c( BIC(wind_RR2), sum(BIC(wind_manyglm)))
[1] 2626.993 2742.920
Which model fits better, according to BIC?
Linear predictors for each “archetype”, characterising how fish species typically respond,
are obtained using latvar:
zoneyear = interaction(windFarms$X$Zone,windFarms$X$Year)
matplot(as.numeric(zoneyear),latvar(wind_RR2),pch=c(1,19))
l
l Archetype 1
30
l Archetype 2
20
linear predictor
l
10
l
l
l l l
0
l l
−10
l
−20
Note that this model does not detect a strong effect of wind farms and that some linear
predictors take very high or very low values, suggesting overfitting.
Some studies (Elith et al., 2006, for example) suggest models predict better if they
can handle non-linearity and interactions between environmental variables. Non-
linearity is important to think about because many species have an optimal value for
environmental variables, at which they can be found in high abundance, and they
reduce in abundance as you move away from this optimum in either direction (a
uni-modal response, ter Braak & Prentice, 1988). For example, thinking about the
response of a species to temperature; every organism has a point where it is too hot
and a point where it is too cold. This idea cannot be captured using linear terms
alone, and at the very least a quadratic term would be needed to account for this,
366 15 Predicting Multivariate Abundances
or some other method to handle non-linearity (such as in Chap. 8). One could argue
that, as a rule, quantitative environmental variables should be entered into models for
species response in a manner that can handle non-linear responses, and uni-modal
responses in particular (ter Braak & Prentice, 1988).
Generalised additive models (Chap. 8) are another possible way to handle non-
linearity, but this method is quite data hungry and doesn’t always hold up well in
comparisons of predictive performance for this sort of data (Norberg et al., 2019).
Alternatives include the mistnet package (Harris, 2015), which can fit a hierarchical
model via artificial neural networks to presence–absence data, and the marge package
(Stoklosa & Warton, 2018), a generalised estimating equation (GEE) extension of
multivariate adaptive regression splines (“MARS”, Friedman, 1991, also see Elith &
Leathwick, 2007). There has been a boom in machine learning methods recently, and
many of these are capable of flexibly handling non-linear responses, such as artificial
neural networks (Olden et al., 2008) and deep learning (Christin et al., 2019, 2021).
Code Box 15.6: Using the LASSO for Petrus’s Spider Data
We will fit a negative binomial regression to Petrus’s spider data, but with a LASSO penalty
to force parameters to zero if they are not related to response. This will be done using the
traitglm function, which coerces data into long format and stores coefficients in a matrix:
library(mvabund)
data(spider)
# fit model:
spid.trait = traitglm(spider$abund,spider$x,method="cv.glm1path")
The method="cv.glm1path" argument meant that the model was fitted using the glm1path
function, a LASSO algorithm written into mvabund that works for negative binomial regres-
sion. Cross-validation was used to choose the value of the LASSO penalty parameter.
To plot coefficients in a heat map where white means zero (Fig. 15.2):
library(lattice)
a = max( abs(spid.trait$fourth.corner) )
colort = colorRampPalette(c("blue","white","red"))
plot.4th = levelplot(t(as.matrix(spid.trait$fourth.corner)),
xlab="Environmental Variables", ylab="Species",
col.regions=colort(100), at=seq(-a, a, length=100),
scales = list( x= list(rot = 45)) )
print(plot.4th)
Which predictors tend to have the strongest effects on community abundance? This
question asks us to quantify variable importance, which can be addressed as pre-
viously (Sect. 5.7), e.g. using a leave-one-out change in deviance (via the drop1
function) or looking at the size of standardised coefficients (as in Code Box 15.6).
Other options are to use a mixed model, with a random effect on environmental coef-
ficients that takes a different value for each taxon, and to look at the size of variance
components for standardised predictors (Code Box 15.7). The larger the variance
15.4 Relative Importance of Predictors 367
Zoraspin
0.4
Trocterr
Pardpull
Pardnigr 0.2
Pardmont
Pardlugu
Species
0.0
Auloalbi
Arctperi
Arctlute −0.2
Alopfabr
Alopcune
−0.4
Alopacce
d
es
y
n
ye
dr
os
an
io
av
ct
il.
.la
m
.s
so
le
fle
re
rb
n.
re
ba
he
lle
fa
Environmental Variables
Fig. 15.2: Heat map of standardised coefficients from negative binomial LASSO re-
gression of Code Box 15.6. Darker coefficients correspond to stronger effects. Which
environmental variables seem to have the strongest effect on spider abundances?
component, the more the standardised coefficients vary across taxa, hence the greater
the variation in community composition due to that predictor. In Code Box 15.7, this
approach would suggest that soil dryness is the most important predictor of changes
in hunting spider composition, with a variance component of 1.84, almost three
368 15 Predicting Multivariate Abundances
times larger than for any other predictor. Note that predictors need to be standardised
prior to this analysis—otherwise the slope coefficients are on different scales for
the different predictors, so their variance components are, too, and comparing their
relative sizes would not be meaningful.
Note that all of the aforementioned methods estimate the conditional effects of
variables, and careful thought is required to establish if that is what is really of
interest (Sect. 5.7).
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 369
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_16
370 16 Explaining Variation in Responses Across Taxa
Key Point
To study how taxa differ in environmental response, a used tool is to classify
them into archetypes based on their environmental response. This can help
characterise the main ways taxa differ in their environmental responses.
A finite mixture model with G components assumes that each observation comes
from one of G distinct component distributions, but we don’t know in advance which
observation comes from which component, and estimate this from the data. We make
some assumptions about the form of the component distributions (e.g. normal, Pois-
son, . . .), but estimate their parameters from the data, as well as estimating the overall
(“prior”) proportion of observations falling in each component and the “posterior”
probability that each observation belongs to any given component. Hence, mixture
models are a form of soft classification, with observations assigned a probability of
belonging to each of the G components, rather than being simply assigned to one
component (hard classification). Those familiar with Bayesian analysis will know
the terms “prior” and “posterior”, but the term is being used in a different context
here, and a finite mixture model is not necessarily fitted using Bayesian techniques.
In fact, mixture models are most often fitted by maximum likelihood.
We will use mixture models primarily for classification, where we wish to classify
observations according to how well they fit into each of several component distribu-
tions. In particular, we want to classify taxa based on how well their environmental
response falls into one of several categories of response type.
There are other reasons, beyond classification, you might want to fit mixture
models. In particular, they are a flexible method of fitting distributions to data, capable
of generating weird distributions for weird data. The most familiar example of this
in ecology is zero-inflated distributions (Welsh et al., 1996), count distributions with
extra zeros in them, often fitted as a two-component mixture of a count distribution
16.1 Classifying Species by Environmental Response 371
and a distribution that is guaranteed to take the value zero (degenerate at zero).
Another important reason for fitting a mixture model is as a simple way to model
heterogeneity, assuming all observations come from one of a few distinct “types”.
An example of this is in capture–recapture modelling (Pledger, 2000), to account for
differences in the capture probability of different individuals (e.g. due to differences
in behaviour). To read more about mixture models and other ways they are used, see
McLachlan and Peel (2000).
“Mixture model” sounds a lot like “mixed model”, but it is important not to
confuse the two; they are quite different models.1
g(mi jk ) = β0j + xi βk
for some link function g(·) and k ∈ {1, 2}. Taxa are assumed to be indepen-
dently drawn from component 1 (with probability π) or 2 (with probability
1 − π). If we also assume that abundances are independent (conditional on
group membership), the log-likelihood function is
p
N
N
(β) = log π f (yi j ; mi j1 ) + (1 − π) f (yi j ; mi j2 )
i=1 i=1 i=1
1 Although technically they are actually related, a mixture model is a type of mixed model where
the random effect is not normally distributed; instead it has a multinomial distribution that takes G
different values.
372 16 Explaining Variation in Responses Across Taxa
−800
−900
−1000
log(Λ)
−1100
−1200
−1300
−3 −2 −1 0 1 2 3
β
Previously we maximised the likelihood function by finding its stationary point
and solving the subsequent score equations (Maths Box 10.2). But now there is
more than one stationary point, and some may not be maxima. To have a better
chance of finding the global maximum, we should give the search algorithm
thoughtful initial guesses for parameters (starting values) or try multiple runs
with different starting values and keep the solution with the largest likelihood.
Or try a bit of both.
A species archetype model fits a mixture model to classify taxa based on their
environmental response (Dunstan et al., 2011), as in Fig. 16.1. This is a special type
of mixture model for the regression setting, known as a finite mixture of regressions,
where we assume that observations come from one of a few distinct regression lines.
Instead of having to estimate and study a different environmental response for every
taxon, we can focus our efforts on estimating and characterising a smaller number
of so-called archetypal environmental responses.
A species archetype model is a mixture of generalised linear models (GLMs),
where we mix on the regression parameters, assuming the following:
• Each taxon is independently drawn from one of G archetypes. The (prior) proba-
bility for the kth component is πk .
• The observed yi j -values (i for observation/replicate, j for taxon) are independent,
conditional on the mean mi jk for archetype k.
• Conditional on their mean mi jk , the y-values come from a known distribution
(from the exponential family) with known mean–variance relationship V(mi jk ).
• Within each archetype, there is a straight-line relationship between some known
function of the mean of y and each x, with a separate intercept for each taxon β0j :
16.1 Classifying Species by Environmental Response 373
Example code for fitting a species archetype model to Petrus’s spider data is in Code
Box 16.1. This uses the species_mix function in the ecomix package, which fits
models by maximum likelihood. The likelihood surface of a mixture model is often
quite bumpy, making it hard to find the maximum. It is a good idea to fit models more
than once from different starting points (as in Sect. 12.3), and if you get different
answers, use the one with highest likelihood. The species_mix function can be used
as for most R regression functions; the only trick to note is that the response needs
to be entered in matrix format. As usual, a family for the component GLMs needs to
be specified via the family argument, which takes character vectors only including
"negative.binomial" for negative binomial regression, "bernoulli" for logistic
regression of presence-absence data, and "tweedie" for Tweedie GLMs of biomass
data. Finally, the number of component archetypes G needs to be specified in advance
(via the nArchetypes argument) and defaults to three. A control argument has
been specified in Code Box 16.1, with parameters as suggested by the package author,
to improve convergence for small and noisy datasets.
Code Box 16.1: Fitting a Species Archetype Model to Petrus’s Spider Data
The species_mix function from the ecomix package will be used to fit a species archetype
model to Petrus’s spider data to classify species by environmental response:
> library(mvabund)
> library(ecomix)
> data(spider)
> SpiderDF=data.frame(spider$x)
> SpiderDF$abund=as.matrix(spider$abund)
> spiderFormula = abund ~ soil.dry + bare.sand + fallen.leaves + moss +
herb.layer + reflection
> ft_Mix = species_mix(spiderFormula, data=SpiderDF,
family="negative.binomial", nArchetypes=2,
control=list(init_method='kmeans',ecm_refit=5, ecm_steps=2) )
SAM modelling
There are 2 archetypes to group the species into.
There are 28 site observations for 12 species.
...
iter 60 value 777.946121
iter 70 value 777.933389
final value 777.933345
converged
The values reported are negative log-likelihood (so the smaller, the better). This function
doesn’t always converge to a global maximum, so try several times and stick with the fit with
the lowest likelihood. You can ignore errors earlier in the output if the final model converges.
16.1 Classifying Species by Environmental Response 375
> coef(ft_Mix)$beta
soil.dry bare.sand fallen.leaves moss herb.layer reflection
Archetype1 1.5627792 -0.05472319 -0.2784647 -0.1450093 0.7081597 -0.4730827
Archetype2 -0.1070417 0.22425755 -0.2854891 0.4187408 0.5896159 0.4760607
Which predictors vary the most across archetypes (hence across taxa)? Is this what you
saw in Code Box 15.7?
Key parts of the output to look at are the model coefficients (available as usual via
the coef function), in particular the regression coefficients for archetypes (stored as
coef(ftMix)$beta for a fitted model ftMix), the posterior probabilities of group
membership (ftMix$tau, which are often mostly zero and one), and the intercept
terms (coef(ftMix)$alpha).
In terms of research questions, species archetype models are best used for exploratory
work, understanding the reasons why environmental response varies across taxa, as
for Exercise 16.1. They can also be used for prediction, given their ability to borrow
strength across taxa to improve predictions for rare taxa (Hui et al., 2013).
Recall that multivariate abundances always have two key data properties, a mul-
tivariate assumption (abundances are correlated across taxa) and an abundance as-
sumption (mean–variance relationship). A GLM approach is used to handle the
abundance assumption, and standard GLM tools can be used to check this assump-
tion, and linearity, as in Code Box 16.2. However, note that species archetype models
lack any correlation in the model and instead assume (conditional) independence of
abundances across taxa. This assumption is rarely reasonable, so community-level
inferences from these models should be treated with caution. But there is no issue if
using the method for exploratory purposes or for prediction of abundances in each
taxon since the prediction of individual response variables is usually little affected
by correlation across responses.
Code Box 16.2: Minding Your Ps and Qs for Petrus’s Species Archetype
Model
Taking the fit from Code Box 16.1 we can just ask for a plot:
> plot(ft_Mix, fitted.scale="log")
376 16 Explaining Variation in Responses Across Taxa
Code Box 16.3 fits models with different numbers of archetypes to Petrus’s spider
data and plots BIC as a function of number of archetypes.
Code Box 16.3: Choosing the Number of Archetypes for Petrus’s Spider
Data
nClust=rep(2:6,3)
bics = rep(NA, length(nClust))
for(iClust in 1:length(nClust))
{
> fti_Mix = species_mix(spiderFormula, data=SpiderDF,
family="negative.binomial", nArchetypes=nClust[iClust],
control=list(init_method='kmeans',ecm_refit=5, ecm_steps=2))
}
plot(bics~nClust, ylab="BIC", xlab="# archetypes")
1750
1725
BIC
1700
1675
1650
2 3 4 5 6
# archetypes
Documenting how taxa vary in environmental response is only part of the battle. If
that is all a community ecologist does, Shipley (2010) argued they are behaving like
a “demented accountant”, effectively keeping separate records of what different taxa
do, without really trying to make sense of what is happening across taxa. In science
we want to understand processes, meaning we want to look at why species differ in
their environmental response (as in Exercise 16.4).
Regression is a key tool for answering why questions, because you can introduce
predictors to try to explain why a response varies. Here we need predictors across
taxa, the columns of the multivariate abundance dataset, in order to explain why taxa
have different patterns of environmental response. These are commonly referred to
as species traits or functional traits, and it has been argued that they should have
a much greater role in community ecology (McGill et al., 2006; Shipley, 2010).
Because we are using traits to understand changes in environmental response, we are
16.2 Fourth Corner Models 379
Key Point
To study why taxa differ in environmental response, a useful tool is to measure
functional traits for different taxa and predict abundance as a function of
these traits, environmental variables, and their interaction (a fourth corner
model). Of primary interest is the fourth corner, the matrix characterising how
environment and traits interact, which is simply the matrix of environment–
trait interaction coefficients.
The term x ⊗ z denotes the interaction, which captures the idea that environmental
response varies across taxa due to their traits. Importantly, the matrix of interaction
coefficients β (x×z) can be understood as the fourth corner (Fig. 16.2b). Thus, this
type of regression model could be described as a fourth corner model. In principle,
any regression framework could be used to construct a fourth corner model by
incorporating trait variables and their interaction with other predictors. For example,
a fourth corner generalised latent variable model could be constructed by adding
latent variables to the mean model in Eq. 16.2. To date, software for resampling-
380 16 Explaining Variation in Responses Across Taxa
(a) (b)
Taxa Environmental variables Taxa Environmental variables
Fig. 16.2: Schematic diagrams illustrating approaches to the fourth corner problem.
(a) The problem as presented in Legendre et al. (1997), as matrices for multivariate
abundance, environmental variables, and functional traits, with the goal being to
estimate a matrix connecting environment and traits (labelled with a question mark).
(b) A regression model for this problem views the multivariate abundances as the
response (Y), environmental variables (X), and species traits (Z) as predictors, and
the matrix of interaction coefficients for X × Z as the fourth corner matrix
based testing (Wang et al., 2012) and ordination (Hui, 2016; Ovaskainen et al.,
2017b; Niku et al., 2019) has fourth corner functionality.
Figure 16.3 illustrates conceptually the relationship between fourth corner models
and some other models described in this text that treat abundance as the response
(as in Warton et al., 2015). In particular, the essential distinction from multivariate
regression models introduced in Chap. 14 is the use of predictors on columns of
abundance (traits) as well as on rows (environment).
The main effect term for traits z j β (z) in Eq. 16.2 can usually be omitted. This
main effect term estimates changes in total abundance across responses due to traits,
but β0j already ensures that responses can differ from each other in total abundance.
In practice, the main effect for traits is usually left out of the model, although it
could remain if a random effect were put on the β0j . If a row effect were included in
the model (as in Sect. 14.3), then the main effect for environment could similarly be
omitted from the model.
In Eq. 16.2, the x i b j , where b j are typically drawn from a (multivariate) normal
distribution, allow taxon j to vary in environmental response for reasons not ex-
plained by its traits (z j ). It is important to include this term when making inferences
about environment–trait associations, because it is unrealistic to expect all varia-
tion in environmental response across taxa to be explained by the traits included
in the model. Incorporating this term requires the use of regression software with
mixed model functionality or perhaps penalised likelihood estimation (e.g. using a
LASSO).
16.2 Fourth Corner Models 381
single species
GLM
(Chap. 10)
single site
Traits Traits
Taxa
Fig. 16.3: Diagram showing interrelationship between fourth corner models and other
methods we have seen. A GLM, like for David and Alistair’s crab data in Chap. 10)
and multivariate regression (as for Anthony’s data in Chap. 14) both relate abundance
to environmental variables. Now we are adding a new type of predictor, Traits, which
acts across columns (Taxa) rather than across rows (Sites). The interaction between
environmental and trait variables explains differences in environmental response
across taxa
A fourth corner model can be fitted using any software suitable for analysing multi-
variate abundance data, capable of including predictors that operate across columns
(taxa) as well as down rows (sites). While some of the original fourth corner models
were fitted using standard mixed modelling software (lme4 was used in Jamil et al.,
2013), it is important to also account for the multivariate property (Chap. 14), that
abundances are correlated across taxa. In the following we will focus on using a
generalised linear latent variable model (as in Chap. 12) for this reason.
382 16 Explaining Variation in Responses Across Taxa
where g(μi j ) specifies a fourth corner model as in Eq. 16.2, and as previously F is a
member of the exponential family of distributions (Maths Box 10.1).
Code Box 16.4 fits this model using the gllvm package (Niku et al., 2019) in
R; another candidate is the Hmsc package (Ovaskainen et al., 2017b; Ovaskainen
& Abrego, 2020, Section 6.3). Recall that taxon-specific terms are required in the
model to handle variation in environmental response not explained by traits (the
x i b j from Eq. 16.2), done using the randomX argument in the gllvm package.
Fourth corner models can also be fitted using the traitglm function of the
mvabund package, but it is best used as an exploratory tool only. ter Braak (2017)
showed that traitglm has problems making inferences about environment–trait
associations, which arises because taxon-specific terms are not included in traitglm
when a trait argument (Q) has been specified. The reason taxon-specific terms are
omitted is that anova calls in the mvabund package make use of design-based
inference (as in Chap. 14), which has difficulty dealing with random effects.
Code Box 16.4: A Fourth Corner Model for Spider Data Using traitglm
We will fit a fourth corner model using the gllvm package. The randomX argument is
included to capture species-specific environmental responses not explained by other pre-
dictors. Only soil dryness and herb cover are included as predictors to capture the main
environmental trends. Note that with only 12 species, there is not enough information in the
data to estimate random effects across many traits.
library(gllvm)
data(spider)
X = spider$x[,c("soil.dry","herb.layer")]
ft_trait = gllvm(spider$abund, X, spider$trait,
randomX=~soil.dry+herb.layer, family="negative.binomial")
logLik(ft_trait)
The log-likelihood bounces around a little, but for a good fit it should be greater than −721.
Fourth corner coefficients, stored in ft_trait$fourth.corner, capture interactions
between environmental and trait variables. They can be plotted to study patterns in environ-
mental response across species that can be explained by traits.
library(lattice)
a = max( abs(ft_trait$fourth.corner) )
colort = colorRampPalette(c("blue","white","red"))
16.2 Fourth Corner Models 383
(a) (b)
herb.layer x
2.0 soil.dry x
1.0
soil.dry:colouryellow x
marksspots 0.5
herb.layer:length x
Traits
0.0
soil.dry:length x
colouryellow −0.5
−1.0 soil.dry:marksstripes x
−2.0
herb.layer:marksspots x
r
y
ye
dr
.la
il.
soil.dry:marksspots x
so
rb
he
Environment −4 −2 0 2
Coefficients
Fig. 16.4: Fourth corner coefficients from model fitted in Code Box 16.4, plotted as
follows: (a) heat map of fourth corner; (b) confidence intervals for all coefficients
(Eq. 16.2). Fourth corner coefficients explain how the environmental response of
different taxa varies with their traits. For example, the negative interaction coefficient
between soil dryness and spots suggests that as soil gets drier, spiders with spots on
them are less likely to be found
Two ways to visualise fourth corner coefficients are presented in Fig. 16.4. However,
often it is hard to understand an interaction from the coefficients alone, and it is
advisable to look for ways to plot how abundance varies with key combinations of
environmental variables and traits. A standard tool for this purpose, if at least one of
the predictors is categorical, is to use an interaction plot (Code Box 16.5). Currently
these plots must be produced manually using the predict function. Looking at
this plot, we see clearly that, when moving along a gradient towards increasing soil
dryness, we expect to see fewer spiders with spots and more spiders with stripes.
There were only two spiders in the dataset with spots, and both had negative species-
specific slopes (Fig. 16.1a, Alopacce, and Arctperi).
If both predictors are quantitative, then an interaction plot can’t be constructed,
unless one of these is binned into categories. An alternative would be to plot a heat
map of predicted values as a function of the trait and environmental variable of
interest (over the range of these values that was jointly observed).
An issue to keep in mind is that if different responses are assumed to have different
intercepts, predictions for different responses will have intercepts that differ from
each other in fairly arbitrary ways. So, for example, in the figure produced by Code
Box 16.5, each line is for a different response, so only the slope of the lines is
meaningful, but the positions of the lines relative to each other in a vertical direction
are not.
Code Box 16.5: A Fourth Corner Interaction Plot for Petrus’s Spider Data
To manually construct a fourth corner interaction plot, to study how the response of spider
abundance to soil dryness varies for spiders with different markings on them:
nVars = dim(spider$abund)[2]
newTraits = spider$trait
# set factors not of interest here to be a constant value
newTraits$length= mean(spider$trait$length) #set length to its mean
newTraits$colour=factor(rep(levels(spider$trait$colour)[1],nVars),
levels=levels(spider$trait$colour)) #set to first level of factor
# set starting rows of 'marks' to take all possible values
nMarks = nlevels(spider$trait$marks)
newTraits$marks[1:nMarks]=levels(spider$trait$marks)
# create a new env dataset where the only thing that varies is soil.dry:
newEnv = spider$x[1:2,c("soil.dry","herb.layer")]
newEnv[,"soil.dry"]=range(scale(spider$x[,"soil.dry"]))
newEnv[,"herb.layer"]=0
#make predictions and plot:
newPreds = predict(ft_trait,newX=newEnv,newTR=newTraits,type="response")
matplot(newEnv[,1], newPreds[,1:nMarks],type="l", log="y")
legend("topright",levels(newTraits$marks),lty=1:nMarks,col=1:nMarks)
16.2 Fourth Corner Models 385
none
Intercept terms are arbitrary (species-specific), and only the slopes on this plot are
meaningful. Spider species with spots tend to decrease in abundance on drier soil, whereas
there seems to be little change in response to soil dryness otherwise.
In Exercise 16.4, Petrus wants to quantify the extent to which traits explain variation
in environmental response. This can be done by fitting three models—a main effects
model, capturing effects of environmental variables on total abundance, a fourth
corner model, and a species-specific model, which lets different species have different
environmental responses. Following the ideas in Sect. 14.3, the main effects model
can be understood as capturing effects of environmental variables on α-diversity.
The additional terms in the species-specific model capture β-diversity along these
environmental gradients, and the fourth corner model attempts to capture this β-
diversity using traits. A simple way to quantify how effectively it does this is to fit
all three models and compare their deviances, e.g. using the anova function in the
gllvm package (Code Box 16.6).
The proportion of β-diversity deviance explained, as previously, is a type of R2
measure, and, like any such measures, it does not account for model complexity.
This means that as more traits are added to the model, the proportion of deviance
386 16 Explaining Variation in Responses Across Taxa
explained will increase. To account for this, one could use cross-validation to instead
compute the proportion of β-diversity deviance explained at new sites.
i ∼ MV N(0, Σ)
The previous two chapters focused on the regression coefficients β. In this chapter,
we will focus on the variance–covariance matrix Σ.
The variance–covariance matrix Σ can be understood as capturing co-occurrence
patterns—positive correlations across taxa indicate that they co-occur more often
than expected by chance. Negative correlations across taxa indicate that the taxa
co-occur less often than expected by chance.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 387
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_17
388 17 Studying Co-occurrence Patterns
There are multiple reasons taxa may co-occur more or less often than expected
by chance. Broadly, this may be due to interactions across taxa (e.g. predation,
facilitation), or it may be because of shared responses to another variable. We will
further break down these potential causes of co-occurrence as follows:
1. The taxa may interact directly.
2. The taxa may both respond to a common mediator taxon.
3. Both taxa may respond to a common environmental variable.
The modelling tools presented in this chapter can isolate some of the effects of (3) or
potentially (2), but only in the case where the environmental variable of interest has
been included in the model and its effect has been estimated correctly. A model fitted
to observational data can never tell the difference between species interaction (1)
and shared response to some unmeasured or imperfectly modelled predictor (3). It is
important to keep this qualification in mind when studying co-occurrence, because
a couple of things we know for sure in ecology are that organisms respond to their
environment and that our models are never quite right. There should always be an
expectation that some taxa will respond to common aspects of their environment in
ways that are not perfectly captured by the model, which induces a form of correlation
across taxa that we cannot distinguish from direct interaction across taxa.
Key Point
There are three reasons taxa may co-occur:
1. They may interact directly.
2. They may both respond to some mediator taxon.
3. They may respond to a common environmental variable.
We can model species correlation to tease apart these three sources, and
quantify the extent to which (2) and (3) drive co-occurrence. However, a
challenge is that models aren’t perfect and there will probably be missing
environmental variables, or predictors whose effect is not estimated correctly
by the model. This means that we can’t use models to fully account for (2)
and (3), so we can never really tell from an observational study whether co-
occurrence is due to species interaction or missing predictors (or other sources
of model misspecification).
both respond to the same environmental variable, this can be accounted for by
regressing responses directly against the relevant environmental predictors (and is
captured in β).
In this chapter, two main modelling tools will be considered—latent variable mod-
els, which can tease apart the effects of environmental variables (3), and graphical
models, which can additionally identify the effects of (2).
Table 14.1 mentioned four main frameworks for analysing multivariate abundances.
Algorithmic distance-based approaches to analysis, used widely in the past, compro-
mise performance in search of short computation times and can no longer be rec-
ommended. Generalised estimating equations (GEEs) are useful for fitting models
quickly, for example, in combination with computationally intensive techniques for
design-based inference (Chap. 14). Hierarchical GLMs (Chap. 11) have been used for
ordination (Sect. 12.3) and fourth corner modelling (Sect. 16.2). They could also be
used here, but we will instead use another modelling framework—copulas—whose
main advantage over hierarchical approaches is in computation time.
A copula model is a type of marginal model, like generalised estimating equations
(GEEs). The word copula comes from the idea that the model couples together a
marginal model for data with a covariance model, as explained below. Compared
to GEEs, a copula has the advantage that it is a fully parametric model, meaning
likelihood-based or Bayesian inference is an option for a copula model. This comes
at some cost in terms of computation time, so copula models tend to be slower to
fit than GEEs, but they are typically much faster than hierarchical models. They are
used in this chapter because flexible tools for modelling co-occurrence have been
proposed in a Gaussian copula framework (Popovic et al., 2018).
The idea of a Gaussian copula model is to map abundances from their marginal
distribution to a standard normal copula variable and then analyse the copula vari-
ables assuming they are multivariate normal. If data were continuous, we would
construct copula values from a marginal model for the jth response at the ith site by
solving
Φ(zi j ) = F(yi j )
where F(·) is the cumulative distribution function of yi j , and Φ(·) is the cumula-
tive distribution function of the standard normal distribution. Abundances are not
continuous, so we will need to use a discrete copula model, defined as follows:
where yi−j is the previous value of yi j . The copula values zi j are not observed, so they
are treated as random effects. Mathematically, this means the likelihood function has
an integral in it, which slightly complicates estimation.
The covariance model tells us the form of Σ to be used in Eq. 17.2. This is like
an error term, capturing co-occurrence patterns that are not explained elsewhere in
the model. In this chapter we will use two covariance modelling approaches: a latent
variable model, which assumes Σ has reduced rank; and a graphical model, which
assumes many of the responses are conditionally independent of each other. Because
copula values have been mapped to the standard normal, with variance one, Σ is
actually a correlation matrix.
A marginal model is needed to determine the form of cumulative distribution
function F(yi j ) to be used in Eq. 17.1. This is the part of the model where envi-
ronmental variables can be introduced, to capture their effects on abundance. In
principle, any marginal model can be used; the approach taken here will be to as-
sume each response follows a generalised linear model (or related method) and to
check these assumptions as appropriate, along the lines of Chaps. 10 and 14. So an
example model for the abundance of taxon j at site i is
The marginal model is not the main focus of the chapter, but it is important as
always to specify it appropriately in order to make meaningful inferences about
co-occurrence.
It is fairly common to estimate the marginal model for y separately from the
covariance model for z, as a two-step process. This is an approximation, which ig-
nores any interplay between the two parts of the model. However, algorithms that
estimate both models jointly are much more complicated to work with and tend
to perform only slightly better (if at all). In this chapter we will make use of the
ecoCopula package, which uses the two-step process to maximum likelihood es-
timation but phrases it as a type of Monte Carlo expectation-maximisation (EM)
algorithm (Popovic et al., 2018) that could be applied to combine (in principle) any
parametric marginal model with any covariance modelling tool designed for mul-
tivariate normal data. This algorithm uses Dunn-Smyth residuals to map observed
data onto the standard normal distribution, but using multiple sets of Dunn-Smyth
residuals (which vary across runs due to jittering) and weighting them according to
how well each residual fits a multivariate normal distribution.
Each of Eqs. 17.1–17.2 involves making an assumption. Firstly, Eq. 17.1 requires an
assumption that the marginal model for abundances is correct in order to be able to
transform abundances from the marginal model to standard normal copula values
17.1 Copula Frameworks for Modelling Co-occurrence 391
yi = β 0 + x i B + Λ zi + i
Σ yi = Λ Σ zi Λ + Σ i = Λ Λ + Σ
ord_spiderInt = cord(spider_glmInt)
plot(ord_spiderInt, biplot=TRUE) #for a biplot
The idea is illustrated in Fig. 17.2a, b for Petrus’s spider data (Exercise 17.1).
Figure 17.2a presents a biplot of factor scores and loadings using a two-factor
Gaussian copula, and Fig. 17.2b is the estimated correlation matrix this implies.
Species at opposite sides of a biplot have negative estimated correlations, and species
close together have positive estimated correlations. The strength of these correlations
increases the further the loadings are from the origin. This close correspondence
between the two plots is expected because the correlations are computed as a function
of factor loadings (Maths Box 17.1).
The two-factor Gaussian copula model of Fig. 17.2a, b was constructed using the
cord function of the ecoCopula package (Code Box 17.1). This software makes
explicit its two-step approach to estimation by first requiring a marginal model to be
fitted (e.g. using the mvabund package); then a covariance model is fitted by applying
the cord function to this marginal model object.
The correlations in Σ capture patterns not accounted for by predictors already in-
cluded in the model. The correlation patterns seen in Fig. 17.2a, b are based on a
model with no predictors in it, and some of these correlations are potentially ex-
plained by shared (or contrasting) response to environmental variables. To study the
extent to which measured environmental variables explain these patterns, we can
refit the model with predictors included and compare results, as in Fig. 17.2c, d.
Note that correlation patterns seem to be substantially weaker after removing
effects due to environmental variables (Fig. 17.2d), with most correlations being
smaller than they were previously (Fig. 17.2b). In particular, there were strongly
negative correlations between the three species on the bottom rows of Fig. 17.2b
and the remaining nine species, which all disappeared on inclusion of environmen-
tal variables. These three species had a negative response to soil dryness (as in
Fig. 16.1), whereas the remaining nine species had a positive response. These con-
trasting responses to soil dryness are the likely reason for the negative co-occurrence
patterns of these species (Fig. 17.2b).
17.3 Co-occurrence Induced by Environmental Variables 395
(a) (b)
gu
lu
rd
2
in
Pa
sp
Alopacce
ra
Zoraspin
Zo
rr
te
11
oc
27 12 Pardmont Trocterr
r
ig
Tr
24 Alopfabr
n
1
rd
1
28 Pardnigr
ll
Pa
23 3
pu
Latent variable 2
rd
22
ne
Arctperi
10 9 Pardpull
Pa
cu
25
op
26 13 Alopcune
bi
Al
Pardpull
al
7
lo
Auloalbi
Arctlute Auloalbi
0
Alopcune
4Pardnigr
Au
te
5
lu
t
ct
Arctlute
on
6
Ar
m
rd
ce
2 14Zoraspin Pardmont
Pa
ac
Trocterr
op
−1
21 8 Alopacce
br
Al
fa
181916
op
17
20 Pardlugu
15 Alopfabr
Al
Arctperi
−2
−2 −1 0 1 2
Latent variable 1
(c) (d)
gu
lu
Alopacce
25
rd
2
in
Pa
22
sp
Zoraspin ra
Zo
r
er
Trocterr ct
gr
o
Tr
ni
13
rd
1
Pardmont Pardnigr
ll
Pa
pu
11 Alopfabr 1
Latent variable 2
rd
ne
27 12 Pardpull
Pa
cu
3 Alopcune
Trocterr
Alopcune op
bi
7 Pardnigr
Al
Zoraspin
al
ArctperiPardlugu
Arctlute
21 20Auloalbi
19Pardpull 16 Auloalbi lo
0
24
Au
te
15
17
18 8 6
lu
t
Arctlute ct
on
Ar
m
10 5
23 rd
ce
Pardmont
Pa
2 28
ac
14 op
−1
Alopacce
b r
Al
fa
Alopfabr op
Al
26 4 Arctperi
9
−2
−2 −1 0 1 2
Latent variable 1
Fig. 17.2: Exploring co-occurrence in Petrus’s spider data: (a) unconstrained biplot of
a two-factor Gaussian copula model with no predictors; (b) the correlation matrix this
model implies; (c) a “residual” biplot of a two-factor Gaussian copula model after
including environmental variables; (d) the correlation matrix this implies. Notice
that species far apart in a biplot have strongly negative estimated correlations (e.g.
Alopecosa accentuata and Pardosa lugubris in a and b), and species close together
and far from the origin have strongly positive estimated correlations (e.g. Trochosa
terricola and Zora spinimana in a and b). Notice also that after controlling for
environmental variables, factor loadings have changed (c), and correlation estimates
tend to be closer to zero (d)
ter Braak and Prentice (1988) introduced the term unconstrained ordination to
describe an ordination without any predictors in it, as in Fig. 17.2a. ter Braak and
396 17 Studying Co-occurrence Patterns
Prentice (1988) further defined a constrained ordination as a plot where the axes
were derived as a linear combination of measured predictors, essentially a reduced
rank regression as in Yee (2006). Figure 17.2c, in contrast, shows co-occurrence
patterns not explained by measured covariates, a type of residual ordination or
partial ordination.
Both the aforementioned measures are calculated in Code Box 17.2, for models
with and without environmental variables. Both measures were roughly halved on
inclusion of environmental predictors, meaning that about half of co-occurrence
patterns can be explained by measured environmental variables. Note the measures
did not return exactly the same answer (45% vs 58%)—as always, different ways of
measuring things can lead to different measurements!
Calculate the sum of squared loadings for latent variable models with and
without fields as a predictor. Notice that these are a lot smaller than in Code
Box 17.2.
What can you conclude about the co-occurrence patterns of these birds and
the extent to which they are explained by presence or absence of fields?
Recall that another possible reason two taxa may co-occur, beyond the scenario
where both taxa respond to the same environmental variable, is that they are both
related to some mediator taxon.
Graphical modelling is a technique designed to tease these two ideas apart, which
identifies pairs of conditionally dependent taxa—taxa that remain correlated with
each other even after accounting for all others in the model. Popovic et al. (2019)
wrote a helpful introduction to the idea of conditional dependence and how to
visualise conditional relationships between a set of response variables.
Recall from Chap. 3 that a key idea in multiple regression is that coefficients are
interpreted conditionally, after controlling for the effects of all others in the model.
Thus, multiple regression is a tool for finding conditionally dependent associations.
So one approach to finding pairs of responses that are conditionally dependent would
be to apply multiple regression to each response, as a function of all others, and look
for slope coefficients that are clearly non-zero. Graphical modelling takes a slightly
different (but related) approach, estimating the inverse of the variance–covariance
matrix and looking for values that are clearly non-zero.
If two variables are conditionally independent, we do not necessarily get a zero in
the variance–covariance matrix (Σ) itself because correlation might be induced by
a third mediator taxon. We would, however, get a zero in the appropriate cell of the
inverse of the variance–covariance matrix (Σ−1 , sometimes called a precision ma-
trix), if data are multivariate normal. Maths Box 17.2 uses multiple regression results
from Chap. 3, and some matrix algebra, to show why this is the case. Figure 17.3
illustrates the idea for a hypothetical variance–covariance matrix, and its precision
matrix, for snakes, mice, and lions. Snakes and lions are conditionally independent
but positively correlated because of the mediator mice, which is positively correlated
with each of them.
17.4 Co-occurrence Induced by Mediator Taxa 399
Maths Box 17.2: Why the precision matrix captures conditional depen-
dence
Consider a linear model for one response y p in terms of other responses YX ,
μ p = β0 + YX β 1 , where (YX , y p ) are multivariate normal with variance-
covariance matrix: $ %
ΣX X σ X p
𝚺=
σ pX σpp
If we have are more than two responses, Σ X X is a matrix, whereas σ X p is a
vector and σpp is scalar. It turns out that the true values for slope coefficients
are:
β 1 = Σ−1
X X σX p
This expression is essentially the least squares estimator from Maths Box 3.1,
but with the relevant (cross-)products replaced with (co-)variances.
If y p is conditionally independent of the jth variable in YX , then the jth
regression
& coefficient
' is zero (as in Chapter
& 3), which' we will write as [β 1 ] j = 0
or Σ−1 XX σ X p j
= 0. In matrix algebra, Σ −1
X X σ X p j = 0 does not imply that
[σ X p ] j = 0, so conditional independence does not imply zero covariance, nor
zero correlation.
But consider φ X p , the corresponding term in the precision matrix:
$ %
−1 ΦX X φ X p
𝚺 =
φ pX φ pp
Σ X X φ X p + σ X p φ pp = 0
so φ X p = −Σ−1
X X σ X p φ pp = −β 1 φ pp
Fig. 17.3: A hypothetical variance–covariance matrix (Σ) from abundance data for
snakes, mice, and lions and its corresponding precision matrix (Σ−1 ). While all taxa
are positively correlated (Σ), there is a zero in Σ−1 for snakes and lions, meaning
they are conditionally independent of each other. A graph of conditional dependence
relationships connects mice with each of snakes and lions, but snakes and lions
are not connected because they are conditionally dependent and their correlation is
induced by shared positive associations with the mediator taxon, mice
There are some nice computational tools for graphical modelling of multivariate
normal data, most notably the graphical LASSO (Friedman et al., 2008, the glasso
package in R). This method estimates a precision matrix using a LASSO approach
via penalised maximum likelihood:
where ||Σ−1 ||1,1 is the element-wise L1 norm, the sum of absolute values of parame-
ters. The key difference from the LASSO as introduced in Chap. 5 is that the penalty
is applied to a precision matrix rather than to a vector of regression coefficients,
so terms in the precision matrix are shrunk towards zero, rather than regression
coefficients. The end result is (usually) a precision matrix with lots of zeros in it,
which considerably simplifies interpretation, by putting our focus on the taxa that
drive key dependence patterns in the data.
A difficulty we have applying this idea to multivariate abundance data is that the
method assumes multivariate normality, which can be overcome by mapping values
to the multivariate normal using a Gaussian copula (Popovic et al., 2018). Other
graphical modelling methods have been applied in ecology, using related modelling
frameworks, in the special case of presence–absence data (Harris, 2016; Clark et al.,
2018).
Graphical models can be fitted to multivariate abundance data using the cgr
function of the ecoCopula package (Code Box 17.3). As previously, models can
be fitted with and without environmental variables to study the extent to which
conditional dependence relationships can be explained by environmental variables,
leading to the graphs of Fig. 17.4. As before, most of the co-occurrence patterns in
the data seem to be explainable by response to common environmental variables
because the relationships are weaker in Fig. 17.4b and there are slightly fewer of
them (reduced from 37 to 31). This effect does, however, look less dramatic on
Fig. 17.4 than in the correlation plots of Fig. 17.2, although it seems comparable in
scale (Code Box 17.3 suggests that in absolute terms, correlations reduced by about
60%, as previously).
Results from graphical models are very sensitive to the GLASSO penalty param-
eter, the value of λ in Eq. 17.3. In a regular LASSO regression (Sect. 5.6), when λ
is large enough, all terms are left out of the model, and when it is small enough, all
predictors get included. In much the same way, in GLASSO, if λ is large enough,
then no correlations are included in the model, and if it is small enough, then all of
them are included. When comparing different graphical models to look at the effects
of predictors on co-occurrence patterns (Code Box 17.3), it is a good idea to use the
same value of λ to put the two models on an even footing. When reporting a final
graphical model, it is a good idea to re-estimate λ, but when specifically looking at
the extent to which predictors explain co-occurrence, λ should be fixed to a common
value.
The two species with the most conditional dependence relationships detected
(Fig. 17.4) were the two most abundant species (Pardosa pullata and Trochosa ter-
ricola). The three rarest species, Arctosa lutetiana, Arctosa perita, and Alopecosa
fabrilis, tended to have fewer and weaker conditional dependence relationships. This
pattern suggests that the conditional dependence relationships for a species are in
part a function of how much information we have on that species. We need a lot of
information on occurrence patterns if we are to detect co-occurrence patterns!
402 17 Studying Co-occurrence Patterns
Code Box 17.3: A Copula Graphical Model for Petrus’s Spider Data
The cgr function in the ecoCopula package will be used to construct a Gaussian copula
graphical model from negative binomial regressions of each hunting spider species:
# fit an intercept model (for an "unconstrained" graph)
graph_spiderInt = cgr(spider_glmInt)
plot(graph_spiderInt, vary.edge.lwd=TRUE)
Notice that the number of links in the graphical models is relatively small—Fig. 17.4b
was constructed from a precision matrix with 31 non-zero values in it, whereas an
unstructured matrix would have 66 non-zero values. A smaller number of parameters
makes interpretation easier but also often gives a better estimate of Σ by choosing a
more appropriate point on the bias–variance trade-off.
Recall that the number of parameters in an unstructured variance–covariance
matrix increases quadratically with the number of responses (p), and unless p is
small, it is not practical to estimate Σ without assuming some structure to it. Using
a generalised latent variable models is one way to impose structure, assuming Σ has
17.6 Other Models for Co-occurrence 403
(a) (b)
Arctperi
Zoraspin Pardmont
Alopacce
Arctlute
Alopfabr Alopacce
Pardlugu Arctlute
Trocterr
Trocterr Auloalbi
Pardpull
Pardnigr Alopfabr
Auloalbi
Arctperi Alopcune
Pardmont Pardnigr
Pardpull
Alopcune
Zoraspin
Pardlugu
reduced rank (Maths Box 17.1). Graphical modelling is an alternative way to impose
structure, assuming many pairs of responses are conditionally independent.
There are many other options for ways to model correlation patterns and, hence,
quantify the various potential sources of co-occurrence patterns. One option not
considered above is correlation from phylogenetic sources (Sect. 7.3). Closely related
taxa might co-occur more often than expected—Li and Ives (2017) explain that one
potential reason for this is missing predictors, specifically, missing functional traits
that may explain some co-occurrence. Traits often carry a phylogenetic signal, and
so traits that are omitted from a model might induce phylogenetic correlation. ter
Braak (2019) proposed adding random slopes for traits to a model to specifically
account for this. However, any method that accounts for the multivariate property,
such as those in Chaps. 11 and 14 and this chapter, are capable of accounting for
phylogenetic correlation, among other causes of correlated responses. If the research
question of interest is to quantify the phylogenetic signal in co-occurrence, then we
need to directly quantify it, which can be done using the phyr package (Li et al.,
2020) or using Hmsc (Ovaskainen & Abrego, 2020, Section 6.4).
Another idea not touched on in this chapter is that co-occurrence patterns may
change as the environment changes—there is ample empirical evidence that this
404 17 Studying Co-occurrence Patterns
This book has been a long journey! But this is not the end. Something I’ve learned
over my career so far is that the more I know, the more I realise that I don’t know. I
am regularly finding out about methods and problems I wasn’t previously aware of
or new techniques that have been developed recently.
One way to understand the breadth of statistics is to recall that the way data should
be analysed primarily depends on the Ps and Qs:
P Data properties. In regression, the properties of the response variable are what
matters.
Q The research question guides us in finding a target quantity or technique for
analysis.
When you look across ecology, or any other discipline, there are lots of different ways
people can collect data, and there are lots of different types of research questions to
ask. So there is a need for a whole lot of different analysis methods! And new ones
all the time, as technology brings new methods of data collection, and as people
think of new types of questions they can ask. So one thing we know for sure is that
you will not find in this book all the analysis methods you are going to need over the
course of your career.
So to close things out, let’s first summarise the main lessons in a common
framework. Then we will discuss what to do when you come across a new method
not covered in this book, as you inevitably will!
A framework for data analysis is presented in Fig. 18.1—it’s all about minding your
Ps and Qs. As with the remainder of this book, Fig. 18.1 and the following discussion
primarily focus on problems that can be posed as a regression model—the situation
where we want to relate a response, or set of responses, to predictors.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 405
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7_18
406 18 Closing Advice
Key Point
Always mind your Ps and Qs!
P Data properties. In regression, the properties of the response variable are
what matters. The main properties we have considered in this book are
number of responses, response type, dependence across observations, and
form of response to predictors.
Q The research question guides us in finding a target quantity or technique
for analysis. In particular, it tells us if we should focus on exploratory data
analysis only, inferences about key parameters, hypothesis testing, model
selection, or predictive modelling.
For more details see Fig. 18.1.
Fig. 18.1: Schematic diagram of approaches to data analysis. In data analysis you
have to mind your Ps and Qs—the analysis method needs to be aligned with data
properties and the research question. Some key data properties to consider are the
number and type of response variables, potential sources of correlation in response,
and how they respond to predictors. Any combination of these properties could
be encountered in practice. The research question informs what type of inference
procedure (if any) is needed, for a suitably chosen regression model
408 18 Closing Advice
(The first three of these issues could be handled along the lines of Ovaskainen et al.,
2016; non-linearity would require quadratic terms or smoothers.)
Always remember that while we should try to capture all key data properties, we are
unlikely to do so successfully. The working assumption should always be that your
model is wrong.
Where possible, assumptions should always be checked, whether using residual
plots, model selection tools to compare models with competing assumptions, or
related approaches, to try and detect mistakes that were made in building the model.
But assumption checks are not perfect, and in most cases we can expect missing
predictors or other forms of model misspecification that have not been fixed in the
analysis process. Fortunately, terms can be added to a model to give some robustness
to violations of assumptions, including the following:
• When modelling discrete data, include an overdispersion term. This term is
needed to capture any variation in response not explained by the model and
account for it when making inferences. Linear models have a built-in error term
that performs this role (as in Maths Box 4.1), but Poisson regression does not and
will go badly wrong if there is overdispersion that was not accounted for.
• When modelling multiple responses simultaneously, account for correlation
across responses (e.g. using latent variables, or an unstructured correlation ma-
trix). Even if responses are not expected to directly interact, correlation will be
induced by missing predictors that affect multiple responses.
Taking these steps will not prevent confounding variables from biasing coefficients
(as in Maths Box 4.1), but it will help ensure standard errors and predictions are
reasonable in the presence of missing information.
For example, in early drafts of this book, I fitted the hierarchical GLM in Section
11.3 using the lme4 package, and eventually found a solution that didn’t report
convergence issues. What puzzled me was that when I constructed a plot for this fit
corresponding to Fig. 11.4, the lines didn’t really go through the data, some seemed
to miss the data entirely. I eventually tried refitting the model using glmmTMB, which
gave a qualitatively different fit that fitted the data much more closely (Fig. 11.4).
Clearly my earlier lme4 fit had not converged (at least, not to a sensible answer),
and if I hadn’t tried plotting the data and fitting trend lines to it I would never have
noticed the issue.
There are heaps of things we haven’t talked about, some of which you will need to
know at some stage. Further, new methods are being developed all the time, so there
is always the possibility that some new method will come up that you will want to
learn how to use. So let’s talk briefly about the next steps beyond this book.
There are still plenty of other data considerations, and methods of data analysis, that
we have not had time to explore in this book. Let’s run through a few ideas.
• Whereas a regression model involves splitting variables into response and predic-
tor variables, more complex networks are possible. A variable that is a predictor
may itself respond to another predictor, as in a food web. It is possible to build
competing hypothesised networks of causal pathways and test which are more
consistent with the data, using structural equation models (Grace, 2006) or path
analysis (Shipley, 2016), although the development of such methods for non-
normal data is a work in progress. There is a lot of related literature developing
techniques intended for causal inference from observational studies (Morgan &
Winship, 2015), the first step of which is to think through all possible alternate
causes of an association you see in practice, and what can be measured about
each of these to account for them and, hence, estimate the causal effect of primary
interest. This is a really helpful step to work through as you progress towards a
better understanding of your study system, irrespective of intent to make causal
inferences. It may then be possible to make causal statements about the effect of
interest, under the very strong assumption that your model is correct, an assump-
tion that invites some scepticism.
• Predictor variables may not be measured perfectly and may come with measure-
ment error. If the size of this error can be estimated, it can (if desired) be corrected
for in order to estimate the response to the true value of the predictor using mea-
18.2 Beyond the Methods Discussed in This Book 411
So what steps should you go through in working out how to use some new method?
Before we talk about the steps to work through when using some new method, first
reconsider whether you really need or want to use it at all. Potential costs of using
some new method include the following: a longer start-up time; if the method is not
well known to readers of your research, then it will be harder for them to follow; it
is probably not as tried-and-tested as existing methods and so could have problems
not yet recognised. Something else worth considering is whether you are looking for
a fancy new method for the right reasons—do you need a better analysis or a better
question? Sometimes researchers use fancy new or fashionable methods to dress up
their research and make it look more attractive to good journals. But if the underlying
science is not that interesting, then no amount of bells and whistles (sometimes called
statistical machismo after a blog post by Brian McGill) can compensate, and the hard
truth is that in that situation you may be better off going back to the drawing board.
Having said that, there are plenty of potential gains to using a fancy new method.
New methods:
• Might work better or better answer your research question than existing methods.
They are usually improvements on older methods in one of these two respects.
• Could add novelty to your research. Sometimes you can even write a paper about
the methodology as well as about the results!
• Can be useful as a learning experience.
• Can go over better with reviewers, assessors, and examiners, depending on your
audience. This should mostly be the case when the new methods are better suited
to your research question.
When learning about a new method of analysis, the most important question to answer
is how do you mind your Ps and Qs? Specifically, what are the key data properties
for which this method was intended? What are the main research questions it was
designed to answer? And what procedures can you use to check you did the right
analysis for your data?
Another important consideration is performance—has this method been demon-
strated to do what it claims to, and if there are alternative approaches, has it been
shown that this new method works better than those alternatives? The main tool for
demonstrating that a method works is a simulation experiment—data are generated
repeatedly under a particular, controlled scenario, and we look at how effectively
the new method recovers the correct answer. A paper proposing new software or a
new method should include a simulation experiment, unless the proposed method
18.2 Beyond the Methods Discussed in This Book 413
• Is there some existing, competing method or software? If so, why use the
proposed methods instead?
• Is there software available with a worked example? See if you can work
through this example by yourself.
Key Point
Some key things to look for in a document describing how to use some new
software:
• What are the Ps—what data properties did developers have in mind when
designing the method? This should be explicitly stated! It should also be
implicit in their motivating examples.
• What are the Qs—what sorts of research questions was the method designed
to answer? This again may not be explicitly stated and again can sometimes
be inferred from the worked examples.
• How do you mind your Ps and Qs—how do you check assumptions? Does
the paper suggest any particular model diagnostics that might be useful?
• Has it been shown (usually by simulation experiment) that this method
actually works?
• What other methods are available, and why is this software proposed in-
stead?
• Is there a worked example you can follow yourself?
Remember it would do no harm at all to see a statistical consultant.
Yeah, so that’s it! I wish you the best for your journey—may your research
questions be rippers.
References
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 415
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7
416 References
Blanchet, F. G., Cazelles, K., & Gravel, D. (2020). Co-occurrence is not evidence of
ecological interactions. Ecology Letters, 23, 1050–1063.
Blomberg, S. P., Lefevre, J. G., Wells, J. A., & Waterhouse, M. (2012). Independent
contrasts and PGLS regression estimators are equivalent. Systematic Biology, 61,
382–391.
Borchers, D. L., Buckland, S. T., Stephens, W., Zucchini, W., et al. (2002). Estimating
animal abundance: Closed populations (Vol. 13). Springer.
Bray, J. R., & Curtis, J. T. (1957). An ordination of the upland forest communities
of southern Wisconsin. Ecological Monographs, 27, 325–349.
Brooks, M. E., Kristensen, K., van Benthem, K. J., Magnusson, A., Berg, C. W.,
Nielsen, A., Skaug, H. J., Machler, M., & Bolker, B. M. (2017). glmmTMB
balances speed and flexibility among packages for zero-inflated generalized linear
mixed modeling. The R journal, 9, 378–400.
Brown, A. M., Warton, D. I., Andrew, N. R., Binns, M., Cassis, G., & Gibb, H.
(2014). The fourth-corner solution—using predictive models to understand how
species traits interact with the environment. Methods in Ecology and Evolution,
5, 344–352.
Brown, B. M., & Maritz, J. S. (1982). Distribution-free methods in regression.
Australian Journal of Statistics, 24, 318–331.
Carroll, R. J., & Ruppert, D. (1996). The use and misuse of orthogonal regression
in linear errors-in-variables models. American Statistician, 50, 1–6.
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measure-
ment error in nonlinear models: A modern perspective. CRC Press.
Chevan, A., & Sutherland, M. (1991). Hierarchical partitioning. The American Statis-
tician, 45, 90–96.
Christin, S., Hervet, É., & Lecomte, N. (2019). Applications for deep learning in
ecology. Methods in Ecology and Evolution, 10, 1632–1644.
Christin, S., Hervet, É., & Lecomte, N. (2021). Going further with model verification
and deep learning. Methods in Ecology and Evolution, 12, 130–134.
Clark, G. F., Kelaher, B. P., Dafforn, K. A., Coleman, M. A., Knott, N. A., Marzinelli,
E. M., & Johnston, E. L. (2015). What does impacted look like? High diversity
and abundance of epibiota in modified estuaries. Environmental Pollution, 196,
12–20.
Clark, J. S. (2007). Models for ecological data. Princeton, NJ: Princeton University
Press.
Clark, J. S., Gelfand, A. E., Woodall, C. W., & Zhu, K. (2014). More than the
sum of the parts: Forest climate response from joint species distribution models.
Ecological Applications, 24, 990–999.
Clark, N. J., Wells, K., & Lindberg, O. (2018). Unravelling changing interspecific
interactions across environmental gradients using Markov random fields. Ecology,
99, 1277–1283.
Clarke, K. R. (1993). Non-parametric multivariate analyses of changes in community
structure. Australian Journal of Ecology, 18, 117–143.
References 417
Cook, D., Lee, E.-K., & Majumder, M. (2016). Data visualization and statistical
graphics in big data analysis. Annual Review of Statistics and Its Application, 3,
133–159.
Cooper, V. S., Bennett, A. F., & Lenski, R. E. (2001). Evolution of thermal depen-
dence of growth rate of escherichia coli populations during 20,000 generations in
a constant environment. Evolution, 55, 889–896.
Corder, G. W., & Foreman, D. I. (2009). Nonparametric statistics for non-
statisticians. Wiley. ISBN: 9781118165881.
Creasy, M. A. (1957). Confidence limits for the gradient in the linear functional
relationship. Journal of the Royal Statistical Society B, 18, 65–69.
Cressie, N. (2015). Statistics for spatial data. Wiley.
Cressie, N., Calder, C. A., Clark, J. S., Hoef, J. M. V., & Wikle, C. K. (2009).
Accounting for uncertainty in ecological analysis: The strengths and limitations
of hierarchical statistical modeling. Ecological Applications, 19, 553–570.
Cressie, N., & Johannesson, G. (2008). Fixed rank kriging for very large spatial data
sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
70, 209–226.
Davis, P. J., & Rabinowitz, P. (2007). Methods of numerical integration. Courier
Corporation.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application.
Cambridge: Cambridge University Press.
Diggle, P. J., Heagerty, P., Liang, K.-Y &, Zeger, S. (2002). Analysis of longitudinal
data (2nd ed.). Oxford University Press.
Dormann, C. F., M. McPherson, J., B. Araújo, M., Bivand, R., Bolliger, J., Carl,
G., G. Davies, R., Hirzel, A., Jetz, W., Kissling, W. D., Kühn, I., Ohlemüller,
R., R. Peres-Neto, P., Reineking, B., Schröder, B., M. Schurr, F., & Wilson, R.
(2007). Methods to account for spatial autocorrelation in the analysis of species
distributional data: A review. Ecography, 30, 609–628.
Dray, S., Dufour, A.-B., et al. (2007). The ade4 package: Implementing the duality
diagram for ecologists. Journal of Statistical Software, 22, 1–20.
Duan, N. (1983). Smearing estimate: A nonparametric retransformation method.
Journal of the American Statistical Association, 78, 605–610.
Dunn, P., & Smyth, G. (1996). Randomized quantile residuals. Journal of Compu-
tational and Graphical Statistics, 5, 236–244.
Dunstan, P. K., Foster, S. D., & Darnell, R. (2011). Model based grouping of species
across environmental gradients. Ecological Modelling, 222, 955–963.
Edgington, E. S. (1995). Randomization tests. (3rd ed.). New York: Marcel Dekker.
Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-
validation. Journal of the American Statistical Association, 99, 619–642.
Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632+
bootstrap method. Journal of the American Statistical Association, 92, 548–560.
Elith, J., Graham, C., Anderson, R., Dudik, M., Ferrier, S., Guisan, A., Hijmans,
R., Huettmann, F., Leathwick, J., Lehmann, A., Li, J., Lohmann, L., Loiselle,
B., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J., Peterson,
A., Phillips, S., Richardson, K., Scachetti-Pereira, R., Schapire, R., Soberon, J.,
418 References
Williams, S., Wisz, M., & Zimmermann, N. (2006). Novel methods improve
prediction of species’ distributions from occurrence data. Ecography, 29, 129–
151.
Elith, J., & Leathwick, J. (2007). Predicting species distributions from museum and
herbarium records using multiresponse models fitted with multivariate adaptive
regression splines. Diversity and Distributions, 13, 265–275.
Ellison, A. M., Gotelli, N. J., Inouye, B. D., & Strong, D. R. (2014). P values,
hypothesis testing, and model selection: It’s déjà vu all over again. Ecology, 95,
609–610.
Evans, M., & Swartz, T. (2000). Approximating integrals via Monte Carlo and
deterministic methods. Oxford statistical science series. OUP Oxford. ISBN:
9780191589874.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood
and its oracle properties. Journal of the American Statistical Association, 96,
1348–1360.
Felsenstein, J. (1985). Phylogenies and the comparative method. The American
Naturalist, 125, 1–15.
Ferrari, S., & Cribari-Neto, F. (2004). Beta regression for modelling rates and
proportions. Journal of Applied Statistics, 31, 799–815.
Finley, A. O., Sang, H., Banerjee, S., & Gelfand, A. E. (2009). Improving the
performance of predictive process modeling for large datasets. Computational
Statistics and Data Analysis, 53, 2873–2884.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philo-
sophical Transactions of the Royal Society of London. Series A, Containing Papers
of a Mathematical or Physical Character, 222, 309–368.
Flury, B. N. (1984). Common principal components in k groups. Journal of the
American Statistical Association, 79, 892–898.
Foster, S. D., Hill, N. A., & Lyons, M. (2017). Ecological grouping of survey sites
when sampling artefacts are present. Journal of the Royal Statistical Society:
Series C (Applied Statistics), 66, 1031–1047.
Freckleton, R., Harvey, P., & Pagel, M. (2002). Phylogenetic analysis and comparative
data: A test and review of evidence. The American Naturalist, 160, 712–726.
Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics,
19, 1–67.
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estima-
tion with the graphical lasso. Biostatistics, 9, 432–441.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of
the Anthropological Institute, 15, 246–263.
Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., & Rubin, D. B.
(2013). Bayesian data analysis. Chapman and Hall/CRC.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the
bias/variance dilemma. Neural computation, 4, 1–58.
References 419
Hodges, J. S., & Reich, B. J. (2010). Adding spatially-correlated errors can mess up
the fixed effect you love. The American Statistician, 64, 325–334.
Hooten, M. B., Johnson, D. S., McClintock, B. T., & Morales, J. M. (2017). Animal
movement: Statistical models for telemetry data. CRC Press.
Hui, F. K. C. (2016). boral—Bayesian ordination and regression analysis of multi-
variate abundance data in R. Methods in Ecology and Evolution, 7, 744–750.
Hui, F. K. C., Taskinen, S., Pledger, S., Foster, S. D., & Warton, D. I. (2015). Model-
based approaches to unconstrained ordination. Methods in Ecology and Evolution,
6, 399–411.
Hui, F. K. C., Warton, D. I., Foster, S., & Dunstan, P. (2013). To mix or not to mix:
Comparing the predictive performance of mixture models versus separate species
distribution models. Ecology, 94, 1913–1919.
Hurlbert, S. H. (1984). Pseudoreplication and the design of ecological field experi-
ments. Ecological Monographs, 54, 187–211.
Hyndman, R. J., & Athanasopoulos, G. (2014). Forecasting: Principles and practice.
OTexts.
Ives, A. R. (2015). For testing the significance of regression coefficients, go ahead
and log-transform count data. Methods in Ecology and Evolution, 6, 828–835.
Jamil, T., Ozinga, W. A., Kleyer, M., & ter Braak, C. J. F. (2013). Selecting traits
that explain species–environment relationships: A generalized linear mixed model
approach. Journal of Vegetation Science, 24, 988–1000.
Johns, J. M., Walters, P. A., & Zimmerman, L. I. (1993). The effects of chronic
prenatal exposure to nicotine on the behavior of guinea pigs (Cavia porcellus).
The Journal of General Psychology, 120, 49–63.
Johnson, D. H. (1999). The insignificance of statistical significance testing. The
Journal of Wildlife Management, 63, 763–772.
Johnson, N., Kotz, S., & Balakrishnan, N. (1997). Discrete multivariate distributions.
Wiley series in probability and statistics. Wiley. ISBN: 9780471128441.
Jolicoeur, P. (1975). Linear regression in fishery research: Some comments. Journal
of the Fisheries Research Board of Canada, 32, 1491–1494.
Jordan, M. I. et al. (2013). On statistics, computation and scalability. Bernoulli, 19,
1378–1390.
Keck, F., Rimet, F., Bouchez, A., & Franc, A. (2016). phylosignal: An R package
to measure, test, and explore the phylogenetic signal. Ecology and Evolution, 6,
2774–2780.
Keribin, C. (2000). Consistent estimation of the order of mixture models. SankhyāĄ:
The Indian Journal of Statistics, Series A, 62, 49–66.
Kerkhoff, A., & Enquist, B. (2009). Multiplicative by nature: Why logarithmic
transformation is necessary in allometry. Journal of Theoretical Biology, 257,
519–521.
Koenker, R., & Hallock, K. F. (2001). Quantile regression. Journal of Economic
Perspectives, 15, 143–156.
Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverley Hills: Sage
Publications.
References 421
Moles, A. T., Warton, D. I., Warman, L., Swenson, N. G., Laffan, S. W., Zanne,
A. E., Pitman, A., Hemmings, F. A., & Leishman, M. R. (2009). Global patterns
in plant height. Journal of Ecology, 97, 923–932.
Moore, D. S., McCabe, G. P., & Craig, B. A. (2014). Introduction to the practice of
statistics.
Moran, P. A. P. (1971). Estimating structural and functional relationships. Journal
of Multivariate Analysis, 1, 232–255.
Morgan, S. L., & Winship, C. (2015). Counterfactuals and causal inference. Cam-
bridge University Press.
Morrisey, D., Underwood, A., Howitt, L., & Stark, J. (1992). Temporal variation
in soft-sediment benthos. Journal of Experimental Marine Biology and Ecology,
164, 233–245.
Murtaugh, P. A. (2007). Simplicity and complexity in ecological data analysis.
Ecology, 88, 56–62.
Nakagawa, S., & Cuthill, I. C. (2007). Effect size, confidence interval and statistical
significance: A practical guide for biologists. Biological Reviews, 82, 591–605.
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal
of the Royal Statistical Society. Series A (General), 135, 370–384.
Niklas, K. J. (2004). Plant allometry: Is there a grand unifying theory? Biological
Reviews, 79, 871–889.
Niku, J., Hui, F. K. C., Taskinen, S., & Warton, D. I. (2019). gllvm: Fast analysis of
multivariate abundance data with generalized linear latent variable models in R.
Methods in Ecology and Evolution, 10, 2173–2182.
Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in
multiple regression. The Annals of Statistics, 12, 758–765.
Norberg, A., Abrego, N., Blanchet, F. G., Adler, F. R., Anderson, B. J., Anttila, J.,
Araújo, M. B., Dallas, T., Dunson, D., Elith, J., Foster, S. D., Fox, R., Franklin,
J., Godsoe, W., Guisan, A., O’Hara, B., Hill, N. A., Holt, R. D., Hui, F. K. C.,
Husby, M., Kålås, J. A., Lehikoinen, A., Luoto, M., Mod, H. K., Newell, G.,
Renner, I., Roslin, T., Soininen, J., Thuiller, W., Vanhatalo, J., Warton, D., White,
M., Zimmermann, N. E., Gravel, D., & Ovaskainen, O. (2019). A comprehensive
evaluation of predictive performance of 33 species distribution models at species
and community levels. Ecological Monographs, 89, e01370.
Oberdorff, T., & Hughes, R. M. (1992). Modification of an index of biotic integrity
based on fish assemblages to characterize rivers of the seine basin, France. Hy-
drobiologia, 228, 117–130.
Oksanen, J., Blanchet, F. G., Friendly, M., Kindt, R., Legendre, P., McGlinn, D.,
Minchin, P. R., O’Hara, R. B., Simpson, G. L., Solymos, P., Stevens, M. H. H.,
Szoecs, E., & Wagner, H. (2017). vegan: Community ecology package. R package
version 2.4-3.
Olden, J., Lawler, J., & Poff, N. (2008). Machine learning methods without tears: A
primer for ecologists. The Quarterly Review of Biology, 83, 171–193.
Ord, T. J., Charles, G. K., Palmer, M., & Stamps, J. A. (2016). Plasticity in so-
cial communication and its implications for the colonization of novel habitats.
Behavioral Ecology, 27, 341–351.
References 423
Ovaskainen, O., & Abrego, N. (2020). Joint species distribution modelling: With
applications in R. Cambridge: Cambridge University Press.
Ovaskainen, O., Roy, D. B., Fox, R., & Anderson, B. J. (2016). Uncovering hid-
den spatial structure in species communities with spatially explicit joint species
distribution models. Methods in Ecology and Evolution, 7, 428–436.
Ovaskainen, O., & Soininen, J. (2011). Making more out of sparse data: Hierarchical
modeling of species communities. Ecology, 92, 289–295.
Ovaskainen, O., Tikhonov, G., Dunson, D., Grótan, V., Engen, S., Sæther, B.-E.,
& Abrego, N. (2017a). How are species interactions structured in species-rich
communities? A new method for analysing time-series data. Proceedings of the
Royal Society B: Biological Sciences, 284, 20170768.
Ovaskainen, O., Tikhonov, G., Norberg, A., Guillaume Blanchet, F., Duan, L., Dun-
son, D., Roslin, T., & Abrego, N. (2017b). How to make more out of community
data? A conceptual framework and its implementation as models and software.
Ecology Letters, 20, 561–576.
Packard, G. C. (2013). Is logarithmic transformation necessary in allometry? Bio-
logical Journal of the Linnean Society, 109, 476–486.
Pagel, M. (1997). Inferring evolutionary processes from phylogenies. Zoologica
Scripta, 26, 331–348.
Pagel, M. (1999). Inferring the historical patterns of biological evolution. Nature,
401, 877–84.
Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling
of species geographic distributions. Ecological Modelling, 190, 231–259.
Pitman, E. T. G. (1939). A note on normal correlation. Biometrika, 31, 9–12.
Pledger, S. (2000). Unified maximum likelihood estimates for closed capture–
recapture models using mixtures. Biometrics, 56, 434–442.
Pledger, S., & Arnold, R. (2014). Multivariate methods using mixtures: Correspon-
dence analysis, scaling and pattern-detection. Computational Statistics & Data
Analysis, 71, 241–261.
Pollock, L. J., Morris, W. K., & Vesk, P. A. (2012). The role of functional traits
in species distributions revealed through a hierarchical model. Ecography, 35,
716–725.
Pollock, L. J., Tingley, R., Morris, W. K., Golding, N., O’Hara, R. B., Parris,
K. M., Vesk, P. A., & McCarthy, M. A. (2014). Understanding co-occurrence
by modelling species simultaneously with a Joint Species Distribution Model
(JSDM). Methods in Ecology and Evolution, 5, 397–406.
Popovic, G. C., Hui, F. K., & Warton, D. I. (2018). A general algorithm for covariance
modeling of discrete data. Journal of Multivariate Analysis, 165, 86–100.
Popovic, G. C., Hui, F. K. C., & Warton, D. I. (2022). Fast model-based ordination
with copulas. Methods in Ecology and Evolution, 13, 194–202
Popovic, G. C., Warton, D. I., Thomson, F. J., Hui, F. K. C., & Moles, A. T. (2019).
Untangling direct species associations from indirect mediator species effects with
graphical models. Methods in Ecology and Evolution, 10, 1571–1583.
Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New
York: Wiley.
424 References
Renner, I. W., Elith, J., Baddeley, A., Fithian, W., Hastie, T., Phillips, S. J., Popovic,
G., & Warton, D. I. (2015). Point process models for presence-only analysis.
Methods in Ecology and Evolution, 6, 366–379.
Renner, I. W., & Warton, D. I. (2013). Equivalence of MAXENT and Poisson point
process models for species distribution modeling in ecology. Biometrics, 69, 274–
281.
Rensch, B. (1954). The relation between the evolution of central nervous functions
and the body size of animals. Evolution as a Process 181–200.
Rice, K. (2004). Sprint research runs into a credibility gap. Nature, 432, 147.
Ricker, W. E. (1973). Linear regressions in fishery research. Journal of the Fisheries
Research Board of Canada, 30, 409–434.
Robert, C., & Casella, G. (2013). Monte Carlo statistical methods. Springer.
Roberts, D. A., & Poore, A. G. (2006). Habitat configuration affects colonisation of
epifauna in a marine algal bed. Biological Conservation, 127, 18–26.
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G.,
Hauenstein, S., Lahoz-Monfort, J. J., Schröder, B., Thuiller, W., Warton, D. I.,
Wintle, B. A., Hartig, F., & Dormann, C. F. (2017). Cross-validation strategies
for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography,
40, 913–929.
Schaub, M., & Abadi, F. (2011). Integrated population models: A novel analysis
framework for deeper insights into population dynamics. Journal of Ornithology,
152, 227–237.
Schielzeth, H., Dingemanse, N. J., Nakagawa, S., Westneat, D. F., Allegue, H., Teplit-
sky, C., Réale, D., Dochtermann, N. A., Garamszegi, L. Z., & Araya-Ajoy, Y. G.
(2020). Robustness of linear mixed-effects models to violations of distributional
assumptions. Methods in Ecology and Evolution, 11, 1141–1152.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,
6, 461–464.
Sentinella, A. T., Warton, D. I., Sherwin, W. B., Offord, C. A., & Moles, A. T.
(2020). Tropical plants do not have narrower temperature tolerances, but are more
at risk from warming because they are close to their upper thermal limits. Global
Ecology and Biogeography, 29, 1387–1398.
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American
Statistical Association, 88, 486–494.
Shermer, M. (2012). The believing brain: From spiritual faiths to political
convictions–how we construct beliefs and reinforce them as truths. Hachette,
UK.
Shipley, B. (2010). From plant traits to vegetation structure: Chance and selection
in the assembly of ecological communities. Cambridge University Press.
Shipley, B. (2016). Cause and correlation in biology: A user’s guide to path analysis,
structural equations and causal inference with R. Cambridge University Press.
Shipley, B., Vile, D., & Garnier, E. (2006). From plant traits to plant communities:
A statistical mechanistic approach to biodiversity. Science, 314, 812–814.
Simpson, G. L. (2016). Permute: Functions for generating restricted permutations
of data. R package version 0.9-4.
References 425
Tikhonov, G., Abrego, N., Dunson, D., & Ovaskainen, O. (2017). Using joint species
distribution models for evaluating how species-to-species associations depend on
the environmental context. Methods in Ecology and Evolution, 8, 443–452.
Tylianakis, J. M., & Morris, R. J. (2017). Ecological networks across environmental
gradients. Annual Review of Ecology, Evolution, and Systematics, 48, 25–48.
Væth, M. (1985). On the use of Wald’s test in exponential families. International
Statistical Review, 53, 199–214.
Walker, S. C., & Jackson, D. A. (2011). Random-effects ordination: Describing and
predicting multivariate correlations and co-occurrences. Ecological Monographs,
81, 635–663.
Wang, Y., Naumann, U., Wright, S. T., & Warton, D. I. (2012). mvabund—an R
package for model-based analysis of multivariate abundance data. Methods in
Ecology and Evolution, 3, 471–474.
Warton, D. I. (2005). Many zeros does not mean zero inflation: Comparing the
goodness-of-fit of parametric models to multivariate abundance data. Environ-
metrics, 16, 275–289.
Warton, D. I. (2007). Robustness to failure of assumptions of tests for a common
slope amongst several allometric lines—a simulation study. Biometrical Journal,
49, 286–299.
Warton, D. I. (2008). Penalized normal likelihood and ridge regularization of corre-
lation and covariance matrices. Journal of the American Statistical Association,
103, 340–349.
Warton, D. I. (2011). Regularized sandwich estimators for analysis of high dimen-
sional data using generalized estimating equations. Biometrics, 67, 116–123.
Warton, D. I. (2018). Why you cannot transform your way out of trouble for small
counts. Biometrics, 74, 362–368
Warton, D. I., & Aarts, G. (2013). Advancing our thinking in presence-only and
used-available analysis. Journal of Animal Ecology, 82, 1125–1134.
Warton, D. I., Duursma, R. A., Falster, D. S., & Taskinen, S. (2012a). smatr 3—an
R package for estimation and inference about allometric lines. Methods in Ecology
and Evolution, 3, 257–259.
Warton, D. I., & Hui, F. K. C. (2011). The arcsine is asinine: The analysis of
proportions in ecology. Ecology, 92, 3–10.
Warton, D. I., & Hui, F. K. C. (2017). The central role of mean-variance relationships
in the analysis of multivariate abundance data: A response to Roberts (2017).
Methods in Ecology and Evolution, 8, 1408–1414.
Warton, D. I., Lyons, M., Stoklosa, J., & Ives, A. R. (2016). Three points to con-
sider when choosing a LM or GLM test for count data. Methods in Ecology and
Evolution, 7, 882–890.
Warton, D. I., Shipley, B., & Hastie, T. (2015). CATS regression—a model-based
approach to studying trait-based community assembly. Methods in Ecology and
Evolution, 6, 389–398.
Warton, D. I., Thibaut, L., & Wang, Y. A. (2017). The PIT-trap—a “model-free”
bootstrap procedure for inference about regression models with discrete, multi-
variate responses. PLoS One, 12, e0181790.
References 427
Warton, D. I., & Weber, N. C. (2002). Common slope tests for errors-in-variables
models. Biometrical Journal, 44, 161–174.
Warton, D. I., Wright, I. J., Falster, D. S., & Westoby, M. (2006). Bivariate line-fitting
methods for allometry. Biological Reviews, 81, 259–291.
Warton, D. I., Wright, S. T., & Wang, Y. (2012b). Distance-based multivariate anal-
yses confound location and dispersion effects. Methods in Ecology and Evolution,
3, 89–101.
Warwick, R. M., Clarke, K. R., & Suharsono (1990). A statistical analysis of coral
community responses to the 1982–1983 El Niño in the Thousand Islands, Indone-
sia. Coral Reefs, 8, 171–179.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context,
process, and purpose. The American Statistician, 70, 129–133.
Weisbecker, V., & Warton, D. I. (2006). Evidence at hand: Diversity, functional impli-
cations, and locomotor prediction in intrinsic hand proportions of diprotodontian
marsupials. Journal of Morphology, 267, 1469–1485.
Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmeyer, D. B. (1996).
Modelling the abundance of rare species: Statistical methods for counts with extra
zeros. Ecological Modelling, 88, 297–308.
Westoby, M., Leishman, M. R., & Lord, J. M. (1995). On misinterpreting the ‘phy-
logenetic correction’. Journal of Ecology, 83, 531–534.
Wheeler, J. A., Cortés, A. J., Sedlacek, J., Karrenberg, S., van Kleunen, M., Wipf,
S., Hoch, G., Bossdorf, O., & Rixen, C. (2016). The snow and the willows: Earlier
spring snowmelt reduces performance in the low-lying alpine shrub salix herbacea.
Journal of Ecology, 104, 1041–1050.
White, C. (2005). Hunters ring dinner bell for ravens: Experimental evidence of a
unique foraging strategy. Ecology, 86, 1057–1060.
Whittaker, R. H. (1972). Evolution and measurement of species diversity. Taxon, 21,
213–251.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. New York:
Springer. ISBN: 978-3-319-24277-4.
Williams, W., & Lambert, J. T. (1966). Multivariate methods in plant ecology: V.
Similarity analyses and information-analysis. The Journal of Ecology, 54, 427–
445.
Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal like-
lihood estimation of semiparametric generalized linear models. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 73, 3–36.
Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd edn.).
CRC Press.
Xiao, X., White, E. P., Hooten, M. B., & Durham, S. L. (2011). On the use of
log-transformation vs. nonlinear regression for analyzing biological power laws.
Ecology, 92, 1887–1894.
Yee, T. (2006). Constrained additive ordination. Ecology, 87, 203–213.
Yee, T. W. (2010). The VGAM package for categorical data analysis. Journal of
Statistical Software, 32, 1–34.
428 References
Yee, T. W., & Mitchell, N. D. (1991). Generalized additive models in plant ecology.
Journal of Vegetation Science, 2, 587–602.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with
grouped variables. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 68, 49–67.
Zeger, S. L., & Liang, K. Y. (1986). Longitudinal data analysis for discrete and
continuous outcomes. Biometrics, 42, 121–130.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American
Statistical Association, 101, 1418–1429.
Zuur, A., Ieno, E., Walker, N., Saveliev, A., & Smith, G. (2009). Mixed effects models
and extensions in ecology with R. New York: Springer.
Index
A B
Abundance Backward selection, 121
relative (see Diversity (β)), 344 Basis functions, 185
Abundance property, 332 Bayesian information criterion (BIC), 119
in model selection, 358 Bias
Additive estimation bias, 28
assumption, 84, 174 unbiased, 27
Akaike information criterion (AIC), 119 Bias-variance tradeoff, 108, 109
Allometry, 318 Big data, 411
All-subsets selection, 121 Blocking, 85
Analysis of covariance (ANCOVA), 86 increases efficiency, 104
interaction, 99 Block resampling, 224, 337
Analysis of deviance, 253 Bootstrap, 212
Analysis of variance (ANOVA), 73 vs. permutation test, 214
as a multiple regression, 74 Borrow strength, 360
ANCOVA, see Analysis of covariance
(ANCOVA) C
ANOVA, see Analysis of variance Categorical variable, 3
(ANOVA) Causal models, 6
Assumptions Causation, 6
violations of, 26, 28, 29 Central limit theorem, 30
for ANCOVA, 88 how fast does it work, 32
for linear models, 102 proof, 31
for t tests, 46 See also Greatest theorem in the known
See also Mind your Ps and Qs universe
Autocorrelation, 152 Circular variables, 193
assumption violation, 154 Classification, 364, 370
is hard for big data, 167 soft, 370
as linear contrasts, 154 Collinearity, 70
phylogenetic, 170 mucks up stepwise methods, 123
sample autocorrelation function, 152 Community assembly by trait selection
spatial, 163 (CATS), 379
temporal, 155 Community composition data, see Multivariate
Autoregressive model, 165 abundances
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 429
Springer Nature Switzerland AG 2022
D. I. Warton, Eco-Stats: Data Analysis in Ecology, Methods in Statistical Ecology,
https://doi.org/10.1007/978-3-030-88443-7
430 Index
N Q
Nested factor, 134, 135 Quantile regression, 103
Nominal variable, 14 Quantitative variable, 3
Non-linearity, 365
Non-parametric statistics, 33, 34
Normality assumption, 45, 51, 64, 137, 174, R
187, 302 R2
checking, 24, 45 is a bad idea for model selection, 111
Normal quantile plot, 24 and sampling design, 57
Random effects, 133
make everything hard, 141
O observation-level, 257
Observational study, 6, 11 phylogenetically structured, 171
Offset, 259 spatially structured, 165
Ordinal variable, 14 temporally structured, 157
Ordination Random intecepts model, 156
constrained, 364 Randomised blocks design, see Blocking
partial, 396 Random sampling, 10
unconstrained, 395 can be difficult, 11
Overdispersion, 256, 409 satisfies independence assumptions, 10, 11,
is not a thing for binary data, 258 23, 46, 84, 85, 102
Overfit, 111 Random slope model, 157
Random variable, 9
P Regression, 3, 4
Paired data, 81 conditions on x, 61
Paired t-test high-dimensional, 333
as a main effects ANOVA, 83 least squares, 47
Pairwise comparisons, see Multiple testing, in logistic, 241
ANOVA to the mean, 319
Parameter, 17 multiple, 64
nuisance, 124 negative binomial, 242
Parametric bootstrap, 145, 212 Poisson, 242
is not design-based, 216 reduced rank, 364
Partial residuals, 68 simple linear, 47
Penalised likelihood, 124 Repeated measures, see Autocorrelation,
Permutation test, 207 temporal
Phylogenetic regression, 171 Representative, 7
Phylogeny, 170 Resampling, 207
PIT-trap, 255 is computationally intensive, 342
Plot the data rows of data, 337
always, 13, 277, 295, 314 Research question, 6
fourth corner interactions, 384 determines analysis procedure, 12
Index 433
T Z
Target population, 7, 17 Zero-inflated, 261, 370