Computational Modeling in Cognition
Computational Modeling in Cognition
All rights reserved. No part of this book may be reproduced or utilized in any form or by any
means, electronic or mechanical, including photocopying, recording, or by any information
storage and retrieval system, without permission in writing from the publisher.
For information:
Lewandowsky, Stephan.
Computational modeling in cognition : principles and practice / Stephan Lewandowsky
and Simon Farrell.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-4129-7076-1 (pbk.)
1. Cognition—Mathematical models. I. Farrell, Simon, 1976- II. Title.
BF311.L467 2011
153.01'5118—dc22 2010029246
10 11 12 13 14 10 9 8 7 6 5 4 3 2 1
Preface ix
1. Introduction 1
1.1 Models and Theories in Science 1
1.2 Why Quantitative Modeling? 3
1.3 Quantitative Modeling in Cognition 5
1.3.1 Models and Data 5
1.3.2 From Ideas to Models 8
1.3.3 Summary 10
1.4 The Ideas Underlying Modeling and Its
Distinct Applications 10
1.4.1 Elements of Models 10
1.4.2 Data Description 11
1.4.3 Process Characterization 16
1.4.4 Process Explanation 19
1.4.5 Classes of Models 25
1.5 What Can We Expect From Models? 25
1.5.1 Classification of Phenomena 25
1.5.2 Emergence of Understanding 26
1.5.3 Exploration of Implications 27
1.6 Potential Problems 28
1.6.1 Scope and Testability 28
1.6.2 Identification and Truth 32
References 319
Author Index 345
Subject Index 353
About the Authors 359
Preface
ix
x——Computational Modeling in Cognition
Pareto principle, we believe that 80% of our readership will be interested in 20% of
the field—and so we focused on making those 20% particularly accessible.
There are several ways in which this book can be used and perused. The order
of our chapters is dictated by logic, and we thus present basic modeling tools
before turning to model selection and so on. However, the chapters can be read in
a number of different orders, depending on one’s background and intentions.
For example, readers with very little background in modeling may wish to begin
by reading Chapters 1, 2, and 3, followed by the first part of Chapter 7 and all of
Chapter 8. Then, you may wish to go back and read Chapters 4, 5, and 6. In con-
trast, if this book is used for formal tuition in a course, then we suggest that the
chapters be assigned in the order in which they are presented; in our experience, this
order follows the most logical progression in which the knowledge is presented.
This project would not have been possible without support and assistance from
many sources. We are particularly grateful to Klaus Oberauer, John Dunn, E.-J.
Wagenmakers, Jeff Rouder, Lael Schooler, and Roger Ratcliff for comments and
clarifications on parts of this book.
—Stephan Lewandowsky
—Simon Farrell
Perth and Bristol,
April 2010
1
Introduction
1
2 Computational Modeling in Cognition
Figure 1.1 An example of data that defy easy description and explanation without a quan-
titative model.
available, when Copernicus replaced the geocentric Ptolemaic system with a heli-
ocentric model: Today, we know that retrograde motion arises from the fact that
the planets travel at different speeds along their orbits; hence, as Earth “overtakes”
Mars, for example, the red planet will appear to reverse direction as it falls behind
the speeding Earth.
This example permits several conclusions that will be relevant throughout
the remainder of this book. First, the pattern of data shown in Figure 1.1 defies
description and explanation unless one has a model of the underlying process.
It is only with the aid of a model that one can describe and explain planetary
motion, even at a verbal level (readers who doubt this conclusion may wish to
invite friends or colleagues to make sense of the data without knowing their
source).
Second, any model that explains the data is itself unobservable. That is,
although the Copernican model is readily communicated and represented (so
readily, in fact, that we decided to omit the standard figure showing a set of con-
centric circles), it cannot be directly observed. Instead, the model is an abstract
explanatory device that “exists” primarily in the minds of the people who use it to
describe, predict, and explain the data.
Third, there nearly always are several possible models that can explain a given
data set. This point is worth exploring in a bit more detail. The overwhelming
Chapter 1 Introduction 3
center of epicycle
planet
retrograde
motion
epicycle
center of
deferent
Earth
deferent
trajectory
of epicycle
Figure 1.2 The geocentric model of the solar system developed by Ptolemy. It was the
predominant model for some 1,300 years.
success of the heliocentric model often obscures the fact that, at the time of
Copernicus’s discovery, there existed a moderately successful alternative—
namely, the geocentric model of Ptolemy shown in Figure 1.2. The model
explained retrograde motion by postulating that while orbiting around the Earth,
the planets also circle around a point along their orbit. On the additional, arguably
somewhat inelegant, assumption that the Earth is slightly offset from the center of
the planets’ orbit, this model provides a reasonable account of the data, limiting
the positional discrepancies between predicted and actual locations of, say, Mars
to about 1◦ (Hoyle, 1974). Why, then, did the heliocentric model so rapidly and
thoroughly replace the Ptolemaic system?1
The answer to this question is quite fascinating and requires that we move
toward a quantitative level of modeling.
But what does “better” mean? Surely it means that the Copernican system pre-
dicted the motion of planets with less quantitative error—that is, less than the 1◦
error for Mars just mentioned—than its Ptolemaic counterpart? Intriguingly, this
conventional wisdom is only partially correct: Yes, the Copernican model pre-
dicted the planets’ motion in latitude better than the Ptolemaic theory, but this
difference was slight compared to the overall success of both models in predict-
ing motion in longitude (Hoyle, 1974). What gave Copernicus the edge, then,
was not “goodness of fit” alone2 but also the intrinsic elegance and simplicity
of his model—compare the Copernican account by a set of concentric circles
with the complexity of Figure 1.2, which only describes the motion of a single
planet.
There is an important lesson to be drawn from this fact: The choice among
competing models—and remember, there are always several to choose from—
inevitably involves an intellectual judgment in addition to quantitative examina-
tion. Of course, the quantitative performance of a model is at least as important as
are its intellectual attributes. Copernicus would not be commemorated today had
the predictions of his model been inferior to those of Ptolemy; it was only because
the two competing models were on an essentially equal quantitative footing that
other intellectual judgments, such as a preference for simplicity over complexity,
came into play.
If the Ptolemaic and Copernican models were quantitatively comparable, why
do we use them to illustrate our central thesis that a purely verbal level of
explanation for natural phenomena is insufficient and that all sciences must seek
explanations at a quantitative level? The answer is contained in the crucial mod-
ification to the heliocentric model offered by Johannes Kepler nearly a century
later. Kepler replaced the circular orbits in the Copernican model by ellipses
with differing eccentricities (or “egg-shapedness”) for the various planets. By this
straightforward mathematical modification, Kepler achieved a virtually perfect fit
of the heliocentric model with near-zero quantitative error. There no longer was
any appreciable quantitative discrepancy between the model’s predictions and
the observed paths of planets. Kepler’s model has remained in force essentially
unchanged for more than four centuries.
The acceptance of Kepler’s model permits two related conclusions, one that
is obvious and one that is equally important but perhaps less obvious. First, if
two models are equally simple and elegant (or nearly so), the one that provides
the better quantitative account will be preferred. Second, the predictions of the
Copernican and Keplerian models cannot be differentiated by verbal interpreta-
tion alone. Both models explain retrograde motion by the fact that Earth “over-
takes” some planets during its orbit, and the differentiating feature of the two
models—whether orbits are presumed to be circular or elliptical—does not entail
any differences in predictions that can be appreciated by purely verbal analysis.
Chapter 1 Introduction 5
That is, although one can talk about circles and ellipses (e.g., “one is round, the
other one egg shaped”), those verbalizations cannot be turned into testable pre-
dictions: Remember, Kepler reduced the error for Mars from 1◦ to virtually zero,
and we challenge you to achieve this by verbal means alone.
Let us summarize the points we have made so far:
1. Data never speak for themselves but require a model to be understood and
to be explained.
2. Verbal theorizing alone ultimately cannot substitute for quantitative
analysis.
3. There are always several alternative models that vie for explanation of data,
and we must select among them.
4. Model selection rests on both quantitative evaluation and intellectual and
scholarly judgment.
All of these points will be explored in the remainder of this book. We next
turn our attention from the night sky to the inner workings of our mind, first by
showing that the preceding conclusions apply in full force to cognitive scientists
and then by considering an additional issue that is of particular concern to scholars
of the human mind.
1.0 33
18 34 3 4
32
28 30
0.8 15 1
2
24 27 5
23 9
Recognition Probability
14
0.6 16 7 20 8 6
10
11 25
26
0.4
21 22 31
29
19 13
0.2 12 17
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Classification “Confidence”
the y-axis). Each data point in the figure, then, represents those two responses,
averaged across participants, for a given face (identified by ID number, which can
be safely ignored). The correlation between those two measures was found to be
r = .36.
Before we move on, see if you can draw some conclusions from the pattern
in Figure 1.3. Do you think that the two tasks have much to do with each other?
Or would you think that classification and recognition are largely unrelated and
that knowledge of one response would tell you very little about what response to
expect on the other task? After all, if r = .36, then knowledge of one response
reduces uncertainty about the other one by only 13%, leaving a full 87% unex-
plained, right?
Wrong. There is at least one quantitative cognitive model (called the GCM
and described a little later), which can relate those two types of responses with
considerable certainty. This is shown in Figure 1.4, which separates classification
and recognition judgments into two separate panels, each showing the
Chapter 1 Introduction 7
Categorization Recognition
1.0 1.0
Observed Probability
Observed Probability
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Predicted Probability Predicted Probability
Figure 1.4 Observed and predicted classification (left panel) and recognition (right panel).
Predictions are provided by the GCM; see text for details. Perfect prediction is represented
by the diagonal lines. Figure reprinted from Nosofsky, R. M. (1991). Tests of an exemplar
mode for relating perceptual classification and recognition memory. Journal of Experimen-
tal Psychology: Human Perception and Performance, 17, 3–27. Published by the American
Psychological Association; reprinted with permission.
relationship between observed responses (on the y-axis) and the predictions of
the GCM (x-axis). To clarify, each point in Figure 1.3 is shown twice in Fig-
ure 1.4—once in each panel and in each instance plotted as a function of the
predicted response obtained from the model.
The precision of predictions in each panel is remarkable: If the model’s pre-
dictions were absolutely 100% perfect, then all points would fall on the diagonal.
They do not, but they come close (accounting for 96% and 91% of the variance in
classification and recognition, respectively). The fact that these accurate predic-
tions were provided by the same model tells us that classification and recognition
can be understood and related to each other within a common psychological the-
ory. Thus, notwithstanding the low correlation between the two measures, there
is an underlying model that explains how both tasks are related and permits accu-
rate prediction of one response from knowledge of the other. This model will
be presented in detail later in this chapter (Section 1.4.4); for now, it suffices to
acknowledge that the model relies on the comparison between each test stimulus
and all previously encountered exemplars in memory.
The two figures enforce a compelling conclusion: “The initial scatterplot . . .
revealed little relation between classification and recognition performance. At that
limited level of analysis, one might have concluded that there was little in com-
mon between the fundamental processes of classification and recognition. Under
8 Computational Modeling in Cognition
the guidance of the formal model, however, a unified account of these processes is
achieved” (Nosofsky, 1991, p. 9). Exactly paralleling the developments in 16th-
century astronomy, data in contemporary psychology are ultimately only fully
interpretable with the aid of a quantitative model. We can thus reiterate our first
two conclusions from above and confirm that they apply to cognitive psychology
in full force—namely, that data never speak for themselves but require a model to
be understood and to be explained and that verbal theorizing alone cannot sub-
stitute for quantitative analysis. But what about the remaining earlier conclusions
concerning model selection?
Nosofsky’s (1991) modeling included a comparison between his favored exem-
plar model, whose predictions are shown in Figure 1.4, and an alternative “proto-
type” model. The details of the two models are not relevant here; it suffices to note
that the prototype model compares a test stimulus to the average of all previously
encountered exemplars, whereas the exemplar model performs the comparison
one by one between the test stimulus and each exemplar and sums the result.3
Nosofsky found that the prototype model provided a less satisfactory account of
the data, explaining only 92% and 87% of the classification and recognition vari-
ance, respectively, or about 5% less than the exemplar model. Hence, the earlier
conclusions about model selection apply in this instance as well: There were sev-
eral alternative models, and the choice between them was based on clear quanti-
tative criteria.
that verbal theories may not only be difficult to implement, as shown by Oberauer
and Lewandowsky (2008), but may even turn out to be scientifically untenable.
1.3.3 Summary
We conclude this section by summarizing our main conclusions:
1. Data never speak for themselves but require a model to be understood and
to be explained.
2. Verbal theorizing alone cannot substitute for quantitative analysis.
3. There are always several alternative models that vie for explanation of data,
and we must compare those alternatives.
4. Model comparison rests on both quantitative evaluation and intellectual and
scholarly judgment.
5. Even seemingly intuitive verbal theories can turn out to be incoherent or
ill-specified.
6. Only instantiation in a quantitative model ensures that all assumptions of a
theory have been identified and tested.
If you are interested in expanding on these conclusions and finding out more
about fascinating aspects of modeling, we recommend that you consider the stud-
ies by Estes (1975), Lewandowsky (1993), Lewandowsky and Heit (2006), Norris
(2005), and Ratcliff (1998).
for the set of numbers {2, 3, 4} is their mean—namely, 3. A good model for the
relationship between a society’s happiness and its economic wealth is a nega-
tively accelerated function, such that happiness rises steeply as one moves from
poverty to a modest level of economic security, but further increases in happi-
ness with increasing material wealth get smaller and smaller as one moves to
the richest societies (Inglehart, Foa, Peterson, & Welzel, 2008). Those models are
descriptive in nature, and they are sufficiently important to merit their own section
(Section 1.4.2).
Needless to say, scientists want to do more than describe the data. At the
very least, we want to predict new observations; for example, we might want to
predict how much happiness is likely to increase if we manage to expand the
gross national product by another zillion dollars (if you live in a rich country, the
answer is “not much”). In principle, any type of model permits prediction, and
although prediction is an important part of the scientific endeavor (and probably
the only ability of consideration for stockbrokers and investment bankers), it is
not the whole story. For example, imagine that your next-door neighbor, a car
mechanic by trade, were able to predict with uncanny accuracy the outcome of
every conceivable experiment on some aspect of human cognition (a scenario
discussed by K. I. Forster, 1994). Would you be satisfied with this state of affairs?
Would your neighbor be a good model of human cognition? Clearly the answer
is no; in addition to robotic predictions, you also want an explanation for the
phenomena under consideration (Norris, 2005). Why does this particular outcome
obtain in that experiment rather than some other result?
It follows that most cognitive modeling goes beyond mere description and
seeks to permit prediction and explanation of behavior. The latter, explanatory
role is the exclusive domain of models that we refer to as providing a process
characterization and process explanation, respectively.
When models are used as an explanatory device, one other attribute becomes
particularly relevant: Models are intended to be simpler and more abstract ver-
sions of the system—in our case, human cognition—they are trying to explain
(Fum et al., 2007). Models seek to retain the essential features of the system while
discarding unnecessary details. By definition, the complexity of models will thus
never match the complexity of human cognition–and nor should it, because there
is no point in replacing one thing we do not understand with another (Norris,
2005).
of Representatives by their average because in this case, there is little doubt that
the mean is the proper “model” of the data (notwithstanding the extra allowances
bestowed upon ministers). Why would we want to “model” the data in this way?
Because we are replacing the data points (N = 150 in this instance) with a single
estimated “parameter.”6 In this instance, the parameter is the sample mean, and
reducing 150 points into one facilitates understanding and efficient communica-
tion of the data.
However, we must not become complacent in light of the apparent ease with
which we can model data by their average. As a case in point, consider U.S. Pres-
ident Bush’s 2003 statement in promotion of his tax cut, that “under this plan,
92 million Americans receive an average tax cut of $1,083.” Although this num-
ber, strictly speaking, was not incorrect, it arguably did not represent the best
model of the proposed tax cut, given that 80% of taxpayers would receive less
than this cut, and nearly half (i.e., some 45 million people) would receive less
than $100 (Verzani, 2004). The distribution of tax cuts was so skewed (bottom
20% of income earners slated to receive $6 compared to $30,127 for the top 1%)
that the median or a trimmed mean would have been the preferable model of the
proposed legislation in this instance.
Controversies about the proper model with which to describe data also arise
in cognitive science, although fortunately with more transparency and less disin-
genuousness than in the political scene. In fact, data description, by itself, can
have considerable psychological impact. As a case in point, consider the debate on
whether learning of a new skill is best understood as following a “power law” or is
better described by an exponential improvement (Heathcote, Brown, & Mewhort,
2000). There is no doubt that the benefits from practice accrue in a nonlinear
fashion: The first time you try your hands at a new skill (for example, creating
an Ikebana arrangement), things take seemingly forever (and the output may not
be worth writing home about). The second and third time round, you will notice
vast improvements, but eventually, after some dozens of trials, chances are that
all further improvements are small indeed.
What is the exact functional form of this pervasive empirical regularity? For
several decades, the prevailing opinion had been that the effect of practice is best
captured by a power law—that is, by the function (shown here in its simplest
possible form),
RT = N −β , (1.1)
where RT represents the time to perform the task, N represents the number of
learning trials to date, and β is the learning rate. Figure 1.5 shows sample data,
taken from Palmeri’s (1997) Experiment 3, with the appropriate best-fitting power
function superimposed as a dashed line.
Chapter 1 Introduction 13
7000
6000
Response Time (ms)
5000
4000
3000
2000
1000
0 50 100 150
Trial Number
Figure 1.5 Sample power law learning function (dashed line) and alternative exponential
function (solid line) fitted to the same data. Data are represented by dots and are taken
from Palmeri’s (1997) Experiment 3 (Subject 3, Pattern 13). To fit the data, the power and
exponential functions were a bit more complex than described in Equations 1.1 and 1.2
because they also contained an asymptote (A) and a multiplier (B). Hence, the power
function took the form RT = A P + B P × (N + 1)−β , and the exponential function was
RT = A E + B E × e−α N .
Heathcote et al. (2000) argued that the data are better described by an expo-
nential function given by (again in its simplest possible form)
RT = e−α N , (1.2)
where N is as before and α the learning rate. The best-fitting exponential function
is shown by the solid line in Figure 1.5; you will note that the two competing
descriptions or models do not appear to differ much. The power function captures
the data well, but so does the exponential function, and there is not much to tell
between them: The residual mean squared deviation (RMSD), which represents
the average deviation of the data points from the predicted function, was 482.4 for
the power function compared to 526.9 for the exponential. Thus, in this instance,
14 Computational Modeling in Cognition
the power function fits “better” (by providing some 50 ms less error in its pre-
dictions than the exponential), but given that RT ’s range is from somewhere less
than 1000 ms to 7 seconds, this difference is not particularly striking.
So, why would this issue be of any import? Granted, we wish to describe the
data by the appropriate model, but surely neither of the models in Figure 1.5 mis-
represents essential features of the data anywhere near as much as U.S. President
Bush did by reporting only the average implication of his proposed tax cut. The
answer is that the choice of the correct descriptive model, in this instance, car-
ries important implications about the psychological nature of learning. As shown
in detail by Heathcote et al. (2000), the mathematical form of the exponential
function necessarily implies that the learning rate, relative to what remains to be
learned, is constant throughout practice. That is, no matter how much practice
you have had, learning continues by enhancing your performance by a constant
fraction. By contrast, the mathematics of the power function imply that the rel-
ative learning rate is slowing down as practice increases. That is, although you
continue to show improvements throughout, the rate of learning decreases with
increasing practice. It follows that the proper characterization of skill acquisition
data by a descriptive model, in and of itself, has considerable psychological impli-
cations (we do not explore those implications here; see Heathcote et al., 2000, for
pointers to the background).
Just to wrap up this example, Heathcote et al. (2000) concluded after rean-
alyzing a large body of existing data that the exponential function provided a
better description of skill acquisition than the hitherto presumed power law. For
our purposes, their analysis permits the following conclusions: First, quantitative
description of data, by itself, can have considerable psychological implications
because it prescribes crucial features of the learning process. Second, the exam-
ple underscores the importance of model selection that we alluded to earlier; in
this instance, one model was chosen over another on the basis of strict quanti-
tative criteria. We revisit this issue in Chapter 5. Third, the fact that Heathcote
et al.’s model selection considered the data of individual subjects, rather than the
average across participants, identifies a new issue—namely, the most appropriate
way in which to apply a model to the data from more than one individual—that
we consider in Chapter 3.
The selection among competing functions is not limited to the effects of prac-
tice. Debates about the correct descriptive function have also figured prominently
in the study of forgetting. Does the rate of forgetting differ with the extent of learn-
ing? Is the rate of information loss constant over time? Although the complete
pattern of results is fairly complex, two conclusions appear warranted (Wixted,
2004a): First, the degree of learning does not affect the rate of forgetting. Hence,
Chapter 1 Introduction 15
irrespective of how much you cram for an exam, you will lose the information at
the same rate—but of course this is not an argument against dedicated study; if
you learn more, you will also retain more, irrespective of the fact that the rate of
loss per unit of time remains the same. Second, the rate of forgetting decelerates
over time. That is, whereas you might lose some 30% of the information on the
first day, on the second day, the loss may be down to 20%, then 10%, and so on.
Again, as in the case of practice, two conclusions are relevant here: First, quantita-
tive comparison among competing descriptive models was required to choose the
appropriate function (it is a power function, or something very close to it). Second,
although the shape of the “correct” function has considerable theoretical import
because it may imply that memories are “consolidated” over time after study (see
Wixted, 2004a, 2004b, for a detailed consideration, and see G. D. A. Brown &
Lewandowsky, 2010, for a contrary view), the function itself has no psychologi-
cal content.
The mere description of data can also have psychological implications when
the behavior it describes is contrasted to normative expectations (Luce, 1995).
Normative behavior refers to how people would behave if they conformed to
the rules of logic or probability. For example, consider the following syllogism
involving two premises (P) and a conclusion (C). P1: All polar bears are ani-
mals. P2: Some animals are white. C: Therefore, some polar bears are white. Is
this argument valid? There is a 75% to 80% chance that you might endorse this
conclusion (e.g., Helsabeck, 1975), even though it is logically false (to see why,
replace white with brown in P2 and C). This example shows that people tend to
violate normative expectations even in very simple situations. In this instance,
the only descriptive model that is required to capture people’s behavior—and to
notice the normative violation—is a simple proportion (i.e., .75–.80 of people
commit this logical error). In other, more realistic instances, people’s normatively
irrational behavior is best captured by a rather more complex descriptive model
(e.g., Tversky & Kahneman, 1992).
We have presented several descriptive models and have shown how they can
inform psychological theorizing. Before we move on, it is important to identify
the common threads among those diverse examples. One attribute of descriptive
models is that they are explicitly devoid of psychological content; for example,
although the existence of an exponential practice function constrains possible
learning mechanisms, the function itself has no psychological content. It is merely
concerned with describing the data.
For the remainder of this chapter, we will be considering models that have
increasingly more psychological content. In the next section, we consider models
that characterize cognitive processes at a highly abstract level, thus going beyond
16 Computational Modeling in Cognition
data description, but that do not go so far as to explain those processes in detail.
The final section considers models that go beyond characterization and explain
the cognitive processes.
1-I I
C
1-R R
E C
Figure 1.6 A simple multinomial processing tree model proposed by Schweickert (1993)
for recall from short-term memory.
A B
d
d
Figure 1.7 The representational assumptions underlying the generalized context model
(GCM). Panel A shows stimuli that differ along one dimension only (line length), and
Panel B shows stimuli that differ along two dimensions (line length and angle). In both
panels, a representative distance (d) between two stimuli is shown by the broken line.
angle. Panel B again shows the distance (d) between two stimuli, which is for-
mally given by the following equation:
12
K
di j = |xik − x jk |2 , (1.3)
k=1
where xik is the value of dimension k for test item i (let’s say that’s the mid-
dle stimulus in Panel B of Figure 1.7), and x jk is the value of dimension k for
the stored exemplar j (say, the right-most stimulus in the panel). The number of
dimensions that enter into computation of the distance is arbitrary; the cartoon
faces were characterized by four dimensions, but of course we cannot easily show
more than two dimensions at a time. Those dimensions were eye height, eye sep-
aration, nose length, and mouth height. 9
An easy way to understand Equation 1.3 is by realizing that it merely restates
the familiar Pythagorean theorem (i.e., d 2 = a 2 + b2 ), where a and b are the thin
solid lines in Panel B of Figure 1.7, which are represented by the more general
notation of dimensional differences (i.e., xik − x jk ) in the equation.
How, then, does distance relate to similarity? It is intuitively obvious that
greater distances imply lesser similarity, but GCM explicitly postulates an
exponential relationship of the following form:
si j = ex p(−c · di j ), (1.4)
where c is a parameter and di j the distance as just defined. Figure 1.8 (see page 22)
visualizes this function and shows how the activation of an exemplar (i.e., si j )
declines as a function of the distance (di j ) between that exemplar and the test
22 Computational Modeling in Cognition
0.6
0.5
0.4
Activation
0.3
0.2
0.1
1 2 3 4 5 6 7 8 9 10
Distance
Figure 1.8 The effects of distance on activation in the GCM. Activation (i.e., si j ) is shown
as a function of distance (di j ). The parameter c (see Equation 1.4) is set to .5.
stimulus. You may recognize that this function looks much like the famous gen-
eralization gradient that is observed in most situations involving discrimination
(in species ranging from pigeons to humans; Shepard, 1987): This similarity is
no coincidence; rather, it motivates the functional form of the similarity function
in Equation 1.4. This similarity function is central to GCM’s ability to generalize
learned responses (i.e., cartoon faces seen during study) to novel stimuli (never-
before-seen cartoon faces presented at test only).
It turns out that there is little left to do: Having presented a mechanism by
which a test stimulus activates an exemplar according to its proximity in psycho-
logical space, we now compute those activations for all memorized exemplars.
That is, we compute the distance di j between i and each j ∈ J as given by Equa-
tion 1.3 and derive from that the activation si j as given by Equation 1.4. The next
step is to convert the entire set of resulting activations into an explicit decision:
Which category does the stimulus belong to? To accomplish this, the activations
are summed separately across exemplars within each of the two categories. The
relative magnitude of those two sums directly translates into response probabili-
ties as follows:
Chapter 1 Introduction 23
si j
j∈A
P(Ri = A|i) = , (1.5)
si j + si j
j∈A j∈B
where A and B refer to the two possible categories, and P(Ri = A|i) means “the
probability of classifying stimulus i into category A.” It follows that application
of Equations 1.3 through 1.5 permits us to derive classification predictions from
the GCM. It is those predictions that were plotted on the abscissa (x-axis) in the
left panel of the earlier Figure 1.4, and it is those predictions that were found to
be in such close accord with the data.
If this is your first exposure to quantitative explanatory models, the GCM
may appear daunting at first glance. We therefore wrap up this section by taking a
second tour through the GCM that connects the model more directly to the cartoon
face experiment.
Figure 1.9 shows the stimuli used during training. Each of those faces cor-
responds to a memorized exemplar j that is represented by a set of dimensional
values {x j1 , x j2 , . . . }, where each x jk is the numeric value associated with dimen-
sion k. For example, if the nose of exemplar j has length 5, then x j1 = 5 on the
assumption that the first dimension (arbitrarily) represents the length of the nose.
To obtain predictions from the model, we then present test stimuli (those
shown in Figure 1.9 but also new ones to test the model’s ability to generalize).
Those test stimuli are coded in the same way as training stimuli—namely, by a
set of dimensional values. For each test stimulus i, we first compute the distance
between it and exemplar j (Equation 1.3). We next convert that distance to an
activation of the memorized exemplar j (Equation 1.4) before summing across
exemplars within each category (Equation 1.5) to obtain a predicted response
probability. Do this for each stimulus in turn, and bingo, you have the model’s
complete set of predictions shown in Figure 1.4. How exactly are these computa-
tions performed? A whole range of options exists: If the number of exemplars and
dimensions is small, a simple calculator, paper, and a pencil will do. More than
likely, though, you will be using a computer package (such as a suitable worksheet
in Excel) or a computer program (e.g., written in a language such as MATLAB
or R). Regardless of how we perform these computations, we are assuming that
they represent an analog of the processes used by people. That is, we presume that
people remember exemplars and base their judgments on those memories alone,
without access to rules or other abstractions.
At this point, one can usefully ponder two questions. First, why would we
focus on an experiment that involves rather artificial cartoon faces? Do these
stimuli and the associated data and modeling have any bearing on classification
of “real-life” stimuli? Yes, in several ways. Not only can the GCM handle per-
formance with large and ill-defined perceptual categories (McKinley & Nosof-
sky, 1995), but recent extensions of the model have been successfully applied
to the study of natural concepts, such as fruits and vegetables (Verbeemen, Van-
paemel, Pattyn, Storms, & Verguts, 2007). The GCM thus handles a wide vari-
ety of both artificial and naturalistic categorizations. Second, one might wonder
about the motivation underlying the equations that define the GCM. Why is dis-
tance related to similarity via an exponential function (Equation 1.4)? Why are
responses determined in the manner shown in Equation 1.5? It turns out that for
any good model—and the GCM is a good model—the choice of mathematics is
not at all arbitrary but derived from some deeper theoretical principle. For exam-
ple, the distance-similarity relationship in the GCM incorporates our knowledge
about the “universal law of generalization” (Shepard, 1987), and the choice of
response implements a theoretical approach first developed by Luce (1963).
What do you now know and what is left to do? You have managed to study
your (possibly) first explanatory process model, and you should understand how
the model can predict results for specific stimuli in a very specific experiment.
However, a few obstacles remain to be overcome, most of which relate to the
“how” of applying the model to data. Needless to say, those topics will be covered
in subsequent chapters.
Chapter 1 Introduction 25
into those that fall within and those that fall outside a model’s scope can be very
informative: “What we hope for primarily from models is that they will bring out
relationships between experiments or sets of data that we would not otherwise
have perceived. The fruit of an interaction between model and data should be a
new categorization of phenomena in which observations are organized in terms
of a rational scheme in contrast to the surface demarcations manifest in data”
(p. 271).
Even if we find that it takes two different models to handle two distinct sub-
classes of phenomena, this need not be at all bad but may in fact crystallize an
interesting question. In physics, for example, for a very long time, light was alter-
nately considered as a wave or a stream of particles. The two models were able to
capture a different subset of phenomena, with no cross-linkage between those sets
of phenomena and the two theories. Although this state was perhaps not entirely
satisfactory, it clearly did not retard progress in physics.
In psychology, we suggest that models have similarly permitted a classifica-
tion of phenomena in categorization. We noted earlier that the GCM is a powerful
model that has had a profound impact on our understanding of how people clas-
sify stimuli. However, there are also clear limits on the applicability of the GCM.
For example, Rouder and Ratcliff (2004) showed that the GCM captures people’s
behavior only when the stimuli are few and highly discriminable. When there is
a large ensemble of confusable stimuli, by contrast, people’s behavior is better
captured by a rule model rather than the GCM’s exemplar representation (more
on this in Chapter 7). Likewise, Little and Lewandowsky (2009) showed that in
a complex probabilistic categorization task, some people will build an exemplar
representation, whereas others will create an ensemble of partial rules; the for-
mer were described well by the GCM, but the latter were best described by a rule
model. Taken together, those studies serve to delineate the applicability of two
competing theoretical approaches—namely, rules versus exemplars—somewhat
akin to the differentiation between wave and particle theories of light.
Seidenberg and McClelland (1989) presented a network that could learn to pro-
nounce both regular (lint) and irregular (pint) words from printed input: It was not
at all clear prior to the modeling being conducted that a uniform architecture could
handle both types of words. Indeed, a “central dogma” (Seidenberg & McClel-
land, 1989, p. 525) of earlier models had been that two processes were required
to accommodate irregular words (via lexical lookup) and regular (non)words (via
pronunciation rules).
As another example, Botvinick and Plaut (2006) recently presented a network
model of short-term memory that was able to learn the highly abstract ability of
“seriation”—namely, the ability to reproduce novel random sequences of stimuli.
Thus, after learning the skill, the model was capable of reproducing short serial
lists. Thus, when presented with “A K P Q B,” the model would reproduce that
sequence after a single presentation with roughly the same accuracy and subject
to the same performance constraints as humans. This might appear like a trivial
feat at first glance, but it is not: It is insufficient to learn pairwise contingencies
such as “A precedes B” because in a random list, A might precede B as frequently
as B precedes A. Likewise, it is insufficient to learn that “A occurs in position
1” because in fact A could occur in any position, and so on for any other specific
arrangements of letters (triplets, quadruplets, etc.). Instead, the model had to learn
the highly abstract ability “whatever I see I will try to reproduce in the same order”
from a small subset of all possible sequences. This abstract ability, once learned,
could then be transferred to novel sequences.
In summary, the point that models can yield unexpected and novel insights
was perhaps best summed up by Fum et al. (2007): “New ways of understanding
may assume several forms. They can derive, for instance, from the discovery of
a single unifying principle that will explain a set of hitherto seemingly unrelated
facts. They can lead to the emergence of complex, holistic forms of behavior from
the specification of simple local rules of interaction. New ways of understanding
can arise from unexpected results that defy the modelers intuition” (p. 136).
Hinton and Shallice found that virtually any such lesioning of their network,
irrespective of location, led to a persistent co-occurrence of visual (cat read as
mat) and semantic (peach read as apricot) errors. This generality elegantly expl-
ained why this mix of visual and semantic errors is common across a wide range
of patients whose performance deficits differ considerably in other respects.
We can draw two conclusions from this example: First, it clarifies the
in-principle point that one can do things to models that one cannot do to peo-
ple, and that those lesioning experiments can yield valuable knowledge. Second,
the fact that the results in this instance were surprising lends further support to
the point made in the previous section—namely, that models can show emergent
properties that are not at all apparent by verbal analysis alone.
earlier that place limits on the GCM’s applicability (e.g., Little & Lewandowsky,
2009; Rouder & Ratcliff, 2004; Yang & Lewandowsky, 2004).
We are now faced with a conundrum: On the one hand, we want our theo-
ries to explain data. We want powerful theories, such as Kepler’s, that explain
fundamental aspects of our universe. We want powerful theories, such as Dar-
win’s, to explain the diversity of life. On the other hand, we want the theories
to be falsifiable—that is, we want to be assured that there are at least hypotheti-
cal outcomes that, if they are ever observed, would falsify a theory. For example,
Darwin’s theory of evolution predicts a strict sequence in which species evolved;
hence, any observation to the contrary in the fossil record—for example, human
bones co-occurring with dinosaur remains in the same geological strata (e.g.,
Root-Bernstein, 1981)—would seriously challenge the theory. This point is suffi-
ciently important to bear repetition: Even though we are convinced that Darwin’s
theory of evolution, one of the most elegant and powerful achievements of human
thought, is true, we simultaneously also want it to be falsifiable—falsifiable, not
false.12 Likewise, we are committed to the idea that the earth orbits around the
sun, rather than the other way round, but as scientists, we accept that fact only
because it is based on a theory that is falsifiable—again, falsifiable, not false.
Roberts and Pashler (2000) considered the issue of falsifiability and scope
with reference to psychological models and provided an elegant graphical sum-
mary that is reproduced in Figure 1.10. The figure shows four hypothetical out-
come spaces that are formed by two behavioral measures. What those measures
represent is totally arbitrary; they could be trials to a criterion in a memory exper-
iment and a final recognition score or any other pair of measures of interest.
Within each panel, the dotted area represents all possible predictions that are
within the scope of a psychological theory. The top row of panels represents some
hypothetical theory whose predictions are constrained to a narrow range of out-
comes; any outcome outside the dotted sliver would constitute contrary evidence,
and only the narrow range of values within the sliver would constitute support-
ing evidence. Now compare that sliver to the bottom row of panels with its very
generous dotted areas; the theory shown here is compatible with nearly all possi-
ble outcomes. It follows that any observed outcome that falls within a dotted area
would offer greater support for the theory in the top row than the bottom row, sim-
ply because the likelihood of falsification is greater for the former than the latter,
thus rendering the match between data and predictions far less likely—and hence
more informative when it occurs (see Dunn, 2000, for a similar but more formal-
ized view). Ideally, we would want our theories to occupy only a small region of
the outcome space but for all observed outcomes to fall within that region—as
they do for Kepler’s and Darwin’s theories.13
Another important aspect of Figure 1.10 concerns the quality of the data,
which is represented by the columns of panels. The data (shown by the single
30 Computational Modeling in Cognition
Consistent
With Theory
Measure B
Measure A
Figure 1.10 Four possible hypothetical relationships between theory and data involving
two measures of behavior (A and B). Each panel describes a hypothetical outcome space
permitted by the two measures. The shaded areas represent the predictions of a theory that
differs in predictive scope (narrow and broad in the top and bottom panels, respectively).
The error bars represent the precision of the observed data (represented by the black dot).
See text for details. Figure reprinted from Roberts, S., & Pashler, H. (2000). How per-
suasive is a good fit? A comment on theory testing. Psychological Review, 107, 358–367.
Published by the American Psychological Association; reprinted with permission.
black point bracketed by error bars) exhibit less variability in the left column of
panels than in the right. For now, we note briefly that support for the theory is
thus strongest in the top left panel; beyond that, we defer discussion of the impor-
tant role of data to Chapter 6. That chapter will also provide another in-depth and
more formal look at the issue of testability and falsifiability.
Let us now turn from the abstract representation in Figure 1.10 to a specific
recent instance in which two theories were compared by exploration of an out-
come space. Howard, Jing, Rao, Provyn, and Datey (2009) examined the nature
of associations among list items. Their study was quite complex, but their cen-
tral question of interest can be stated quite simply: Are associations between list
items symmetrical or asymmetrical? That is, given a to-be-memorized list such
Chapter 1 Introduction 31
1.0
0.5
Remote
0.0
−0.5
−1.0
Adjacent
Figure 1.11 Outcome space covered by two models examined by Howard, Jing, Rao,
Provyn, and Datey (2009). An index of remote asymmetry is shown as a function of an
index of adjacent asymmetry for a variety of parameter values for two models (referred
to here as “black” and “gray,” corresponding to the color of their plotting symbols). See
text for details. Figure reprinted from Howard, M. W., Jing, B., Rao, V. A., Provyn, J. P.,
& Datey, A. V. (2009). Bridging the gap: Transitive associations between items presented
in similar temporal contexts. Journal of Experimental Psychology: Learning, Memory &
Cognition, 35, 391–407. Published by the American Psychological Association; reprinted
with permission.
possible to choose one model over another, even if (in principle) the chosen model
is equivalent to many unknown others. Simply put, the fact that there are many
good models out there does not prevent us from rejecting the bad ones.
Third, the mere existence of equivalent models does not imply that they have
been—or indeed will be—discovered. In our experience, it is difficult enough to
select a single suitable model, let alone worry about the existence of an infinite
number of equivalent competitors.
Finally, even supposing that we must select from among a number of com-
peting models of equivalent capability (i.e., equal goodness of fit), some fairly
straightforward considerations have been put forward to achieve this (see, e.g.,
Fum et al., 2007). We revisit this issue in detail in Chapter 5.
Now let us turn to the issue concerning the “truth” of a model. Is there such
a thing as one true model? And if not, what are the implications of that? The
answer to the first question is strongly implied by the preceding discussion, and
it was most clearly stated by MacCallum (2003): “Regardless of their form or
function, or the area in which they are used, it is safe to say that these models all
have one thing in common: They are all wrong” (p. 114). Now what?
To answer this question, we again briefly digress into astronomy by noting that
Kepler’s model, being based on Newtonian physics, is—you guessed it—wrong.
We now know that Newtonian physics is “wrong” because it does not capture the
phenomena associated with relativity. Does this mean that the earth is in fact not
orbiting around the sun? No, it does not, because Kepler’s model is nonetheless
useful because within the realm for which it was designed—planetary motion—
Newtonian physics holds to an acceptable degree. Likewise, in psychology, our
wrong models can nonetheless be useful (MacCallum, 2003). We show exactly
how wrong models can still be useful at the end of the next chapter, after we
introduce a few more essential tools and concepts.
Notes
1. Lest one think that the heliocentric and geocentric models exhaust all possible views
of the solar system, it is worth clarifying that there is an infinite number of equivalent mod-
els that can adequately capture planetary motion because relative motion can be described
with respect to any possible vantage point.
2. Goodness of fit is a term for the degree of quantitative error between a model’s predic-
tions and the data; this important term and many others are discussed in detail in Chapter 2.
3. Astute readers may wonder how the two could possibly differ. The answer lies in
the fact that the similarity rule involved in the comparisons by the exemplar model is non-
linear; hence, the summed individual similarities differ from that involving the average.
This nonlinearity turns out to be crucial to the model’s overall power. The fact that subtle
matters of arithmetic can have such drastic consequences further reinforces the notion that
purely verbal theorizing is of limited value.
34 Computational Modeling in Cognition
4. Another lesson that can be drawn from this example is a rejoinder to the popular but
largely misplaced criticism that with enough ingenuity and patience, a modeler can always
get a model to work.
5. Several distinctions between models have been proposed (e.g., Luce, 1995); ours
differs from relevant precedents by being explicitly psychological and being driven entirely
by considerations that are relevant to the cognitive researcher.
6. We will provide a detailed definition of what a parameter is in Chapter 2. For now,
it suffices to think of a parameter as a number that carries important information and that
determines the behavior of the model.
7. Some readers may have noticed that in this instance, there are two parameters (I and
R) and two data points (proportion correct and errors; C and R), which renders the model
nonidentifiable. We ignore this issue here for simplicity of exposition; for a solution, see
Hulme et al. (1997).
8. This model is a connectionist model, and these are discussed further in Chapter 8.
9. For simplicity, we omit discussion of how these psychological distances relate to the
physical measurement (e.g., line length in cm) of the stimuli; these issues are covered in,
for example, Nosofsky (1986).
10. Of course, a cognitive model may leave other levels of explanation unspecified, for
example, the underlying neural circuitry. However, at the level of abstraction within which
the model is formulated, nothing can be left unspecified.
11. Throughout this book, we use the terms falsifiable and testable interchangeably to
denote the same idea—namely, that at least in principle, there are some possible outcome(s)
that are incompatible with the theory’s predictions.
12. Despite its falsifiability, Darwin’s theory has a perfect track record of its predictions
being uniformly confirmed; Coyne (2009) provides an insightful account of the impressive
list of successes.
13. It is important to clarify that, in our view, this argument should apply only with
respect to a particular measurement. That is, for any given measurement, we prefer theo-
ries that could have only predicted a subset of all possible observations over theories that
could have predicted pretty much any outcome. However, it does not follow that we prefer
theories that are so narrow in scope that they only apply to a single experiment; on the
contrary, we prefer theories that apply to a range of different situations.
2
From Words to Models:
Building a Toolkit
Let us turn some of the ideas of the preceding chapter into practice. We begin
by presenting an influential model of working memory that has been stated at
a verbal level (A. D. Baddeley, 1986), and we then take you through the steps
required to instantiate this verbal model in a computer simulation. Along the way,
we introduce you to MATLAB, which is a programming environment that is par-
ticularly suitable for developing computer simulations. We conclude the chapter
with a toolkit of terms and concepts and some further thoughts about modeling
that are essential for your understanding of the remainder of the book.
35
36 Computational Modeling in Cognition
our working memories predicts higher level cognitive abilities with considerable
precision (e.g., Oberauer, Süß, Wilhelm, & Sander, 2007).
The importance of working memory is reflected in the large number of
models that describe and explain its functioning (e.g., A. D. Baddeley, 1986;
G. D. A. Brown et al., 2007; Burgess & Hitch, 1999; Lewandowsky & Farrell,
2008b; Oberauer & Kliegl, 2006). Here, we focus on the highly influential model
by Baddeley (e.g., 1986; see also A. D. Baddeley & Hitch, 1974). The model
assumes that working memory consists of several interacting components, most
crucially a “visuospatial sketchpad” that is dedicated to the processing of visual
(as opposed to verbal) information and a “phonological loop” that is dedicated
to the retention of verbal information. The model contains a number of addi-
tional components, such as a “central executive” and a (recently added) “episodic
buffer,” but for present purposes, we restrict consideration to the phonological
loop.
Proportion Correct
0.8
0.6
0.4
0.2
0
1 2 3 4
Speech Rate
Figure 2.1 Memory performance on a five-item list as a function of speech rate for 7-
and 10-year-olds. Long words such as helicopter translate into a low speech rate (because
few long words can be articulated in a second), and shorter words (e.g., bus) translate
into high speech rates. The solid line is the regression line. Data taken from Hulme, C., &
Tordoff, V. (1989). Working memory development: The effects of speech rate, word-length,
and acoustic similarity on serial recall. Journal of Experimental Child Psychology, 47,
72–87.
These data strongly imply that your social graces will be greatly facilitated
if the people you meet at a party are called Bob, Pat, Ben, or Buzz, rather than
Vladimir, Chathuranga, Konstantin, or Phillipa. What was that guy standing next
to Vladimir called again?
a
Phonological
WLE
loop
b
Phonological
WLE
loop
Other
explanations
c
Phonological
WLE
loop ?
Other
explanations
Figure 2.2 The evolution of our understanding of the relationship between a model (in this
instance, the phonological loop) and a key piece of supporting data (in this case, the word
length effect or WLE). Panel (a) shows the initially presumed situation in which the model
predicts the WLE and is, in turn, uniquely supported by those data. Panel (b) shows that
the supporting role of the data is weakened by the existence of alternative explanations.
Panel (c) shows that the phonological loop may not even predict the WLE in the first place.
See text for details.
else, given the many variables along which words differ in natural language. It fol-
lows that—notwithstanding its ubiquity—the WLE may be of limited theoretical
diagnosticity because it can arise from processes other than decay (e.g., differ-
ing levels of phonological complexity; for details, see Lewandowsky & Oberauer,
2008).
Second, at a theoretical level, it is known that the data in Figure 2.1 can be
explained by models that do not rely on decay (e.g., G. D. A. Brown & Hulme,
1995; Lewandowsky & Farrell, 2000). The existence of alternative explanations,
then, creates the situation in panel (b) in Figure 2.2: The supporting link between
the data and the model is weakened because the WLE is no longer uniquely and
exclusively compatible with the phonological loop.
In fact, we may take an even more radical step by asking whether the phono-
logical loop model even predicts the very data that constitute its principal sup-
port. Although this question might seem awfully contrived at first glance, you
may remember that we began this book with a chapter devoted to demonstrating
the vagaries of verbal theorizing; might this be another instance in which a verbal
model behaves differently from what is expected at first glance?
Chapter 2 From Words to Models 39
Table 2.1 Summary of Decisions Necessary to Implement the Phonological Loop Model
in a Simulation
Decision Point N Alternatives Our Decision
items in a localized manner, by the activation of single dedicated units. For sim-
plicity, we represented order among the items by using their subscripts in the
memory array as “positional markers.” Another simplifying assumption was that
all items were assumed to be encoded with equal strength.
The second class of decisions involves more technical issues, such as the exact
mechanics of rehearsal and decay. We highlight these technical issues because
they illustrate the necessary decision-making process particularly well. To provide
a road map for what follows, we summarize our technical decisions in Table 2.1.
The pattern of our decisions (right-most column) describes our preferred instan-
tiation of the phonological loop; that is, if this particular set of decisions is taken,
then, as you will see shortly, the resulting computer simulation will reproduce
the data in Figure 2.1. The table also contains a column that lists the number
of possible alternatives at each choice point.2 Because the choices are for the
most part independent, the total number of distinct models that could have been
constructed to instantiate the phonological loop model is given by their
product—namely, 144.
noted that it is shared by at least one other instantiation of the phonological loop
(Page & Norris, 1998b).3
We must next consider the exact functional form of decay. In a simulation, it
is insufficient to speak of “decay” without specifying a function: That is, for each
second delay, is there a constant loss of information (i.e., a linear decay function),
or is the loss proportional (i.e., exponential)? Most decay models assume that
decay is exponential (e.g., Page & Norris, 1998b); indeed, a linear decay function
makes limited sense because, unless accompanied by a lower bound, it necessarily
predicts that long temporal delays can result in “negative memories.” Nonetheless,
bearing in mind this caveat, we chose a linear decay function for simplicity.
Finally, we must decide whether the decay is constant or variable. That is, does
each item for each participant decay at the same rate, or is there some variability?
This decision turns out to be crucial, and we will therefore explore more than
one option below; we begin by assuming that decay is constant across items and
people.
2.2.3 Rehearsal
Did you keep track of how many decisions we have made so far? If you
have tracked our progress through Table 2.1, you will have noticed that we have
42 Computational Modeling in Cognition
journeyed past 5 choice points and that we have selected one particular model
from among 72 options. And we had to do this simply to implement the postulate
of the phonological-loop model that information in short-term memory decays
rapidly—a seemingly straightforward postulate that turns out to be compatible
with not one but more than 70 actual models. We return to the implications of this
many-to-one mapping after we present our simulation results.
Let us now turn to the second postulate of the phonological-loop model;
namely, that decay can be counteracted by articulatory rehearsal. What exactly is
rehearsal? There is considerable evidence that rehearsal can take several different
forms (e.g., Hudjetz & Oberauer, 2007); namely, one that is verbal-articulatory
and another that is non-verbal and “attentional.” Within the phonological-loop
framework, rehearsal is considered to be articulatory and we inherit this assump-
tion here. Moreover, following much precedent (see Tan & Ward, 2008, for a
brief review), we assume that rehearsal is equivalent to recall; that is, recitation of
a list during rehearsal is nothing but repeated recall of that list (albeit subvocally
in most cases). This assumption simplifies our modeling decisions considerably
because, having already specified a recall process, we can readily adapt it for
rehearsal.
In particular, we assume that rehearsal is ordered (i.e., people rehearse in
order, from the beginning of the list to the end), an assumption supported by data
(Tan & Ward, 2008).4 We furthermore assume that rehearsal restores the memory
representations to their original state—in that sense, rehearsal is not only identical
to recall but also isomorphic to a further presentation of the list. As during recall,
only those items are restored that have not been completely forgotten. Likewise,
as in recall, any items not completely forgotten are restored in their correct posi-
tion, and no extra-list intrusions are possible.
This, then, finalizes our decisions about how to implement the phonological-
loop model. We have settled on one of at least 144 possible instantiations of the
verbal model by making the sequence of decisions just outlined. It is important
to note that we make no claim that those were the only possible decisions; on
the contrary, our decisions were guided primarily by the desire to keep things
tractable rather than by the intention to maximize psychological plausibility. Let’s
see whether our decisions still produced a plausible model.
2.3.1 MATLAB
There are numerous ways in which models can be instantiated in a computer
simulation. We rely on the popular MATLAB programming language, and all
Chapter 2 From Words to Models 43
examples in this book assume that you have access to MATLAB and that you
have at least some limited knowledge of how to use it.
We cannot teach you MATLAB programming from the ground up in this book.
However, all programming examples that we provide in this book are extensively
commented, and it should require only some limited assistance for you to repro-
duce those examples, even if you have no programming background at all. All of
our programs are available at the supporting webpage, http://www.cogsciwa.com,
which also contains external links to other important and useful sites. One such
external site is “MATLAB Central,” which is a facility maintained by Mathworks
Inc., the company that produces and maintains MATLAB. This site contains a
huge (and growing) archive of numerous MATLAB programs that are contributed
by programmers from around the world; we will occasionally refer to MATLAB
Central in the remaining chapters. You can browse to MATLAB Central via our
supporting webpage.
In addition, there are numerous books available that can assist you in learning
MATLAB. Rosenbaum’s (2007) book is aimed specifically at behavioral scientists
and may therefore be of particular value; we also link to a number of other texts
at our supporting webpage.
Why did we choose MATLAB? Why do we think it is worth your while to
learn it? We chose MATLAB because it provides a vast array of functions that can
perform many of the operations required in computer simulations (e.g., drawing
random numbers from a variety of distributions) with great ease. The existence of
those functions allows programmers to focus on the crucial elements of their mod-
eling without having to worry about nitty-gritty details. We will next see exactly
how that is done.
same participants after they completed the recall phase by averaging the time
required for 10 overt recitations of word triplets at each pronunciation duration.
Those averages were converted to speech rates and are plotted on the abscissa in
Figure 2.1.
The simulation that we now develop instantiates this experimental procedure
with one important exception that we note below. We present this program in
separate listings (each listing is like a figure, except that it contains programming
code rather than a graph). Each listing is followed by an explanation of what this
particular segment of code accomplishes.
Where exactly did the preceding segment come from? It represents the first
handful of lines of a MATLAB program that we typed into the program editor
that forms part of the MATLAB package; as we noted earlier, we cannot help you
with setting up MATLAB and we are assuming that if you want to reproduce our
simulation, you will learn how to use the MATLAB environment (including its
editor) on your own.
The first four lines of code are comments that are not addressed to the com-
puter but a human reader; they tell you what the program is about. You will find
that all programs in this book are extensively commented. Comments are an indis-
pensable part of a program because even if you have written it yourself, rest
assured that you will forget what it does in a frighteningly short time—hence,
any time invested in commenting is time saved later on when you are trying to
Chapter 2 From Words to Models 45
figure out your own brilliant coding (let alone when other scientists want to read
your code).
Lines 7 to 13 declare various parameters that are explained in the accompa-
nying comments. To understand what those variables do, let’s consider the next
listing.
14 rRange = l i n s p a c e ( 1 . 5 , 4 . , 1 5 ) ;
15 tRange = 1 . / rRange ;
16 pCor = z e r o s ( s i z e ( rRange ) ) ;
17
18 i=1; %i n d e x f o r word l e n g t h s
19 f o r tPerWord=tRange
20
21 f o r rep = 1 : nReps
22 actVals = ones ( 1 , listLength ) * initAct ;
41 pCor ( i ) = pCor ( i ) + ←
( sum ( actVals>minAct ) . / listLength ) ;
42
43 end
44 i=i + 1 ;
45 end
46
47 scatter ( rRange , pCor . / nReps , ' s ' , ' f i l l e d ' , . . .
48 ' MarkerFaceColor ' , ' k ' )
49 xlim ( [ 0 4 . 5 ] )
50 ylim ( [ 0 1 ] )
51 x l a b e l ( ' Speech Rate ' )
52 ylabel ( ' Proportion Correct ' )
Note how Listing 2.2 is broken into two panels; this is because there is a
lot of code in between the two panels that contains the core components of the
simulations, and that is omitted here (this also explains the discontiguous line
numbers in the two panels). Because those core components only make sense if
one understands the frame into which they are embedded, we will consider lines
14 to 22 and 41 to 52 on their own before looking at the important bits in between.
First, lines 14 to 18 define the principal “experimental manipulation” that we
are interested in—namely, the effects of word length or speech rate. Accord-
ingly, to create different speech rates, we first define the variable rRange by
calling the function linspace(1.5, 4, 15). This function call provides us with 15
equally spaced values between 1.5 and 4 that represent the desired speech rates.
(If this seems mysterious to you, check the MATLAB help pages for linspace .
46 Computational Modeling in Cognition
Things will become instantly obvious.) A brief glance at the abscissa in Figure 2.1
confirms that these values span the experimentally obtained range. You will also
note that we cover the range at a finer level of grain than is possible in a behavioral
experiment; this is one of the advantages of a simulation.
The speech rates are then converted into the time that it takes to pronounce
each word by simple inversion, and the results are stored in the variable tRange.
Note how we exploit the ability of MATLAB to operate on entire vectors of num-
bers all at the same time; we just had to use the “./” notation (instead of “/”) to
ensure that the operation was done on each element in the vector.
Finally, we set aside a vector for the results (pCor), which has a subscript i
to record data for each speech rate.
The following lines contain two nested loops that govern the simulation: The
first one, for tPerWord=tRange, goes through all the speech rates, one at a time,
and the second one, for rep=1:nReps, goes through the replications for each level
of speech rate. Each of those replications can be understood to represent a differ-
ent trial (i.e., a different set of words) or a different participant or indeed both. We
should briefly explain why we used 1,000 replications here, rather than a more
modest number that is comparable to the number of subjects in the experiment.
We did this because to compare the simulation results to the data, we want to
maximize the reliability of the former. Ideally, we want the simulation to yield
predictions with negligible or no error (which can be achieved by cranking the
number of replications up to 1,000,000 or more) so we can see if they fall within
the standard error of the data. In essence, we can consider the simulation to rep-
resent an (error-free) population prediction against which we compare the data.
(Exceptions arise when we also seek to capture individual variation in our results;
we revisit those exceptions in the next few chapters.)
We next consider what happens within each of the 1,000 replications.
The single statement in line 22 encodes all items by setting them equal to the
value of initAct, which was defined at the outset (Listing 2.1). Having encoded
the list, you may wonder how it can be retrieved from memory. The answer is
given in line 41. This single line of code is more sophisticated than first meets the
eye. The statement determines which items have an activation value that exceeds
minAct, which represents a threshold activation that was defined at the outset
(Listing 2.1). This comparison is achieved by making use of “positional mark-
ers,” because each item in memory is addressed by its position, and its activation
is interrogated to see if it exceeds the threshold; again, we exploit MATLAB’s
ability to operate on entire vectors by reducing this sequential comparison to the
simple expression actVals>minAct. By summing the comparison, we count the
number of items correctly recalled—remember that we do not model transposi-
tions or intrusions; hence, any item whose activation is above threshold is consid-
ered to be retrieved—and we then add that sum (converted to a proportion correct
by dividing by listLength) to the data for that speech rate.
Chapter 2 From Words to Models 47
Note how the index i is set to 1 initially and is then incremented by 1 at every
end of the speech rate loop; hence, i points to the current value of speech rate in
the vector pCor, which keeps track of the proportion correct.
The final set of lines in the bottom panel of Listing 2.2 simply plots the predic-
tions (the contents of pCor averaged across replications) against the set of speech
rates. Below, we will show some of the results from this simulation.
Before we can report any simulation results, we need to consider the missing
bit of code that was excised from the preceding listing. This code is shown in
Listing 2.3 below. Remember that this code segment is immediately preceded
by encoding (line 22 above) and immediately followed by retrieval (41 above);
hence, the code below instantiates forgetting and rehearsal in our model.
The core of Listing 2.3 is the while cT < delay loop, which is executed as
many times as steps of duration tPerWord fit within delay—in other words, this
loop rehearses as many list items as can be articulated within the total retention
interval. This is the one place where our simulation deviates from the method
of Hulme and Tordoff (1989) that gave rise to the data in Figure 2.1: Whereas
recall in the experiment was immediate (hence the retention interval was zero, or
close to it), we introduced a delay of 5 seconds (see Listing 2.1). The reason for
this deviation is that given our decision not to commence decay until after list
presentation (Decision 1 in Table 2.1), we require some nonzero interval during
which decay and compensatory rehearsal can exert their opposing effects. This is
48 Computational Modeling in Cognition
perfectly fine for present purposes, but we would not want to publish the simula-
tion results with this deviation between actual and simulated methodologies.
Let us resume our discussion of rehearsal in the simulation. The duration of
each articulation is given by tPerWord, which changes with speech rate as shown
earlier in Listing 2.2. The number of rehearsals thus depends on word length,
exactly as it should. Within the while loop, we first locate the next item that is still
accessible (because its activation exceeds minAct) and then restore its activation
to its initial value in line 35. The order in which items are rehearsed is from the
beginning of the list to the end, skipping over items that have been completely
forgotten and wrapping around at the end, for a renewed cycle of rehearsal, if time
permits. This mechanism conforms to our decision earlier that rehearsal should
be ordered (Decision 6, Table 2.1). (If you find it difficult to figure out how this
mechanism operates, we suggest that you study the MATLAB help pages for find
and isempty; those are the two functions used to select items for rehearsal.)
At each rehearsal step, the contents of memory decay in the linear and con-
stant manner decided upon earlier (Decisions 2 and 3 in Table 2.1), using the
statement in line 39. The decay process deserves several comments. First, it does
not occur until after encoding is complete (Decision 1). Second, the extent of
decay varies with word length, with longer words providing more opportunity for
decay than shorter words. Third, decay affects all items, including the one just
rehearsed. That is, the boost in activation resulting from rehearsal is immediately
counteracted by decay, reflecting the fact that decay is a continuous and uninter-
ruptable process. This is another arbitrary decision that was required to instantiate
the model. If you prefer, the just-rehearsed item can be exempted from decay
by changing line 39 to read actVals((1:listLength)~=itemReh)=actVals
((1: listLength)~=itemReh)−(dRate.*tPerWord);.
The relationship between rehearsal and decay deserves to be examined more
closely because it reveals an extremely important property of simulations. You
may have noted that the statement that instantiates decay (line 39) follows the
line that instantiates rehearsal (line 35). Does this mean that rehearsal precedes
decay? No; the two processes actually occur at the same (simulated) time. In a
simulation, time always needs to be advanced explicitly, which in our case hap-
pens in line 40. It follows that everything in between increments of the variable
cT takes place at the same (simulated) time; hence, rehearsal and decay occur
simultaneously, notwithstanding the fact that one statement follows the other in
the program. This property of simulated time is true for any simulation involving
a temporal component.
Our journey is almost complete: We have taken a slow but informative
trip through the processes involved in translating a verbal model into a computer
simulation. We have presented and discussed the simulation program, which pro-
vides one of a multitude of possible instantiations of the phonological loop model.
Chapter 2 From Words to Models 49
Figure 2.3 Results of a MATLAB simulation of the phonological loop. See text and List-
ings 2.1 through 2.3 for details.
Let us now see how well this model works. Can we reproduce the speech rate
function in Figure 2.1?
Activation
minAct
Time
Ti
Figure 2.4 The implications of the decay function assumed in our simulation. The thick
solid line is the decay function relating activation to time. The broken horizontal line indi-
cates the minimum activation value; any item whose activation falls below that line is
considered to have been forgotten. The horizontal arrows represent hypothetical rehearsal
durations. See text for details.
also at odds with what we might expect from the verbal description of the phono-
logical loop; there is really nothing in the verbal model that would lead one to
believe that the speech rate function should be discontinuous. What went wrong?
In a sense, nothing “went wrong.” Our particular instantiation of the phono-
logical loop is as legitimate—that is, it is as compatible with the verbal theory—as
any of the other 143 we could have created. Hence, rather than asking what went
wrong, it is more appropriate to ask what the reasons are for the discontinuity
in the predicted speech rate function. The answer to this question is provided in
Figure 2.4.
The figure shows the relationship between an item’s level of activation and
time in our simulation by the thick decreasing function. As an exercise, refer
to Listing 2.1 to determine the slope of this function. If you cannot figure it
out, read the footnote.5 The figure also shows the minimum activation level (the
broken horizontal line at minAct) below which an item is considered to have
been forgotten. Once an item is below that line, it can be neither recalled nor
rehearsed.
The joint implications of this threshold and the decay function are illustrated
by the horizontal arrows, which represent several hypothetical rehearsal durations.
As we noted earlier, decay continues throughout rehearsal, so although rehearsal
restores an item’s activation, all items that are not currently being rehearsed con-
tinue to decay along the function shown in the figure.
Chapter 2 From Words to Models 51
The unequal lengths of the arrows represent different speech rates; remember
that shorter words can be rehearsed more quickly than longer words. Now, bear-
ing in mind that any above-threshold activation translates into a correct recall (or
successful rehearsal), note how the length of the arrows does not affect recall for
any item whose activation remains above threshold throughout. That is, up to a
point, it does not matter how long it takes to rehearse a given list item because
the next one will nonetheless remain above threshold and will thus be avail-
able for rehearsal (or recall). This explains the plateaus that are obtained with a
range of speech rates. It is only when the additional rehearsal time permits all (or
some) of the remaining items to decay below threshold that an item is irrevocably
lost—and it is this loss of an item that explains the steps in between plateaus in
Figure 2.3.
The preceding analysis is informative in two ways: First, it reveals that our
simulation can be readily understood and that its predictions are not mysterious.
Second, it points to a way in which the simulation can be revised so that its pre-
dictions are more in line with the data.
be expected to “wash out” the discontinuities in the speech rate function across
replications.
Figure 2.5 confirms that this is exactly what happens. The figure contains the
results of a run of the second version of our simulation shown in Listing 2.4. It is
clear that these results are much more in accord with the data shown at the outset
Chapter 2 From Words to Models 53
Figure 2.5 Results of a MATLAB simulation of the phonological loop with variable decay
rates. See text and Listing 2.4 for details.
(Figure 2.1). The results also conform far better to one’s intuitions about what the
phonological loop model ought to predict.
Our technical journey is now complete. We commenced our journey by con-
sidering the empirical speech rate function; we then derived a simulation from a
verbally stated theory. The simulation implemented a process model of the phono-
logical loop and applied it to a speech rate experiment, and finally, we obtained
predictions from a simulation that closely mirrored the original data. We now
analyze the conceptual lessons that can be drawn from our exercise.
Let us explore what options would now lie ahead if our simulations had been
“serious” rather than a pedagogical exercise. What might we do next to explore
our phonological loop instantiation? One obvious avenue for exploration is grad-
ually to relax the design decisions (see Table 2.1) that were made primarily for
simplicity’s sake; for example, the linear decay assumption might be replaced by
a more realistic exponential function.7
A further avenue for exploration is to introduce a mechanism that can give rise
to transpositions and intrusions. This is a matter of central importance; a serious
model of short-term memory must explain why people so frequently transpose
the order of list items. If you are curious about how such a mechanism might be
added to the present model, Neath (2000) presents a relevant precedent involving
the “feature model,” which broadly resembles our simulations.
Two other constraints of the present simulation that ought to be explored relate
to the ordering of items during rehearsal and the encoding strengths. Concerning
the former, it would be worthwhile to examine what happens if list items are
rehearsed in a random order rather than in strict forward sequence. Concerning the
latter, many models assume that early list items are encoded more strongly than
later items (e.g., G. D. A. Brown, Preece, & Hulme, 2000; Page & Norris, 1998b),
and the effects of this so-called primacy gradient should also be examined in
our simulation. Our supporting website, http://www.cogsciwa.com, also contains
those additional explorations.
but they are too involved for the present introductory chapter. Instead, we merely
note that our explorations (not reported in detail here) reveal the second simula-
tion’s predictions to be quite robust; that is, several other features of the model
can be changed without affecting the shape of the predicted speech rate function.
We therefore at least cautiously conclude that the support for our second model
is gratifyingly strong because the model’s predictions do not represent just one
of many possible predicted outcomes; given the known stability of the speech
rate data, our second simulation therefore likely instantiates the top-left panel in
Figure 1.10.
What are the implications of our simulations for the verbal model of the
phonological loop as initially formulated by Baddeley (e.g., 1986)? There are
two reasons to suggest that support for the verbal model is not terribly strong.
First, we established that the verbal model can be instantiated in more than a hun-
dred possible ways. Second, we explored two of those possible instantiations and
found the results to be relatively divergent. Although we cannot anticipate exactly
how many different results can be predicted by the remaining 140 or so possible
instantiations, the fact that a subset of two can create divergent predictions sug-
gests that the overall ensemble will likely generate even greater heterogeneity. It
follows that the state of the verbal model is more likely to be characterized by the
bottom-left panel in Figure 1.10: The data are precise, but the theory would have
been compatible with outcomes other than those that were observed, and hence
support for it is weak.
Is this to be taken as a criticism of verbal models in general or the working
memory model in particular? No, not at all. Verbal models can have enormous
impact and can stimulate much valuable and informative research—indeed, the
working memory model rightfully ranks among the leading theories of memory.
However, to achieve a detailed understanding of the underlying psychological
processes, a verbal model must ultimately be implemented in a computational
model—as indeed has been the case for the working memory model (e.g., Burgess
& Hitch, 1999; Page & Norris, 1998b).
basic architecture must precede any modeling in cognition, irrespective of the par-
ticular model or domain in question. Table 2.2 lists some of these issues together
with pointers to the literature or other places within this book that contain further
discussion.
For now, the important thing to bear in mind is that the technical issues involved
in model building are always embedded in a broader theoretical context. The
links between the broader theoretical context and your decisions about a model
can often be tacit; for example, it may not occur to you that your model of
short-term memory could rely on anything but a token representation. Alterna-
tively, you may have subscribed to a particular theoretical outlook—for example,
connectionism—some time ago, and hence you will naturally approach any fur-
ther modeling from that perspective and may not consider symbolic alternatives.
Because those foundational decisions are often subjective, this initial phase of
model development has been referred to as an “art” rather than science (Shiffrin
& Nobel, 1997). Shiffrin and Nobel (1997) provide an extended discussion of this
foundational phase of model development, and we refer the interested reader to
their paper.
2.5.1 Parameters
Let us briefly reconsider Listing 2.1, which set up some of the variables that
governed the behavior of our simulation (e.g., initAct, dRate, minAct). Those
variables are called parameters. What is the role of parameters? Parameters can
be understood as “tuning knobs” that fine-tune the behavior of a model once its
architecture (i.e., basic principles) has been specified. A good analogy is your car
radio, which has “knobs” (or their high-tech digital equivalent) that determine the
volume and the station; those knobs determine the behavior of your radio without
changing its architecture.
In our model of the phonological loop, dRate was one important parame-
ter: Varying its value affects the overall performance of the model: As dRate
decreases, performance increases (because there is less forgetting) and the effect
of speech rate is diminished (because the effects of word length depend on the
existence of decay). In the extreme case, when dRate=0, performance will be
perfect irrespective of speech rate. Changing the decay rate does not change the
architecture of our simulation, but it certainly changes its behavior. Usually, our
models involve more than a single parameter. It is helpful to introduce some nota-
tion to refer to the parameters of a model: For the remainder of the book, we will
use θ (“theta”) to denote the vector of all parameter values of a model.
wouldn’t this be like getting a plane to fly without wings and a fuselage? No, it is
more like listening to a radio that is permanently tuned to one station: Its architec-
ture is still intact, and hence it is able to receive the signal and amplify it and so
on. What is lost is the ability to tune into different stations. But is it a “loss”? No,
if you like listening to that station (and that station only), then nothing has been
lost—and by exact analogy, nothing has been lost if a model has no parameters
but it nonetheless predicts the observed data. Quite to the contrary, if a model han-
dles the data notwithstanding the absence of parameters, then we have amassed
the strongest possible support for a model; by definition, the model’s predictions
cannot be altered (because there are no parameters), and we are therefore firmly
anchored in the top row of panels in Figure 1.10. (Or indeed in a hypothetical row
of panels floating above the figure in which there is a single point corresponding
to both data and predictions.)
Can this Nirvana be attained in practice? Yes, but not very often. One exam-
ple is shown in Figure 2.6, which shows a subset of the results of an experiment
by Lewandowsky, Griffiths, and Kalish (2009). The details of the model and the
study can be largely glossed over; suffice it to say that we asked people to predict
the total duration of the reign of an Egyptian pharaoh in response to a (quasi-
random) cue. For example, given that a pharaoh has been ruling for 5 years, how
long will his total reign be? People responded to multiple such cues, and their dis-
tribution of responses was compared to the distribution predicted by a Bayesian
model without any parameters. This comparison is shown in the figure, which
shows that the quantiles of the two distributions are very similar to each other (per-
fect agreement is represented by the diagonal), suggesting that the model captured
people’s performance. So yes, there are occasions in which support for a model
can be extremely strong because its predictions involve no free parameters.
To clarify what a parameter-free model looks like, we reproduce Lewandowsky
et al.’s (2009) Bayesian model in the following equation:
p(ttotal |t) ∝ p(t|ttotal ) p(ttotal ), (2.1)
where p(ttotal ) is the prior distribution of quantities (in this instance, the distri-
bution of the reigns of Egyptian pharaohs, gleaned from a historical database),
p(t|ttotal ) is the likelihood of encountering any particular probe value t (assumed
to be uniformly distributed), and p(ttotal |t) is the predicted posterior distribution.
The quantiles of that posterior distribution are compared to the data in Figure 2.6.
If this terse description of the model leaves you baffled and mystified, we can
comfort you by noting that what is important here are not the details of the model
but the fact that there are no parameters in Equation 2.1: The predictions result
from the prior distribution, which is independently known, and the assumption
that any value of t from that prior distribution is equally likely to be encoun-
tered (as indeed was the case in the experiment). So, the predictions of the model
Chapter 2 From Words to Models 59
Pharaohs
70
60
Bayesian Model
50
40
30
20
10
20 40 60
Stationary
Figure 2.6 Snapshot of results from an experiment that tested the predictions of a Bayesian
model with no parameters. The ordinate plots quantiles of the predicted distribution, and
the abscissa plots the obtained quantiles. Figure from Lewandowsky, S., Griffiths, T. L.,
& Kalish, M. L. (2009). The wisdom of individuals: Exploring people’s knowledge about
everyday events using iterated learning. Cognitive Science, 33, 969–998. Copyright by the
Cognitive Science Society; reprinted with permission.
were inevitable, arising from its architecture in Equation 2.1 alone. It follows
that the model would have been challenged had the data come out even slightly
differently.8
Returning from Nirvana to more common pastures, what about the present
simulations, where altering the value of the decay parameter noticeably affected
the model’s predictions? Does this mean the support for the model is weak? Well,
it is weaker than the support for the Bayesian model just discussed, but it need
not be weak. For example, although the reduction in decay rate reduces the slope
of the predicted speech rate function, it will never reverse it. (We are disallow-
ing the possibility of “antidecay,” or the increase in memory strength with time,
by reducing the decay rate below zero.) In consequence, our simulation model
remains falsifiable because it would be incompatible with a reversed word length
effect. (Lest you think that a reversed word length effect is absurd, this finding is
not altogether unknown; see Lewandowsky & Oberauer, 2008, for a review and
analysis.)
You may have already guessed, then, that the issue of testability and flexibil-
ity is tied to the presence of parameters—and in particular to their effects on a
model’s prediction—but that it is considerably more complex than a simple pos-
tulate such as “parameters prevent testability.” There is an old saying, attributed to
60 Computational Modeling in Cognition
John von Neumann (Dyson, 2004, p. 297), that “with four parameters I can fit an
elephant, and with five I can make him wiggle his trunk.” In actual fact, the truth
could not be more different: Wei (1975) showed that it takes 30 (!) parameters
to fit an elephant. We explore the notion of testability and how it relates to the
number of parameters in detail in Chapter 6.
A more relevant question, then, is how many free parameters it takes to char-
acterize a verbal model. Could we escape the problem of model flexibility by
returning to a verbal model? No, because as we have just seen, a verbal model
leaves open many important issues, and indeed the decision points in Table 2.1
are best considered free parameters.
If parameters are tuning knobs, how are their values set? How do you tune your
car radio to a new station?. . . Exactly, you adjust the frequency knob until the hiss
has been replaced by Moz9 or Mozart. Likewise, there is a class of parameters
in cognitive models that are adjusted until the predictions are in line with the
data to the extent possible. Those parameters are known as free parameters. In our
preceding simulations, the decay rate and its variability were free parameters. The
process by which the parameters are adjusted is known as parameter estimation or,
sometimes, model fitting. The resulting estimates are known as the “best-fitting”
parameter values.
Free parameters are usually estimated from the data that the model seeks to
explain. In Chapter 1, we proposed that the salary of Australian members of par-
liament can be summarized by a single parameter—namely, the mean. We esti-
mated that parameter by simply computing the average of the data. Things are
a little more complicated if we fit a regression model to some bivariate data, in
which case we estimate two parameters—slope and intercept. And things get more
complicated still for psychological process models—sufficiently complicated, in
fact, for us to devote the next two chapters to this issue.
Because the predictions of the model depend on its specific parameter values,
a fair assessment of the model’s adequacy requires that we give it the “best shot”
to account for the data. For that reason, we estimate the free parameters from the
data by finding those values that maximally align the model’s predictions with
the data. Those parameter estimates often (though not necessarily) vary between
different data sets to which the model is applied.
Generally, as we have seen in the previous section, modelers seek to limit
the number of free parameters because the larger their number, the greater the
model’s flexibility—and as we discussed in some detail in Chapter 1, we want to
place bounds on that flexibility. That said, we also want our models to be powerful
and to accommodate many different data sets: It follows that we must satisfy a
Chapter 2 From Words to Models 61
delicate trade-off between flexibility and testability in which free parameters play
a crucial role. This trade-off is examined in detail in Chapter 5.
There is another class of parameters, known as fixed, that are not estimated
from the data and hence are invariant across data sets. In our simulations, the
initial encoding strengths (variable minAct) were a fixed parameter. The role of
fixed parameters is primarily to “get the model off the ground” by providing some
meaningful values for its components where necessary. In the radio analogy, the
wattage of the speaker and its resistance are fixed parameters: Both can in prin-
ciple be changed, but equally, keeping them constant does not prevent your radio
from receiving a variety of stations at a volume of your choice. Although parsi-
mony dictates that models should have few fixed parameters, modelers are less
concerned about their number than they are about minimizing the number of free
parameters.
J
j=1 (d j − p j )2
RMSD = , (2.2)
J
62 Computational Modeling in Cognition
0.8
Proportion Correct
0.6
0.4
0.2
0
1 2 3 4
Speech Rate
Figure 2.7 Speech rate data (large gray plotting symbols), taken from Figure 2.1, and
simulation predictions (small black squares), produced by the simulation in Listing 2.4 for
the exact speech rates observed in the data.
where J is the number of data points over which the sum is taken, and d and
p represent data and predictions, respectively. For Figure 2.7, the RMSD turns
out to be .082. In other words, the simulation predictions differ from the data by
8 percentage points on average. Note that the “data points” that contributed to
the RMSD were the means in the figure, rather than the underlying individual
observations. This is frequently the case when we fit group averages rather than
individual subjects (which is why we used J in the denominator rather than N ,
which is the notation of choice to refer to the number of observations).10
If the data are discrete—for example, when the number of responses is con-
stant, but each response can fall into one of several different categories (e.g.,
whether an item is recalled in its correct position or 1, 2, . . . , positions away)—
then a χ 2 or G 2 discrepancy measure is more appropriate (e.g., Lamberts, 2005).
The χ 2 is defined as
J
(O j − N · p j )2
χ2 = , (2.3)
N · pj
j=1
where J refers to the number of response categories, N refers to the total number
of observed responses, and O j refers to the number of observed responses within
each category j. Note that the sum of all O j s is N , and note that the model
predictions, p j , are assumed to be probabilities rather than counts, as one would
commonly expect from a model (hence the need to multiply each p j with N , in
order to place the observed and expected values on an equivalent scale).
Chapter 2 From Words to Models 63
J
G2 = 2 O j log{O j /(N · p j )}, (2.4)
j=1
Parameters
Experimental (fixed and free)
method
People Model
Data Predictions
Figure 2.8 The basic idea: We seek to connect model predictions to the data from our
experiment(s). This process involves the observables in the gray area at the bottom of the
figure. The area at the top shows the origin of data and predictions, as well as the auxiliary
role of the model parameters.
to explain those data. Does it follow that the model is also necessary—that is, that
the model provides the sole unique explanation for the data? No, not at all, in the
same way that the fact that you flew from Lagos to Tripoli on your last African
trip does not rule out that you could have taken an overland caravan. This rather
painful—but frequently forgotten—fact deserves to be fleshed out.
It follows that a successful fit of your model to the available data represents,
alas, fairly weak evidence in its favor: Although a successful fit shows that your
model is a possible explanation for the data, it does not identify your model as the
only possible explanation. This is an in-principle problem that has nothing to do
with the quality of the data and the model: The case was stated strongly and elo-
quently by J. R. Anderson (1990), who concluded, “It is just not possible to use
behavioral data to develop a theory of the implementation level in the concrete
and specific terms to which we have aspired” (p. 24). By implication, irrespective
of how good and how large our behavioral database is, Anderson suggested that
there would always be multiple different possible models of the internal processes
that produce those data (models of internal processes are at the “implementation
level,” and this most closely corresponds to our view of “process explanation”
as developed in Section 1.4.4). We agree with Anderson, and we agree that for
any successful model that handles the data, there exist an unknown and unknow-
able number of equally capable alternative models—thus, our seemingly trivial
Figure 2.8 is actually quite complex because the single “Model” node on the right
is hiding an infinity of equally powerful (but unknown) alternatives. It follows
that the data never necessarily imply or identify one and only one model.
What are we to do in light of this indeterminacy, which is often referred to
as the “identifiability problem”? We begin by noting that J. R. Anderson (1990)
proposed two solutions to the identifiability problem. The first one abandoned the
idea of process modeling altogether and replaced it by a “rational analysis” of
behavior that sought to identify the linkages between the demands of the environ-
ment and human adaptation to those demands (e.g., J. R. Anderson & Schooler,
1991) A defining feature of rational analysis is that it explicitly eschews the mod-
eling of cognitive processes and remains at a level that we would consider to be
descriptive (see Sections 1.4.2 and 1.4.3). Anderson’s second solution invoked
the constraints that could be provided by physiological data, which were said to
permit a “one-to-one tracing of the implementation level” (J. R. Anderson, 1990,
p. 25). Recently, J. R. Anderson (2007) has argued that this additional constraint—
in the form of brain imaging data—has now been achieved or is at least near,
thus putting a solution to the identification problem tantalizingly within reach (a
view that resonates with at least some philosophers of science; Bechtel, 2008).
We revisit the role of neuroscientific data in Chapter 8, where we consider some
neurally inspired models in greater detail.
Chapter 2 From Words to Models 67
bank” (Meehl, 1990, p. 115). How does one get money in the bank? By “pre-
dicting facts that, absent the theory, would be antecedently improbable” (Meehl,
1990, p. 115). Thus, the more a model has succeeded in making counterintuitive
predictions, the greater its verisimilitude, and hence the more entitled we are to
continue using it even though we know it to be (literally) false.
have nothing to do with their objective properties. In an article that was evoca-
tively titled “Explanation as Orgasm,” Gopnik (1998) highlighted the distinctive
phenomenology (i.e., subjective feeling) associated with explanations; specifi-
cally, she proposed that the gratifying sense that accompanies the discovery of
an explanation (the “aha,” p. 108) may be evolution’s mechanism to ensure the
impetus for continued search and discovery—in the same way that orgasms may
deliver the necessary impetus for reproduction. Although this “cognitive emo-
tion” may deliver benefits to the species as a whole, by ensuring continued explo-
ration of the environment, it does not ensure that people—including people who
are scientists—will necessarily accept the best available explanation. Thus, Trout
(2007) identifies several cognitive factors, such as hindsight bias and overconfi-
dence, that can lead to a false or exaggerated sense of intellectual satisfaction (the
earth actually did not move) when a scientist selects an explanation. Similarly,
people generally have been found to prefer simple explanations to a greater extent
than warranted by the data (Lombrozo, 2007), and they tend to cling to seduc-
tive adaptationist explanations (e.g., that animals have large eyes because they are
better for seeing in the dark; Lombrozo, 2005). Hintzman (1991) even suggested
that people will accept mere acronyms as an explanation for something, even if
the acronym implies that the phenomenon is unexplained (e.g., UFO).
Are there any safeguards against these psychological risks that are associ-
ated with the pursuit of scientific explanations? Yes, and the best safeguard is to
seek explanations that are embodied within the type of models discussed in this
book. This case was argued very eloquently by Hintzman (1991), who listed 10
attributes of human reasoning that are likely contributors to errors in scientific
reasoning and showed how those potential weaknesses can be counteracted by
the use of quantitative models. Quantitative models are therefore preferable to
verbal models not only for the reasons discussed throughout the first two chapters
but also because they provide a “cognitive prosthesis” for our own human insuf-
ficiencies during theory construction itself. Farrell and Lewandowsky (in press)
provide further analysis of how models can serve as a prosthesis to aid in scientific
reasoning.
Notes
1. Readers familiar with the literature on the word length effect may object that only
certain select stimuli give rise to the effect, whereas the majority of words do not; Bireta,
Neath, and Surprenant (2006); Neath, Bireta, and Surprenant (2003). We agree entirely
(Lewandowsky & Oberauer, 2008). However, our discussion here is limited to the syllabic
word length effect using pure lists (i.e., all words on a given list are either all short or all
long). Under those circumstances, the word length effect is robust and replicable (Bireta
et al., 2006).
70 Computational Modeling in Cognition
2. These numbers are conservative and represent the minimum number of choices avail-
able; for example, there are an infinite number of possible decay functions, and we assume
here that only three are worthy of serious consideration (power, exponential, and linear).
3. Page and Norris (1998b) postulate that decay occurs during list presentation, but their
assumption that all items have been rehearsed (and their activation has thus been restored to
their encoded values) at the end of list presentation is formally identical to the assumption
that decay does not commence until after the list has been presented.
4. This is an over-simplification because ordered rehearsal breaks down at some point.
Nonetheless, for present purposes we retain this simple assumption.
5. The slope is (minus) the decay rate, which is defined to be dRate = .8;. Do not use
the figure to infer the slope because the axes are not labeled—hence, the 45◦ angle of the
line is entirely arbitrary.
6. The common intercept is the variable initAct, and the slopes have mean (minus)
decRate and standard deviation decSD.
7. In fact, we have explored exponential decay, and it makes no difference to the results
shown in Figures 2.3 and 2.5. Without variability in decay rate, the speech rate function is
discontinuous.
8. Returning to our earlier analogy, the parameter-free radio can also be horrifically
unsatisfying if you consider its instantiation in contemporary elevators.
9. Morrissey, erstwhile member of the band The Smiths.
10. Because the RMSD computes a continuous deviation between predictions and data,
it assumes that the data are measured at least on an interval scale. Use of nominal measures
(e.g., a Likert-type rating scale) is inappropriate because the meaning of a given deviation
varies across the scale (Schunn & Wallach, 2005).
11. Meehl (1990) employed this analogy.
12. Formal analysis of verisimilitude, in particular a principled comparison between the
values of rival theories, has proven to be surprisingly difficult (see, e.g., Gerla, 2007).
3
Basic Parameter
Estimation Techniques
There is little doubt that even before you started reading this book, you had
already fit many models to data. No one who has completed an introductory statis-
tics course can escape learning about linear regression. It turns out that every time
you computed a regression line, you were actually fitting a model—namely, the
regression line with its two parameters, slope and intercept—to the data.
y = X b + e, (3.1)
where the two elements of the vector b are the parameters (b0 and b1 ) whose
values we wish to obtain and where X is a two-column matrix. The first col-
umn consists of 1s (to represent the constant intercept for all observations) and
the second column of the observed values of our independent variable. You may
remember from your statistics background that the parameters can be computed
by rearranging Equation 3.1:
71
72 Computational Modeling in Cognition
b = (X T X)−1 X T y, (3.2)
where the superscripts T and −1 refer to the matrix transpose and matrix inverse
operators, respectively.
In MATLAB, Equation 3.2 can be trivially implemented by the single state-
ment b = x\y, where y is a vector of y values, and x is the two-column matrix just
discussed. The “\” operator is shorthand for “matrix left division” and implements
the operations required for a linear regression.
Given the availability of this simple solution in MATLAB, why do we devote
an entire chapter to the process of fitting a model to data? The answer is that
unlike linear regression, the parameters for most psychological models cannot be
computed directly, by a single statement or a single equation such as Equation 3.2,
because their complexity prevents a direct algebraic solution. Instead, parameters
must be estimated iteratively. Several parameter estimation techniques exist, and
although they differ in important ways, they also share many features in common.
Let’s begin by establishing those before we turn to the more technical details.
3.5
2.5
RMSD
1.5
0.5
2
1 3
2
0 1
−1 0
−1
−2 −2
Intercept Slope
Figure 3.1 An “error surface” for a linear regression model given by y = X b + e. The
discrepancy between data and predictions is shown on the vertical axis (using the root mean
squared deviation [RMSD] as a discrepancy function) as a function of the two parameters
(slope, b1 , and intercept, b0 ). The underlying data consist of observations sampled from
two normal distributions (one for x and one for y) with means 0 and standard deviations
1 and correlation ρ = .8. The contours projected onto the two-dimensional basis space
identify the minimum of the error surface at b1 = .74 and b0 = −.11. See text for details.
that there is no guarantee that a model can handle the data to which it is applied:
Even though the error surface will have a minimum (or indeed more than one),
and even though best-fitting parameter values can always be estimated, the min-
imum discrepancy between predictions and data may nonetheless be too great—
and hence its goodness of fit poor—for the model to be of much use. We will con-
tinue to return to this issue of assessing the goodness of fit of a model later, but for
now, we need to ask how exactly the best-fitting estimates of the parameters are
obtained.
One possibility, of course, is to do what we did for Figure 3.1 and to examine
all possible combinations of parameter values (with some degree of granularity
because we cannot explore the infinite number of combinations of continuous
parameter values). By keeping track of the lowest discrepancy, we can then sim-
ply read off the best-fitting parameter estimates when we are done. This procedure
74 Computational Modeling in Cognition
3.1.2 An Example
You may have already guessed that the model underlying our error surface is a
simple linear regression involving two variables that were related by the standard
two-parameter model yi = b0 + b1 xi + ei . For our example, the data for each
variable were generated by randomly sampling 20 observations from a normal
distribution with mean μ = 0 and standard deviation σ = 1. The correlation
between the two variables was ρ = .8, and the best-fitting regression line for
the data underlying our error surface was yi = −.11 + .74 xi . The best-fitting
parameter values were computed by the MATLAB statement mentioned earlier—
namely, b = x\y. The RMSD (see Equation 2.2) between model predictions—
that is, the fitted values ŷi —and the data was .46 for the best-fitting parameter
values; this represents the value on the ordinate at the minimum of the surface in
Figure 3.1.
Chapter 3 Basic Parameter Estimation Techniques 75
Consider first the program in Listing 3.1, which spans only a few lines but
accomplishes three major tasks: First, it generates data, then it performs a regres-
sion analysis, and finally it repeats the regression but this time by calling a func-
tion that estimates the parameters using the procedure just described. Let’s go
through those steps in detail.
The first line of interest is line 7, which fills the second column of a rect-
angular matrix (called data) with samples from a random normal distribution.
Those are our values for the independent variable, x. The next line, line 8, does
almost the same thing: It samples random-normal values, but it also ensures that
those values are correlated with the first set (i.e., x) to the extent determined by
the variable rho. The resulting samples are put into the first column of the data
76 Computational Modeling in Cognition
matrix, and they represent our values of y. It is important to realize that those two
lines of code draw random samples from two distributions with known means (μ),
standard deviations (σ ), and correlation (ρ); thus, we expect the sample statistics
(namely, X , s, and r ) to be approximately equal to those population values, but it
would be surprising indeed if they turned out to be exactly equal. (In reality, of
course, we would replace these lines with the code required to read our data of
interest into the program. Here we generated synthetic data because we can then
determine their properties and examine how well our fit recovers those known
properties.)
Having thus generated the data, we next generate a “design matrix” for the
regression by adding a column of 1s in front of our values of x. This is done
in line 11 and results in a matrix called bigX, which we need to compute the
regression parameters using the statement already discussed. Note that the column
of 1s represents the regression intercept, which is constant and does not depend
on x—hence the use of a column of constant values. Note the absence of a “;” at
the end of line 13 that computes the actual regression. MATLAB prints out the
end result of any statement that is not terminated by a “;” and this ensures that
you can see the parameter estimates (in the vector b) on the screen.
Now let’s turn to the most interesting and novel part, which commences in
line 16. That line assigns starting values to the two parameters—namely, slope
(b1 ) and intercept (b0 ), which are represented in a single vector in that order.
(The order was determined by the programmer and is arbitrary, but once it’s been
decided, it is important to keep it consistent.) Note that the starting value for the
slope is −1, which is exactly opposite to the true slope value implied by our data
generation in line 8, so these starting values are nowhere near the true result.
Although we chose these starting values mainly for illustrative purposes, it also
reflects real life where we often have no inkling of the true parameter values.
Later, we discuss the issue of finding suitable starting values for parameters.
The final line of the program, line 17, hands over control to another function,
called wrapper4fmin, and passes the parameters and the data as arguments to
that function. The function returns two values that are printed out (if you don’t
know why they are printed, reread the preceding paragraph) and that contain the
final best-fitting parameter estimates and the final value of the discrepancy func-
tion (i.e., its achieved minimum).
Seem simple? It is—almost. Just like in the examples in the previous chapter,
we call a function to do most of the work for us. This function appears to take
the data and starting values and then returns the best-fitting parameter estimates.
How does it know what model and what discrepancy function to use? The function
knows because we wrote it; unlike the functions from the previous chapter, this
one is not built into MATLAB.
Chapter 3 Basic Parameter Estimation Techniques 77
One of the more powerful features of MATLAB (and of most other program-
ming languages) is that we can write new functions ourselves. Listing 3.2 shows
this function, which is defined in a separate file that must be available in the same
directory on your disk as the preceding program or available in the MATLAB
path. (By the way, the file’s name must be “wrapper4fmin.m” because MATLAB
expects each function to be contained in a file with the same name.)
The listing is rather short because the function accomplishes only two things:
First, in line 3, it calls the MATLAB function fminsearch, which performs the
actual parameter estimation. Note that fminsearch is passed the starting parame-
ter values (in the array pArray) and something called @bof. The latter is of partic-
ular interest, and we will discuss it in a moment—but first, note that fminsearch
returns two variables (x and fVal), which are also the return arguments of the
function wrapper4fmin itself. You can tell they are return arguments because
they appear on the left-hand side of the so-called “function header” in line 1.
So it appears that our function merely takes information, passes it on to another
function, and then returns that other function’s result itself—so what’s the point?
The answer lies in the variable @bof, which is a special variable known as a
function handle. Unlike other variables that contain standard information such as
parameter values or the results of computations, this variable contains the name
of a function. Function handles are identified by a leading “@” symbol, and they
permit MATLAB to call a function by referencing the function handle. The great
advantage of this is that MATLAB need not know about the actual function—all
it needs is the function handle. (We recommend that you consult the MATLAB
documentation at this point if you need a more detailed description of function
78 Computational Modeling in Cognition
main
bof
Figure 3.2 The relationship between the MATLAB functions used in Listings 3.1
through 3.3. The names in each box refer to the function name, and the arrows refer to
exchanges of information (function calls and returns). Solid arrows represent information
exchange that is entirely managed by the programmer, whereas broken arrows represent
exchanges managed by MATLAB. Shading of a box indicates that the function is provided
by MATLAB and does not require programming. See text for details.
handles.) Thus, the second main task of our function wrapper4fmin is to define
another function, called bof, in lines 7 through 12. This function is called an
“embedded” function (because it is itself wholly contained within a function),
and it is not called directly by any of our own functions—instead, the embedded
function bof is called from within MATLAB’s fminsearch, which knows about
it because we pass fminsearch the function handle @bof.
Because this calling sequence is somewhat complex, we have illustrated the
overall relationship between the various functions in Figure 3.2. In the figure,
the box labeled “main” refers to the program in listing 3.1, and the other boxes
refer to the various functions. Note the arrows that connect the box “main” with
“wrapper4fmin” and the latter to “fminsearch.” This indicates that those functions
communicate directly with each other in the manner controlled by us, the pro-
grammers (or you, eventually!). This communication is done by passing values as
function arguments or by accepting return values from functions (or by inheriting
variables from containing functions, as noted below). Note also that “fminsearch”
and “bof” are connected by dotted arrows—this is to indicate that those calls are
under the control of MATLAB rather than the programmer.
Which brings us to the really nifty bit. MATLAB’s fminsearch will
estimate the parameters for any model—all that fminsearch needs is (a) what
Chapter 3 Basic Parameter Estimation Techniques 79
starting values to use for the parameters and (b) the handle of some function (in
this instance, bof) that can return the discrepancy between that model’s predic-
tions and the data. Before we move on, it is important to focus your understanding
on the most crucial part: The function bof is never invoked directly, but only indi-
rectly, via the MATLAB built-in function fminsearch. The two listings we have
discussed so far (3.1 and 3.2) set up the scaffolding around bof that permits this
indirect call (hence the dotted arrows in Figure 3.2) to happen.
Let’s have a closer look at bof. Figure 3.2 tells us that bof calls a function
named getregpred; this happens in line 9. Although we haven’t shown you
getregpred yet, we can tell you that it takes the current parameter values and
returns the predictions of the model. In this instance, the model predictions are
the fitted values ( ŷ) provided by the regression line. The two lines following the
function call take those predictions and compare them to the data by computing
the RMSD (see Section 2.5.2). That value of RMSD is the return argument of
bof. Two further points are noteworthy: First, the values of the parameters passed
in parms are constantly changing during calls to bof. Their values will resem-
ble the starting values at the outset, and they will ultimately end up being the
best-fitting estimates, with fminsearch providing the path from the former to
the latter set of values. Second, did you notice that bof used the array data in
line 9? This may have escaped your notice because it is not terribly remarkable
at first glance; however, it is quite important to realize that this is possible only
because bof is embedded within wrapper4fmin and hence inherits all variables
that are known to wrapper4fmin. Thus, by passing data to wrapper4fmin as
an argument, we also make it automatically available to bof—notwithstanding
the fact that we do not pass that variable as an argument. In general, the fact that
an embedded function has access to all variables in the surrounding function pro-
vides a direct path of communication with bof in addition to the indirect calls via
fminsearch.
We have almost completed our discussion of this example. Listing 3.3 shows
the final function, getregpred, which computes predictions from the current
parameter values whenever it is called by bof. The function is simplicity itself,
with line 4 taking the parameters b0 and b1 and the values of x (in the second
column of data; see line 7 in listing 3.1) to compute the fitted values ( ŷ).
The remainder of the function plots the data and the current predictions (i.e.,
the current estimate of the best-fitting regression line) before waiting for a key-
press to proceed. (The pause and the plotting at each step are done for didactic
purposes only; except for this introductory example, we would not slow the pro-
cess down in this fashion.) There is no pressing need to discuss those lines here,
although you may wish to consult them for a number of informative details about
MATLAB’s plotting capabilities (which are considerable; many of the figures in
this book were produced by MATLAB).
80 Computational Modeling in Cognition
We are done! Figure 3.3 provides two snapshots of what happens when we
run the programs just discussed. The top panel shows the data together with a
regression line during the opening stages of parameter estimation, whereas the
bottom panel shows the same data with another regression line toward the end of
the parameter estimation. Altogether, when we ran the program, 88 such graphs
were produced, each resulting from one call to bof by fminsearch. In other
words, it took 88 steps to descend the error surface from the starting values to the
minimum.
As we noted earlier, the starting values were very different from what we
knew to be the true values: We chose those rather poor values to ensure that the
early snapshot in Figure 3.3 would look spectacular. Had we chosen better starting
values, the optimization would have taken fewer steps—but even the 88 steps here
are a vast improvement over the roughly 1,600 predictions we had to compute
to trace out the entire surface in Figure 3.1. By the way, reassuringly, the final
estimates for b0 and b1 returned by our program were identical to those computed
in the conventional manner at the outset.
Let us recapitulate once more. We first visualized the mechanism by which
parameters are estimated when direct analytical solution is impossible (i.e., in the
vast majority of cases in cognitive modeling). We then provided an instantiation of
parameter estimation in MATLAB. What remains to be clarified is that our MAT-
LAB code was far more powerful than it might appear at first glance. Although we
“only” estimated parameters for a simple regression line, the framework provided
in Listings 3.1 through 3.3 can be extended to far more complex modeling: Just
Chapter 3 Basic Parameter Estimation Techniques 81
0
Y
-1
-2 -1 0 1 2
X
0
Y
-1
-2 -1 0 1 2
X
Figure 3.3 Two snapshots during parameter estimation. Each panel shows the data (plot-
ting symbols) and the current predictions provided by the regression parameters (solid
line). The top panel shows a snapshot early on, and the bottom panel shows a snapshot
toward the end of parameter estimation.
82 Computational Modeling in Cognition
replace line 4 in Listing 3.3 with your favorite cognitive model (which of course
may stretch over dozens if not hundreds of lines of code) and our programs will
estimate that model’s parameters for you. A later chapter of this book (Chapter 7)
contains two examples that do exactly that, but before we can discuss those, we
need to deal with several important technical issues.
3.1.3.1 Simplex
1.5
0.5
Intercept
−0.5
−1.5
−2 −1 0 1 2 3
Slope
Figure 3.4 Two-dimensional projection of the error surface in Figure 3.1. Values of RMSD
are represented by degree of shading, with lower values of RMSD corresponding to darker
shades of gray. The three large simplexes illustrate possible moves down the error surface.
(a) Reflection accompanied by expansion. (b) Contraction along two dimensions (shrink-
age). (c) Reflection without expansion. Note that the locations of those points are arbitrary
and for illustration only. The tiny simplex at point d represents the final state when the
best-fitting parameter values are returned. See text for details.
instantiation of the phonological loop with variable decay rate in the previous
chapter—will necessarily yield variable predictions under identical parameter val-
ues. This random variation can be thought to reflect trial-to-trial “noise” within
a participant or individual differences between participants or both. The presence
of random variation in the model’s predictions is no trivial matter because it turns
the error surface into a randomly “bubbling goo” in which dimples and peaks
appear and disappear in an instant. It takes little imagination to realize that this
would present a major challenge to Simplex. The bubbling can be reduced by
running numerous replications of the model each time it is called, thus averaging
out random error. (When this is done, it is advantageous to reseed the random
number generator each time Simplex calls your model because this eliminates an
unnecessary source of noise; more on that in Chapter 7.) 6
A general limitation of parameter estimation. A final problem, which applies
to all parameter estimation techniques, arises when the error surface has a more
challenging shape. Until now, we have considered an error surface that is smooth
and gradual (Figure 3.1), but there is no guarantee that the surface associated with
our model is equally well behaved. In fact, there is every probability that it is not:
Complex models tend to have surfaces with many dimples, valleys, plateaus, or
ridges. Given what you now know about parameter estimation, the adverse impli-
cations of such surfaces should be clear from a moment’s thought. Specifically,
there is the possibility that Simplex will descend into a local minimum rather
than the global minimum. This problem is readily visualized if you imagine an
empty egg carton that is held at an angle: Although there will be one minimum
that is lowest in absolute terms—namely, the cup whose bottom happens to be
the lowest, depending on which way you point the carton—there are many other
minima (all other cups) that are terribly tempting to a tumbling simplex. Because
the simplex knows nothing about the error landscape other than what it can “see”
in its immediate vicinity, it can be trapped in a local minimum. Being stuck in a
local minimum has serious adverse consequences because it may obscure the true
power of the model. Imagine the egg carton being held at a very steep angle, with
the lowest cup being near zero on the discrepancy function—but you end up in the
top cup whose discrepancy is vast. You would think that the model cannot handle
the data, when in fact it could if you could only find the right parameter values.
Likewise, being stuck in a local minimum compromises any meaningful inter-
pretation of parameter values because they are not the “right” (i.e., best-fitting)
estimates.
The local-minima problem is pervasive and, alas, unsolvable. That is, there is
never any guarantee that your obtained minimum is the global minimum, although
your confidence in it being a global minimum can be enhanced in a number of
ways. First, if the parameter estimation is repeated with a number of different
86 Computational Modeling in Cognition
starting values, and one always ends up with the same estimates, then there is a
good chance that these estimates represent a global minimum. By contrast, if the
estimates differ with each set of starting values, then you may be faced with an egg
carton. In that instance, a second alternative is to abandon Simplex altogether and
to use an alternative parameter estimation technique that can alleviate—though
not eliminate—the local-minimum problem by allowing the procedure to “jump”
out of local minima. This technique is known as simulated annealing (Kirkpatrick,
Gelatt, & Vecchi, 1983).
where p is a sample from a uniform distribution in [0, 1]. Put simply, the accep-
tance function returns one of two possible outcomes: It either returns the current
parameter vector, in which case the candidate is rejected and the process continues
and another candidate is drawn anew using Equation 3.3, or it returns the candi-
date parameter vector despite the fact that it increases the discrepancy function.
The probability of this seemingly paradoxical uphill movement is determined by
two quantities: the extent to which the discrepancy gets worse ( f ) and the cur-
rent “temperature” of the annealing process (T (t) ).
Let us consider the implications of Equations 3.4 and 3.5. First, note that the
acceptance function is only relevant if the candidate makes things worse (i.e.,
f > 0)—otherwise, no decision is to be made, and the improved candidate
vector is accepted. The fact that f > 0 whenever the acceptance function is
(t)
called implies that the quantity e− f /T in Equation 3.5 is always < 1 and will
tend toward zero as f increases. In consequence, large steps up the error surface
are quite unlikely to be accepted, whereas tiny uphill steps have a much greater
acceptance probability. This relationship between step size and acceptance prob-
ability is further modulated by the temperature, T (t) , such that high temperatures
make it more likely for a given step uphill to be accepted than lower temperatures.
Figure 3.5 shows the interplay between those two variables.
The figure clarifies that when the temperature is high, even large movements
up the error surface become possible, whereas as things cool down, the probability
of an upward movement decreases. In the limit, when the temperature is very
low, no movement up the error surface, however small, is likely to be accepted.
This makes intuitive sense if one thinks of temperature as a Brownian motion:
The more things heat up, the more erratically everything jumps up and down,
whereas less and less motion occurs as things cool down. By implication, we are
unlikely to get stuck in a local minimum when the temperature is high (because
88 Computational Modeling in Cognition
1
0.8
p(Accept)
0.6
0.4
0.2
0
0
1
2 10
3 8
6
Δf 4 4
2 T
5 0
Figure 3.5 Probability with which a worse fit is accepted during simulated annealing as a
function of the increase in discrepancy ( f ) and the temperature parameter (T ). The data
are hypothetical but illustrate the interplay between the two relevant variables. The range
of temperatures follows precedent (Nourani & Andresen, 1998). See text for details.
T (t) = T0 α t (3.6)
Chapter 3 Basic Parameter Estimation Techniques 89
T (t) = T0 − η t, (3.7)
where α and η are fixed parameters, and T0 represents the initial temperature of
the system. (The choice of T0 is crucial and depends on the nature of the discrep-
ancy function, which may be ascertained by computing the discrepancies for a
sample of randomly chosen parameter values; Locatelli, 2002.) Whichever cool-
ing schedule is used, Equations 3.6 and 3.7 imply that across iterations, the SA
process gradually moves toward the left in Figure 3.5, and the system becomes
more and more stable until it finally settles and no further uphill movement is
possible. (Quick test: Is the surface in Figure 3.5 an error surface, such as those
discussed earlier in connection with Simplex? If you were tempted to say yes, you
should reread this section—the two surfaces represent quite different concepts.)
Finally, then, where do the candidates θc(t) come from? Perhaps surprisingly, it
is not uncommon to generate the next candidate by taking a step in a random
direction from the current point:
Simulated-annealing algorithms are not built into MATLAB but can be down-
loaded from MATLAB Central (see Section 2.3.1 for instructions). Once down-
loaded, the MATLAB implementations of SA can be called and used in much
the same way as fminsearch. Alternatively, if you have access to the “Genetic
Algorithm and Direct Search Toolbox” for MATLAB, which can be purchased as
an add-on to the basic MATLAB package, then you will have immediate access
to SA functions as well as the genetic algorithms that we discuss in the next
section.
We briefly consider another class of estimation techniques that are based on evo-
lutionary genetics. These genetic algorithms rival simulated annealing in their
ability to resist local minima and noise (Buckles & Petry, 1992), but they are
based on a completely different approach to “search”—so different, in fact, that
we put “search” in quotation marks.8
We take a first stab at introducing the technique by retaining the (pseudo)
genetic language within which it is commonly couched. At the heart of a genetic
algorithm is the idea of a population of organisms that evolves across generations.
Transmission across generations involves mating, where the choice of potential
mates is tied to their fitness. Reproduction is imperfect, involving the occasional
random mutation as well as more systematic crossovers between pairs of parents.
This process repeats across generations until an organism has evolved whose fit-
ness satisfies some target criterion.
Now let’s translate this into modeling terminology. The population of organ-
isms in a genetic algorithm involves representations of candidate parameter
values. Specifically, each organism represents a θc ; that is, a complete set of can-
didate values for all parameters that are being estimated. Each organism’s fitness,
therefore, relates to the value of our standard discrepancy function, f , for those
parameter values (because the algorithm relies on maximizing fitness rather than
minimizing discrepancy, we must invert the value of f so it points in the required
direction). At each generational cycle, organisms are selected for mating with
replacement on the basis of their fitness—that is, the fitter an organism is, the
more likely it is to “mate,” and because sampling occurs with replacement, any
organism can contribute more than once. Once mates have been selected, the next
generation (which contains the same number of organisms as before) is derived
from the mating set by one of three processes that are chosen at random and
on the basis of some preset probabilities: (a) An exact copy of an organism is
made, (b) a random mutation occurs during copying, or (c) a crossover occurs in
which two parents exchange part of their genetic material to form two offspring.
The new population, thus derived, replaces the old one, and the process continues
Chapter 3 Basic Parameter Estimation Techniques 91
this is that each member of M (0) is sampled from P (0) with replacement and on
the basis of its fitness. Specifically, each member m (0) of M (0) is set equal to some
x (0) in P (0) with probability9
where
F (0) = f (x (0) ), (3.10)
with the sum taken over all organisms in P (0) . This ensures that each organism is
recruited for mating with a probability that is proportional to its fitness.
Note that the fitness evaluation required for selection takes place at the level
of parameter space, not chromosome space. This means that, implicitly within the
function f , each organism’s chromosome is converted into a vector θc to evaluate
its fitness—in the same way that the evolution of a live organism depends on its
fitness with respect to the environment rather than on direct analysis of its genes.
Once M (0) has been thus created, the next generation P (1) is derived from the
mating set as follows.
Crossover. With probability pc , two randomly chosen members of M (0) are
designated as “parents,” which means that they cross over their alleles from a
randomly determined point (in the range 1 − L with uniform probability) to form
two offspring. For example, the chromosomes 11111 and 00000 might form the
offspring 11000 and 00111 after a crossover in the third position.
With probability 1 − pc , the crossover is skipped, and parents generate off-
spring by passing on an exact copy.
Mutation. Each offspring may then undergo a further mutation, by randomly
“flipping” each of the bits in its chromosome with probability pm . The value of
pm is typically very small, in the order of .001 (Mitchell, 1996). Mutation occurs
at the level of each allele, so in some rare instances, more than one bit in a chro-
mosome may be flipped.
Stopping rule. Once all offspring have been created, the initial population P (0)
is replaced by P (1) , and the process continues with renewed selection of members
for the mating set M (1) and so on for generations 1, 2, . . . , k, . . . N . Across gen-
erations, the fitness of the various organisms will (in all likelihood) improve, and
the process is terminated based on some stopping rule.
A variety of stopping rules have been proposed; for example, Kaelo and Ali
(2007) suggested a rule based on the best and the worst point. That is, if the abso-
lute difference in fitness between the best and the worst organism in a generation
falls below some suitable threshold, evolution terminates because all organisms
have optimally adapted to the environment. Note the resemblance between this
stopping rule and its equivalent in Simplex (when the simplex collapses toward
Chapter 3 Basic Parameter Estimation Techniques 93
a single point, all vertices have the same discrepancy). Tsoulos (2008) addition-
ally considered the variance of the best organism across generations; if the fitness
of the best point no longer changes across generations, thus reducing its inter-
generational variance, then the algorithm stops because further improvement is
unlikely.
Why does it work? The most stunning aspect of genetic algorithms is that
they actually work. In fact, they can work extremely well, and sometimes better
than simulated annealing (Thompson & Bilbro, 2000). But why? After all, we
take our parameters, convert them to concatenated binary strings, and then we
(a) randomly flip bits within the strings or (b) randomly exchange portions of
those strings between organisms. While the latter might seem plausible if the
crossover point were always at the boundary between parameters, it is more diffi-
cult to intuit why if a parameter happens to have the values 2.34 and 6.98 across
two organisms, exchanging the “.34” with the “.98” would enhance the fitness
of one or both of them. This acknowledged (potential) mystery has been widely
researched (e.g., Schmitt, 2001; Whitley, 1994), and here we briefly illustrate the
crucial insight—namely, that genetic algorithms implement a hyperplane sam-
pling approach to optimization.
We begin by noting the fact that if our chromosomes are of length L (and
binary, as we assume throughout), then the total search space is an L-dimensional
hyperspace with 2 L vertices or possible values. Our task thus is to find that one
point out of 2 L points that has maximal fitness. Lest you consider this task triv-
ial, bear in mind that realistic applications may have values of L = 70 (Mitchell,
1996), which translates into roughly 1.18 × 10, 000, 000, 000, 000, 000, 000, 000
points—we do not recommend that you (or your children, grandchildren, and
great-grandchildren) attempt to search them one by one.
To introduce the idea of hyperspace sampling, Figure 3.6 shows a three-
dimensional search space (i.e., L = 3) in which all possible points are labeled
with their allele code. Hence, each vertex of the cube represents one organism
that may potentially be a member of our population, P (k) , at some generation k.
The figure also introduces the concept of a schema. A schema is a string in which
some values are fixed but others are arbitrary; the latter are represented by aster-
isks and stand for “I do not care.” Hence, in this instance, all points on the gray
frontal plane are represented by the string 0 ∗ ∗. Any string that can be created by
replacing an asterisk with either 0 or 1 conforms to that schema; hence, 010 and
011 do, but 111 does not.
Why would we care about schemata? The answer is implied by the fitness
values associated with each point, which are also shown in the figure (in paren-
theses next to each vertex). The bottom-left point on the frontal plane is clearly
the “best” solution because it has the highest fitness value. Hence, to hone in on
this point, we would want organisms that conform to that schema (010 and so on)
94 Computational Modeling in Cognition
0**
111 (.6)
110 (.5)
of maximal fitness. In reality, we do not know why the selected points have a high
fitness.
The role of crossover, then, is to further narrow down the subspace in which
fitness is maximal. Of course, crossover is frequently going to generate offspring
that fall outside the most promising schema; if that happens, they are unlikely to
be included in future generations, and hence there is no lasting damage. If, by
contrast, crossover generates offspring that fall within a more narrow schema of
equal or greater promise, then those will be more likely to carry their information
into future generations. Effectively, across generations, this process will gradually
replace all asterisks in a schema with 0s or 1s because only those organisms make
it into the next generation whose alleles conform to the most promising schema.
This tendency toward narrowing of the schema is slowed and counteracted by
the random mutation. The role of mutation is therefore similar to the role of the
occasional acceptance of “uphill” steps in simulated annealing; it helps prevent
premature convergence onto a suboptimal solution (Whitley, 1994).
Genetic algorithms are not built into MATLAB but can also be downloaded
from MATLAB Central (see Section 2.3.1 for instructions). Once downloaded, the
genetic algorithm can be called and used in much the same way as fminsearch.
are used widely. One word of caution: The time taken to estimate parameters can
be far greater for simulated annealing and genetic algorithms than for Simplex.
Although this difference may be negligible for simple models that are rapidly
evaluated—such as our introductory regression example—the time difference may
be significant when more complex models are involved. Clearly, it makes a huge
pragmatic difference whether a parameter estimation takes several hours or a
week!10
Men Women
100
90
Performance
80
70
60
50
0 20 40 60 80 100 120
Trial
Figure 3.7 Simulated consequences of averaging of learning curves. The thin solid lines
represent the individual performance of a randomly chosen subset of 100 simulated sub-
jects. Each subject learns linearly, and across subjects, there is a slight variation in learning
rate but considerable variation in the onset of learning. The solid line and filled circles
represent average simulated performance across all 100 subjects.
at chance (50% in this instance) before they commence learning at some point
s at a linear rate of improvement r . We assume that there is considerable varia-
tion across subjects in s (σs = 20) but only small variation across subjects in r
(σr = 1.5). Our assumptions embody the idea that learning is accompanied by
an “aha” experience; that is, a problem may initially appear unsolvable, but at
some point or another, there is a sudden “insight” that kick-starts the then very
rapid learning.
The results of our simulation are shown in Figure 3.7. The figure shows the
individual data for a handful of randomly chosen subjects (each subject is repre-
sented by one of the thin solid lines). The variation among individuals is obvious,
with one subject commencing learning virtually instantaneously, whereas the last
subject in our sample requires 80 trials to get going. However, there is also con-
siderable similarity between subjects: Once learning commences, there is a nearly
constant increment in performance at each trial. Now consider the thick filled cir-
cles: They represent average performance across all 100 simulated subjects. Does
the average adequately characterize the process of learning? No, not at all. The
average seems to suggest that learning commences right from the outset and is
smooth, gradual, and highly nonlinear. Alas, every single one of those attributes
Chapter 3 Basic Parameter Estimation Techniques 99
is an artifact of averaging, and not a single subject learns in the manner assumed
by the average learning curve.
One might object that the figure plots simulation results and that “real” data
might behave very differently. Sadly, this objection cannot be sustained: Our sim-
ulation results are almost identical to behavioral results reported by Hayes (1953)
in an experiment involving brightness discrimination learning in rats. (This is
not surprising because we designed the simulation to act just like the rats in
that study.) Moreover, it is not just rats whose behavior can be misrepresented
by the average: You may recall that in Section 1.4.2, we discussed the study by
Heathcote et al. (2000) that compared different functions for capturing people’s
learning performance in skill acquisition experiments. Heathcote et al. concluded
that the hitherto popular “power law” of practice was incorrect and that the data
were best described by an exponential learning function instead. In the present
context, it is particularly relevant that their conclusions were based on examina-
tion of individual performance rather than the average. Heathcote et al. explicitly
ascribed the earlier prominence of the power law to an inopportune reliance on
averaged data. This is no isolated case; the vagaries of aggregating have been
noted repeatedly (e.g., Ashby, Maddox, & Lee, 1994; Curran & Hintzman, 1995;
Estes, 1956).
Where does this leave us, and where do we go from here? First, we must
acknowledge that we have a potentially serious problem to contend with. This
may appear obvious to you now, but much contemporary practice seems oblivi-
ous to it. Second, it is important to recognize that this problem is not (just) the
modeler’s problem but a problem that affects anyone dealing with psychological
data—as evidenced by the decade-long practice to fit power functions to average
learning data. Hence, although as modelers we must be particularly aware of the
problem, we cannot avoid it simply by giving up modeling.
Fortunately, the problems associated with data aggregation are not insurmount-
able. One solution relies on the recognition that although the problem is perva-
sive, it is not ubiquitous. Estes (1956) provides mathematical rules that permit
identification of the circumstances in which averaging of data across individuals
is likely to be problematic. Specifically, if the function characterizing individual
performance is known, then one can readily determine whether that functional
form will look different after averaging. For a variety of functions, averaging
does not present a problem, including nonlinear functions such as y = a log x
and y = a + bx + cx 2 (where x is the independent variable, for example trials,
and a, b, and c are parameters describing the function).13 Then there are other
functions, such as y = a + b e−cx , whose shape does change with averaging. Of
course, this mathematical information is of limited use if the function describing
individual behavior is unknown; however, if one is entertaining several candidate
functions and seeks to differentiate between them, then information about their
100 Computational Modeling in Cognition
shape invariance is crucial. For example, the fact that an exponential learning
function does not retain its shape upon averaging should caution against fitting a
(quite similar) power function to average learning data.
More recently, J. B. Smith and Batchelder (2008) provided statistical methods
for the detection of participant heterogeneity that may prevent averaging. Those
tests can be applied to the data prior to modeling to determine the correct level
at which a model should be applied. When the tests reveal heterogeneity, fitting
at the aggregate level is inadvisable. When the tests fail to detect heterogeneity,
fitting at the aggregate level may be permissible. We now turn to a discussion of
the relative merits of the three principal approaches to fitting data.
3.2.2.1 Advantages
The individual approach has some deep theoretical advantages that go beyond res-
olution of the aggregation problem. For example, capturing the behavior of indi-
viduals within a cognitive model opens a window into cognition that cannot be
obtained by other means. It has long been known that individual differences pro-
vide a “crucible in theory construction” (Underwood, 1975, p. 128), and there is
a plethora of research on individual differences in cognition (e.g., Oberauer, Süß,
Schulze, Wilhelm, & Wittmann, 2000). That research has served to constrain and
inform our understanding of numerous cognitive processes (e.g., Miyake, Fried-
man, Emerson, Witzki, & Howerter, 2000). Traditionally, this research has been
conducted by behavioral means. It is only recently that those investigations have
begun to be accompanied by theoretical analysis. For example, Schmiedek, Ober-
auer, Wilhelm, Süß, and Wittmann (2007) fitted two models to the response time
distributions of a large sample of participants, thus obtaining separate parameter
estimates for each subject for a variety of cognitive tasks (e.g., a speeded deci-
sion about the number of syllables in a word). Analysis of the variation of those
parameter values across individuals revealed that only some model parameters
covaried with people’s “working memory” capacity (WMC).14 Specifically,
Chapter 3 Basic Parameter Estimation Techniques 101
Schmiedek et al. found that a parameter reflecting the speed of evidence accumu-
lation in a popular decision-making model (Ratcliff & Rouder, 1998) was highly
correlated with WMC, whereas a parameter reflecting the time for encoding of
a stimulus was not. Thus, by fitting individual data and analyzing the resultant
parameter values across subjects, Schmiedek et al. were able to link a particular
stage of a decision model to a cognitive construct that is a powerful predictor of
reasoning ability.
Another advantage of fitting models to individual data is that it may reveal
boundaries of the applicability of a model. Simply put, a model may fit some sub-
jects but not others, or it may describe behavior in one condition but not another.
Although such heterogeneity may present a source of frustration for a modeler
who is interested in making general statements about human behavior, it can also
be extremely illuminating. A particularly striking case was reported in a visual
signal detection study by Wandell (1977), who used a descriptive model (namely,
signal detection theory) to infer how people interpret spikes of neural activity in a
visual channel to detect a signal. Specifically, when presented with a weak signal
that results in infrequent neural activity, people have a choice between count-
ing the number of spikes that are received in a constant time period (and decid-
ing that a signal is present when the number is sufficiently large) or waiting for
a fixed number of spikes to be detected (and deciding that a signal is present
because that number is reached within a reasonable waiting period). It turns out
that those two distinct mechanisms make differing predictions about the time to
respond and the relationship between “hits” (successful detections of the signal)
and “false alarms” (erroneously reporting a signal when it was not there). Wandell
(1977) showed that people predictably alternated between those decision modes
in response to an experimental manipulation (the details of which are not relevant
here), and he was also able to pinpoint individual departures from the expected
model (see Luce, 1995, p. 11, for further details). What can we learn from this
example? As Luce (1995) put it, “People typically have several qualitatively dif-
ferent ways of coping with a situation. If we . . . elect to ignore . . . them, we are
likely to become confused by the data” (Luce, 1995, p. 12). In the example just
discussed, reliance on a descriptive model permitted Wandell (1977) to avoid such
confusion by identifying (and manipulating) each person’s decision strategy.
3.2.2.2 Disadvantages
At first glance, it might appear that there are no disadvantages associated with
fitting of individuals. This perspective would be overly optimistic because several
potentially troublesome exceptions can be cited.
First, it is quite common for modelers to apply their theories to archival data
(which, in the extreme case, may have been extracted from a published plot). In
102 Computational Modeling in Cognition
those cases, it may simply be impossible to fit the data at the individual level
because that information may no longer be available. Needless to say, in those
cases, the best one can do is to fit the available aggregate data with the appropriate
caution (more on that later).
Second, a related problem arises even with “new” data when the experimental
methodology does not permit multiple observations to be drawn. For example,
mirroring real-life police procedures, eyewitness identification experiments nec-
essarily involve just a single lineup and a single response from each participant.
Clearly, in those circumstances, no modeling at the individual level is possible,
and the data must be considered in the aggregate (i.e., by computing the propor-
tion of participants who picked the culprit or any of the various foils from the
lineup). We will present an in-depth example of a model of eyewitness identifica-
tion in Chapter 7 that illustrates this approach to modeling.
The third problem is similar and involves experiments in which the number of
observations per participant is greater than one but nonetheless small. This situa-
tion may occur in experiments with “special” populations, such as infants or clin-
ical samples, that cannot be tested extensively for ethical or pragmatic reasons. It
may also occur if pragmatic constraints (such as fatigue) prevent repeated testing
of the same person. Although it is technically possible in these circumstances to fit
individuals, is this advisable? How stable would parameter estimates be in those
circumstances, and how much insight can we expect from the modeling? Cohen,
Sanborn, and Shiffrin (2008) conducted a massive study—involving nearly half
a million simulated experiments—that sought answers to those questions using
the model recovery technique discussed later in Chapter 6 (see Figure 6.2). That
is, Cohen et al. used a number of competing models to generate data for simu-
lated subjects and then tested whether the correct model was recovered, varying
the number of (simulated) observations per subject and the number of (simulated)
subjects. Although Cohen, Sanborn, and Shiffrin’s results are so complex that it
is risky to condense them into a simple recommendation, in our view, it appears
inadvisable to fit individuals whenever the number of observations per partici-
pant falls below 10. Cohen et al. showed that in many—but not all—cases with
fewer than 10 observations per subject, the model that generated the data was
more likely to be correctly recovered from the aggregate data than from fits to
individuals.15
A related problem arises when the number of observations per participant is
moderate (i.e., > 10) but not large (<< 500; Farrell & Ludwig, 2008). Although
fits to individual participants may be quite successful under those circumstances,
the distribution of estimated parameter values across individuals may be
overdispersed—that is, the variance among parameter estimates is greater than
the true variance among those parameters (Farrell & Ludwig, 2008; Rouder &
Lu, 2005). This overdispersion problem arises because each individual estimate
Chapter 3 Basic Parameter Estimation Techniques 103
generated by a single source (i.e., a single participant), and each to-be-fitted obser-
vation is formed by averaging (or equivalently, summing) the underlying data
across subjects. For example, if we are fitting data from a categorization experi-
ment, in which each subject classifies a stimulus as belonging to category A or B,
we may sum response frequencies across participants and fit the resulting cell fre-
quencies (or proportions of “A” responses). Similarly, when modeling response
latencies, we may choose to estimate a single set of parameters to capture the
average latency across trials in some skill acquisition experiment.
An alternative approach to aggregation goes beyond simple averaging and
seeks to retain information about the underlying structure of each participant’s
responses. This is best illustrated by considering cases in which responses are
represented in distributions. One case where this approach is often used is in
analyzing and modeling response times (RTs). RT distributions have played an
important role in cognitive psychology, and we will introduce them in greater
detail in Chapter 4. For now, it suffices to know that a person’s response times
can be “binned,” by linear interpolation, into quantiles (e.g., the latency values
cutting off 10%, 30%, 50%, 70%, and 90% of the distribution below; Ratcliff
& Smith, 2004), and we can then average the observations within each quantile
across participants to obtain average quantiles whose location, relative to each
other, preserves the shape of the distribution of each individual (e.g., Andrews &
Heathcote, 2001; Jiang, Rouder, & Speckman, 2004; Ratcliff, 1979).
This procedure, known as “Vincentizing” (Ratcliff, 1979), is a particularly
useful aggregating tool: Like simple averaging, it yields a single set of scores that
we can fit, but unlike averaging, those scores retain information about the individ-
ual RT distributions that would be lost if all observations were lumped together.
That is, in the same way that the average learning curve in Figure 3.7 does not
represent any of the underlying individual curves, the distribution of all observed
RTs across subjects is unlikely to resemble any of the underlying individual distri-
butions. If those distributions are Vincentized before being averaged, their shape
is retained. We recommend Vincentizing techniques for any situation involving
distributions or functions whose shape is of interest and should be preserved dur-
ing aggregation.16 Van Zandt (2000) provides an extremely detailed treatise of
Vincentizing and other ways in which RT distributions can be fit.
time among statisticians. In this approach, parameters are estimated for each
subject, but those parameters are simultaneously constrained to be drawn from
a distribution that defines the variation across individuals (see, e.g., Farrell &
Ludwig, 2008; Lee & Webb, 2005; Morey, Pratte, & Rouder, 2008; Rouder, Lu,
Speckman, Sun, & Jiang, 2005).
This type of modeling is known as “multilevel” because the parameters are
estimated at two levels: first at the level of individual subjects (base level) and
second at a superordinate level that determines the relationship (e.g., variance)
among base-level parameters. Those techniques have considerable promise but
are beyond the scope of the present volume.
3.2.6 Recommendations
We have surveyed the three major options open to modelers and have found that
each approach comes with its own set of advantages and difficulties. Is it possi-
ble to condense this discussion into a clear set of recommendations? Specifically,
under what conditions should we fit a model to individual participants’ data, and
when should we fit aggregate or average data and estimate a single set of parame-
ters? Although we can offer a set of decision guidelines based on current practice,
the rapidly evolving nature of the field and the heterogeneity of prevailing opin-
ions renders our advice suggestive rather than conclusive.
Notes
1. The rolling marble is not a perfect analogy because it continuously rolls down the
error surface, whereas parameter estimation typically proceeds in discrete steps. A more
accurate analogy might therefore involve a blind parachutist who is dropped onto a moun-
tain behind enemy lines on a secret mission and must reach the bottom of the valley by
making successive downward steps.
2. By implication, the same parameter estimation techniques can also be applied to
psychological models that are analytically tractable, just like regression.
3. To satisfy your curiosity, a four-dimensional simplex is called a pentachoron, and a
five-dimensional simplex is a hexateron.
4. In actual fact, the simplex will never be a point, but it will have a very small diameter.
The size of that diameter is determined by the convergence tolerance, which can be set in
Chapter 3 Basic Parameter Estimation Techniques 107
MATLAB via a call to the function optimset; see the MATLAB documentation for
details. Lagarias, Reeds, Wright, and Wright (1998) provide a rigorous examination of the
convergence properties of Simplex.
5. For brevity, from here on we will refer to the algorithm by capitalizing its name
(“Simplex”) while reserving the lowercase (“simplex”) to refer to the geometrical figure.
6. Brief mention must be made of an alternative technique, known as Subplex (Rowan,
1990), which was developed as an alternative to Simplex for situations involving large
numbers of parameters, noisy predictions (i.e., models involving random sampling), and
the frequent need to dismiss certain combinations of parameter values as unacceptable
(Rowan, 1990). As implied by the name, Subplex divides the parameter space into sub-
spaces, each of which is then independently (and partially) optimized by standard Simplex.
7. In order to focus the discussion and keep it simple, we considered only very sim-
ple cooling schedules and a trivial candidate function. More sophisticated alternatives are
discussed by Locatelli (2002) and Nourani and Andresen (1998).
8. At a mathematical level, one can in fact draw a plausible connection between genetic
algorithms and simulated annealing because both involve the exploration of randomly cho-
sen points. We do not consider this connection here and focus on the overriding conceptual
differences between the two.
9. Because the parameter values at this stage have been converted to chromosome
strings, we refer to the organisms by the notation x (or x) rather than θ .
10. In this context, it is also worth noting that MATLAB may evaluate models more
slowly than other, “low-level” languages such as C. If execution time presents a problem,
it may therefore be worthwhile to rewrite your model in another language, such as C,
provided you save more in program execution time than you spend to rewrite your model
in another language.
11. If you are really keen to sink your teeth into a “real” model, you can proceed from
the end of this chapter directly to Chapter 7; Section 7.1 does not require knowledge of the
material in the intervening Chapters 4 and 5.
12. The table contains only a snapshot of the situation, and hence the totals will not
yield the exact percentages reported by Bickel et al. (1975).
13. Estes (1956) offers the following heuristic to identify this class of functions: “What
they all have in common is that each parameter in the function appears either alone or as
a coefficient multiplying a quantity which depends only on the independent variable x”
(p. 136).
14. Working memory is closely related to the concept of short-term memory. However,
unlike tests of short-term memory that rely on memorization and recall alone, examinations
of working memory typically involve some additional cognitive processes. For example, in
one favored working memory task, study items may alternate with, say, mental arithmetic
problems (Turner & Engle, 1989). People might process a sequence such as 2 + 3 = 5?,
A, 5 + 1 = 7?, B, . . . , where the equations have to be judged for correctness and the letters
must be memorized for immediate serial recall after the sequence has been completed.
The capacity of working memory, as measured by performance in the complex span task,
accounts for a whopping half of the variance among individuals in measures of general
fluid abilities (i.e., intelligence; see, e.g., Kane, Hambrick, & Conway, 2005).
15. Cohen et al. (2008) identified a situation in which fitting of the aggregate data was
always more likely to recover the correct model than fits to individuals, irrespective of
the number of observations per participant and the overall number of participants. This
situation involved forgetting data, in which individual subjects (simulated or real) often
108 Computational Modeling in Cognition
scored 0% recall; those extreme scores uniquely favored one of the models under consider-
ation (even if it had not generated the data), thus enabling that model (whose details are not
relevant here) to do particularly well when applied to individual data. Upon aggregation,
the individual 0s were absorbed into the group average, thus leveling the playing field and
enabling better identification of the correct model.
16. As an exercise, you may wish to figure out a way in which the learning curves in
Figure 3.7 can be aggregated so that they retain their shape. If you find this difficult, consult
Addis and Kahana (2004).
4
Maximum Likelihood
Estimation
109
110 Computational Modeling in Cognition
independent (as defined using conditional probabilities above), then their joint
probability is computed simply by multiplying their individual probabilities:
More generally, if a and b are not independent, and if we know the conditional
relationship a|b between a and b, the joint probability is given by
0.35
0.3
0.25
Probability
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8
N correct
Figure 4.1 An example probability mass function: the probability of correctly recalling
N out of eight items, where the probability of correctly recalling any particular item is
pcorr ect =.7.
a statistical model called the binomial function; we will examine this model in
more detail later. For the moment, we can see in Figure 4.1 that the binomial
model assigns a probability to each possible outcome from the experiment. The
only assumption that has been made here is that the probability of correctly recall-
ing each of the eight items, which we label pcorr ect , is equal to .7; as we will see
later, pcorr ect is a parameter of the binomial model.1 The distribution of proba-
bilities across different possible values of N reflects the assumed variability in
responding: Although on average a person is predicted to get a proportion of .7 of
items correct, by chance she may get only three correct, or maybe all eight items
correct. Note that all the probabilities in Figure 4.1 add to 1, consistent with the
second axiom of probability theory. This is because we have examined the entire
sample space for N : An individual could recall a minimum of zero and a maxi-
mum of eight items correctly from an eight-item list, and all intermediate values
of N are shown in Figure 4.1.
We were able to plot probability values in Figure 4.1 because there are a
finite number of discrete outcomes, each with an associated probability of occur-
rence. What about the case where variables are continuous rather than discrete?
Continuous variables in psychology include direct measures such as response
114 Computational Modeling in Cognition
latencies (e.g., Luce, 1986), galvanic skin response (e.g., Bartels & Zeki, 2000),
and neural firing rates (e.g., Hanes & Schall, 1996), as well as indirect measures
such as latent variables from structural equation models (Schmiedek et al., 2007).
Accuracies are also often treated as continuous variables when a large number
of observations have been collected or when we have calculated mean accuracy.
One property of a continuous variable is that, as long as we do not round our
observations (and thus turn it into a discrete variable), the probability of observ-
ing a specific value is effectively 0. That is, although we might record a latency
of 784.5 ms, for a fully continuous variable, it would always in theory be pos-
sible to examine this latency to another decimal place (784.52 ms), and another
(784.524 ms), and another (784.5244 ms). Accordingly, we need some way of
representing information about probabilities even though we cannot meaningfully
refer to the probabilities of individual outcomes.
There are two useful ways of representing probability distributions for con-
tinuous variables. The first is the cumulative density function (CDF; also called
the cumulative probability function and, confusingly, the probability distribution
function). An example CDF is shown in Figure 4.2, which gives a CDF predicted
by a popular model of response times called the ex-Gaussian. We will come back
to this model later and look at its insides; for the moment, we will treat the model
as a black box and simply note that when we feed a certain set of parameters into
the model, the predicted CDF shown in Figure 4.2 is produced. To give this some
context, let’s imagine we are modeling the time taken to make a decision, such as
deciding which of two faces on the computer screen is more attractive (e.g., Shi-
mojo, Simion, Shimojo, & Scheier, 2003). This decision latency t is measured as
the time in seconds between the appearance of the pair of faces and the keypress
indicating which face is judged more attractive. The abscissa gives our continuous
variable of time t; along the ordinate axis, we have the probability that a decision
latency x will fall below (or be equal to) time t; formally,
f (t) = P(x ≤ t). (4.3)
Note that the ordinate is a probability and so is constrained to lie between 0 and 1
(or equal to 0 or 1), consistent with our first axiom of probability.
Another representation of probability for continuous variables, and one crit-
ical to the likelihood framework, is the probability density function (PDF), or
simply probability density. Figure 4.3 plots the probability density function for
latencies predicted by the ex-Gaussian, using the same parameters as were used
to generate the CDF in Figure 4.2. The exact form of this density is not impor-
tant for the moment, except that it shows the positively skewed shape typically
associated with latencies in many tasks (see, e.g., Luce, 1986; Wixted & Rohrer,
1994). What is important is what we can read off from this function. Although it
Chapter 4 Maximum Likelihood Estimation 115
1
0.9
0.8
Cumulative Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
Decision Latency (s)
Figure 4.2 An example cumulative distribution function (CDF). For a particular value
along the x-axis, the function gives the probability of observing a latency less than or
equal to that value.
might be tempting to try and interpret the y-axis directly as a probability (as in
Figure 4.1), we cannot: Because we are treating latency as a continuous dimen-
sion, there are effectively an infinite number of precise latency values along that
dimension, which consequently means that the probability of a particular latency
value is vanishingly small. Nonetheless, the height of the PDF can be interpreted
as the relative probability of observing each possible latency. Putting these two
things together, we can see why the function is called a probability density func-
tion. Although a particular point along the time dimension itself has no “width,”
we can calculate a probability by looking across a range of time values. That is, it
is meaningful to ask what the probability is of observing a latency between, say,
2 and 3 seconds. We can do this by calculating the area under the curve between
those two values. This gives the probability density function its name: It provides
a value for the height (density) of the function along the entire dimension of the
variable (in this case, time), and this density can then be turned into an area, and
thus a probability, by specifying the range for which the area should be calcu-
lated. As a real-world example, think about a cake. Although making a single cut
in a cake does not actually have any dimensionality (the cut cannot be eaten), if
we make two cuts, we can remove the area of cake sliced out by the cuts and
devour it. The height of the curve corresponds to the height of a cake: A taller
116 Computational Modeling in Cognition
0.45
0.4
0.35
Probability Density
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10
Decision Latency
Figure 4.3 An example probability density function (PDF). See text for details.
cake will give us more cake if we make two cuts spaced a specific distance
apart (e.g., 3 cm).
Formally, the PDF is the derivative of the CDF (taken with respect to the
dependent variable, in this case t); that is, it gives the rate of change in the cumu-
lative probability as we move along the horizontal axis in Figure 4.2. To make
this more concrete, imagine Figure 4.2 plots the total distance covered in a 100-m
sprint (instead of probability) as a function of time; as time passes, the sprinter
will have covered more and more distance from the beginning of the sprint. In this
case, the PDF in Figure 4.3 would give the instantaneous velocity of the sprinter
at any point in time in the race. According to Figure 4.3, this would mean the
sprinter started off slowly, sped up to some peak velocity at around 1.5 seconds,
and then slowed down again. We can also flip this around: The CDF is obtained by
integrating (i.e., adding up) the PDF from the minimum possible value to the cur-
rent value. For example, the value of the CDF at a value of 3 seconds is obtained
by integrating the PDF from 0 to 3 seconds or, equivalently, working out the area
under the PDF between 0 and 3 seconds, which in turn gives us the probability of
observing a latency between 0 and 3 seconds; this is the area shaded in light gray
in Figure 4.3.
Because the probability of an observation on a continuous variable is effec-
tively 0, the scale of the ordinate in the PDF is in some sense arbitrary. However,
an important constraint in order to give the relationship between the CDF and
Chapter 4 Maximum Likelihood Estimation 117
the PDF is that the area under the PDF is equal to 1, just as probabilities are con-
strained to add up to 1. This means that if the scale of the measurement is changed
(e.g., we measure latencies in milliseconds rather than seconds), the values on the
ordinate of the PDF will also change, even if the function itself does not. Again,
this means that the scale in Figure 4.3 cannot be interpreted directly as a proba-
bility, but it does preserve relative relationships, such that more likely outcomes
will have higher values. We can also talk about the probability of recording a
particular observation with some error , such that the probability of recording
a latency of 784.52 ms is equal to the probability that a latency will fall in the
window 784.52 ms ± (Pawitan, 2001). This equates to measuring the area under
the density function that is cut off by the lower limit of 784.52 ms − and the
upper limit of 784.52 ms + .
Before moving on, let us reiterate what is shown in Figures 4.1 to 4.3. Each of
these figures shows the predictions of a model given a particular set of parameter
values. Because of the variability inherent in the model and in the sampling pro-
cess (i.e., the process of sampling participants from a population and data from
each participant; we tease these sources of variability apart later in this chapter),
the models’ predictions are spread across a range of possible outcomes: number of
items correct, or latency in seconds for the examples in Figure 4.1 and Figures 4.2
to 4.3, respectively. What the model does is to assign a probability (in the case of
discrete outcomes) or probability density (in the case of continuous outcomes) to
each possible outcome. This means that although the model effectively predicts
a number of different outcomes, it predicts that some outcomes are more likely
than others, which, as we will see next, will be critical when relating the model to
data we have actually observed.
distribution or density function given (a) a model and (b) a specific set of parame-
ters for that model. For a single data point y, the model M, and a vector of parame-
ter values θ , we will therefore refer to the probability or probability density for an
observed data point given the model and parameter values as f (y|θ , M), where f
is the probability mass function or probability density function.2 We will assume
for the rest of the chapter that we are reasoning with respect to a particular model
and will leave M out of the following equations, although you should read any
of those equations as being implicitly conditional on M. We will return to M in
Chapter 5, where we will look at comparing different mathematical or computa-
tional models on their account for a set of data.
Rather than considering all possible values of y, as in Figure 4.3, we are now
interested in the probability (discrete) or probability density (continuous variable)
for the data y we have actually observed. To illustrate, Figure 4.4 shows some
obtained data points, represented by stars, for the examples we have looked at so
far. In the top panel, we see a single data point, five out of eight items correct,
from a single participant in our serial recall experiment, along with a depiction
of reading off the probability of getting five items correct according to the model
(which is equal to .25). In practice, we do not determine this value graphically but
will feed our data y and parameters θ into the function f (y|θ , M) and obtain a
probability.
In the bottom panel of Figure 4.4, we see the case where we have obtained
six latencies in the attractiveness decision experiment considered earlier. Again,
a graphical depiction of the relationship between one of the data points and its
probability density is shown in this panel. When we have a number of data points
(as we usually will in psychology experiments), we can obtain a joint probability
or probability density for the data in the vector y by multiplying together the indi-
vidual probabilities or probability densities, under the assumption that the obser-
vations in y are independent:
k
f (y) = f (yk |θ ), (4.4)
0.35
0.3
0.25
Probability
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8
N correct
0.45
0.4
0.35
Probability Density
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10
Decision Latency (s)
Figure 4.4 Reading off the probability of discrete data (top panel) or the probability den-
sity for continuous data (bottom panel). The stars in each panel show example data points,
and the dashed lines describe what we are doing when we calculate a probability or prob-
ability density for some data.
120 Computational Modeling in Cognition
keep the data and the model fixed and observe changes in likelihood values as
the parameter values change. That is, we get some measure of how likely each
possible parameter value is given our observed data. This is of obvious utility
when modeling data: We usually have collected some data and now wish to know
what the values of the parameters are.
To understand the relationship between probabilities and likelihoods and how
they differ, let’s look at the latencies in the attractiveness choice example again.
As discussed above, latency probability densities tend to be positively skewed.
One simple model of latencies that nicely captures the general shape of latency
distributions is the ex-Gaussian distribution. This model assumes that each latency
produced by a participant in any simple choice experiment can be broken down
into two independent components (see Figure 4.5). The first of these is the time
taken to make the decision, which is assumed to be distributed according to an
exponential function (left panel of Figure 4.5). The assumption of an exponen-
tial is made to capture the extended tail of latency distributions and because it
is naturally interpreted as reflecting the time of information processing in many
areas of psychology (e.g., Andrews & Heathcote, 2001; Balota, Yap, Cortese, &
Watson, 2008; Hohle, 1965; Luce, 1986). The second component is assumed to
be processes supplementary to the information processing of interest, including
the time to encode a stimulus and initiate motor movement (Luce, 1986). This
second component (middle panel of Figure 4.5) is assumed to be Gaussian (i.e., a
normal distribution) for convenience and because it follows from the assumption
that a number of different processes and stages contribute to this residual time (the
Gaussian shape then following from the central limit theorem). The ex-Gaussian
has three parameters: the mean μ and standard deviation σ of the Gaussian dis-
tribution and the parameter τ governing the rate of drop-off of the exponential
function.
Let’s consider the case where we have collected a single latency from a par-
ticipant and where we know the values of σ and τ but μ is unknown. This is just
for demonstration purposes; usually all the parameters will have unknown val-
ues, and we will certainly want to estimate those parameters from more than a
single observation. (Just like multiple regression, it is important that we have a
reasonable number of data points per free parameter in our model.)
The top panel of Figure 4.6 plots the probability density f (y|μ) as a function
of the single data point y and the single parameter μ; both axes are expressed
in units of seconds. Each contour in the figure is a probability density function,
plotting out the probability density function for a particular value of μ. We’ve only
plotted some of the infinite number of possible probability density functions (keep
in mind μ is a continuous parameter). As an illustration, a particular probability
density function f (y|μ = 2) is marked out as a gray line on the surface in the
Chapter 4 Maximum Likelihood Estimation 121
0.6 2 0.4
0.5
1. 5 0.3
0.4
0.3 1 0.2
0.2
0. 5 0.1
0.1
0 0 0
0 5 10 0 5 10 0 5 10
Latency (s)
Figure 4.5 Building up the ex-Gaussian model. When independent samples are taken
from a Gaussian PDF (left panel) and an exponential PDF (middle panel) and summed, the
resulting values are distributed according to the ex-Gaussian PDF (right panel).
f (y | μ)
(a) L (μ | y)
0.5
0
0
1
2 6
3 4
4 2
μ (Parameter, in seconds) y (Data, in seconds)
5 0
(b) 0.45
0.4
0.35 μ = 2 seconds
f (y | μ)
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7
y (Data)
(c) 0.45
0.4
y = 3 seconds
0.35
0.3
L (μ | y)
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5
μ (Parameter, in seconds)
Figure 4.6 Distinguishing between probabilities and likelihoods. The top panel plots the
probability density as a function of the single ex-Gaussian model parameter μ and a single
data point y. Also shown are cross sections corresponding to a probability density (gray
line) and likelihood function (dark line), which are respectively shown in profile in the
middle and bottom panels. See text for further details.
Chapter 4 Maximum Likelihood Estimation 123
0.8
p(N correct|pcorrect)
0.6
0.4
0.2
8
7
0 6
1 5
4
3
0.5 2
1 N correct
pcorrect 0
0
Figure 4.7 The probability of a data point under the binomial model, as a function of the
model parameter pcorr ect and the data point N correct. The solid line shows the probabil-
ity mass function for a particular value of pcorr ect , while each of the strips represents a
continuous likelihood function.
Each cut of this surface perpendicular to the pcorr ect axis in Figure 4.7 gives
us the binomial probability mass function f (N corr ect| pcorr ect ) for a particular
value of pcorr ect . As an example, the heavy line on the surface traces out the
probability mass function for pcorr ect = 0.7; this function is identical to the one
in Figure 4.1. In contrast, each ribbon in Figure 4.7 traces out the likelihood func-
tion for each value of N correct, L( pcorr ect |N corr ect); that is, N correct is fixed
and pcorr ect is allowed to vary. The difference between the likelihood and the
probability functions is evident from their different characteristics; each probabil-
ity mass function in Figure 4.7 is composed of a series of steps (being a discrete
function), while the likelihood functions are smoothly varying (being continuous).
this shortly (in Section 4.4). For the moment, we first need to talk about a critical
step along the way: actually specifying a probability function.
model is being fit or compared and (b) the probability distribution predicted by
the model. The nature of the dependent variable tells us whether a probability
mass function or a PDF will be more appropriate, which depends on whether our
dependent variable is discrete or continuous. Second, the probability distribution
predicted by our model, together with the nature of the data, tells us whether we
can apply the model to the data directly or whether we need to introduce some
intermediate probability function first.
1 μ − yk σ2 yk − μ σ
f (yk |μ, σ, τ ) = exp + 2 − , (4.6)
τ τ 2τ σ τ
where μ, σ , and τ are the three parameters of the probability density function; yk
is a single data point (response time); and is the Gaussian cumulative distribu-
tion function.
Implementing this function in MATLAB is straightforward, as shown in List-
ing 4.1. The last factor in the listing (beginning with the multiplication by .5)
computes the cumulative normal from scratch by relying on the relation
x
(x) = .5 1 + erf √ , (4.7)
2
where erf refers to the “error function” that is built into MATLAB.
To remind yourself what the ex-Gaussian PDF looks like, refer back to
Figure 4.6. Using Equation 4.6, we can calculate a probability density for each
of the data points in the data vector y directly from the values for μ, σ , and τ . In
Chapter 4 Maximum Likelihood Estimation 127
this case, the probability density function is itself the model of behavior, and no
further assumptions are needed to relate the model to the data.
where c is a scaling parameter. Based on the ηs given by Equation 4.8, the proba-
bility of recalling item j given the temporal probe i is given by
ηi j
p( j|i) = , (4.9)
k ηik
where k is an index for all possible recall candidates (i.e., list items, in the present
instance). You may recognize the preceding equations from our discussion of
GCM in Chapter 1. SIMPLE can be thought of as an extension of GCM to phe-
nomena in memory and perception, where the dimensions of the category space
are replaced by the temporal dimension.
Based on Equation 4.9, the probability of correctly recalling an item is obtained
by setting j equal to i; since the similarity of an item to itself, ηii , is equal to 1,4
this is given by
1
pcorr ect (i) = , (4.10)
k ηik
Listing 4.2 Code for the Basic SIMPLE Model of Serial Recall
1 f u n c t i o n pcor = SIMPLEserial ( c , presTime , recTime , J )
2 % c i s t h e s i n g l e p a r a m e t e r o f SIMPLE
3 % p r e s T i m e and r e c T i m e a r e t h e e f f e c t i v e t e m p o r a l
4 % s e p a r a t i o n o f i t e m s a t i n p u t and o u t p u t
5 % J i s the length of the l i s t
6
7 pcor = z e r o s ( 1 , J ) ;
8 Ti = cumsum ( repmat ( presTime , 1 , J ) ) ;
9 Tr = Ti ( end ) + cumsum ( repmat ( recTime , 1 , J ) ) ;
10
11 f o r i=1:J % i indexes o u t p u t + probe p o s i t i o n
12 M = l o g ( Tr ( i )−Ti ) ;
13 eta = exp (−c * a b s ( M ( i )−M ) ) ;
14 pcor ( i ) = 1 . / sum ( eta ) ;
15 end
correct at position i using Equation 4.10, the vector pcor then being returned as
the output of the function.
The output of the code in Listing 4.2 presents us with a problem when mod-
eling the accuracy serial position function: We have a predicted probability of
correct recall for the item at a particular serial position i, but we have not speci-
fied a full probability function. This is because SIMPLE predicts a single number,
a single probability correct, for each set of parameter values. We need some way
of assigning a probability (or probability density) to all possible outcomes; in this
case, this means all possible values of proportion correct will have some prob-
ability of occurrence predicted by SIMPLE. To solve this issue, it helps to step
back and consider why SIMPLE does not produce a range of possible outcomes,
whereas we know that if we ran this experiment on people that we would get a
range of different outcomes. The answer is that the variability in our data is due to
sampling variability: Although on average a person may recall half of the items
correctly, by chance that same person may remember 60% of items correctly on
one trial and only 40% on the next. An even clearer case is a fair coin: A coin is
not subject to fluctuations in motivation, blood sugar levels, and so on, but each
time we toss the coin a fixed number of times, we are not surprised to see that the
coin does not always give the same number of tails each time.
Does this sound familiar? It should. Look back to Figure 4.1 and the discus-
sion around that figure, where we discussed the same issue. The implication is that
it is possible for a model like SIMPLE to predict a specific probability of correct
recall, but for the observed proportion correct to vary from trial to trial because of
sampling variability, as on some trials a participant will correctly recall the item at
130 Computational Modeling in Cognition
a particular serial position, and at other times will fail to recall the corresponding
item. If we have N trials, we will have NC (i) correct recalls and N F (i) failures to
recall at a particular serial position i, where NC (i) + N F (i) = N . (This assumes
there are no missing data; if there were, we would need to specify N for each
individual serial position, such that NC (i) + N F (i) = N (i).) As we discussed
in the context of Figure 4.1, this situation is formally identical to the case where
we flip a weighted coin N times and record the number of heads (correct recalls)
and tails (failures to recall). Given the coin has a probability pheads of coming up
heads, the probability distribution across all possible numbers of heads (out of N )
is given by the following binomial distribution:
N
p(k| pheads , N ) = pheads k (1 − pheads ) N −k , (4.11)
k
where p(k) is the probability of observing exactly k out of N coin tosses come up
as heads, and Nk is the combinatorial function from N choose k, giving the total
number of ways in which k out of N tosses could come up heads (if this is unfa-
miliar, permutations and combinations are covered in most introductory books on
probability). Listing 4.3 gives MATLAB code corresponding to Equation 4.11.
Replacing the variables in Equation 4.11 with those from our serial recall
experiment, we get
N
p (NC (i)| pcorr ect (i), N ) = pcorr ect (i) NC (i) (1 − pcorr ect (i)) N −NC (i) .
NC (i)
(4.12)
Given the probability of correctly recalling item i, pcorr ect (i), and that the per-
son completed N trials, this gives us the probability of correctly recalling item
i on NC trials, out of a maximum of N . This function is plotted in Figure 4.1
for pcorr ect (i) = .7 and N =8. This means that we can take predicted propor-
tion correct pcorr ect (i) from any model (in this case, SIMPLE) and obtain a full
probability mass function based only on the number of trials! We therefore turn
Chapter 4 Maximum Likelihood Estimation 131
Listing 4.4 Code for Obtaining Predicted Probability Masses From SIMPLE
1 f u n c t i o n pmf = SIMPLEserialBinoPMF ( c , presTime , ←
recTime , J , Nc , N )
2 % c i s t h e p a r a m e t e r o f SIMPLE
3 % p r e s T i m e and r e c T i m e a r e t h e e f f e c t i v e t e m p o r a l
4 % s e p a r a t i o n o f i t e m s a t i n p u t and o u t p u t
5 % J i s the length of the l i s t
6 % Nc ( a v e c t o r ) i s t h e number o f i t e m s c o r r e c t l y ←
r e c a l l e d a t each p o s i t i o n
7 % N i s t h e number o f t r i a l s a t e a c h p o s i t i o n
8
9 pmf = z e r o s ( 1 , J ) ;
10 Ti = cumsum ( repmat ( presTime , 1 , J ) ) ;
11 Tr = Ti ( end ) + cumsum ( repmat ( recTime , 1 , J ) ) ;
12
13 f o r i = 1 : J % i i n d e x e s o u t p u t + p r o b e p o s i t i o n
14 M = l o g ( Tr ( i )−Ti ) ;
15 eta = exp (−c * a b s ( M ( i )−M ) ) ;
16 pcor = 1 . / sum ( eta ) ;
17 pmf ( i ) = binomPMF ( Nc ( i ) , N , pcor ) ;
18 end
Note that we have two levels of function parameters in the above example. On
one hand, we have the model parameters of SIMPLE; in this basic version, that’s
the single parameter c in Equation 4.8.5 These are used to calculate pcorr ect (i)
using the equations for SIMPLE presented above. Each pcorr ect (i) is then used as
a parameter for the binomial distribution function in Equation 4.12. We are really
only interested in c as the model parameter, as it fully determines the binomial
density function via Equations 4.8 and 4.10, but we should be aware that we are
132 Computational Modeling in Cognition
1. The probability pcorr ect (i), the probability of correct recall predicted by
SIMPLE
2. The probability of correct recall in the data, obtained by dividing NC by N
Chapter 4 Maximum Likelihood Estimation 133
Model Model
(e.g., ex-Gaussian) (e.g., SIMPLE)
Data model
predicted predicted
probability mass function probability mass function
or or
probability density function probability density function
Figure 4.8 Different ways of predicting a probability function, depending on the nature
of the model and the dependent variable. On the left, the model parameters and the model
are together sufficient to predict a full probability function. This usually applies to the case
where the dependent variable is continuous and the model is explicitly developed to predict
probability functions (e.g., response time models). On the right, the model parameters and
the model predict some intermediate value(s), such as proportion correct. Together with
other assumptions about the sampling process, these intermediate values are used to specify
a full probability function via the data model.
Whenever working with models like this, it is important not to get these different
types of probabilities confused. To keep these probabilities distinct in your head,
it might help to think about where these different probabilities slot in Figure 4.8
(going from model parameters to a full probability function) and Figure 2.8 (relat-
ing the model to the data).
NT !
p(N|p, NT ) = p 1 N 1 p2 N 2 . . . p J N J , (4.13)
N1 !N2 ! . . . N J !
Chapter 4 Maximum Likelihood Estimation 135
NT !
p(N | p, N T ) = p N (1 − P) NT −N , (4.14)
N !(N T − N )!
where the variables and parameters from Equation 4.11 have been replaced with
N , p, and NT . You’ll notice that Equations 4.13 and 4.14 are similar in form;
in fact, the binomial distribution is simply the multinomial distribution obtained
when we have only two categories (e.g., heads vs. tails or correct vs. incorrect).
Equation 4.14 simplifies Equation 4.13 by taking advantage of the constraint that
the probabilities of the two outcomes must add up to 1: If we are correct with
probability p, then we must necessarily be incorrect with probability 1 − p.
Let’s consider how we might use the multinomial model to apply SIMPLE
to serial recall data in more detail. As above, for each output position i, we may
ask the probability of recalling each item j, 1 ≤ j ≤ J , where J is the length
of the list (i.e., we assume that participants only recall list items). The predicted
probability of making each response, p( j|i), is given by Equation 4.9, which in
turn depends on the values of the SIMPLE parameters via Equation 4.8. For a
particular probe (i.e., output position) i, we can then rewrite Equation 4.13 in the
language of SIMPLE:
NT !
p(N(i)|p(i), N T ) = p(i)1 N (i)1 p(i)2 N (i)2 . . . p(i) J N (i) J ,
N (i)1! N (i)2! . . . N (i) J !
(4.15)
where N T is the number of trials (i.e., the number of total responses made at each
output position), N(i) is the vector containing the number of times each response
j was produced at the ith output position, and p(i) is the probability with which
SIMPLE predicts each response j will be produced at the ith output position.
The end product is a single number, probability p(N(i)|p(i)), of observing the
frequencies in N(i) given the predicted probabilities p(i) and the total number
of responses NT . We will defer providing code for this function until later in the
chapter, where we will see how we can simplify Equation 4.15 and make this job
easier.6
When should we use the multinomial distribution in preference to its simpler
sibling, the binomial distribution? This depends on the question we are asking.
If we are simply interested in fitting accuracies (e.g., the accuracy serial position
curve), then we can define our events of interest as correct and its complement
incorrect; in this case, the binomial will be sufficient to produce a likelihood.
136 Computational Modeling in Cognition
except for that introduced by sampling variability within each participant. As dis-
cussed in Chapter 3, we would usually prefer to fit the data from individual partic-
ipants unless circumstances prohibited us from doing so. We return to maximum
likelihood estimation for multiple participants in Section 4.5.
Not only will we usually have multiple data points, but we will also usually
have multiple parameters. This doesn’t affect our likelihood calculations but does
mean that we should be clear about our conceptualization of such models. In
the case where we have multiple parameters, Figures 4.6 and 4.7 will incorpo-
rate a separate dimension for each parameter. As an example, let’s return to the
ex-Gaussian model that we covered earlier in the chapter and, in particular, panel
c in Figure 4.6; as a reminder, this plots out the likelihood of the ex-Gaussian
parameter μ for a fixed, single observation y = 3. Figure 4.9 develops this further
by plotting the joint likelihood for the data vector y = [3 4 4 4 4 5 5 6 6 7 8 9]
(all are response times in seconds from a single participant) as a function of two
parameters of the ex-Gaussian model, μ and τ ; the other ex-Gaussian parameter
σ is fixed to the value of 0.1. (We have no specific reason for fixing the value of σ
here except that adding it as an additional dimension to Figure 4.9 would give us a
four-dimensional figure, which is not easily conceptualized on the printed page!)
Together, μ and τ make the parameter vector θ, such that the value plotted along
the vertical axis is the joint likelihood L(θ|y), calculated using Equation 4.16.
Note the size of the units along this axis; because we have multiplied together a
number of likelihood values (as per Equation 4.16), we end up with very small
numbers, as they are in units of 10−10 . This surface is called a likelihood surface
(as is panel c in Figure 4.6) and plots the joint likelihood L(θ |y) as a function of
the model parameters. To show how this figure was generated, the code used to
generate Figure 4.9 (and an accompanying log-likelihood surface; see below) is
presented in Listing 4.5.
Listing 4.5 MATLAB Script for Generating Likelihood and Log-Likelihood Surfaces for
the μ and τ Parameters of the Ex-Gaussian Model
1 mMu = 5 ; mTau= 5 ; muN = 5 0 ; tauN = 5 0 ; %r a n g e and ←
r e s o l u t i o n of p o i n t s along each dimension
2 mu = l i n s p a c e ( 0 , mMu , muN ) ;
3 tau = l i n s p a c e ( 0 , mTau , tauN ) ;
4
5 rt = [ 3 4 4 4 4 5 5 6 6 7 8 9 ] ;
6
7 i=1;
8 lsurf = z e r o s ( tauN , muN ) ;
9 % n e s t e d l o o p s a c r o s s mu and t a u
10 % c a l c u l a t e a j o i n t l i k e l i h o o d f o r e a c h p a r a m e t e r ←
combination
11 f o r muloop=mu
(Continued)
138 Computational Modeling in Cognition
(Continued)
12 j=1;
13 f o r tauloop = tau
14 lsurf ( j , i ) = p r o d ( exGaussPDF ( rt , muloop , . 1 , ←
tauloop ) ) ;
15 lnLsurf ( j , i ) = sum ( l o g ( exGaussPDF ( rt , muloop , ←
. 1 , tauloop ) ) ) ;
16 j=j + 1 ;
17 end
18 i=i + 1 ;
19 end
20
21 %l i k e l i h o o d s u r f a c e
22 c o l o r m a p ( g r a y ( 1 ) + . 1 )
23 mesh ( tau , mu , lsurf ) ;
24 x l a b e l ( ' \ t a u ( s ) ' ) ;
25 y l a b e l ( ' \ mu ( s ) ' ) ;
26 z l a b e l ( ' L ( y | \ t h e t a ) ' ) ;
27 xlim ( [ 0 mTau ] ) ;
28 ylim ( [ 0 mMu ] ) ;
29
30 f i g u r e
31 %l o g −l i k e l i h o o d s u r f a c e
32 c o l o r m a p ( g r a y ( 1 ) + . 1 )
33 mesh ( tau , mu , lnLsurf ) ;
34 x l a b e l ( ' \ t a u ( s ) ' ) ;
35 y l a b e l ( ' \ mu ( s ) ' ) ;
36 z l a b e l ( ' l n L ( y | \ t h e t a ) ' ) ;
37 xlim ( [ 0 mTau ] ) ;
38 ylim ( [ 0 mMu ] ) ;
−10
x 10
1
0.8
0.6
L(0 |y)
0.4
0.2
0
5
4
5
3
4
2 3
μ (s) 2
1
1
0 0 τ (s)
Figure 4.9 The joint likelihood of the parameters of the ex-Gaussian given the data in
the vector y. The likelihood is shown as a function of two of the ex-Gaussian parameters,
μ and τ .
the highest point on the surface (e.g., Eliason, 1993). However, this would be an
exhaustive and inefficient strategy and would certainly be impractical when more
than a few free parameters need to be estimated. As discussed in Chapter 3, a
more practical method is to use an algorithm such as the Simplex algorithm of
Nelder and Mead (1965) to search the parameter space for the best-fitting param-
eters. Indeed, all the methods discussed in Chapter 3 apply directly to maximum
likelihood estimation.
One caveat on using the routines discussed in Chapter 3 is that they are geared
toward minimization, meaning that we will need to reverse the sign on the like-
lihood when returning that value to the optimization function. In fact, there are
a few other changes we can make to the likelihoods to follow convention and
to make our job of fitting the data easier. One convention usually adopted is to
measure log likelihoods rather than straightforward likelihoods, by taking the
natural log, ln, of the likelihood (e.g., the log function in MATLAB). There
140 Computational Modeling in Cognition
are a number of reasons this makes estimation and communication easier. The
first is that many analytic models are exponential in nature. That is, many of
the probability densities we would wish to specify in psychology come from the
exponential family of probability distributions. These include probability mass
functions such as the binomial, the multinomial, and the Poisson and probability
density functions such as the exponential, the normal/Gaussian, the gamma, and
the Weibull. The log and the exponential have a special relationship, in that they
are inverse functions. That is, the log and the exponential cancel out each other:
ln (exp (x)) = exp (ln (x)) = x. One consequence is that any parts of a proba-
bility function that are encapsulated in an exponential function are unpacked; this
makes them easier to read and understand and can also have the pleasant result of
revealing a polynomial relationship between parameters of interest and the log-
likelihood, making minimization easier. The natural log is also useful for turning
products into sums: K
K
ln f (k) = ln ( f (k)) (4.17)
k=1 k=1
Similarly, the log turns division into subtraction. As well as being useful for sim-
plifying likelihood functions (as we’ll see shortly), this deals with a nasty prob-
lem: The likelihood for a large number of observations can sometimes go outside
the range of possible values that can be represented on a modern computer since
each extra observation multiplies the likelihood by a value usually much greater
or smaller than 1! The log acts to compress the values and keep them in reason-
able ranges. The log also makes combining information across observations or
participants easier since we can simply add the log-likelihoods from independent
observations or participants to obtain a joint log-likelihood (cf. Equation 4.16):
K
ln (L(θ |y)) = ln (L(θ|yk )) , (4.18)
k=1
where k might index observations (in order to obtain a sum across observations for
a single participant) or participants (in order to obtain a joint—that is, summed—
log-likelihood for all participants).
As an example of several of these advantages of dealing with log-likelihoods,
consider the normal distribution, the familiar bell-shaped probability density usu-
ally assumed as the distribution of responses in psychology:
1 (y − μ)2
p(y|μ, σ ) = √ exp − . (4.19)
2π σ 2 2σ 2
Taking this as our likelihood function L(μ, σ |y), we can obtain the following
log-likelihood function:
Chapter 4 Maximum Likelihood Estimation 141
(y − μ)2
ln L(μ, σ |y) = ln(1) − ln( 2π σ 2 ) − . (4.20)
2σ 2
Whether attempting to solve this analytically or using an algorithm such as the
Simplex algorithm discussed in Chapter 3, expressing things in this manner makes
it easier to read the equation and see cases where we could cancel out unneeded
calculations. For example, the first term, ln(1), actually works out to be 0 and so
can be discarded. In addition, if we√were not concerned with estimating σ and only
estimating μ, the second term ln( 2π σ 2 ) could also be removed. This is because
this term does not depend on μ and therefore acts as a constant in the equation.
If we knew the value of σ , this would make μ very easy to estimate since only
the third and final term would remain, where the log-likelihood is related to μ by
a simple quadratic function. This means that whatever value of μ was the best
estimate for the entire equation would also be the best with the first and second
term as constants.
Similarly, taking the probability mass function for the multinomial (Equa-
tion 4.13) and turning it into a log-likelihood, we get
J
J
ln L(p|N) = ln(NT ! ) − ln N j ! + N j ln p j , (4.21)
where J refers to the number of categories into which the responses can fall. In
this case, the first two terms can be discarded. The term ln(NT ! ) depends only on
the number of observations and therefore acts as a constant for the log-likelihood.
Similarly, J ln N j ! depends only on our observed data (the number of obser-
vations falling into each category) and can also be treated as a constant (remem-
ber, the data are fixed and the parameters vary when talking about likelihoods).
Only the final term J N j ln p j depends on the model parameters (via p) and
is therefore important for estimating the parameters. Listing 4.6 shows how this
simplified multinomial log-likelihood function can be incorporated into the SIM-
PLE code we’ve presented earlier. Note that the code produces a log-likelihood
value for each output position; to obtain a single joint log-likelihood, we would
need to sum the values returned in the lnL vector.
In this case, note that the log-likelihood values will be negative because p < 1,
meaning that log( p) < 0. Because we wish to maximize the log-likelihood, that
means we want the number to get closer to 0 (i.e., less negative), even though
that means its absolute value will thereby get smaller. Note, however, that it is
possible for values from a probability density function to lie above 1 depending
on the scale of the variable along the abscissa, meaning that our log-likelihood
values are also positive. Consider, for example, a uniform distribution with lower
and upper bounds of 0 and 0.1, respectively. For the area under the function to
be 1 (i.e., for the probability density to integrate to 1), we need the value of the
probability density to be a constant value of 10 across that range.
142 Computational Modeling in Cognition
Notes
1. We make the simplifying assumption that accuracy of recall is independent; in prac-
tice, the recall of one item will modify the probabilities for next recall of all items (e.g.,
Schweickert, Chen, & Poirier, 1999).
2. f could also represent a CDF, but we will rarely refer to the CDF when working with
likelihoods.
Chapter 4 Maximum Likelihood Estimation 147
3. We are using P to denote probabilities generally. In many cases, the functions will
be continuous probability densities, which we would usually denote f following the ter-
minology adopted in this book.
4. To see why, try setting the distance to 0 in Equation 4.8 and working out the
similarity.
5. The full version of SIMPLE also allows for the case where people may fail to recall
any item from the list, which is captured in SIMPLE by introducing a threshold function
similar to that we used in our model of the phonological loop in Chapter 2. We’ve omitted
discussion of that threshold here to simplify presentation of the model.
6. Alternatively, if you are desperate, the multinomial PMF is available as the function
mnpdf in the Statistics Toolbox.
7. This property of additivity does extend to the χ 2 statistic, which is closely related to
the deviance (-2 ln L) introduced earlier.
5
Parameter Uncertainty and
Model Comparison
149
150 Computational Modeling in Cognition
further problem of quantifying our confidence in that model being the best one.
(By “best,” we mean the model that best characterizes the psychological pro-
cesses that generated the data.) It turns out that maximized log-likelihood can be
used to quantify uncertainty about models and that this in turn leads to a natural
mechanism for making statistical and theoretical inferences from models and to
differentiate between multiple candidates.
This chapter gives an introduction to model selection and model inference in
the likelihood framework. We will discuss the uncertainty surrounding parame-
ter estimates and outline methods for quantifying this uncertainty and using it to
make inferences from parameter estimates. We will then step through the theoret-
ical basis of model selection1 and discuss its practical application in psychology.
By the end of this chapter, you should be equipped to interpret reports of parame-
ter uncertainty and model selection in psychology (including the use of informa-
tion criteria and information weights) and to apply this knowledge by following
the methods that we introduce.
contrast, did not differ significantly from 0. Thus, Ludwig et al. (2009) were able
to use the variability around the mean parameter estimates—which represented
the differences between conditions—to determine which process was modulated
by the recent history of saccades.
In general, we can submit individual ML parameter estimates to inferential
tests using standard statistical procedures such as the analysis of variance
(ANOVA). Consider again the ex-Gaussian model, which we discussed in
Chapter 4 as a model of response latency distributions. Balota et al. (2008) applied
the ex-Gaussian model to latency distributions involving semantic priming in
tasks such as lexical decision and word naming. Briefly, semantic priming refers
to the fact that processing of a word (e.g., doctor) is facilitated if it is preceded
by a related item (e.g., nurse) as compared to an unrelated item (e.g., bread).
Priming effects are highly diagnostic and can help reveal the structure of seman-
tic knowledge (e.g., Hutchison, 2003) and episodic memory (e.g., Lewandowsky,
1986).
Balota et al. (2008) were particularly interested in the effect of prime-target
relatedness on the time to name target words and the interaction of prime-target
relatedness with variables such as stimulus onset asynchrony (SOA; the time
between onset of the prime and onset of the target) and target degradation (rapidly
alternating the target word with a nonsense string of the same length; e.g.,
@#\$&%). For each cell in their factorial experimental designs, Balota et al. (2008)
estimated values of μ, σ , and τ for individual participants using MLE. The μ, σ ,
and τ parameter estimates were then treated as dependent variables and entered
into an ANOVA with, for example, prime-target relatedness and target degrada-
tion as factors. Balota et al. found that, sometimes, the semantic priming effect
was linked to μ, indicating that primes gave related words a “head start,” leading
to a constant shift in the naming latency distribution. However, when the targets
were degraded, an additional pattern of results emerged because τ (and thus the
rate of information processing) then also varied across prime-target relatedness.
Thus, depending on the visual quality of the stimuli, priming either provided a
“head start” to processing or affected the information accumulation itself. Without
a descriptive model and interpretation of its parameter values, those conclusions
could not have been derived by analysis of the behavioral data alone.
Similar approaches have been used to make inferences from models of sig-
nal detection theory (Grider & Malmberg, 2008) and to make inferences from
computational models of decision-making deficits in patients with neuropsycho-
logical disorders such as Huntington’s and Parkinson’s disease (Busemeyer &
Stout, 2002; Yechiam, Busemeyer, Stout, & Bechara, 2005). As well as sim-
ply calculating the standard error around a mean parameter estimate, it can also
be instructive to plot a histogram of parameter estimates. For example, Farrell
Chapter 5 Parameter Uncertainty and Model Comparison 153
−2390
−2400
−2410
−2420
−2430
ln L
−2440
−2450
−2460
−2470
−2480
2.5 3 3.5 4 4.5 5 5.5 6
τ
Figure 5.1 Two log-likelihood surfaces from the ex-Gaussian density, displayed with
respect to the τ parameter. Both surfaces have the same maximum log-likelihood at the
same value of τ (approximately 3.8); the surfaces differ in their curvature (the extent to
which they are peaked) at that maximum. Note that this is an artificially constructed exam-
ple: For a specific model, if we collect more data, both the extent of peaking and the
maximum log-likelihood will change.
A key step in using the likelihood function for inference lies in recognizing
that a more peaked likelihood function reflects greater confidence in the maximum
likelihood parameter estimate. A greater curvature indicates that as we move away
from the maximum likelihood estimate (i.e., away from the peak of the likelihood
function), the likelihood of those other parameter values given the data falls off
faster. For the purposes of inference, we can quantify this curvature using the
Chapter 5 Parameter Uncertainty and Model Comparison 155
∂ ln L(θθ |y)
,
∂μ
156 Computational Modeling in Cognition
80 40
50 60
derivative
40 20
0
First
20
0 0
−50
−20
−100 −40 −20
0.95 1 1.05 1. 1 0. 2 0.25 3.6 3.8 4 4.2 4.4
0 0 0
derivative
Second
Figure 5.2 Log-likelihood (top row) surfaces for the μ (left column), σ (middle column),
and τ (right column) parameters of the ex-Gaussian model. In the middle row, the first
derivatives of the log-likelihood functions are plotted (also called score functions), and in
the bottom panel, the second partial derivatives are plotted from the diagonal of the Hessian
matrix (see text for details).
(which is itself a function) with respect to τ , we end up with the second partial
derivative:
∂ 2 ln L(θθ |y)
.
∂τ ∂μ
The second partial derivatives are represented by a matrix called the Hessian
matrix, where each element (i, j) in the matrix gives the partial derivative,
with respect to i, of the partial derivative function of the log-likelihood with
respect to j:
∂ 2 ln L(θθ |y)
,
∂i∂ j
where i and j point to particular parameters in the parameter vector θ (read that
sentence a few times to get it clear in your head). One property of second partial
derivatives is that they are generally invariant to the order of application of the
derivatives: Taking the derivative of function f with respect to parameter x and
then taking the derivative of that derivative function with respect to y gives the
same result as taking the derivative of f with respect to y and then taking the
derivative of that function with respect to x. That is,
∂2 f ∂2 f
= ,
∂i∂ j ∂ j∂i
which means that the Hessian matrix is symmetric around the main diagonal. In
fact, the main diagonal is arguably the most interesting part of the Hessian matrix:
It contains the second partial derivatives obtained by taking the derivative of the
log-likelihood function twice with respect to the same parameter. Accordingly,
these entries in the Hessian matrix tell us how quickly the likelihood surface is
changing as each parameter is changing: the curvature of the log-likelihood func-
tions. The bottom row of Figure 5.2 plots these second partial derivatives for the
parameters of the ex-Gaussian model. Notice that the function values are con-
sistently negative and indicate a consistent downward accelerative force on the
log-likelihood functions in the top row. This means that the log-likelihood func-
tion is concave (has an upside-down U shape), which in turn means that we will
be able to find a maximum for this function.
log-likelihood function along that variable or pair of variables. In fact, Fisher sug-
gested the curvature as a measure of the information in the parameter estimate;
a more peaked function means that the maximum likelihood parameter estimate
gives us more information about where the “true” parameter value actually lies. In
likelihood theory, Fisher information measures the variance of the score (the first
derivative) and turns out to be equal to the negative of the Hessian matrix of the
log-likelihood. The observed Fisher information matrix (or simply the observed
information matrix) is the information matrix calculated at the maximum likeli-
hood parameter estimate (e.g., Edwards, 1992; Pawitan, 2001).
The Hessian and information matrices may seem like fairly abstract concepts,
but they are really just tools to get what we really want: an estimate of variabil-
ity around the parameters. From the Hessian matrix, we can calculate a standard
error, and thus a confidence interval, on our parameter estimate. The fundamen-
tal requirement is that our log-likelihood surface is approximately quadratic in
shape around the maximum likelihood estimate. Why this seemingly arbitrary
assumption? This is because the log-likelihood function for the mean of the nor-
mal distribution is exactly quadratic in shape. To confirm this, have a look back at
Equation 4.20 in Chapter 4, which gives the log-likelihood function for the nor-
mal distribution. Recall that the first two terms in this equation act as constants.
This leaves the third term
(y − μ)2
,
2σ 2
which gives a quadratic relationship between μ and the log-likelihood. If we were
to take the exponential of this function (to “undo” the log and revert to the raw
likelihood), we would obtain something like Equation 4.19, which is the formula
for the probability density function (PDF) of the normal distribution.2 Further-
more, if we take the derivative of the normal log-likelihood twice with respect
to μ, we find this is equal to −N /σ 2 (e.g., Eliason, 1993).√Since we know that
the standard error of the mean of a sample is equal to σ/ N and thus that the
variance in our estimate of the mean (the square of the standard error) is equal to
σ 2 /N , then there is a nice, simple relationship between the variance in the esti-
mate of μ and the second derivative: Each can be obtained by taking the inverse
of the negative of the other!
In practice, this means that once we have our Hessian matrix for any model
(with an approximately quadratic log-likelihood function), we can obtain a covari-
ance matrix by simply taking the inverse of the Hessian matrix if we have mini-
mized the negative log-likelihood or by taking the inverse of the negative
Hessian matrix if we have maximized the log-likelihood (in other words, watch
the sign!).3 We can then read the values of the diagonal of the covariance matrix to
get the variances around individual parameters and take the square root of these to
obtain standard errors. Note that when we have estimated the value of more than
one parameter (i.e., the Hessian really is a matrix and not just a single number),
Chapter 5 Parameter Uncertainty and Model Comparison 159
we need to take the matrix inverse and not simply take the inverse of individual
elements in the matrix.
How do we know if our log-likelihood function is quadratic? It turns out that
one property of likelihoods is that the likelihood function is asymptotically nor-
mal in shape (i.e., with a large sample; Pawitan, 2001). This means that if we have
a reasonable number of data points, we can make the assumption that our log-
likelihood function is approximately quadratic and obtain reasonable estimates of
parameter variability. Conversely, in cases where the log-likelihood function is
not quadratic in shape, obtaining a covariance matrix from the Hessian may give
unreliable estimates (see, e.g., Pawitan, 2001; Riefer & Batchelder, 1988; Visser,
Raijmakers, & Molenaar, 2000). What sample size counts as being sufficient for
assuming asymptotic normality will vary between models and applications. Also
note that it is sufficient to assume that our likelihood function is approximately
normal, which will then give us approximate confidence intervals. Figure 5.2
shows a case where the quadratic assumption does not exactly hold; although
the log-likelihood functions in the top row look fairly close to being quadratic, the
second derivatives in the bottom panel are not constant, as they would be if the
quadratic assumption was exactly met. Nonetheless, the sample size is probably
sufficient. For a discussion of the role of sample size in the context of multino-
mial tree models and an empirical demonstration about the consequences of small
sample sizes, see Riefer and Batchelder (1988).
∂ 2 ln L(θ|y) C
≈ 2, (5.1)
∂i∂ j 4δ
where
C = ln L(θθ + ei + e j |y) − ln L(θθ + ei − e j |y)
− ln L(θθ − ei + e j |y) + ln L(θθ − ei − e j |y). (5.2)
160 Computational Modeling in Cognition
In the equations, ei and e j are vectors of the same size as the vector θ , with the
ith and jth element respectively set to δ and the remaining elements set to 0 (e.g.,
Abramowitz & Stegun, 1972; Huber, 2006). The scalar δ controls the step size
used to calculate the approximation and should be set to some small value (e.g.,
10−3 ). Alternatively, the step size may be adjusted to scale with the size of each
element of θ , a technique used by the function mlecov in MATLAB’s Statistics
Toolbox, which directly returns the covariance matrix.
Some MATLAB code implementing Equations 5.1 and 5.2 is given in
Listing 5.2. For MATLAB code implementing a more refined set of calculations,
see Morgan (2000).
(e.g., Ratcliff & Murdock, 1976; Rohrer, 2002), the ROUSE model of short-term
priming (Huber, 2006), hidden Markov models (Visser et al., 2000), and multi-
nomial tree models (e.g., Bishara & Payne, 2008; Riefer & Batchelder, 1988).
To see the use of the Hessian in practice, let’s continue with our old friend, the
ex-Gaussian model. MATLAB code for this example is shown in Listing 5.3; the
line numbers in the following refer to that listing. Imagine that we have collected
a sample of latency observations from a single participant in a single condition in
a word-naming task (e.g., Balota et al., 2008; Spieler, Balota, & Faust, 2000). The
top panel of Figure 5.3 shows histograms for two representative sets of data; here,
we have simply generated 100 (top plot) or 500 (bottom plot of top panel) latency
values from an ex-Gaussian distribution with known parameter values μ = 500,
σ = 65, and τ = 100 (in units of milliseconds), thus simulating two hypotheti-
cal subjects who participated in 100 or 500 trials, respectively (line 7). Using the
methods discussed in Chapters 3 and 4, we minimize − ln L (note again the sign;
we maximize the log-likelihood by minimizing its negative) and obtain the max-
imum likelihood parameter estimates of μ = 503.99, σ = 56.59, and τ = 95.73
for N = 100 and μ = 496.50, σ = 61.11, and τ = 107.95 for N = 500 (this
is line 13 in Listing 5.3). The question now is: What is the variability on these
estimates? To determine this, we first obtain the Hessian matrix for the maximum
likelihood parameter vector, using Equations 5.1 and 5.2 (line 16):
μ σ τ
⎡ ⎤
μ 0.0128 −0.006 0.0064
.
σ ⎣−0.006 0.0152 0.0020⎦
τ 0.0064 0.0020 0.0087
The positive numbers along the diagonal tell us that the surface is curving upwards
as we move away from the ML parameter estimates along each of the param-
eter dimensions, confirming that we have indeed minimized the negative log-
likelihood (note that this is different from Figure 5.2, which shows the plots
for log-likelihood, not negative log-likelihood). We now take the inverse of this
matrix (e.g., using the inv command in MATLAB; line 17 in Listing 5.3)4 to
obtain the covariance matrix:
μ σ τ
⎡ ⎤
μ 216.18 103.32 −184.2
.
σ ⎣ 103.32 116.95 −103.3⎦
τ −184.2 −103.3 275.80
The off-diagonal elements tell us about the covariance between each of the param-
eters. We can see that μ and σ are positively related (covariance = 103.32) and
that μ and τ are negatively correlated (covariance = −184.2). This accords with
162 Computational Modeling in Cognition
the simulation results of Schmiedek et al. (2007), who found that MLEs of μ and
σ tended to be positively correlated and those for μ and τ tend to be negatively
correlated. As noted by Schmiedek et al., this set of correlations is partly due to
trade-offs between parameters: μ and τ both affect the mean of the distribution
and will not be independent, and since σ and τ both affect the variance, an indirect
relationship is introduced between μ and σ .
The cells along the main diagonal of the covariance matrix tell us about the
variance of the parameter estimates and can be used to obtain standard errors
(ignoring the parameter covariance) by taking the square root of these numbers:
This gives us standard errors on μ, σ , and τ of 14.70, 10.81, and 16.61, respec-
tively. If we multiply these standard errors by 1.96 (the .975 quantile of the nor-
mal distribution, to give a 95% confidence limit at either end; a brief review
of 95% confidence intervals is in order if this doesn’t make sense to you), this
gives us confidence limits on μ, σ , and τ ; these are plotted in Figure 5.3 for
the two sample sizes, with the parameter values used to generate the data shown
as horizontal lines. As we’d expect, having more data means our standard
errors are smaller, and we have less uncertainty about the “true” values of the
parameters.
Listing 5.3 An Example of Estimating Parameters and Calling the Hessian Function to
Obtain a Hessian Matrix and From That a Covariance Matrix
1 f u n c t i o n [ x , fVal , h e s s , cov ] = myHessianExample
2
3 rand ( ' seed ' ,151513) ;
4 randn ( ' seed ' ,151513) ;
5
6 N=100;
7 y = normrnd ( 5 0 0 , 6 5 , [ N 1 ] ) + exprnd ( 1 0 0 , [ N 1 ] ) ;
8
9 % f i n d MLEs f o r ex−G a u s s i a n p a r a m e t e r s
10 % u s e i n l i n e s p e c i f i c a t i o n o f f u n c t i o n t o be m i n i m i z e d
11 % s o we c a n p a s s p a r a m e t e r s and d a t a
12
13 [ x , fVal ] = fminsearch ( @ ( x ) exGausslnL ( x , y ) , [ 5 0 0 65 ←
100])
14
15 % f i n d H e s s i a n f o r MLEs
16 h e s s = hessian ( @exGausslnL , x,10^ −3 , y ) ;
17 cov = i n v ( h e s s ) ;
18
19 end
20
21 f u n c t i o n fval = exGausslnL ( theta , y )
22
Chapter 5 Parameter Uncertainty and Model Comparison 163
23 mu = theta ( 1 ) ;
24 sigma = theta ( 2 ) ;
25 tau = theta ( 3 ) ;
26
27 fval = l o g ( 1 . / tau ) + . . .
28 ( ( ( mu−y ) . / tau ) + ( ( sigma . ^ 2 ) . / ( 2 . * tau . ^ 2 ) ) ) . . .
29 +log ( . 5 ) . . .
30 + l o g ( 1+ e r f ( ( ( ( y−mu ) . / sigma ) −(sigma . / tau ) ) . / s q r t ( 2 ) ) ) ;
31 fval = −sum ( fval ) ; % t u r n i n t o a summed n e g a t i v e ←
l o g −l i k e l i h o o d f o r m i n i m i z a t i o n
32 end
5.1.3 Bootstrapping
A final method to obtain confidence limits is by using resampling procedures such
as the bootstrap (e.g., Efron & Gong, 1983; Efron & Tibshirani, 1994). Bootstrap-
ping allows us to construct a sampling distribution for our statistic of interest (in
our case, model parameters) by repeatedly sampling from the model or from the
data. For the purposes of constructing confidence limits on model parameters, a
favored method is parametric resampling, where samples are repeatedly drawn
from the model. Specifically, we generate T samples by running T simulations
from the model using the ML parameter estimates. Each generated sample should
contain N data points, where N is the number of data points in the original sam-
ple. We then fit the model to each of the T generated samples. The variability
across the T samples in the parameter estimates then gives us some idea about the
variability in the parameters. The process for generating the bootstrap parameter
estimates is depicted graphically in Figure 5.4.
Let’s go through an example of using bootstrapping to construct confidence
limits on parameter estimates. We will take a break from the ex-Gaussian and
return to a model we examined in Chapter 4, the SIMPLE model (G. D. A. Brown
et al., 2007). We will examine SIMPLE’s account of people’s performance in
the free recall task, a standard episodic memory task in which participants are
asked to recall the items (usually words) from a list in any order they choose (in
contrast to the serial recall we examined in the last chapter in which output order
was prescribed). A common way of examining free recall performance is to plot
proportion correct for each item in the list according to its serial position. An
example of such a serial position function is shown in Figure 5.5, which plots
(using crosses) the proportion correct by serial position for a single participant in
one of the conditions from an experiment conducted by Murdock (1962).5 Also
shown in the figure is the ML fit of SIMPLE to the data. The model has been
adapted slightly from the earlier application to serial recall in order to reflect
164 Computational Modeling in Cognition
(a)
Frequency 30
20
10
0
300 400 500 600 700 800 900 1000 1100 1200 1300
Latency (ms)
200
150
Frequency
100
50
0
300 400 500 600 700 800 900 1000 1100 1200 1300
Latency (ms)
550
(b)
500
μ
450
0 100 200 300 400 500 600
80
60
σ
40
20
0 100 200 300 400 500 600
150
100
τ
50
0 100 200 300 400 500 600
Number of observations
Figure 5.3 Top panel: Histograms for two samples from the ex-Gaussian function. The
top histogram summarizes a smaller sample (N = 100), and the bottom histogram plots
frequencies for a larger sample (N = 500). Bottom panel: ML parameter estimates for the
ex-Gaussian parameters μ, σ , and τ for the two samples, along with their 95% confidence
intervals calculated from the log-likelihood curvature. The solid line in each plot shows the
true parameter value (i.e., the value used to generate the data).
Chapter 5 Parameter Uncertainty and Model Comparison 165
data y
fit model
estimated ^θ
parameters
generate bootstrap
samples
b fit
y 1 model θ̂1b
b fit
y 2 model θ̂2b
bootstrap
b fit parameter
y3 model θ̂3b estimates
•
•
•
b fit
y N model θ̂Nb
Figure 5.4 The process of obtaining parameter estimates for bootstrap samples. Working
from top to bottom, we first fit our model to the original data (in vector y) to obtain maxi-
mum likelihood parameter estimates (in vector θ̂). These parameter estimates are then fed
into the model to simulate new bootstrap data sets (ybk )) of the same size as y. The model
is then fit to each of these bootstrapped data sets to obtain bootstrap parameter estimates
θ̂kb , such that each bootstrapped data set provides a vector of parameter estimates.
0.8
Proportion Correct
0.6
0.4
0.2
0
0 5 10 15 20
Serial Position
Figure 5.5 Proportion correct by serial position for a single participant from a free recall
experiment of Murdock (1962). Crosses show the participant’s data; the line shows the ML
predictions of the SIMPLE model.
Frequency
Frequency
200 200 200
0 0 0
10 20 30 0.4 0.6 0.8 8 10 12 14
c t s
Figure 5.6 Histograms of parameter estimates obtained by the bootstrap procedure, where
data are generated from the model and the model is fit to the generated bootstrap samples.
From left to right, the histograms correspond to the c, t, and s parameters from the SIMPLE
model. Also shown are the 95% confidence limits obtained by calculating the .025 and .975
quantiles of the plotted distributions (dashed lines).
are shown in Figure 5.6. To find our confidence limits on each parameter, we need
to find those values of the parameters that cut off 2.5% of the scores at either end
of each sampling distribution. That is, we need to find the .025 and .975 quantiles
of the data, to give a total of .05 cut off at the ends corresponding to our α level
of .05. To do this, we can either use the quantile function in the MATLAB
Statistics Toolbox or determine these manually by ordering the values for each
parameter and finding the score that cuts off the bottom 25 scores (obtained by
averaging the 25th and 26th scores) and the score that cuts off the top 25 scores.
When we do so, we find 95% confidence limits around c = 20.39 of 16.73 and
24.50; for t = 0.64, the limits are 0.57 and 0.72; and for s = 10.25, the limits
are 9.35 and 11.78. These quantiles are marked off in Figure 5.6 by the dashed
vertical lines.
Of course, this procedure isn’t specific to the SIMPLE model. If we wanted to
generate new samples from the ex-Gaussian, for example, this would be straight-
forward: As per the assumptions of the ex-Gaussian, each observation would be a
sample from a normal distribution with parameters μ and σ , plus an independent
sample from an exponential distribution with parameter τ . As mentioned earlier,
Schmiedek et al. (2007) adopted such an approach to examine the correlations
between ex-Gaussian parameter estimates, although they did not use MLEs from
fits to participants’ data to generate the bootstrap samples.
One thing you might be thinking right now (besides “gosh I’m exhausted!”) is
that there doesn’t seem to be anything about the bootstrap procedure that dictates
using it in the ML framework. This is indeed the case. The bootstrap procedure
just outlined can equally be used using any measure of discrepancy, including
RMSD and χ 2 (see Chapter 2). Indeed, the fact that very few assumptions need to
Chapter 5 Parameter Uncertainty and Model Comparison 169
be made when carrying out the bootstrap procedure means it is a popular method
for obtaining sampling distributions and confidence limits from models across the
sciences.
Listing 5.5 Function to Fit SIMPLE to Free Recall Data and Obtain Bootstrapped Confi-
dence Intervals
Interpretation of the standard errors and confidence limits presented in this section
depends on the method we use and the data entering into the likelihood function.
The methods give us some estimate of variability on a parameter estimate: If we
ran the same experiment again, where might we expect the parameter estimate
for that new data set to lie? However, we must be careful in our definition of
“the same experiment.” If we calculate a standard error by looking at the variabil-
ity in the estimated parameter across participants, we are asking what reasonable
range of parameter values we would expect if the experiment were run on a dif-
ferent set of participants; this is the variability we are usually interested in, as
we will be concerned with making inferences from a sample of participants to
the population. However, calculating standard errors using the Hessian matrix or
bootstrapping does not necessarily provide us with this information. Specifically,
if we calculate a standard error on the basis of a single participant’s data, we
cannot estimate the variability between participants that is integral to inferential
statistics. In this case, the standard errors only tell us where we would expect the
parameter value to lie if we ran the experiment on the same (one) participant.
If the bootstrapping or likelihood curvature procedure is performed on the data
aggregated across participants, then we can legitimately use the standard error as
an inferential tool.
One caution is that producing confidence limits as well as parameter esti-
mates can lead us into a false sense of understanding. Not only do we have “best”
estimates, we also have some idea about where the true parameters might reason-
ably lie—isn’t that great? Remember, though, that these estimates and confidence
intervals (CIs) are conditional on the specific model being examined and don’t
provide any guarantees about the appropriateness of our model. Very wrong mod-
els can produce convincing-looking ML parameter estimates and CIs, although
they fundamentally miss important features of the data. In the case of the ex-
Gaussian above, it would be straightforward to fit a normal distribution to data
generated from an ex-Gaussian process with a large τ and obtain estimates and
CIs for μ and σ , although the normal distribution would miss the large positive
skew characterizing models like the ex-Gaussian. The remainder of this chapter
addresses this very issue: How confident are we that Model X is the best model
for our data?
the observed data and the specific model whose parameters we are estimating
(see Section 4.2). This would be fine if we had absolute certainty that our model
of choice really is a close approximation to the actual process that generated the
data of interest. Although the proponents of a particular model may be more than
enthusiastic about its merits, there is yet another level of uncertainty in our reason-
ing with models—namely, uncertainty about the models themselves. This uncer-
tainty lies at the central question we often ask as theorists in psychology: Which of
a number of given candidate models lies closest to the true underlying processes
that actually generated the observed data?
In some cases, the superiority of one model over the other is apparent even
from visual inspection of the predictions of the models. For example, Figure 5.7
shows the fits of the Generalized Context Model (GCM) and another model of cat-
egorization, General Recognition Theory (GRT; e.g., Ashby & Townsend, 1986),
to the categorization responses from six participants examined by Rouder and
Ratcliff (2004). In each panel, observed categorization probabilities from four
different conditions (labeled stimulus in the panels) are shown along with the pre-
dictions of GCM (dashed lines) and GRT (solid lines). Although the evidence is
a little ambiguous to the eye in some cases, for three of the participants (1, 4, and
5), the superior fit of GCM is clear from visual inspection.
An even clearer example of model comparison by eye comes from the paper
by Lewandowsky et al. (2009) discussed in Chapter 2. Recall that Lewandowsky
et al. compared the predictions from a Bayesian model without any parameters
to the data from participants who had to estimate variables such as the length of
reign of a pharaoh. Figure 2.6 shows that the model does a very convincing job of
predicting the data. Actually, Lewandowsky et al. also tested another model, the
MinK model, which assumes that participants make these judgments not by inte-
grating across the entire distribution of possibilities (as in the Bayesian approach)
but on the basis of only a few instances in memory. Although these models sound
like they might be difficult to clearly distinguish (as a few samples from a dis-
tribution already provide a fair bit of information about the entire distribution,
particularly the average of that distribution), Lewandowsky et al. found that the
MinK did a poor job of accounting for their data. Figure 5.8 shows that the MinK
model deviates substantially from the data, in a way that is apparent just from
visual inspection.
Although the predictions of different models often diverge visibly, in other
cases, the distinction may not be so clear. Consider Participants 2, 3, and 6 in
Figure 5.7: Although it looks like GCM more accurately predicts the data, it has
only a fine edge over the GRT model. Would we be confident in these cases in
saying that these fits provide evidence for the GCM? For another example, take
a look back at Figure 1.5. Recall that the data purport to show the ubiquitous
“power law of practice.” However, as can be seen in that figure, there is actually a
172 Computational Modeling in Cognition
.8
Response Proportion for Category A
.2
1 2 3
.8
.2
4 5 6
I II III IV I II III IV I II III IV
Stimulus
Figure 5.7 Fits of the GCM (dashed lines) and the GRT (solid lines) to data from four
probability conditions in Experiment 3 of Rouder and Ratcliff (2004). Figure reprinted
from Rouder, J. N., & Ratcliff, R. (2004). Comparing categorization models. Journal of
Experimental Psychology: General, 133, 63–82. Published by the American Psychological
Association; reprinted with permission.
Pharaohs
70
60
50
40
MinK
30
20
10
0
0 20 40 60
Stationary
Figure 5.8 Snapshot of results from the MinK model, for comparison with Figure 2.6. The
ordinate plots quantiles of the predicted distribution, and the abscissa plots the obtained
quantiles. The different levels of gray correspond to different values for K : light gray,
K = 2; medium gray, K = 5; black, K = 10. Figure from Lewandowsky, S., Griffiths,
T. L., & Kalish, M. L. (2009). The wisdom of individuals: Exploring people’s knowledge
about everyday events using iterated learning. Cognitive Science, 33, 969–998. Copyright
by the Cognitive Science Society; reprinted with permission.
(e.g., Fecteau & Munoz, 2003; Gilden, 2001; M. Jones, Love, & Maddox, 2006;
Lewandowsky & Oberauer, 2009; Wagenmakers, Farrell, & Ratcliff, 2004). One
question we might ask in this vein is whether a fast response tends to be followed
by another fast response (Laming, 1979). Specifically, we will assume that there
is some dependency between trials in a time estimation task (e.g., Gilden, 2001;
Laming, 1979; Wagenmakers, Farrell, & Ratcliff, 2004) and ask more specifically
whether this form of memory (or multiple forms; e.g., Wagenmakers, Farrell, &
Ratcliff, 2004) extends beyond the previous trial.
For the time estimation task (repeatedly estimating 1-second intervals by press-
ing a key when 1 second has passed), there is no variation in the stimulus across
trials (or indeed there is no stimulus at all; Gilden, 2001). We are asking a very
simple question: How far back does participants’ memory for their own responses
go, such that those memories have effects on time estimation on the current trial?
To answer this question about the range of dependence in this task, we will
look at two ARMA (Auto-Regressive Moving Average) models. ARMA models
174 Computational Modeling in Cognition
Data
3000
Latency (ms)
2000
1000
0
0 200 400 600
Trial
ARMA(1,1)
3000
Latency (ms)
2000
1000
0
0 200 400 600
Trial
ARMA(2,2)
3000
Latency (ms)
2000
1000
0
0 200 400 600
Trial
Figure 5.9 Three time series of estimation times in an experiment requiring participants
to repeatedly press a key after 1 second has passed (Wagenmakers et al., 2004). Top panel:
A series of 750 observations from a single participant. Middle panel: The trial-by-trial
predictions of the ARMA(1,1) model under ML parameter estimates. Bottom panel: The
predictions of the more complex ARMA(2,2) model.
176 Computational Modeling in Cognition
Data
Autocorrelation
0.8
0.4
0.0
05 10 15 20 25 30
Lag
Models
Autocorrelation
0.8
ARMA(1,1)
ARMA(2,2)
0.4
0.0
05 10 15 20 25 30
Lag
Figure 5.10 Autocorrelation functions for the data (top panel) and the two ARMA models
(bottom panel) displayed in Figure 5.9. Each function shows the correlation between trials
separated by the number of trials along the Lag axis; Lag = 0 corresponds to correlating
the series with an unshifted version of itself and therefore gives a correlation of 1. The
top panel also shows estimated confidence limits around the null autocorrelation value of
0 (corresponding to an uncorrelated series).
You might find it reassuring to know that the general principles of model
selection aren’t so different from those you’ve probably already encountered in
statistics. Researchers (and graduate students in statistics classes) often confront
the problem of deciding which combination of predictors to include in a multiple
regression model. In the case of psychological models—such as those we’ve cov-
ered in the last few chapters—the models will be more heterogeneous than differ-
ent multiple regression models, but the general problem remains the same. Indeed,
the methods we will discuss below apply equally to models developed for the pur-
pose of data description, process characterization, and process explanation (see
Chapter 1). Moreover, many statistical methods (e.g., logistic regression, multi-
level regression, and generalized linear modeling) rely on the likelihood tools we
develop in the following.
Chapter 5 Parameter Uncertainty and Model Comparison 177
where general refers to the general version of the model and specific the restricted
version with some parameters fixed. We can then compare this obtained χ 2 to
the critical value on the χ 2 distribution with K degrees of freedom given our α
level (which will usually be .05). This is called the likelihood ratio test, as we
are examining whether the increased likelihood for the more complex model (i.e.,
the smaller −2 ln L; due to the relationship between the logarithm and the expo-
nential, a ratio in likelihoods translates into a difference in log-likelihoods) is
merited by its extra parameters. You might recognize this test from Chapter 2,
where we presented the G 2 statistic as a measure of discrepancy between a
model and a set of discrete data and noted that it too is asymptotically distributed
as χ 2 .
As an example of the application of the likelihood ratio test (LRT), the −2 ln L
for the ARMA(1,1) and ARMA(2,2) models for the data shown in Figure 5.9 is
10872.08 and 10863.2, respectively. The difference of 8.88 is significant when
compared to the critical χ 2 of 5.99 on 2 degrees of freedom,7 where those extra
degrees of freedom relate to the parameters for two-back prediction in both the
autoregressive and moving average components of the ARMA model. Although
the models appear to behave quite similarly in Figures 5.9 and 5.10, the
ARMA(2,2) model provides a significantly better fit to the data: The addition
of those two extra parameters is warranted by the increase in quantitative fit. We
can then infer that the memory process that carries information between trials
to produce the dependencies in time estimation performance operates over larger
ranges than simply between successive trials (see, e.g., Thornton & Gilden, 2005;
Wagenmakers, Farrell, & Ratcliff, 2004, 2005, for more about the implications of
these types of models).
This example again emphasizes the importance of modeling to inform and
test our intuitions about a situation. For example, you may be puzzled by the fact
that models that operate only over a window of one or two trials can generate the
extended ACFs seen in Figure 5.10, with correlations remaining between obser-
vations lagged by up to 25 trials. The answer is that the ARMA model works in
part by incorporating the history of previous trials; specifically, the autoregressive
component of these models states that an observation is obtained by adding noise
Chapter 5 Parameter Uncertainty and Model Comparison 179
onto a decayed version of the previous observation (or previous several observa-
tions). Accordingly, although the current observation is not based directly on the
data from 20 trials back, the effects of the history are carried through intervening
trials (trial t is dependent on trial t − 1, which in turn is dependent on trial t − 2,
which in turn is dependent on trial t − 3 . . . ).
ex-Gaussian model) matches the “true” process that we as scientists are really
attempting to model. From here on, we call the model that we are concerned with
the “known” model and compare it against the unknown state of reality, which we
call the “true” model or reality.
The Kullback-Leibler (K-L) distance is a measure of how much information
is lost when we use one model to approximate another model. Our interest lies in
the case where we use a known model to approximate the “true” model, or reality.
The Kullback-Leibler distance for continuous data is given by
R(x)
KL = R(x) log d x, (5.6)
p(x|θ )
where R(x) is the probability density function for the true model, and p(x|θ ) is
the probability density function for our known model and parameters that we are
using to approximate reality. In the case of discrete variables, the K-L distance is
obtained by
I
pi
KL = pi log , (5.7)
πi
i=1
where i indexes the I categories of our discrete variable, and pi and πi are,
respectively, the “true” probabilities and the probabilities predicted by the known
model.
The K-L distance shown in Equations 5.6 and 5.7 measures how much the
predicted probabilities or probability densities deviate from the “truth.”8 The use
of this quantity as a measure of information becomes clearer when we rewrite
Equation 5.6 as follows:
KL = R(x) log R(x)d x − R(x) log p(x|θ )d x. (5.8)
The first term in Equation 5.8 tells us the total amount of information there is in the
“true” model. This information is actually a measure of the entropy or uncertainty
in reality. The more variability there is in reality, the more information can be
provided by an observation, in that it is harder to predict the next value of x
we might observe. The second term in Equation 5.8 quantifies the uncertainty in
reality that is captured by the model. The difference between these tells us about
the uncertainty that is left over after we have used our model to approximate
reality. In the limit, where our model is a perfect match to reality, the two terms
will be identical, and there will be no uncertainty in reality that is not reflected in
our model: The K-L distance will be 0. As our model gives a poorer and poorer
approximation of reality, the K-L distance increases.
One thing to note about Equation 5.8 is that the first term, the information
in the “true” model, is insensitive to our choice of approximating model. As a
Chapter 5 Parameter Uncertainty and Model Comparison 181
M2 M2
R R θ
M3 M3
M1 M1
θ^
Figure 5.11 K-L distance is a function of models and their parameters. Left panel: Three
models and their directed K-L distance to reality, R. Right panel: Change in K-L distance
as a function of a parameter θ in one of the models shown in the left panel. The point
closest to reality here is the ML estimate of the parameter, θ̂.
improve the fit but will also increase the size of the penalty term. In the AIC,
we then have a computational instantiation of the principle of parsimony: to find
the best and simplest model.
Before moving on to another well-used information criterion, we mention that
a number of statisticians have noted that the AIC does not perform very well when
models have a large number of parameters given the number of data points being
fit (e.g., Hurvich & Tsai, 1989; Sugiura, 1978). A correction to the AIC has been
suggested in the case where regression and autoregression models are fit to small
samples; this corrected AIC, called AICc , is given by
N
AI Cc = −2 ln L(θ|y, M) + 2K , (5.10)
N −K −1
where N is the number of data points. Burnham and Anderson (2002) recommend
using this statistic whenever modeling the behavior of small samples (i.e., when
the number of data points per parameter is smaller than 40).
p(y|M) p(M)
p(M|y) = , (5.11)
p(y)
with p(M) being the prior probability of the model. One complication is that
the probability of the data given the model, p(y|M), depends on the selection of
the parameter values. We can remove that dependence by working out the prob-
ability of the data under all possible parameter values weighted by their prior
probabilities:
p(y|M) = p(y|θ, M) p(θ, M)dθ, (5.12)
where p(θ ) is the prior probability distribution across the parameter(s). This is
a complicated procedure for most models, and we can appreciate why Schwarz
(1978) and others wanted a quick and simple approximation! Under some rea-
sonable default assumptions about the prior distribution on the model param-
eters (see, e.g., Kass & Raftery, 1995; Kuha, 2004; Schwarz, 1978), we can
approximate full-scale Bayesian model selection by correcting our maximized
184 Computational Modeling in Cognition
log-likelihood and obtaining the Bayesian Information Criterion. The BIC is cal-
culated as
B I C = −2 ln L(θ|y, M) + K ln N , (5.13)
N being the number of data points on which the likelihood calculation is based.
The general form of the AIC and BIC is similar, but you can see that the BIC
will provide a greater punishment term whenever ln N > 2—that is, whenever
N > 7—which will usually be the case. This weighted punishment of complexity
means that the BIC has a greater preference for simplicity (e.g., Wagenmakers &
Farrell, 2004).
The BIC does not receive the same interpretation of K-L distance as the AIC.
Unlike the AIC, a single BIC doesn’t have any useful interpretation. However, a
comparison of BICs for different models can tell us which model has the highest
posterior probability given the data, and we will shortly see how BIC can be used
to obtain posterior probabilities for models in a set of candidate models.
The most obvious next step is to pick the “winning” model as the model with
the smallest AIC or BIC. In the case of the AIC, this model is the model with
the smallest estimate of the expected K-L distance from the true generating pro-
cess. In the case of the BIC, the model with the smallest BIC is the model with
the highest posterior probability given the data, assuming the models have equal
priors (i.e., we a priori consider each model to be equally likely). The winning
model can be made more apparent by forming the difference between each AIC
(BIC) value and the smallest AIC (BIC) value in our set of models (Burnham &
Anderson, 2002). Although not necessary, this can aid in the readability of infor-
mation criteria, as sometimes these can reach quite large values (on the order
of tens of thousands). Looking at AIC (BIC) differences also accounts for the
ratio scaling of these information criteria due to the logarithmic transform of the
likelihood. This ratio scale means that any differences between AIC values are
actually ratios between the original likelihoods. Hence, a difference between AIC
values of 2 and 4 is as large as the difference between AIC values of 2042 and
2044, which is made more explicit by calculating differences. The differences
indicate how well the best model (the model with the smallest AIC or BIC) per-
forms compared to the other models in the set. In fact, we can calculate an AIC or
BIC difference between any two models and thus quantify their relative corrected
goodness of fit.
What do these model differences tell us for the individual criteria? In the
case of the AIC, we obtain an estimate of the additional loss in approximation
of the “true” model that obtains when we take a particular model, rather than
the best model, as the approximating model in the AIC. Burnham and Ander-
son (2002, p. 70) present a heuristic table for interpreting AIC differences
( AIC) as strength of evidence: AIC = 0–2 indicates that there is little to
distinguish between the models, 4–7 indicates “considerably less” support for the
model with the larger AIC, and >10 indicates essentially no support for the model
with the larger AIC and a great deal of support for the model with the smaller
AIC.
The BIC differences can be given a more principled interpretation as the log-
arithm of the Bayes factor for two models. The Bayes factor is the ratio of two
models M1 and M2 in their posterior probability,
p(M1 |y)
B= , (5.14)
p(M2 |y)
and therefore gives a measure of relative strength of evidence for the models in
terms of relative probability. On the basis of this relationship, guidelines for inter-
preting BIC differences have been suggested along similar lines as those for the
AIC (Jeffreys, 1961; Kass & Raftery, 1995; Wasserman, 2000).
186 Computational Modeling in Cognition
exp(− 12 B I C M )
wM = , (5.16)
i exp(− 2 B I Ci )
1
where B I C M is the BIC difference between the best model and model M, and
each B I Ci is the difference between a specific model in our set and the best
model. These model weights add up to 1 and tell us the posterior probability of
each model given the data, assuming the models in our set are the only candidates
for explaining the data. These model weights are usually used to predict future
observations from all models at once, by averaging the predictions of all models,
but with these predictions weighted by each model’s posterior probability (e.g.,
Kass & Raftery, 1995).
We can give similar interpretations to the AIC, although these are more heuris-
tic and arguably less principled. We can turn the AIC values into model likeli-
hoods: Given an AIC difference AI C between a particular model and the best
model in a set of models, we obtain a likelihood as (Burnham & Anderson, 2002)
Chapter 5 Parameter Uncertainty and Model Comparison 187
1
L i ∝ exp − AI Ci . (5.17)
2
Because Equation 5.17 is based on the AIC, which in turn corrects for the free
parameters in our model, we can treat Equation 5.17 as the likelihood of the model
given the data, L(M|y) (Burnham & Anderson, 2004). The likelihood is only
proportional to the expression in Equation 5.17 because it is expressed relative
to the other models in the set; changing the models in the set will change the
value obtained from the equation even though the K-L distance for the specific
model i is fixed. This isn’t a problem, as our real interest in Equation 5.17 is
in determining the relative strength of evidence in favor of each model in the
set. To this end, Equation 5.17 gives us a likelihood ratio: the ratio between the
likelihood for the best model and the likelihood for model i. Just as for the model
likelihood ratios discussed in the context of nested models in Section 5.3, this
model likelihood ratio tells us about the relative evidence for two models. More
generally, we can also calculate Akaike weights as an analog to Bayesian model
weights. This is accomplished by using Equation 5.16 and replacing B I C with
AI C; that is,
exp(−0.5AI C M )
wM = . (5.18)
i exp(−0.5AI Ci )
As for Bayesian model weights, the Akaike weights are useful for model presenta-
tion and model inference because they quantify the relative success of each model
in explaining the data. Burnham and Anderson (2002) suggest a specific interpre-
tation of Akaike weights as the weight of evidence in favor of each model being
the best model in the set (“best” meaning that it has the smallest expected K-L
distance to reality). Note that regardless of whether we obtain Akaike weights or
Bayesian weights, our inferences are relative with respect to the specific set of
models we have fit to the data and are now comparing. This means that we should
not compare AIC and BIC values or related statistics for different data sets and
that the likelihood ratios, Bayes factors, and model weights are also specific to the
set of models that is being compared.
Listing 5.6 provides MATLAB code to calculate AIC, BIC, and their differ-
ences and model weights. The information needed (passed as arguments) are three
vectors containing, respectively, the minimized negative log-likelihoods, the num-
ber of free parameters in each model, and the number of observations involved in
calculating the log-likelihood.
The AIC and BIC have been used in many areas of psychology to compare
quantitative models of cognition and behavior (Hélie, 2006; Wagenmakers & Far-
rell, 2004). These include applications to models in the following areas:
• Categorization (e.g., Farrell et al., 2006; Maddox & Ashby, 1993; Nosofsky
& Bergert, 2007)
188 Computational Modeling in Cognition
Listing 5.6 MATLAB Code to Calculate Information Criteria Statistics From Minimized
Negative Log-Likelihoods
1 f u n c t i o n [ AIC , BIC , AICd , BICd , AICw , BICw ] = ←
infoCriteria ( nlnLs , Npar , N )
2 % C a l c u l a t e i n f o r m a t i o n c r i t e r i a ( AIC ; BIC ) ,
3 % IC d i f f e r e n c e s from b e s t model ( AICd ; BICd ) ,
4 % and model w e i g h t s ( AICw , BICw )
5 % from a v e c t o r o f n e g a t i v e l n L s
6 % Each c e l l i n t h e v e c t o r s c o r r e s p o n d s t o a model
7 % Npar i s a v e c t o r i n d i c a t i n g t h e
8 % number o f p a r a m e t e r s i n e a c h model
9 % N i s t h e number o f o b s e r v a t i o n s on which
10 % t h e l o g −l i k e l i h o o d s were c a l c u l a t e d
11
12 AIC = 2 . * nlnLs + 2 . * Npar ;
13 BIC = 2 . * nlnLs + Npar . * l o g ( N ) ;
14
15 AICd = AIC−min ( AIC ) ;
16 BICd = BIC−min ( BIC ) ;
17
18 AICw = exp ( − . 5 . * AICd ) . / sum ( exp ( − . 5 . * AICd ) ) ;
19 BICw = exp ( − . 5 . * BICd ) . / sum ( exp ( − . 5 . * BICd ) ) ;
• Memory (e.g., Farrell & Lewandowsky, 2008; Jang, Wixted, & Huber,
2009; Lewandowsky & Farrell, 2008a)
• Spike trains from single-cell recording (e.g., Bayer, Lau, & Glimcher, 2007;
S. Brown & Heathcote, 2003)
• Developmental patterns (Kail & Ferrer, 2007)
• Perception (Kubovy & van den Berg, 2008; Macho, 2007; Ploeger, Maas,
& Hartelman, 2002)
• Decision and response latencies (e.g., Ratcliff & Smith, 2004), including
time-accuracy functions (Liu & Smith, 2009)
• Serial correlations and sequential transitions (e.g., Torre, Delignières, &
Lemoine, 2007; Visser, Raijmakers, & Molenaar, 2002; Wagenmakers, Far-
rell, & Ratcliff, 2004)
Table 5.1 AIC Values and Associated Quantities for the Models in Table 1 of Ratcliff and
Smith (2004)
cannot discount the Wiener model. The other models in the set are far beyond
these two models in their estimated expected K-L distance. Nevertheless, com-
parisons within these relatively inferior models may still be informative about
the underlying psychological processes. For example, we can take the AIC dif-
ference between the rectangular and geometric Poisson models and calculate a
heuristic likelihood ratio as exp (−.5 (8216.89 − 8125.69)), which gives a very
large value (> 1019 ). This tells us that although both versions of the Poisson
counter model give a relatively poor account of the data, the data greatly favor the
version of the model in which incoming evidence is sampled from a geometric
distribution.
Table 5.2 summarizes model comparisons using the BIC. The BIC values
were calculated with the number of observations equal to 2304 (i.e., ln N = 7.62);
this N was obtained by determining the mean number of trials per participant
that were used to calculate response proportions and latency quantiles. The major
change in the pattern of results from Table 5.1 is that the BIC almost exclusively
favors the Wiener diffusion model (Ratcliff & Smith, 2004). The Bayes factors
(B) show that the next best model, the exponential accumulator model, is much
less likely than the Wiener model to have generated the data (B=.007). We can
reexpress this Bayes factor by putting the Wiener model in the numerator and
the exponential accumulator in the denominator of Equation 5.14; this works out
as exp(−.5(8162.1 − 8172)), approximately equal to 141. That is, the posterior
probability of the Wiener model given the data is around 141 times that of the
exponential accumulator model. The Bayesian weights in the final column, calcu-
lated using Equation 5.16, confirm that the Wiener process stands out as having
the highest posterior probability in this set of models.
Chapter 5 Parameter Uncertainty and Model Comparison 191
Table 5.2 BIC Values and Associated Quantities for the Models in Table 1 of Ratcliff and
Smith (2004)
model has one additional free parameter), the maximum possible difference in
AIC between the two models is 2. This is because the more general model will fit
at least as well as the simpler model; hence at worst (for the general model), the
−2 ln L values are identical for both models, and the maximum possible AIC dif-
ference would be given by 0 − 2 × K , which is 2 for a single parameter (K = 1).
This means we can never find strong evidence in favor of the simpler model using
the AIC.
In contrast, the punishment given to more complex models by the BIC scales
with the (log of the) number of observations, meaning that we can find strong
evidence for the simpler model with large N . In addition, given that the BIC
may tend to conservatism, any evidence against the simpler model in a nested
comparison provides good grounds for concluding that the data favor the more
complex model (Raftery, 1999). In this respect, BIC appears to be the preferable,
though by no means necessary, alternative.
5.5 Conclusion
Putting this all together, model comparison proceeds something like the follow-
ing. First, we specify a probability density function or probability mass function
for each model, perhaps involving the specification of a data model to link the
model and the data. Second, we fit the models to the data using maximum likeli-
hood parameter estimation. Third, we use AIC or BIC to point to one or more pre-
ferred models, in two ways. The first is to give some indication of which model,
out of our set, gives the best account of the data in that it has the smallest estimated
expected K-L distance to the data (AIC) or the highest posterior probability given
the data (BIC). The second is to quantify model uncertainty; that is, how much
does the winning model win by, and how competitive are the remaining models?
Another way of looking at this second point is in terms of strength of evidence,
which is captured in relative terms by a variety of statistics, such as Bayes fac-
tors and model weights. Finally, for the preferred models (and any other models
of interest in our set of candidates), we calculate standard errors and confidence
intervals on the parameter estimates, to give some idea about the uncertainty in
those estimates, especially in those cases where the estimated parameter values
are to receive some theoretically meaningful interpretation.
You now have the tools to carry out all of those steps. Of course, there is
always more to learn, and there are many papers and tutorials in the literature that
will complement or refine what you have learned here. For interesting discussions
of the use of likelihoods as measures of uncertainty and evidence, see Edwards
(1992), Royall (1997), and Pawitan (2001). For further reading on model selection
Chapter 5 Parameter Uncertainty and Model Comparison 193
and model comparison, see Burnham and Anderson (2002), Kass and Raftery
(1995), Hélie (2006), Myung and Pitt (2002), and two recent special issues of the
Journal of Mathematical Psychology devoted to model selection (Myung, Forster,
& Browne, 2000; Wagenmakers & Waldorp, 2006).
Before we close this chapter, there are two final and important issues to dis-
cuss. The first relates to the notion of model complexity. So far we have defined
model complexity very generally as the number of free parameters a model uses
to fit a specific set of data. The AIC and BIC both rely on this definition, as well
as general assumptions about the free parameters, to correct the log-likelihood for
model complexity. Recent work has shown that this is only a rough metric and that
even models with the same number of free parameters may differ in their com-
plexity, and thus their flexibility in accounting for the data, due to their functional
form (Pitt & Myung, 2002).
Computational modelers are becoming increasingly aware of the need to
assess model flexibility when comparing different models on their account of data
(for a review, see Pitt & Myung, 2002). There are several methods available for
surveying and quantifying the flexibility of one or several models, including the
extent to which these models are able to mimic each other. These methods include
minimum description length (e.g., Myung et al., 2000; Pitt & Myung, 2002), land-
scaping (e.g., Navarro, Pitt, & Myung, 2004), parameter space partitioning (Pitt,
Kim, Navarro, & Myung, 2006), and parametric bootstrapping (Wagenmakers,
Ratcliff, Gomez, & Iverson, 2004).
Minimum description length (MDL) is intriguing here because it makes
explicit a necessary trade-off between parameter uncertainty and model uncer-
tainty. MDL has a similar form to AIC and BIC but measures model complexity
as the sensitivity of model fit to changes in the parameters based on the Hessian
matrix for the log-likelihood function (Myung & Pitt, 2002). The more peaked
the log-likelihood function at the ML parameter estimates, the more sensitive the
fit is to our specific selection of parameter values, and thus the more complex the
functional form of the model. As we saw earlier, a more peaked log-likelihood
function (quantified using the Hessian matrix) tells us that our parameter esti-
mates are more accurate. This means that a more peaked log-likelihood function
makes us more confident in our parameter estimates for the model but less confi-
dent that the model is not just overly flexible and fitting the noise in our data. We
return to this issue in the next chapter.
A second, related point is that goodness of fit, whether it is corrected for com-
plexity or not, is only partially informative. Papers and book chapters on this
topic can often make model selection sound like placing gladiators in an arena
and forcing them to fight to the death. After many tortuous simulations and cal-
culations, a single model arises victorious from the gore of accumulation rates and
194 Computational Modeling in Cognition
activation functions, triumphantly waving its maxed likelihood in the air! Although
there is a temptation to halt proceedings at that point, it is critical that we trans-
late this model success into some theoretically meaningful statement. Although
goodness of fit may sometimes be a legitimate end point for a model-fitting exer-
cise, in many cases a complete focus on goodness of fit may disguise a plethora of
issues, including a simple failure of the authors to understand the behavior of their
model. In addition, as models become more complex in order to handle increasing
numbers of data sets, or to account for increasingly complex data, these models
are likely to move into territory where they are less useful as theories because
they are not comprehendible by humans, even highly educated scientists (see our
discussion of Bonini’s paradox in Chapter 2).
As experimental psychologists, we are primarily interested in why this model
won. What features of this model were critical for its success in explaining the
data? What aspects of the data did the model handle particularly well? What
about the failing models? Why did those models perform so poorly? Is there some
common characteristic that led to their downfall? Are there models that failed to
capture the individual data points but nonetheless capture the general trend in the
data (Shiffrin & Nobel, 1997)? As we will discuss in the next few chapters, these
are important questions to ask and resolve.
Notes
1. We will use the terms model selection and model comparison interchangeably.
Although the term model selection is most commonly used in the literature, it can cre-
ate the false impression that our sole aim is simply to pick a model as being the best.
Regardless of which term we use, we are always referring to the procedure of comparing
models and using their relative performance to tell us useful information about underlying
principles or mechanisms.
2. Interested readers may be intrigued to know that this relation can also be obtained
through Taylor series expansion around the log-likelihood function; see, for example,
Chapter 2 of Pawitan (2001).
3. If we have minimized the deviance (−2 ln L), then we need to multiply the Hessian
matrix by .5 before inverting it.
4. Remember, if we had maximized the log-likelihood, we need to multiply the Hessian
matrix by −1 before taking the matrix inverse.
5. Specifically, this is the 20-1 condition, in which participants were presented with 80
lists of 20 words, each word being presented for 1 second. The data were obtained from
the Computational Memory Lab website: http://memory.psych.upenn.edu/.
6. Specifically, this series shows the last 750 observations from Participant 2 in the EL
condition of Wagenmakers et al. (2004).
7. The critical χ 2 can be obtained using the chi2inv function in MATLAB’s Statistics
Toolbox or by consulting a χ 2 table available in the back of most behavioral sciences
statistics textbooks.
Chapter 5 Parameter Uncertainty and Model Comparison 195
8. The K-L distance is not symmetric: The distance between the model and reality is
not necessarily equal to the distance between reality and the model. For this reason, some
authors prefer to refer to this quantity as the K-L discrepancy (see, e.g., Burnham & Ander-
son, 2002).
9. Ratcliff and Smith (2004) did not present −2 ln L values, numbers of parameters,
or number of data points. We have estimated these based on their Tables 1–4 (i.e., from
the BIC values, differences in numbers of parameters implied by the d f in their Table 1,
and apparent numbers of parameters in Tables 1–4) and description in the text, with some
clarifying details provided by Phil Smith and Roger Ratcliff, for which we are grateful.
10. Ratcliff and Smith (2004) actually looked at two versions of the OU model that
differed in their fixed decay parameter β. We examine only a single version of the OU
model here, for ease of exposition, and because of one debatable issue. The issue is this:
Because Ratcliff and Smith (2004) examined two different versions of the OU model,
each with a fixed decay parameter, they effectively treated the decay parameter as a free
parameter in calculating their BIC value for that model. Because the decay parameter was
not allowed to be entirely free, the BIC for that model would have been quite approximate,
as the derivation of the BIC assumes the likelihood is the fully maximized (using the K
free parameters) likelihood for that model. This is not troublesome for Ratcliff and Smith’s
conclusions: They found that the OU model could only give a competitive fit by allowing
the decay parameter to be 0, which turned the model into the Wiener diffusion model, a
restricted version of the OU model.
11. This is not the same as the likelihood ratio statistic that we looked at in Section 5.3
earlier. Remember that the likelihood ratios in Table 5.1 are calculated using the AIC as a
corrected measure of the log-likelihood. We present them here as they are an intermediate
step on the way to calculating model weights.
6
Not Everything
That Fits Is Gold:
Interpreting the
Modeling
We have shown you how to fit models to data and how to select the “best” one
from a set of candidates. With those techniques firmly under our belt, we can
now take up three issues that are important during interpretation of our modeling
results.
First, we expand on the notion of “goodness of fit” and how it relates to the
properties of our data. What exactly does a good fit tell us? Is a good fit necessarily
good? What is a “good fit,” anyhow?
Second, suppose that our model fits the data well; we need to ask whether
it could have been bad—that is, can we be sure that our model was falsifiable?
We already touched on the issue of falsifiability in Chapters 1 and 2, and we
now provide formal criteria for establishing a model’s testability and falsifiability.
Does it make sense to even consider models that cannot be falsified? What do we
do if we cannot identify a model’s parameters?
Finally, we discuss the conclusions and lessons one can draw from model-
ing. We show how exploration of a model, by examining its behavior in response
to manipulations of the parameters, can yield valuable theoretical insights. We
then expand on the notions of sufficiency and necessity that were introduced in
Chapter 2.
197
198 Computational Modeling in Cognition
6.1.1 Overfitting
How serious is the overfitting problem, and what can we do about it? Pitt and
Myung (2002) reported a simulation study that illustrated the magnitude of the
problem. Their basic approach, known as model recovery, is sufficiently important
and general to warrant a bit of explanation.
Chapter 6 Not Everything That Fits Is Gold 199
1 1 1
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Speech Rate Speech Rate Speech Rate
Figure 6.1 The effects of speech rate on recall for 7- and 10-year-olds. The data are iden-
tical across panels and are represented by gray circles, whereas the solid lines are various
statistical models fit to the data (a regression line on the left, a third-order polynomial in
the middle, and a fifth-order polynomial on the right). Data taken from Hulme, C., & Tord-
off, V. (1989). Working memory development: The effects of speech rate, word-length, and
acoustic similarity on serial recall. Journal of Experimental Child Psychology, 47, 72–87.
See text for details.
Figure 6.2 explains the process underlying all recovery techniques. A known
model (called Mx in the figure) is used to generate a sample of data. For example,
in Pitt and Myung’s case, data were generated from a one-parameter model y =
(1 + t)−a (where a = 0.4) by adding some sampling noise to 100 values of y
obtained for a range of ts. Note that these “data” are not data in the strict sense
(i.e., obtained from participants in an experiment) but are simulated data with a
known origin (namely, the model Mx ). We already used this technique earlier, in
Section 3.1, when generating data for our introductory linear regression example;
see Listing 3.1.
The second stage of the recovery procedure, shown in the bottom of
Figure 6.2, consists of fitting a number of candidate models (M y , Mz , . . . ) to the
simulated data. Only one of those models actually generated the data, M x , and
the question is whether it fits (its own) data better than the other competitors. Per-
haps surprisingly, Pitt and Myung found in their simulations that in 100% of all
200 Computational Modeling in Cognition
Generated data
Figure 6.2 Overview of the model recovery procedure. We start with a known model
(Mx ) and use it to generate data at random (sometimes we may additionally “contaminate”
the data with suitably aberrant observations). We then fit a variety of candidate models
(M y , Mz , etc.) to those generated data and examine which one fits best. Ideally, the best-
fitting model should always be Mx , the model that generated the data. However, there are
instances in which other models may fit even better; see text for details.
We can resolve this conundrum by redefining the problem “as one of assessing
how well a model’s fit to one data sample generalizes to future samples generated
by that same process” (Pitt & Myung, 2002, p. 422). In other words, if we fit a
model to one set of noisy data, our fears about possible overfitting can be allayed
if the same parameter estimates successfully accommodate another data set within
the purview of our model. Generalizing to another data set alleviates the problem
because the new data set is guaranteed to be “contaminated” by a different set
of “noise” from the first one; hence, if both can be accommodated by the same
model, then this cannot be because the model fit any micro-variation arising from
error.
There are two main approaches to establishing generalizability of a model.
The first is statistical and relies on some of the techniques already mentioned
(i.e., in Chapters 4 and 5). Although we did not dwell on this at the time, several
model selection techniques have statistical properties that provide at least some
safeguard against overfitting. This safeguard is greatest for minimum-description
length (MDL) approaches (Grünwald, 2007), which are, however, beyond our
scope.
The second approach is more empirical and involves a variety of
techniques that are broadly known as “cross-validation” (e.g., Browne, 2000;
Busemeyer & Wang, 2000; M. R. Forster, 2000). Cross-validation is sufficiently
important to warrant further comment. The basic idea underlying cross-validation
is simple and has already been stated: Fit a model to one data set and see how
well it predicts the next one. Does this mean we must replicate each experiment
before we can fit a model to the data? No; in its simplest form, cross-validation
requires that the existing data set be split in half (at random) and that the model
be fit to one half (called the calibration sample) and its best-fitting predictions
be compared to the data in the other half (the validation sample). This technique
is illustrated in Figure 6.3 using the data from the study by Hulme and Tordoff
(1989).
We arbitrarily split those observations into a calibration sample (N = 3) and a
validation sample (N = 3). Following standard cross-validation techniques (see,
e.g., Browne, 2000; Stone, 1974), we fit three competing models (polynomials
of first, fourth, and eighth order, respectively) to the calibration sample; the solid
lines in the panels show the resulting predictions.
There is an obvious lesson that can be drawn from the figure: Although the
most complex model (an eighth-order polynomial) fits the calibration sample per-
fectly, it fails quite miserably at fitting the validation sample—quite unlike the
linear regression, which does a moderately good job of accommodating both
samples. Had these been real data, the cross-validation would have revealed a
severe case of overfitting for the most complex candidate model (and also for
the intermediate model), and the modeler would have likely chosen the simplest
202 Computational Modeling in Cognition
1 1 1
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Speech Rate Speech Rate Speech Rate
Figure 6.3 The effects of speech rate on recall for 7- and 10-year-olds. The data are iden-
tical across panels and are taken from Hulme, C., & Tordoff, V. (1989). Working memory
development: The effects of speech rate, word-length, and acoustic similarity on serial
recall. Journal of Experimental Child Psychology, 47, 72–87. The data in each panel are
arbitrarily split into two samples: a calibration sample represented by gray circles and a
validation sample represented by open circles. The solid lines in each panel represent var-
ious statistical models that are fit to the calibration sample only (a regression line on the
left, a fourth-order polynomial in the middle, and an eighth-order polynomial on the right).
See text for details.
6.2.1 Identifiability
In Chapter 1, we touched on the issue of model identifiability, which refers to the
question of whether behavioral data are ever sufficiently constraining to permit a
unique process model to be identified from among a set of infinitely many other
candidates. Here we address a slightly different question, which is whether for a
given model, one can uniquely identify its parameter values.
Suppose you are shown the letters K L Z, one at a time, and a short while later
you are probed with another letter, and you must decide whether or not it was
part of the initial set. So, if you are shown Z, you respond with “yes” (usually
by pressing one of two response keys), and if you are shown X, you respond
“no.” How might this simple recognition memory task be modeled? Numerous
proposals exist, but here we focus on an early and elegant model proposed by
Sternberg (e.g., 1975). According to Sternberg’s model, performance in this task
is characterized by three psychological stages: First, there is an encoding stage
that detects, perceives, and encodes the probe item. Encoding is followed by a
comparison stage during which the probe is compared, one by one, to all items in
the memorized set. Finally, there is a decision-and-output stage that is responsi-
ble for response selection and output.2 This model can be characterized by three
temporal parameters: the duration of the encoding process (parameter a), the com-
parison time per memorized item (b), and the time to select and output a response
(c). A crucial aspect of this model is the assumption that the probe is compared
to all memorized items, irrespective of whether or not a match arises during the
scan.
The model makes some clear and testable predictions: First, the model pre-
dicts that the time taken to respond should increase with set size in a linear fash-
ion; specifically, each additional item in memory should add an amount b to the
total response time (RT). Second, owing to the exhaustive nature of the scan,
the set size effect must be equal for old (Z) and new (X) probes, and hence the
slopes relating set size to RT must be parallel for both probe types. In a nutshell,
the model would be challenged if RT were not a linear function of set size or if
the slope of the set size function were different for old and new probes.3 As it
turns out, the data often conform to the model’s expectations when considered at
the level of mean RT. Performance is typically characterized by the descriptive
regression function:
RT = top + b × s, (6.1)
where s refers to the number of memorized items, and b represents the comparison
time parameter just discussed. Across a wide range of experiments, estimates for
b converge on a value of approximately 35 to 40 ms (Sternberg, 1975), and the
estimates are indistinguishable for old and new items. The intercept term, top ,
206 Computational Modeling in Cognition
varies more widely with experimental conditions and tends to range from 380 to
500 ms.
You may have discovered the problem already: We describe the data using two
parameters, b and top , whereas the psychological model has three parameters (a,
b, and c). The value of parameter b is given by the data, but all we can say about
a and c is that their sum is equal to top . Alas, beyond that constraint, they are not
identifiable because there are infinitely many values of a and c that are compatible
with a given estimate of top . Hence, the relative contributions of encoding and
decision times to the total RT remain unknown.
This example illustrates a few important points: First, the model is clearly
testable because it makes some quite specific predictions that could, in principle,
be readily invalidated by contrary outcomes. Second, even though the data are in
accord with the predictions, they are insufficient to identify the values of all the
model’s parameters. Several questions immediately spring to mind: What are the
implications of the lack of identifiability? How can we respond if a model turns
out not to be identifiable? Can we ascertain identifiability of a model ahead of
time?
in uncertainty. For example, given a parameter with range [0-1], if one assumes
that the distribution of its possible values is uniform before the data are col-
lected, the variance of this distribution is 1/12. [Because for a uniform distri-
bution, σ 2 = 1/12(b − a)2 , where a and b are the limits of the range; so for a unit
interval, σ 2 = 1/12.] Chechile provides an example of a multinomial tree model
( , β, γ ; see Figure 1.6 for the general form of such a model) whose posterior
distribution—computed in light of the data by Bayesian means—has variances
1/162, 1/97, and 1/865. Thus, notwithstanding the nonidentifiability of parame-
ters, uncertainty about their value has been reduced by a factor of up to 72. More-
over, the mean of the parameters’ posterior distribution can be taken as point
estimates of their values, thus providing quasi-identifiability in some situations in
which conventional identifiability is absent. The quasi-identifiability of parame-
ters is illustrated in Figure 6.4, which shows the posterior probability density for
two parameters, β and , for a multinomial tree model (for simplicity, we omit
the third parameter). The solid horizontal line in the figure represents the prior
probability density of parameter values; it is obvious how much more is known
about the likely parameter values in light of the data (the lines labeled β and )
than is known a priori, where any possible parameter value is equally likely.
Putting aside those exceptions, however, nonidentifiability is often a serious
handicap that imperils a model’s applicability. Fortunately, if a model turns out to
be unidentifiable, there are several ways in which identifiability can be restored.
4 ε
β
Probability Density
Figure 6.4 Prior probability (solid horizontal line) and posterior probabilities (lines
labeled β and ) for two parameters in a multinomial tree model that are “posterior prob-
abilistically identified.” Figure adapted from Chechile, R. A. (1977). Likelihood and pos-
terior identification: Implications for mathematical psychology. British Journal of Mathe-
matical and Statistical Psychology, 30, 177–184.
parameters can be set to a fixed value, as in the case of Luce’s (1959) choice
model (see Bamber & van Santen, 2000, for a discussion of the identifiability of
Luce’s model).
An alternative to reparameterization involves the elimination of parameters
by experimentally induced constraints (Wickens, 1982). For example, suppose
a model contains a parameter that represents the preexperimental familiarity of
the material in a memory experiment. That parameter can be eliminated (e.g.,
by setting it to 0) if an experiment is conducted in which the to-be-remembered
material is entirely novel (e.g., nonsense syllables or random shapes) and thus
cannot have any preexperimental familiarity. Upon elimination of one parameter,
the model may now be identifiable within the context of that experiment.
Relatedly, identification of a model may be achievable by collecting “richer”
data (Wickens, 1982). For example, a model that is not identifiable at the level
Chapter 6 Not Everything That Fits Is Gold 209
Thus far, we have tacitly assumed that the identifiability of models (or lack thereof)
is either obvious or known. Indeed, identifiability was obvious upon a moment’s
reflection for the earlier model of the Sternberg recognition task. However, the
identifiability of a model need not be so obvious, which gives rise to the rather per-
nicious possibility that we continue to work with a model and rely on its parameter
estimates even though it may be unidentifiable (e.g., Wickens, 1982). How can we
defend ourselves against that possibility? How can we ascertain identifiability?
One approach relies on examination of the standard results from our model
fitting. For example, analysis of the covariance matrix of the parameters (we dis-
cussed its computation in Section 5.1.2) after the model has been fit to a single
data set can reveal problems of identifiability. In particular, identifiability prob-
lems are indicated if the covariances between parameters are high relative to the
variances. Li et al. (1996) explore the implications of this technique. Similarly, if
multiple runs of the model—from different starting values—yield the same final
value of the discrepancy function but with very different parameter estimates, then
this can be an indication of a lack of identifiability (Wickens, 1982). Of course,
this outcome can also arise if there are multiple local minima, and it is not always
easy to differentiate between the two scenarios. Finally, an alternative approach
that is not tied to the vagaries of estimating parameters establishes a model’s iden-
tifiability by formal means through analysis of its “prediction function.”
Bamber and van Santen (1985, 2000) and P. L. Smith (1998) showed that
identifiability of a model can be established by analyzing the Jacobian matrix of
the model’s prediction function. For this analysis, we first note that any model
can be considered a vector-valued function, call that f (θ ), that maps a parameter
vector, θ, into an outcome vector, r. That is, unlike a conventional scalar function,
a model produces not a scalar output but an entire vector—namely, its predictions,
expressed as point values.
It turns out that the properties of the model, including its identifiability, can
be inferred from the properties of the Jacobian matrix, Jθ , of that prediction func-
tion. Briefly, the Jacobian matrix describes the orientation of a tangent plane to a
vector-valued function at a given point. Thus, whereas scalar-valued functions are
characterized by a gradient, which is a vector pointing in the direction of steep-
est descent, vector-valued functions, by extension, are analogously characterized
by the Jacobian matrix. This is analogous to the vector of first partial derivatives
210 Computational Modeling in Cognition
that we discussed in Chapter 5, where we were talking about changes in the log-
likelihood rather than (predicted) data points themselves. In the present context, it
is important to note that each column of the Jacobian contains a vector of partial
derivatives with respect to one of the model’s parameters (and each row refers
to a different predicted point). P. L. Smith (1998) showed that if the rank of the
Jacobian matrix (i.e., the number of its columns that are linearly independent) is
equal to the number of parameters, then the model is identifiable. In other words,
if the partial derivatives with respect to the various parameters are all linearly
independent, then all parameters can be identified.4 Conversely, if the rank of the
Jacobian is less than the number of parameters, the model is not identifiable.
How, then, does one compute a model’s Jacobian matrix? It turns out that
this computation is fairly straightforward. Listing 6.1 shows a function, called
quickJacobian, that relies on a quick (albeit not maximally accurate) algorithm
to compute the Jacobian matrix. The function is of considerable generality and
requires a minimum of three arguments: a vector of parameter values (x), the pre-
dictions of the model at that point (y), and the name of the function that computes
the predictions of the model. Any additional arguments, when present, are passed
on to the function that computes the model predictions. The only requirements of
the function that computes the model predictions are (1) that the first (though not
necessarily only) argument is a vector of parameter values and (2) that it returns a
vector of predictions. Beyond that, the function can require any number of further
arguments, and its design and workings are completely arbitrary and of no interest
to quickJacobian.
To put the use of this function into context, consider the very first modeling
example that we presented in Chapter 3. You may recall that we used fminsearch
to compute the parameters for a simple linear regression, using the program in
Listing 3.1. If you run this program and get the best-fitting parameter estimates,
you can get information about the model’s Jacobian by entering the following
three commands into the MATLAB command window:5
y=getregpred(finalParms,data);
This command calls the model to get predictions, using the function shown in
Listing 3.3, and stores those predictions in the variable y.
J=quickJacobian(finalParms,y,@getregpred,data)
This command calls the function just presented (Listing 6.1), passing as argu-
ments (1) the final parameter estimates (the listings in Chapter 3 explain how the
variable finalParms was generated), (2) the vector of predictions created by
the immediately preceding command, (3) the name of the function that generates
predictions (note the “@,” which denotes the parameter as a function handle), and
(4) the original data. Note that the last argument is optional from the point of view
of the function quickJacobian, but it is required by the function getregpred,
and thus it is passed on to the latter function by quickJacobian.
When the command has been executed, the Jacobian matrix is in the variable
J, and so all that is needed is to type
rank (J)
at the command prompt, and the single value that is returned is the rank of the
model’s Jacobian matrix computed at the point of the best-fitting parameter esti-
mates. Of course, in the case of the regression model, the rank is 2, which is
equal to the number of parameters—hence, as we would expect, the model is
identifiable.6
It turns out that the Jacobian matrix is also intimately involved in determining
a model’s testability; hence, computing the Jacobian helps us in more ways than
one.
6.2.2 Testability
We provided a fairly thorough conceptual treatment of the issue of testability
(or, equivalently, falsifiability) in Chapter 1. In particular, we have already estab-
lished that a model is testable if “there exists a conceivable experimental outcome
with which the model is not consistent” (Bamber & van Santen, 2000, p. 25).7
Unfortunately, however useful this in-principle definition of testability might be,
it begs the question of whether this particular model that I am currently design-
ing will turn out to be testable. Given the paramount emphasis on testability in
most circumstances, advanced knowledge of whether one’s model is testable is an
212 Computational Modeling in Cognition
important consideration when deciding whether to invest further effort into model
development.
Identifiability rule:
Compare r and m
r=m r<m
Identifiable Unidentifiable
Figure 6.5 Flowchart to determine a model’s identifiability and testability based on analy-
sis of the Jacobian matrix, Jθ , of its prediction function. Here r denotes the maximum rank
of Jθ , m denotes the number of independent free parameters, and n denotes the number of
independent data points that are being fitted. If a model is determined to be “nontestable,”
weaker forms of testability may nonetheless persist; see text for details. Adapted from
Bamber, D., & van Santen, J. P. H. (2000). How to assess a model’s testability and identi-
fiability. Journal of Mathematical Psychology, 44, 20–40. Reprinted with permission from
Elsevier.
being “conjectural” rather than lawful (see Bamber & van Santen, 1985, 2000)
for further details). Second, we must clarify the implications of a model being
identified as “nontestable” by those rules. Principally, the implication is that the
model cannot be tested at a rigorous quantitative level; however, the possibility
remains that even a “nontestable” model can be found to be at odds with some
outcome, albeit not at a quantitative level. For example, even though the model
may be (quantitatively) nontestable, it may nonetheless predict inequalities among
data points that can be falsified by the data (see Bamber & van Santen, 1985, 2000,
for further details).
domain. The answer is that models that are not testable can nonetheless be quite
useful—in fact, they are quite common.
This becomes obvious if we consider a “model” consisting of the sample
mean; we can clearly always identify its single parameter—namely, the sample
mean—but we will never be able to “falsify” it. Lest you think that this is a trivial
example with no practical relevance, consider the common practice of computing
a measure of sensitivity (d ) and a criterion (e.g., β) from hits and false alarm rates
in a detection or a recognition memory experiment (e.g., Hicks & Marsh, 2000).
Those two parameters are always identifiable; that is, any imaginable experimen-
tal outcome will always yield one and only one set of values for d and β. It is
common to interpret those parameters as reflecting, respectively, a bias-free mea-
sure of the “strength” of evidence underlying people’s judgments and the nature
of people’s response bias. It is perhaps less common to realize that this interpre-
tation is tied to acceptance of a very specific underlying model of the decision
process—namely, that the evidence distributions (noise and signal-plus-noise) are
Gaussian and have equal variance.8 Interpretation of the parameters (in particular,
the presumed independence of d and β) is therefore model bound—and therein
lies a problem because the model is not falsifiable in the situation just described.
That is, there exists no combination of a hit rate and a corresponding false alarm
rate that would be incompatible with the signal detection model. Thus, rather
than being able to confirm the adequacy of a model before interpreting its param-
eters, computation of d and β from a single set of hits and false alarms does the
opposite—we presume the adequacy of the model and interpret the parameters
in light of that model. Is this necessarily inappropriate? No, not at all. In other
disciplines, such as physics, it is not uncommon to presume the applicability of
a model (e.g., Ohm’s law; see Bamber & van Santen, 2000, for an example) and
to identify parameters on its basis without being concerned about a lack of testa-
bility. In psychology, somewhat greater caution is advised, and interpretation of
parameter values must consider the full intricacies of the situation if a model is
not testable (see, e.g., Pastore et al., 2003; Wixted & Stretch, 2000).
That said, in many situations, lack of testability can present a problem. This
problem is particularly pernicious when the exact role of a model has become
blurred, for example, when it is no longer totally clear to readers (or even writers,
for that matter) whether the model under consideration is being tested, whether
support for its assumptions is being adduced, or whether it is presumed to be
true in order to imbue the parameter estimates with psychological validity. We
illustrate this problem with an identifiable but untestable model that experienced
a rash of popularity some time ago.
There has been much fascination with the finding that amnesic patients, who
by definition suffer from extremely poor memories, do not appear to differ from
normal control participants on indirect tests of memory. For example, when asked
Chapter 6 Not Everything That Fits Is Gold 215
to recall a word from a previous list given the cue WIN___, amnesic patients per-
form much more poorly than control subjects. However, when given the same
word stem cue with the instruction to complete it “with the first word that comes
to mind,” patients show the same increased tendency to report a list item (rather
than another nonstudied option) as do normal control subjects. This dissociation in
performance between direct and indirect memory tests (also known as “explicit”
and “implicit” tests, respectively) has been at the focus of intense research activ-
ity for several decades (for a comprehensive review, see Roediger & McDermott,
1993), in particular because performance on indirect tests is often said to involve
an “unconscious” form of memory that does not involve recollective awareness
(e.g., Toth, 2000). The latter claim is beset with the obvious difficulty of ensuring
that people are in fact unaware of relying on their memories when completing a
word stem. How do we know that a person who completed the stem with a list
item did not think of its presence on the list at that very moment? Jacoby (1991)
proposed an elegant and simple solution to this “conscious-contamination” prob-
lem. The solution is known as the “process dissociation procedure” (PDP), and it
permits an empirical estimate of the conscious and unconscious contributions to
memory performance. For the remainder of this discussion, we illustrate the pro-
cedure within the domain of recognition memory, where the distinction between
“conscious” and “unconscious” forms of memory has been adopted by the family
of dual-process models (for a review, see Yonelinas, 2002).
At the heart of the PDP is the notion that conscious recollection (R),
but not unconscious activation (U ), is under the subject’s control. In recogni-
tion memory experiments that build on this notion, two distinct lists—which we
call M and F—are presented for study. Following study of both lists, people are
tested first on one and then on the other list. For simplicity, we focus here on
the test of List M, where people are asked to respond yes to items from M but
not to those from List F. For test items from List M, these instructions are cre-
ating what is known as an “inclusion” condition because conscious and uncon-
scious forms of memory can cooperatively drive a response (namely, “yes”). For
test items from List F, by contrast, these instructions create an “exclusion” con-
dition in which conscious recollection is brought into opposition to unconscious
activation. That is, on one hand, successful recollection of a test item will
mandate a “no” response (because F items are to be excluded). On the other hand,
if recollection fails, then unconscious activation arising from the prior study will
mistakenly result in a “yes” response. The resultant differences between inclusion
and exclusion conditions can be exploited to estimate the relative contribution of
the two memorial processes using the logic shown in Figure 6.6.
For the List M items, R and U act in concert, and the probability of a yes
response is thus given by P(I nclusion) = P(R) + P(U ) × [1 − P(R)]: Either
a person recollects the presence of the item on List M and says yes on that basis
216 Computational Modeling in Cognition
Inclusion Exclusion
R 1-R R 1-R
R + (1–R) × U (1–R) × U
R = Inclusion-Exclusion
Figure 6.6 Logic underlying the process dissociation procedure. After study of Lists M
and F, people are asked to respond “yes” to recognition test items from List M but not
those from List F. Thus, items from List M are processed under “inclusion” conditions and
those from List F under “exclusion” conditions. It follows that a yes response to M items
arises from the independent contributions of R (conscious recollection) and U (uncon-
scious activation). For F items, by contrast, any conscious recollection yields a no response;
hence, the tendency to endorse an item from List F as old relies on a failure of R. An esti-
mate of R can thus be obtained as the simple difference between the two conditions. See
text for details.
Condition/Estimate of
Memory Component Short List Long List
Condition
Inclusiona .78 .70
Exclusionb .22 .30
New test itemc .09 .14
Estimate
Recollection (R) .56 .40
Activation (U ) .50 .50
Note: In the experiment, people were tested on both lists, and the data in the table are for both tests.
For ease of exposition, our description focuses on a single test only, for the list called M.
a. Correct yes responses to items from List M.
b. Erroneous yes responses to items from List F.
c. Erroneous yes responses (i.e., false alarms) to new items.
the notion that an item’s presentation boosts its activation by a constant amount
irrespective of other factors such as list length. Conversely, recollection-driven
responding (R) declines with list length, consonant with the notion that items
become harder to retrieve as list length increases.
Similarly, when the PDP is applied to experiments that compare explicit and
implicit memory tests, the empirical estimates of R and U are typically in line
with theoretical expectations. For example, when a conventional levels-of-
processing manipulation (semantic encoding tasks vs. nonsemantic encoding) is
applied to the inclusion and exclusion conditions, P(R) varies considerably
between encoding conditions, whereas P(U ) remains invariant (e.g., Toth, Rein-
gold, & Jacoby, 1994).9
Can we therefore conclude that the PDP yields pristine estimates of the con-
tributions of conscious and unconscious forms of memory? Does the PDP solve
the “conscious-contamination” problem? To answer this question, we must begin
by considering the assumptions underlying the procedure. First, the PDP assumes
that the two memory components are independent of each other; this assump-
tion is crucial because Equation 6.3 relies on statistical independence:10 If other
assumptions are made about the relationship between R and U , a very different
equation is implied (Buchner, Erdfelder, & Vaterrodt-Plünnecke, 1995; Joordens
& Merikle, 1993). Second, the PDP assumes that R cannot be in error (i.e., one
cannot erroneously recollect an extra-list item or an item from List M to have
been on List F; Ratcliff, Van Zandt, & McKoon, 1995). Third, as already noted,
the PDP assumes that the operation of U is unaffected by changes in context or
instruction.
218 Computational Modeling in Cognition
Next, at the risk of stating the obvious, we must realize that the PDP is a
model rather than just “an ingenious methodology to obtain separate estimates
of familiarity and intentional recollection within a single paradigm” (Light &
La Voie, 1993, p. 223). This realization immediately gives rise to two important
issues: Are the model’s parameters identifiable, and is the model testable? For
this simple model, we can resolve those issues without analysis of the Jacobian.
Concerning identifiability, brief inspection of Equations 6.2 and 6.3 reveals that
there is a one-to-one mapping between any possible outcome and the parameter
values—namely, the estimates of R and U .11 Hence the model’s parameters are
identifiable. Is the model testable? Clearly, it is not: The model necessarily yields
a perfect fit for any possible set of data with the parameters that are computed
from those data. That is, in the same way that one can always compute d and
β within a two-parameter signal detection framework, there will always be esti-
mates of R and U that one can obtain from any given set of data. The implications
are clear and were noted earlier in the context of signal detection theory: Interpre-
tation of the parameter values presumes the validity of the model. That is, because
the two-parameter signal detection model always fits perfectly, it cannot be inval-
idated on the basis of its failure to fit data—if this sounds like tautology, it is, but
it clarifies the point. It is useful to explore the implications of this fact.
Ratcliff et al. (1995) used a variant of the model recovery technique out-
lined earlier (see Figure 6.2) to test the PDP. Recall that the true origin of the
data is known when the model recovery technique is applied, and emphasis is on
whether the model under examination fits its own data better than those generated
from competing different models. Ratcliff et al. thus used a single-process model
(SAM; Raaijmakers & Shiffrin, 1981) and a dual-process model other than the
PDP (Atkinson & Juola, 1973) to generate data to which the PDP was applied.
SAM was used to simulate the list length results of Experiment 1 of Yonelinas
(1994) that are shown in Table 6.1, and Atkinson and Juola’s model was used
to generate data for a further experiment involving another manipulation (the
details are not relevant here). Ratcliff et al. (1995) found that the PDP yielded
interpretable parameter estimates in both cases. This is not surprising because we
have already seen that any outcome, whether real data or simulation results, yields
meaningful estimates of R and U . The difficulty, however, is that in one case, the
data were generated by a single-process model that removes any meaning of the R
and U parameters—they are simply not psychologically relevant to the situation.
In the other case, even though the data were generated by a dual-process model,
the PDP recovered estimates of R and U that were very different from the true
probabilities with which the two processes contributed to data generation. In nei-
ther case did the PDP provide any indication that the data did not conform to its
assumptions—quite unlike the SAM model, which failed to account for the data
Chapter 6 Not Everything That Fits Is Gold 219
will almost certainly require a well-crafted general discussion that converts the
results of the modeling into some clear statements about psychological theory.
We will now discuss a number of ways in which researchers have used formal
models to make arguments about psychological processes. In terms of Figure 2.8,
we are moving from the connection between data and predictions (gray area at the
bottom) to the purpose of modeling: relating people to models (unshaded area at
the top of Figure 2.8).
One reason for the utility of parameters is that they permit further analysis of
the model’s behavior. For example, manipulation of a parameter can isolate the
contribution of a particular process to the model’s predictions.
One example comes from Lewandowsky’s (1999) dynamic distributed model
of serial recall. The model is a connectionist model in which items are stored
in an auto-associative network and compete for recall based on their encoding
strength and their overlap with a positional recall cue. One assumption made
in Lewandowsky’s simulations and shared with other models (e.g., Farrell &
Lewandowsky, 2002; Henson, 1998; Page & Norris, 1998b) was that recall of
items was followed by response suppression to limit their further recall. In
Lewandowsky’s model, this was accomplished by partially unlearning an item
once it had been recalled. Following Lewandowsky and Li (1994), Lewandowsky
(1999) claimed that response suppression in his model was tied to the recency
effect in serial recall, the increase in recall accuracy for the last one or two items
on a list. Intuitively, this sounds reasonable: By the time the last few items are
Chapter 6 Not Everything That Fits Is Gold 221
1.0
0.90η
0.8 0.75η
0.50η
Proportion Correct
0.6
0.4
0.2
0.0
1 2 3 4 5 6
Output Position
Figure 6.7 The effect of the response suppression parameter η in Lewandowsky’s (1999)
connectionist model of serial recall. Figure taken from Lewandowsky, S. (1999). Redinte-
gration and response suppression in serial recall: A dynamic network model. International
Journal of Psychology, 34, 434–446. Reprinted with permission of the publisher, Taylor &
Francis Ltd.
being recalled, most other recall competitors (the other list items) have been
removed from recall competition, thereby lending an advantage to those last few
list items. To confirm this, Lewandowsky (1999) ran a simulation varying the
extent of response suppression. The results, reproduced in Figure 6.7, reinforce
the link between response suppression and recency in his model: As the extent of
response suppression is reduced, so is recall accuracy for the last few items.
Parameter values may also lead to interesting predictions, simply because their
particular best-fitting estimates are suggestive of a specific outcome. Suppose you
present people in an experiment with a long list of unrelated pairs of words for
study. The pair green–table might be followed by book–city and so on. Every so
often, you test people’s memory by presenting them with a probe pair (differenti-
ated from study pairs by being bracketed by “?”s). The probe pairs are either intact
(green–table) or rearranged (book–table), and people must discriminate between
them. Surely you would observe more forgetting the longer you delay the test of
222 Computational Modeling in Cognition
an intact pair? That is, if we re-present green–table after only one or two inter-
vening other pairs, surely people are more likely to recognize it as intact than if
10 or even 20 intervening items have been presented?
The (surprising) empirical answer to this question was stimulated by a param-
eter estimate. Murdock (1989) conducted a simulation of paired-associate exper-
iments, such as the one just described, and found that the estimate of the model’s
forgetting parameter, α, was close to 1, where α = 1 means no forgetting. That
is, seemingly unreasonably, the model suggested that paired-associate lists were
subject to very little forgetting over time.
This counterintuitive expectation was confirmed in a series of experiments by
Hockley (1992). Using the method just described, Hockley showed that perfor-
mance in associative recognition declined very little, even when people studied up
to 20 intervening new pairs. (And this was not a methodological “bug” because at
the same time, memory for single items declined considerably—hence there was
something special about the resilience of associations.)
were named faster than low-frequency words), regularity effects (words with reg-
ular pronunciation, such as MINT, are named faster than irregular words such as
PINT), and the interaction between regularity and frequency (the regularity effect
is larger for low-frequency words). The model was also shown to account for
a number of other aspects of the data, including neighborhood size effects and
developmental trends, including developmental dyslexia.
What does this actually tell us? Arguably, the most interesting and important
claim to arise from Seidenberg and McClelland’s (1989) results is that a model
that is not programmed with any rules can nonetheless produce rule-like behav-
ior. It has often been assumed that regularity effects reflect the difference between
application of a rule in the case of regular words (e.g., INT is pronounced as
in MINT) and the use of lexical knowledge to name irregular exceptions (the
INT in PINT). Seidenberg and McClelland’s simulations show that a single pro-
cess is sufficient for naming regular and irregular words. Similar claims have
been made in other areas of development, where it has been shown that apparent
stage-like behavior (of the type shown in Figure 3.7) can follow from continuous
changes in nonlinear connectionist models (Munakata & McClelland, 2003). Crit-
ics have highlighted issues with Seidenberg and McClelland’s account of reading
more generally (e.g., Coltheart, Curtis, Atkins, & Haller, 1993), arguing that the
model does not provide a sufficient account of nonword reading or different types
of dyslexia. Issues have also been raised regarding the falsifiability of connec-
tionist models of the type represented by Seidenberg and McClelland’s model
(e.g., Massaro, 1988). Nonetheless, the model constitutes a useful foil to dual-
route models of reading.
Our second example comes from the work on the remember-know distinction
in recognition memory. Recognition memory looks at our ability to recognize
whether we have seen or otherwise encountered an object before, usually in some
experimental context. The remember-know distinction taps into an assumed dis-
sociation between two processes or types of information that underlie recognition
memory: a process of conscious recollection and a feeling of “knowing” that the
object has been encountered before but without any attendant episodic details of
that experience (Gardiner, 1988). (Note how this dichotomy resembles the two-
process views of recognition considered earlier in this chapter; e.g., Yonelinas,
1994.) The remember-know distinction can be operationalized by asking partic-
ipants after each recognition response (“Yes, I saw the item on the list,” or “No,
I did not”) whether they “remember” (i.e., have a conscious recollection of) see-
ing the item or whether they simply “know” that the item has been encountered
previously. Evidence for a dissociation comes from the finding that certain vari-
ables independently affect the frequency of remember and know responses (e.g.,
amnesiacs vs. controls: Schacter, Verfaellie, & Anes, 1997; item modality: Gregg
& Gardiner, 1994) or can have opposite effects on remember and know responses
224 Computational Modeling in Cognition
K R
0.45
0.4
0.35
Probability Density
0.3
0.25
0.2
0.15
0
1 2 3 4 5 6 7 8 9 10
Familiarity
(Gardiner & Java, 1990). Although it seems intuitively plausible that the empirical
remember-know distinction must be tapping some underlying distinction between
recollective and non-recollective processing, it has been demonstrated that a sim-
ple signal detection theory (SDT) model of recognition memory can also account
for remember-know judgments.
The basics of this model (Donaldson, 1996) are illustrated in Figure 6.8. It is
assumed that each time a recognition probe is presented, a measure of familiarity
of that probe is calculated. The amount of familiarity depends on whether the item
is “old” (was seen in the study phase) or “new.” Generally, items previously seen
are more familiar. Nonetheless, there is some variability in familiarity, such that
there are overlapping distributions (probability density functions or PDFs) for the
old and new items. A further assumption is that recognition decisions are made on
the basis of criteria, corresponding to remember (R) and know (K) judgments. If
the familiarity falls above the R criterion, the participant produces a “remember”
response. If the familiarity does not exceed the R criterion but still surpasses the
K criterion, then a “know” response is given. Finally, if the familiarity falls below
the K criterion, then the item is considered to be new.
The critical feature of the model in Figure 6.8 is that it is a unidimensional
model: There is only one dimension of familiarity along which items vary and
only one process giving rise to remember-know responses. Several researchers
have shown that this simple unidimensional account does surprisingly well at
fitting data from the remember-know paradigm (Donaldson, 1996; Dunn, 2004;
Chapter 6 Not Everything That Fits Is Gold 225
Rotello & Macmillan, 2006; Wixted & Stretch, 2004). In particular, Dunn (2004)
showed that the model depicted in Figure 6.8 could account for cases where
manipulations had selective or opposing effects on remember and know responses
by changes in both the separation of the densities in Figure 6.8 and in one or both
of the response criteria. For example, Schacter et al. (1997) showed that amnesiacs
give fewer “remember” responses to old items than controls but that both groups
are roughly equal in their frequency of “know” responses. Dunn’s fit of the model
showed it to give an excellent quantitative fit to these results, and examination of
model parameters showed how the model could account for the hitherto challeng-
ing data: The fit to the control data produced a larger d and larger estimates of
the K and R criteria, with a larger change in the K criterion to account for the
lack of change in the frequency of K responses. Although this somewhat begs
the question of why these groups should differ in their criteria, an argument can
be made that the criteria should scale with the quality of an individual’s memory
(as measured by the difference between the new and old curves in Figure 6.8:
Hirshman, 1995).
This is a clear example of a model’s success: A model that intuitively might
not appear able to account for the data in fact does. One consequence of such
successes has been a more careful consideration of what remember and know
responses might actually correspond to: Dunn (2004) has presented analyses of
the relation of remember and know responses to other possible responses avail-
able in the recognition task, and Wixted (2004a) demonstrated the compatibility
between the criterion treatment of remember and know responses illustrated in
Figure 6.8 and other responses that effectively recruit multiple criteria, such as
confidence ratings.
Hirshman and Master (1997) and, later, Hirshman, Lanning, Master, and Hen-
zler (2002) argued that the signal detection account also served a valuable role
even if the weight of evidence falls against the theory. Hirshman et al. (2002)
were responding to the claim that since some empirical results were clearly prob-
lematic for the SDT account of remember-know judgments, the theory should
be abandoned. The critics of the SDT account of remember-know judgments
hold that the model does not capture the critical qualitative distinction between
“remember” and “know” responses (including neurophysiological dissociations),
the argued differential extent to which these responses reflect access to conscious
awareness and relate to the self (e.g., Conway, Dewhurst, Pearson, & Sapute,
2001; Gardiner & Gregg, 1997). A statement that summarizes the view of SDT
from this perspective and that has been subsequently quoted by several studies
comes from Gardiner and Conway (1999), cited in Hirshman et al. (2002): “[As]
trace strength models will always fall far short of providing anything approaching
a full understanding of states of memory awareness, it seems to us that it makes
sense to turn to other approaches” (p. 153).
226 Computational Modeling in Cognition
cases, this can still provide strong evidence in favor of a particular assumption.
As an example, Table 6.2 shows model selection results from Liu and Smith’s
(2009) examination of time accuracy functions obtained from a response signal
task. In this task, participants were given a stimulus about which they needed
to make a decision (e.g., whether a Gabor patch is oriented to the left or right;
Carrasco & McElree, 2001) and were signaled to respond at various time points
after onset of the stimulus. The resulting time accuracy functions have a character-
istic shape that resembles an exponential function. Accordingly, common practice
is to fit an exponential function to the data, as a purely descriptive model, and esti-
mate values for three parameters: the intercept of the function (the signal delay
time where accuracy exceeds chance levels), the asymptote of the function, and
the rate at which asymptote is approached. The rate of the exponential is of great-
est theoretical interest, as it is taken as a measure of the rate of visual information
processing.
In this framework, we can ask whether a specific manipulation affects the
rate, asymptote, intercept, or some combination of these variables. For example,
Carrasco and McElree (2001) gave participants a visual search task, in which
observers were to determine whether a stimulus composed of a pattern of lines—
the target—was tilted to the left or right; other distractors on the screen were com-
posed of similar line patterns but were vertically oriented. Carrasco and McElree
(2001) examined whether pre-cueing the location of the target (by placing a cue
at the target location before target onset) had effects on rate, asymptote, or inter-
cept (or some combination of these) by fitting a time accuracy function to their
data. They found that the pre-cue affected both the discriminability of the tar-
get (as captured by the asymptote of the exponential) and the rate of approach to
asymptote (i.e., the rate of information extraction). To illustrate the procedure of
model selection with this family of models, Liu and Smith (2009) fit eight models
to some hypothetical response signal data. The eight models all represented the
same exponential model for the time accuracy function but varied in whether var-
ious aspects of the exponential were affected by the pre-cue. Specifically, Liu and
Smith (2009) factorially varied whether each parameter (rate, asymptote, or inter-
cept) was fixed between cued and uncued locations or was allowed to vary. With
a few exceptions, these models are not nested within each other. Accordingly, the
AIC or BIC is an appropriate model comparison tool to use; following Liu and
Smith (2009), we consider the BIC here.
Table 6.2 shows the results of their model fitting. The first column labels
the models A, R, and I, respectively, which stand for asymptote, rate, and inter-
cept, and 1 versus 2 indicates where a single parameter was shared between cued
and uncued locations (1) or whether that parameter was allowed to freely vary
228 Computational Modeling in Cognition
Table 6.2 BIC Differences and BIC Weights for the Eight Time Accuracy Models of Liu
and Smith (2009)
Model B I C wB I C
between the two (2). The next column shows BIC differences—from the small-
est value of BIC—calculated from Liu and Smith’s (2009) Table 2, and the final
column shows the BIC weights calculated from the BIC differences. Although
there is one model that stands out (2A-1R-2I), the evidence for that model is not
strong. Several other models have nontrivial BIC weights: the 2A-1R-1I model,
the 2A-2R-1I model, and the 2A-2R-2I model. All these models have in com-
mon the assumption that the asymptote differs between cued and uncued loca-
tions, although they differ in whether they additionally allow changes in rate or
intercept or both. Notwithstanding their heterogeneity, when considered jointly,
the models provide convincing evidence for a change in asymptote. The higher
weight for the 2A-1R-2I model provides some additional evidence about changes
in the other two parameters, but this aspect of the evidence is less strong than the
clearly identified changes in asymptote.
Of course, sometimes the evidence is less clear for a particular assumption or
set of assumptions, or some further investigation might be needed. McKinley and
Nosofsky (1995) compared a number of models of people’s ability to categorize
stimuli where the categories were ill-defined (see their Figure 6 for an illustration
of their complicated category structure). One model examined by McKinley and
Nosofsky was a deterministic version of the Generalized Context Model (GCM)
(Nosofsky, 1986). This is the same GCM model we discussed in Chapter 1,
and we’ll talk more about the deterministic version of that model in Chapter 7.
McKinley and Nosofsky also examined four versions of decision-bound theory,
in which participants are assumed to carve up the stimulus space into categories
using decision boundaries.
Table 6.3 shows BIC weights for individual participants and the group aver-
age for the five models fit to McKinley and Nosofsky’s (1995) Experiment 2;
these values were calculated from McKinley and Nosofsky’s Experiment 2,
taking into account the number of trials modeled (300). The mean BIC weights
Chapter 6 Not Everything That Fits Is Gold 229
Table 6.3 BIC Weights for Five Categorization Models From McKinley and Nosofsky
(1995)
models to the data of those participants reflected the greater flexibility of those
models.
6.3.4 Summary
In the end, your goal as a modeler is not only to fit one or more models to some
data, and perhaps perform some model selection, but also to communicate the
implications of these results to others. Explaining to your reader why a particu-
lar model was able to account for the data—and why other models failed to do
so—will be essential if you want to convince that reader of the role of a partic-
ular mechanism, process, or representation; we assume that’s why you want to
do modeling, right? When doing this, it is especially important to keep in mind
that your reader will not necessarily be familiar with the model(s) you are dis-
cussing or even with modeling generally. Even if the technical details of your
work are opaque to such readers, in an area such as psychology, it is impera-
tive that the implications of your work are apparent to modelers and nonmodelers
alike; remember that the latter group will likely constitute the majority of your
readership.
Above all, we encourage you to avoid the overly simplistic “winner-takes-all”
perspective (“my model fits better than yours, and therefore the assumptions I
make are more appropriate”) and to adopt a more sophisticated stance (“my
model fits better than yours for these reasons, and this is what that tells us about
cognition”).
Notes
1. It turns out that the “leaving one out at a time” approach is asymptotically identical
to model selection based on the Akaike Information Criterion (AIC) that we discussed in
Chapter 5 (provided some specific conditions are met; see Stone, 1977, for details).
2. For simplicity, the latter stage lumps together response selection and response output
even though those two processes could quite possibly be considered as separate stages.
3. To illustrate, suppose the comparison process stopped whenever a match was found
between the probe and a memorized item. In that case, on the assumption that items from
all list positions are tested equally often, the slope for old items would be half of the slope
for new items (because on average, only half of the list items would have to be scanned to
find a match when the probe is old, whereas a new probe could only be rejected after all
items had been scanned).
4. Identifiability follows because a function is invertible if its derivative is invertible—
and for that to be the case, the Jacobian has to be full rank; for details, see Bamber and
van Santen (1985) and P. L. Smith (1998). Fortunately, for most models, this analysis of
the Jacobian can be conducted at any arbitrarily chosen θ and then holds for the model
overall; for details of when those generalizability conditions hold, see Bamber and van
Santen (1985, p. 458).
5. You could also put those commands into an m-file and then execute that program.
This achieves the same result, but doing it interactively is more straightforward for simple
groups of commands like this.
Chapter 6 Not Everything That Fits Is Gold 233
6. As an exercise, we suggest that you repeat this analysis with three parameters. Specif-
ically, similar to the case of the Sternberg model discussed at the outset, we suggest you
replace the intercept, parms(2) in Listing 3.3, with two values, such that line 4 in that
listing now reads as follows: preds = (data(:,2).* b) + a2+a;, where a2 is a third free
parameter (in parms(3)). This will compute seemingly sensible parameter estimates (e.g.,
we obtained .52, −.43, and .48 in one of our runs, thus yielding an intercept of around
zero), but the rank of the Jacobian will be one less (2) than the number of parameters (3),
thus confirming the obvious lack of identifiability.
7. The issue of testability can be further subdivided into “qualitative” versus “quantita-
tive” testability (Bamber & van Santen, 1985, 2000). We do not flesh out this distinction
here other than to note that quantitative testability is a more stringent criterion and involves
models that make exact predictions—as virtually all models considered in this book do. We
therefore implicitly refer to quantitative testability, as defined by Bamber and van Santen
(1985, 2000), throughout our discussion.
8. Strictly speaking, other models can be presumed, but for the present discussion, we
assume the equal-variance Gaussian model and focus on β as the chosen descriptive mea-
sure of the criterion. We also assume that only a single set of hit and false alarm rates is
used to compute the model parameters; the situation is very different when multiple such
rates exist and computation of receiver operating characteristic (ROC) curves becomes
possible (see, e.g., Pastore, Crawley, Berens, & Skelly, 2003, for details).
9. The principal support for the PDP consists of the broad range of such findings, where
R and U behaved as expected on the basis of prior theory, a pattern known as “selective
influence.” Hirshman (1998) pointed to the flaws in that logic, and Curran and Hintzman
(1995) additionally showed that those patterns can be an artifact of data aggregation; how-
ever, more recently, Rouder, Lu, Morey, Sun, and Speckman (2008) provided a hierarchical
model of the PDP that can circumvent the aggregation problems.
10. Recall that we defined and discussed independence in probability theory in
Chapter 4.
11. This statement is a slight oversimplification: If recollection is perfect, P(E xclusion)
goes toward 0, in which case R 1 and hence U is undefined.
7
Drawing It All Together:
Two Examples
This chapter draws together the entire material presented so far in two detailed
examples. The first example involves the WITNESS model of eyewitness identifi-
cation (Clark, 2003) and in particular its application to the “verbal overshadowing
effect” reported by Clare and Lewandowsky (2004). The second example involves
a head-to-head comparison of some models of categorization, thereby illustrating
the concepts of model selection developed in Chapter 5.
These two examples illustrate several important contrasts: First, WITNESS is
based on a stochastic simulation involving a large number of replications, whereas
the categorization models are based on analytic solutions and hence provide pre-
dictions that are not subject to sampling variability. Second, WITNESS considers
the data at the aggregate level because each subject in an eyewitness identification
experiment provides only a single response, and response proportions are thus
only available in the aggregate, whereas the categorization models can be fit to
the data from individuals. Third, the WITNESS example is based on a descriptive
approach relying on a least squares criterion (i.e., root mean squared deviation
[RMSD]; see Chapter 3), whereas the comparison of the categorization models
involves maximum likelihood estimation and model comparison (see Chapters 4
and 5).
235
236 Computational Modeling in Cognition
fact that the defendant had been positively identified by seven (7!) independent
eyewitnesses. After the prosecution had presented its case, the actual perpetrator,
a Mr. Ronald Clouser, came forward and confessed to the crimes after having
been identified and located by a former FBI agent. The charges against Father
Pagano were dropped (see Searleman & Herrmann, 1994, for a recounting of this
intriguing case). How could so many eyewitnesses have mistakenly identified an
innocent person? Lest one dismiss this as an isolated case, Wells et al. (1998)
presented a sample of 40 cases in which wrongfully convicted defendants were
exonerated on the basis of newly available DNA evidence. In 90% of those cases,
eyewitness identification evidence was involved—in one case, a person was erro-
neously identified by five different witnesses.
Not surprisingly, therefore, the study of eyewitness behavior has attracted
considerable attention during the past few decades. In eyewitness identification
experiments, subjects typically watch a (staged) crime and are then presented with
a lineup consisting of a number of photographs of people (for an overview, see
Wells & Seelau, 1995). The lineup typically contains one individual who commit-
ted the (staged) crime, known as the perpetrator, and a number of others who had
nothing to do with the crime, known as foils. Occasionally, “blank” lineups may
be presented that consist only of foils. There is little doubt that people’s perfor-
mance in these experiments is far from accurate, with false identification rates—
that is, identification of a foil—in excess of 70% (e.g., Clare & Lewandowsky,
2004; Wells, 1993). The laboratory results thus confirm and underscore the known
problems with real-world identification evidence.
The first computational model of eyewitness identification, appropriately
called WITNESS, was proposed by Clark (2003). WITNESS provided the first
detailed description of the behavior of eyewitnesses when confronted with a police
lineup or its laboratory equivalent, and it has successfully accounted for several
diagnostic results (Clark, 2003, 2008).
(2) Encoding into memory is assumed to be imperfect, such that only a propor-
tion s (0 < s < 1) of features are veridically copied into a memory vector (called
M) when the perpetrator is witnessed during commission of the crime. The value
of s is typically estimated from the data as a free parameter.1 The remaining 1 − s
features are stored incorrectly and hence are replaced in memory by another sam-
ple from the uniform distribution.
(3) At the heart of WITNESS’s explanatory power is its ability to handle
rather complex similarity relationships between the perpetrator and the foils in
a lineup, which vary with the way foils are selected (see Clark, 2003, for details).
For the present example, we simplified this structure to be captured by a single
parameter, sim, which determined the proportion of features (0 < sim < 1) that
were identical between any two vectors, with the remainder (1 − sim) being ran-
domly chosen from the same uniform distribution (range −.5 to +.5). Thus, all
foils in the lineup resembled the perpetrator to the extent determined by the sim
parameter.
(4) At retrieval, all faces in the lineup are compared to memory by computing
the dot product between the vector representing each face and M. The dot product,
d, is a measure of similarity between two vectors and is computed as
N
d= g i Mi , (7.1)
i=1
where N is the number of features in each vector and i the subscript running over
those features. The greater the dot product, the greater the similarity between the
two vectors. In WITNESS, the recognition decision relies on evaluating the set of
dot products between the faces in the lineup and M.
(5) The complete WITNESS model differentiates between three response
types: an identification of a lineup member (“it’s him”), a rejection of the entire
lineup (“the perpetrator is not present”), and a “don’t know” response that was
made when there was insufficient evidence for an identification (Clark, 2003).
For the present example, we simplified this decision rule by eliminating the “don’t
know” category in accordance with the experimental method to which we applied
the model. This simplified decision rule reduces to a single comparison: If the
best match between a lineup member and memory exceeds a criterion, cr ec , the
model chooses that best match as its response. If all matches fell below cr ec , the
model rejects the lineup and records a “not present” response.
Because each participant in an eyewitness identification experiment provides
only a single response, the data are best considered in the aggregate. In particular,
the data consist of the proportion of participants in a given condition who identify
the perpetrator or one of the foils, or say “not present.” The model predictions are
likewise generated by aggregating across numerous replications, each of which
238 Computational Modeling in Cognition
Verbalization Condition
Experiment 1
Experiment 2
Main
storevec
getvec
getsimvec bof
decision
Figure 7.1 The relationship between the MATLAB functions used in the WITNESS simu-
lation. The names in each box refer to the function name(s) and file names. Boxes within a
box represent embedded functions. Arrows refer to exchanges of information (via function
calls and returns or global variables). Solid arrows represent information exchanges that
are managed by the programmer, whereas broken arrows represent exchanges managed by
MATLAB. Shading of a box indicates that the function is provided by MATLAB and does
not need to be programmed. See text for details.
represent, respectively, the value of the response criterion after holistic and fea-
tural verbalization. Aside from altering the setting of the criterion, verbalization
has no further effect within this version of WITNESS.3
We now present and discuss the various programs, beginning with the function
that implements WITNESS and that is called from within bof, the function that
computes badness of fit. We present the main program and the Wwrapper4fmin
function later.
We begin by presenting in Listing 7.1 the core of the function that implements
WITNESS. The function takes a single input argument that contains the current
parameter values, and it returns the associated predictions in a single array. In
addition, the function uses a “global” variable—defined immediately below the
function header—to communicate with other MATLAB programs and functions
that are part of the simulation package. Variables that are declared global in MAT-
LAB are accessible from within all functions in which they appear in a global
statement; this provides another convenient avenue of communication between
parts of a large program that does not require input or output arguments.
26 f o r i = 2 : consts . lSize
27 paLineup ( i , : ) = getsimvec ( paSim , perp ) ;
28 ppLineup ( i , : ) = getsimvec ( ppSim , perp ) ;
29 end
30
31 %e y e w i t n e s s i n s p e c t s l i n e u p
32 f o r i = 1 : consts . lSize
33 paMatch ( i ) = d o t ( paLineup ( i , : ) , m ) ;
34 ppMatch ( i ) = d o t ( ppLineup ( i , : ) , m ) ;
35 end
36
37 %w i t n e s s r e s p o n d s
38 f o r iLineup = 1 : consts . nCond
39 i f any ( iLineup==consts . fChoice )
40 criterion = 0 ;
41 else
42 criterion=parms ( consts . ptToCrit ( iLineup ) ) ;
43 end
44 i f any ( iLineup==consts . paLineup )
45 useMatch = paMatch ;
46 else
47 useMatch = ppMatch ;
48 end
49 resp = decision ( useMatch , criterion ) ;
50 predictions ( iLineup , resp ) = ←
predictions ( iLineup , resp ) + 1 ;
51 end
52 end %r e p l o o p
Preliminaries. First consider the statement in line 7, which calls the random-
number generator with two arguments that reset its state to the value provided
by the variable consts.seed.4 Note that this usage of consts.seed identifies
the global variable consts to be a “structure.” A structure is a very useful con-
struct in MATLAB because it allows you to refer to many variables (known as
“structure members”) at the same time, in much the same way that checking in a
single suitcase is far preferable to carrying numerous socks and shirts to the air-
port. Structure members are identified by appending their names to the structure’s
name, separated by a period (“.”).
Because the consts structure contains several important variables that govern
the simulation, we summarize all its members and their values in Table 7.3. We
will show later how those structure members are initialized; for now, we can take
their values for granted as shown in the table. The structure members that are
identified by an asterisk will be explained further in the following; the others are
self-explanatory.
244 Computational Modeling in Cognition
The reseeding of the random generator ensures that the sequence of ran-
dom numbers provided by MATLAB’s rand function is identical every time the
witness function is called. In consequence, there are no uncontrolled random
variations across calls of the function during parameter estimation, thus minimiz-
ing the disturbance of the error surface that is associated with stochastic simula-
tion models (see Chapter 3).5
The subsequent lines (9–13) assign the first two entries in the parameter vec-
tor to more mnemonic variable names. Thus, we create a variable s that contains
the current value of the encoding parameter, s, and a variable sim for the param-
eter of the same name. (In case you are wondering how we know that those two
parameters take the first two slots in the parameter vector, the answer is that the
order of parameters in the vector corresponds to their order of presentation in
Table 7.2; this order was determined by us, the programmers.) We then assign
the same value of sim to three other variables that represent the specific simi-
larities within the simulation—namely, between the perpetrator and an innocent
suspect (i.e., the person that takes the perpetrator’s place on perpetrator-absent
lineups; variable ssp) and between the perpetrator and all foils on the perpetrator-
present (ppSim) and perpetrator-absent (paSim) lineups. The reason we use dif-
ferent variable names here even though all are assigned the same value is to leave
open the possibility that in future simulations, the different similarities may take
on distinct values.
The core components. The core of the function begins in line 16 with a loop
that accumulates predictions across the multiple replications. Within the loop,
each replication first involves a holdup (or some other heinous crime), which is
modeled in line 19 by storing in memory an image of the perpetrator (generated
in the line immediately above). The two functions getvec and storevec form
part of the WITNESS simulation package (as foreshadowed in Figure 7.1) and
Chapter 7 Drawing It All Together 245
are presented later; for now, it suffices to know that they generate a random vector
and store a vector in memory, respectively.
The holdup is immediately followed by lineup construction. First, an inno-
cent suspect is obtained for the perpetrator-absent lineup using another embedded
function—namely, getsimvec (Line 22). The two lineup types are then created
by placing the perpetrator or innocent suspect in the first lineup position (lines 24
and 25) and the foils in the remaining positions (lines 26–29). At this point, your
training in experimental design should kick in, and you should balk at the idea
that the first lineup position is always taken up by the perpetrator (when present).
Surely this important variable must be randomized or counterbalanced? Yes, if
this were a behavioral experiment involving human subjects whose decisions are
subject to all sorts of biases, then it would be imperative to determine the posi-
tion of the perpetrator at random. Fortunately, however, WITNESS does not suffer
from such biases and considers all lineup positions exactly equally. For that rea-
son, we can fix the perpetrator’s position, which turns out greatly to facilitate
scoring.
You might also wonder why we created two separate lineups, rather than just
a single set of foils that is presented either with the perpetrator or an innocent
suspect in the first position. Does the use of different foils for the perpetrator-
present and perpetrator-absent lineups not introduce an unnecessary source of
variation? The answer is that in reality, foils are selected by police officers in
order to match the apprehended person of interest—who may or may not be the
perpetrator (for details, see Clark & Tunnicliff, 2001). It follows that in order
to be realistic, the foils should differ between the two lineup types. However,
because ppSim and paSim are set to the same value, there should be no systematic
differences between foils in the two lineup types.
Once created, the witness inspects the lineups, and a match between memory
and each lineup member is computed (lines 32–35). In contrast to human wit-
nesses, who can only be assigned to one lineup type or the other—but not both—
the model can consider two completely different conditions without any bias or
carryover effects. This means that although it looks like we are testing the same
witness under different conditions, we are actually testing different simulated par-
ticipants in line with the experimental design. The match is computed by calling
the MATLAB function dot , which returns the dot product between two vectors.
Response selection and scoring. WITNESS then selects a response in the man-
ner dictated by the experimental methodology and in line with the criterion expla-
nation. This crucial segment, which implements the experimental methodology
underlying the data in Table 7.1, involves the loop beginning in line 38.
At this point things get to be somewhat intricate, and to facilitate understand-
ing, we present the mapping between experimental conditions and the program
variables in Figure 7.2.
246 Computational Modeling in Cognition
Experiment 1 2
Lineup PP PA PP
Condition Control Holistic Featural Control Holistic Featural Control Holistic Featural
iLineup 1 2 3 4 5 6 7 8 9
consts.ptToCrit 3 4 5 3 4 5 n/a n/a n/a
criterion parms (3) parms (4) parms (5) parms (3) parms (4) parms (5) 0 0 0
any(iLineup==consts.
0 0 0 1 1 1 0 0 0
paLineup)
any(iLineup==consts.
0 0 0 0 0 0 1 1 1
fChoice)
Figure 7.2 Mapping between experimental conditions (shaded part at the top) and pro-
gram parameters in our simulation (bottom part). PP = perpetrator-present lineup; PA =
perpetrator-absent lineup. See text for details.
The shaded cells at the top of the figure summarize the data that we want
to simulate: There are two experiments, there are two lineup types, and there are
three conditions for each lineup type in each experiment. Now consider the bottom
(unshaded) panel of the figure, which lists the values of the program variables that,
in lines 38 through 51, instantiate this experimental setup, thus ensuring that WIT-
NESS selects responses in the manner appropriate for the experiment, condition,
and lineup type being modeled. Turning first to Experiment 2, the response crite-
rion is disabled by setting it to zero (line 40) whenever the loop index iLineup
matches one of the elements of consts.fChoice, which is an array that points
to the conditions that comprise Experiment 2 (see Table 7.3 and Figure 7.2). Note
that the criterion explanation cannot differentiate between the three conditions
within a forced-choice methodology, which implies that to the extent that there are
differences between the conditions in Experiment 2, this will necessarily increase
the misfit of the model.
For the remaining optional-choice conditions from Experiment 1, two choices
must be made: The appropriate criterion must be selected from the parameter vec-
tor, and the appropriate lineup must be chosen. The criterion is chosen in line 42
using the array consts.ptToCrit. The first three rows in the bottom panel of
Figure 7.2 clarify the mapping between the loop index (iLineup) and the deci-
sion criterion that is being selected from the parameter vector. The lineup type is
determined in line 44, using the array consts.paLineup. The second-last row
in Figure 7.2 shows the outcome of the expression in line 44, which is equal to
1 (i.e., true) whenever the value of the loop index (iLineup) matches one of the
perpetrator-absent conditions.
Finally, once the criterion and lineup type have been selected, a response is
returned by the decision function, which we will discuss in a moment. The
Chapter 7 Drawing It All Together 247
returned responses are counted in the predictions array (line 50), which keeps
track of the number of occurrences of each response type for each condition.
To summarize, within each replication, WITNESS encounters a perpetrator
and then selects a response under all conditions being modeled—that is, two
lineup types (perpetrator present or absent), two decision types (optional choice
or forced choice), and three experimental conditions (control, holistic, and fea-
tural). Once all replications are completed, the counted responses are converted
to predicted proportions (not shown in listing) and are returned as the model’s
predictions.
You may have noted that a seemingly disproportionate amount of program
(and description) space was devoted to vaguely annoying matters of bookkeep-
ing, such as identification of the appropriate parameters and lineups for the var-
ious conditions. Comparatively little space seemed to be devoted to doing the
actual simulation, for example, the encoding of the perpetrator’s face. This is not
at all unusual: In our experience, the majority of programming effort tends to
be devoted to instantiating important details of the experimental method and to
keeping track of simulation results.
Embedded functions. Let us now turn to the various embedded functions that
are required to make WITNESS work. Listing 7.2 shows the remaining segment
of the witness function; as indicated by the line numbers, this listing involves
the same file shown above in Listing 7.1.
Listing 7.2 Embedded Functions for WITNESS
54
55
56 %−−−−−− m i s c e l l a n e o u s embedded f u n c t i o n s
57 %g e t random v e c t o r
58 f u n c t i o n rv = getvec ( n )
59 rv = ( r a n d ( 1 , n ) −0.5) ;
60 end
61
62 %t a k e a v e c t o r and r e t u r n one o f s p e c i f i e d s i m i l a r i t y
63 f u n c t i o n outVec=getsimvec ( s , inVec )
64 a = r a n d ( 1 , l e n g t h ( inVec ) ) < s ;
65 outVec = a . * inVec + ~a . * getvec ( l e n g t h ( inVec ) ) ;
66 end
67
68 %e n c o d e a v e c t o r i n memory
69 f u n c t i o n m=storevec ( s , inVec )
70 m = getsimvec ( s , inVec ) ;
71 end
72
73 %i m p l e m e n t t h e d e c i s i o n r u l e s
74 f u n c t i o n resp = decision ( matchValues , cRec )
(Continued)
248 Computational Modeling in Cognition
(Continued)
75 %i f a l l l i n e u p members f a l l below cRec , t h e n ←
reject
76 i f max ( matchValues ) < cRec
77 resp = 3 ;
78 else
79 [ c , j ] = max ( matchValues ) ;
80 i f j == 1 %s u s p e c t o r p e r p a l w a y s ←
first
81 resp = 1 ;
82 else
83 resp = 2 ;
84 end
85 end
86 end
87 end
The first embedded function in line 58, getvec, is simplicity itself: It creates
a vector of uniform random numbers that are centered on zero and range from
−.5 to +.5. All stimulus vectors in this simulation are ultimately generated by
this function.
The next function, getsimvec in line 63, also returns a random vector, but
in this instance, the new vector is of a specified similarity to another one that is
provided by the input argument inVec. Specifically, the function returns a vec-
tor in which a random proportion s of features are drawn from inVec, and the
remainder is sampled at random.
To encode the perpetrator in memory, we use the function storevec in line 69:
As it turns out, this function simply calls getsimvec, thus instantiating WIT-
NESS’s assumption that only part of the perpetrator image is stored correctly,
whereas the remaining encoded features are sampled at random. (In fact, we could
have omitted this function altogether and used getsimvec to do the encoding.
However, by using a separate function, we leave open the door for possible future
modifications of the encoding process.)
Finally, we must examine how WITNESS selects a response. This selection
is made by the function decision, which is defined in lines 74 through 86.
The function receives an array of dot products that represent the match between
memory and the lineup members (input argument matchValues) together with
a response criterion (cRec). If all matches fall below the criterion (line 76), then
a response type “3” is returned. Alternatively, the lineup member with the largest
match is returned as the response. If that largest match is in Position 1 (line 80),
then we know that we have identified the perpetrator (or innocent suspect, in
perpetrator-absent lineups), and the function returns a response type “1.” (Remem-
ber how we said earlier that placing the perpetrator in Position 1 facilitates
Chapter 7 Drawing It All Together 249
scoring—now you know why.) Alternatively, if the largest match is in any other
position, we return response type “2,” which means that a foil has been mistak-
enly identified. To summarize, the decision function returns a single variable
that can take on values 1, 2, or 3 and classifies the response, respectively, as
(1) an identification of the perpetrator, (2) an identification of a foil, or (3) the
rejection of the lineup. Recall that those response types are counted separately
across replications (refer back to line 50 in Listing 7.1).
This, then, completes the presentation and discussion of the central part of
the simulation—namely, the witness function and all its embedded auxiliary
functions. Within the structure in Figure 7.1, we have discussed the box on the
left. We next turn to the main program shown in the box at the top of that figure.
The compact main program is presented in Listing 7.3 and is explained quite
readily. Lines 6 through 13 initialize the consts structure with the values shown
earlier in Table 7.3. Note how those values were available inside the witness
function because consts was declared to be global both here and inside the
function.
The next few lines (17–25) initialize a matrix with the to-be-fitted data that
were shown earlier in Table 7.1. You will note that all numbers in the table are
also shown here, albeit in a slightly different arrangement (e.g., the columns repre-
sent response types rather than conditions) that simplifies the programming. Note
also that the order of conditions, from top to bottom, is the same as their order,
from left to right, in Figure 7.2. This is no coincidence because it means that the
predictions returned by function witness share the layout of the data.
We next set the starting values for the parameters (line 34) and determine their
maximum values (line 35) to ensure that they do not go out of bounds during esti-
mation (e.g., sim must not exceed 1 because the similarity between two vectors
cannot be greater than identity). Finally, Wwrapper4fminBnd is called to esti-
mate the parameters (line 37). This part of the code is virtually identical to the
example presented earlier (in Section 3.1.2) and does not require much comment.
We therefore now turn to the remaining function, represented by the central box
in Figure 7.1, which coordinates the parameter estimation.
The test for boundary conditions in line 9 is noteworthy: If any of the param-
eters are out of bounds, the function bof returns the maximum number that
MATLAB can represent (realmax), thus signaling Simplex not to go anywhere
near those values. Only if the current parameter values are within bounds are the
predictions of WITNESS computed and compared to the data by computing the
standard RMSD (Equation 2.2). We used RMSD as the discrepancy function for
comparability with Clare and Lewandowsky (2004); because the data consist of
counts (i.e., number of subjects who make a certain response), we could equally
have used a χ 2 discrepancy function (which was employed by Clark, 2003).
Improving boundary checks. One limitation of the boundary check in List-
ing 7.4 is that it creates a “step” function, such that any legitimate parameter
value, no matter how close to the boundary, is left unpenalized, whereas any out-
of-bounds value, no matter how small the transgression, is given an equally large
penalty. These problems can be avoided by using the fminsearchbnd function,
which is not part of a standard MATLAB installation but can be readily down-
loaded from MATLAB Central.
Listing 7.5 shows an alternative parameter estimation function, called
Wwrapper4fminBnd, which uses fminsearchbnd and passes the lower and
upper bounds of the parameters as additional arguments (lines 4 and 5). Owing to
the use of fminsearchbnd, the code has also become more compact because the
boundary check did not have to be programmed explicitly.
252 Computational Modeling in Cognition
The simulation as just described was run three times, with a different set of start-
ing values for the parameters on each run. The best-fitting estimates reported in
Table 7.2 are based on the run that yielded the smallest RMSD (.0498), suggesting
that the average deviation between model predictions and data was on the order
of 5%.
Can we be certain that this solution reflects a global rather than a local mini-
mum? There can be no complete assurance that the observed minimum is indeed
global, but several facts raise our level of confidence in the solution. First, the
different runs converged on very similar RMSDs—namely, .0498, .0511, and
Chapter 7 Drawing It All Together 253
s sim Criteria
0.5 0.5 2
Final Estimates
Final Estimates
Final Estimates
0 0 1
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 1 1.2 1.4 1.6 1.8 2
Starting Values Starting Values Starting Values
Figure 7.3 Final parameter estimates as a function of their starting values for three fits of
the WITNESS model to the data of Clare and Lewandowsky (2004). From left to right, the
panels show the values of s, sim, and the recognition criteria, respectively. In the rightmost
panel, circles, squares, and triangles refer to Cr ec (C), Cr ec (H ), and Cr ec (F), respectively.
.0509, respectively. Second, the final parameter estimates were remarkably sim-
ilar, notwithstanding considerable variation in their starting values. This is illus-
trated in Figure 7.3, which shows the final parameter estimates as a function of
their starting values. The figure shows that there is very little variability along
the ordinate (final estimates) even though the values differ quite noticeably along
the abscissa (starting values). The fact that the three simulation runs converged
onto nearly indistinguishable best-fitting estimates bolsters our confidence that
we have reached a global minimum.
How well does the model capture the data? Figures 7.4 and 7.5 show the data
from Experiments 1 and 2, respectively, together with the model predictions.
It is clear from the figures that the model handles the main trends in the data
and provides a good quantitative fit of both experiments. By varying the deci-
sion criterion, WITNESS captures the effects of verbalization on identification
performance for both optional-choice and forced-choice lineups and for
perpetrator-absent as well as perpetrator-present lineups. This result lends consid-
erable support to the criterion explanation of verbal overshadowing. What remains
to be seen is whether a memory-based alternative explanation may also handle the
results.
Clare and Lewandowsky (2004) also examined whether WITNESS might handle
the results by assuming that memory is modified during verbalization. Specif-
ically, to instantiate a memory explanation, Clare and Lewandowsky assumed
that verbalization partially overwrites the perpetrator’s image in memory. This
was modeled by a reduction in the encoding parameter s because reducing s is
254 Computational Modeling in Cognition
1 1 1
Hits False IDs Misses
0.8 0.8 0.8
Response Proportion
Response Proportion
Response Proportion
0.6 0.6 0.6
0 0 0
Control Holistic Featural Control Holistic Featural Control Holistic Featural
1 1 1
Suspect IDs Foil IDs Correct Rejections
0.8 0.8 0.8
Response Proportion
Response Proportion
Response Proportion
0.6 0.6 0.6
0 0 0
Control Holistic Featural Control Holistic Featural Control Holistic Featural
Figure 7.4 Data (bars) and predictions of the criterion explanation within WITNESS
(points and lines) for Experiment 1 (optional-choice lineups) of Clare and Lewandowsky
(2004). The top row of panels represents the perpetrator-present lineup and the bottom
row the perpetrator-absent lineup. Data from Clare, J., & Lewandowsky, S. (2004). Verbal-
izing facial memory: Criterion effects in verbal overshadowing. Journal of Experimental
Psychology: Learning, Memory, & Cognition, 30, 739–755. Published by the American
Psychological Association; adapted with permission.
1 1
0.8 0.8
0.7 0.7
Response Proportion
Response Proportion
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
Control Holistic Featural Control Holistic Featural
Figure 7.5 Data (bars) and predictions of the criterion explanation within WITNESS
(points and lines) for Experiment 2 (forced-choice lineup) of Clare and Lewandowsky
(2004). Data from Clare, J., & Lewandowsky, S. (2004). Verbalizing facial memory: Cri-
terion effects in verbal overshadowing. Journal of Experimental Psychology: Learning,
Memory, & Cognition, 30, 739–755. Published by the American Psychological Associa-
tion; adapted with permission.
What conclusions can we draw from the modeling involving WITNESS? First, the
modeling shows that the verbal overshadowing effect arguably reflects a criterion
adjustment rather than an impairment of memory after verbalization. When this
256 Computational Modeling in Cognition
1 1 1
Hits False IDs Misses
0.8 0.8 0.8
Response Proportion
Response Proportion
Response Proportion
0.6 0.6 0.6
0 0 0
Control Holistic Featural Control Holistic Featural Control Holistic Featural
1 1 1
Suspect IDs Foil IDs Correct Rejections
0.8 0.8 0.8
Response Proportion
Response Proportion
Response Proportion
0.6 0.6 0.6
0 0 0
Control Holistic Featural Control Holistic Featural Control Holistic Featural
Figure 7.6 Data (bars) and predictions of the memory explanation within WITNESS
(points and lines) for Experiment 1 (optional-choice lineups) of Clare and Lewandowsky
(2004). The top row of panels represents the perpetrator-present lineup and the bottom
row the perpetrator-absent lineup. Data from Clare, J., & Lewandowsky, S. (2004). Verbal-
izing facial memory: Criterion effects in verbal overshadowing. Journal of Experimental
Psychology: Learning, Memory, & Cognition, 30, 739–755. Published by the American
Psychological Association; adapted with permission.
criterion explanation is instantiated in WITNESS, the model can predict both the
presence and absence of verbal overshadowing (or indeed a beneficial effect for
perpetrator-absent lineups) depending on the type of decision that is expected of
participants. Second, the modeling showed that the data are not readily modeled
by an explanation based on alteration or overwriting of memory during verbaliza-
tion. Although this second finding does not rule out a memory-based explanation
because we did not explore all possible instantiations of that alternative, the fact
that the criterion explanation handles the data quite parsimoniously makes it an
attractive account of the phenomenon.
In the context of verbal overshadowing, Clark (2008) drew attention to the fact
that the mere label for the effect—namely, “verbal overshadowing”—is not theory
neutral but, simply by its name, “implies the memory impairment explanation that
has been the dominant explanation . . . since the original results were reported”
Chapter 7 Drawing It All Together 257
(pp. 809–810). Thus, merely giving an effect a name can create the appearance
of an explanation: Far from being an advantage, this erroneous appearance may
bias subsequent theorizing and may retard corrective action (cf. Hintzman, 1991).
One of the uses of computational modeling is to provide substantive explanations
that necessarily go beyond labeling of an effect. Thus, although we refer to the
“criterion explanation” by name, this is a name not for a phenomenon but for a
fully instantiated process explanation.
One limitation of the example just presented is that we only considered a sin-
gle model (albeit in two different variants). In general, stronger inferences about
cognitive processes are possible if several competing models are compared in
their ability to handle the data. The next example illustrates this situation.
You’ve already been presented with the basics of GCM in Chapter 1. Because
that was a while ago, and so that you can easily refer back to this material when
reading about the application of GCM below, we’ll briefly summarize the model
again. GCM’s main claim is that whenever we have an experience about an object6
and its category, we store a localist, unitary representation of that object and its
category: an exemplar. For the example we are looking at here, the experimental
stimuli only vary along a single dimension, and so only a single feature is relevant.
When we come across an object in the world and wish to determine its cate-
gory (e.g., edible vs. inedible), the GCM postulates that we match that object to
all exemplars in memory and categorize the object according to the best match to
existing exemplars. This matching process relies on a calculation of the distance
between the object i and all the stored exemplars j ∈ J:
di j = |xi − x j |. (7.2)
Note the simple form of this equation compared to the earlier Equation 1.3. We’ve
been able to simplify this because here we only have a single stimulus dimension.
This means we do not need to sum across several dimensions, and it also means
that we can leave out the generalization
to different types of difference metric:
For a single dimension, |a − b| = (a − b)2 . This distance is then mapped into a
measure of similarity (i.e., match) between the new stimulus and each exemplar:
si j = ex p(−c · di j ). (7.3)
Remember that the c in Equation 7.3 scales the drop-off in similarity with increas-
ing distance. When c is small, a stimulus will match a wide range of exemplars
(i.e., a slow drop-off with distance), and when c is large, the stimulus will only
match exemplars within a very narrow range (i.e., a fast drop-off with distance).
Finally, the similarity values are added up separately for exemplars in each cate-
gory and used to determine the categorization probability for the new stimulus:
Chapter 7 Drawing It All Together 259
si j
j∈A
P(Ri = A|i) = . (7.4)
si j + si j
j∈A j∈B
The fundamental assumption of the GCM is that all past experiences with cate-
gory members are stored as exemplars in memory. This assumption permits the
model to explain the relationship between categorization and recognition mem-
ory as discussed in Chapter 1. General recognition theory (GRT; Ashby, 1992a;
Ashby & Gott, 1988), by contrast, takes a very different tack. The GRT assumes
that what is represented in memory is an abstraction of the category structure
rather than the exemplars themselves. Specifically, GRT assumes that the stim-
ulus space is carved into partitions by decision boundaries. All that needs to be
stored in memory in order to classify new exemplars is some specification of the
placement of the boundaries. In multidimensional space, this can get quite com-
plicated to conceptualize, as the boundaries can have any form, although they are
usually assumed to be specified by linear or quadratic equations.7 In the case of
stimuli that vary along only a single dimension, things are very easy: A category
boundary is a single point along the stimulus dimension.
Categorization errors are assumed to follow from trial-by-trial variability in
the perception of the stimulus (Alfonso-Reese, 2006). This means that on one
trial, a stimulus may appear to fall on one side of the boundary, whereas on another
trial, it appears to fall on the other side. GRT assumes that this variability takes the
form of a normal probability density function (PDF) centered on the true value of
the stimulus and with standard deviation σ . An example is shown in the top panel
of Figure 7.7: The circle shows the actual stimulus value, and around that is drawn
the normal PDF.
Given the normal density around the stimulus and the placement of the bound-
aries, what is the probability of categorizing the stimulus as belonging to a partic-
ular category? This can be worked out using concepts we discussed in Chapter 4
with reference to probability functions. Recall that in the case of a PDF, the proba-
bility that an observation will fall within a specific range involves finding the area
under the curve within that range. That is, we need to integrate the PDF between
the minimum and maximum values defining that range. Take a look again at the
top panel of Figure 7.7. The portion of the PDF shaded in gray shows the area
under the curve to the left of the boundary; this area under the PDF is the prob-
ability that the perceived stimulus value falls below the boundary and therefore
corresponds to the probability that the stimulus i will be categorized as an ‘A,’
260 Computational Modeling in Cognition
μ = x
i
0.4 β
A B
0.3
0.2
0.1 σ
Probability Density
0
0 2 4 6 8 10
A B A
0.4 β1 β2
0.3
0.2 a2
a1
0.1
0
0 2 4 6 8 10
Stimulus Dimension
Figure 7.7 Depiction of one-dimensional categorization in GRT. Both panels show a nor-
mal PDF reflecting the variability in perception of the stimulus; the stimulus itself has a
value of 4 and is shown as a circle. The vertical lines in each panel (labeled β) show the
category boundaries. In the top panel, only a single boundary exists, and the probability of
categorizing the stimulus as being in Category A is the area under the normal density to
the left of the boundary. The bottom panel shows a more complicated example with two
boundaries; in this case, the probability of categorizing the stimulus as an ‘A’ is the area
under the curve to the left of the left-most boundary (β1 ) plus the area to the right of the
right-most boundary (β2 ).
P(Ri = A|i). In the top panel, we need to find the integral from the minimum
possible stimulus value up to the boundary value (β). Let’s assume that the stim-
ulus dimension is unbounded, so that the minimum possible value is −∞. This
means we need to calculate
β
P(Ri = A|i) = N(xi , σ ), (7.5)
−∞
where N is the normal PDF with mean xi and standard deviation σ . To calculate
this integral, we can use the normal cumulative distribution function (CDF). As
Chapter 7 Drawing It All Together 261
we discussed in Chapter 4, the CDF is the integral of a PDF and can therefore
be used to integrate across segments of the PDF. Using the integral of the normal
CDF, , we can rewrite Equation 7.5 as
β − xi
P(Ri = A|i) = , (7.6)
σ
where the integration is assumed to be taken from −∞. The argument passed to
the normal CDF, (β − xi )/σ , expresses the boundary as a z score of the normal
density around the stimulus because the normal CDF function assumes a mean of
0 and a standard deviation of 1.
We can also use this method to obtain predicted probabilities from GRT for
more complicated examples. The bottom panel of Figure 7.7 shows a case where
there are two boundaries, β1 and β2 . Stimuli below β1 and above β2 belong to
Category A, whereas stimuli between the two boundaries belong to Category B;
such a category structure might arise when determining whether some milk is safe
to feed to an infant given its temperature. The predicted probability of categoriz-
ing a stimulus i as being an ‘A’ is then the probability that either the perception
of the stimulus falls below β1 or that it falls above β2 . These two probabilities
are mutually exclusive (unless it is quantum event, the stimulus cannot simulta-
neously both fall below β1 and above β2 ), so following the rules of probability in
Chapter 4, the probability of either event happening can be obtained by adding up
the individual probabilities; that is, we sum the gray areas in Figure 7.7. The first
area is obtained in a similar manner to the top panel, by integrating from −∞ to
β1 :
β1 − xi
P(Ri = a1 |i) = . (7.7)
σ
The second region, to the right of β2 , requires only a little more thinking. The
CDF gives the integral from −∞ up to some value, so we can use it to obtain the
integral up to β2 . To work out the area above β2 , we can calculate the integral up
to β2 and subtract it from 1: Remember that probabilities must add up to 1, and
if the perceived stimulus doesn’t fall to the left of β2 , it must fall to the right (we
are assuming that the perceived stimulus cannot fall directly on the boundary).
Written as an equation, it looks like this:
β2 − xi
P(Ri = a2 |i) = 1 − . (7.8)
σ
Much research has been directed at discriminating between GCM and GRT as
competing explanations for human (and animal) categorization (e.g., Farrell et al.,
2006; Maddox & Ashby, 1998; McKinley & Nosofsky, 1995; Nosofsky, 1998;
Rouder & Ratcliff, 2004). Ashby and Maddox (1993) and Maddox and Ashby
(1993) noted that one difficulty with comparing GCM and GRT directly is that
such a comparison confounds the representations involved in categorizing a stim-
ulus and the processes required to turn the resulting information into an overt
categorization response. That is, not only do GCM and GRT obviously differ in
the way category information is represented, but they also differ in the way in
which responses are generated. In GCM, responses are probabilistic (by virtue
of the use of the Luce choice rule, Equation 7.4), whereas GRT’s responses are
deterministic: If a stimulus is perceived to fall to the left of the boundary in the
top panel of Figure 7.7, it is always categorized as an ‘A.’
To partially deconfound these factors, Ashby and Maddox (1993) presented
a deterministic exemplar model (DEM). This model is identical to the standard
GCM model, with the exception that the response rule is deterministic. DEM
replaces Equation 7.4 with a modified version of the Luce choice rule:
γ
si j
j∈A
P(Ri = A|i) = γ γ . (7.9)
si j + si j
j∈A j∈B
The modification is that the summed similarities are raised to the power of a free
parameter γ . This parameter controls the extent to which responding is determin-
istic. If γ = 1, the model is identical to GCM. As γ gets closer to 0, responding
becomes more and more random, to the point where γ =0 and Equation 7.9 works
out as 1/(1 + 1): Responding is at chance and isn’t sensitive to the actual stimulus
presented. As γ increases above 1, responding gets more and more deterministic.
This is because γ acts nonlinearly on the summed similarities, and if ( si j )γ
j∈A
is greater than ( si j )γ , then increasing γ will increase ( si j )γ more than it
j∈B j∈A
Chapter 7 Drawing It All Together 263
increases ( si j )γ . For a very large value of γ , the model will respond deter-
j∈B
ministically: If ( si j ) >> ( si j ), the model will always produce an ‘A’
j∈A j∈B
response.
1 2 3 4 5 6 7 8
1 (A)
P(A|luminance)
.50
0
.25 .50 .75
Luminance
1 β0 β1 (B)
Probability
Response
.50
0
.25 .50 .75
Luminance
1 (C)
Probability
Response
.50
0
.25 .50 .75
Luminance
Figure 7.8 Experimental structure and model predictions for Rouder and Ratcliff’s (2004)
probabilistic categorization experiment. The top panel shows the proportion of times each
stimulus value was designated as belonging to Category A by experimental feedback; the
integers above the plot number the stimuli from 1 to 8. The middle panel shows repre-
sentative predictions from GRT under the assumption that a boundary is placed wherever
the membership probability crosses 0.5. The bottom panel shows representative exemplar
model predictions, from DEM. Figure reprinted from Rouder, J. N., & Ratcliff, R. (2004).
Comparing categorization models. Journal of Experimental Psychology: General, 133,
63–82. Published by the American Psychological Association; reprinted with permission.
increasing as one moves away from the dip in either direction. In contrast, the
bottom panel shows that an exemplar model (DEM) will predict a more compli-
cated function. Although the same dip occurs for stimuli 5 and 6, there is a second
dip at the bottom end of the stimulus space.
Why do the two models make different predictions? Exemplar theory predicts
that response probabilities should track the feedback probabilities. As we move
Chapter 7 Drawing It All Together 265
down the stimulus dimension in the lower half of the space (i.e., from stimulus
4 down to stimulus 1), the proportion of exemplars in memory that belong to
Category A drops from 1.0 to 0.6. This means that the summed similarity for Cat-
egory A will also drop, and since the summed similarity feeds directly into the
predicted responses via Equation 7.4 or Equation 7.9, the response proportions
are also expected to drop. The extent of this drop will depend on the parameter
settings and on the particular model: GCM will show a strong tendency to track
the feedback probabilities, while DEM, with its ability to respond deterministi-
cally, may show responding that is more deterministic (Farrell et al., 2006). GRT,
by contrast, predicts that as we move away from a stimulus boundary the more
likely a stimulus is to be classified as corresponding to that category. Although
the membership probability decreases as we move to the left of the top panel of
Figure 7.8, all GRT is concerned with is the placement of the boundaries; as we
move to the left, we move away from the boundary, and the predicted probabil-
ity of classifying the stimuli as ‘A’s necessarily increases in a monotonic fashion.
A defining characteristic of GRT is that it cannot predict non-monotonicity in
response proportions as the absolute distance of a stimulus from the boundary
increases.
Figure 7.9 shows the results from the last three sessions for the six participants
from Rouder and Ratcliff’s (2004) experiment, along with the feedback probabil-
ities reproduced from Figure 7.8 (in gray). The participants in the left and middle
columns of Figure 7.9 (SB, SEH, BG, and NC) qualitatively match the predic-
tions of GRT, particularly in the monotonic shape of their response probability
functions in the left-hand size of the stimulus space. Participant VB (top right
panel) behaves more or less as predicted by GRT, but with a small downturn for
the smallest luminances as predicted by exemplar theory. Finally, participant LT
(bottom right panel) behaves as predicted by exemplar theory, with a very clear
downturn in ‘A’ responses.
Following Rouder and Ratcliff (2004), we next apply exemplar and decision-
bound models to the data from the six participants shown in Figure 7.9. We will
use maximum likelihood estimation (MLE) to fit the models to the data and obtain
standard errors on the parameter estimates for each participant. Finally, we will
use AIC and BIC to compare the models on their account of the data.
1 1 1
P(A)
P(A)
P(A)
0.5 0.5 0.5
SB SEH VB
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Luminance Luminance Luminance
1 1 1
P(A)
P(A)
P(A)
0.5 0.5 0.5
BG NC LT
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Luminance Luminance Luminance
Figure 7.9 Proportion of trials on which stimuli were categorized as belonging to Cat-
egory A, for six participants (separate panels). Feedback probabilities from the training
session are shown in gray. Figure adapted from Rouder, J. N., & Ratcliff, R. (2004). Com-
paring categorization models. Journal of Experimental Psychology: General, 133, 63–82.
Published by the American Psychological Association; reprinted with permission.
Listing 7.6 provides the code for calculation of the negative log-likelihood for
GCM and DEM. Recall that GCM is simply the DEM model with γ set to 1.
Accordingly, we can use the DEM code to run both GCM and DEM, as long as
we remember to set γ equal to 1 when running GCM. The first several lines are
comments to tell us what the input arguments are. The first input argument is the
parameter vector theta, which contains c as the first element (line 9) and γ as
the second element (line 10). The second argument x is a vector containing the
luminance values for the eight possible stimuli. The third argument feedback
specifies how many times each stimulus was designated to be an ‘A.’ Specifically,
feedback has two columns, one giving the number of times A was given as
feedback for each stimulus and a second column giving the number of times B
was designated as feedback. The fourth argument data contains the frequencies
of A responses from the participant for each stimulus, and the vector N contains
the total number of trials per stimulus in a vector.
Chapter 7 Drawing It All Together 267
The GCM/DEM calculations are contained in lines 12 to 17. The loop goes
through each of the stimuli, i, and for each stimulus calculates the predicted pro-
portion of ‘A’ responses, predP(i). Line 13 instantiates Equation 7.3, using the
simple absolute distance measure in Equation 7.2. Line 13 is in a vectorized form:
It calculates the similarity between each x j and xi in one go and returns a vector of
similarities as a result. Lines 14 and 15 calculate summed similarity values. Rather
than considering each exemplar individually, lines 14 and 15 take advantage of the
design of the experiment in which there are only eight possible stimulus values.
Since identical stimulus values will have identical similarity to xi , we can cal-
culate the summed similarity between xi and all the exemplars with a particular
value x j by simply multiplying s j by the number of exemplars with that particular
value of x j . Again, we don’t refer explicitly to s j in the code because the code is
vectorized and processes all js at once. Line 16 feeds the summed similarity val-
ues into the Luce choice rule with γ -scaling to give a predicted proportion correct
(Equation 7.9).
268 Computational Modeling in Cognition
The final line in Listing 7.6 calculates the negative log-likelihood based on
the predicted proportions and observed frequencies. Do you recognize the form
of line 19? If not, turn to Section 5.4. Otherwise, right! It’s the binomial log-
likelihood function we discussed in Chapter 4 and that we used to obtain probabil-
ities from SIMPLE in Chapter 5. The function takes the observed frequencies, the
total number of trials, and the predicted proportions to calculate a log-likelihood
value for each stimulus (in vectorized form). We then sum the ln L values and
take the negative so we can minimize the negative log-likelihood.
One issue that deserves mention is that we are using the simplified version of
the binomial log-likelihood function and have omitted some of the terms from the
full function (see Section 4.4). This has implications for calculation of AIC and
BIC, as discussed in Chapter 5. This won’t be a problem here, as we will use the
same binomial data model for all three categorization models and will therefore
be subtracting the same constant from AIC and BIC values for all three models.
Finally, to avoid confusion, it should be noted that Rouder and Ratcliff (2004)
fit DEM to their data but called it GCM. This is because a number of authors treat
DEM not as a model separate from GCM but as a version of GCM with response
scaling. The differentiation between GCM and DEM is maintained here because
of the possible importance of the decision process in the probabilistic nature of
this experiment; in many circumstances, the nature of the response process will
be a side issue separate from the main theoretical issue of concern.
7.2.3.2 GRT
1 x
(x) = erf √ . (7.10)
2 2
Chapter 7 Drawing It All Together 269
This relationship is used to obtain the normal CDF in the embedded function
normalCDF and was also used in Chapter 5 for the ex-Gaussian model, which
also contains the normal CDF.8 Line 21 adds the two areas a1 and a2 together
to obtain predicted proportions, and line 23 implements the same binomial data
function as was used in the GCM code. Notice that lines 19 to 21 are written in a
vectorized format: At each step, the result is a vector whose elements correspond
to specific stimulus values.
the data and the predicted proportion correct for each stimulus value given the
parameters. This is sufficient to carry out parameter estimation and model selec-
tion. Listing 7.8 provides a script to fit each of the models and provides statistics
for model comparison and further interpretation.
7.2.4.1 Setting Up
The first thing the script does is to assign the proportion of ‘A’ responses from
the six participants to the variable data. The data are structured so that the rows
correspond to participants (in the same order as in Figure 7.9), and each column
corresponds to a stimulus i.
Listing 7.8 Code for MLE and Model Selection for the Categorization Models
1 % s c r i p t catModels
2 clear all
3 close all
4
5 dataP = [ 0 . 7 5 , 0 . 6 7 , 0 . 5 4 , 0 . 4 , 0 . 4 , 0 . 3 7 , 0 . 5 8 , 0 . 7 1 ;
6 0.92 ,0.81 ,0.53 ,0.28 ,0.14 ,0.22 ,0.45 ,0.81;
7 0.91 ,0.97 ,0.93 ,0.64 ,0.28 ,0.09 ,0.12 ,0.7;
8 0.98 ,0.94 ,0.85 ,0.62 ,0.2 ,0.037 ,0.078 ,0.71;
9 0.97 ,0.94 ,0.8 ,0.58 ,0.4 ,0.45 ,0.81 ,0.97;
10 0.29 ,0.66 ,0.85 ,0.71 ,0.33 ,0.1 ,0.32 ,0.77];
11
12 % number s e s s i o n s x 10 b l o c k s x 96 t r i a l s / ( n s t i m u l i )
13 Ntrain = ( ( 5 * 1 0 * 9 6 ) / 8 ) ;
14 pfeedback = [ . 6 . 6 1 1 0 0 . 6 . 6 ] ;
15 Afeedback = pfeedback . * Ntrain ;
16 feedback = [ Afeedback ; Ntrain−Afeedback ] ;
17
18 Ntest = ( ( 3 * 1 0 * 9 6 ) / 8 ) ;
19 N = repmat ( Ntest , 1 , 8 ) ;
20
21 dataF = c e i l ( Ntest . * ( dataP ) ) ;
22
23 stimval = l i n s p a c e ( . 0 6 2 5 , . 9 3 7 5 , 8 ) ;
24
25 %% Maximum l i k e l i h o o d e s t i m a t i o n
26 f o r modelToFit = { 'GCM ' , 'GRT ' , 'DEM ' } ;
27
28 f o r ppt = 1 : 6
29 switch char ( modelToFit )
30 case 'GCM '
31 f=@ ( pars ) DEMlnL ( [ pars 1 ] , stimval , ←
feedback , dataF ( ppt , : ) , N ) ;
32 [ theta ( ppt , : ) , lnL ( ppt ) , exitflag ( ppt ) ] ←
Chapter 7 Drawing It All Together 271
33 =fminbnd ( f , 0 , 1 0 0 ) ;
34 case 'GRT '
35 f=@ ( pars ) GRTlnL ( pars , stimval , ←
dataF ( ppt , : ) , N ) ;
36 [ theta ( ppt , : ) , lnL ( ppt ) , exitflag ( ppt ) ] ←
37 =fminsearchbnd ( f , [ . 3 . 7 . 1 ] , ←
[−1 −1 e p s ] , [ 2 2 1 0 ] ) ;
38 case 'DEM '
39 f=@ ( pars ) DEMlnL ( pars , stimval , ←
feedback , dataF ( ppt , : ) , N ) ;
40 [ theta ( ppt , : ) , lnL ( ppt ) , exitflag ( ppt ) ] ←
41 =fminsearchbnd ( f , [ 5 1 ] , [ 0 ←
0] , [ Inf Inf ]) ;
42 otherwise
43 e r r o r ( ' Unknown model ' ) ;
44 end
45
46 [ junk , predP ( ppt , : ) ] = f ( theta ( ppt , : ) ) ;
47
48 h e s s = hessian ( f , theta ( ppt , : ) ,10^ −3) ;
49 cov = i n v ( h e s s ) ;
50 thetaSE ( ppt , : ) = s q r t ( d i a g ( cov ) ) ;
51 end
52
53 figure
54 pptLab = { ' SB ' , ' SEH ' , 'VB ' , 'BG ' , 'NV ' , ' LT ' } ;
55
56 f o r ppt = 1 : 6
57 s u b p l o t ( 2 , 3 , ppt ) ;
58 p l o t ( stimval , dataP ( ppt , : ) , '−+ ' ) ;
59 hold a l l
60 p l o t ( stimval , predP ( ppt , : ) , ' −. * ' ) ;
61 ylim ( [ 0 1 ] ) ;
62 x l a b e l ( ' Luminance ' ) ;
63 y l a b e l ( ' P (A) ' ) ;
64 t i t l e ( char ( pptLab{ppt } ) ) ;
65 end
66 s e t ( g c f , ' Name ' , char ( modelToFit ) ) ;
67
68 t . theta = theta ;
69 t . thetaSE = thetaSE ;
70 t . nlnL = lnL ;
71 e v a l ( [ char ( modelToFit ) ' = t ; ' ] ) ;
72 c l e a r theta thetaSE
73 end
74
75
76 f o r ppt = 1 : 6
(Continued)
272 Computational Modeling in Cognition
(Continued)
77 [ AIC ( ppt , : ) , BIC ( ppt , : ) , AICd ( ppt , : ) , ←
BICd ( ppt , : ) , AICw ( ppt , : ) , BICw ( ppt , : ) ] = . . .
78 infoCriteria ( [ GCM . nlnL ( ppt ) GRT . nlnL ( ppt ) ←
DEM . nlnL ( ppt ) ] , [ 1 3 2 ] , ←
repmat ( Ntest * 8 , 1 , 3 ) ) ;
79 end
mapping physical luminance to perceived luminance and found all gave compa-
rable results, including a function assuming a linear relationship between actual
and perceived luminance. In providing the physical luminance values as perceived
luminance, we are assuming a deterministic (i.e., noise-free) linear relationship
between physical and perceived luminance.
The loop beginning at line 26 and finishing at line 73 does the hard graft of fit-
ting each model to the data from each participant. We start off by setting up a
loop across the models; each time we pass through the loop, we will be fitting
a different model whose name will be contained in the variable modelToFit
as a string. Within that loop is another loop beginning at line 28, which loops
across participants, as we are fitting each participant’s data individually, indexed
by the loop variable ppt. The next line looks at modelToFit to work out which
model’s code to use in the fitting. This uses a switch statement, which matches
char(modelToFit) to a number of possible cases ( ' GCM','GRT','DEM'), and
otherwise reverts to a default catch statement, which here returns an error mes-
sage. The switch statement uses char(modelToFit) rather than modelToFit
directly because the set of values for modelToFit is provided in a cell array
(being enclosed in curly braces), so the current value of modelToFit must be
turned into a string using the function char before it can be matched to the string
in each case statement.
Within each possible case, there are two lines. The first line (e.g., line 31)
constructs an anonymous function (see the MATLAB documentation for further
details). Anonymous functions are functions that are adaptively created on the fly
and do not require their own function files in MATLAB. We are using anonymous
functions here because, as we will see, we end up calling the same function several
times. By using the anonymous function, we will simplify our code and make it
easier to read. The main purpose of line 31 is to create a new on-the-fly function
called f that takes only a single argument, a parameter vector called pars. Inside
this function, all that happens is that we call DEMlnL and pass it pars, along with
the other information needed by DEMlnL. Remember that GCM is DEM with γ
fixed at 1; the first argument to DEM is therefore a vector joining together the free
parameter c and the fixed value of 1 for γ . Going from left to right, we next have
stimval, the luminance of each stimulus; [feedback; Ntrain−feedback], a
matrix containing one row for the number of times ‘A’ was given as feedback
and a second row containing the number of times ‘B’ was given as feedback; the
frequency of ‘A’ responses for the participant of current interest; and N, a vector
specifying how many trials in total were tested for each stimulus. No feedback
information is provided to the anonymous function for GRT because GRT does
274 Computational Modeling in Cognition
not require this argument. All this information from stimval onwards doesn’t
change within the participant loop; by using the anonymous function, we can
specify it the one time and ignore it for the rest of the code, focusing instead on
the parameter vector, which will need to change during the parameter estimation.
The second line within each case passes the anonymous function f to a
minimization routine to find the ML parameter estimates (remember, we are min-
imizing the negative log-likelihood). GCM has only a single parameter to be esti-
mated (c); accordingly, we use the built-in MATLAB function fminbnd, which
performs unidimensional function minimization. This function does not need a
starting point and only requires the minimum and maximum possible values for
the parameter; we’ve set these to 0 and 100, respectively. For GRT and DEM, the
models incorporating at least two free parameters, we use the fminsearchbnd
function that was introduced in Section 7.1. This function requires a starting
vector and vectors of lower and upper bounds on the parameters. These can be
inspected in lines 37 and 41.
The result of what we’ve discussed so far is that for a given model (indexed
by modelToFit), the code loops across participants, and for each participant,
the anonymous function is reconstructed and passed to code that minimizes the
appropriate negative log-likelihood function (contained in f) using a minimization
routine. When each minimization attempt is finished, the ML parameter estimates,
the minimized − ln L, and the exit flag (giving information about whether or not
the minimization was successful) are respectively placed in variables theta, lnL,
and exitflag.
Having found the ML parameter estimates, line 46 runs the model a final time
using the MLEs to obtain the predictions of the model under the MLEs; these
predicted proportions of ‘A’ responses are stored in the matrix predP.
After ML estimation has been performed, the next section of code (lines 48–50)
finds the standard errors on the parameter estimates. Because we rewrote each
log-likelihood function into the anonymous function f, we can use the same code
for all three models. Line 48 passes the anonymous function and MLEs to the
hessian function provided in Chapter 5, along with the δ parameter required by
that function. The following two lines convert the resulting Hessian matrix hess
into standard errors by taking the matrix inverse of the Hessian matrix to obtain a
covariance matrix and then taking the square root of the values along the diagonal
of this matrix to obtain the standard error on the ML parameter estimates.
The ML parameter estimates and their estimated standard errors are given in
Tables 7.4, 7.5, and 7.6 for GCM, GRT, and DEM, respectively. We will refer
back to these after discussing the model comparison results.
Chapter 7 Drawing It All Together 275
Table 7.4 ML Parameter Estimates and Associated Standard Errors for the GCM Fits to
the Data in Figure 7.9
Participant c S E(c)
SB 2.29 0.36
SEH 5.29 0.32
VB 9.32 0.38
BG 9.43 0.38
NV 5.91 0.36
LT 10.28 0.41
Table 7.5 ML Parameter Estimates and Associated Standard Errors for the GRT Fits to
the Data in Figure 7.9
Table 7.6 ML Parameter Estimates and Associated Standard Errors for the DEM Fits to
the Data in Figure 7.9
SB SEH VB
1 1 1
P(A)
P(A)
0.4 0.4 0.4
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Luminance Luminance Luminance
BG NV LT
1 1 1
P(A)
P(A)
0.4 0.4 0.4
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
Luminance Luminance Luminance
Figure 7.10 Proportion of ‘A’ responses predicted by GCM under ML parameter estimates.
The data are plotted as crosses connected by solid lines, and the model predictions are
plotted as asterisks connected by dashed lines.
misses the data of VB. The one participant for whom GCM appears to give a rea-
sonable fit is participant LT in the bottom-right panel, who showed a very clear
drop in ‘A’ responses with lower luminance values (stimuli 1 and 2). The predic-
tions of GRT, shown in Figure 7.11, are much more in line with the data. The
one exception is participant LT, for whom GRT predicts a nearly flat function that
does little to capture the large changes in responses across luminance values for
that participant. Finally, the predictions of DEM are shown in Figure 7.12. These
predictions are similar to those of GCM (Figure 7.10), with DEM giving visu-
ally better fits in some cases (VB and BG). DEM generally appears to be inferior
to GRT, with two exceptions. For participant VB, GRT and DEM appear to give
equally good fits, and for participant LT, GRT is clearly inferior to DEM.
SB SEH VB
1 1 1
0. 8 0. 8 0. 8
0. 6 0. 6 0. 6
P(A)
P(A)
P(A)
0. 4 0. 4 0. 4
0. 2 0. 2 0. 2
0 0 0
0 0. 5 1 0 0. 5 1 0 0. 5 1
Luminance Luminance Luminance
BG NV LT
1 1 1
0. 8 0. 8 0. 8
0. 6 0. 6 0. 6
P(A)
P(A)
P(A)
0. 4 0. 4 0. 4
0. 2 0. 2 0. 2
0 0 0
0 0. 5 1 0 0. 5 1 0 0. 5 1
Luminance Luminance Luminance
Figure 7.11 Proportion of ‘A’ responses predicted by GRT under ML parameter estimates.
may simply follow from the fact that it has more free parameters than both GCM
and DEM. Similarly, DEM appears to give a better account of the data than GCM,
but to what extent is this due to its extra parameter γ picking up nonsystematic
residual variance?
Lines 68 to 72 prepare the results of the model fitting for model comparison
by storing the results of each model’s fit in a separate structure. Line 71 uses the
eval function to assign the modeling results, collected in the temporary structure t
in the preceding few lines, to a structure named either GCM, GRT, or DEM. Line 72
deletes a few variables because they differ in size between the different models
and would return an error otherwise.
The last part of Listing 7.8 calls the function infoCriteria from Chapter 5
to obtain AIC and BIC values, AIC and BIC differences, and model weights for
each participant. The AIC and BIC differences calculated by this code are shown
in Table 7.7, and the corresponding model weights are shown in Table 7.8. For
participants SB, SEH, BG, and NV, the model differences and model weights
bear out the superior fit of the GRT suggested in the figures, with the model
278 Computational Modeling in Cognition
SB SEH VB
1 1 1
0. 8 0. 8 0. 8
0. 6 0. 6 0. 6
P(A)
P(A)
P(A)
0. 4 0. 4 0. 4
0. 2 0. 2 0. 2
0 0 0
0 0. 5 1 0 0. 5 1 0 0. 5 1
Luminance Luminance Luminance
BG NV LT
1 1 1
0. 8 0. 8 0. 8
0. 6 0. 6 0. 6
P(A)
P(A)
P(A)
0. 4 0. 4 0. 4
0. 2 0. 2 0. 2
0 0 0
0 0. 5 1 0 0. 5 1 0 0. 5 1
Luminance Luminance Luminance
Figure 7.12 Proportion of ‘A’ responses predicted by DEM under ML parameter estimates.
weights for GRT indistinguishable from 1 for these participants. The information
criterion results for VB are more informative. Recall that the fits for GRT and
DEM looked similar in quality for VB. Tables 7.7 and 7.8 show that GRT nonethe-
less gives a statistically superior fit to VB’s data, even when the extra parameter
in the GRT is accounted for via the correction term in the AIC and BIC. Finally,
Tables 7.7 and 7.8 show that LT’s data are better fit by an exemplar model, par-
ticularly the DEM. Not only does DEM provide a superior fit to GRT in this
case, but the information criteria point to DEM also being superior to GCM in
its account of the data, indicating that the extra parameter γ is important for the
exemplar model’s account for this participant. Inspection of the ML parameter
estimates in Table 7.6 is instructive for interpreting DEM’s success. The row for
LT shows that the estimated c was very large compared to the other participants
and that the γ parameter was below 1. The large c for DEM (and, indeed, GCM, as
shown in Table 7.4) means that stimuli only really match exemplars with the same
luminance in memory, leading the model to strongly track the feedback probabil-
ities. However, the γ < 1 means that the model undermatches those feedback
Chapter 7 Drawing It All Together 279
Table 7.7 AIC and BIC Differences Between the Three Models GCM, GRT, and DEM
AIC BIC
GCM GRT DEM GCM GRT DEM
Table 7.8 AIC and BIC Weights for the Three Models GCM, GRT, and DEM
AIC BIC
GCM GRT DEM GCM GRT DEM
probabilities, which pulls the predicted proportion of ‘A’ responses back to the
baseline of 0.5.
processes to perform this task. The most obvious conclusion based on the mod-
eling results is that LT’s performance is supported by matching to stored exem-
plars. However, that conclusion may be premature. Although the results corre-
spond more to the ML predictions of GCM and DEM, notice that although LT’s
observed probability of giving an ‘A’ response changes dramatically between
stimuli 1 and 2 (and, indeed, between 7 and 8), GCM and DEM’s predictions
change little between the same stimuli. Arguably, GCM and DEM are still failing
to capture an important aspect of these data. Rouder and Ratcliff (2004) noted
another possible explanation for these results: that LT was relying on boundaries
(as in GRT) to categorize the stimuli but had inserted a third boundary somewhere
around the lowest luminance stimulus despite this being a nonoptimal approach.
Rouder and Ratcliff (2004) fit GRT with three boundaries to the data of LT and
found it to give the best fit of all models and that it was able to produce the large
change between stimuli 1 and 2 seen in LT’s data. We leave implementation of that
model as an exercise for you, but note that this reinforces the overall consistency
between GRT and the data.
What does it mean to say that GRT is the “best” model? The main message
is that under conditions of uncertainty, and with stimuli varying along a single
dimension, people partition the stimulus space using boundaries and use those
boundaries to categorize stimuli observed later on. We know it isn’t simply that
GRT assumes deterministic responding: The DEM model, an exemplar-based
model with deterministic responding, failed to compete with the GRT in most
cases.
One interesting note to sign off on is that hybrid models have become increas-
ingly popular in accounting for categorization. For example, the RULEX model of
Nosofsky and Palmeri (1998) assumes that people use rules to categorize stimuli
but store category exceptions as exemplars. The COVIS model of Ashby
et al. (1998) assumes that participants can use both explicit rules and implicit
representations learned procedurally to categorize stimuli. Rouder and Ratcliff
(2004) found that in some circumstances, their participants appeared to use exem-
plars to perform similar probabilistic categorization tasks and shifted between
using boundaries and exemplars depending on the discriminability of the stimuli.
This fits well with another hybrid model, ATRIUM (Erickson & Kruschke, 1998),
in which rules and exemplars compete to categorize and learn each new object.
ATRIUM has been shown to account for a variety of categorization experiments,
particularly ones where participants appear to switch between strategies or mech-
anisms (e.g., Yang & Lewandowsky, 2004). Nonhybrid models of the types con-
sidered here are still useful for determining what type of model is predominantly
in use in a particular task; nonetheless, a comprehensive model of categorization
will need to explain how participants appear to be using quite different
Chapter 7 Drawing It All Together 281
representations depending on the nature of the task and stimuli (e.g., Ashby &
O’Brien, 2005; Erickson & Kruschke, 1998).
7.3 Conclusion
We have presented two detailed examples that build on the material from the
preceding chapters. If you were able to follow those examples and understand
what we did and why, then you have now gathered a very solid foundation in the
techniques of computational and mathematical modeling.
Concerning future options for exploring the implementation and fitting of
models, two directions for further study immediately come to mind: First, we sug-
gest you explore Bayesian approaches to modeling and model selection. Bayesian
techniques have recently become prominent in the field, and the material in this
book forms a natural and solid foundation for further study in that area. Second,
we suggest you investigate hierarchical (or multilevel) techniques. We touched on
those techniques in Chapter 3 but were unable to explore them further in this vol-
ume. Many good introductions to both issues exist, and one particularly concise
summary can be found in Shiffrin, Lee, Kim, and Wagenmakers (2008).
We next present a final chapter that moves away from the techniques of model-
ing and considers several general frameworks that have been adopted by modelers
to explain psychological processes.
Notes
1. Note that our notation and symbology correspond to that used by Clare and
Lewandowsky (2004) rather than by Clark (2003).
2. The fact that each replication involves different stimulus vectors (in addition to encod-
ing of a different subset of features of the perpetrator) also implies that each replication
involves a different set of faces in the lineup. This is desirable because it generalizes the
simulation results across possible stimuli, but it does not reflect the procedure of most
experiments in which a single lineup is used for all participants.
3. There are some minor differences between the simulations reported by Clare and
Lewandowsky (2004) and those developed here, which were introduced to make the present
example simple and transparent. The parameter estimates and model predictions reported
here thus differ slightly (though not substantially) from those reported by Clare and
Lewandowsky (2004).
4. In fact, we also select a method by which random numbers are selected, but we ignore
this subtlety here for brevity.
5. If the number of replications is small, reseeding the random generator carries the risk
that the model may capitalize on some chance idiosyncrasy in the random sequence. The
282 Computational Modeling in Cognition
Earlier in this book, in Section 2.6, we considered what it actually means to have
an “explanation” for something. We discovered that explanations exist not in a
vacuum but within a psychological context; that is, they must be understood in
order to have any value (if this sounds mysterious, you may wish to refer back
to Section 2.6.3 before proceeding). We then introduced material in the next five
chapters that is required for the construction of satisfactory psychological expla-
nations. Most of our discussion, including the detailed examples in the preceding
chapter, involved models that were formulated to provide a process explanation
(see Section 1.4.4).
The purpose of this final chapter is to present several additional avenues to
modeling that inherit most of the techniques we have presented but that formu-
late explanations at different levels. We begin by covering Bayesian approaches,
which assume that human cognition is a mirror of the environment. We then turn
to neural networks (also known as connectionist models), which provide a pro-
cess explanation of cognition with the added claim of neural plausibility. We next
briefly touch on theorizing that is explicitly “neuroscientific” and that strongly
appeals to the brain’s architecture or indicators of its functioning for validation.
In a final section, we introduce cognitive architectures: large-scale models that
are best understood as a broad organizing framework within which more detailed
models of cognition can be developed.
Each of the sections provides a thumbnail sketch of the approach and, where
possible, links to material presented in the earlier chapters. However, we cannot
possibly do justice to the complexity and richness of each approach in a single
283
284 Computational Modeling in Cognition
chapter, and we therefore conclude each section with pointers to the relevant
literature for follow-up research.
In this equation, which has been reproduced from Chapter 2 for convenience,
p(ttotal ) is the actual prior distribution of total times, as, for example, observable
from a historical database that records the reigns of Egyptian pharaohs. Supposing
that people have access to this prior distribution from their lifelong experience
with the world, in the absence of any other information, it can be used to make a
fairly good guess about how long an event from the same family (e.g., runtime of
a movie or reign of a pharaoh) would be expected to last.
However, in the example from Lewandowsky et al. (2009), participants had
some additional information—namely, the amount of time that the event in ques-
tion had already been running. Lewandowsky et al. assumed that participants treat
this information as the likelihood p(t|ttotal )—that is, the probability of observing
an event after t time units given that it runs for ttotal units altogether. This likeli-
hood was assumed to follow a uniform distribution between 0 and ttotal , reflecting
the assumption that there is a constant probability with which one encounters an
event during its runtime.
As briefly discussed in Chapter 4, the magic of Bayes’ theorem is that it allows
us (as statisticians and as organisms behaving in the world) to reverse the direction
of a conditional. In Lewandowsky et al.’s (2009) case, it provides the machinery
to turn the probability of t given ttotal , provided by the stimulus value (t) in the
experiment, into the probability of ttotal given t, the prediction that participants
were asked to make.
The conceptual framework usually adopted in Bayesian theory is that the
prior distribution is updated by the likelihood [the incoming evidence, in this
case p(t|ttotal )] to obtain a posterior distribution that represents our expectations
about the environment in light of the newly obtained information. Be clear that
p(ttotal |t) is a posterior distribution: A person’s expectations about the future
are represented by an assignment of probability to all possible states of the
environment—or, in the case of a continuous variable such as time, by a prob-
ability density function.
Using Bayes’ theorem as per Equation 8.1, Lewandowsky et al. (2009) were
able to obtain people’s posterior distribution p(ttotal |t). Moreover, using a proce-
dure called iterated learning, in which a person’s responses are fed back to him
or her later in the experiment as stimuli, Lewandowsky et al. were able to exploit
the fact that after many such iterated learning trials, people’s final posterior distri-
bution converges onto their prior distribution. That is, Griffiths and Kalish (2007)
showed mathematically that in an iterated learning procedure, people’s estimates
ultimately converge onto their prior distribution—that is, their preexperimental
knowledge of the quantities under consideration. In a nutshell, this arises because
each prediction results from a combination of the prior probability and a likeli-
hood, where the likelihood is obtained from the presented stimulus. If responses
Chapter 8 Modeling in a Broader Context 287
are then re-presented as stimuli later in the experiment, the posterior distribution
comes to resemble the prior distribution more and more because the impact of
specific stimuli “washes out.” As illustrated in Figure 2.6, Lewandowsky et al.
(2009) found that the actual prior probability of quantities in the environment
(on the ordinate) closely matched the prior distributions apparently used by peo-
ple (the “Stationary” distribution on the abscissa), which were obtained using the
iterated learning procedure just described.
One concept often associated with Bayesian theory is that of optimality: Indi-
viduals’ perception (e.g., Alais & Burr, 2004; Weiss, Simonvelli, & Adelson,
2002; Yuille & Kersten, 2006), cognition (Lewandowsky et al., 2009; Norris,
2006; Tenenbaum, Griffiths, & Kemp, 2006), and action (e.g., Körding & Wolpert,
2006) are adapted to their environment, and their predictions or expectations about
the environment are in some sense optimal. Bayesian theory does indeed provide
an unambiguous definition of optimality; if the prior distribution and the likeli-
hood are both known, and if the posterior distribution matches the predictions of
people in that same environment, then there is a firm basis for concluding that
those participants are acting optimally. In some areas, particularly perception, the
Bayesian model serves as an “ideal observer” model that describes the best possi-
ble performance on a task in light of the inherent uncertainty in the environment,
or in the “noisiness” of perception. In the example by Lewandowsky et al. (2009)
just discussed, the prior distribution was independently known (e.g., by examin-
ing historical data about Egyptian pharaohs), and hence there is little ambiguity
in interpreting the outcome as showing that “whatever people do is equivalent to
use of the correct prior and sampling from the posterior” (p. 991).
In other cases, however, the model describing the likelihood function might
only be assumed, or the prior distribution might be specified by the researcher
without necessarily measuring the environment. In these cases, any claims of opti-
mality are less grounded: Correspondence of the model’s predictions with human
behavior may be down to the specification of the prior or the likelihood, which
are effectively free parameters in the model. An additional implication is that opti-
mality is only expressed with respect to a given prior distribution and likelihood
function, meaning that our definition of optimality may change depending on how
we conceptualize a task or environment (Courville, Daw, & Touretzky, 2006). For
example, one feature of many Bayesian theories is that they assume the world
to be static, meaning that each previous observation is given equal weighting in
making future predictions. However, this model will not be optimal if the world
is changing (as in the J. R. Anderson and Schooler example above), and the opti-
mal model will instead be one that explicitly recognizes the temporal uncertainty
in the environment. One outstanding issue is whether Bayesian theories provide
enough detail to describe theories at a process level (see Section 1.4.4).
288 Computational Modeling in Cognition
Perhaps a better way to think about Bayesian theory is that it describes cogni-
tion at a process characterization level (see Section 1.4.3), by identifying general
lawful properties of human behavior without describing, say, the memory pro-
cesses that represent the prior distribution. Having said that, Bayesian concepts
can be incorporated into models that provide a process explanation for various
cognitive phenomena (e.g., J. R. Anderson, 1996; Dennis & Humphreys, 2001;
J. McClelland & Chappell, 1998; Norris, 2006; Shiffrin & Steyvers, 1997).
Although Bayesian theories are relatively new in psychology, there is no ques-
tion that they are gaining in popularity and that they are here to stay. Our brief dis-
cussion of the Bayesian approach barely scratches the surface of the considerable
promise of the approach. To gain further information about these models, a good
place to start are recent special issues of Trends in Cognitive Sciences (Chater,
Tenenbaum, & Yuille, 2006c) and Developmental Science (e.g., Shultz, 2007).
Output layer
Input layer
Figure 8.1 An example of a neural network model. Each circle is a processing unit (“neu-
ron”), and the arrowhead lines between them are the weighted connections. Also schemati-
cally depicted in the figure are the activation values of the units: The more activated a unit,
the more it is filled with dark shading.
wi j = η f i g j , (8.2)
where wi j is the weight between input unit fi and output unit g j , and the sign
indicates that we are updating the weight with whatever is on the right-hand side
of the equation. This turns out to simply be the product of the activations of the
Chapter 8 Modeling in a Broader Context 291
input unit and output unit, scaled by a parameter η that controls the strength of
learning.
It should be clear how this is analogous to Hebb’s verbal proposal: If fi and
g j are both large and positive, then the weight between them will increase pro-
portionally to both units’ activation.
By thus using the weights to track the co-firing of input and output units,
Hebbian learning provides a mechanism for learning associations between whole
items. If an activation pattern is presented across the input units (vector f, repre-
sented by elements fi in the above equation) and a separate activation pattern pre-
sented across the output units (g), then adjusting the weights amounts to learning
an association between those two activation patterns. In other words, the mem-
ory in the network is in the weights! Critically, the weights by themselves do not
mean or do anything: Memory in a neural network is the effect of past experience
on future performance via the weights, which determine how activation is passed
from the input units to the output units. In addition, and in departure from alterna-
tive viewpoints such as the cognitive architectures discussed below, there are no
separate memories: All associations are stored in the one weight matrix, such that
memories are superimposed in this “composite” storage. That is, the learning rule
in Equation 8.2 applies irrespective of the identity of items f and g, and the same
weight wi j may be updated multiple times by different pairs of items.
The effects of changing the weights can be seen by considering how a Hebbian
network retrieves information. Let us imagine that the network has learned several
input-output pairs and that we now present some input pattern as a cue and ask
the network for a response. That is, we ask the network, What output pattern was
paired with this input pattern? To do so, we present a cue, c, and activate the input
units accordingly. The activation of output unit j in this test situation, r j , is then
simply the sum of each input activation multiplied by the weight between that
input unit and the output unit j:
rj = ci wi j . (8.3)
i
The effects of learning become apparent when we substitute the weight change
resulting from the association between a particular input-output pair f-g from
Equation 8.2 into the activation rule in Equation 8.3:
rj = ηci f i g j . (8.4)
i
There are two things to note about this equation. The first is that the bigger the
input activations (ci ) or the larger the weight values overall, the larger the output
activation values will be. The second, less trivial feature of Equation 8.4 is that
292 Computational Modeling in Cognition
the learned output g j will be elicited as the activation for the output unit j to
the extent that the activation pattern c matches the learned input f. In the case of
an extreme match, where c is the same as f, the product following η is given by
fi 2 g j , which, when applied for all output units, will produce the best match to g.
In practice, there will usually be a number of associations stored in the weights,
and each output pattern g(k) will be elicited to the extent that the cue matches the
corresponding input pattern f(k), where the match is measured by the dot product
(see Equation 7.1). As a consequence, if the cue matches more than one input
pattern, the output of the network will be a blend—a weighted sum of the learned
outputs.
Hebbian models are well suited to modeling memory. For example, they are
able to generalize to new cues based on previous knowledge: Several models of
serial recall (see Chapter 4) assume that people remember the order of elements
in a list by forming a Hebbian association between each list item and some rep-
resentation of its timing (G. D. A. Brown et al., 2000) or position (Lewandowsky
& Farrell, 2008a). Critically, the vectors representing neighboring positions (or
times) are assumed to overlap. This means that cueing for an item at a partic-
ular position (time) will elicit a blend of list items, with the target item highly
weighted, and neighboring items activated to a lesser extent. This similarity-
based generalization during retrieval explains the common fact in serial recall
that when items are recalled out of order, they are nonetheless recalled in approx-
imately the correct position (rather than in a random position; e.g., Henson et al.,
1996).
Hebbian learning also plays an important role in autoassociative models, in
which recurrent weights projecting from a single layer of units feed back into that
same layer. By implication, updating weights using Hebbian learning means that
an item is associated with itself—hence the name autoassociator. This probably
sounds ridiculous—what good can such a mechanism be in practice? It sounds
as though the associative component is redundant because all it supplies is the
item that has to be available as a cue in the first place! However, autoassociative
models come into their own when we consider the situation of having only partial
information about an item. If I have some knowledge about an object (e.g., I can
see the lower half of a person’s face, and I need to identify the person), I can
present this information to the autoassociative network and let it fill in the gaps.
This works because the weights, which connect together different units in the
same single layer, effectively store information about the correlations between
units. If two units tend to fire together, and I know that one of the units is on, a
justifiable inference (from the network’s perspective) is that the other unit should
also be on. Hence, an autoassociator provides a powerful tool for a process known
as “redintegration,” whereby a complete item can be reconstructed from partial
and/or degraded information in the cue.
Chapter 8 Modeling in a Broader Context 293
In reality, it turns out that a single pass through the network rarely suffices for
complete redintegration because what is produced at output will be a blend of the
items that have been learned by the network. Accordingly, most autoassociative
models use an iterative procedure, in which the output activations from one itera-
tion serve as the input for the next iteration (e.g., J. A. Anderson, Silverstein, Ritz,
& Jones, 1977; Lewandowsky & Farrell, 2008b; O’Toole, Deffenbacher, Valentin,
& Abdi, 1994). Over iterations, the strongest item in the blend will come to domi-
nate; since, according to Equation 8.4, items will be more strongly represented in
the blend at output if they more strongly match the input, and since the input on
one iteration is simply the output from the previous iteration, this process serves as
an amplification device that will eventually converge on a single representation.
For this reason, autoassociative networks have been used as “clean-up” mecha-
nisms that take the blended output from a standard associator and disambiguate
this to obtain a pure representation of a single item (e.g., Chappell & Humphreys,
1994; Lewandowsky, 1999; Lewandowsky & Farrell, 2008b). They have also been
applied to such areas as perceptual learning (J. A. Anderson et al., 1977) and face
recognition (O’Toole et al., 1994).
A key feature of backpropagation models is that they learn via error correction.
That is, weights are updated not simply on the basis of the output values—as in
294 Computational Modeling in Cognition
Hebbian learning—but rather based on the difference between the “correct” out-
put value and the output actually produced when the input pattern is used as a
cue. In fact, the backpropagation learning algorithm works to minimize the error
between the desired and obtained output E = (r j − g j )2 . The algorithm min-
imizes this error using a gradient descent algorithm that, for a particular learning
episode, adjusts the weights so as to reduce E as much as possible. This is pre-
cisely analogous to the error minimization routines discussed in Chapter 3, except
that the “parameters” being optimized are the weights in the network rather than
conventional model parameters.
Although this might seem to imply that backpropagation models are thus
incredibly flexible, it is important to note that the weights are optimized with
respect to the training sequence (stimuli and feedback) rather than with respect
to the data; that is, the network is trying best to account for the environment, not
people’s behavior in that environment.
A classic example of a backpropagation model is Hinton’s (1990) demonstra-
tion of how connectionist networks can learn family trees (see also Hinton, 1986).
In Hinton’s network, the name of a person and a relationship (e.g., brother, grand-
mother) were coded and fed in as inputs, with the network tasked to produce
a pattern of activation across the output units corresponding to the appropriate
response. For example, if Colin and Charlotte are siblings, and “Colin + sister” is
given as the input, then the target output is “Charlotte.” The network, which used
localist representations of people and relationships, was trained on two family
trees, one English and one Italian, each involving 12 people spread across three
generations (see top panel of Figure 8.2).
Not only did the network learn to produce the outputs appropriate to specific
cues, but it also developed some “emergent” internal representations. Those emer-
gent representations become apparent when the weights from the input to hidden
units are analyzed. As can be seen in the bottom panel of Figure 8.2, the first unit2
“represents” the difference between English and Italian, in that the weights to this
unit from English people were excitatory, whereas those from Italian people were
inhibitory. Note that the network was given no separate information about the lan-
guage of the people involved; this discrimination emerged by virtue of the fact that
only people from the same tree were ever presented to the network as members
of a relationship. Figure 8.2 also shows other forms of representation; for exam-
ple, the second unit appears to code for generation, receiving excitatory weights
from grandparents, inhibitory weights from grandchildren, and small positive or
negative weights from the middle stratum of the family tree, parents. Again, no
information was given about this higher order information, but the model used
the statistics of the input (which people are paired with each other and as part of
which specific relationship) to extract these higher order relationships.
Chapter 8 Modeling in a Broader Context 295
A
Christopher = Penelope Andrew = Christine
Colin Charlotte
Alfonso Sophia
B
Christopher
Christopher
Charlotte
Charlotte
Penelope
Christine
Penelope
Christine
Margaret
Margaret
Andrew
Jennifer
Andrew
Jennifer
Victoria
Victoria
Charles
Charles
Arthur
Arthur
James
James
Colin
Colin
Figure 8.2 Hinton’s (1986, 1990) application of a backpropagation network to the learning
of familial relations. Panel A shows one of the tree structures used to generate the Person
1–relationship–Person 2 sets that were presented for learning. Panel B shows the size and
sign of the weights projecting from each localist unit coding a person to each of six hid-
den units. Each large box corresponds to a hidden unit, and the top and bottom rows in
each box correspond to English and Italian names, respectively. The small squares indicate
whether a weight was excitatory (white) or inhibitory (black), with the size of the square
indicating the magnitude of the weight. From Hinton, G. E. (1990). Mapping part-whole
hierarchies into connectionist networks. Artificial Intelligence, 46, 47–75. Published by
Elsevier. Reproduced with permission.
A final notable feature of this example is that it highlights the role of the differ-
ent layers of units: The weights between the input and hidden layers effectively
learn to transform or remap the input into new representations that are (more
easily) learnable. The hidden-to-output connections then store the associations
296 Computational Modeling in Cognition
between the transformed inputs and the target outputs. It is this ability to trans-
form input space into emergent representations that differentiates backpropaga-
tion networks from Hebbian networks, and that has contributed to much of the
excitement about connectionism in general.
One interesting implication of the error-driven learning inherent in backpropa-
gation is that it gives greater weighting to more recent information. In the extreme
case, backpropagation may produce catastrophic interference, whereby learning
one set of associations grossly impairs the network’s memory for another, pre-
viously learned set of associations (McCloskey & Cohen, 1989; Ratcliff, 1990).
Although this problem is largely quantitative, in that modifying various assump-
tions about the representations and learning algorithm brings the interference
more in line with that seen in human forgetting (e.g., French, 1992; Lewandowsky,
1991), the catastrophic-interference effect usefully illustrates the dynamics of
learning in these models. To see why catastrophic interference might arise, con-
sider the top panel of Figure 8.3, which schematically depicts the progress of
learning as the movement of weights through “weight space,” analogous to the
movement of a parameter vector through parameter space in parameter estimation
(see Figure 3.4).
If the network is first trained on one set of associations, B, the weights will
move from their initial setting ( A) and, due to the error-driven learning of the
backpropagation algorithm, will eventually settle on a minimum error Min(X )
representing a set of weights that adequately captures the associations in B. Now,
if a new set of associations (C) is presented, the error surface will change—
because the already stored knowledge is no longer adequate—and will drive the
weights toward the new minimum Min(Y ). In doing so, the weights will move
away from Min(X ), producing forgetting of the old set of associations, B.
One way to prevent catastrophic interference is to interleave both sets of infor-
mation to be learned, such that no one set of information is more recent than the
other. As depicted in the bottom of Figure 8.3, the state of the weights will con-
stantly change direction as one or the other set of information is learned but will
ultimately converge on a minimum Min(X &Y ) that does a reasonable job of
accounting for both sets of associations. Supported by simulations, Farrell and
Lewandowsky (2000) used this reasoning to explain why automating tasks (for
example, in an aircraft cockpit) leads operators to be poor at detecting occasional
failures of the automation. Farrell and Lewandowsky (2000) suggested that when
a task is automated, operators learn not to respond to a task, which catastrophi-
cally interferes with their previously learned responses (e.g., to push a button in
order to change radio frequencies when entering a different air space).
It has also been found that repeatedly returning full control of the task to
an operator for a brief period is sufficient to minimize the deleterious effects of
automation on performance. Farrell and Lewandowsky (2000) suggested that this
Chapter 8 Modeling in a Broader Context 297
Min(X)
Min(Y)
B
C
Min(X&Y)
Min(X)
B
C
A Min(Y)
Figure 8.3 Schematic depiction of movement through weight space as learning progresses
in a backpropagation network. The top panel corresponds to sequential learning of two sets
of information, and the bottom panel corresponds to interleaved presentation of the two
sets. From Farrell, S., & Lewandowsky, S. (2000). A connectionist model of complacency
and adaptive recovery under automation. Journal of Experimental Psychology: Learning,
Memory, & Cognition, 26, 395–410. Published by the American Psychological Associa-
tion. Reproduced with permission.
8.2.4 Summary
We presented two neural networks that have been influential in cognitive psychol-
ogy during the past few decades. There is a plethora of other networks, and we
cannot do this area justice in a single section. Instead, interested readers may wish
298 Computational Modeling in Cognition
to choose from the large number of relevant texts to learn more about networks.
For example, the book by J. A. Anderson (1995) provides an in-depth treatment of
Hebbian networks and autoassociators, whereas Gurney (1997) and Levine (2000)
provide other broad overviews.
Irrespective of which network you may choose to apply to a particular prob-
lem, the techniques you will use are identical to those outlined in Chapters 2 to 7.
In fact, the WITNESS model discussed in Chapter 7 can be considered to be a
(albeit miniaturesque) neural network: All stimuli were represented by vectors,
thus instantiating a “distributed” representation, and information was retrieved by
interrogating memory via a cue, similar to the way information is elicited in a
Hebbian net.
Thus, from the preceding chapter you already know that just like any other
model, a neural network requires that some parameters (e.g., the learning rate η
in Equation 8.2 or the parameters of WITNESS) be estimated from the data. Just
like any other model, a network will generate predictions that are then compared
to the data—although unlike many other models, a network requires access to
the sequence of training stimuli in order to learn the task before it can generate
predictions.
We do not reject Henson’s conclusions but again provide several notes of cau-
tion that must be borne in mind while collecting or interpreting functional imaging
data. First, there is some debate concerning the reliability of fMRI signals. The
reliability of measurements across repeated trials is central to the entire scientific
enterprise: In the extreme case, if repeated measurements are uncorrelated with
each other, no meaningful interpretation of the data is possible. In the physical
sciences, of course, reliability is high—for example, two consecutive measure-
ments of the length of a fossilized femur are likely to be virtually identical. In
the behavioral sciences, reliability is not perfect but still considerable; for exam-
ple, the reliability of the Raven’s intelligence test across repeated applications is
around .80 (Raven, Raven, & Court, 1998). What is the reliability of functional
imaging measures?
There has been some recent controversy surrounding this issue (Vul, Har-
ris, Winkielman, & Pashler, 2009a, 2009b), with postulated values of reliability
ranging from .7 (Vul et al., 2009a) to .98 (Fernandez, 2003). Bennett and Miller
(2010) provided a particularly detailed examination of this issue and reported a
meta-analysis of existing reliability measures. Reassuringly, Bennett and Miller
found that between-subject variability was generally greater than within-subject
variation—in other words, the fMRI signals of the same individual performing
the same task repeatedly were more similar to each other than the fMRI signals
between different individuals on the same single occasion. (The same relationship
between within-subject and between-subject variation holds for most behavioral
measures; see, e.g., Masson & Loftus, 2003.) On a more sobering note, the aver-
age cluster overlap between repeated tests was found to be 29%. That is, barely a
third of the significantly activated voxels within a cluster can be expected to also
be significant on a subsequent test involving the same person. Other measures of
reliability converge on similar results, leading Bennett and Miller to conclude that
the reliability of fMRI is perhaps lower than that of other scientific measures. This
context must be borne in mind when evaluating fMRI data.
Second, Coltheart (2006) provided a detailed critique of several instances in
which the claimed theoretically constraining implications of fMRI results arguably
turned out to be less incisive than initially claimed. For example, E. E. Smith
and Jonides (1997) reported imaging data that purportedly identified different
working-memory systems for spatial, object, and verbal information based on
the fact that the different types of information appeared to involve activation
of different brain regions. E. E. Smith and Jonides concluded that for “a cog-
nitive model of working memory to be consistent with the neural findings, it must
distinguish the three types of working memory” (p. 39). The model said to be
favored by their findings was the “working memory” model of A. D. Baddeley
(e.g., 1986). Coltheart (2006) rejected those conclusions based on two arguments:
First, E. E. Smith and Jonides did not consider alternatives to the Baddeley model,
Chapter 8 Modeling in a Broader Context 303
thus precluding any conclusions about the necessity of that particular model (see
our earlier discussion about sufficiency and necessity in Section 2.6). Second,
Coltheart argued that it was unclear whether any pattern of imaging results could
have challenged the Baddeley model, given that the model makes no claims about
localization of any of its purported modules. Importantly, Coltheart’s analysis was
not an in-principle critique of the utility of neuroscientific results; rather, it thor-
oughly analyzed existing claims of interpretation and found them to be wanting.
Below, we analyze some further instances that in our view do not suffer from such
shortcomings.
that the individual differences in activation became tractable; the raw data were
insufficient for this purpose.
We next turn to an example involving single-cell recordings. Single-cell record-
ings offer far better temporal resolution than fMRI imaging data, and of course
they are also highly localized (namely, to the neuron being recorded). We focus
on an example that involves recordings from the superior colliculus (SC) of rhe-
sus monkeys. Ratcliff, Hasegawa, Hasegawa, Smith, and Segraves (2007) trained
monkeys to perform a brightness discrimination task by moving their eyes to one
or another target in response to a centrally presented stimulus patch. For example,
a bright patch might require the monkey to fixate on the target to the right, whereas
a dim patch would require a saccade to the left. The use of brightness decoupled
the nature of the stimulus—namely, its luminosity—from the decision process and
its outcome—namely, the direction in which to move the eyes—thus ensuring that
variation in firing rates reflected a decisional component rather than stimulus per-
ception. Only those neurons were recorded that showed no responsiveness to a
central visual stimulus but were preferentially active during the period leading up
to a right or left saccade, respectively.
Ratcliff et al. fitted both behavioral and neural data with a version of the dif-
fusion model (e.g., Ratcliff, 1978), which we considered briefly in Chapter 5. The
diffusion model is related to the LBA, except that its information accumulation
over time is stochastic rather than deterministic—that is, the accumulated infor-
mation during the decision process varies randomly at each time step. Ratcliff
et al. first fit the diffusion model to the behavioral data, consisting of the response
time distributions and accuracies of the monkey’s eye movements. To relate the
model to firing rates, they then assumed that proximity to the decision in the diffu-
sion process was reflected in the firing rates of the neurons in the SC—the nearer
the process was to a decision criterion, the higher the firing rate. The model was
found to capture many—although not all—aspects of the observed firing rates,
suggesting that the superior colliculus is a plausible locus of the decision process
involved in planning saccades.
The examples just discussed confirm that neuroimaging results and single-
cell recordings can inform theorizing and can provide valuable constraints that
go beyond those available in behavioral data. However, crucially, it is neither the
behavioral data nor the imaging results on their own, and neither is it the com-
bination of the two, that suffice to advance theorizing: It is only by integrating
all sources of empirical results within a computational and mathematical model
that this becomes possible. When this is done, the potential payoff in terms of
additional constraints on models is considerable.
The same conclusion applies to the multiple memory systems view in catego-
rization. Although we noted some potential limitations with the approach at the
306 Computational Modeling in Cognition
beginning of this section, those concerns were largely confined to the verbal theo-
rizing that often surrounds the MMS. Ashby, Paul, and Maddox (in press) recently
reported a computational instantiation of the MMS view that circumvented at least
some of those problems. Ashby et al. applied their model to a single experiment
with the aid of more than a dozen parameters; the generality and power of their
model thus remain to be ascertained, but the fact that it exists in a computational
instantiation is noteworthy. The potential now exists for this computational model
to be applied to numerous other phenomena.
As in the earlier case of neural networks, computational neuroscientific mod-
eling as just described involves all the techniques presented in this book. The
only difference is that the to-be-modeled dependent measures also include mea-
sures such as brain activation or neural activity. Further information about those
techniques is available in a number of sources, for example, the book by Dayan
and Abbott (2001). Specific programming techniques for MATLAB can be found
in Wallish et al. (2009).
IF the goal is to solve an equation and a variable has been read and there are no
arguments
THEN store it as the first argument.
At the next step, when people consider the + sign, the relevant production is
as follows:
IF the goal is to solve an equation and an operator has been read and there is no
operator stored
THEN store it as the operator.
(Basal Ganglia)
Productions Matching (Striatum)
Selection (Pallidum)
Execution (Thalamus)
External World
Figure 8.4 The overall architecture of ACT-R. The modules communicate with the pro-
duction system via buffers: The contents of the buffers are compared to the conditions
of productions, and the actions of productions change the contents of the buffers. Where
possible, the presumed localization of components in the brain is identified. DLPFC =
dorsolateral prefrontal cortex; VLPFC = ventrolateral prefrontal cortex. Figure taken from
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004).
An integrated theory of mind. Psychological Review, 111, 1036–1060. Published by the
American Psychological Association. Reproduced with permission.
as its principal engine—to develop simulated fighter jets that “flew” on their own
and interacted with other pilots and air traffic control using a (limited) natural-
language vocabulary. In a large-scale military simulation, some 100 simulated
aircraft were piloted by SOAR with notable success. At the heart of this appli-
cation of SOAR were more than 8,000 production rules that imbued the simu-
lated aircraft (or their simulated pilots) with the knowledge necessary to “fly” in
a realistic manner. Those productions, in turn, were derived from subject matter
experts (such as military pilots) and “textbook knowledge” of military doctrine
and tactics. Although this application of SOAR is impressive by the standards
of expert systems and artificial intelligence, it also illustrates one of the princi-
pal problems of psychological applications of cognitive architectures—namely,
their reliance on external input that provides the knowledge required to make the
architecture work.
310 Computational Modeling in Cognition
Specifically, in the same way that for the jet fighter simulation much intel-
lectual capital was expended on extracting knowledge from domain experts and
textbooks, and to convert that knowledge into a form that was palatable to SOAR,
in many psychological applications, much of the architecture’s work is done by
the productions that define the particular task being modeled (Taatgen & Ander-
son, 2008). For example, the ACT-R productions shown above that were required
to solve equations are typically provided to the system by the theoretician, and
their mere presence thus does not represent an achievement of the architecture—
instead, they serve the role of free parameters in other more conventional models
(J. R. Anderson et al., 2004).
This, then, points to the fundamental problem of cognitive architectures—
namely, that they “make broad assumptions, yet leave many details open” (Sun et
al., 2005, p. 616). This problem is well known, and two broad classes of solutions
have been explored: The first approach seeks to find ways by which the archi-
tecture can acquire its own productions—for example, by processing instructions
presented in (quasi-natural) language. Thus, like a human subject, the architecture
is presented with task instructions and processes those instructions to develop a
set of productions required for the task (J. R. Anderson et al., 2004). When suc-
cessful, this obviates the need for a task analysis or other interventions by the the-
oretician to select suitable productions, thus effectively eliminating a large source
of (tacit) free parameters in the model. The second approach seeks to connect the
cognitive architecture to its presumed neural underpinnings.
(J. R. Anderson, 2005, p. 323). Why, then, do we reproduce this bedazzling figure
here? The answer is that brain imaging data provide the constraints necessary to
confirm some of the assumptions made within ACT-R and the interplay between
its various models.
Table 8.1 presents the detailed presumed mapping between ACT-R modules
and brain regions, obtained from a number of imaging studies (e.g., J. R. Ander-
son, 2005, 2007). Under some fairly straightforward assumptions about the
temporal characteristics of the blood oxygen–level dependent (BOLD) response
measured by fMRI imaging, the mapping in the table permits prediction of brain
activations from the simulated activations of the modules shown in Figure 8.5.
Table 8.1 Mapping Between Modules in ACT-R and Brain Regions Identified by Imaging
Data
ACT-R Module Associated Brain Region
J. R. Anderson (2005) showed that once the architecture was fit to the behav-
ioral data (i.e., the learning curves for the various problem types), the simulated
pattern of activations of the various modules correlated highly with the activa-
tions observed during brain imaging. Those correlations, then, arguably provided
the independent verification necessary to justify the complexity of Figure 8.5 (see
page 312).
Figure 8.5 Simulated module activity in ACT-R during solution of a two-step equation on Day 1
(a) with a two-step equation on Day 5 (b). In both cases, the equation is 7 ∗ x + 3 = 38. See text for
details. Figure taken from Anderson, J. R. (2005). Human symbol manipulation within an integrated
cognitive architecture. Cognitive Science, 29, 313–341. Copyright by the Cognitive Science Society;
reprinted with permission.
Chapter 8 Modeling in a Broader Context 313
8.5 Conclusion
This concludes our exploration of the elements of computational and mathemat-
ical modeling in cognition. We began by considering some very general concep-
tual issues regarding the use of models, and we finished by drawing a wide arc
to encompass a variety of approaches that are united by their attempt to explain
human cognition by quantitative and computational means, thus avoiding the pit-
falls that beset purely verbal theorizing.
Where does this leave you, and what future options exist for further study?
At the very least, you should now be in a position to apply existing models
to your own data. This may initially require some further “coaching” from the
model’s author(s) or other experts, but we expect you to have acquired the relevant
skills by working through the material in this book. We encourage you to practice
widely and to use your skills wisely; please remember the conceptual lessons of
Chapters 1, 2, and 6.
There are many ways in which you can apply the skills acquired in this book.
The approaches we have discussed in this chapter cover only some of the
many models that have been applied in cognitive psychology, and the best way
to broaden your understanding of computational models of cognition is to read
about these models. To give you a head start, in the remaining sections of this
chapter, we present pointers to a small sample of models from a number of areas.
Each of those models has made a significant contribution to the literature, and you
cannot go wrong by exploring any one of them in greater depth.
On our webpage (http://www.cogsciwa.com), you will find an extended ver-
sion of this listing that we endeavor to keep current and that also includes direct
links to code for the models if available. Whatever area of cognition you are spe-
cializing in, you will find an attractive model in this list that is worthy of further
314 Computational Modeling in Cognition
examination. Begin by working with one of those models; after you have gathered
some expertise applying an existing model, there will be little to stop you from
designing your own. We wish you the best of luck!
8.5.1 Memory
Retrieving Effectively from Memory (REM; Shiffrin & Steyvers, 1997): a
Bayesian model of recognition memory, in which the likelihood of partial
episodic traces is calculated in parallel given a presented probe.
8.5.2 Language
Bayesian Reader (Norris, 2006): a Bayesian model of word recognition, lexical
decision, and semantic categorization.
The Bayesian Ventriloquist (Alais & Burr, 2004): a Bayesian model of multi-
modal integration that explains ventriloquism as precise (in terms of spatial
location) visual information capturing imprecise auditory information.
Relative Judgment Model (RJM; Stewart, Brown, & Chater, 2005): a model
of absolute identification in which identification is based on comparisons
Chapter 8 Modeling in a Broader Context 317
between the current and previous stimuli, as well as feedback from the
previous trial.
Notes
1. The backpropagation algorithm also assumes a contribution from bias units that pro-
vide a constant input to each hidden and output unit. For conceptual clarity, we do not
discuss these units here.
2. There is no meaning to the spatial arrangement of units within a layer (i.e., the units
are unordered), and so the numbering of units in Figure 8.2 is completely arbitrary.
References
319
320——Computational Modeling in Cognition
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y.
(2004). An integrated theory of mind. Psychological Review, 111,
1036–1060.
Anderson, J. R., Fincham, J. M., & Douglass, S. (1999). Practice and retention:
A unifying analysis. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 25, 1120–1136.
Anderson, J. R., & Schooler, L. J. (1991). Reflections of the environment in
memory. Psychological Science, 2, 396–408.
Andrews, S., & Heathcote, A. (2001). Distinguishing common and task-specific
processes in word identification: A matter of some moment? Journal of
Experimental Psychology: Learning, Memory, and Cognition, 27, 514–544.
Ashby, F. G. (1992a). Multidimensional models of categorization. In F. G. Ashby
(Ed.), Multidimensional models of perception and cognition (pp. 449–483).
Hillsdale, NJ: Erlbaum.
Ashby, F. G. (1992b). Multidimensional models of perception and cognition.
Hillsdale, NJ: Erlbaum.
Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. M. (1998). A
neuropsychological theory of multiple systems in category learning. Psy-
chological Review, 105, 442–481.
Ashby, F. G., & Gott, R. E. (1988). Decision rules in the perception and catego-
rization of multidimensional stimuli. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 14, 33–53.
Ashby, F. G., & Maddox, W. T. (1993). Relations between prototype, exemplar,
and decision bound models of categorization. Journal of Mathematical
Psychology, 37, 372–400.
Ashby, F. G., Maddox, W. T., & Lee, W. W. (1994). On the dangers of averaging
across subjects when using multidimensional scaling or the similarity-choice
model. Psychological Science, 5, 144–151.
Ashby, F. G., & O’Brien, J. B. (2005). Category learning and multiple memory
systems. Trends in Cognitive Sciences, 9, 83–89.
Ashby, F. G., Paul, E. J., & Maddox, W. T. (in press). COVIS. In E. M. Pothos &
A. J. Wills (Eds.), Formal approaches in categorization. New York:
Cambridge University Press.
Ashby, F. G., & Townsend, J. T. (1986). Varieties of perceptual independence.
Psychological Review, 93, 154–179.
Atkinson, R. C., & Juola, J. F. (1973). Factors influencing speed and accuracy of
word recognition. In S. Kornblum (Ed.), Attention and performance IV (pp.
583–612). New York: Academic Press.
Attewell, D., & Baddeley, R. J. (2007). The distribution of reflectances within the
visual environment. Vision Research, 47, 548–554.
Baddeley, A. D. (1986). Working memory. New York: Oxford University Press.
Baddeley, A. D. (2003). Working memory: Looking backward and looking for-
ward. Nature Reviews Neuroscience, 4, 829–839.
References——321
Baddeley, A. D., & Hitch, G. (1974). Working memory. In G. Bower (Ed.), The
psychology of learning and motivation: Advances in research and theory
(Vol. 8, pp. 47–89). New York: Academic Press.
Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word length and struc-
ture of short-term memory. Journal of Verbal Learning and Verbal Behavior,
14, 575–589.
Baddeley, R., & Attewell, D. (2009). The relationship between language and the
environment: Information theory shows why we have only three lightness
terms. Psychological Science, 20, 1100–1107.
Balota, D. A., Yap, M. J., Cortese, M. J., & Watson, J. M. (2008). Beyond mean
response latency: Response time distributional analyses of semantic priming.
Journal of Memory and Language, 59, 495–523.
Bamber, D., & van Santen, J. P. (1985). How many parameters can a model have
and still be testable? Journal of Mathematical Psychology, 29, 443–473.
Bamber, D., & van Santen, J. P. (2000). How to assess a model’s testability and
identifiability. Journal of Mathematical Psychology, 44, 20–40.
Bartels, A., & Zeki, S. (2000). The neural basis of romantic love. Neuroreport, 11,
3829–3834.
Batchelder, W., & Riefer, D. (1999). Theoretical and empirical review of multi-
nomial process tree modeling. Psychonomic Bulletin & Review, 6, 57–86.
Bayer, H. M., Lau, B., & Glimcher, P. W. (2007). Statistics of midbrain dopamine
neuron spike trains in the awake primate. Journal of Neurophysiology, 98,
1428–1439.
Bechtel, W. (2008). Mechanisms in cognitive psychology: What are the opera-
tions? Philosophy of Science, 75, 983–994.
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from func-
tional magnetic resonance imaging? Annals of the New York Academy of
Sciences, 1191, 133–155.
Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex bias in graduate
admissions: Data from Berkeley. Science, 187, 398–404.
Bireta, T. J., Neath, I., & Surprenant, A. M. (2006). The syllable-based word
length effect and stimulus set specificity. Psychonomic Bulletin & Review,
13, 434–438.
Bishara, A. J., & Payne, B. K. (2008). Multinomial process tree models of con-
trol and automaticity in weapon misidentification. Journal of Experimental
Social Psychology, 45, 524–534.
Botvinick, M. M., & Plaut, D. C. (2004). Doing without schema hierarchies: A
recurrent connectionist approach to normal and impaired routine sequential
action. Psychological Review, 111, 395–429.
Botvinick, M. M., & Plaut, D. C. (2006). Short-term memory for serial order: A
recurrent neural network model. Psychological Review, 113, 201–233.
Box, M. J. (1966). A comparison of several current optimization methods, and the
use of transformations in constrained problems. Computer Journal, 9, 67–77.
322——Computational Modeling in Cognition
Hood, B. M. (1995). Gravity rules for 2-to 4-year olds? Cognitive Development,
10, 577–598.
Hooge, I. T. C., & Frens, M. A. (2000). Inhibition of saccade return (ISR):
Spatio-temporal properties of saccade programming. Vision Research, 40,
3415–3426.
Hopkins, R. O., Myers, C. E., Shohamy, D., Grossman, S., & Gluck, M. (2004).
Impaired probabilistic category learning in hypoxic subjects with hip-
pocampal damage. Neuropsychologia, 42, 524–535.
Howard, M. W., Jing, B., Rao, V. A., Provyn, J. P., & Datey, A. V. (2009). Bridg-
ing the gap: Transitive associations between items presented in similar tem-
poral contexts. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 35, 391–407.
Howard, M. W., & Kahana, M. J. (2002). A distributed representation of temporal
context. Journal of Mathematical Psychology, 46, 269–299.
Howell, D. C. (2006). Statistical methods for psychology. Belmont, CA:
Wadsworth.
Hoyle, F. (1974). The work of Nicolaus Copernicus. Proceedings of the Royal
Society, Series A, 336, 105–114.
Huber, D. E. (2006). Computer simulations of the ROUSE model: An analytic
simulation technique and a comparison between the error variance-
covariance and bootstrap methods for estimating parameter confidence.
Behavior Research Methods, 38, 557–568.
Hudjetz, A., & Oberauer, K. (2007). The effects of processing time and
processing rate on forgetting in working memory: Testing four models of the
complex span paradigm. Memory & Cognition, 35, 1675–1684.
Hughes, C., Russell, J., & Robbins, T. W. (1994). Evidence for executive dis-
function in autism. Neuopsychologia, 32, 477–492.
Hulme, C., Roodenrys, S., Schweickert, R., Brown, G. D. A., Martin, S., & Stuart,
G. (1997). Word-frequency effects on short-term memory tasks: Evidence for
a redintegration process in immediate serial recall. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 23, 1217–1232.
Hulme, C., & Tordoff, V. (1989). Working memory development: The effects of
speech rate, word-length, and acoustic similarity on serial recall. Journal of
Experimental Child Psychology, 47, 72–87.
Hunt, E. (2007). The mathematics of behavior. Cambridge, UK: Cambridge
University Press.
Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model
selection in small samples. Biometrika, 76, 297–307.
Hutchison, K. A. (2003). Is semantic priming due to association strength or
feature overlap? Psychonomic Bulletin & Review, 10, 785–813.
Inglehart, R., Foa, R., Peterson, C., & Welzel, C. (2008). Development,
freedom, and rising happiness. Perspectives on Psychological Science, 3,
264–285.
330——Computational Modeling in Cognition
Pastore, R. E., Crawley, E. J., Berens, M. S., & Skelly, M. A. (2003). “Nonpara-
metric” A’ and other modern misconceptions about signal detection theory.
Psychonomic Bulletin & Review, 10, 556–569.
Pawitan, Y. (2001). In all likelihood: Statistical modelling and inference using
likelihood. Oxford: Oxford University Press.
Pike, R. (1973). Response latency models for signal detection. Psychological
Review, 80, 53–68.
Pitt, M. A., Kim, W., Navarro, D. J., & Myung, J. I. (2006). Global model analysis
by parameter space partitioning. Psychological Review, 113, 57–83.
Pitt, M. A., & Myung, I. J. (2002). When a good fit can be bad. Trends in
Cognitive Science, 6, 421–425.
Ploeger, A., Maas, H. L. J. van der, & Hartelman, P. A. I. (2002). Stochastic
catastrophe analysis of switches in the perception of apparent motion.
Psychonomic Bulletin & Review, 9, 26–42.
Poldrack, R. A., & Foerde, K. (2008). Category learning and the memory systems
debate. Neuroscience and Biobehavioral Reviews, 32, 197–205.
Popper, K. R. (1963). Conjectures and refutations. London: Routledge.
Qin, Y., Anderson, J. R., Silk, E., Stenger, V. A., & Carter, C. S. (2004). The
change of the brain activation patterns along with the children’s practice in
algebra equation solving. Proceedings of the National Academy of Sciences,
101, 5686–5691.
Raaijmakers, J. G. W., & Shiffrin, R. M. (1980). SAM: A theory of probabilistic
search of associative memory. In G. H. Bower (Ed.), The psychology of
learning and motivation (Vol. 14, pp. 207–262). New York: Academic Press.
Raaijmakers, J. G. W., & Shiffrin, R. M. (1981). Search of associative memory.
Psychological Review, 88, 93–134.
Raftery, A. E. (1999). Bayes factors and BIC: Comment on “A critique of the
Bayesian Information Criterion for model selection.” Sociological Methods
& Research, 27, 411–427.
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85,
59–108.
Ratcliff, R. (1979). Group reaction time distributions and an analysis of distribu-
tion statistics. Psychological Bulletin, 86, 446–461.
Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints
imposed by learning and forgetting functions. Psychological Review,
97, 285–308.
Ratcliff, R. (1998). The role of mathematical psychology in experimental psy-
chology. Australian Journal of Psychology, 50, 129–130.
Ratcliff, R., Hasegawa, Y. T., Hasegawa, Y. P., Smith, P. L., & Segraves, M. A.
(2007). Dual diffusion model for single-cell recording data from the superior
colliculus in a brightness-discrimination task. Journal of Neurophysiology,
97, 1756–1774.
Ratcliff, R., & Murdock, B. B. (1976). Retrieval processes in recognition
memory. Psychological Review, 83, 190–214.
338——Computational Modeling in Cognition
Ratcliff, R., & Rouder, J. N. (1998). Modeling repsonse times for two-choice
decisions. Psychological Science, 9, 347–356.
Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential sampling models
for two-choice reaction time. Psychological Review, 111, 333–367.
Ratcliff, R., Van Zandt, T., & McKoon, G. (1995). Process dissociation, single-
process theories, and recognition memory. Journal of Experimental Psy-
chology: General, 124, 352–374.
Ratcliff, R., Van Zandt, T., & McKoon, G. (1999). Connectionist and diffusion
models of reaction time. Psychological Review, 106, 261–300.
Raven, J., Raven, J. C., & Court, J. H. (1998). Section 2: Coloured Progressive
Matrices (1998 edition). Introducing the parallel version of the test, Manual
for the Raven’s progressive matrices and vocabulary scales. Oxford, UK:
Oxford Psychologist Press.
Reitman, W. B. (1965). Cognition and thought. New York: Wiley.
Rickard, T. (1997). Bending the power law: A CMPL theory of strategy shifts and
the automatization of cognitive skills. Journal of Experimental Psychology:
General, 126, 288–311.
Riefer, D., & Batchelder, W. (1988). Multinomial modeling and the measurement
of cognitive processes. Psychological Review, 95, 318–339.
Rips, L. (2002). Circular reasoning. Cognitive Science, 26, 767–795.
Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on
theory testing. Psychological Review, 107, 358–367.
Roediger, H. L., & McDermott, K. B. (1993). Implicit memory in normal human
subjects. In F. Boller & J. Grafman (Eds.), Handbook of neuropsychology
(Vol. 8, pp. 63–131). Amsterdam: Elsevier.
Roelofs, A. (1997). The WEAVER model of word-form encoding in speech
production. Cognition, 64, 249–284.
Rohrer, D. (2002). The breadth of memory search. Memory, 10, 291–301.
Root-Bernstein, R. (1981). Views on evolution, theory, and science. Science, 212,
1446–1449.
Rosenbaum, D. A. (2007). MATLAB for behavioral scientists. Mahwah, NJ:
Lawrence Erlbaum.
Rotello, C. M., & Macmillan, N. A. (2006). Remember-know models as decision
strategies in two experimental paradigms. Journal of Memory and Language,
55, 479–494.
Rouder, J. N., & Lu, J. (2005). An introduction to Bayesian hierarchical models
with an application in the theory of signal detection. Psychonomic Bulletin &
Review, 12, 573–604.
Rouder, J. N., Lu, J., Morey, R. D., Sun, S., & Speckman, P. L. (2008). A hier-
archical process-dissociation model. Journal of Experimental Psychology:
General, 137, 370–389.
Rouder, J. N., Lu, J., Speckman, P., Sun, D., & Jiang, Y. (2005). A hierarchical
model for estimating response time distributions. Psychonomic Bulletin &
Review, 12, 195–223.
References——339
Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory: REM—
retrieving effectively from memory. Psychonomic Bulletin & Review, 4,
145–166.
Shimojo, S., Simion, C., Shimojo, E., & Scheier, C. (2003). Gaze bias both
reflects and influences preference. Nature Neuroscience, 1317–1322.
Shimp, C. P., Long, K. A., & Fremouw, T. (1996). Intuitive statistical inference:
Categorization of binomial samples depends on sampling context. Animal
Learning & Behavior, 24, 82–91.
Shultz, T. R. (2007). The Bayesian revolution approaches psychological devel-
opment. Developmental Science, 10, 357–364.
Smith, E. E., & Jonides, J. (1997). Working memory: A view from neuroimaging.
Cognitive Psychology, 33, 5–42.
Smith, J. B., & Batchelder, W. H. (2008). Assessing individual differences in
categorical data. Psychonomic Bulletin & Review, 15, 713–731.
Smith, P. L. (1998). Attention and luminance detection: A quantitative analysis.
Journal of Experimental Psychology: Human Perception and Performance,
24, 105–133.
Smith, P. L., & Vickers, D. (1988). The accumulator model of two-choice dis-
crimination. Journal of Mathematical Psychology, 32, 135–168.
Spanos, A. (1999). Probability theory and statistical inference. Cambridge:
Cambridge University Press.
Spieler, D. H., Balota, D. A., & Faust, M. E. (2000). Levels of selective attention
revealed through analyses of response time distributions. Journal of Exper-
imental Psychology: Human Perception and Performance, 26, 506–526.
Sternberg, S. (1975). Memory scanning: New findings and current controversies.
Quarterly Journal of Experimental Psychology, 27, 1–32.
Stewart, N., Brown, G. D. A., & Chater, N. (2002). Sequence effects in catego-
rization of simple perceptual stimuli. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 28, 3–11.
Stewart, N., Brown, G. D. A., & Chater, N. (2005). Absolute identification by
relative judgement. Psychological Review, 112, 881–911.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predic-
tions. Journal of the Royal Statistical Society, 36B, 111–147.
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-
validation and Akaike’s criterion. Journal of the Royal Statistical Society,
39B, 44–47.
Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion
and the finite corrections. Communications in Statistics: Theory and
Methods, 7, 13–26.
Sun, R., Coward, A., & Zenzen, M. J. (2005). On levels of cognitive modeling.
Philosophical Psychology, 18, 613–637.
Sun, R., Slusarz, P., & Terry, C. (2005). The interaction of the explicit and the
implicit in skill learning: A dual-process approach. Psychological Review,
112, 159–192.
References——341
Taatgen, N. A., & Anderson, J. R. (2002). Why do children learn to say “broke”?
A model of learning the past tense without feedback. Cognition, 86, 123–155.
Taatgen, N. A., & Anderson, J. R. (2008). Constraints in cognitive architectures.
In R. Sun (Ed.), The Cambridge handbook of computational psychology (pp.
170–185). Cambridge: Cambridge University Press.
Taatgen, N. A., & Anderson, J. R. (2009). The past, present, and future of cogni-
tive architectures. Topics in Cognitive Science, 1–12.
Tan, L., & Ward, G. (2008). Rehearsal in immediate serial recall. Psychonomic
Bulletin & Review, 15, 535–542.
Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. (2006). Theory-based Bayesian
models of inductive learning and reasoning. Trends in Cognitive Science, 10,
309–318.
Thompson, D. R., & Bilbro, G. L. (2000). Comparison of a genetic algorithm
with a simulated annealing algorithm for the design of an ATM network.
IEEE Communications Letters, 4, 267–269.
Thornton, T. L., & Gilden, D. L. (2005). Provenance of correlations in psycho-
logical data. Psychonomic Bulletin & Review, 12, 409–441.
Torre, K., Delignières, D., & Lemoine, L. (2007). Detection of long-range depen-
dence and estimation of fractal exponents through ARFIMA modelling.
British Journal of Mathematical and Statistical Psychology, 60, 85–106.
Toth, J. P. (2000). Nonconscious forms of human memory. In E. Tulving &
F. I. M. Craik (Eds.), The Oxford handbook of memory (pp. 245–261).
Oxford: Oxford University Press.
Toth, J. P., Reingold, E. M., & Jacoby, L. L. (1994). Toward a redefinition of
implicit memory: Process dissociations following elaborative processing and
self-generation. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 20, 290–303.
Trout, J. D. (2007). The psychology of scientific explanation. Philosophy Com-
pass, 2/3, 564–591.
Tsoulos, I. G. (2008). Modifications of real code genetic algorithm for global
optimization. Applied Mathematics and Computation, 203, 598–607.
Turner, M., & Engle, R. (1989). Is working memory capacity task dependent?
Journal of Memory and Language, 49, 446–468.
Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative
representation of uncertainty. Journal of Risk and Uncertainty, 5, 297–323.
Underwood, B. J. (1975). Individual differences as a crucible in theory construc-
tion. American Psychologist, 30, 128–134.
Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: The
leaky, competing accumulator model. Psychological Review, 108, 550–592.
Vanderbilt, D., & Louie, S. G. (1984). A Monte Carlo simulated annealing
approach to optimization over continuous variables. Journal of Compu-
tational Physics, 56, 259–271.
van Santen, J. P., & Bamber, D. (1981). Finite and infinite state confusion models.
Journal of Mathematical Psychology, 24, 101–111.
342——Computational Modeling in Cognition
Yechiam, E., Busemeyer, J. R., Stout, J. C., & Bechara, A. (2005). Using
cognitive models to map relations between neuropsychological disorders and
human decision-making deficits. Psychological Science, 16, 973–978.
Yonelinas, A. (1994). Receiver-operating characteristics in recognition memory:
Evidence for a dual-process model. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 20, 1341–1354.
Yonelinas, A. (2002). The nature of recollection and familiarity: A review of 30
years of research. Journal of Memory and Language, 46, 441–517.
Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: Analysis by
synthesis? Trends in Cognitive Sciences, 10, 301–308.
Zucchini, W. (2000). An introduction to model selection. Journal of Mathemati-
cal Psychology, 44, 41–61.
Author Index
Abbott, L. F., 306, 325 Bamber, D., 204, 206, 212, 232, 321
Abdi, H., 293, 336 Bartels, A., 114, 321
Abramowitz, M., 160, 319 Basso, M. A., 305, 321
Addis, K. M., 108, 319 Batchelder, W., 16, 159, 161, 204, 206,
Adelson, E. H., 287, 343 207, 321, 338
Adhikari, A., 96, 326 Batchelder, W. H., 100, 106, 340
Akaike, H., 179, 181, 319 Bates, E. A., 325
Alais, D., 287, 315, 319 Bayer, H. M., 188, 321
Alfonso-Reese, L. A., 257, 259, Bechara, A., 152, 344
268, 319, 320 Bechtel, W., 66, 321
Ali, M. M., 92, 330 Benayoun, M., 343
Anderson, D. R., 144, 179, 181, 183, Bennett, C. M., 302, 321
185–187, 189, 191, 193, 195, 323 Berens, M. S., 337
Anderson, J. A., 56, 293, 298, 319 Bergert, F. B., 187, 336
Anderson, J. R., 32, 56, 66, 172, 285, 287, Bickel, P. J., 96, 107, 321
288, 299, 306–308, 310, 311, 319, Bilbro, G. L., 93, 95, 341
320, 337, 341 Bireta, T. J., 69, 321, 335
Andresen, B., 88, 107, 336 Bishara, A. J., 161, 321
Andrews, S., 104, 120, 125, 320 Bothell, D., 320
Anes, M. D., 223, 339 Botvinick, M. M., 27, 315, 321
Ashby, F. G., 56, 99, 171, 187, 229, Box, M. J., 84, 322
257, 259, 262, 280, 282, 306, 316, Bozdogan, H., 181, 322
320, 333 Braver, T. S., 322
Atkins, P., 223, 324 Brigham, J. C., 238, 334
Atkinson, R. C., 218, 320 Brimacombe, C. A. E., 343
Attewell, D., 284, 320, 321 Brockwell, P. J., 322
Austin, G. A., 257, 322 Brown, G. D. A., 15, 19, 36, 38, 54, 127,
128, 136, 163, 166, 172, 230, 231,
Baddeley, A. D., 8, 35, 36, 41, 290, 292, 316, 322, 329, 332, 341
320, 321, 328 Brown, J. W., 322
Baddeley, R., 284, 321 Brown, S., 12, 125, 184, 188, 316, 322,
Baddeley, R. J., 320 324, 326, 328
Baker, T. I., 343 Brown, S. D., 151, 304, 316, 322
Balota, D. A., 120, 125, 152, Browne, M. W., 193, 201, 322, 335
161, 321, 340 Bruner, J. S., 257, 322
345
346 Computational Modeling in Cognition
Ludwig, C. J. H., 102, 105, 145, 146, 151, Munoz, D. P., 173, 326
152, 305, 326, 333 Murdock, B. B., 161, 163, 166, 222, 290,
Lusignan, M., 343 335, 338
Murphy, G., 257, 335
Maas, H. L. J. van der, 188, 337 Murre, J. M. J., 56, 334
MacCallum, R. C., 33, 333 Myers, C. E., 301, 329
Macho, S., 188, 333 Myung, I.-J., 193, 335
MacInnes, W. J., 151, 331 Myung, I. J., 127, 193, 198–201, 335, 337
Macmillan, N. A., 225, 339 Myung, J. I., 193, 337
Maddox, W. T., 99, 173, 187, 229, 262,
306, 320, 330, 333 Navalpakkam, V., 315, 335
Malmberg, K. J., 152, 327 Navarro, D. J., 193, 335, 337
Malpass, R. S., 343 Neath, I., 19, 54, 69, 321, 322, 335
Marley, A. A. J., 316, 322 Nelder, J. A., 82, 139, 335
Marr, D., 25, 334 Neumann, J., 326
Marsh, R. L., 214, 328 Newell, A., 306, 307, 335
Martin, S., 329
Newell, B. R., 300, 335
Massaro, D. W., 223, 334
Ngang, S., 257, 332
Masson, M. E. J., 302, 334
Nielsen, P. E., 330
Master, S., 225, 328, 329
Nobel, P. A., 9, 56, 194, 340
McClelland, J., 288, 334
Norris, D., 11, 39, 41, 53–55, 70, 136,
McClelland, J. L., 27, 125, 222, 223, 289,
184, 220, 231, 287–289, 314,
316, 334, 335, 340, 342
335–337
McCloskey, M., 296, 334
Norris, D. G., 41, 328
McDaniel, M. A., 203, 325
Nosofsky, R. M., 5, 8, 20, 23, 24, 28, 34,
McDermott, K. B., 215, 338
136, 187, 228, 229, 231, 257, 262,
McElree, B., 227, 323
280, 316, 334, 336
McKinley, S. C., 24, 228, 229, 231,
Nourani, Y., 88, 107, 336
262, 334
Nuthmann, A., 314, 325
McKoon, G., 172, 217, 303, 334, 338
McNicol, D., 51, 334
Mead, R., 82, 139, 335 Oberauer, K., 9, 10, 36, 38, 42, 59, 69,
Medin, D. L., 316, 333 100, 109, 146, 173, 329, 332,
Meehl, P. E., 67, 70, 334 336, 339
Meeter, M., 56, 334 O’Brien, J. B., 257, 280, 320
Meissner, C. A., 238, 334 O’Connell, J. W., 96, 321
Merikle, P. M., 217, 330 O’Toole, A., 293, 336
Mewhort, D. J., 12, 328
Mewhort, D. J. K., 315, 330 Page, M. P. A., 39, 41, 53–56, 70, 136,
Meyer, D. E., 36, 335 184, 220, 231, 289, 301, 328,
Miller, M. B., 302, 321 336, 337
Mitchell, M., 91–93, 334 Palmeri, T., 280, 336
Miyake, A., 100, 334 Palmeri, T. J., 257, 336, 337
Molenaar, P. C. M., 159, 188, 342 Parisi, D., 325
Morey, R. D., 105, 334, 339 Pashler, H., 28, 29, 54, 302, 338, 342
Morgan, B. J. T., 160, 335 Pastore, R. E., 337
Mueller, S. T., 36, 335 Pattyn, S., 24, 342
Munakata, Y., 223, 335 Paul, E. J., 306, 320
350 Computational Modeling in Cognition
Pawitan, Y., 117, 146, 153, 155, 158, 159, Riefer, D., 16, 159, 161, 204, 206, 207,
179, 192, 194, 337 321, 338
Payne, B. K., 161, 321 Rips, L., 301, 338
Pearson, N., 225, 324 Ritz, S. A., 293, 319
Penrod, S., 343 Robbins, T. W., 112, 329
Perry, C., 314, 324 Roberts, S., 28, 29, 54, 338
Peterson, C., 11, 330 Roediger, H. L., 215, 338
Petry, F. E., 90, 323 Roelofs, A., 315, 338
Pike, R., 189, 337 Rohrer, D., 114, 125, 161, 338, 344
Pisani, R., 96, 326 Roodenrys, S., 329
Pitt, M.-A., 193, 335 Root-Bernstein, R., 29, 338
Pitt, M. A., 193, 198–201, 335, 337 Rosenbaum, D. A., 339
Plaut, D. C., 27, 315, 321 Rotello, C. M., 225, 339
Ploeger, A., 188, 337 Rouder, J. N., 26, 29, 101, 102, 104, 105,
Plunkett, K., 325 146, 171, 172, 262–265, 268, 272,
Poirier, M., 147, 339 280, 315, 330, 334, 338, 339
Poldrack, R. A., 298, 300, 301, 337 Rowan, T. H., 84, 107, 339
Popper, K. R., 67, 337 Royall, R. M., 192, 339
Pratte, M. S., 105, 334 Rumelhart, D., 334
Preece, T., 54, 322 Rumelhart, D. E., 289, 334
Provyn, J. P., 329 Russell, J., 112, 329
Purves, R., 96, 326
Sakamoto, Y., 89, 327
Sanborn, A. N., 102, 324
Qin, Y., 310, 320, 337
Sander, N., 36, 336
Santen, J. P. v., 204
Raaijmakers, J. G. W., 314, 337 Sapute, A., 225, 324
Raftery, A. E., 183, 185, 186, 191–193, Schacter, D. L., 223, 225, 339
330, 337 Schall, J. D., 114, 305, 327
Raijmakers, M. E. J., 159, 188, 342 Scheier, C., 114, 340
Rao, V. A., 329 Schmiedek, F., 100, 101, 114, 146, 153,
Rastle, K., 314, 324 162, 168, 339
Ratcliff, R., 10, 26, 29, 101, 104, 125, Schmitt, L. M., 93, 339
161, 171–173, 178, 188–191, 193, Schooler, J. W., 238, 339
195, 217–219, 262–265, 268, 272, Schooler, L. J., 285, 287, 307, 320
280, 296, 303, 305, 315, 326, 334, Schul, Y., 301, 303, 331
337–339, 343 Schulze, R., 100, 336
Raven, J., 302, 338 Schunn, C. D., 70, 339
Raven, J. C., 302, 338 Schwarz, G., 183, 339
Rawson, E., 299, 343 Schweickert, R., 16, 17, 147, 329, 339
Reder, L., 313, 324 Searleman, A., 236, 340
Reeds, J. A., 331 Sederberg, P. B., 340
Reingold, E. M., 217, 341 Seelau, E., 236, 343
Reitman, W. B., 26, 338 Segraves, M., 172, 326
Reynolds, J. R., 322 Segraves, M. A., 305, 338
Richter, E. M., 314, 325 Seidenberg, M. S., 27, 222, 223, 340
Rickard, T., 172, 338 Severini, T. A., 145, 340
Ridderinkhofa, K. R., 326 Seymour, T. L., 36, 335
Author Index 351
353
354 Computational Modeling in Cognition
joint probability, see probability, joint matrix of second derivatives, see Hessian
matrix
Kepler’s model, 4, 32 maximum likelihood estimation, 151,
Kullback-Leibler distance, 179–182, 184, 153, 184, 192
189, 192 minimization vs. maximization, 158
expected, 192 MinK model, 171
minimum description length, 193
latency, 114, 151–153, 189 mode, 153
latency distribution, 151, 152 model, 2, 5
layers, in neural networks, 289, 293 definition, 10–11
LBA, see Linear Ballistic Accumulator descriptive, 11–16
learning, in neural networks, 290–291 elements of, 10–11
lesioning of models, 27–28 process characterization,
lexical decision, 152 16–19, 288
lightness, terms for, 284 process explanation, 19–24, 287
likelihood, 150, 286 quantitative, 3
likelihood function, 150, 153, 155 model architecture, 58
asymptotic normality, 159 model complexity, 182, 184, 193–194
curvature, 154–157, 170 model flexibility, see flexibility
likelihood ratio, 178, 187, 189 model recovery, 198
likelihood ratio test, 177–179 model selection, 5, 8, 33, 150, 170–193
likelihood surface, 153 model weights, 186–187, 192
Linear Ballistic Accumulator, models
151, 152, 304 classes of, 25
local minimum,global minimum, 85 insights from, 26–27
localist representation, 40, 289, 294 models, classes of, 283–284
log-likelihood, 181 movie running time, 285
ex-Gaussian multinomial models, multinomial
MATLAB code, 154, 162 processing tree, MPT, 16–19
maximum, 149, 150 multinomial processing tree, 159
simplifying, 166, 184
multiple memory systems, 298–301
log-likelihood function, 153, 155,
multiple regression, 176
157, 158
binomial, 166
concave, 157 nested models, 177–179, 191
curvature, 193 neural network, 288–298
ex-Gaussian, 153 Hebbian, 290
quadratic, 159 neuroscience, 298
log-likelihood surface, 155, 158 normative behavior, normative
quadratic, 158 expectations, 15
logarithm function, 185 null hypothesis, 178, 179
logistic function, 165
LRT, see likelihood ratio test omission error, 165
Luce’s choice axiom, 22, 24 optimality, 287
luminance, 273 optimization, 71
Ornstein-Uhlenbeck model, 189
MATLAB, 42, 128, 178 outcome space, 29
matrix inverse, 159, 161 overfitting, 198
356 Computational Modeling in Cognition
taxation, 12 variance
testability, see falsifiability in parameter estimates, 151, 153
testability,falsifiability, 54, 204 verbal overshadowing effect, 238
the Vasa, 9 verbal theorizing, 5, 8
threshold, 151, 165, 166 verisimilitude, 67
time estimation, 173 visual search, 151
time-series analysis, 172–174
transposition error in serial weather, 19
recall, 9 weight space, 296
truth, 32–33, 180, 181 weights, in neural networks, 289, 291
WITNESS model, 236
word frequency, 18
uncertainty, 149, 150, 153, 170–171, 180, word naming, 27, 152, 161
192, 193 word-length effect, 36
units, in neural networks, 288–289 working memory, 153
About the Authors
Simon Farrell completed a PhD at the University of Western Australia and then
worked as a post-doctoral fellow at Northwestern University. He moved to the
Universty of Bristol in 2003, where he is now a Reader in Cognitive Psychology.
Simon’s work focuses on the application of models in cognitive psychology, par-
ticularly in the areas of memory and choice behaviour. He is currently an Associate
Editor for Journal of Memory and Language, and was recently awarded the
Bertelson Award by the European Society for Cognitive Psychology for his out-
standing early-career contribution to European Cognitive Psychology.
359
SAGE Research
Methods Online
The essential tool for researchers
Sign up now at
www.sagepub.com/srmo
An expert for more information.
research tool
• An expertly designed taxonomy
with more than 1,400 unique terms for social
and behavioral science research methods
• Visual and hierarchical
search tools to help you
discover material
and link to related
methods
• Easy-to-use navigation tools
• Content organized by complexity
• Tools for citing, printing, and downloading
content with ease
• Regularly updated content and features
A wealth of
essential content
• The most comprehensive picture of
quantitative, qualitative, and mixed methods
available today
• More than 100,000 pages of SAGE book and
reference material on research methods as well as
editorially selected material from SAGE journals
• More than 600 books Launchin
available in their g
entirety online 2011!