Observing Interaction - An Introduction To Sequential Analysis
Observing Interaction - An Introduction To Sequential Analysis
Second edition
Observing interaction:
An introduction to
sequential analysis
Second edition
ROGER BAKEMAN
Georgia State University
JOHN M. GOTTMAN
University of Washington
CAMBRIDGE
UNIVERSITY PRESS
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521450089
© Cambridge University Press 1997
A catalogue record for this publication is available from the British Library
1. Introduction 1
1.1. Interaction and behavior sequences 1
1.2. Alternatives to systematic observation 2
1.3. Systematic observation defined 3
1.4. A nonsequential example: Parten 's study of children's play 4
1.5. Social process and sequential analysis 6
1.6. Another nonsequential example: Smith's study of parallel
play 1
1.7. A sequential example: Bakeman and Brownlee's study of
parallel play 8
1.8. Hypothesis-generating research 12
1.9. Summary: Systematic is not always sequential 13
Appendix 194
A Pascal program to compute kappa and weighted kappa 194
References 198
Index 205
Preface to the second edition
Since the first edition of Observing Interaction appeared in 1986, the tech-
nology supporting recording and systematic coding of behavior has become
less expensive, more reliable, and considerably less exotic (Bakeman, in
press; revised chapter 3, this volume). Cumbersome videotape and cas-
sette recorders have given way to video camcorders. Visual time codes
are routinely recorded as part of the picture, and equipment to write and
read machine-readable time codes is readily available at reasonable cost.
Increasingly, computers assist coding, making routine what once was labor
intensive and time-consuming. Even physiological recording devices can
be added to the computer's net (Gottman & Bakeman, in press). Thus
an increasing circle of investigators can avail themselves of the methods
detailed in this book without mortgaging their careers, their lives, or the
lives of their associates.
At the same time, the way we think about sequential data has devel-
oped. This is reflected in a standard format for sequential data (Bakeman
& Quera, 1992, 1995a; revised chapter 5, this volume). SDIS - the Se-
quential Data Interchange Standard - has greatly facilitated the analysis of
sequential data. Again, an enterprise that formerly was time-consuming
and cumbersome has yielded to appropriately designed computer tools, as
described in my and Quera's Analyzing Interaction (1995), which should
be regarded as a companion to this volume. This revised version of Observ-
ing Interaction still explains how to conceptualize, code, record, organize,
and analyze sequential data, but now Analyzing Interaction provides the
tools to do so easily.
Another area of considerable development, and one responsible for many
of the differences between the first and second editions of Observing In-
teraction, concerns techniques for analyzing sequential data (chapters 7-9,
this volume; these chapters are extensively modified versions of chapters
7-8 from the first edition). Formerly many of the analytic techniques
proposed for sequential analysis were somewhat piecemeal and post hoc,
yet, waiting in the wings, log-linear analysis promises a coherent analytic
view for sequential phenomena (Bakeman & Quera, 1995b). This revised
Preface to the second edition
ROGER BAKEMAN
References
the collaboration with John Gottman that has resulted in this book, but
also I have benefited greatly from my collaboration, first with Josephine
Brown, and more recently with Lauren Adamson. Their contribution to
this book is, I hope, made evident by how often their names appear in the
references. I would also like to thank my department and its chair, Duane
M. Rumbaugh, for the support I have received over the years, as well as
the NIMH, the NIH, and the NSF for grants supporting my research with
J. V. Brown and with L. B. Adamson.
Finally, I would like to acknowledge the seminal work of Mildred Parten.
Her research at the University of Minnesota's Institute of Child Develop-
ment in the 1920s has served as a model for generations of researchers
and is still a paradigmatic application of observational methods, as our first
chapter indicates. How much of a model she was, she probably never knew.
She did her work and then disappeared. In spite of the best efforts of Bill
Hartup, who until recently was the director of the Institute, her subsequent
history remains unknown.
Parten left an important legacy, however - one to which I hope readers
of this book will contribute.
ROGER BAKEMAN
JOHN M. GOTTMAN
1
Introduction
satisfaction, we would need to examine more closely just how the couple
related to each other - and, in order to describe and ultimately attempt
to understand the dynamics of how they relate to each other, a sequential
view is essential. Our hope is that readers of this book not only take such
a sequential view, but also will learn here how to describe effectively the
sequential nature of whatever interaction they observe.
Each child was observed for 1 minute each day. The order of observation
was determined beforehand and was varied systematically so that the 1-
minute samples for any one child would be distributed more or less evenly
throughout the hour-long free-play period. On the average, children were
observed about 70 different times, and each time they were observed, their
degree of social participation was characterized using one of the six codes
defined above.
Florence Goodenough (1928) called this the method of repeated short
samples. Today it is often called "time sampling," but its purpose remains
the same. A number of relatively brief, nonsuccessive time intervals are
categorized, and the percentage of time intervals assigned a particular code
is used to estimate the proportion of time an individual devotes to that
kind of activity. For example, one 3-year-old child in Parten's study was
Introduction
observed 100 times. None of the 1-minute time samples was coded Unoc-
cupied, 18 were coded Solitary, 5 Onlooking, 51 Parallel, 18 Associative,
and 8 Cooperative. It seems reasonable to assume that had Parten observed
this child continuously hour after hour, day after day, that about 51% of
that child's time would have been spent in parallel play.
The method of repeated short samples, or time sampling, is a way of
recording data, but it is only one of several different ways that could be
used in an observational study. What makes Parten's study an example of
systematic observation is not the recording strategy she used but the coding
scheme she developed, along with her concern that observers apply that
scheme reliably.
Parten was primarily concerned with describing the level of social par-
ticipation among children of different ages, and with how the level of social
participation was affected by children's age, IQ, and family composition.
For such purposes, her coding scheme and her method of data collection
were completely satisfactory. After all, for each child she could compute
six percentages representing amount of time devoted to each of her six lev-
els of social participation. Further, she could have assigned, and did assign,
weights to each code (—3 to Unoccupied, —2 to Solitary, —1 to Onlooker,
1 to Parallel, 2 to Associative, and 3 to Cooperative), multiplied a child's
percent scores by the corresponding weights, and summed the resulting
products, which yielded a single composite social participation score for
each child - scores that were then correlated with the child's age and IQ.
Knowing that older children are likely to spend a greater amount of time
in associative and cooperative play than younger ones, however, does not
tell us much about moment-by-moment social process or how Parten's par-
ticipation codes might be sequenced in the stream of behavior. This is not
because her codes are inadequate to the task, but because her way of record-
ing data did not capture behavior sequences. There is no reason, of course,
why she should have collected sequential data - her research questions
did not require examining how behavior is sequenced on a moment-by-
moment basis. However, there are interesting questions to ask about the
sort of children's behavior Parten observed that do require a sequential
view. An example of such a question is presented below.
group players. This idea found its way into textbooks but was not tested
empirically until Peter Smith did so in the late 1970s (Smith, 1978). In the
present context, Smith's study is interesting for at least three reasons: for
what he found out, for the way he both made use of and modified Parten's
coding scheme, and for his method, which only appears sequential, as we
define the term.
For simplicity, Smith reduced Parten's six categories to three:
1. Alone, which lumped together Parten's Unoccupied, Onlooker, and
Solitary
2. Parallel, as defined by Parten
3. Group, which lumped together Parten's Associative and Coopera-
tive
After all, because he wanted to test the notion that Parallel play character-
izes an intermediate stage of social development, finer distinctions within
Alone and within Group play were not necessary. Smith then used these
codes and a time-sampling recording strategy to develop time-budget in-
formation for each of the 48 children in his study. However, Smith did not
compute percent scores for the entire period of the study, as Parten did, but
instead computed them separately for each of six successive 5-week peri-
ods (the entire study took 9 months). These percent scores were then used
to code the 5-week periods: Whichever of the three participation categories
occurred most frequently became the category assigned to a time period.
Smith's method is interesting, in part because it forces us to define exactly
what we mean by a sequential approach. Certainly his method has in com-
mon with sequential approaches that successive "units" (in his case, 5-week
periods) are categorized, that is, are matched up with one of the codes from
the coding scheme. However, what Smith did does not satisfy our sense of
what we usually mean by "sequential." It is only a matter of definition, of
course, but for the purpose of this book we would prefer to reserve the word
"sequential" for those approaches that examine the way discrete sequences
of behavior occur. Normally this means that sequential approaches are
concerned with the way behavior unfolds in time, as a sequence of rela-
tively discrete events, usually on a moment-by-moment or event-by-event
basis. In contrast, Smith's 5-week periods are not at all discrete, and thus
his approach is not sequential - as we use the term here - but is a reasonable
data reduction technique, given the question he sought to answer.
Together
Unoccupied
Unoccupied
Onlooker Alone
Solitary Solitary
Associative
Group Group
Cooperative
did, without an intervening period during which Parallel play was most
frequent. He concluded that a period during development characterized
by parallel play may be optional, a stage that children may or may not go
through, instead of obligatory, as Parten seems to have suggested. Still,
Smith's children engaged in parallel play about a quarter of the time, on the
average, and therefore, although it was seldom the most frequent mode of
play, it was nonetheless a common occurrence. This caused Bakeman and
Brownlee (1980) to think that perhaps parallel play might be more fruitfully
regarded, not as the hallmark of a developmental stage, but as a type of play
important because of the way it is positioned in the stream of children's
play behavior. Thus Bakeman and Brownlee raised a uniquely sequential
question about parallel play, one quite different from the question Parten
and Smith pursued.
Like Smith, Bakeman and Brownlee modified Parten's coding scheme
somewhat (see Table 1.1). They defined five codes as follows:
10 Introduction
different utterance categories. The data derived from such studies can
then be used to ask the usual questions regard-ing how different groups of
individuals vary, or how individuals change with age.
Sequential techniques, added to systematic observation, allow a whole
new set of questions to be addressed. In particular, sequential techniques
can be used to answer questions as to how behavior is sequenced in time,
which in turn should help us understand how behavior functions moment
to moment. In fact, for purposes of this book, we use the word "sequen-
tial" to refer to relatively momentary phenomena, not for developmental
phenomena, which are expressed over months or years.
The purpose of this introductory chapter has been to suggest, both in
words and by example, what sequential analysis is, why it is useful, and
what it can do. In the following chapters, we discuss the various compo-
nents required of a study invoking sequential analysis.
Developing a coding scheme
2.1 Introduction
The first step in observational research is developing a coding scheme. It
is a step that deserves a good deal of time and attention. Put simply, the
success of observational studies depends on those distinctions that early
on become enshrined in the coding scheme. Later on, it will be the job of
observers to note when the behaviors defined in the code catalog occur in
the stream of behavior. What the investigator is saying, in effect, is: This is
what I think important; this is what I want extracted from the passing stream.
Yet sometimes the development of coding schemes is approached almost
casually, and so we sometimes hear people ask: Do you have a coding
scheme I can borrow? This seems to us a little like wearing someone else's
underwear. Developing a coding scheme is very much a theoretical act,
one that should begin in the privacy of one's own study, and the coding
scheme itself represents an hypothesis, even if it is rarely treated as such.
After all, it embodies the behaviors and distinctions that the investigator
thinks important for exploring the problem at hand. It is, very simply, the
lens with which he or she has chosen to view the world.
Now if that lens is thoughtfully constructed and well formed (and aimed
in the right direction), a clearer view of the world should emerge. But
if not, no amount of corrective action will bring things into focus later.
That is, no amount of technical virtuosity, no mathematical geniuses or
statistical saviors, can wrest understanding from ill-conceived or wrong-
headed coding schemes.
How does one go about constructing a well-formed coding scheme?
This may be a little like asking how one formulates a good research ques-
tion, and although no mechanical prescriptions guaranteeing success are
possible, either for coding schemes or for research questions, still some
general guidelines may prove helpful. The rest of this chapter discusses
various issues that need to be considered when coding schemes are being
developed.
15
16 Developing a coding scheme
With such a simple coding scheme, the progression from data collection
to analysis to interpretation would be simple and straightforward. We might
find, for example, that over the course of the first several months of life the
number of separation episodes gradually increased and then decreased. At
first, there would be few separations because the infant is almost always
clinging to the mother, whereas later on there might be few because the
infant is almost always off the mother, but in between there would be
more movement from and to the mother. Further, the data could show
that when the number of separations is first increasing, the mother initiated
considerably more of them than her infant, whereas later, when the number
of separations begins to decrease, the infant initiated more, leading us to
conclude that it is the mothers who first push their presumably reluctant
infants out into the world.
The point is, developing a coding scheme is theoretical, not mechanical
work. In order to work well, a coding scheme has to fit a researcher's ideas
and questions. As a result, only rarely can a coding scheme be borrowed
from someone else. However, when research questions are clearly stated, it
is a much easier matter to determine which distinctions the coding scheme
should make. Without clear questions, code development is an unguided
affair.
by socially based schemes - schemes that deal with behavior whose very
classification depends far more on ideas in the mind of the investigator (and
others) than on mechanisms in the body. We have called these "socially
based," not because they necessarily deal with social behaviors - even
though they typically do - but because both their specification and their
use depend on social processes. Instead of following quite clearly from
physical features or physical mechanisms in a way that causes almost no
disagreement, socially based schemes follow from cultural tradition or
simply negotiation among people as to a meaningful way to view and
categorize the behavior under discussion. Moreover, their use typically
requires the observer to make some inference about the individual observed.
For example, some people are paid to determine the sex of baby chicks.
The "coding scheme" in this case is simple and obvious: male or female.
This is not an easy discrimination to make, and chicken sexers require a fair
amount of training, but few people would suggest that the categories exist
mainly as ideas in the observers' heads. Their connection with something
"seeable," even if difficult to see, is obvious.
Other people (therapists and students influenced by Eric Berne) go about
detecting, counting, and giving "strokes" - statements of encouragement
or support offered in the course of interaction. In effect, their "coding
scheme" categorizes responses made to others as strokes or nonstrokes. For
some purposes, therapeutic and otherwise, this may turn out to be a useful
construct, but few would argue that "strokes" are a feature of the natural
world. Instead, they are a product of the social world and "exist" among
those who find the construct useful. Moreover, coding a given behavior as
a "stroke" requires making an inference about another's intentions.
Other examples of physically and socially based coding schemes could
be drawn from the study of emotion. For example, Ekman and Friesen's
(1978) Facial Action Coding System scores facial movement in terms of
visible changes in the face brought about by the motion of specific muscle
groups (called action units or "AUs"). The muscles that raise the inner
corner of the eyebrows receive the code "AU1." The muscles that draw the
brows down and together receive the code "AU4." When these muscles
act together, they result in a particular configuration of the brow called
"AU1+4." The brows wrinkle in specific ways for each of these three
action units (see Figure 2.1).
The brow configuration AU1 +4 has been called "Darwin's grief muscle"
as a result of Darwin's 1873 book on the expression of emotion. This
leads us to the point that AU1 and AU1+4 are both brow configurations
typically associated with distress and sadness. In fact, there is not always
a one-to-one correspondence between sadness or distress and these brow
configurations. For example, Woody Allen uses AU1+4 as an underliner
Physically vs. socially based coding schemes 19
Figure 2.1. Examples of action units from Ekman and Friesen's (1978) Facial
Action Coding System.
after he tells a joke. But in most social interaction there are additional cues
of sadness or distress.
In a physically based coding scheme we would be coding such things
as specific brow configuration, but in a socially based coding scheme we
would be coding such things as sadness. The socially based coding scheme
requires considerably more inference, and probably requires sensitive ob-
servers. However, it is not necessarily less "real" than the system that
records brow action. It is simply a different level of description.
Not all researchers would agree with this last statement. Some, espe-
cially those trained in animal behavior and ethology, might argue that if
the problem is analyzed properly, then any socially based scheme can and
should be replaced with a physically based one. We disagree. In fact, we
think there are often very good reasons for using socially based coding
schemes.
First, it is often the case that physically based schemes, like Ekman and
Friesen's mentioned above, are time consuming to learn and to apply, and
therefore, as a practical matter, it may be much easier to use a socially based
alternative. Even if it is not, a socially based scheme may more faithfully
reflect the level of description appropriate for a given research issue. In any
given case, of course, investigators' decisions are influenced by the problem
at hand and by the audience they want to address, but it seems worth asking,
before embarking on an ambitious observational study, whether something
simpler, and perhaps more in tune with the research question, would not
do as well. Some people will be unhappy with any coding scheme that is
not clearly grounded in the physical world, but others, ourselves included,
will be tolerant of almost any kind of coding scheme, even one that is quite
20 Developing a coding scheme
and concrete features, but we are not willing to let this one consideration
override all others, especially if the meaningfulness of the data collected
or the ability of those data to answer the question at hand might suffer. It
is possible, for example, to record accurately how many times an infant
approaches within 1 meter of his or her mother and what proportion of time
was spent in contact with the mother, within 1 meter of the mother, and
looking at the mother, and still not be able to gauge validly the quality of
the mother-infant relationship. That task might be accomplished better, as
Mary Ainsworth, Alan Sroufe, and others have argued, by asking observers
to rate, on 7-point scales, how much the infant seeks to maintain contact
with the mother, resists contact with the mother, etc., and then assessing the
relationship on the basis of the pattern of the rating scales (see Ainsworth,
Blehar, Waters, & Wall, 1978; Sroufe & Waters, 1977).
Still, when one is developing coding schemes (or rating scales, for that
matter), it is a very useful exercise to describe each behavior (or points on
the rating scale) in as specific a way as possible. For example, Bakeman
and Brownlee (1980), in their parallel play study, required observers to
distinguish between children who were playing alone and who were playing
in parallel with others. First, Bakeman and Brownlee viewed videotapes
of 3-year-olds in a free play situation. They continually asked each other,
is this child in solitary or parallel play? - and other questions, too - and
even when there was consensus, they tried to state in as specific terms as
possible what cues prompted them to make the judgments they did. They
were thus able to make a list of features that distinguished parallel from
solitary play; when engaged in "prototypic" parallel play, children glanced
at others at least once every 30 seconds, were engaged in similar activities,
were within 1 meter of each other, and were no more than 90 degrees away
from facing each other directly.
These features were all described in the code catalog (the written de-
scription of the coding scheme), but the observers were instructed to treat
these defining features as somewhat flexible guidelines and not as rigid
mandates. Once they had thoroughly discussed with their observers what
most writers mean by parallel play, and had described parallel play in their
code catalog, they were willing to let observers decide individual cases
on the basis of "family resemblance" to parallel play. We agree with this
procedure and would argue as follows: By not insisting that observers
slavishly adhere to the letter of the rules, we then make use of, instead of
denying, their human inferential abilities. However, those abilities need to
be disciplined by discussion, by training, and - perhaps most importantly -
by convincing documentation of observer agreement. The result should be
more accurate data and data that can "speak" to the complex questions that
often arise when social behavior and social development are being studied.
22 Developing a coding scheme
the observers, but paradoxically we think that this strategy increases the
chances of collecting reliable data. Just as it is often easier to remember
three elements, say, instead of one, if those three are structured in some way,
so too observers are often more likely to see and record events accurately
when those events are broken up into a number of more specific pieces (as
long as that number is not too great, of course). This seems to provide the
passing stream of behavior with more "hooks" for the observers to grab.
Further, when data are collected at a somewhat more detailed level than
required, we are in a position to justify empirically our later lumping. Given
the coding scheme presented in the previous paragraph, for example, a
critic might object that the different kinds of vocalization we coded are so
different that they should not be dealt with as a single class. Yet if we can
show that the frequency of gesture use was not different for the different
kinds of vocalizations in the different age groups, then there would be no
reason, for these purposes, to use anything other than the lumped category.
Moreover, the decision to lump would then be based on something more
than our initial hunches.
Finally, and this is the third reason, more detailed data may reveal some-
thing of interest to others whose concerns may differ from ours, and at the
same time may suggest something unanticipated to us. For example, given
the coding scheme described above, we might find out that how gestures
and vocalizations are coordinated depends on whether the other person
involved is a peer or an adult, even if initially we had not been much in-
terested in, nor had even expected, effects associated with the interactive
partner.
We should note that often the issue of level of analysis is not the same
issue as whether to code data at a detailed or global level. We may have
a set of research questions that call for more than one coding system.
For example, Gottman and Levenson are currently employing a socially
based coding scheme to describe emotional moments as angry, sad, etc.
Observers also note if there was facial movement during each emotional
moment. These facial movements are then coded with a detailed, physically
based coding system, Ekman and Friesen's Facial Action Coding System
(FACS). Gottman and Levenson collected psychophysiological data while
married couples interacted. One research question concerns whether there
are specific physiological profiles for specific categories of facial expres-
sions. The FACS coding is needed to address this question, but in the
Gottman and Levenson study a decision had to be made about sampling
moments for FACS coding because detailed FACS coding is so costly. The
socially based system is thus used as an aid to employing a more detailed,
physically based coding system. Coding schemes at different levels of
analysis can thus be used in tandem within the same study.
26 Developing a coding scheme
Next they noted what FICS behavior initiated and maintained these bursts
and identified a set of sequences characteristic of the bursts. Using the
larger interaction units, they discovered evidence for what they called an
"interactional ripple effect," by which they meant the increased likelihood
of the initiating event of the chain occurring once a chain has been run off.
There are many consequences of employing a coding system in a se-
ries of studies. New codes may appear, new distinctions may be made, or
28 Developing a coding scheme
6. Food Call
7. Huddling Call
Maternal behaviors
Body movements
8. Undetermined Move
9. Egg Turn
10. Resettle
Head movements
11. Peck
12. Beak Clap
Vocalizations
13. Cluck
14. Intermediate Call
15. Food Call
16. Mild Alarm Call
To those familiar with the behavior of chickens, these codes appear
"natural" and discrete. Trained observers apparently have no trouble dis-
criminating, for example, between a Phioo, a Soft Peep, a Peep, and a
Screech, each of which in fact appears somewhat different on a spectro-
graphic recording. Thus "physical reality" may undergird these codes, but
human observers are still asked to make the determinations.
These codes are also clearly organized. There are three levels to this
particular hierarchy. On the first level, embryonic and maternal behavior
are distinguished; on the second, different kinds of embryonic (distress
and pleasure calls) and different kinds of maternal (body movements, head
movements, vocalizations) behavior are differentiated; and on the third, the
codes themselves are defined. Within each "second-level" category, codes
are mutually exclusive, but codes across different second-level categories
can cooccur. Indeed, cooccurrence of certain kinds of behavior, like em-
bryonic distress calls and maternal body movements, was very much of
interest to the investigators.
There are at least three reasons why we think organizing codes in this
hierarchical fashion is often desirable. First, it both ensures and reflects a
certain amount of conceptual analysis. Second, it makes the codes easier to
explain to others and easier for observers to memorize. Third, it facilitates
analysis. For example, for some analyses all embryonic distress calls and
all maternal vocalizations were lumped together, which is an example of a
practice we recommended earlier - analyzing on a more molar level than
that used for data collection.
30 Developing a coding scheme
Unlike with the codes for chicken behavior described above, it is hard to
claim that any physical reality undergrids Gottman's content codes. For that
very reason, he took considerable pains to demonstrate observer reliability
for his codes - which Tuculescu and Griswold did not. Gottman also made
finer distinctions when defining his codes than he found useful for later
analyses - which is natural tendency when the cleavage between codes is
not all that clear. Still, like Tuculescu and Griswold if for slightly different
reasons, he found it useful to lump codes for analysis. In fact, the initial 42
content codes were reduced to 20 for his analyses of friendship formation.
The three codes derived from the 16 initial codes list above were (a) Weak
demands - numbers 2, 3, 5, 6, 7, 8, and 10 above; (b) Strong Demands -
Example 2: Children's conversations 31
numbers 1,4,9,11, and 12 above; and (c) Demands for the Pair - numbers
13, 14, 15, and 16 above.
Using the result of sequential analysis from a detailed coding system,
Gottman devised a "macro" coding system. The macro system was de-
signed so that it would be faster to use (2 hours per hour of tape instead of
30) and would code for the sequences previously identified as important. In
the process of building the macro system, new codes were added because
in moving to a larger interaction unit, he noticed new phenomena that had
never been noticed before. For example, the codes escalation and deesca-
lation of a common-ground activity were created. Gottman (1983) wrote:
Escalation and deescalation of common-ground activity were included as cate-
gories because it appeared that the children often initially established a relatively
simple common-ground activity (such as coloring side by side) that made low
demands of each child for social responsiveness. For example, in coloring side
by side, each child would narrate his or her own activity (e.g., "I'm coloring mine
green"). This involved extensive use of the ME codes. Piaget (1930) described
this as collective monologue, though such conversation is clearly an acceptable
form of dialogue. However, in the present investigation the common-ground
activity was usually escalated after a while. This anecdotal observation is consis-
tent with Bakeman and Brownlee's (1980) recent report that parallel play among
preschool children is usually the temporal precursor of group play. However, the
extent of this process of escalation was far greater than Bakeman and Brownlee
(1980) imagined. An example of this escalation is the following: Both children
begin narrating their own activity; then one child may introduce INX codes (nar-
ration of the other child's activity - e.g., "You're coloring in the lines"); next, a
child may begin giving suggestions or other commands to the other child (e.g.,
"Use blue. That'd be nice"). The activity escalates in each case in terms of the
responsiveness demand it places on the children. A joint activity is then suggested
and the complexity of this activity will be escalated from time to time.
This escalation process was sometimes smooth, but sometimes it introduced
conflict. When it did introduce conflict, the children often deescalated that ac-
tivity, either returning to a previous activity that they had been able to maintain
or moving to information exchange. While many investigators have called at-
tention to individual differences in the complexity of children's dialogue during
play (e.g., Garvey, 1974; Garvey & Berndt, 1977), the anecdotal observation here
is that a dyad will escalate the complexity of the play (with complexity defined
in terms of the responsiveness demand) and manage this complexity as the play
proceeds. I had not noticed this complex process until I designed this coding
system. However, I do not mean to suggest that these processes are subtle or hard
to notice, but only that they have until now been overlooked. An example will
help clarify this point. D, the host, is 4-0; and J, the guest, is 4-2. They begin
playing in parallel, but note that their dialogue is connected.
20. J: You got white Play-Doh and this color and that color.
21. D: Every color. That's the colors we got.
They continue playing, escalating the responsiveness demand by using strong
forms of demands.
29. D: I'm putting pink in the blue.
30. J: Mix pink.
31. D: Pass the blue.
32. J: I think I'll pass the blue.
They next move toward doing the same thing together (common-ground activity).
35. D: And you make those for after we get it together, OK?
36. J: 'Kay.
37. D: Have to make these.
38. J: Pretend like those little roll cookies, too, OK?
39. D: And make, um, make a, um, pancake, too.
40. J: Oh rats. This is a little pancake.
41. D: OK. Make, make me, um, make 2 flat cookies. Cause I'm, I'm cutting
any, I'm cutting this. My snake.
nex escalation includes offers.
The next
54. J: You want all my blue?
55. D: Yes. To make cookies. Just to make cookies, but we can't mess the
cookies all up.
56. J: Nope.
They then introduce a joint activity and begin using "we" terms in describing
what the activity is:
57. D: Put this the right way, OK? We're making supper, huh?
58. J: We 're making supper. Maybe we could use, if you get white, we could
use that too, maybe.
59. D: I don't have any white. Yes, we, yes I do.
60. J: If you got some white, we could have some, y'know.
As they continue the play, they employ occasional contextual reminders that this
is a joint activity:
72. D: Oh, we've got to have our dinner. Trying to make some.
D then tries to escalate the play by introducing some fantasy. This escalation is
not successful. J is first allocated a low-status role (baby), then a higher-status
role (sister), then a higher-status (but still not an equal-status) role (big sister).
76. D: I'm the mommy.
77. J: Who am I?
78. D: Um, the baby.
79. J: Daddy.
80. D: Sister.
81. J: I wanna be the daddy.
82. D: You're the sister.
Example 3: Baby behavior codes 33
83. J: Daddy.
84. D: You're the big sister!
85. J: Don't play house. I don't want to play house.
The escalation failure leads to a deescalation.
87. J: Just play eat-eat. We can play eat-eat. We have to play that way.
However, in this case, the successful deescalation was not accomplished without
some conflict:
89. J: Look hungry!
90. D: Huh?
91. J: I said look hungry!
92. D: Look hungry? This is dumb.
93. J: Look hungry!
94. D: No!
The children then successfully returned to the previous level of common ground
activity, preparing a meal together. Common ground activity is thus viewed in
this coding system as a hierarchy in terms of the responsiveness it demands of
each child and in terms of the fun it promises, (pp. 55-57)
categories for the other three schemes represent different modes or kinds of
behavior or different questions about a particular behavioral event and are
clearly not mutually exclusive (with the exception of embryonic distress
and pleasure calls).
The Landesman-Dwyer and the Bakeman and Brownlee schemes are
formally identical. Both consist of several sets of mutually exclusive and
exhaustive codes. This is a useful structure for codes because it ensures that
answers to a number of different questions will be answered: What is the
baby doing with his eyes? With his mouth? Did the taker have prior posses-
sion? They differ, however, in when the questions are asked. Landesman-
Dwyer's scheme is used to characterize each successive moment in time,
whereas Bakeman and Brownlee's scheme is used to characterize a partic-
ular event and is "activated" only when the event of interest occurs.
12. Play and Sex (any type of play and/or sexual posturing, exclusive
of locomotion)
13. Aggression (vigorous and/or prolonged biting, hair pulling, clasp-
ing, accompanied by one or more of threat, barking, piloerection,
or strutting)
14. Social Contact (contact and/or proximity with another subject, ex-
clusive of Mother-Infant Ventral, Ventral Cling, Aggression, or
Play and Sex)
Except for Passive (which can cooccur with Self-Clasp, Self-Mouth, and
Vocalization), these codes appear to be mutually exclusive. In some cases,
activities that could cooccur have been made mutually exclusive by defini-
tion. For example, if an activity involves both Stereotypy and Locomotion,
then Stereotypy is coded. Similarly, if what appears to be Social Contact
involves a more specific activity for which a code is defined (like Play
and Sex), then the specific code takes precedence. Defining such rules of
precedence is, in fact, a common way to make a set of codes mutually
exclusive.
2.15 Summary
No other single element is as important to the success of an observational
study as the coding scheme. Yet developing an appropriate scheme (or
schemes) is often an arduous task. It should be assumed that it will in-
volve a fair number of hours of informal observation (either "live" or
using videotapes), a fair amount of occasionally heated discussion, and
several successively refined versions of the coding scheme. Throughout
this process, participants should continually ask themselves, exactly what
questions do we want to answer, and how will this way of coding behavior
help us answer those questions?
There is no reason to expect this process to be easy. After all, quite apart
from our current research traditions, developing "coding schemes" (making
distinctions, categorizing, developing taxonomies) is an ancient, perhaps
even fundamental, intellectual activity. It seems reasonable to view one
product of this activity, the coding scheme, as an informal hypothesis, and
the entire study in which the coding scheme is embedded as a "test" of that
hypothesis. If the "hypothesis" has merit, if it is clearly focused and makes
proper distinctions, then sensible and interpretable results should emerge.
When results seem confused and inconclusive, however, this state of affairs
should not automatically be blamed on a lack of proper data-analytic tools
for observational data. First we should ask, are questions clearly stated,
and do the coding schemes fit the questions? We hope that a consideration
Summary 37
of the various issues raised in this chapter will make affirmative answers
to these questions more likely.
In this chapter, we have confined our discussion to coding schemes. Five
examples of developed schemes were presented, and an additional four are
detailed in the companion volume (Bakeman and Quera, 1995a, chapter
2). We have stressed in particular how coding schemes can be organized
or structured and have left for the next chapter a discussion of how coding
schemes are put into use. This separation is somewhat artificial. How
behavioral sequences are to be recorded can and often does affect how codes
are defined and organized in the first place. This is especially evident when
the Landesman-Dwyer and the Bakeman and Brownlee schemes discussed
earlier are compared. Still, a scheme like Landesman-Dwyer's could be
recorded in two quite different ways, as we discuss in the next chapter. It
is the task of that chapter to describe the different ways behavior can be
recorded, once behavioral codes have been defined.
Recording behavioral
sequences
long they last. At other times, duration - the mean amount of time a par-
ticular kind of event lasts or the proportion of time devoted to a particular
kind of event - is very much of concern. As a result, many writers have
found it convenient to distinguish between "momentary events" (or fre-
quency behaviors) on the one hand, and "behavioral states" (or duration
behaviors) on the other (J. Altmann, 1974; Sackett, 1978). The distinction
is not absolute, of course, but examples of relatively brief and discrete, mo-
mentary events could include baby burps, dog yelps, child points, or any of
Gottman's thought unit codes described in section 2.11, whereas examples
of duration events could include baby asleep, dog hunting, child engaged in
parallel play, or any of Landesman-Dwyer's baby behavior codes described
in section 2.12.
One particular way of conceptualizing duration events is both so common
and so useful it deserves comment. Often researchers view the events they
code as "behavioral states." Typically, the assumption is that the behav-
ior observers see reflects some underlying "organization," and that at any
given time the infant, animal, dyad, etc., will be "in" a particular state. The
observers' task then is to segment the stream of behavior into mutually ex-
clusive and exhaustive behavioral states, such as the arousal states often de-
scribed for young infants (REM sleep, quiet alert, fussy, etc.; Wolff, 1966).
The distinction between momentary and duration events (or between dis-
crete events and behavioral states) seems worth making to us, partly because
of the implications it may have for how data are recorded. When the investi-
gator wants to know only the order of events (for example, Gottman's study
of friendship formation) or how behavioral states are sequenced (Bakeman
and Brownlee's study of parallel play), then the recording system need not
preserve time. However, if the investigator wants also to report proportion
of time devoted to the different behavioral states, then time information
of course needs to be recorded. In general, when duration matters, the
recording system must somehow preserve elapsed time for each of the
coded events. Moreover, when occurrences of different kinds of events
are to be related, beginning times for these events need to be preserved as
well. (Examples are provided by Landesman-Dwyer and by Tuculescu and
Griswold; see section 3.5, Recording onset and offset times.)
moving roll of paper and its pens, ready to record events by their deflection
(see Figure 3.1). It is a rather cumbersome device, rarely used in observa-
tional research. Almost always, researchers prefer pencil, paper, and some
sort of clock, or else (increasingly) an electronic recording device. Still,
the phrase "continuous recording" seems appropriate for the strategies we
describe here, not because the paper rolls on, but because the observers are
continuously alert, paying attention, ready to record whenever an event of
interest occurs, whenever a behavioral state changes, or whenever a specific
time interval elapses.
Given that this is a book about sequential analysis in particular, and
not just systematic observation in general, the emphasis on continuous
recording is understandable. After all, for sequential analysis to make
much sense, the record of the passing stream of behavior captured by the
coding/recording system needs to be essentially continuous, free of gaps.
However, we do discuss intermittent recording in section 3.9 (Nonsequen-
tial considerations: time sampling).
The purpose of the following sections is to describe different ways of col-
lecting observational data, including recording strategies that code events
and ones that code intervals. For each way of collecting data, we note
what sort of time information is preserved, as well as other advantages and
disadvantages.
to record a particular code. When the events of interest, not a time interval
running out, are what stir an observer into action, we would say that an event
coding strategy is being used to record observational data. The simplest
example of event coding occurs when observers are asked just to code
events, making no note of time. For example, an investigator might be
interested in how often preschool children try to hit each other, how often
they quarrel, and how often they ask for an adult's aid. The observer's task
then is simply to make a tally whenever one of these codable events occurs.
Such data are often collected with a "checklist." The behavioral codes are
written across a page, at the top of columns. Then when a codable event
occurs, a tally mark is made in the appropriate column. No doubt our
readers are already quite aware of this simple way of collecting data. Still,
it is useful whenever investigators want only to know how often events of
interest occur (frequency information) or at what rate they occur (relative
frequency information) (see Figure 3.2).
Such data can be important. However, of more immediate concern to
us, given the focus of this book, are event coding strategies that result
in sequential data. For example, Gottman segmented the stream of talk
into successive thought units. Each of these events was then coded, pro-
viding a continuous record of how different kinds of thought units were
sequenced in the conversations Gottman tape-recorded. Similarly, Bake-
man and Brownlee (who actually used an interval coding strategy) could
have asked observers to note instead whenever the play state of the child
they were observing changed. Each new play state would then have been
coded, resulting in a record of how different kinds of play states were
sequenced during free play (see Figure 3.3).
42 Recording behavioral sequences
A P A P ft A frP & P A
In these two examples, the basic requirement for sequential data - con-
tinuity between successive coded units - is assured because the stream of
talk or the stream of behavior is segmented into successive events (units) in
a way that leaves no gaps. However, sequential data may also result when
observers simply report that this happened, then that happened, then that
happened next, recording the order of codable events. Whether such data
are regarded as sequential or not depends on how plausible the assumption
of continuity between successive events is, which in turn depends on the
coding scheme and the local circumstances surrounding the observation.
However, rather than become involved in questions of plausibility, we think
it better if codes are defined so as to be mutually exclusive and exhaustive in
the first place. Then it is easy to argue that the data consist of a continuous
record of successive events or behavioral states.
Whether behavior is observed "live" or viewed on videotape does not
matter. For example, observers could be instructed to sit in front of a cage
from 10 to 11 a.m. on Monday, 3 to 4 p.m. on Tuesday, etc., and record
whenever an infant monkey changed his activity, or observers could be
instructed to watch several segments of videotape and to record whenever
the "play state" of the "focal child" changed, perhaps using Smith's social
participation coding scheme (Alone, Parallel, Group). In both cases, ob-
servers would record the number and sequence of codable events. An obvi-
ous advantage of working from videotapes is that events can be played and
replayed until observers feel sure about how to code a particular sequence.
Recording onset and offset times 43
Still, both result in a complete record of the codable events that occurred
during some specified time period.
co&JL,
II 0 IS
if 18 5
11 23 19
If 42
16 56
that does not. First, imagine that observers using the Landesman-Dwyer
baby's eyes code were equipped with electronic recording devices. They
would learn to push the correct keys by touch; thus they could observe
the baby continuously, entering codes to indicate when behavior changed.
(For example, an 11 might be used for Closed, a 12 for Slow Roll, a 16
for Bright, etc.) Times would be recorded automatically by a clock in the
device (see Figure 3.4). Later, the observer's record would be "dumped" to
a computer, and time budget (i.e., percentage scores for different behavioral
codes) and other information would be computed by a computer program.
(In fact, Landesman-Dwyer used such devices but a different recording
strategy, as described in the next section.)
The second application we want to describe involves pencil and paper
recording, but couples that with "time-stamped" videotapes. For a study
of preverbal communication development, Bakeman and Adamson (1984)
videotaped infants playing with mothers and with peers at different ages.
Timing pattern changes 45
At the same time that the picture and sound were recorded, a time code
was also placed on the tape. Later, observers were instructed to segment
the tapes into different "engagement states" as defined by the investigators'
code catalog. In practice, observers would play and replay the tapes until
they felt certain about the point where the engagement state changed. They
would then record the code and onset time for the new engagement state.
In summary, one way to preserve a complete record of how behavior
unfolds in time (when such is desired) is to record onset (and if necessary
offset) times for all codable events. This is easy to do when one is working
from time-stamped videotapes. When coding live, this is probably best
done with electronic recording devices (either special purpose devices like
the one shown in Figure 3.4 or general-purpose handheld or notebook
computers, programmed appropriately, which increasingly are replacing
special purpose devices). When such devices are not available, the same
information can be obtained with pencil, paper, and some sort of clock,
but this complicates the observers' task. In such cases, investigators might
want to consider the approximate methods described in section 3.7, on
Coding intervals.
the surface of it, this seems like more work than necessary. After all, why
require observers to enter codes for external stimulation, eyes, head, and
body when there has been no change just because there has been a change
in facial behavior? In fact, observers who use this approach to recording
data report that, once they are trained, it does not seem like extra work at
all. Moreover, because the status of all five groups is noted whenever any
change occurs, investigators feel confident that changes are seldom missed,
which they might not be if observers were responsible for monitoring five
different kinds of behavior separately.
Paradoxically, then, more may sometimes be less, meaning that more
structure - that is, always entering a structured 5-digit code - may seem
like less work, less to remember. This is the approach that Landesman-
Dwyer actually uses for her Baby Behavior Code. We should add, however,
that her observers use an electronic recording device so that time is automat-
ically recorded everytime a 5-digit code is entered. In fact, we suspect that
the Baby Behavior Code would be next to impossible to use without such
a device, no matter which recording strategy (timing onsets or timing pat-
tern changes) were employed. Given proper instrumentation, however, the
same information (frequencies, mean duration, percents, cooccurrences,
etc.) would be available from data recorded using either of these strate-
gies. When a researcher's coding scheme is structured appropriately, then,
which strategy should be used? It probably depends in part on observer
preference and investigator taste, but we think that recording the timing of
pattern changes is a strategy worth considering.
T
Ik,
T
1
z y
3
4
5
6 y
7 y
8 y
q y
10 y
II y
12 y
13
14 y
current possessor resisted the take attempt or not, and (c) whether the child
attempting to take the object was successful or not. Note that the possible
answers to each of these three questions (yes/no) are mutually exclusive
and exhaustive and that the three questions have a natural temporal order.
The cross-classification of events is a recording strategy with many ad-
vantages. For one thing, techniques for analyzing cross-classified data
(contingency tables) have received a good deal of attention, both histori-
cally and currently, and they are relatively well worked out (see chapter
10). Also, clear and simple descriptive data typically result. For example,
in another study, Brownlee and Bakeman (1981) were interested in what
"hitting" might mean to very young children. They defined three kinds of
hits (Open, Hard, and Novelty) and then asked observers to record when-
ever one occurred and to note the consequence (classified as No Further
Interaction, Ensuing Negative Interaction, or Ensuing Positive Interaction).
They were able to report that open hits were followed by no further interac-
tion and novelty hits by ensuing positive interaction more often than chance
would suggest, but only for one of the age groups studied, whereas hard
hits were associated with negative consequences in all age groups.
A further advantage is that a coding scheme appropriate for cross-
classifying events (temporally ordered superordinate categories, mutually
exclusive and exhaustive codes within each superordinate category) im-
plies a certain amount of conceptual analysis and forethought. In general,
we think that this is desirable, but in some circumstances it could be a
liability. When cross-classifying events, observers do impose a certain
amount of structure on the passing stream of behavior, which could mean
that interesting sequences not accounted for in the coding scheme might
pass by unseen, like ships in the night. Certainly, cross-classifying events
is a useful and powerful way of recording data about behavioral sequences
when investigators have fairly specific questions in mind. It may be a less
useful strategy for more exploratory work.
view tapes, slowing down the speed and rewinding and reviewing as neces-
sary. The times when events occurred are read from the time displayed on
the screen. This time may be written down, using pencil and paper, or keyed
along with its accompanying code directly into a computer. An improve-
ment on this strategy involves recording time information on the videotape
in some machine-readable format and connecting a computer to the video
player. Then observers need only depress keys corresponding to particular
behaviors; the computer both reads the current time and stores it along
with the appropriate code. Moreover, the video player can be controlled
Summary 55
3.12 Summary
Table 3.1 is a summary of the four major conceptual recording schemes
we have discussed in this chapter, together with their advantages and dis-
advantages. The particular recording scheme chosen clearly depends on
the research question. However, in general, we find event recording (with
or without timing of onsets and offsets or timing of pattern changes) and
cross-classifying events to be more useful for sequential analyses than ei-
ther interval recording or time sampling.
Assessing observer agreement
56
Why bother? 57
Accuracy
The major conceptual reason for assessing interobserver agreement, then,
is to convince others as to the "accuracy" of the recorded data. The as-
sumption is, if two naive observers independently make essentially similar
codings of the same events, then the data they collect should reflect some-
thing more than a desire to please the "boss" by seeing what the boss wants,
and something more than one individual's unique and perhaps strange way
of seeing the world.
Some small-scale studies may require only one observer, but this does
not obviate the need for demonstrating agreement. For example, in one
study Brownlee and Bakeman (1981) were concerned with communicative
aspects of hitting in 1-, 2-, and 3-year-old children. After repeated viewings
of 9 hours of videotape collected in one day-care center, they developed
some hypotheses about hitting and a coding scheme they thought useful for
children of those ages. The next step was to have a single observer collect
data "live" in another day-care center. There were two reasons for this.
First, given a well-worked-out coding scheme, they thought observing live
would be more efficient (no videotapes to code later), and second, nursery
school personnel were concerned about the disruption to their program
that multiple observers and/or video equipment might entail. Two or more
observers, each observing at different times, could have been used, but
Brownlee and Bakeman thought that using one observer for the entire study
would result in more consistent data. Further, the amount of observation
required could easily be handled by one person. Nonetheless, two observers
were trained, and agreement between them was checked before the "main"
observer began collecting data. This was done so that the investigators,
and others, would be convinced that this observer did not have a unique
personal vision and that, on a few occasions at least, he and another person
independently reported seeing essentially the same events.
Calibration
Just as assuring accuracy is the major conceptual reason, so calibrating ob-
servers is probably the major practical reason for establishing interobserver
agreement. A study may involve a large number of separate observations
and/or extend over several months or years. Whatever the reason, when
different observers are used to collect the same kind of data, we need to
assure ourselves that the data collected do not vary as a function of the
observer. This means that we need to calibrate observers with each other
or, better yet, calibrate all observers against some standard protocol.
58 Assessing observer agreement
Reliability decay
Not only do we need to assure ourselves that different observers are coding
similar events in similar ways, we also need to be sure that an individual ob-
server's coding is consistent over time. Taplin and Reid (1973) conducted a
study of interobserver reliability as a function of observer's awareness that
their coding was being checked by an independent observer. There were
three groups - a group that was told that their work would not be checked, a
group that was told that their work would be spot-checked at regular inter-
vals, and a group that was told that their work would be randomly checked.
Actually the work of all three groups was checked for all seven sessions.
All groups showed a gradual decay in reliability from the 80% training
level. The no-check group showed the largest decay. The spot-check
group's reliability increased during sessions 3 and 6, when they thought
they were being checked. The random-check group performed the best over
all sessions, though lower than the spot-check group on session 3 and 6.
Reliability decay can be a serious problem when the coding process takes
a long time, which is often the case in a large study that employs a complex
coding scheme. One solution to the problem was reported by Gottman
(1979a). Gottman obtained a significant increment in reliability over time
by employing the following procedure in coding videotapes of marital
interaction. One employee was designated the "reliability checker"; the
reliability checker coded a random sample of every coder's work. A folder
was kept for each coder to assess consistent confusion in coding, so that
retraining could be conducted during the coder's periodic meetings with
the reliability checker. To test for the possibility that the checker changed
coding style for each coder, two procedures were employed. First, in one
study the checker did not know who had been assigned to any particular
tape until after it was coded. This procedure did not alter reliabilities.
Second, coders occasionally served as reliability checkers for one another
in another study. This procedure also did not alter reliabilities. Gottman
also conducted a few studies that varied the amount of interaction that the
checker coded. The reliabilities were essentially unaffected by sampling
larger segments, with one exception: The reliabilities of infrequent codes
are greatly affected by sampling smaller segments. It is thus necessary for
each coding system to determine the amount that the checker codes as a
function of the frequency of the least frequent codes.
What should be clear from the above is that investigators need to be
concerned not just with inter-, but also with intraobserver reliability. An
investigator who has dealt with the problems of inter- and intraobserver
agreement especially well is Gerald Patterson, currently of the Oregon
Social Learning Center. Over the past several years, Patterson and his co-
workers have trained a number of observers to use their coding schemes.
Reliability vs. agreement 59
two seconds. They disagree when one records a distress call and the other
does not, or when they agree that a distress call occurred but disagree
as to what kind it is. (Following an old classification system for sins,
some writers call these disagreements "omission errors" and "commission
errors," respectively.)
Once agreements and disagreements have been identified and tallied, the
percentage of agreement can be computed. For example, if two observers
both recorded eight Phioos at essentially the same time, but disagreed three
times (each observer recorded one Phioo that the other did not, and once one
observer recorded a Phioo that the other called a Soft Peep), the percentage
of agreement would be 73 (8 divided by 8 + 3 times 100). Percentage
agreement could also be reported, not just for Phioos in particular, but
for embryonic distress calls in general. For example, if the two observers
agreed as to type of distress call 35 times but disagreed 8 times, then the
percentage of agreement would be 81 (35 divided by 35 + 8 times 100).
Given a reasonable definition for agreement and for disagreement, the
percentage of agreement is easy enough to compute. However, it is not at
all clear what the number means. It is commonly thought that agreement
percentages are "good" if they are in the 90s, but there is no rational basis
for this belief. The problem is that too many factors can affect the per-
centage of agreement - including the number of codes in the code catalog
- so that comparability across studies is lost. One person's 91% can be
someone else's 78%.
Perhaps the most telling argument against agreement percentage scores
is this: Given a particular coding scheme and a particular recording strategy,
some agreement would occur just by chance alone, even with blindfolded
observers, and agreement percentage scores do not correct for this. This be-
comes most clear when an interval coding strategy is coupled with a simple
mutually exclusive and exhaustive scheme, as in the study of parallel play
described in section 1.7. Recall that for this study, Bakeman and Brownlee
had observers code each successive 15-second interval as either Unoccu-
pied, Solitary, Together, Parallel, or Group. If two observers had each
coded the same 100 intervals, the pattern of agreement might have been as
depicted in Figure 4.1. In this case, the percentage of agreement would be
87 (87 divided by 87 + 13 times 100). However, as we show in the next sec-
tion, an agreement of 22.5% would be expected, in this case, just by chance
alone. The problem with agreement percentages is that they do not take
into account the part of the observed agreement that is due just to chance.
Figure 4.1 is sometimes called a "confusion matrix," and it is useful
for monitoring areas of disagreement that are systematic or unsystematic.
After computing the frequencies of entries in the confusion matrix, the
reliability checker should scan for clusters off the diagonal. These indicate
62 Assessing observer agreement
UM.
i
1
LU, O i o o 8
i
i ffli 2 4 0 o o 2.5
I mrwr 11 w
i 1 z 23
wrttff I
wrr 21
0 o *i 25 1
^ A*. 0 o o I 15
25 21 2S 17 100
100. Symbolically:
Fobs —
N
where k is the number of codes (i.e., the order of the agreement matrix),
xu is the number of tallies for the /th row and column (i.e., the diagonal
cells), and N is the total number of tallies for the matrix. For the agreement
portrayed in Figure 4.1, this is:
7 + 2 4 + 1 7 + 25 + 14 on
Pobs = m = .87
Pexp is computed by summing up the chance agreement probabilities for
each category. For example, given the data in Figure 4.1, the probability
that an interval would be coded Unoccupied was .08 for the first observer
and .09 for the second. From basic probability theory, the probability of
two events occurring jointly (in this case, both observers coding an interval
Unoccupied), just due to chance, is the product of their simple probabilities.
Thus the probability that both observers would code an interval Unoccupied
just by chance is .0072 (.08 x .09). Similarly, the chance probability that
both would code an interval Solitary is .0625 (.25 x .25), Together is .0483
(.21 x .23), Parallel is .0812 (.28 x .29), and Group is .0255 (.17 x .15).
Summing the chance probabilities for each category gives the overall pro-
portion of agreement expected by chance (PeXp), which in this case is .2247.
A bit of algebraic manipulation suggests a somewhat simpler way to
compute PeXp- Multiply the first column by the first row total, add this to
the second column total multiplied by the second row total, etc., and then
divide the resulting sum of the column-row products by the total number
of tallies squared. Symbolically:
where JC+,- and JC,-+ are the sums for the /th column and row, respectively
(thus one row by column sum cross-product is computed for each diagonal
cell).
For the agreement given in Figure 4.1, this is:
_ 9 x 8 + 25 x 25 + 21 x 23 + 28 x 29 + 17 x 15
Pexp
" 100 x 100
= .2247
64 Assessing observer agreement
UM-. £K.
4^ 7 I
Z 12
-facto. <\ H\ too
Figure 4.2. An agreement matrix using a coding scheme with two codes. Although
there is 97% agreement, 84% would be expected just by chance alone.
from the sample data. Then the value of kappa estimated from the sample
data is divided by the square root of the estimated variance and the result
compared to the normal distribution. If the result were 2.58 or bigger, for
example, we would claim that kappa differed significantly from zero at the
.01 level or better.
In this paragraph, we show how to compute the estimated variance for
kappa, first defining the procedure generally and then illustrating it, using
the data from Figure 4.2. The formula incorporates the number of tallies
(N, in this case 100), the probability of chance agreement (Pexp in this case
.8444), and the row and column marginals: /?,-+ is the probability that a tally
will fall in the i"th row, whereas p+j is the probability that a tally will fall
in the jth column. In the present case, p\+ = .08, p2+ = .92, p+\ — .09,
and p+2 = -91. To estimate the variance of kappa, first compute
k
]T = Pi+ x p+i x [1 - (p+i + pi+)f
1=1
In the present case, this is:
.08 x .09 x [1 - (.09 + .08)]2 + .92 x .91 x [1 - (.91 + .92)]2
= .00496+ .57675 = .5817
Then add to it this sum:
k k
9
x x
TTiT^= Pt+ P+J (P+i + PJ+>
I— 1 J— 1
For many investigators, this will not be stringent enough. Just as corre-
lation coefficients that account for little variance in absolute terms are often
significant, so too, quite low values of kappa often turn out to be signifi-
cant. This means only that the pattern of agreement observed is greater than
would be expected if the observers were guessing and not looking. This can
be unsatisfactory, however. We want from our observers not just better than
chance agreement; we want good agreement. Our own inclination, based
on using kappa with a number of different coding schemes, is to regard
kappas less than .7, even when significant, with some concern, but this is
only an informal rule of thumb. Fleiss (1981), for example, characterizes
kappas of .40 to .60 as fair, .60 to .75 as good, and over .75 as excellent.
The computation of kappa can be refined in at least three ways. The first
is fairly technical. Multiple observers may be used - not just two - and
investigators may want a generalized method for computing kappa across
the different pairs. In such cases, readers should consult Uebersax (1982);
a BASIC program that computes Uebersax's generalized kappa coefficient
has been written by Oud and Sattler (1984).
The second refinement is often useful, especially when codes are roughly
ordinal. Investigators may regard some disagreements (confusing Unoccu-
pied with Group Play, for example) as more serious than others (confusing
Together with Parallel Play, for example). Cohen (1968) has specified a
way of weighting different disagreements differently. Three kxk matrices
are involved: one for observed frequencies, one for expected frequencies,
and one for weights. Let x^, mtj, and W(j represent elements from these
three matrices, respectively: then ra,7 = (x+j x xt+) -f- N, and the wtj indi-
cate how seriously we choose to regard various disagreements. Usually the
diagonal elements of the weight matrix are 0, indicating agreement (i.e.,
wa = 0 for i = 1 through k)\ cells just off the diagonal are 1, indicating
some disagreement; cells farther off the diagonal are 2, indicating more
serious disagreement; etc. For the present example, we might enter 4 in
cells JC15 and x$u indicating that confusions between unoccupied and group
are given more weight than other disagreements. Then weighted kappa is
computed as follows:
K -1
Kwt l
-
clear implications for observer training. In the first case (code confusion),
further training would attempt to establish a consensual definition for the
two codes; and in the second (different sensitivities), further training would
attempt to establish consensual thresholds for all codes.
be the word gap. Each gap would contribute one tally to a 2 x 2 kappa
table: (a) Both observers agree that this gap is a thought unit boundary;
(b) the first observer thinks it is, but the second does not; (c) the second
thinks it is, but the first disagrees; or (d) both agree that it is not a thought
unit boundary. Although the chance agreement would likely be high (two
coding categories, skewed distribution, assuming that most gaps would not
be boundaries), still the kappa computed would correct for that level of
chance agreement.
A better and more general procedure for determining agreement with re-
spect to unitizing (i.e., identifying homogeneous stretches of talk or partic-
ular kinds of episodes) requires that onset and offset times for the episodes
be available. Then the tallying unit for the kappa table becomes the unit
used for recording time. For example, imagine that times are recorded to
the nearest second and that observers are asked to identify conflict episodes,
recording their onset and offset times. Then kappa is computed on the ba-
sis of a simple 2 x 2 table like that shown in Figure 4.2, except that now
rows and columns are labeled yes/no, indicating whether or not a second
was coded for conflict. One further refinement is possible. When tally-
ing seconds, we could place a tally in the agreement (i.e., yes/yes) cell if
one observer claimed conflict for the second and the other observer claimed
conflict either for that second or an adjacent one, thereby counting 1-second
disagreement as agreements as often seems reasonable.
In the previous chapter, we described five general strategies for recording
observational data: (a) coding events, (b) timing onsets and offsets, (c) tim-
ing pattern changes, (d) coding intervals, and (e) cross-classifying events.
Now we would like to mention each in turn, describing the particular prob-
lems each strategy presents for determining agreement about unitizing.
Coding events, without any time information, is in some ways the most
problematic. If observers work from transcripts, marking event (thought
unit) boundaries, then the procedures outlined in the preceding paragraphs
can be applied. If observers note only the sequence of events, which
means that the recorded data consist of a string of numbers or symbols,
each representing a particular event or behavioral state, then determining
agreement as to unit boundaries is more difficult. The two protocols would
need to be aligned, which is relatively easy when agreement is high, and
much more difficult when it is not, and which requires some judgment in
any case. An example is presented in Figure 4.3.
When onset and offset or pattern-change times are recorded, however, the
matter is easier. Imagine, for example, that times are recorded to the nearest
second. Then the second can be the unit used for computing agreement both
for unitizing (identifying homogeneous stretches or episodes) and for the
individual codes themselves. Because second boundaries are determined
70 Assessing observer agreement
TVUdW
u u a.
s S a. a.
Gr a. a.
T ? A
G Gr a. a.
V
5 S a- a.
V T
U U a.
T
u
p F a-
& a. a.
Figure 4.3. Two methods for determining agreements ("a") and disagreements
("d") when two observers have independently coded the same sequence of events.
Method A ignores errors of omission. Method B counts both errors of commission
and errors of omission as disagreements.
1W HZ 3 O O 8 53
i 13 7 o 5 \ob
z 3 5 o O \0
o 1 O 165-
7 1 i H3O HH3
5Z 13 4H5 777
Figure 4.4. An agreement matrix for coding of chicken embryonic distress calls.
Each tally represents a 1-second interval. The reader may want to verify that the
percentage of agreement observed is 94.2%, the percentage expected by chance
is 39.4%, the value of kappa is .904, its standard error is .0231, and the z score
comparing kappa to its standard error is 39.2.
Ok*
At TUt
13 A$ 22 4 26
1
583 587 8 1166 1174
ClS»t3MS8S»587) p.(3O»26Mll7O»l»7H)
600*600 rc
" 1200*1200
Figure 4.5. When time intervals are tallied, and a reasonable time interval is used,
the value of kappa is not changed by halving the time interval (which doubles the
tallies).
ft**
\ Z 1 Q
=• LET
I zo Q a (7O
3 3O 22. = 26.O
H 3 7 = SO
5 iZO 84 P5 s I02.O
5^=25:6 h * 30.3
= 0.95
correlation is defined as
a = -J^-J. (4.2)
assumes that data will be interpreted within what Suen (1988) terms a norm-
referenced (i.e., values are meaningful only relatively; rank-order statistics
like correlation coefficients are emphasized) as opposed to a criterion-
referenced framework (i.e., interpretation of values references an absolute
external standard; statistics like unstandardized regression coefficients are
emphasized). Equation 4.1 is based on recommendations made by Hart-
mann (1982) and Wiggins (1973, p. 290). For other possible intraclass
correlation coefficients (generalizability coefficients), based on other as-
sumptions, see Fleiss (1986, chapter 1) and Suen (1988), although, as a
practical matter, values may not be greatly different. For example, val-
ues for a criterion-referenced fixed-effect and random-effect model per
Fleiss (1986) were .926 and .921, respectively, compared to the .931 of
Figure 4.6. In contrast, assuming that observers were item scores and we
wished to know the reliability of total scores based on these items, Cron-
bach's internal-consistency alpha (which is MSP — MSr divided by MSP;
see Wiggins, 1973, p. 291) was .964.
This way of thinking has profound consequences. It means that reli-
ability can be high even if interobserver agreement is moderate, or even
low. How can this be? Suppose that for person #5 in Figure 4.6, Observer
1 detected code A 120 times, as shown but only 30 of these overlapped
in time with Observer 2's 84 entries. Then the interobserver agreement
would be only 30/120 = .25. Nonetheless, the generalizability coefficient
of equation 4.1 is .93. The reliability is high because either observer's data
distinguishes equally well between persons. The agreement within person
need not be high. The measure does the work it was intended to do, and
either observer's data will do this work. This is an entirely different notion
of reliability than the one we have been discussing.
Note that the generalizability or reliability coefficient estimated by equa-
tion 4.1 is a specific measure of the relative variance accounted for by an
interesting facet of the design (subjects) compared to an uninteresting one
(coders). This is an explicit and specific proportion, but it does not tell us
78 Assessing observer agreement
how large a number is acceptable, any more than does the proportion of
variance accounted for in a dependent variable by an independent variable.
The judgement must be made by the investigator. It is not automatic, just
as a judgment of an adequate size for kappa is not an automatic procedure.
Note also that what makes the reliability high in the table in Figure 4.6
is having a wide range of people in the data, ranging widely with respect to
code A. Jones, Reid, and Patterson (1975) presented the first application of
Cronbach and colleagues' (1972) theory of measurement to observational
data.
The reliability analysis just presented, although appropriate when scores
are interval-scaled (e.g., number of events coded A by an observer), is
inadequate for sequential analysis. The agreement required for sequential
analysis cannot be collapsed over time, but must match point for point, as
exemplified by the kappa tables presented in the previous section. Such
matching is much more consistent with "classical" notions of reliability,
i.e., before Cronbach et al. (1972).
Still, agreement point-for-point could be assessed in the same manner as
in Figure 4.6. The two columns in Figure 4.6 would be replaced by sums
from the confusion matrix. Specifically, the sums on the diagonal would
replace Observer l's scores, and the sums of diagonal plus off-diagonal
cells (i.e., the row marginals) would replace Observer 2's scores. If the
agreement point-for-point were perfect, all entries for code A in the confu-
sion matrix would be on the diagonal and the two column entries would be
the same. There would then be no variation across "observers," and alpha
would be high. In this case, the "persons" of Figure 4.6 become codes,
"Observer 1" becomes agreements and "Observer 2" becomes agreements
plus disagreements.
This criterion is certainly sufficient for sequential analysis. However, it
is quite stringent. Gottman (1980a) proposed the following: If independent
observers produce similar indexes of sequential connection between codes
in the generalizability sense, then reliability is established. For example,
if two observers produced the data in Figure 4.7 (designed so that they
are off by one time unit, but see the same sequences), their interobserver
agreement would be low but indexes of sequential connection would be very
similar across observers. Some investigators handle this simple problem
by having a larger time window within which to calculate the entries in the
confusion matrix. However, that is not a general solution because more
complex configurations than that of Figure 4.7 are possible, in which both
observers detect similar sequential structure in the codes but point-for-point
agreement is low. Cronbach et al.'s (1972) theory implies that all we need
to demonstrate is that observers are essentially interchangeable in doing
the work that our measures need to do.
Unreliability as a research variable 79
A B B A B A B A B A B
A A B B A B A B A B A 2.'s
B
1 5
4 1
Figure 4.7. A confusion matrix when observers "see" the same sequence but one
observer lags the other. In such cases, a "point-for-point" agreement approach
may be too stringent.
was strongly supported; reliability was lower for more complex segments.
This could partly be due to the increased task demands on the coder, but it
could also be partly a property of the interaction itself, if in more complex
interactions people sent more probe messages. The point of this section is
to suggest that reliability can itself become a research variable of interest.
4.9 Summary
There are at least three major reasons for examining agreement among
observers. The first is to assure ourselves and others that our observers are
accurate and that our procedures are replicable. The second is to calibrate
multiple observers with each other or with some assumed standard. This
is important when the coding task is too large for one observer or when it
requires more than a few weeks to complete. The third reason is to provide
feedback when observers are being trained.
Depending on which reason is paramount, computation of agreement
statistics may proceed in different ways. One general guiding principle is
that agreement need be demonstrated only at the level of whatever scores
are finally analyzed. Thus if conditional probabilities are analyzed, it is
sufficient to show that data derived from two observers independently cod-
ing the same stream of behavior yielded similar conditional probabilities.
Such an approach may not be adequate, however, when training of ob-
servers is the primary consideration. Then, point-by-point agreement may
be demanded. Point-by-point agreement is also necessary when data de-
rived from different observers making multiple coding passes through a
videotape are to be merged later.
When an investigator has sequential concerns in mind, then, point-by-
point agreement is necessary for observer training and is required for at
least some uses of the data. Moreover, if point-by-point agreement is
established, it can generally be assumed that scores derived from the raw
sequential data (like conditional probabilities) will also agree. If agreement
at a lower level is demonstrated, agreement at a higher level can be assumed.
For these reasons, we have stressed Cohen's kappa in this chapter because
it is a statistic that can be used to demonstrate point-by-point agreement.
A Pascal program that computes kappa and weighted kappa is given in the
Appendix. Kappa is also computed by Bakeman and Quera's Generalized
Sequential Querier or GSEQ program (Bakeman & Quera, 1995a). At the
same time, we have also mentioned other approaches to observer reliability
(in section 4.7). This hardly exhausts what is a complex and far-ranging
topic. Interested readers may want to consult, among others, Hollenbeck
(1978) and Hartmann (1977, 1982).
Representing observational
data
81
82 Representing observational data
The first four forms are defined by Bakeman and Quera's (1992, 1995a)
Sequential Data Interchange Standard (SDIS), which defines a standard
form for event, state, timed-event, and interval sequences, respectively.
Sequential data represented by any of these forms can be analyzed with
the Generalized Sequential Querier (GSEQ; Bakeman & Quera, 1995a).
The fifth form is an application of the standard cases by variables rectan-
gular matrix and is useful for analyzing cross-classified events, including
contingency table data produced by the GSEQ program.
to a single stream of coded events (which are thus mutually exclusive and
exhaustive by definition), and when information about time (such as the
duration of events) is not of interest. Event sequences are both simple and
limited. Yet applying techniques described in chapter 7 to event sequences,
Bakeman and Brownlee were able to conclude that parallel play often
preceded group play among 3-year-old children.
exclusive and exhaustive (ME&E) coded states (or events) captures infor-
mation of interest. It would be used instead of (or in addition to) state
sequences when information concerning proportion of time devoted to a
behavior (e.g., percentage of time spent in parallel play) or other timing
information (e.g., average bout length for group play) is desired. Addition-
ally, it is possible to define multiple streams of ME&E states; for details see
Bakeman and Quera (1995a). In sum, both simple event and state sequences
are useful for identifying sequential patterns, given a simple ME&E cod-
ing scheme. But they are not useful for identifying concurrent patterns
(unless multiple streams of states are defined) or for answering relatively
specific questions, given more complex coding schemes. In such cases, the
timed-event sequences described in the next section may be more useful.
6 8 10 12 14 16 18 20
Time Interval
are simply listed as they occur and interval boundaries are represented by
commas. For example:
Interval = 5;
, , Ivoc, Ivoc, Aofr, Rain Aofr Ivoc Ismi, Rain, . . .
1 1 1
2 1 2
would code two events. In the first, the child attempting to take the object
had had prior possession (1), his take attempt was resisted (1), but he
succeeded in taking the object (1). In the second, the attempted taker had
not had prior possession (2), he was likewise resisted (1), and in this case,
the other child retained possession (2).
Unlike the SDIS data formats discussed in the previous several sections,
data files that contain cross-classified event data are no different from the
usual cases by variables rectangular data files analyzed by the standard sta-
tistical packages such as SPSS and SAS. Typically, cross-classified data are
next subjected to log-linear analyses. When events have been detected and
cross-classified in the first place, the data file could be passed, more or less
unaltered, to the log-linear routines within SPSS or other standard pack-
ages, or to a program like ILOG (Bakeman & Robinson, 1994) designed
specifically for log-linear analysis. Moreover, often analyses of SDIS data
result in contingency table data, which likewise can be subjected to log-
linear analysis with any of the standard log-linear programs. In fact, GSEQ
is designed to examine sequential data and produce contingency table sum-
maries in a highly flexible way; consequently it allows for the export of
such data into files that can subsequently be read by SPSS, ILOG, or other
programs.
88 Representing observational data
interval, but any "width" interval could have been used) was categorized
as follows: (a) interval contains neither mother nor infant "communicative
act" codes, (b) interval contains some mother but no infant codes, (c)
interval contains some infant but no mother codes, and (d) interval contains
both mother and infant codes. This scheme was inspired by others who
had investigated adult talk (Jaffe & Feldstein, 1970) and infant vocalization
and gaze (Stern, 1974). With it, Bakeman and Brown were able to show, in
a subsequent study, differences between mothers interacting with preterm
and full-term infants (Brown & Bakeman, 1980). But the point for now is
to raise the possibility, and suggest the usefulness, of extracting more than
one representation from the data as originally recorded.
A second example of data transformation is provided by the Bakeman
and Brownlee (1980) study of parallel play described in chapter 1. There an
interval recording strategy was used. Each successive 15-second interval
was coded for predominant play state as follows: (a) Unoccupied, (b)
Solitary, (c) Together, (d) Parallel, or (e) Group play. Thus the data as
collected were already in interval sequential form and were analyzed in
this form in order to determine percentage of intervals assigned to the
different play states.
However, to determine if these play states were sequenced in any system-
atic way, Bakeman and Brownlee transformed the interval sequence data
into event sequences, arguing that they were concerned with which play
states followed other states, not with how long the preceding or follow-
ing states lasted. But once again, the moral is that for different questions,
different representations of the data are appropriate.
Throughout this book, the emphasis has been on nominal-scale measure-
ment or categorization. Our usual assumption is that some entity-an event
or a time interval - is assigned some code defined by the investigator's
coding scheme. But quantitative measurement can also be useful, although
such data call for analytic techniques not discussed here. (Such techniques
are discussed in Gottman, 1981.) The purpose of this final example of data
transformation is to show how categorical data can be transformed into
quantitative time-series data.
Tronick, Brazelton, and their co-workers have been interested in the
rhythmic and apparently reciprocal way in which periods of attention and
nonattention, of quiet and excitation, seem to mesh and merge with each
other in the face-to-face interaction of mothers with their young infants
(e.g., Tronick, Als, & Brazelton, 1977). They videotaped mothers and in-
fants interacting and then subjected those tapes to painstaking coding, using
an interval coding strategy. Several major categories were defined, each
containing a number of different codes. The major categories included,
for example, vocalizations, facial expressions, gaze directions, and body
90 Representing observational data
movement for both mother and infant. The tapes were viewed repeatedly,
often in slow motion. After each second of real time, the observers would
decide on the appropriate code for each of the major categories. The end
result was interval sequential data, with each interval representing 1 second
and each containing a specific code for each major category.
Next, each code within each of the major categories was assigned a num-
ber, or weight, reflecting the amount of involvement (negative or positive)
Tronick thought that code represented. In effect, the codes within each ma-
jor category were ordered and scaled. Then, the weights for each category
were summed for each second. This was done separately for mother and in-
fant codes so that the final result was two parallel strings of numbers, or two
times series, in which each number represented either the mother's or in-
fant's degree of involvement for that second. Now analyzing two time series
for mutual influence is a fairly classic problem, more so in astronomy and
economics than psychology, but transforming observational data in this way
allowed Gottman and Ringland (1981) to test directly and quantitatively
the notion that mother and infant were mutually influencing each other.
5.8 Summary
Five standard forms for representing observational data are presented here.
The first, event sequences, consists simply of codes for the events, or-
dered as they occurred. The second, state sequences, adds onset times
so that information such as proportions of time devoted to different codes
and average bout durations can be computed. The third, timed-event se-
quences, allows for events to cooccur and is more open-ended; momen-
tary and duration behaviors are indicated along with their onset and offset
times, as required. The fourth, interval sequences, provides a convenient
way to represent interval recorded data. And the fifth form is for cross-
classified events.
An important point to keep in mind is that data as collected can be rep-
resented in various ways, depending on the needs of a particular analysis.
Several examples of this were presented in the last section. The final exam-
ple, in fact, suggested a sixth data representation form: time series. Ways to
analyze all six forms of data are discussed in the next chapters, although the
emphasis is on the first five. (For time-series analyses, see Gottman, 1981.)
Some of the analyses can be done by hand, but most are facilitated by using
computers. An advantage of casting data into these standard forms is that
such standardization facilitates the development and sharing of computer
software to do the sorts of sequential analyses described throughout the rest
of this book. Indeed, GSEQ (Bakeman & Quera, 1995a) was developed to
analyze sequential data represented according to SDIS conventions.
Analyzing sequential data:
First steps
91
92 Analyzing sequential data: First steps
25% Solitary, 21% Together, 28% Parallel, and 17% Group. Thus, for ex-
ample, although 24% of the events were coded Parallel play, Parallel play
occupied 28% of the time. (These are estimates, of course. Recording on-
set times for behavioral state changes would have resulted in more accurate
time-budget information than the interval-recording strategy actually used;
see section 3.7.) A second example could be provided by the Adamson and
Bakeman study of affective displays. They recorded not just the occurrence
of affective displays, but their onset and offset times as well. Thus we were
able to compute that affective displays occurred, on the average, 4.4% of
the time during observation sessions. Put another way, the probability that
the infant would display affect in any given moment was .044.
Event-based (e.g., proportion of events coded Solitary) and time-based
(e.g., proportion of time coded Solitary) probabilities or percentages pro-
vide different and independent information; there is no necessary correla-
tion between the two. Which then should be reported? The answer is, it
depends. Whether one or both are reported, investigators should always
defend their choice, justifying the statistics reported in terms of the research
questions posed.
Data: B C A A A B B C B C A C
Lagl
A B C
A 2 1 1 4
LagO B 0 1 3 4
C 2 1 0 3
11
Data: A B A B C B A C B C A C
Lagl
A B C
A 0 2 2 4
LagO B 2 0 2 4
C 1 2 0 3
11
Figure 6.1. Examples of transitional frequency matrices and state transition dia-
grams.
LagO A B C
For example, the probability of code C occurring, given that code B just
occurred, is p(C+i|2?o) — tBC = *BC + XB+ = 3/4 = .75. This means
that 75% of the time, code C followed code B.
Transitional probabilities are often presented graphically, as state transi-
tion diagrams (for examples, see Bakeman & Brown, 1977; Stern, 1974).
Such diagrams have the merit of rendering quite visible just how events (or
time intervals) were sequenced in time. Circles represent the codes, and
arrows represent the transitional probabilities among them. Examples are
given in Figure 6.1, and the reader may want to verify that they were drawn
and labeled correctly.
Figure 6.1 contains a second data sequence, along with its associated
transitional frequency matrix and state transition diagram. For both, the
simple probabilities are the same, that is, for both sequences each different
code occurred four times. The point of presenting these two examples
is to show that, even when simple probabilities indicate no differences,
events may nonetheless be sequenced quite differently. And when they
are, transitional probabilities and their associated state transition diagrams
can reveal those differences in a clear and informative way.
One final point: The discussion here treats transitional probabilities as
simple descriptive statistics, and state transition diagrams as simple de-
scriptive devices. Others (e.g., Kemeny, Snell, & Thompson, 1974) dis-
cuss transitional probabilities and state transition diagrams from a formal,
mathematical point of view, as parameters of models. Interesting questions
for them are, given certain models and model parameters, what sorts of out-
comes would be generated? This is a formal, not an empirical exercise, one
in which data are generated, not collected. For the scientist, on the other
Summary 99
hand, the usual question is, first, can I accurately describe my data, and
second, does a particular model I would like to support generate data that
fit fairly closely with the data I actually got? The material in this chapter,
as noted in its first section, is concerned mainly with the first enterprise -
accurate description.
6.6 Summary
This "first steps" chapter discusses four very simple, very basic, but very
useful, statistics for describing sequential observational data: rates (or
frequencies), simple probabilities (or percentages), mean event durations,
and transitional probabilities.
Rates indicate how often a particular event of interest occurred. They
are probably most useful for relatively momentary and relatively infrequent
events like the affective displays Adamson and Bakeman (1985) described.
Simple probabilities indicate what proportion of all events were of a par-
ticular kind (event based) or what proportion of time was devoted to a
particular kind of event (time based). Time-based simple probabilities are
especially useful when the events being coded are conceptualized as behav-
ioral states (see section 3.2). Indeed, describing how much time individuals
(dyads, etc.) spent in various activities (or behavioral states) is a frequent
goal of observational research.
With the knowledge of how often a particular kind of event occurred,
and what percentage of time was devoted to it, it is possible to get a sense
of how long episodes of the event lasted. Mean event durations can be
computed directly, of course. But because these three statistics provide
redundant information, the investigator should choose whichever two are
most informative, given the behavior under investigation.
Finally, transitional probabilities capture sequential aspects of observa-
tional data in a simple and straightforward way. They can be presented
graphically in state transition diagrams. Such diagrams have the merit of
rendering visible just how events are sequenced in time.
7
Analyzing event sequences
100
Determining significance of particular chains 101
same); 80 (or 5 x 42) different kinds of three-event sequences (not 53); 320
(or 5 x 43) different kinds of four-event sequences (not 54); etc.
Determining how often particular two-event, three-event, etc., sequences
occurred in one's data is what we mean by "basic methods." This involves
nothing more than counting. The investigator simply defines particular
sequences, or all possible sequences of some specified length, and then
tallies how often they appear in the data. For example, Bakeman and
Brownlee were particularly interested in transitions from Parallel to Group
play. For one child in their study, 127 two-event sequences were observed,
10 of which were from Parallel to Group. Thus they could report that, for
that child, f(PG) = 10 and p(PG) = .079 (10 divided by 127). (Note
that p(PG) is not a transitional probability. It is the simple or zero-order
probability for the two-event sequence, Parallel to Group.) In sum, the
most basic thing to do with event-sequence data is to define particular
sequences, count them, and then report frequencies and/or probabilities for
those sequences.
2 _ ^ jobs - expf
exp
(10 - 6.35)2 (117 - 120.65)2
+
_
635 120.65
102 Analyzing event sequences
which, with one degree of freedom, is not significant (we use a Roman X
to represent the computed chi-square statistic to distinguish it from a Greek
chi, which represents the theoretical distribution).
Alternatively, we might make use of what we already know about how
often the five different codes occurred. This "first-order" model assumes
that codes occurred as often as they in fact did (and were not equiprobable),
but that the way codes were ordered was determined randomly. For the
child whose data we are examining, 143 behavioral states were coded; 34
were coded Parallel and 30 Group. (Because 127 two-event sequences
were tallied, and 143 states were coded, there must have been 15 breaks
in the sequence.) Now, if codes were indeed ordered randomly, then we
would expect that the probability for the joint event of Parallel followed
by Group would be equal to the simple probability for Parallel multiplied
by the simple probability for Group (this is just basic probability theory).
Symbolically,
p(PG)exp=p(P)xp(G)
The p(P) is .238 (34/143, the frequency for Parallel divided by the total, N).
In this case, however, the p(G) is not the f(G) divided by N. Because
a Parallel state cannot follow a Parallel state, the probability of group
(following Parallel) is computed by dividing the frequency for Group, not
by the total number of states coded, but by the number that could occur
after Parallel - that is, the total number of states coded, less the number of
Parallel codes. Symbolically.
P(G)=
- f(P)
when adjacent codes must be different and when we are interested in the
expected probability for Group following Parallel. Now we can compute
the expected probability for the joint event of a Parallel to Group transition.
It is:
f(P) f(G) 34 30
p(PG)exp = — x j r = W ) = — x ^ - ^ = .0654
The expected frequency, then, is 8.31 (.0654, the expected probability for
this particular two-event sequence, times 127, the number of two-event
sequences coded).
The chi-square statistic for this modified expected frequency is
2 (10-8.31)2 ( 1 1 7 - 118.69)2
X +
8.31
Transitional probabilities revisited 103
Unoccupied — 6 5 2 2 15
Solitary 5 — 6 7 5 23
Together 5 6 12 10 33
Parallel 2 7 11 10 30
Group 2 4 11 9 26
Totals 14 23 33 30 27 127
p(PU), was .0157. This certainly conveys the information that Group was
more common after Parallel than Unoccupied. But somehow it seems both
clearer and descriptively more informative to say that the probability of
Group, given a previous Parallel, p(G|P) or tpc, was .333, whereas the
probability of Unoccupied, given a previous Parallel, p(U\P) or tpjj, was
.067. Immediately we know that 33.3% of the events after Parallel were
Group, whereas only 6.7% were Unoccupied.
We just considered transitions from the same behavioral state (Parallel)
to different successor states (Group and Unoccupied), but the descriptive
value of transitional probabilities is portrayed even more dramatically when
transitions from different behavior states to the same successor state are
compared. For example, the simple probabilities for the Unoccupied to
Solitary, p(C/5), and for the Together to Solitary, p(TS), transitions are
both .0472. Yet the probability of Solitary, given a previous Unoccupied,
p(S\U) or tus, is .400, whereas the probability of Solitary, given a previous
Together, p(S\T) or tjs, is .182. The transitional probabilities "correct"
Transitional probabilities revisited 105
for differences in base rates for the "given" behavioral states and, there-
fore, clearly reveal that in this case, Solitary was relatively common after
Unoccupied, and considerably less so after Together, even though the Un-
occupied to Solitary and the Together to Solitary transitions appeared the
same number of times in the data.
Moreover, as already noted in section 6.5, transitional probabilities form
the basis for state transition diagrams, which, at least on the descriptive
level, are a particularly clear and graphic way to summarize sequential
information. The only problem is that, even with as few as five states, the
number of possible arrows in the diagram can produce far more confusion
than clarity. The solution is to limit the number of transitions depicted
in some way. In this case, for example, we could decide to depict only
transitional probabilities that are .3 or greater, which is what we have done
in Figure 7.1. This reduces the number of arrows in the diagram from a
possible 20, if all transitions were depicted, to a more manageable 9.
The nine transitions shown in Figure 7.1 are not necessarily the most
frequent transitions; this information is provided by the simple probabilities
for two-event sequences (see Table 7.2). Nor are the transitions necessarily
significantly different from expected; to determine this we would need to
compute and evaluate a z score for each transition (see next section). What
the state transition diagram does show are the most likely transitions, taking
the base rate for previous states into account. In other words, it shows the
most likely ways of "moving" from one state to another. For this one child,
Figure 7.1 suggests frequent movement from Unoccupied to both Solitary
and Together, from Solitary to Parallel, and reciprocal movement among
Together, Parallel, and Group.
One final point: Transitional probabilities can be used to describe re-
lationships between two nonadjacent events as well. Not only can we
compute, for example, the probability of Group in the lag 1 positions given
an immediately previous Parallel: p(G+i | PQ), but we can also compute the
106 Analyzing event sequences
Figure 7.1. A state transition diagram. Only transitional probabilities greater than
0.3 are shown. U = unoccupied, S = solitary, T = together, P — parallel, and
G = group.
which is almost but not quite the same as Equation 7.1 because it is based
110 Analyzing event sequences
Note: Row and column totals may not add exactly to those shown in
Table 7.1 because of rounding.
on simple frequencies and probabilities for given and target behaviors, not
on values from two-dimensional tables as is Equation 7.1. Equation 7.3 is
based on the normal approximation for the binomial test,
(7.5)
The reader of this book will already be familiar with the basic elements
of the lag sequential method. As an example, assume that our code catalog
defines several events, five of which are:
1. Infant Active
2. Mother Touch
3. Mother Nurse
4. Mother Groom
5. Infant Explore
(These codes are suggested by Sackett's work with macaque monkeys. The
example here is based on one given in Sackett, 1974.) Assume further that
successive events have been coded so that, as throughout this chapter, we
are analyzing event-sequence data. Finally, assume that we are particularly
interested in what happens after times when the infant is active, that is, we
want to know whether there is anything systematic about the sequencing
of events beginning with Infant Active episodes.
To begin with, the investigator selects one code to serve as the "criterion"
or "given" event. In this case, that code would be Infant Active. Next, an-
other code is selected as the "target." For example, we might select Mother
Touch as our first target code. Then, a series of transitional probabilities are
computed: for the target immediately after the criterion (lag 1), after one
intervening event (lag 2), after two intervening events (lag 3), etc. Symbol-
ically, we would write these lagged transitional probabilities as p(T\ |Go),
p(?21 Go), p(?31 Go), etc. (remember, if we just write p(T | G), target at lag
1 and given at lag 0 are assumed). The result is a series of transitional
probabilities, each of which can then be tested for significance. For exam-
ple, given Infant Active at lag "position" 0, if we had computed transitional
probabilities for Mother Touch at lags 1 through 6, but only the lag 1 tran-
sitional probability significantly exceeded its expected value, we would
conclude that Mother Touch was likely to occur just after Infant Active,
but was not especially likely in the other lag "positions" investigated.
If we stopped now, we would have examined transitional probabilities for
one particular target code, at different lags after a particular criterion code.
This is not likely to tell us much about multievent sequences. The next step
is to compute other series of transitional probabilities (and determine their
significance), using the same criterion code but selecting different target
codes. For example, given a criterion of Infant Active at lag 0, we could
compute lag 1 through 6 transitional probabilities for Mother Nurse, Mother
Groom, and Infant Explore. Imagine that the transitional probabilities for
Mother Nurse at lag 2, for Mother Groom at lag 3, and for Infant Explore
at lags 4 and 5 were significant. Such a pattern of results could suggest the
four-event sequence: Infant Active, Mother Touch, Mother Nurse, Mother
Classic lag sequential methods 113
Criterion Target 1 2 3 4 5
Groom (we shall return to Infant Explore in a moment), even though the
lagged transitional probabilities examined only two codes at a time.
As stated before, the lag sequential is a probabilistic, not an absolute
approach. To confirm the putative Active, Touch, Nurse, Groom sequence,
we should do the following. First, compute lagged transitional probabilities
with Mother Touch as the criterion and Mother Nurse and Mother Groom
as targets, then with Mother Nurse as the criterion and Mother Groom as
the target. If the transitional probabilities for Mother Nurse at lag 1 and
Mother Groom at lag 2, with Mother Touch as the lag 0 criterion, and
for Mother Groom at lag 1 with Mother Nurse as the lag 0 criterion, are
all significant, then we would certainly be justified in claiming that the
Active, Touch, Nurse, Groom sequence was especially characteristic of the
monkeys observed (see Table 7.6).
Recall, however, that Infant Explore was significant at lags 4 and 5 after
Infant Active. Does this mean that we are dealing with a six-event instead
of a four-event sequence? The answer is, not necessarily. For example,
if the transitional probabilities for Infant Explore at lags 3 and 4 given
Mother Touch as the criterion, at lags 2 and 3 given Mother Nurse, and at
lags 1 and 2 given Mother Groom were not significant, then there would
be no reasons to claim that Infant Explore followed the Active, Touch,
Nurse, Groom sequence already identified. Instead, if the results were as
suggested here, we would conclude that after a time when the infant was
active, next we would likely see either the Touch, Nurse, Groom sequence
or else three more or less random events followed by Infant Explore in the
fourth or fifth position.
114 Analyzing event sequences
(7.8)
-xG
where mcr represents the expected frequency for the target behavior at
lag L when preceded by the given behavior at lag 0, and XGT the observed
frequency for the target at lag L — 1 preceded by the given behavior at lag 0.
Sackett (1979) reasoned that when adjacent codes cannot repeat, the
expected probability for a particular target code at lag L (assuming a par-
ticular given code at lag 0) is the frequency for that target code diminished
by the number of times it appears in the lag L — 1 position (because then
it could not appear in the L position, after itself) divided by the number of
events that may occur at lag L (which is the sum of the lag L minus the lag
L — 1 frequencies summed across all K target codes). Simply put, this sum
is the number of all events less the number of events assigned the given
code. As with Equation 7.7, Equation 7.8 assumes overlapped sampling;
and again like Equation 7.7, marginals for expected frequencies based on
Equation 7.8 do not match the observed marginals.
Nonetheless, when consecutive codes cannot repeat, traditional lag
sequential analysis (Sackett, 1979; the first edition of this book) has
Classic lag sequential methods 115
3 4
Lag Position
Figure 7.3. A lagged probability profile for distressed couples. As in Figure 7.2,
triangles represent transitional probabilities for Husband Complaint at the lag
specified, given Husband Complaint at lag 0. Squares, however, represent transi-
tional probabilities for Wife Complaint, given Husband Complaint at lag 0. Again
asterisks (*) indicate that the corresponding z score is significant.
Omnibus tests
Used in an exploratory way, traditional lag-sequential analysis invites type
I error (although among statistical techniques it is hardly unique in this
respect). When 10 codes are defined, for example, the lag 0 by lag 1 table
contains 100 cells when consecutive codes may repeat and 90 when not
(K2 and K[K — 1] generally, where K is the number of codes defined).
Assuming the usual .05 value for alpha, if there were no association be-
tween lag 0 and lag 1 behavior, approximately 5 of 100 transitions would
be identified, on average and incorrectly, as statistically significant. One
solution is to take an omnibus or whole-table view. Absent specific pre-
dictions that one or just a few transitions will be significant, individual cell
statistics should be examined for significance only when a tablewise statis-
tic, such as the Pearson or likelihood-ratio chi-square (symbolized as X2
and G 2 , respectively), is large, just as post hoc tests are pursued in analysis
of variance only when the omnibus F ratio is significant (Bakeman, 1992;
Bakeman & Quera, 1995b).
Applying this test, we would not have examined the data presented in
Table 7.1 further. For these data,
X\U, N = 127) = f : y ( X G r m G r )
= H.O (7.9)
m
V T
and
K K
G2(ll,N = 127) = 2 y V x G r l o g - ^ - = 10.3 (7.10)
where log represents the natural logarithm (i.e., the logarithm to the base
e)\ X2 and G 2 both estimate chi-square, although usually G 2 is used in
log-linear analyses (for a discussion of the differences between them, see
Wickens, 1989). These estimates fall short of the .05 critical value of
19.7. Moreover, only 1 of 20 adjusted residuals exceeded 1.96 absolute
(Unoccupied to Solitary; see Table 7.5), which suggests it was simply a
chance finding, unlikely to replicate.
In log-linear terms, we ask whether expected frequencies generated by
the model of independence (i.e., Equation 7.1), which is symbolized [0][l]
and indicates the independence of the lag 0 and lag 1 dimension, are similar
to the observed frequencies. If they are, then the chi-square statistic will
not exceed its .05 critical value, as here (i.e., observed frequencies fit those
expected tolerably well). However, if the computed chi-square statistic is
large, exceeding its .05 critical value, then we reject the model of indepen-
dence and conclude that the dimensions of the table are in fact related and
not independent.
Log-linear approaches to lag-sequential analysis 119
lagO A B C Totals A B c
A 23 5 15 43 2.02 -1.82 -.056
B 11 1 7 19 1.56 -1.78 -0.12
C 8 14 16 38 -3.32 3.30 0.66
Totals 42 20 38 100
Note: This example was also used in Bakeman and Quera (1995b).
Winnowing results
As a second example, consider the data given in Table 7.7 for which K is
3; the codes are labeled A, B, and C; and consecutive codes may repeat.
For these data, X 2 (4, N = 100) is 15.7 and G2(4, N = 100) is 16.4. Both
exceed the .05 critical value of 9.49, which suggests that lag 0 and lag
1 are associated. Moreover, three of nine adjusted residuals exceed 1.96
absolute (again, see Table 7.7). But now we confront a different dilemma.
Adjusted residuals in a table form an interrelated web. If some are large,
others necessarily must be small, and so, rather than attempting to interpret
each one (thereby courting type I error), now we need to determine which
one or ones should be emphasized.
The initial set of all statistically significant transitions in a table can be
winnowed using methods for incomplete tables (i.e., tables with structural
zeros). Assume the C-B chain, whose adjusted residual is 3.30, is of
primary theoretical interest. In order to test its importance, we declare the
C-B cell structurally zero, use an iterative procedure to compute expected
frequencies (e.g., using Bakeman & Robinson's, 1994, ILOG program),
and note that now the [0][l] model fits the remaining data (G 2 [3, N =
86] = 5.79; .05 critical value for 3 df= 7.81; df= 3 because one is lost to the
structurally zero cell). We conclude that interpretation should emphasize
the C-B chain as the other two effects (decreased occurrences for C-A,
increased occurrences for A-A) disappear when the C-B chain is removed.
Had the model of independence not fit the reduced data, we would have
declared another structural zero and tested the data now reduced by two
cells. Proceeding step wise (but letting theoretical considerations not raw
empiricism determine the next chain to delete, else one risks capitalizing on
chance as with backward elimination in multiple regression and compro-
120 Analyzing event sequences
mising type I error control), we would identify those chains that prevent
the [0][l] model from fitting. (A logically similar suggestion, not in a
log-linear context, is made by Rechten & Fernald, 1978; see also Wick-
ens, 1989, pp. 251-253.) The ability to winnow results in this way is one
advantage of the log-linear view over traditional lag-sequential analysis.
1: A B C
0:A 21 25 49
B 23 26 21
C 50 19 15
Figure 7.4. Observed frequencies for 249 two-event chains derived from a se-
quence of 250 events.
(thus K = 3) that may repeat and an initial interest in lag 1. Then occur-
rences of each of the nine possible two-event chains (AA, AB, etc.) would
be tallied in one of the cells of a 3 2 table. For example, we (Bakeman &
Quera, 1995b) generated a sequence of 250 coded events and tallied the 249
overlapped two-event chains; the results are shown in Figure 7.4. For this
two-dimensional table, the [0][l] model (implying independence of rows
and columns) fails to fit the data (G2[4, N=249]=35.2) and so we would
conclude that events at lag 0 and lag 1 are associated and not independent.
This much seems easy and, apart from the preliminary omnibus test, not
much different from traditional lag-sequential methods.
Next, assume that our interest expands from lag 1 to lag 2. Still assuming
K = 3 and consecutive codes that may repeat, then each of the 27 possible
3-event chains (AAA, AAB, etc.) would be tallied in one of the cells of
a 3 3 table. Tallies for the 248 overlapped three-event chains derived from
the same sequence used earlier are shown in Figure 7.5.
Cells for the 3 3 table shown in Figure 7.5 are symbolized *//*, where i9j,
and k represent the lag 0, lag 1, and lag 2 dimensions, respectively. Tradi-
tional lag-sequential analysis would test for lag 2 effects in the collapsed
02 table, that is, the table whose elements are JC;+&, where
This table is shown in Figure 7.6. For this two-dimensional table, the [0][2]
model (implying independence of rows and columns) fails to fit the data
(G 2 [4, TV = 248] = 10.70) and so, traditionally, we would conclude that
events at lag 0 and lag 2 are associated and not independent. But this fails
to take into account events at lag 1.
A hierarchic log-linear analysis of the 3 3 table shown in Figure 7.5
provides more information, and in this case leads to a different conclusion,
than a traditional lag-sequential analysis of the collapsed table shown in
Figure 7.6. The complete or saturated model for a three-dimensional table
122 Analyzing event sequences
2: A B
0:A A 7 6 8
B 10 9 6
C 31 8 10
B A 3 6 14
B 6 11 9
C 12 4 4
A 11 13 26
B 7 6 6
C 7 7 1
Figure 7.5. Observed frequencies for 248 three-event chains derived from the
same sequence of 250 events used for Figure 7.4.
2: A B C
0:A 48 23 24
B 21 21 27
C 25 26 33
Figure 7.6. Observed frequencies for the collapsed lag Ox lag 2 table derived from
the observed frequencies for three-event chains shown in Figure 7.5.
is represented as [012] and includes seven terms: 072, 01, 12, 02, 0, 1, and
2. The saturated model is not symbolized as [012][01][12][02][0][l][2]
because the three two-way and three one-way terms are implied by (we
could say, nested hierarchically within) the three-way term, and so it is
neither necessary nor conventional to write them explicitly.
Typically, a hierarchic log-linear analysis proceeds by deleting terms,
seeking the simplest model that nonetheless fits the data tolerably well
(Bakeman & Robinson, 1994). Results for the data shown in Figure 7.4
are given in Table 7.8. The best-fitting model is [01][12]; the term that
represents lag 0-lag 2 association (i.e., the 02 term) is not required. Thus
Log-linear approaches to lag-sequential analysis 123
the log-linear analysis reveals that, when events at lag 1 are taken into
account, events at lag 0 and lag 2 are not associated, as suggested by the
analysis of the 02 table, but are in fact independent. Such conditional
independence - that is, the independence of lag 0 and lag 2 conditional
on lag 1 - is symbolized 0_U_2|l by Wickens (1989; see also Bakeman &
Quera, 1995b), and the ability to detect such circumstances represents an
advantage of log-linear over traditional lag-sequential methods. Readers
who wish to pursue the matter of conditional independence further should
read Wickens (1989, especially chapter 3).
As just described, when consecutive codes may repeat log-linear but not
traditional lag-sequential methods detect conditional independence. Ad-
ditional advantages accrue when consecutive codes cannot repeat because
log-linear methods handle structural zeros routinely and do not require ad
hoc and problematic formulas such as Equations 7.7 and 7.8. As an ex-
ample, we (Bakeman & Quera, 1995b) generated a sequence of 122 coded
events and tallied the 120 overlapped three-event chains. Tallies for the 12
permitted sequences are given in Figure 7.7. Cells containing structural
zeros are also indicated; when consecutive codes cannot repeat, the 012
table will always contain the pattern of structural zeros shown.
A summary of the log-liner analysis for the data given in Figure 7.7 is
shown in Table 7.9. When K is 3, and only when K is 3, the [01][12][02]
model is completely determined; its degrees of freedom are 0 and expected
frequencies duplicate the observed ones (as in the [012] model, when con-
secutive codes may repeat). Unlike in the previous analysis, for these data
the model of conditional independence - [01] [12] - fails to fit the data
(G 2 [3, N = 120] = 10.83, p < .05). Thus we accept the [01] [12] [02]
model and conclude that events at lag 0 and lag 2 are associated (and both
are associated with lag 1).
124 Analyzing event sequences
2: A B C
0:A —
B 15 6
C 10 9 -
A - 12 8
B - - -
C 9 12 -
A - 8 11
B 5 - 15
C - - -
Figure 7.7. Observed frequencies for the 12 possible three-event chains derived
from a sequence of 122 events for which consecutive codes cannot repeat. Struc-
tural zeros are indicated with a dash.
*p < .05
**p < .01
might be responsible for the failure of the [01] [12] model to fit the ob-
served data. Because the CBC chain is of primary interest, we replaced
the XCBC cell (which contained a tally of 15) with a structural zero. As
shown in Table 7.9, the model of conditional independence now fit the data
(G 2 [2, AT = 105] = 1.64, NS\ and so we conclude that the CBC chain can
account for the lag 2 effect detected by the omnibus analysis described in the
previous paragraph. This can be tested directly with a hierarchic test, as in-
dicated in Table 7.9. The difference between two hierarchically related G2s
is distributed approximately as chi-square with degrees of freedom equal to
the difference between the degrees of freedom for the two G 2 s; in this case,
AG 2 (1) = 9.19, p < .01. (Replacing a different chain with a structural
zero might also result in a fitting model, which is why it is so important that
selection of the chain to consider first be guided by theoretic concerns.)
although of course we would test for lag 1 effects. Thus the search begins
by looking for complex lag 2 effects in the 012 table. At each lag (L > 1),
the complex effects we seek first implicate lags 0 and L with lag L — 1.
If present, collapsing over the L — 1 dimension (which would reduce the
number of cell and so data demands) is unwarranted. For example, if L is
2, then complex effects are present if the simplest model that fits the 012
table includes any of the following:
1. [012] because then lag 0 and lag 2 are associated and interact with
lag 1 (three-way associations), or
2. [01] [12] [02] because then lag 0 and lag 2 are associated with each
other and lag 1 but do not interact with lag 1 (homogeneous asso-
ciations), or
3. [01] [12] because then lag 0 and lag 2 are independent conditional
on lag 1.
If complex effects are found, we would explicate them, as demonstrated
in the previous section. However, if simpler models fit (e.g., [01][2] or
any others not in the list just given), which means no complex effects were
found, then collapsing over the L — 1 dimension is justified (Wickens,
1989, pp 79-81, pp. 142-143), resulting in the 0L table of traditional
lag-sequential analysis (e.g., the 02 table when L = 2).
Assuming no complex effects are found in the 012 table, after collapsing
we would first test whether 0X2 (unconditional independence) in the 02
table and, if not, examine residuals in order to explicate the lag 0-lag 2
effect just identified (exactly as we would have done for the 01 table).
Next we would create a new three-dimensional table by adding the lag 3
dimension, tally sequences in this 023 table, and then look for lag 3 effects
in the 023 table exactly as described for the 012 table. This procedure is
repeated for successive lags. In general terms, beginning with lag L, we test
whether the three-dimensional 0(L — 1)L table can be collapsed over the
L — 1 dimension. If so, we collapse to the 0L table, add the L + 1 dimension
thereby creating a new three-dimensional table, increment L, and repeat the
procedure, continuing until we find a table that does not permit collapsing
over the L — 1 dimension. Once such a table is found, we explicate the lag
L effects in this three-dimensional table. If data are sufficient, we might
next analyze the four-dimensional 0(L — 1)L(L +1) table, and so forth, but
further collapsing is unwarranted because of the lag L effects just found.
Nonetheless, this strategy may let us examine lags longer than 2 without
requiring tables larger then K3 when consecutive codes may repeat.
The sequential search strategy described in the previous paragraph ap-
plies when consecutive codes cannot repeat with one modification. When
consecutive codes may repeat, and no complex lag L effects are found (i.e.,
each table examined sequentially permits collapsing) then the test series
Computing Yule's Q or phi 127
becomes 0X2, then 0J_3 and so forth (i.e., 0J_L is tested in the OL table),
as just described. When consecutive codes cannot repeat, the unconditional
test makes no sense because it fails to reflect the constraints imposed when
consecutive codes cannot repeat. Then, when no complex lag L effects
are found, the analogous series becomes 0_ll_2| 1, 0_JL3|2, and so forth [i.e.,
0_ll_L \L — 1 is tested in the 0(L — \)L table]. Models associated with these
tests include the (L — \)L term. The corresponding marginal table has
structural zeros on the diagonal, which reflect the cannot-repeat constraint.
This strategy may let us examine lags longer than 2 without requiring tables
larger than K2(K — 1) when consecutive codes cannot repeat (K[K — I] 2
when L = 2). These matters are discussed further in Bakeman and
Quera (1995b).
where individual cells are labeled a,b,c, and d as shown and represent
cell frequencies.
One of the most common statistics for 2 x 2 tables (perhaps more so
in epidemiology and sociology than psychology) is the odds ratio. As its
name implies, it is estimated by the ratio of a to b divided by the ratio
of c to d,
est. odds ratio = ">" (7.1.)
c/d
(where a, &, c, and d refer to observed frequencies for the cells of a 2 x 2
table as noted earlier; notation varies, but for definitions in terms of pop-
ulation parameters, see Bishop Fienberg, & Holland, 1975; and Wickens,
1993). Multiplying numerator and divisor by d/c, this can also be ex-
pressed as
ad
est. odds ratio = — . (7.12)
be
Equation 7.12 is more common, although Equation 7.11 reflects the name
and renders the concept more faithfully. Consider the following example:
A 10 10 20
A 20 60 80
30 70 100
The odds for B after A are 1:1, where as the odds for B after any other
(non-A) event are 1:3; thus the odds ratio is 3. In other words, the odds for
B occurring after A are three times the odds for B occurring after anything
else. When the odds ratio is greater than 1 (and it can always be made > 1
by swapping rows), it has the merit, lacking in many indices, of a simple
and concrete interpretation.
The odds ratio varies from 0 to infinity and equals 1 when the odds are
the same for both rows (indicating no effect of the row classification). The
natural logarithm (In) of the odds ratio, which is estimated as
(ad\
est. log odds ratio = In — (7.13)
\bcj
Computing Yule's Q or phi 129
extends from minus to plus infinity, equals 0 when there is no effect, and
is more useful for inference (Wickens, 1993). However Equation 7.13
estimates are biased. An estimate with less bias, which is also well defined
when one of the cells is zero (recall that the log of zero is undefined), is
obtained by adding 1/2 to each count,
(a + l/2)(d + 1/2)
= <714)
^ "V+l/2X*+l/2)
(Gait & Zweifel, 1967; cited in Wickens, 1993, Equation 8). As Wickens
(1993) notes when recommending that the log odds ratio computed per
Equation 7.14 be analyzed with a parametric t test, this procedure not
only provides protection for a variety of hypotheses against the effects of
intersubject variability when categorical observations are collected from
each member of a group (or groups), it is also easy to describe, calculate,
and present.
Yule's Q
Yule's Q is a related index. It is a transformation of the odds ratio designed
to vary, not from zero to infinity with 1 indicating no effect, but from — 1
to +1 with zero indicating no effect, just like the familiar Pearson product
- moment correlation. For that reason many investigators find it more
descriptively useful than the odds ratio. First, c/d is subtracted from the
numerator so that Yule's Q is zero when a/b equals c/d. Then, a/b is
added to the denominator so that Yule's Q is +1 when b and/or c is zero
and —1 when a and/or d is zero, as follows:
a c ad — be
Yule's Q = | — \ = , bf , = ab~bc (7.15)
^ £ + f* bc + ad ad + bc
d b bd
Yule's Q can be expressed as a monotonically increasing function of both
the odds and log odds ratio; thus these three indices are equivalent in the
sense of rank ordering subjects the same way (Bakeman, McArthur, &
Quera, 1996).
Phi
Another extremely common index for 2 x 2 tables is the phi coefficient.
This is simply the familiar Pearson product-moment correlation coefficient
computed using binary coded data (Cohen & Cohen, 1983; Hays, 1963).
130 Analyzing event sequences
where z is computed for the 2 x 2 table and hence equals \[~y}. Thus phi
can be viewed as a z score corrected for sample size. Like Yule's Q, it
varies from —1 to +1 with zero indicating no association. In terms of the
four cells, phi is defined as
(7.17)
V(a + b)(c + d)(a + c)(b + d)
Multiplying and rearranging terms this becomes
ad — be
(7.18)
bd + ad + bc)(ab + cd + ad + be)
If we now rewrite the expression of Yule's Q, first squaring the denominator
of Equation 7.15 and then taking its square root
ad
Yule's Q = ~bC (7.19)
y/(ad + bc)(ad + be)
the value of Yule's Q is not changed but similarities and differences between
phi and Yule's Q (Equations 7.18 and 7.19) are clarified.
Does it matter which index is used, Yule's Q or phi? The multiplier and
multiplicand in the denominator for Yule's Q (Equation 7.19) consist only
of the sum of ad and be, whereas multiplier and multiplicand in the phi
denominator (Equation 7.18) add more terms. Consequently, values for
phi are always less than values for Yule's Q (unless b and c, or a and d,
are both zero, in which case both Yule's Q and phi would be +1 and —1,
respectively). Yule's Q and phi differ in another way as well. Yule's Q is
+1 when either b or c is zero and — 1 when either a or d is zero (this is called
weak perfect association, Reynolds, 1984), whereas phi is +1 only when
both b and c are zero and — 1 only when both a and d are zero (this is called
strict perfect association). Thus phi achieves its maximum value (absolute)
only when row and column marginals are equal (Reynolds, 1984). Some
investigators may regard this as advantageous, some as disadvantageous,
but in most cases it probably matters little which of these two indices is
used (or whether the odds ratio or log odds ratio is used instead). In fact,
after running a number of computer simulations, Bakeman, McArthur, and
Quera (1996) concluded that, when testing for group differences, it does
not matter much whether Yule's Q or phi is used since both rank-order cases
essentially the same. Transformed kappa, a statistic proposed by Wampold
Computing Yule's Q or phi 131
(1989, 1992), however, did not perform as well. For details see Bakeman,
Me Arthur, and Quera (1996).
B d f B e f
C g i C h i
132 Analyzing event sequences
and a Yule's Q or phi computed for each. These statistics could then be sub-
jected to whatever subsequent analyses the investigator deems appropriate.
In this section we have suggested that sequential associations between
two particular events (e.g., an A to B transition) be assessed with an index
like Yule's Q or phi. These statistics gauge the magnitude of the effect
and, unlike the z score, are unaffected by the number of tallies. Thus
they are reasonable candidates for subsequent analyses such as the familiar
parametric tests routinely used by social scientists to assess individual
differences and effects of various research factors (e.g., t tests, analyses of
variance, and multiple regression). But the events under consideration may
be many in number, leading to many tests and thereby courting type I error.
It goes without saying (which may be why it is so necessary to restate)
that guiding ideas provide the best protection against type I error. Given
K codes and an interest in lag 1 effects, a totally unguided and completely
exploratory investigator might examine occurrences of all possible K2 two-
event chains (or K[K — 1] two-event chains when consecutive codes cannot
repeat). In this section, we have suggested that a more justifiable approach
would limit the number of transitions examined to the (K — I) 2 degrees of
freedom associated with the table (or [K — I] 2 — K degrees of freedom
when consecutive codes cannot repeat) and have demonstrated one way
that this number of 2 x 2 subtables could be extracted from a larger table.
Presumably a Yule's Q or some other statistic would be computed for each
subtable. Positive values would indicate that the pair of events associated
with the upper-left-hand cell is associated more than expected, given the oc-
currences observed for the baseline events associated with the second row
and second column of the 2 x 2 table. The summary statistic for the 2 x 2 ta-
bles, however many are formed, could then be subjected to further analysis.
Investigators are quite free - in fact, encouraged - to investigate a smaller
number of associations (i.e., form a smaller number of 2 x 2 tables). For ex-
ample, a larger table might be collapsed into a smaller one, combining some
codes that seem functionally similar, or only those associations required to
address the investigator's hypotheses might be subjected to analysis in the
first place. Other transitions might be examined later, and those analyses
labeled exploratory instead of confirmatory. For further discussion of this
"less is more" and "least is last" strategy for controlling type I error, see
Cohen and Cohen (1983, pp. 169-172).
7.8 Summary
Investigators often represent their data as sequences of coded events. Some-
times, data are recorded as event sequences in the first place; other times,
Summary 133
8.1 Independence
In classical parametric statistics, we assume that our observations are in-
dependent, and this assumption forms part of the basis of our distribution
statistics. In the sequential analysis of observational data, on the other
hand, we want to detect dependence in the observations. To do this we
compare observed frequencies with those we would expect if the observa-
tions were independent. Thus, dependence in the data is not a "problem."
It is what we are trying to study.
The statistical problem of an appropriate test is not difficult to solve. It
was solved in a classic paper in 1957 by Anderson and Goodman (see also
Goodman, 1983, for an update). Their solution is based on the likelihood-
ratio chi-square test.
The likelihood-ratio test applies to the comparison of any two statisti-
cal models if one (the "little" model) is a subcase of the other (the "big"
model). The null-hypothesis model is usually the little model. In our case,
this model is often the assumption that the data are independent (or quasi
independent); i.e., that there is no sequential structure. Compared to this is
the big, interesting model that posits a dependent sequential structure. As
discussed in section 7.6, the difference between the G 2 for the big model
(e.g., [01]) and the G 2 for the little model (e.g., [0][l]) is distributed asymp-
totically as chi-square, with degrees of freedom equal to the difference in
the degrees of freedom for the big and little models. "Asymptotic" means
that it becomes increasingly true for large N, where N is the number of
observations.
When the data have "structural zeros," e.g., if a code cannot follow itself
(meaning that the frequency for that sequence is necessarily zero), the
number of degrees of freedom must be reduced (by the number of cells that
are structural zeros). These cells are not used to compute chi-square (see
Goodman, 1983).
We shall now discuss the conditions required to reach asymptote. In
particular, we shall discuss assigning probability values to z scores. We
136
Independence 137
should note that most observational data are stochastically dependent. They
are called in statistics "ra-dependent" processes, which means that the
dependencies are short lived. One implication of this is that there is poor
predictability from one time point to another, as the lag between time
points increases. In time-series analysis, forecasts are notoriously poor
if they exceed one step ahead (see Box & Jenkins, 1970, for a graph of
the confidence intervals around forecasts). It also means that clumping m
observations gives near independence. For most data, m will be quite small
(probably less than 4), and its size relative to n will determine the speed at
which the asymptote is approached.
We conclude that assigning probability values to pairwise z scores (or
tablewise chi-squares) is appropriate when we are asking if the observed
frequency for a particular sequence is significantly different from expected
(or whether lag 0 and L are related and not independent). We admit,
however, that more cautious interpretations are possible, and would quote
a paragraph we wrote earlier (Gottman & Bakeman, 1979, p. 190):
As N increases beyond 25, the binomial distribution approximates a normal dis-
tribution and this approximation is rapidly asymptotic if P is close to 1/2 and
slowly asymptotic when P is near 0 or 1. When P is near 0 or 1, Siegel (1956)
suggested the rule of thumb that NP(l — P) must be at least 9 to use the normal
approximation. Within these constraints the z-statistic above is approximately
normally distributed with zero mean and unit variance, and hence we may cau-
tiously conclude that if z exceeds ± 1.96 the difference between observed and
expected probabilities has reached the .05 level of significance (see also Sackett,
1978). However, because dyadic states in successive time intervals (or simply
successive dyadic states in the case of event-sequence data) are likely not inde-
pendent in the purest sense, it seems most conservative to treat the resulting z
simply as an index or score and not to assign /?-values to it.
As the reader will note, in the chapter just quoted we were concerned
with two issues: the assumption of independence and the number of tallies
required to justify use of the binomial distribution. On reflection, we find
the argument that the categorizations of successive n-event sequences in
event-sequence data are not "independent" less compelling than we did
previously, and so we are no longer quite so hesitant to assign probability
values on this score.
Our lack of hesitancy rests in part on a simulation study Bakeman and
Dorval (1989) performed. No matter the statistic, for the usual sorts of
parametric tests, p values are only accurate when assumptions are met. To
those encountering sequential analysis for the first time, the common (but
not necessarily required) practice of overlapped sampling (tallying first the
e\e2 chain, then ^ 3 , £3^4, etc.) may seem like a violation of independence.
The two-event chain is constrained to begin with the code that ended the
138 Issues in sequential analysis
previous chain (i.e., if a two-event chain ends in B, adding a tally to the 2nd
column, the next must add a tally to the 2nd row), and this violates sampling
independence. A Pearson or likelihood-ratio chi-square could be computed
and would be an index of the extent to which observed frequencies in this
table tend to deviate from their expected ones. But we would probably
have even less confidence than usual that the test statistic is distributed as
X2 and so, quite properly, would be reluctant to apply a p value.
Nonoverlapped sampling (tallying first the e\e2 chain, then ^3^4, e^e^,
etc.) does not pose the same threat to sampling independence, although it
requires sequences almost twice as long in order to extract the same num-
ber of two-event chains produced by overlapped sampling. However, the
consequences of overlapped sampling may not be as severe as they at first
seem. Bakeman and Dorval (1989) found that when sequences were gen-
erated randomly, distributions of a test statistic assumed their theoretically
expected form equally for the overlapped and nonoverlapped procedures
and concluded that the apparent violation of sampling independence asso-
ciated with overlapped sampling was not consequential.
8.2 Stationarity
The term "stationarity" means that the sequential structure of the data is
the same independent of where in the sequence we begin. This means that,
for example, we will get approximately the same antecedent/consequent
table for the first half of the data as we get for the second half of the data.
1st Half 2nd Half
HNice HNasty HNice HNasty
WNice
WNasty
P(IJ, t) be the transition probability for that cell. Let P(IJ) be the pooled
transition probability. Then G 2 , computed as
base rates for each of these codes. Or, in a more exploratory vein, they may
want to identify whichever sequences, if any, occur at greater than chance
rates. However, there is another quite different kind of question investiga-
tors can ask, one that concerns not how particular codes in the stream of
behavior are ordered, but how orderly the stream of behavior is overall.
In this section, we shall not describe analyses of overall order in any
detailed way. Instead, we shall suggest some references for the interested
reader, and shall try to give a general sense of what such analyses reveal and
how they proceed. Primarily, we want readers to be aware that it is possible
to ask questions quite different from those discussed earlier in this chapter.
One traditional approach to the analysis of general order is provided
by what is usually called "information theory." A brief explication of
this approach, along with appropriate references and examples, is given by
Gottman and Bakeman (1979). Although the classical reference is Shannon
and Weaver (1949), more useful for psychologists and animal behaviorists
are Attneave (1959) and Miller and Frick (1949). A well-known exam-
ple of information theory applied to the study of social communication
among rhesus monkeys is provided by S. Altmann (1965). A closely re-
lated approach is called Markovian analysis (e.g., Chatfield, 1973). More
recently, problems of gauging general orderliness are increasingly viewed
within a log-linear or contingency-table framework (Bakeman & Quera,
1995b; Bakeman & Robinson, 1994; Bishop, Fienberg, & Holand, 1975;
Castellan, 1979; Upton, 1978).
No matter the technical details of these particular approaches, their goals
are the same: to determine the level of sequential constraint. For example,
Miller and Frick (1949), reanalyzing Hamilton's (1916) data concerning
trial-and-error behavior in rats and 7-year-old girls, found that rats were
affected just by their previous choice whereas girls were affected by their
previous two choices. In other words, if we want to predict a rat's current
choice, our predictions can be improved by taking the previous choice into
account but are not further improved by knowing the choice before the
previous one. With girls, however, we do improve predictions concerning
their current choice if we know not just the previous choice but the one
before that, too.
If data like these had been analyzed with a log-linear (or Markovian)
approach, the analysis might have proceeded as follows: First we would
define a zero-order or null model, one that assumed that all codes occurred
with equal probability and were not in any way sequentially constrained.
Most likely, the data generated by this model would fail to fit the observed
data. Next we would define a model that assumed the observed probabilities
for the codes but no sequential constraints. Again, we would test whether
the data generated by this model fit the observed. If this model failed to
Individual vs. pooled data 141
fit, we would next define a model that assumed that codes are constrained
just by the immediately previous code (this is called a first-order Markov
process). In terms of the example given above, this model should generate
data that fit those observed for rats but not for girls. Presumably, a model
that assumes that codes are constrained by the previous two codes should
generate data that pass the "fitness test" for the girl's data.
In any case, the logic of this approach should be clear. A series of models
are defined. Each imposes an additional constraint, for example, that the
data generated by the model need to take into account the previous code,
the previous two codes, etc. The process stops when a particular model
generates data similar to what was actually observed, as determined by a
goodness-of-fit test. The result is knowledge about the level of sequential
constraint, or connectedness, or orderliness of the data, considered as a
whole. (For a worked example, analyzing mother-infant interaction, see
Cohn & Tronick, 1987.)
Thus, even though investigators who pool data over several subjects usu-
ally do so for practical reasons, it has some implications for how results
are interpreted.
How seriously this last limitation is taken seems to vary somewhat by
field. In general, psychologists studying humans seem reluctant to pool
data over subjects, often worrying that some individuals will contribute
more than others, thereby distorting the data. Animal behaviorists, on the
other hand, seem to worry considerably less about pooling data, perhaps
because they regard their subjects more as exemplars for their species and
focus less on individuality. Thus students of animal behavior often seem
comfortable generalizing results from pooled data to other members of the
species studied.
As we see it, there are three options: First, when observations do no de-
rive from different subjects (using "subject" in the general sense of "case"
or "unit"), the investigator is limited to describing frequencies and proba-
bilities for selected sequences. Assuming enough data, these can be tested
for significance. Second, even when observations do derive from different
subjects, but when there are few data per subject, the investigator may opt
to pool data across subjects. As in the first case, sequences derived from
pooled data can be tested for significance, but investigators should keep in
mind the limits on interpretation recognized by their field.
Third, and again when observations derive from different subjects, in-
vestigators may prefer to treat statistics (e.g., Yule's Q's) associated with
different sequences just as scores to be analyzed using standard techniques
like t test or the analysis of variance (see Wickens, 1993). In such cases,
statistics for the sequences under consideration would be computed sepa-
rately for each subject. However, analyses of these statistics tell us only
whether they are systematically affected by some research factor. They do
not tell us whether the statistics analyzed are themselves significant. In
order to determine that, we could test individual z scores for significance,
assuming enough data, and report for how many participants z scores were
significant, or else compute a single z score from pooled data, assuming
that pooling over units seems justified.
For example, Bakeman and Adamson (1984), for their study of infants'
attention to people and objects, observed infants playing both with their
mothers and with same-age peers. Coders segmented the stream of behavior
into a number of mutually exclusive and exhaustive behavioral states: Two
of those states were "Supported Joint" attention (infant and the other person
were both paying attention to the same object) and "Object" attention (the
infant alone was paying attention to some object). The Supported Joint
state was not especially common when infants played with peers. For
Individual vs. pooled data 143
that reason, observations were pooled across infants, but separately for the
"with mother" and "with peer" observations.
The z scores computed from the pooled data for the Supported Joint
to Object transition were large and significant, both when infants were
observed with mother and when with peers. This indicates that, considering
these observations as a whole, the Supported Joint to Object sequence
occurred significantly more often than expected, no matter the partner. In
addition, an analysis of variance of individual scores indicated a significant
partner effect, favoring the mother. Thus, not only was this sequence
significantly more likely than expected with both mothers and peers, the
extent to which it exceeded the expected was significantly higher with
mothers, compared to peers.
In general, we suspect that most of our colleagues (and journal editors)
are uneasy when data are pooled over human participants. Thus it may be
worthwhile to consider how data such as those just described might be an-
alyzed, not only avoiding pooling, but actually emphasizing individuality.
The usual parametric tests analyze group means and so lose individuality.
Better, it might be argued, to report how many subjects actually reflected
a particular pattern, and then determine whether that pattern was observed
in more subjects than one might expect by chance.
For such analyses, the simple sign test suffices. For example, we might
report the number of subjects for whom the Yule's Q associated with the
Supported Joint to Object transition was positive when observed with moth-
ers, and again when observed with peers. If 28 infants were observed, by
chance alone we would expect to see the pattern in 14 (50%) of them, but
if the number of positive Yule's Q's was 19 or greater (p < .05, one-tailed
sign test), we would conclude that the Supported Joint to Object transition
was evidenced by significantly more infants than expected. And if the
Yule's Q when infants were observed with mothers was greater than the
Yule's Q when infants were observed with peers for 19 or more infants, we
would conclude that the association was stronger when with mothers, com-
pared to peers. The advantage of such a sign-test approach is that we learn,
not just what the average pattern was, but exactly how many participants
evidenced that pattern.
The approach presented earlier in section 8.2 can also be applied to
the issue of pooling over subjects. Again, the question we ask is one of
homogeneity, that is, whether the sequential structure is the same across
subjects, or groups of subjects (instead of across time as for stationarity).
The formula to test this possibility is similar to the formula for stationarity
(see Gottman & Roy, 1990, pp. 67 ff.). In the following formula the sum
is across s = 1, 2 , . . . S subjects, and P(IJ) represents the pooled joint
144 Issues in sequential analysis
The Pearson chi-square for this table is 1.92 (so the z score is its square
root, 1.39) and its Yule's Q is + 1 . After all, every A (all two of them) was
followed by a B. However, if only one of the As were followed by a B,
How many data points are enough? 145
B ~B
A 1 1 2
A 24 24 48
25 25 50
then all statistics (Pearson chi-square, z score, and Yule's Q) would be zero.
This example demonstrates summary statistics' instability when only a few
instances of a critical code are observed.
To protect ourselves against this source of instability, we compute sum-
mary statistics only when all marginals sums are 5 or greater, and regard
the value of the summary statistics as missing (too few instances to regard
the computed values as accurate) in other cases. This is only an arbi-
trary rule, of course, and hardly confers absolute protection. Investigators
should always be alert to the scanty data problem, and interpret results
cautiously when summary statistics (e.g., z scores, Yule's Q's) are based
on few instances of critical codes.
If stability is the bedrock reason to be concerned about the adequacy
of the data collected, correct inference is a secondary but perhaps more
often mentioned concern. This matters only when investigators wish to
assign a p value to a summary statistic (e.g., a X2 or a z score) based on
assumptions that the statistic follows a known distribution (e.g., the chi-
square or normal). Guidelines for amount of adequate data for inference
have long been addressed in the chi-square and log-liner literature, so it
makes sense to adapt those guidelines - many of which are stated in terms
of expected frequencies - for lag-sequential analysis.
Several considerations play a role, so absolute guidelines are as difficult
to define as they are desired. Summarizing current advice, Wickens (1989,
p. 30) noted that (a) expected frequencies for two-dimensional tables with
1 degree of freedom should exceed 2 or 3 but that with more degrees of
freedom some expected frequencies may be near 1 and with large tables up
to 20% may be less than 1, (b) the total sample should be at least four or
five times the number of cells (more if marginal categories are not equally
likely), and (c) similar rules apply when testing whether a model fits a
three- or larger-dimensional table.
As noted earlier, when lag 1 effects are studied, the number of cells is K2
when consecutive codes may repeat and K(K — 1) when they cannot. Thus,
at a minimum, the total number of tallies should exceed K2 or K(K — 1), as
appropriate, times 4 or 5. Additionally, marginals and expected frequencies
should be checked to see whether they also meet the guidelines. When lag
L effects are studied, the number of cells is KL+l when consecutive codes
may repeat and K(K — 1)L when they cannot. As L increases, the product
146 Issues in sequential analysis
or "study wise" alpha level, when k tests are performed, is 1 — .95* (Cohen &
Cohen, 1983). Thus if 20 independent tests are performed, the probability
of type I error is really .64, not .05.
What each study needs is a coherent plan for controlling type I error.
The best way, of course, is to limit drastically the number of tests made.
And even then, it makes sense to apply some technique that will assure a
desired study wise alpha level. For example, if k tests are performed and a
studywise alpha level of .05 is desired, then, using Bonferroni's correction,
the alpha level applied to each test should not be alpha, but rather alpha
divided by k (see Miller, 1966). Thus if 20 tests are performed, the alpha
level for each one should be .0025 (.05/20).
When studies are confirmatory, type I error usually should not be a
major problem. Presumably in such cases the investigator is interested in
(and will test for significance) just a few theoretically relevant sequences.
Exploratory studies are more problematic. Consider the parallel play study
discussed earlier. Only five codes were used, which is not a great number
at all. Yet these generate 20 possible two-event and 80 possible three-
event sequences. This makes us think that unless very few codes are used
(three or four, say) and unless there are compelling reasons to do so, most
exploratory investigations should limit themselves to examining just two-
event sequences, no longer - even if the amount of data is no problem.
Even when attention is confined to two-event sequences, the number of
codes should likewise be limited. For two-event sequences, the number
of possible sequences, and hence the number of tests, increase roughly as
the square of the number of codes. For this reason, we think that coding
schemes with more than 10 codes border on the unwieldy, at least when
the aims of a study are essentially exploratory.
Two ways to control type I error were described in section 7.6 when dis-
cussing log-linear approaches to lag-sequential analysis. First, exploratory
studies should not fish for effects at lag L in the absence of significant lag
L omnibus tests. And second, the set of seemingly significant sequences
at lag L should be winnowed into a smaller subset that can be viewed as
responsible for the model of independence's failure to fit. Still, as empha-
sized in section 7.7 when discussing Yule's Q, guiding ideas provide the
best protection against type 1 error. Investigators should always be alert
for ways to limit the number of statistical tests in the first place.
8.7 Summary
Several issues important if not necessarily unique to sequential analysis
have been discussed in this chapter. Investigators should always worry
Summary 149
whether summary indices (means, Yule's Q's, etc.) are based on sufficient
data. If not, confidence in computed values and their descriptive value is
seriously compromised. Further, when inferential statistics are used, data
sufficient to support their assumptions are required. Guidelines based on
log-linear analyses were presented here, but the possibility of permutation
tests, which require drastically fewer assumptions, was also mentioned.
Again, investigators should always limit the number of statistical tests in
any study, else they court type I error. Of help here is the discipline provided
by guiding ideas and theories, clearly stated. In contrast, issues of pooling
may arise more in sequential than other sorts of analyses because of data
demands. Pooling data over units such as individuals, dyads, families, etc.,
is rarely recommended, no matter how necessary it seems. When data per
unit are few, a jackknife technique (computing several values for a sum-
mary statistic, each with data for a different unit removed, then examining
the distribution for coherence) is probably better than pooling. Finally,
common to almost all statistical tests is the demand for independence (or
exchangeability; see Good, 1994). When two-event chains are sampled
in an overlapping manner from longer sequences, this requirement might
seem violated, but simulation studies indicate that the apparent violation
in this particular case does not seem consequential.
Analyzing time sequences
150
Taking time into account 151
using a time unit instead of the event as the basic unit of analysis, values for
transitional probabilities are affected by how long particular events lasted
- which is undesirable if all the investigator wants to do is describe the
typical sequencing of events.
As an example, recall the study of parallel play introduced earlier, which
used an interval recording strategy (intervals were 15 seconds). Let U =
unoccupied, S = solitary, and P = Person Play. Then:
Interval = 15; U, U, U, S, P, P . . .
Interval = 15; U, S, S, S, P, P . . .
Intervals 15; U, U, S, S, P, P . . .
represent three slightly different ways an observational session might have
begun. All three interval sequences clearly represent one event sequence:
Unoccupied to Solitary to Parallel. Yet values for transitional probabilities
and their associated z scores would be quite different for these three interval
sequences. For example, the p(Pt+\\St) would vary from 1.00 to 0.33 to
0.50 for the three sequences given above. Worse, the z score associated with
0.33 would be negative, whereas the other two z scores would be positive.
No one would actually compute values for sequences of just six intervals, of
course, but if we had analyzed longer interval sequences like these with the
techniques described in chapter 7, it is not clear that the USP pattern would
have been revealed. Very likely, especially if each interval had represented 1
second instead of 5 seconds, we might have discovered only that transitions
from one code to itself were likely, whereas all other transitions were
unlikely. In short, we urge investigators to resist the tyranny of time. Even
when time information has been recorded, it should be "squeezed out"
of the data whenever describing the typical sequencing of events is the
primary concern. In fact, the GSEQ program (Bakeman & Quera, 1995a)
includes a command that removes time information, thereby transforming
state, timed-event, or interval sequences into event sequences when such
is desired.
interval sequences) does not require learning new techniques, only the
application of old ones.
In fact, one underlying format suffices for event, state, timed-event, and
interval sequences. These four forms are treated separately both here and
in the Sequential Data Interchange Standard (SDIS; see Bakeman & Quera,
1992) because this connects most easily with what investigators actually
do, and have done historically. Thus the four forms facilitate human use
and learning. A general-purpose computer program like GSEQ, however,
is better served by a common underlying format because this allows for
greater generality and hence less specific-purpose computer code. Indeed,
the SDIS program converts SDS files (files containing data that follow
SDIS conventions) into a common format (called MDS or modified SDS
files) that is easily read by GSEQ (Bakeman & Quera, 1995a).
The technical details need not concern users of these computer programs,
but understanding the conceptual unity of the four forms can be useful.
Common to all four is an underlying metric. For event sequences, the
underlying metric is the discrete event itself. For state and timed-event
sequences, the underlying metric is a unit of time, often a second. And for
interval sequences, the underlying metric is a discrete interval, usually (but
not necessarily) defined in terms of time.
The metric can be imagined as cross marks on a time line, where the
space between cross marks is thought of as bins to which codes may be
assigned, each representing the appropriate unit. For event sequences, one
code and one code only is placed in each bin. Sometimes adjacent bins may
be assigned the same code (consecutive codes may repeat), sometimes not
(for logical reasons, consecutive codes cannot repeat). For state sequences,
one (single stream) or more codes (multiple streams) may be placed in each
bin. Depending on the time unit used and the typical duration of a state,
often a stretch of successive bins will contain the same code. For timed-
event sequences, one or more codes or no codes at all may be placed in each
bin. And for interval sequences, again one or more codes or no codes at all
may be placed in each bin. As you can see, the underlying structure of all
forms is alike. Successive bins represent successive units and, depending
on the form, may contain one or more or no codes at all.
Interval sequences, in particular, can be quite useful, even when data
were not interval recorded in the first place. For example, imagine that
a series of interactive episodes are observed for particular children and
that attributes of each episode are recorded (e.g., the partner involved, the
antecedent circumstance, the type of the interaction, the outcome). Here
the event (or episode) is multidimensional, so the event sequential form
is not adequate. But interval sequences, which permit several codes per
bin, work well, and permit both concurrent (e.g., are certain antecedents
Micro to macro 153
often linked with particular types of interaction) and sequential (e.g., are
consequences of successive episodes linked in any way) analyses. Used
in this way, each episode defines an interval (instead of some period of
elapsed time); in such cases, interval sequences might better be called
multidimensional events. Further examples of the creative and flexible use
of the four forms for representing sequential data are given in Bakeman
and Quera (1995a, especially chapter 10).
Because the underlying form is the same for these four ways of represent-
ing sequential data, computational and analytic techniques are essentially
the same (primarily those described in chapter 7). New techniques need not
be introduced when time is taken into account. Only interpretation varies,
depending on the unit, whether an event or a time unit. For event sequences,
coders make a decision (i.e., decide which code to assign) for each event.
For interval sequences, coders make decisions (i.e., decide which codes
occurred) for each interval. For state and timed-event sequences, there is
no simple one-to-one correspondence between decisions and units. Coders
decide and record when events or states began and ended. They may note
the onset of a particular event, the moment it occurs. But just as Charles
Babbage questioned the literal accuracy of Alfred Lord Tennyson's cou-
plet "Every minute dies a man, / Every minute one is born" (Morrison &
Morrison, 1961), so too we should question the literal accuracy of a claim
that observers record onsets discretely second by second and recognize the
fairly arbitrary nature of time units. For example, we can double tallies by
changing units from 1 to 1/2 second. The connection, or lack of connec-
tion, between coders' decisions and representational units is important to
keep in mind and emphasize when interpreting sequential results.
truck) was an index of how connected and dialogic the children's conver-
sations were. This sequence thus indexed a more macro social process.
Gottman could have proceeded to analyze longer sequences in a statisti-
cal fashion, but with 40 codes the four-element chain matrix will contain
2,560,000 cells! Most of these cells would have been empty, of course, but
the task of even looking at this matrix is overwhelming. Instead, Gottman
designed a macro coding system whose task it was to code for sequences,
to code larger social processes. The macro system used a larger interaction
unit, and it gave fewer data for each conversation (i.e., fewer units of ob-
servation). However, the macro system used a larger interaction unit, and
it gave fewer data for each conversation (i.e., fewer units of observation).
However, the macro system was extremely useful. First, it was far more
rapid to use than the micro system. Second, because a larger unit was now
being used, new kinds of sequences were being discovered. This revealed
an organization of the conversations that Gottman did not notice, even with
the sequential analysis of the micro data (see Gottman & Parker, 1985 in
press).
To summarize, one strategy we recommend for sequential analysis is not
looking for everything that is patterned by employing statistical analysis
of one data set. It is possible to return to the data, armed with a knowledge
of patterns, and to reexamine the data for larger organizational units.
4
COUPLE NO. 27 IMPROVISATION NO. 3 CLINIC
0
- HUSBAND
- WIFE -4
-8
-12
-16
-20
-24
-28
-32
-36 -36
40 80 120
FLOOR SWITCHES
., HUSBAND
-4 t WIFE -4
40 80 120
FLOOR SWITCHES
Figure 9.1. Time series for two clinic couples. From Gottman (1979b, p. 215).
156 Analyzing time sequences
7:54 a.m.
2:12(2.2)
10:06
0:42(0.7) 10-12 1.45
10:48
1:30(1.5)
12:18
0:24 (0.4) 12-2 0.95
12:42
1:48(1.8)
2:30
1:18(1.8) 2-4 1.55
3:48
Note: Interevent intervals are given both in hours : minutes and in decimal hours.
observations, and then we slide the window forward in time. This option
smooths the data as we use a larger and larger time unit, a useful procedure
for graphical display, but not necessary for many time-series procedures
(particularly those in the frequency domain).
A third option is the univariate scaling of codes. For two different ap-
proaches to this, see Brazelton, Koslowski, and Main (1974), and Gottman
(1979b).
Each option produces a set of time series for each variable created, for
each person in the interacting unit. Analysis proceeds within each in-
teracting unit ("subject"), and statistics of sequential connection are then
extracted for standard analysis of variance or regression. A detailed ex-
ample of the analysis of this kind of data obtained from mother-infant
interaction appears in Gottman, Rose, and Mettetal (1982). A review of
time-series techniques is available in Gottman's (1981) book, together with
10 computer programs (Williams & Gottman, 1981).
other. They described the interactional energy building up, then ebbing and
cycling in synchrony. Part of this analysis had to do with their confidence
in the validity of their time-series variable. They wrote:
In other words, the strength of the dyadic interaction dominates the meaning of
each member's behavior. The behavior of any one member becomes a part of
a cluster of behaviors which interact with a cluster of behaviors from the other
member of the dyad. No single behavior can be separated from the cluster for
analysis without losing its meaning in the sequence. The effect of clustering and
of sequencing takes over in assessing the value of particular behaviors, and in
the same way the dyadic nature of interaction supercedes the importance of an
individual member's clusters and sequences, (p. 56)
determinable relationships between the peaks and troughs of the levels of waves,
which serve to express organized processes with continually changing relation-
ships, (p. 224)
were to plot the value of this ratio for every frequency we guess (remember
frequency equals l/t)9 this graph is an estimate of the spectral density
function. Of course, this is the ideal case. In practice, there would be a lot
of noise in the data, so that the zero values would be nonzero, and also the
peak of the spectral density function would not be such a sharp spike. This
latter modification in thinking, in which the amplitudes of the cycles are
themselves random variables, is a major conceptual revolution in thinking
about data over time; it is the contribution of the 20th century to this area.
(For more discussion, see Gottman, 1981.)
Rare events
One of the uses of univariate time-series analysis is in evaluating the effects
of rare events. It is nearly impossible to assess the effect of a rare but
theoretically important event without pooling data across subjects in a
study by the use of sequential analysis of categorical data. However, if we
create a time-series variable that can serve as an index of the interaction,
the problem can be solved by the use of the interrupted, time-series quasi-
experiment.
What we mean by an "index" variable is one that is a meaningful the-
oretical index of how the interaction is going. Brazelton, Koslowski, and
Main (1974) suggested an index time series that measured a dimension of
engagement and involvement to disengagement. The dimension assessed
the amount of interactive energy and involvement that a mother expressed
toward her baby and that the baby expressed toward the mother. This is an
example of such an index variable. Gottman (1979) created a time-series
variable that was the cumulative positive-minus-negative affect in a mari-
tal interaction for husband and wife. The interactive unit was the two-turn
unit, called a "floor switch." Figure 9.1 illustrates the Gottman index time
series for two couples.
Now suppose that the data in Figure 9.2 represented precisely such a
point graph of a wife's negative affect, and that a rare but interesting event
occurred at time 30, when her husband referred to a previous relationship
he had had. We want to know whether this event had any impact on
the interaction. To answer this question, we can use an interrupted time-
series analysis. There are many ways to do this analysis (see Gottman,
1981). We analyzed these data with the Gottman-Williams program ITSE
(see Williams & Gottman, 1981) and found that there was no significant
change in the slope of the series |>(32) = —0.3], but that there was a
significant effect in the change in level of the series [t@2) = 5.4]. This is
one important use of time-series analysis.
162 Analyzing time sequences
Time
Cyclicity
Univariate time-series analysis can also answer questions about the cyclicity
of the data. Recall that to assess the cyclicity of the data, a function called
the "spectral density function" is computed. This function will have a
significant peak for cycles that account for major amounts of variance
in the series, relative to what we might have expected if there were no
significant cycles in the data (i.e., the data were noise). We used the data
in Figure 9.2 for this analysis, and used the Gottman-Williams program
SPEC. The output of SPEC is relatively easy to interpret. The solid line
in Figure 9.3 is the program's estimate of the spectral density function,
and the dotted line above and below the solid line is the 0.95 confidence
interval. If the entire confidence interval is above the horizontal dashed
line, the cycle is statistically significant. The x axis is a little unfamiliar to
most readers, because it refers to "frequency," which in time-series analysis
means cycles per time unit. It is the reciprocal of the period of oscillation.
The peak cycle is at a frequency of 0.102, which corresponds to a period
of 9.804 time periods. The data are cyclic indeed. It is important to realize
that cyclicity in modern time-series analysis is a statistical concept. What
we mean by this is that the period of oscillation is itself a random variable,
with a distribution. Thus, the data in Figure 9.3 are almost periodic, not
precisely periodic. Most phenomena in nature are actually of this sort.
Time-series analysis 163
.3960 r
.3448
.2928
.t? .2407
I
CO
& .1*87
.1367
.0847
.0327
Frequency
Figure 9.3. Plot of density estimates. The density is indicated by a solid line, and
the 0.95 confidence interval by a dotted line. The white-noise spectrum is shown
by a dashed line.
• Turns
i n 1/111111fti111111111111111 n 11 ii ii 1111
<
-10-
-15-
u
-20
Figure 9.4. Couple mp48 conflict discussion. The solid line indicates the hus-
band's affect; the broken line, the wife's affect.
1 8 8 133.581 44.499
2 1 4 150.232 52.957
3 1 0 162.940 58.803
1 vs 2: Q = 8.458 df = 11 z = -.542
2vs3: 2 = 5.846 df = 4 z = .653
207.735 76.291
224.135 81.762
298.577 102.410
a
Weighted error variance; see Gottman and Ringland (1981), p. 411.
versus model 2 should be nonsignificant, which is the case. The third row
of the table shows the model with all of the wife's cross-regressive terms
dropped. If this model is not significantly different from model 2, then we
cannot conclude that the wife influences the husband; we can see that the
z score (0.653) is not significant. Similar analyses appear in the second
half of the table for determining if the husband influences the wife; the
comparison of models 2 and 3 shows a highly significant z score (this is the
normal approximation to the chi-square, not to be confused with the z score
for sequential connection that we have been discussing). Hence we can
conclude that the husband influences the wife, and this would be classified
as a husband-dominant interaction according to Gottman's (1979) defini-
tion. We note in passing that although these time series are not stationary,
it is still sensible to employ BIVAR.
Interevent Interval
In the study of the heart, in analyzing the electrocardiogram (ECG), there
is a concept called "interbeat interval," or the "IBI." It is the time between
the large R-spike of the ECG that signals the contraction of the ventricles
of the heart. It is usually measured in milliseconds in humans. So, for
example, if we recorded the ECG of a husband and wife talking to each
166 Analyzing time sequences
other about a major issue in their marriage, and we took the average of the
IBIs for each second (this can be weighted or prorated by how much of
the second each IBI took up), we would get a time series that was a set
of consecutive IBIs (in milliseconds) that looked like this: 650, 750, 635,
700, 600, 625, 704, and so on.
We wish to generalize this idea of IBI and discuss a concept we call
the "interevent interval." This is like an old concept in psychology, the
intertrial interval. When we are recording time, which we get for free with
almost all computer-assisted coding systems, we also can compute the time
between salient events, and these times can become a time series. When
we do this, we are interested in how these interevent times change with
time within an interaction.
A regulated couple
Husband Wife
A nonregulated couple
20
0 -
c -20 -
1
C/)
o -40 -
Q_
-60 -
ted
-80 -
1
o
o
-100 -
-120 -
-140 -
-160
20 40 60 80 100 120 140 160
Turns at speech
Husband Wife
Figure 9.5. (a) Cumulative point graphs for a regulated couple, for which pos-
itive codes generally exceed negative codes, (b) Cumulative point graphs for a
nonregulated couple, for which negative codes generally exceed positive codes.
Autocorrelation and time-series analysis 169
Model identification
Using the autocorrelation function, an autoregressive model for the time
series can be identified exactly if the series is stationary (this means that
the series has the same correlation structure throughout and no local or
global trends) using a computer program (see the Williams & Gottman,
1981, computer programs). A wide variety of patterns can be fit using the
autoregressive models, including time series with one or many cycles.
divorce versus marital stability), we can model the series, and then we can
scan the time series for statistically significant changes in overall level
or slope. This is called an "interrupted time-series experiment," or ITSE
(see Figure 9.6). An ITSE consists of a series of data points before and
after an event generally called the "experiment." The "experiment" can
be some naturally occurring event, in which case it is actually a quasi-
experiment. We then represent the data before the intervention as one
function b\ +m \t+ Autoregressive term, and the data after the intervention
as &2 + rri2t+ Autoregressive term. We need only supply the data and the
order of the autoregressive terms we select, and the computer program tests
for statistically significant changes in intercept (the 6's) and slope (the m's).
The experiment can last 1 second, or just for one time unit, or it can last
longer. One useful procedure is to use the occurrence of individual codes
as the event for the interrupted time-series experiment. Thus, we may ask
questions such as "Does the wife's validation of her husband's upset change
how he rates her?" Then we can do an interrupted time-series experiment
for every occurrence of Wife Validation. For the data in Figure 9.6, there
were 48 points before and 45 points after the validation. The order of the
autoregressive term selected was about one tenth of the preintervention
data, or 5. The t for change in intercept was t(19) = —2.58, p < .01, and
the t for change in level was f(79) = -1.70, p < .05.
For this example, we used the Williams and Gottman (1981) computer
program ITSE to test the statistical significance of changes in intercept and
slope before and after the experiment; an autoregressive model of any order
can be fit to the data. Recently, Crosbie (1995) developed a powerful new
method for analyzing short time-series experiments. In these analyses only
the first-order autoregressive parameter is used, and the preexperiment data
are fit with one straight line (intercept and slope) and the postexperiment
data are fit with a different straight line (intercept and slope). An omnibus
F test and t tests for changes in level and slope are then computed. This
method can be used to determine which codes in the observational system
have potentially powerful impact on the overall quality of the interaction.
Phase-space plots
Another useful way to display time-series data is by using a "phase-space"
plot. In a phase-space plot, which has an x-axis and a j-axis, we plot the
data as a set of pairs of points: (x 1, JC2), (JC2, x3), (x3, x4), The x-axis
is x{t), and the y-axis is x(t + 1), where t is time, t = 1, 2, 3, 4, and so
on. Alternatively, if we are studying marital interaction, we can plot the
interevent intervals for both husband and wife separately, so we have both
Autocorrelation and time-series analysis 171
Time
Figure 9.6. Plot of rating dial in which the husband rated his wife's affect during
their conversation.
a husband and a wife time series. A real example may help clarify how we
might use this idea (Gottman, 1990).
Time
Figure 9.7. Interevent integrals for negative affect in a marital conversation.
plot we connect the successive dots with straight lines (see Figures 9.8 and
9.9). This gives us an idea of the "flow" of the data over time.
What does this figure mean in terms of the actual behavior of the marital
system? It means that, insofar as we can ascertain from these data, we have
a system whose energy balance is not stable, but dissipative; that is, it runs
down. Like the pendulum winding down, this system tends toward what is
called an attractor; in this case the attractor represents an interevent interval
of zero. However, for the consequences of energy balance, this movement
toward an attractor of zero interevent interval between negative affect may
be disastrous. Specifically, this system tends, over time, toward shorter
response times for negative affect. Think of what that means. As the
couple talk, the times between negativity become shorter and shorter. This
interaction is like a tennis match where that ball (negative affect) is getting
hit back and returned faster and faster as the game proceeds. Eventually
the system is tending toward uninterrupted negativity.
We can verify this description of this marital interaction using more
standard analytic tools, in this case by performing the mathematical pro-
cedure we discussed called spectral time-series analysis of these IEIs (see
Gottman, 1981). Recall that a spectral time-series analysis tells us whether
there are specific cyclicities in the data, and, if there are, how much variance
each cycle accounts for. See Figure 9.10.
Autocorrelation and time-series analysis 173
20 -r
20
Data
Time Code IEI
5:14 Husb Anger
5:27 Wife Sad 13 Sees
5:40 Husb Disgust 13 Sees
5:44 Husb Anger 04 Sees
(X 1 ,X 2 ) = (13,13)
(X 2 ,X 3 ) =
Note that the overall spectral analysis of all the data reveals very little.
There seem to be multiple peaks in the data, some representing slower and
some faster cycles. However, if we divide the interaction into parts, we
can see that there is actually a systematic shift in the cyclicities. The cycle
length is 17.5 seconds at first, and then moves to 13.2 seconds, and then
to 11.8 seconds. This means that the time for the system to cycle between
negative affects is getting shorter as the interaction proceeds. This is exactly
what we observed in the state space diagram in which all the points were
connected. Hence, in two separate analyses of these data we have been led
to the conclusion that this system is not regulated, but is moving toward
more and more rapid response times between negative affects. From the
data we have available, this interaction seems very negative, relentless, and
unabated. Of course, there may be a more macro-level regulation that we
174 Analyzing time sequences
X
i+2
Figure 9.9. System is being drawn toward faster and faster response times in the
IEI of negative affect.
do not see that will move the system out toward the base of the cone once
it has moved in, and it may oscillate in this fashion. But we cannot know
this. At the moment it seems fair to conclude that this unhappy marriage
represents a runaway system.
There are lots of other possibilities for what phase-space flow diagrams
might look like. One common example is that the data seem to hover quite
close to one or two of what are called "steady states." This means that the
data really are quite stable, except for minor and fairly random variations.
Another common example is that the data seem to move in something
like a circle or ellipse around a steady state. The circle pattern suggests
one cyclical oscillation. More complex patterns are possible, including
chaos (see Gottman, 1990, for a discussion of chaos theory applied to
families); we should caution the reader that despite the strange romantic
appeal that the chaos theory has enjoyed, chaotic patterns are actually
almost never observed. Gottman (1990) suggested that the cyclical phase-
space plot was like a steadily oscillating pendulum. If a pendulum is
steadily oscillating, like the pendulum of a grandfather clock, energy is
constantly being supplied to drive the pendulum, or it would run down (in
phase space, it would spiral in toward a fixed point).
Summary 175
Overall
Spectrum
o
c
CO
cz
co
I
Frequency
Figure 9.10. Spectrum for three segments showing higher frequencies and shorter
periods.
9.6 Summary
When successive events have been coded (or when a record of successive
events is extracted from more detailed data), event-sequential data result.
When successive intervals have been coded, interval-sequential data result.
And when event or state times have been recorded, the result is timed-event
or state data. At one level, the representation of these data used by the GSEQ
program, these four kinds of data are identical: All consist of successive
bins, where bins are defined by the appropriate unit (event, time, or interval)
and may contain one, more, or no codes.
In general, all the analytic techniques that apply to event sequences can
be applied to state, timed-event, and interval sequences as well, but there
are some cautions.
Primarily, the connection, or lack of connection, between coders' deci-
sions and representational units is important to keep in mind and emphasize
when interpreting sequential results because in some cases units represent
decisions and in other cases arbitrary time units. It is also important to keep
in mind how different forms of data representation suit different questions.
176 Analyzing time sequences
177
178 Analyzing cross-classified events
: Counts for the toddlers are based on 20.7 hours of observations; for the
preschoolers, on 16.6 hours; see Bakeman and Brownlee (1982).
particular, it does not contain any terms that suggest that either dominance,
or prior possession, or their interaction, is related to the amount of resis-
tance encountered.
In analysis-of-variance terms, the [/?][DP] model is the "no effects"
model. In a sense, the [DP] term just states the design, whereas the fact
that the response variable, [/?], is not combined with any of the explanatory
variables indicates that none affect it. If the [R][DP] model failed to fit
the data, but the [RD][DP] model did, we would conclude that there was a
main effect for dominance - that unless dominance is taken into account,
we fail to make very good predictions for how often resistance will occur.
Similarly, if the [RP][DP] model fit the data, we would conclude that there
was a main effect for prior possession. If the [RD][RP][DP] model fit,
main effects for both dominance and prior possession would be indicated.
Finally, if only the [RDP] modelfitthe data (the saturated model), we would
conclude that, in order to account for resistance, the interaction between
dominance and prior possession must be taken into account.
In the present case, the no-effects model failed to fit the observed data.
For both toddlers and preschoolers, the chi-square comparing generated to
observed was large and significant (values were 11.3 and 7.2, df= 3, for tod-
dlers and preschoolers, respectively; these are likelihood-ratio-chi-squares,
computed by Bakeman & Robinson's [1994] ILOG program). However,
for toddlers the [RD] [DP] model, and for preschoolers the [RP] [DP] model
generated data quite similar to the observed (chi-squares were L9 and 0.8,
df = 2, for toddlers and preschoolers, respectively; these chi-squares are
both insignificant, although in both cases the difference between them and
the no-effects model is significant). This is analogous to a main effect for
dominance among toddlers and a main effect for prior possession among
preschoolers. In other words, the dominance of the taker affected whether
his or her take attempt would be resisted among toddlers, but whether
the taker had had prior possession of the contested object or not affected
whether he or she would meet with resistance among preschoolers. Thus
the effects noted descriptively in the previous section are indeed statistically
significant. Bakeman and Brownlee (1982) interpreted this as evidence for
shared possession rules, rules that emerge somewhere between 2 and 3
years of age.
10.3 Summary
Sometimes investigators who collect observational data and who are inter-
ested in sequential elements of the behavior observed seem compelled both
to obtain a continuous record of selected aspects of the passing stream of
Summary 183
184
Soskin and John on marital interaction 185
Soskin and John reported several findings based on their analysis of their
tapes. The structural analysis showed that, in most situations, Jock talked
a lot and could be described as highly gregarious (he talked about 36%
of the total talk time in a four-person group). His longer utterances were
predominantly structones (factual-information exchanges).
The functional analysis of 1850 messages of Roz and Jock's talk showed
that Roz was significantly more expressive (8.6% vs. Jock's 3.8%), less con-
trolling (fewer regones: 11.0% vs. Jock's 13.9%), and less informational
(fewer structones: 24.5% vs. Jock's 31.3%). They concluded:
Roz produced a high percentage of expressive messages whenever the two were
out of the public view and became noticeably more controlled in the presence of
others. Jock's output, on the other hand, was relatively low throughout, (p. 267)
This is not quite consistent with the earlier analysis of Jock as gregarious
and a high-output talker. They then turned to the results of their dynamic
analysis, which they began describing as follows:
The very dimensions by which it was hoped to identify inter- and intrapersonal
changes in the sequential development of an episode proved most difficult to
isolate, (p 267)
Unfortunately, they could find no consistent patterns in the way Roz and
Jock tried to influence and control one another. They wrote:
The very subtle shifts and variations in the way in which these two people at-
tempted to modify each other's states sequentially throughout this episode obliged
us to question whether summaries of very long segments of a record reflect the
actual sequential dynamics of the behavior in a given episode, (p. 268)
In these results, we can see a struggle with coding systems that are unwieldy,
that are hard to fit together, and that lack a clear purpose or focus. The
research questions are missing.
188 Epilogue
H: Don't you remember I had that argument with him last week?
W: I forgot.
H: Yeah.
W: So I'm sorry I forgot, all right?
H: So it is a big deal to him.
W: So what do you want me to do, jump up and down?
H: Well, how was your day, honey?
W: Oh brother, here we go again.
H: (pause) You don't have to look at me that way.
W: So what d'ya want me to do, put a paper bag over my head?
Using the Couples Interaction Scoring System (CISS), which codes both
verbal and nonverbal behaviors of both speaker and listener, Gottman and
his associates coded the interaction of couples who were satisfied or dis-
satisfied with their marriages. Among other tasks, couples were studied
attempting to resolve a major area of disagreement in their marriages.
Gottman's results
The major question in this research was "what were the differences between
satisfied and dissatisfied couples in the way they resolve conflict?"
Basically, these differences can be described by using the analogy of
a chess game. A chess game has three phases: the beginning game, the
middle game, and the end game. Each phase has characteristic good and
bad maneuvers and objectives. The objectives can, in fact, be derived
inductively from the maneuvers. The goal of the beginning phase is control
of the center of the chessboard and development of position. The goal of
the middle game is the favorable exchange of pieces. The goal of the end
game is checkmate. Similarly, there are three phases in the discussion of a
marital issue. The first phase is "agenda-building," the objective of which
is to get the issues out as they are viewed by each partner. The second
phase is the "arguing phase," the goal of which is for partners to argue
energetically for their points of view and for each partner to understand the
areas of disagreement between them. The third phase is the "negotiation,"
the goal of which is compromise.
It is possible to discriminate the interaction of satisfied and dissatisfied
couples in each phase. In the agenda-building phase, cross-complaining
sequences characterize dissatisfied couples. A cross-complaining sequence
is one in which a complaint by one person is followed by a countercomplaint
by the other. For example:
W: I'm tired of spending all my time on the housework. You're not doing your
share.
H: If you used your time efficiently you wouldn't be tired.
190 Epilogue
Validation:
W: I've been home alone all day.
H: Uh-huh.
W: Cooped up with the kids.
H: Yeah, I come home tired.
W: Mmm.
H: And just want to relax.
W: Yeah.
Contracting:
W: We spent all of Christmas at your mother's last year. This time let's spend
Christmas at my mother's.
H: Yeah you're right, that's not fair. How about 50-50 this year?
Procedure ComputeKappa;
var i, j : word;
M : array[1..Max+1,1..Max+1] of real;
Kappa, Numer, Denom : real;
begin
for i := 1 to N do { Set row & col sums to zero. }
begin
M[i,N+l] := 0 . 0 ;
194
Appendix: A Pascal program to compute kappa 195
M[N+l,i] := 0 . 0 ;
end;
M[N+1,N+1] := 0 . 0 ;
{ T a l l y row & col t o t a l s . }
f o r i := 1 t o N do for j := 1 t o N do
begin
M[N+1,N+1] := M[N+1,N+1] + X [ i , j ] ;
end;
{ Compute exp. f r e q u e n c i e s . }
f o r i := 1 t o N do f o r j := 1 t o N do
*
begin
for i := 1 to N do for j := 1 to N do
if (i = j) then W[i,j] := 0 else W[i,j] := 1;
end;
begin
196 Appendix: A Pascal program to compute kappa
begin
write (( \Message[Kind],c (Y|N)? ' ) ;
Ch := ReadKey; writeln (Ch);
WantsSame := (UpCase(Ch) = C Y J ) or (Ch = Enter);
end;
BEGIN
TextColor (Black) ;
TextBackground (LightGray) ;
ClrScr;
Appendix: A Pascal program to compute kappa 197
Writeln
('Compute kappa or wt kappa; (c) Roger Bakeman, GSU J );
repeat
if N = 0 then AskForOrder
else if not WantsSame(l) then AskForOrder;
if not WeightsDefined
then begin if WantsSame(3) then AskForNumbers(1) end
else if not WantsSame(4) then AskForNumbers(1);
Adamson, L. B., & Bakeman, R. (1985). Affect and attention: Infants observed with mothers
and peers. Child Development, 56, 582-593.
Ainsworth, M. D. S., Blehar, M. C , Waters, E., & Wall, S. (1978). Patterns of attachment.
Hillsdale, NJ: Lawrence Erlbaum.
Allison, P. D., & Liker, J. K. (1982). Analyzing sequential categorical data on dyadic
interaction: A comment on Gottman. Psychological Bulletin, 91, 393—403.
Anderson, T. W, & Goodman, L. A. (1957). Statistical inference about Markov chains.
Annals of Mathematical Statistics, 28, 89-110.
Altmann, J. (1974). Observational study of behaviour: Sampling methods. Behaviour, 49,
227-267.
Altmann, S. A. (1965). Sociobiology of rhesus monkeys. II. Stochastics of social commu-
nication. Journal of Theoretical Biology, 8, 490-522.
Attneave, F. (1959). Applications of information theory to psychology. New York: Henry
Holt.
Bakeman, R. (1978). Untangling streams of behavior: Sequential analysis of observation
data. In G. P. Sackett (Ed.), Observing behavior (Vol. 2): Data collection and analysis
methods (pp. 63-78). Baltimore: University Park Press.
Bakeman, R. (1983). Computing lag sequential statistics: The ELAG program. Behavior
Research Methods & Instrumentation, 15, 530-535.
Bakeman, R. (1992). Understanding social science statistics: A spreadsheet approach.
Hillsdale, NJ: Erlbaum.
Bakeman, R., & Adamson, L. B. (1984). Coordinating attention to people and objects in
mother-infant and peer-infant interaction. Child Development, 55, 1278-1289.
Bakeman, R., Adamson, L. B., & Strisik, P. (1989). Lags and logs: Statistical approaches to
interaction. In M. H. Bornstein & J. Bruner (Eds.), Interaction in human development
(pp. 241-260). Hillsdale, NJ: Erlbaum.
Bakeman, R., Adamson, L. B., & Strisik, P. (1995). Lags and logs: Statistical approaches to
interaction (SPSS Version). In J. M. Gottman (Ed.), The analysis of change (pp. 279-
308). Hillsdale, NJ: Erlbaum.
Bakeman, R., & Brown, J. V. (1977). Behavioral dialogues: An approach to the assessment
of mother-infant interaction. Child Development, 49, 195-203.
Bakeman, R., & Brownlee, J. R. (1980). The strategic use of parallel play: A sequential
analysis. Child Development, 51, 873-878.
Bakeman, R., & Brownlee, J. R. (1982). Social rules governing object conflicts in toddlers
and preschoolers. In K. H. Rubin & H. S. Ross (Eds.), Peer relationships and social
198
References 199
Chatfield, C. (1973). Statistical inference regarding Markov chain models. Applied Statis-
tics, 22, 7-20.
Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psycho-
logical Measurement, 20, ?>l-46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit. Psychological Bulletin, 70, 213-220.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Aca-
demic Press.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
Cohn, J. F., & Tronick, E. Z. (1987). Mother-infant face-to-face interaction: The sequence
of dyadic states at 3, 6, and 9 months. Developmental Psychology, 23, 68-77.
Condon, W. S., & Ogston, W. D. (1967). A segmentation of behavior. Journal of Psychiatric
Research, 5, 221-235.
C o n g e r , A . J., & W a r d , D . G . ( 1 9 8 4 ) . A g r e e m e n t a m o n g 2 x 2 a g r e e m e n t i n d i c e s . Educa-
tional and Psychological Measurement, 44, 301-314.
Cook, J., Tyson, R., White, J., Rushe, R., Gottman, J., & Murray, J. (1995). The mathematics
of marital conflict: Qualitative dynamic modeling of marital interaction. Journal of
Family Psychology, 9, 110-130.
Cronbach, L. J., Gleser, G. C , Nanda, H., & Rajaratnam, N. (1972). The dependability
of behavioral measurements: Theory of generalizability for scores and profiles. New
York: Wiley.
Crosbie, J. (1995). Interrupted time-series analysis with short series: Why it is problematic;
How it can be improved. In J. Gottman (Ed.), The analysis of change (pp. 361-398).
Hillsdale, NJ: Erlbaum.
Dabbs, J. M., Jr., & Swiedler, T. C. (1983). Group AVTA: A microcomputer system for
group voice chronography. Behavior Research Methods & Instrumentation, 15, 79-84.
Duncan, S., Jr., & Fiske, D. W. (1977). Face to face interaction. Hillsdale, NJ: Lawrence
Erlbaum.
Edgington, E. S. (1987). Randomization tests (2nd Ed.). New York: Marcel Dekker.
Ekman, P. W., & Friesen, W. (1978). Manual for the facial action coding system. Palo Alto,
CA: Consulting Psychologist Press.
Fienberg, S. E. (1980). The analysis of cross-classified categorical data. Cambridge, MA:
MIT Press.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.
Fleiss, J. L. (1986). The design and analysis of clinical experiments. New York: John Wiley.
Fleiss, J. L., Cohen, J., & Everitt, B. S. (1969). Large sample standard errors of kappa and
weighted kappa. Psychological Bulletin, 72, 323-327.
Gardner, W. (1995). On the reliability of sequential data: Measurement, meaning, and
correction. In J. M. Gottman (Ed.), The analysis of change (pp. 339-359). Hillsdale,
NJ: Erlbaum.
Gart, J. J., & Zweifel, J. R. (1967). On the bias of various estimators of the logit and its
variance with application to quantile bioassay. Biometrika, 54, 181-187.
Garvey, C. (1974). Some properties of social play. Merrill-Palmer Quarterly, 20, 163-180.
Garvey, C , & Berndt, R. (1977). The organization of pretend play. In JSAS Catalog of
Selected Documents in Psychology, 7, 107. (Ms. No. 1589).
References 201
Good, P. (1994). Permutation tests: A practical guide to resampling methods for testing
hypotheses. New York: Springer-Verlag.
Goodenough, F. L. (1928). Measuring behavior traits by means of repeated short samples.
Journal of Juvenile Research, 12, 230-235.
Goodman, L. A. (1983). A note on a supposed criticism of an Anderson-Goodman test in
Markov chain analysis. In S. Karlin, T. Amemiya, & L. A. Goodman (Eds.), Stud-
ies in econometrics, time series, and multivariate statistics (pp. 85-92). New York:
Academic Press.
Gottman, J. M. (1979a). Marital interaction: Experimental investigations: New York:
Academic Press.
Gottman, J. M. (1979b). Time-series analysis of continuous data in dyads. In M. E. Lamb,
S. J. Sumoi, & G. R. Stephenson (Eds.), Social interaction analysis: Methodological
issues (pp. 207-229). Madison: University of Wisconsin Press.
Gottman, J. M. (1980a). Analyzing for sequential connection and assessing interobserver
reliability for the sequential analysis of observational data. Behavioral Assessment, 2,
361-368.
Gottman, J. M. (1980b). The consistency of nonverbal affect and affect reciprocity in marital
interaction. Journal of Consulting and Clinical Psychology, 48, 711-717.
Gottman, J. M. (1981). Time-series analysis: A comprehensive introduction for social
scientists. New York: Cambridge University Press.
Gottman, J. M. (1983). How children become friends. Monographs of the Society for Re-
search in Child Development, 48(3, Serial No. 201).
Gottman, J. M. (1990). Chaos and regulated change in family interaction. In P. Cowan and
E. M. Hetherington (Eds.), New directions in family research: transition and change.
Hillsdale, NJ: Erlbaum.
Gottman, J. M., & Bakeman, R. (1979). The sequential analysis of observational data. In
M. E. Lamb, S. J. Suomi, & G. R. Stephenson (Eds.), Social interaction analysis:
Methodological issues (pp. 185-206). Madison: University of Wisconsin Press.
Gottman, J. M., & Levenson, R. W. (1992). Marital processes predictive of later dissolution:
Behavior, physiology, and health. Journal of Personality and Social Psychology, 63,
221-233.
Gottman, J. M., Markman, H., & Notarius, C. (1977). The topography of marital conflict:
A sequential analysis of verbal and nonverbal behavior. Journal of Marriage and the
Family, 39, 461^77.
Gottman, J. M., & Parker, J. (Eds.) (1985). Conversations of friends: Speculations on
affective development. New York: Cambridge University Press.
Gottman, J. M., & Ringland, J. T. (1981). The analysis of dominance and bidirectionality
in social development. Child Development, 52, 393^-12.
Gottman, J., Rose, E, & Mettetal, G. (1982). Time-series analysis of social interaction data.
In T. Field & A. Fogel (Eds.), Emotion and interactions (pp. 261-289). Hillsdale, NJ:
Lawrence Erlbaum.
Gottman, J. M., & Roy, A. K. (1990). Sequential analysis: A guide for behavioral research.
New York: Cambridge University Press.
Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cell
counts. Annals of Statistics, 5, 1148-1169.
Haberman, S. J. (1978). Analysis of qualitative data (Vol. 1). New York: Academic Press.
202 References
Haberman, S. J. (1979). Analysis of qualitative data (Vol. 2). New York: Academic Press.
Hamilton, G. V. (1916). A study of perserverence reactions in primates and rodents. Behavior
Monographs, 3 (No. 2).
Hartmann, D. P. (1977). Considerations in the choice of interobserver reliability estimates.
Journal of Applied Behavior Analysis, 10, 103-116.
Hartmann, D. P. (1982). Assessing the dependability of observational data. In D. P. Hartmann
(Ed.)., Using observers to study behavior: New directions for methodology of social
and behavioral science (No. 14, pp. 51-65). San Francisco: Jossey-Bass.
Hartup, W. W. (1979). Levels of analysis in the study of social interaction: An historical
perspective. In M. E. Lamb, S. J. Suomi, & G. R. Stephenson (Eds.), Social interaction
analysis: Methodological issues (pp. 11-32). Madison: University of Wisconsin Press.
Hays, W. L. (1963). Statistics (1st ed.). New York: Holt, Rinehart, & Winston.
Hollenbeck, A. R. (1978). Problems of reliability in observational research. In G. P. Sackett
(Ed.), Observing behavior (Vol. 2): Data collection and analysis methods (pp. 79-98).
Baltimore: University Park Press.
Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84, 289-297.
Jaffe, J., & Feldstein, S. (1970). Rhythms of dialogue. New York: Academic Press.
Johnson, S. M., & Bolstad, O. D. (1973). Methodological issues in naturalistic observation:
Some problems and solutions for field research. In L. A. Hamerlynch, L. C. Handy,
& E. J. Mash (Eds.), Behavior change: Methodology, concepts, and practice (pp. 7 -
67). Champaign, IL: Research Press.
Jones, R. R., Reid, J. B., & Patterson, G. R. (1975). Naturalistic observations in clinical
assessment. In P. McReynolds (Ed.), Advances in psychological assessment (Vol. 3,
pp. 42-95). San Francisco: Jossey-Bass.
Kemeny, J. G., Snell, J. L., & Thompson, G. L. (1974). Introduction to finite mathematics.
Englewood Cliffs, NJ: Prentice-Hall.
Kennedy, J. J. (1983). Analyzing qualitative data: Introductory log-linear analysis for
behavioral research. New York: Praeger.
Kennedy, J. J. (1992). Analyzing qualitative data: Log-linear analysis for behavioral re-
search (2nd ed.). New York: Praeger.
Knoke, D., & Burke, P. J. (1980). Log-linear models. Newbury Park, CA: Sage.
Krokoff, L. (1983). The anatomy of negative affect in blue collar marriages. Unpublished
doctoral dissertation, University of Illinois at Urbana-Champaign.
Landesman-Dwyer, S. (1975). The baby behavior code (BBC): Scoring procedures and
definitions. Unpublished manuscript.
Losada, M., Sanchez, P., & Noble, E. E. (1990). Collaborative technology and group process
feedback: Their impact on interactive sequences in meetings. CSCW Proceedings,
53-64.
Miller, G. A., & Frick, F. C. (1949). Statistical behavioristics and sequences of responses.
Psychological Review, 56, 311- 324.
Miller, R. G., Jr. (1966). Simultaneous statistical inference. New York: McGraw-Hill.
Morley, D. D. (1987). Revised lag sequential analysis. In M. L. McLaughlin (Ed.), Com-
munication year book (Vol. 10, pp. 172-182). Beverly Hills, CA: Sage.
Morrison, P., & Morrison, E. (1961). Charles Babbage and his calculating engines. New
York: Dover.
Noller, P. (1984). Nonverbal communication and marital interaction. Oxford: Pergamon
Press.
References 203
Oud, J. H., & Sattler, J. M. (1984). Generalized kappa coefficient: A Microsoft BASIC
program. Behavior Research Methods, Instruments, and Computers, 76, 481.
Overall, J. E. (1980). Continuity correction for Fisher's exact probability test. Journal of
Educational Statistics, 5, 177- 190.
Parten, M. B. (1932). Social participation among preschool children. Journal of Abnormal
and Social Psychology, 27, 243-269.
Patterson, G. R. (1982). Coersive family process. Eugene, OR: Castalia Press.
Patterson, G. R., & Moore, D. (1979). Interactive patterns as units of behavior. In
M. E. Lamb, S. J. Sumoi, & G. R. Stephenson (Eds.), Social interaction analysis:
Methodological issues (pp. 77-96). Madison: University of Wisconsin Press.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An inte-
grated approach. Hillsdale, NJ: Erlbaum.
Rechten, C , & Fernald, R. D. (1978). A sampled randomization test for examining single
cells of behavioural transition matrices. Behaviour, 69, 217-227.
Reynolds, H. T. (1984). Analysis of nominal data. Beverly Hills, CA: Sage.
Rosenblum, L. (1978). The creation of a behavioral taxonomy. In G. P. Sackett (Ed.), Observ-
ing behavior (Vol. 2): Data collection and analysis methods (pp. 15-24). Baltimore:
University Park Press.
Raush, H. L., Barry, W. A., Hertel, R. K., & Swain, M. A. (1974). Communication, conflict,
and marriage. San Francisco: Jossey-Bass.
Sackett, G. P. (1974). A nonparametric lag sequential analysis for studying dependency
among responses in observational scoring systems. Unpublished manuscript.
Sackett, G. P. (1978). Measurement in observational research. In G. P. Sackett (Ed.), Observ-
ing behavior (Vol. 2): Data collection and analysis methods (pp. 25-43). Baltimore:
University Park Press.
Sackett, G. P. (1979). The lag sequential analysis of contingency and cyclicity in behavioral
interaction research. In J. D. Osofsky (Ed.), Handbook of infant development (pp. 623-
649). New York: Wiley.
Sackett, G. P. (1980). Lag sequential analysis as a data reduction technique in social inter-
action research. In D. B. Sawin, R. C. Hawkins, L. O. Walker, & J. H. Penticuff (Eds.),
Exceptional infant (Vol. 4): Psychosocial risks in infant-environment transactions.
New York: Brunner/Mazel.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana:
University of Illinois Press.
Shotter, J. (1978). The cultural context of communication studies: Theoretical and method-
ological issues. In A. Lock (Ed.), Action, gesture, and symbol: The emergence of
language (pp. 43-78). London: Academic Press.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York:
McGraw-Hill.
Smith, P. K. (1978). A longitudinal study of social participation in preschool children:
Solitary and parallel play reexamined. Developmental Psychology, 14, 517-523.
Smith, P. K., & Connolly, K. J. (1972). Patterns of play and social interaction in preschool
children. In N. Blurton Jones (Ed.), Ethological studies of child behavior (pp. 65-95).
Cambridge: Cambridge University Press.
Soskin, W. F., & John, V. P. (1963). The study of spontaneous talk. In R. G. Barker (Ed.),
The stream of behavior: Explorations of its structure and content (pp. 228-287). New
York: Appleton-Century-Crofts.
204 References
Sroufe, L. A., & Waters, E. (1977). Attachment as an organizational construct. Child De-
velopment, 48, 1184-1199.
Stern, D.N. (1974). Mother and infant at play: The dyadic interaction involving facial,
vocal, and gaze behaviors. In M. Lewis & L. A. Rosenblum (Eds.), The effect of the
infant on its caregiver (pp. 187-213). New York: Wiley.
Suen, H. K. (1988). Agreement, reliability, accuracy, and validity: Toward a clarification.
Behavioral Assessment, 10, 343-366.
Suomi, S. J. (1979). Levels of analysis for interactive data collected on monkeys living in
complex social groups. In M. E. Lamb, S. J. Suomi, G. R. Stephenson (Eds.), Social
interaction analysis: Methodological issues (pp. 119-135). Madison: University of
Wisconsin Press.
Suomi, S. J., Mineka, S., & DeLizio, R. D. (1983). Short- and long-term effects of repetitive
mother-infant separations on social development in rhesus monkeys. Developmental
Psychology, 19, 770-786.
Taplin, P. S., & Reid, J. B. (1973). Effects of instructional set and experimenter influence
on observer reliability. Child Development, 44, 547-554.
Tapp, J., & Walden, T. (1993). PROCODER: A professional tape control coding and anal-
ysis system for behavioral research using videotape. Behavior Research Methods,
Instruments, & Computers, 25, 53-56.
Tronick, E. D., Als, H., & Brazelton, T. B. (1977). Mutuality in mother-infant interaction.
Journal of Communication, 27, 74-79.
Tronick, E., Als, H., & Brazelton, T. B. (1980). Monadic phases: A structural descriptive
analysis of infant-mother face to face interaction. Merrill-Palmer Quarterly, 26, 3-24.
Tuculescu, R. A., & Griswold, J. G. (1983). Prehatching interactions in domestic chickens.
Animal Behavior, 31, 1- 10.
Uebersax, J. S. (1982). A generalized kappa coefficient. Educational and Psychological
Measurement, 42, 181-183.
Upton, G. J. G. (1978). The analysis of cross-tabulated data. New York: Wiley.
Wallen, D., & Sykes, R. E. (1974). Police IV: A code for the study of police-civilian in-
teraction. (Available from Minnesota Systems Research, 2412 University Ave., Min-
neapolis, MN 55414.)
Wampold, B. E. (1989). Kappa as a measure of pattern in sequential data. Quality and
Quantity, 23, 171-187.
Wampold, B. E. (1992). The intensive examination of social interaction. In T. R. Kratochwill
& J. R. Levin (Eds.), Single-case research design and analysis: New directions for
psychology and education (pp. 93-131). Hillsdale, NJ: Erlbaum.
Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences. Hills-
dale, NJ: Erlbaum.
Wickens, T. D. (1993). Analysis of contingency tables with between-subjects variability.
Psychological Bulletin, 113, 191-204.
Wiggins, J. S. (1973). Personality and prediction. Reading, MA: Addison-Wesley.
Williams, E., & Gottman, J. (1981). A user's guide to the Gottman-Williams time-series
programs. New York: Cambridge University Press.
Wolff, P. (1966). The causes, controls, and organization of the neonate. Psychological
Issues, 5 (whole No. 17).
Index
adjusted residuals, see z scores Couples Interaction Scoring System (CISS), 189
agreement matrix, 61-62, 62/, 64/, 73/ Cronbach's generalizability alpha, 75-76, 76/
agreement, observer, see observer agreement Cross-classified events
Allison-Liker z score, see z score, agreement for, 74
Allison-Liker's formula analysis of, 179-182
alpha, Cronbach's generalizability, see data format for, 87, 88f, 90
Cronbach's generalizability alpha describing, 177-179
autocorrelation and time series, 163 recording of, 49-50, 54t
cyclicity in time series, 162
Barker, Roger, 185
Bernoulli, Daniel, 158 discrete events, see momentary events
Bonferroni's correction, 148 duration behaviors, see duration events
Brahe, Tycho, 184 duration events, 38-39
205
206 Index