Research Article
Altering Context Speech Rate Can Cause
Words to Appear or Disappear
Psychological Science
21(11) 1664–1670
© The Author(s) 2010
Reprints and permission:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/0956797610384743
http://pss.sagepub.com
Laura C. Dilley1,2,3,4 and Mark A. Pitt5
1
Department of Communicative Sciences and Disorders and 2Department of Psychology, Michigan State University;
Department of Psychology and 4Department of Communication Sciences and Disorders, Bowling Green
State University; and 5Department of Psychology, Ohio State University
3
Abstract
Speech is produced over time, and this makes sensitivity to timing between speech events crucial for understanding language.
Two experiments investigated whether perception of function words (e.g., or, are) is rate dependent in casual speech, which
often contains phonetic segments that are spectrally quite reduced. In Experiment 1, talkers spoke sentences containing a target
function word; slowing talkers’ speech rate around this word caused listeners to perceive sentences as lacking the word (e.g.,
leisure or time was perceived as leisure time). In Experiment 2, talkers spoke matched sentences lacking a function word; speeding
talkers’ speech rate around the region in which the function word had been embedded in Experiment 1 caused listeners to
perceive a function word that was never spoken (e.g., leisure time was perceived as leisure or time). The results suggest that
listeners formed expectancies based on speech rate, and these expectancies influenced the number of words and word
boundaries perceived. These findings may help explain the robustness of speech recognition when speech signals are distorted
(e.g., because of a casual speaking style).
Keywords
spoken-word recognition, casual speech, speech rate, word segmentation
Received 12/14/09; Revision accepted 4/8/10
The perception of spoken words is thought to depend largely
on recovery of phonemic cues from frequency-specific (spectral) information (e.g., Marslen-Wilson & Welsh, 1978). Yet
recognition of spoken words can be remarkably accurate when
spectral cues are missing or severely distorted, as occurs,
for example, in sine-wave speech (Remez, Rubin, Pisoni, &
Carrell, 1981), phase-vocoded speech (Shannon, Zeng, Kamath,
Wygonski, & Ekelid, 1995), or auditory chimeras (Smith,
Delgutte, & Oxenham, 2002). This fact suggests that temporal
information is crucial for accurate spoken-word recognition.
Despite compelling demonstrations, progress has been slow in
understanding the role of speech timing (i.e., speech rate,
speech duration, hierarchical rhythmic structure, etc.) in word
recognition (Davis, Marslen-Wilson, & Gaskell, 2002; Salverda,
Dahan, & McQueen, 2003). In this study, we used natural,
undegraded speech to show that entire words can seem to disappear or appear as a function of their context speech rate (i.e.,
the rate of all speech preceding and following the region of
interest). The results have implications for understanding how
adults and infants recognize spoken words and segment those
words from speech, issues that are long-standing puzzles in
the literature (e.g., Klatt, 1980).
We hypothesized that speech timing plays a decisive role in
perceiving a word when the word’s frequency spectrum shows
substantial overlap or blending with the spectrum of adjacent
words; this pervasive phenomenon is known as coarticulation.
Coarticulation of adjacent words can sometimes be so severe
that spectral information is insufficient to identify whether a
given word is present in the speech stream, let alone where that
word begins. This is especially true for short, high-frequency
words, such as function words like or and and (Bell et al.,
2003; Shockey, 2003).
We reasoned that when coarticulation of words is severe,
the presence of a word could be conveyed by the duration of
the blended phonemes relative to context speech rate. A typical case of heavy coarticulation is shown in Figure 1. The
spectrum for the word or, spoken in its reduced form as er,
blends almost totally with that of the preceding syllable -sure
Corresponding Author:
Laura C. Dilley, Michigan State University, Department of Communicative
Sciences and Disorders and Department of Psychology, 116 Oyer Building,
East Lansing, MI 43403
E-mail: ldilley@msu.edu
1665
Altering Speech Rate
Frequency (kHz)
5.0
2.5
0
0
0.76
Time (s)
Fig. 1. Spectrogram illustrating heavy coarticulation of a function word in the
phrase leisure or time. Phonemic content is shown as time-aligned International
Phonetic Alphabet symbols at the top of the figure. The arrow on the x-axis
indicates the approximate start of the function word or; note the utter absence
of discontinuity marking the start of this word and the lack of clear cues
differentiating the function word spectrally from the preceding syllable.
in the phrase leisure or time. Thus, there is a relatively homogeneous span of spectral material for most of the two syllables:
-(s)ure or. We hypothesized that such a span is heard as two
syllables because it is too long relative to the context speech
rate to contain a single syllable. If this interpretation is correct,
slowing down the context speech rate should make the span
sound relatively shorter, like a single syllable, causing the
function word to disappear and the phrase to be heard as leisure time instead of leisure or time.
One reason for thinking that speech rate might alter perception of a word’s presence is that context speech rate affects the
boundary between spectrally related phonemes (e.g., /p/ vs. /b/;
see Liberman, Delattre, Gerstman, & Cooper, 1956; Miller &
Liberman, 1979) and between singleton and geminate segments (Fujisaki, Nakamura, & Imoto, 1975; Pickett & Decker,
1960). We hypothesized that context speech rate could also
affect the perceived presence of larger morphophonological
units (i.e., words or syllables). This possibility stems from proposals concerning entrainment to temporal sequences; according to such proposals, an auditory event (e.g., a tone or syllable)
of a given duration can be heard as corresponding to different
rhythms (i.e., different numbers of “beats” or onsets). This correspondence would depend on the rate or rhythm of surrounding auditory events (e.g., Large & Jones, 1999; McAuley, 1995;
Port, 2003; Povel & Essens, 1985; Saltzman & Byrd, 2000).
Rate normalization has been proposed as one mechanism
behind speech-rate effects on phoneme boundaries (Miller &
Liberman, 1979; Pisoni, Carrell, & Gans, 1983; Sawusch &
Newman, 2000). Generalizing this account on the basis of
entrainment, we hypothesized that listeners entrain to context
speech and that this process affects the number of morphophonological units (words, syllables, segments) perceived in a
given stretch of speech. According to this generalized-ratenormalization account, the lexical content (number of words)
in a spectrally ambiguous stretch of speech depends on the
duration of that content relative to the surrounding speech
rate, as well as other information, such as grammatical
content. When a coarticulated stretch of speech is long relative
to its surroundings, the listener should perceive a function
word, because doing so is plausible given the rate cues, as
well as higher-level information (e.g., semantic and syntactic
context).
In two experiments, we tested whether the number of morphophonological units—here, the number of function words—
is dependent on the duration of a given stretch of speech
relative to the context speech rate, given grammatically viable
contexts. For Experiment 1, we predicted that if context speech
rate is made slow relative to a stretch of speech containing a
function word—either by slowing down the context speech
rate or by speeding up the stretch of speech itself—then that
stretch of speech should be perceived as short and as containing fewer phonological units (i.e., fewer words). For Experiment 2, we predicted that if context events are made fast
relative to a stretch of speech not containing a function word—
either by speeding up the context speech rate or by slowing
down the stretch of speech under analysis—then the stretch of
speech should be perceived as relatively long and thus as containing an additional phonological unit (i.e., a function word
that was never spoken).
Experiment 1
Method
Participants. Participants (N = 41) were young, American
English speakers from the Midwest United States. All reported
that they had normal hearing.
Materials. We constructed 50 sentences containing a critical
function word embedded in a phonetic context expected to
show heavy coarticulation with the function word; Table 1
lists fragments of these sentences used in experimental materials. Each sentence consisted of a grammatically acceptable
beginning, whether or not the critical function word was present; that is, the span from the beginning of the sentence until
just after the critical function word was grammatical, even if
the function word was not present. For example, Deena doesn’t
have any leisure or time is a grammatically acceptable beginning for a sentence, even if the word or is missing.
Recordings of experimental stimuli were elicited from 29
speakers of American English from the Midwest United States.
All but the last word of a sentence was presented on a computer
screen in front of the speaker. The last word was presented
only after the sentence had been erased for 1.5 s, at which point
the speaker had to recite the sentence into a head-mounted
microphone. Instructions emphasized accuracy in repeating
the sentence verbatim; this procedure suggested that the
experiment was investigating memory. Because they were not
given any instructions pertaining to speech clarity, speakers
1666
Table 1. Fragments Constituting the Speech Stimuli in
Experiments 1 and 2
Taylor knew the principal and teacher (are) from Ohio . . .
Conor knew that bread and butter (are) both . . .
Frank thinks that sadness and anger (are) both . . .
Claire said that sour and bitter (are) both . . .
Chris said his mother and father (are) both . . .
Zach knew that there (are) things . . .
George thought my father and brother (are) like [good] . . .
Glenn thought his friend and neighbor (are) like plenty . . .
Ruth saw the maid and butler (are) at the top . . .
Rose knew that there (are) lamps . . .
The company moved to (a) different . . .
Trent might get to (a) certain . . .
Clay thinks that would be (a) good . . .
The Smiths wouldn’t buy (a) Butterball . . .
Anne wanted to see (a) very funny . . .
It makes no sense to obey (a) petty . . .
It takes a lot of work to review (a) personal . . .
It costs a lot to tattoo (a) pink . . .
The boy wanted to glue (a) broken . . .
Dave asked how long it takes to repay (a) large . . .
Aspirin and other painkillers are (our) drugs . . .
The Murrays are (our) favorite . . .
The callers are (our) French contacts . . .
Mom said these are (our) gray gloves . . .
The accountants are (our) wise advisors . . .
Phil and Mary are (our) young cousins . . .
The leaves fell after (her) green . . .
The manager hid the candy before (her) six kids . . .
The sign was replaced after (her) black . . .
The message was clear after (her) blank . . .
Chris was very quick after (her) sharp . . .
The Perrys thought carefully after (her) wise advice . . .
The value went up after (her) rich neighbors . . .
People were offended after (her) rude . . .
The Smiths were shocked after (her) weird . . .
Deena doesn’t have any leisure (or) time . . .
Anyone must be a minor (or) child . . .
Marty gave him a dollar (or) twenty last week . . .
George turned left at the river (or) bank . . .
Sally sold all her silver (or) jewelry last month . . .
Don must see the harbor (or) boats . . .
Fred would rather have a summer (or) lake . . .
Steve pitched the ball to center (or) left . . .
They promised him the future (or) aid . . .
Susan said those are (our) black socks . . .
Jake didn’t vote for the member (or) constituent . . .
Jack reported trouble before (her) two children . . .
These documents are (our) fake . . .
Those tickets are (our) late entries . . .
These houses are (our) best . . .
Note: All fragments were grammatically acceptable beginnings of sentences. Parentheses indicate that the word was present in the Experiment
1 item but not in the Experiment 2 item, and square brackets indicate the
reverse.
Dilley, Pitt
spoke naturally after adjusting to the task. An additional 70
sentences served as fillers intended to increase variety (e.g.,
length and structure). Visual presentation and audio recording
of items were controlled by Presentation software (Version
12.1; Neurobehavioral Systems, Albany, CA).
We identified the 12 speakers who produced the fewest
speech errors and disfluencies, as well as the fewest glottal
onsets in critical function words, because continuous formant
transitions across function words were desired; we used
recordings from multiple talkers to increase the generalizability of the results across speakers. A single token recording was
selected for each item using the following criteria: (a) The
function words are, or, our, and her were spoken as [„] and a
was spoken as [ə], (b) the token recording showed continuous
formant transitions across the critical function word plus the
preceding syllable rhyme, and (c) the recording contained no
hesitations or disfluencies.1
Fragments consisting of the grammatically acceptable
beginnings of the sentences were taken from the token recordings and divided into target portion and context portion using
spectrogram and waveform displays. The target corresponded
to the critical function word plus the preceding syllable and
the following phoneme (e.g., -sure or t-). The context corresponded to all speech preceding and following the target (e.g.,
Deena doesn’t have any lei- . . . -ime).
Contiguous target and context regions were spliced out at
zero crossings, where there was no energy in the waveform, and
subjected to time manipulation using the Pitch-Synchronous
Overlap and Add (PSOLA) algorithm in Praat software
(Boersma & Weenink, 2002). Target and context regions were
then recombined to create four conditions (Fig. 2); this method
kept intact the spectral detail of the speech and altered only the
speech rate. In the normal-rate condition, the entire fragment
was presented at the spoken rate. In the slowed-context condition, the context was slowed through time expansion, and the
target was presented at the spoken rate. (The target was thus
acoustically identical to the target in the normal-rate condition). In the speeded-target condition, the target was speeded
through time compression, and the context was presented at
the spoken rate. In the speeded-target-plus-context condition,
both the target and the context portions were speeded to the
same degree through time compression. The time-compression
factor was 0.6, and the time-expansion factor was 1.9. Each
filler item was likewise speeded, slowed, or left unaltered in
rate; the materials included approximately equal numbers of
fillers at these three rates. All stimuli were then amplitudenormalized to 70 dB sound-pressure level (SPL).
Design and procedure. The experiment consisted of 120 trials, presented in a single session. Fifty trials contained experimental fragments of interest, and the remainder contained
filler items. Participants were randomly assigned to one of
four lists, with an approximately equal number of participants
hearing each list. Each list contained 12 items in each of three
1667
Altering Speech Rate
Deena
doesn’t have any
leisure
or
time. . .
Normal Rate
Slowed Context
Speeded Target
Speeded Target and Context
Fig. 2. Waveforms of a sample time-altered stimulus across the four conditions of Experiment 1. The sections of the waveform without background shading
correspond to the target region, which consisted of the critical function word (or) plus the preceding syllable and the following phoneme (-sure or t-).
rate conditions and 14 items in the fourth rate condition. The
pairing of items with conditions was counterbalanced across
the four lists. Each participant heard each experimental fragment only once, in one of the four rate conditions.
The experiment began with 20 filler trials. The remaining
trials were presented to all participants in the same random
order. Participants were instructed to listen carefully to each
fragment and to play it back as often as necessary to produce a
veridical transcription of what they heard, typing their response
using a computer keyboard. Stimuli were presented over studioquality headphones at a comfortable listening level.
Results and discussion
The frequency of transcribing a function word in the target
region was scored. Responses that did not include, at a minimum, a transcription of the target region plus the following
syllable were discarded (6% of trials).2 For remaining trials,
function-word presence was coded as 1, and function-word
absence was coded as 0.
Figure 3a shows that reports of the critical function word
depended on the relative rate of the target and the context. In
the normal-rate condition, function-word reports were quite
high; the fact that they were not at ceiling is expected given
that the speech was casually spoken. It is critical to note that
a comparison of reports in the normal-rate condition and the
slowed-context condition showed that merely slowing the
context surrounding a function word caused the rate of
function-word reports to drop by more than half, from 79% to
33%, even though the target regions containing the function
word were acoustically identical. An equally dramatic reduction in function-word reports, relative to the normal-rate
condition, was found in the speeded-target condition, in
which the target region was speeded and the context was
unaltered.
Function-word reports in the speeded-target-plus-context
condition rebounded to close to their original levels in the normalrate condition. That this mean did not reach the mean for the
normal-rate condition is likely due to an overall drop in recognition accuracy associated with the significant compression factor
(the compressed fragment was 60% of its original duration). A
repeated measures one-way analysis of variance was significant
by subjects, F1(3, 120) = 48.34, p < .001, η2 = .55, and by stimulus items, F2(3, 147) = 54.99, p < .001. η2 = .53. Post hoc twotailed, paired-sample t tests with Bonferroni correction showed
that all conditions differed significantly from one another in both
by-subjects analyses and by-items analyses (ps < .01), except
that the difference between the speeded-target condition and the
slowed-context condition was not significant.
These results support the predictions of the generalizedrate-normalization account: Making the duration of a stretch
of speech containing a function word fast relative to its context
affected the number of morphophonological units perceived.
This perceptual change was accomplished by either slowing
down the context speech rate or speeding up the stretch of
speech containing the target. That listeners could be induced to
hear fewer morphophonological units implies that manipulating context speech rate induced listeners to hear fewer phonemes and fewer word boundaries than were actually spoken,
a finding that has implications for word segmentation.
Experiment 2
A further test of the generalized-rate-normalization account is
whether listeners can be made to hear more morphophonological units than were actually produced. This possibility was
tested using fragments similar to those in Experiment 1, except
for one minor (but crucial) change: The critical function word
was never spoken. On the basis of the generalized-ratenormalization account, we predicted that if context events
1668
Dilley, Pitt
Function Words Reported (%)
a
100
90
80
70
60
50
40
30
20
10
0
Normal
Rate
Slowed
Context
Speeded Speeded
Target Target +
Context
Condition
Function Words Reported (%)
b
100
90
80
70
60
50
40
30
20
10
0
Recordings of experimental and filler sentences were
obtained from 23 speakers using the elicitation task described
in Experiment 1. From these recordings, we identified the 15
speakers who produced the fewest speech errors and disfluencies, and a single token recording was selected for each sentence.3 Fragments consisting of the grammatically acceptable
beginning of each sentence were spliced out of the recordings;
each fragment was then divided into target and context portions. The target region was bounded by the same phoneme
string as in Experiment 1, but the function word was not present (e.g., -sure t- in Deena doesn’t have any leisure time). The
context portion corresponded to all speech material preceding
and following the target portion.
Four speech-rate conditions were created from the fragments. In the normal-rate condition, the entire fragment was
presented at the spoken rate. In the speeded-context condition,
the context was speeded, but the target was presented at the
normal rate. In the slowed-target condition, the target was
slowed, but the context was presented at the normal rate.
Finally, in the slowed-target-plus-context condition, both the
target and the context portions were slowed to the same degree.
The time-compression factor was 0.6, and the time-expansion
factor was 1.9. After alteration, target and context portions were
concatenated in the proper order, and stimuli were amplitudenormalized to 70 dB SPL. The design and procedure were
identical to those in Experiment 1.
Results and discussion
Normal
Rate
Speeded
Context
Slowed
Target
Slowed
Target +
Context
Condition
Fig. 3. Mean percentage of function words that participants reported
hearing in (a) Experiment 1 and (b) Experiment 2, as a function of condition. In
Experiment 1, the function words were spoken. In Experiment 2, the function
words were not spoken. Error bars indicate standard errors of the mean.
were made fast relative to a stretch of speech that does not
contain a function word—either by speeding up context speech
rate or by slowing down the stretch of speech in question—
then the stretch of speech would be perceived as relatively
long and as containing more morphophonological units, even
though those units were never spoken.
Method
Participants. Characteristics of the participants (N = 69) were
the same as in Experiment 1.
Materials. We constructed sentences that had the same grammatical beginnings as in Experiment 1, but lacked the critical
function word (e.g., Deena doesn’t have any leisure time).
Filler items were the same as in Experiment 1.
The frequency of transcribing a function word in the target
region was scored. Responses that did not include, at a minimum, a transcription of the target region plus the following
syllable were discarded from analysis (7% of trials).4
We obtained clear evidence that an alteration of speech rate
can induce listeners to hear a function word (Fig. 3b). In the
normal-rate condition (baseline), participants seldom (3% of
the time) reported a function word in the target region—an
expected finding because critical function words were never
spoken. However, speeding the context surrounding the target
caused an 8-fold increase in the rate of reporting a function
word, even though the target was identical in the two conditions. Slowing the target resulted in a similar, 5-fold increase
in the rate of hearing a function word that was never spoken.
As in Experiment 1, when the context and the target were
time-altered together (in this case, slowed), reports of the
function word returned to the level found in the normal-rate
condition. A repeated measures one-way analysis of variance
with rate condition as the factor was significant by subjects,
F1(3, 204) = 60.82, p < .001, η2 = .47, and by stimulus items,
F2(3, 147) = 25.50, p < .001, η2 = .34. Post hoc tests showed
that all conditions differed significantly from one another in
both by-subjects analyses and by-items analyses (ps < .01),
except for the normal-rate condition compared with the
slowed-target-plus-context condition.
1669
Altering Speech Rate
These results provide strong evidence for the generalizedrate-normalization hypothesis that context speech rate affects
whether listeners perceive a word. In this experiment, listeners
were made to perceive a function word that was never spoken.
By implication, context speech rate affected the number of
phonemes and word boundaries that listeners perceived.
Thus, Experiment 2 replicated and extended the findings of
Experiment 1.
General Discussion
The current experiments provide new insight into how timing
information is used in speech perception. In Experiment 1,
sentence fragments containing a critical function word were
heard as having fewer such words when context speech rate
was slowed. In Experiment 2, matched sentences in which the
critical function word was never spoken were heard as containing function words when context speech rate was speeded.
These experiments were based on the generalized-ratenormalization hypothesis, according to which the number of
perceived morphophonological units in a stretch of speech
depends on the duration of that stretch relative to the speech
rate of the context in which it is embedded. These experiments
indicate that listeners used context speech rate to help them
decode spectrally ambiguous portions of the speech stream,
and this process aided listeners in perceiving spoken words
and in segmenting those words from the speech context.5
These studies are the first to show that context speech rate
can modulate whether an entire word is perceived. The duration of a stretch of speech relative to its context speech rate
also modulated the number of phonemes and implied word
onsets perceived as present. These findings have implications
for the important and unsolved problem of how infants and
adults identify word onsets in connected speech (Cutler,
Mehler, Norris, & Segui, 1983; Mattys, White, & Melhorn,
2005; Thiessen, Hill, & Saffran, 2005). Note that our speechrate manipulations were several phonemes distant from the
variably perceived function word; this contrasts with previous
findings that speech-rate manipulations in the immediate
vicinity of a to-be-perceived phoneme had an effect, but more
distant manipulations typically had no effect (e.g., Sawusch &
Newman, 2000).
These findings suggest that information on relative speech
rate aids in interpreting ambiguous spectral cues and helps
listeners identify and segment spoken words. How words and
word boundaries are so robustly perceived when spectral
cues are unclear remains poorly understood (Ernestus,
Baayen, & Schreuder, 2002; Pitt, 2009). Our experiments
suggest that word recognition depends in part on relative rate
cues provided by speech context, and this study adds to a
growing body of work showing that prosodic properties of
speech context influence lexical recognition and word segmentation (Dilley, Mattys, & Vinke, 2010; Dilley & McAuley,
2008; Gout, Christophe, & Morgan, 2004; Salverda et al.,
2003).
More generally, the results demonstrate the rapid and seamless integration of signal-based cues (spectral, temporal) and
knowledge-based cues (syntactic, semantic) during spokenword recognition. In this regard, our speech-rate phenomenon,
particularly the results of Experiment 2, can be viewed as a
temporal version of phonemic restoration, in which listeners
readily restore phonemes in words whose acoustic evidence
has been replaced by noise (Samuel, 2001). In phonemic restoration, sentential context biases perception, and such higherlevel biases are likely at work in the effect we observed; they
may be a precondition for the effect, for example.
Compared with studying spectral cues, studying how timing
information is used in speech perception has proven challenging. The present results provide one answer to the puzzle of
how reduced, spectrally attenuated syllables and words are recognized and segmented from continuous speech; rate normalization via temporal entrainment to speech rate provides a
possible explanation for these findings. In the absence of clear
spectral information, timing information becomes increasingly
important in conveying the message intended by the speaker.
Acknowledgments
We thank Delphine Dahan, Sven Mattys, Arthur Samuel, and an
anonymous reviewer for useful feedback on the manuscript. Also, we
thank Victoria Hoover, Michael Tat, Andrea Hulme, Chris Heffner, and
Claire Carpenter for help with data acquisition and data analysis.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with
respect to their authorship or the publication of this article.
Funding
This work was supported by Grant DC004330 from the National
Institute on Deafness and Other Communication Disorders to M.A.P.
and by Grant 0847653 from the National Science Foundation
Division of Behavioral and Cognitive Sciences to L.C.D.
Notes
1. To select similar speaking rates across talkers, we first determined
the grand mean duration of the two syllables preceding the critical
function word across all sentences produced by these speakers. After
other selection criteria had been applied, a token recording of a given
sentence was selected from the talker who produced the two syllables
preceding the function word with a duration that was minimally different from the grand mean duration. This resulted in different talkers providing different numbers of experimental items (M = 4.2, range = 0–13,
with 11 talkers represented in the final experimental stimulus set).
2. Results were identical when these responses were included.
3. The selection procedure for controlling speaking rate across sentences
was identical to the procedure used in Experiment 1. This procedure
resulted in different talkers providing different numbers of experimental
items (M = 3.3, range = 0–8, with 13 talkers represented in the final set).
4. Results were identical when these responses were included.
5. A competing hypothesis that speech-rate mismatches per se were
responsible for effects on function-word perception is untenable, for
1670
several reasons. According to one version of this hypothesis, function
words were not perceived when the rate across stimuli mismatched.
However, in Experiment 2, more, not fewer, function words were perceived when the rate mismatched than when it matched. Thus, this
hypothesis cannot account for the data across both experiments.
The results also do not support a weaker version of the hypothesis,
that speech-rate mismatches cause reductions in general intelligibility associated with different illusory lexical percepts, depending on
the veridical grammatical properties of fragments. To test this latter
version of the hypothesis, we conducted additional analyses of transcription accuracy of phonemes in words preceding the critical function word in matching conditions (Experiment 1: normal rate, speeded
target plus context; Experiment 2: normal rate, slowed target plus
context) and mismatching conditions (Experiment 1: slowed context,
speeded target; Experiment 2: speeded context, slowed target).
Results revealed no difference in transcription accuracy between the
matching and mismatching conditions in Experiment 1 (Mmatch =
94%, Mmismatch = 95%), paired-samples t(49) = 1.13, p = 0.27.
Although a significant difference was found in Experiment 2 (Mmatch =
97%, Mmismatch = 93%), paired-samples t(49) = 6.21, p < .001, the
size of the change (4%) was much smaller than the rise in rates of
function-word reports (13%–21%). This result further suggests that
such a rate effect cannot account for differences in function-word
perception.
References
Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M.,
& Gildea, D. (2003). Effects of disfluencies, predictability, and
utterance position on word form variation in English conversation. Journal of the Acoustical Society of America, 113,
1001–1024.
Boersma, P., & Weenink, D. (2002). Praat: Doing phonetics by computer (Version 4.0.26) [Computer software]. Retrieved August 6,
2010, from http://www.praat.org
Cutler, A., Mehler, J., Norris, D., & Segui, J. (1983). A languagespecific comprehension strategy. Nature, 304, 159–160.
Davis, M.H., Marslen-Wilson, W.D., & Gaskell, M.G. (2002). Leading up the lexical garden path: Segmentation and ambiguity in
spoken word recognition. Journal of Experimental Psychology:
Human Perception and Performance, 28, 218–244.
Dilley, L., Mattys, S., & Vinke, L. (2010). Potent prosody: Comparing
the effects of distal prosody, proximal prosody, and semantic context on word segmentation. Journal of Memory and Language,
63, 274–294.
Dilley, L.C., & McAuley, J.D. (2008). Distal prosodic context affects
word segmentation and lexical processing. Journal of Memory
and Language, 59, 294–311.
Ernestus, M., Baayen, H., & Schreuder, R. (2002). The recognition of
reduced word forms. Brain and Language, 81, 162–173.
Fujisaki, H., Nakamura, K., & Imoto, T. (1975). Auditory perception of duration of speech and non-speech stimuli. In G. Fant &
M.A.A. Tatham (Ed.), Auditory analysis and perception of speech
(pp. 197–219). London, England: Academic Press.
Gout, A., Christophe, A., & Morgan, J. (2004). Phonological phrase
boundaries constrain lexical access: II. Infant data. Journal of
Memory and Language, 51, 547–567.
Dilley, Pitt
Klatt, D.H. (1980). Speech perception: A model of acousticphonetic analysis and lexical access. In R.A. Cole (Ed.), Perception
and production of fluent speech (pp. 243–288). Hillsdale, NJ:
Erlbaum.
Large, E.W., & Jones, M.R. (1999). The dynamics of attending: How
people track time-varying events. Psychological Review, 106,
119–159.
Liberman, A.M., Delattre, P., Gerstman, L., & Cooper, F.S. (1956).
Tempo of frequency change as a cue for distinguishing classes of
speech sounds. Journal of Experimental Psychology, 52, 127–137.
Marslen-Wilson, W.D., & Welsh, A. (1978). Processing interactions
during word recognition in continuous speech. Cognition, 10,
487–509.
Mattys, S.L., White, L., & Melhorn, J.F. (2005). Integration of multiple speech segmentation cues: A hierarchical framework. Journal
of Experimental Psychology: General, 134, 477–500.
McAuley, J.D. (1995). Perception of time as phase: Toward an
adaptive-oscillator model of rhythmic pattern processing. Unpublished doctoral dissertation, Indiana University, Bloomington.
Miller, J.L., & Liberman, A.M. (1979). Some effects of later-occurring
information on the perception of stop consonant and semivowel.
Perception & Psychophysics, 25, 457–465.
Pickett, J.M., & Decker, L.R. (1960). Time factors in perception of a
double consonant. Language and Speech, 3, 11–17.
Pisoni, D.B., Carrell, T.D., & Gans, S.J. (1983). Perception of the
duration of rapid spectrum changes in speech and nonspeech signals. Perception & Psychophysics, 34, 314–322.
Pitt, M.A. (2009). How are pronunciation variants of spoken words
recognized? A test of generalization to newly learned words.
Journal of Memory and Language, 61, 19–36.
Port, R.F. (2003). Meter and speech. Journal of Phonetics, 31, 599–611.
Povel, D.J., & Essens, P. (1985). Perception of temporal patterns.
Music Perception, 2, 411–440.
Remez, R.E., Rubin, P.E., Pisoni, D.B., & Carrell, T.D. (1981).
Speech perception without traditional speech cues. Science, 212,
947–949.
Saltzman, E., & Byrd, D. (2000). Task-dynamics of gestural timing:
Phase windows and multifrequency rhythms. Human Movement
Science, 19, 499–526.
Salverda, A.P., Dahan, D., & McQueen, J. (2003). The role of prosodic boundaries in the resolution of lexical embedding in speech
comprehension. Cognition, 90, 51–89.
Samuel, A.G. (2001). Knowing a word affects the fundamental perception of the sounds within it. Psychological Science, 12, 348–351.
Sawusch, J.R., & Newman, R.S. (2000). Perceptual normalization for
speaking rate II: Effects of signal discontinuities. Perception &
Psychophysics, 62, 285–300.
Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J., & Ekelid, M.
(1995). Speech recognition with primarily temporal cues. Science, 270, 303–304.
Shockey, L. (2003). Sound patterns of spoken English. Cambridge,
England: Blackwell.
Smith, Z.M., Delgutte, B., & Oxenham, A. (2002). Chimaeric sounds
reveal dichotomies in auditory perception. Nature, 416, 87–90.
Thiessen, E.D., Hill, E.A., & Saffran, J.R. (2005). Infant-directed
speech facilitates word segmentation. Infancy, 7, 53–71.