The Journal of the Acoustical Society of America, 2017
Vowel duration is determined by a number of factors in American English (e.g., Klatt, 1976), incl... more Vowel duration is determined by a number of factors in American English (e.g., Klatt, 1976), including the tense vs. lax distinction (e.g., /i/-/ɪ/, /u/-/ʊ/) and the relation between duration and vowel height (e.g., Lehiste, 1970). A large body of research has identified segment-internal factors in data averaged over many speakers (e.g., Crystal and House, 1988; Hillenbrand et al., 1995), but few studies have investigated the inherent vowel duration patterns of individuals (cf. House, 1961). Stressed vowel productions (>2 million tokens) were identified by forced alignment in connected speech recordings of 390 speakers (209 female) from the Mixer 6 corpus (Chodroff et al., 2016). For each speaker, mean durations of ten stressed vowels (/i ɪ e ɛ ae a ʌ o ʊ u/) were calculated after outlier exclusion. The resulting duration patterns were strongly correlated across pairs of speakers (Pearson r: mean = 0.936, 95% CI [0.935, 0.936], range [0.559, 0.999]), and PCA identified a single component, plausibly indexin...
Information structure is said to play an important role in determining phrasal prominence and the... more Information structure is said to play an important role in determining phrasal prominence and the assignment of nuclear pitch accents in English. Early accounts claim that discourse-new or focused words receive a prominence-lending high/rising pitch accent, while given words are unaccented, with reduced prominence. Empirical findings are varied, but paint a more complex picture of the prosodic encoding of information structure. The present study investigated the phonological and phonetic encoding of information status and contrastive focus in nuclear position in American English, from speech read under neutral and lively affect. Given information was associated with decreased phonological and phonetic prominence, contrastive information with enhanced prominence, while new information corresponded to increased phonological, but not phonetic prominence, as assessed in pitch accent type, duration, intensity, and voice quality. The findings indicate a probabilistic relationship between ...
Corpus phonetics has become an increasingly popular method of research in linguistic analysis. Wi... more Corpus phonetics has become an increasingly popular method of research in linguistic analysis. With advances in speech technology and computational power, large scale processing of speech data has become a viable technique. This tutorial introduces the speech scientist and engineer to various automatic speech processing tools. These include acoustic model creation and forced alignment using the Kaldi Automatic Speech Recognition Toolkit (Povey et al., 2011), forced alignment using FAVE-align (Rosenfelder et al., 2014), the Montreal Forced Aligner (McAuliffe et al., 2017), and the Penn Phonetics Lab Forced Aligner (Yuan & Liberman, 2008), as well as stop consonant burst alignment using AutoVOT (Keshet et al., 2014). The tutorial provides a general overview of each program, step-by-step instructions for running the program, as well as several tips and tricks.
The Mixer series of speech corpora were collected over several years, principally to support annu... more The Mixer series of speech corpora were collected over several years, principally to support annual NIST evaluations of speaker recognition (SR) technologies. These evaluations focused on conversational speech over a variety of channels and recording conditions. One of the series, Mixer-6, added a new condition, read speech, to support basic scientific research on speaker characteristics, as well as technology evaluation. With read speech it is possible to make relatively precise measurements of phonetic events and features, which can be correlated with the performance of speaker recognition algorithms, or directly used in phonetic analysis of speaker variability. The read speech, as originally recorded, was adequate for large-scale evaluations (e.g., fixed-text speaker ID algorithms) but only marginally suitable for acoustic-phonetic studies. Numerous errors due largely to speaker behavior remained in the corpus, with no record of their locations or rate of occurrence. We undertook...
Non-native speech production is frequently characterized by its deviation from native pronunciati... more Non-native speech production is frequently characterized by its deviation from native pronunciation. Among segments, previous work has largely focused on describing the separation between native and non-native speakers at the level of individual phonetic categories. An additional hallmark of L1 pronunciation is the presence of systematic relationships within and among phonetic categories. For example, mean voice onset times (VOT) strongly covary among aspirated stop consonants across L1 speakers of American English. The present study examined whether L2 English speakers from various L1 backgrounds differ from native speakers in the relationship of VOT among word-initial /ptk/. Despite differences in the overall realization, L2 speakers resembled native English speakers in the degree of VOT covariation between stop-specific means and variances, as well as between /ptk/. These findings have important implications for the perception of accented speech, as listeners could employ structu...
Listeners are highly proficient at adapting to variation in the speech signal. The present study ... more Listeners are highly proficient at adapting to variation in the speech signal. The present study examined short-term generalized adaptation to sibilant fricatives based on preceding speech and non-speech contexts. We explored three theoretically motivated accounts of generalization: a phonetic cue-based adaptation account, a phonetic covariation-based adaptation account, and an auditory contrast-based adaptation account. Under the cue-based adaptation account, listeners adapt to a talker-specific realization for each phonetic dimension (or cue); under the covariation-based account, listeners exploit the empirical covariation of phonetic cues among speech sounds and across talkers; under the contrast-based account, short-term adaptation can be accounted for based on local contrasts in adjacent auditory spectra. The spectral center of gravity, a phonetic cue to fricative identity, was manipulated for several types of context sound: /z/, /v/, and white noise matched in the long-term av...
The Journal of the Acoustical Society of America, 2019
Processes of talker recognition and adaptation rely on a high degree of inter-talker phonetic var... more Processes of talker recognition and adaptation rely on a high degree of inter-talker phonetic variability and systematicity, respectively. While superficially in opposition, talker recognition in part depends on adaptation to the talker at hand. In this talk, we present evidence that talker variability is simultaneously extensive and structured within natural classes of speech sounds. In American English, talker mean peak frequencies for [s] span over 3000 Hz, but the variation in [s] is not independent of that in [z]: strong correlations of the talker mean peak frequency, among other phonetic dimensions, are observed between sibilant fricatives. Covariation among speech sounds indicates mutual predictability, such that evidence from one speech sound could be used to refine estimates or make predictions about a second. Listeners indeed demonstrate perceptual knowledge of covariation in generalized adaptation to novel talkers. After exposure to a talker with a relatively high- or low-peak frequency [z], listeners adjusted their [s]-[ʃ] boundary in accordance with the empirical covariation. As talker recognition entails estimation of a talker’s phonetic parameters, prior perceptual knowledge of covariation could be used to refine estimation of multiple speech sounds from minimal exposure, thus accelerating processes of talker adaptation and recognition.Processes of talker recognition and adaptation rely on a high degree of inter-talker phonetic variability and systematicity, respectively. While superficially in opposition, talker recognition in part depends on adaptation to the talker at hand. In this talk, we present evidence that talker variability is simultaneously extensive and structured within natural classes of speech sounds. In American English, talker mean peak frequencies for [s] span over 3000 Hz, but the variation in [s] is not independent of that in [z]: strong correlations of the talker mean peak frequency, among other phonetic dimensions, are observed between sibilant fricatives. Covariation among speech sounds indicates mutual predictability, such that evidence from one speech sound could be used to refine estimates or make predictions about a second. Listeners indeed demonstrate perceptual knowledge of covariation in generalized adaptation to novel talkers. After exposure to a talker with a relatively high- or low-peak frequency [z], li...
The Journal of the Acoustical Society of America, 2017
Vowel duration is determined by a number of factors in American English (e.g., Klatt, 1976), incl... more Vowel duration is determined by a number of factors in American English (e.g., Klatt, 1976), including the tense vs. lax distinction (e.g., /i/-/ɪ/, /u/-/ʊ/) and the relation between duration and vowel height (e.g., Lehiste, 1970). A large body of research has identified segment-internal factors in data averaged over many speakers (e.g., Crystal and House, 1988; Hillenbrand et al., 1995), but few studies have investigated the inherent vowel duration patterns of individuals (cf. House, 1961). Stressed vowel productions (>2 million tokens) were identified by forced alignment in connected speech recordings of 390 speakers (209 female) from the Mixer 6 corpus (Chodroff et al., 2016). For each speaker, mean durations of ten stressed vowels (/i ɪ e ɛ ae a ʌ o ʊ u/) were calculated after outlier exclusion. The resulting duration patterns were strongly correlated across pairs of speakers (Pearson r: mean = 0.936, 95% CI [0.935, 0.936], range [0.559, 0.999]), and PCA identified a single component, plausibly indexin...
Information structure is said to play an important role in determining phrasal prominence and the... more Information structure is said to play an important role in determining phrasal prominence and the assignment of nuclear pitch accents in English. Early accounts claim that discourse-new or focused words receive a prominence-lending high/rising pitch accent, while given words are unaccented, with reduced prominence. Empirical findings are varied, but paint a more complex picture of the prosodic encoding of information structure. The present study investigated the phonological and phonetic encoding of information status and contrastive focus in nuclear position in American English, from speech read under neutral and lively affect. Given information was associated with decreased phonological and phonetic prominence, contrastive information with enhanced prominence, while new information corresponded to increased phonological, but not phonetic prominence, as assessed in pitch accent type, duration, intensity, and voice quality. The findings indicate a probabilistic relationship between ...
Corpus phonetics has become an increasingly popular method of research in linguistic analysis. Wi... more Corpus phonetics has become an increasingly popular method of research in linguistic analysis. With advances in speech technology and computational power, large scale processing of speech data has become a viable technique. This tutorial introduces the speech scientist and engineer to various automatic speech processing tools. These include acoustic model creation and forced alignment using the Kaldi Automatic Speech Recognition Toolkit (Povey et al., 2011), forced alignment using FAVE-align (Rosenfelder et al., 2014), the Montreal Forced Aligner (McAuliffe et al., 2017), and the Penn Phonetics Lab Forced Aligner (Yuan & Liberman, 2008), as well as stop consonant burst alignment using AutoVOT (Keshet et al., 2014). The tutorial provides a general overview of each program, step-by-step instructions for running the program, as well as several tips and tricks.
The Mixer series of speech corpora were collected over several years, principally to support annu... more The Mixer series of speech corpora were collected over several years, principally to support annual NIST evaluations of speaker recognition (SR) technologies. These evaluations focused on conversational speech over a variety of channels and recording conditions. One of the series, Mixer-6, added a new condition, read speech, to support basic scientific research on speaker characteristics, as well as technology evaluation. With read speech it is possible to make relatively precise measurements of phonetic events and features, which can be correlated with the performance of speaker recognition algorithms, or directly used in phonetic analysis of speaker variability. The read speech, as originally recorded, was adequate for large-scale evaluations (e.g., fixed-text speaker ID algorithms) but only marginally suitable for acoustic-phonetic studies. Numerous errors due largely to speaker behavior remained in the corpus, with no record of their locations or rate of occurrence. We undertook...
Non-native speech production is frequently characterized by its deviation from native pronunciati... more Non-native speech production is frequently characterized by its deviation from native pronunciation. Among segments, previous work has largely focused on describing the separation between native and non-native speakers at the level of individual phonetic categories. An additional hallmark of L1 pronunciation is the presence of systematic relationships within and among phonetic categories. For example, mean voice onset times (VOT) strongly covary among aspirated stop consonants across L1 speakers of American English. The present study examined whether L2 English speakers from various L1 backgrounds differ from native speakers in the relationship of VOT among word-initial /ptk/. Despite differences in the overall realization, L2 speakers resembled native English speakers in the degree of VOT covariation between stop-specific means and variances, as well as between /ptk/. These findings have important implications for the perception of accented speech, as listeners could employ structu...
Listeners are highly proficient at adapting to variation in the speech signal. The present study ... more Listeners are highly proficient at adapting to variation in the speech signal. The present study examined short-term generalized adaptation to sibilant fricatives based on preceding speech and non-speech contexts. We explored three theoretically motivated accounts of generalization: a phonetic cue-based adaptation account, a phonetic covariation-based adaptation account, and an auditory contrast-based adaptation account. Under the cue-based adaptation account, listeners adapt to a talker-specific realization for each phonetic dimension (or cue); under the covariation-based account, listeners exploit the empirical covariation of phonetic cues among speech sounds and across talkers; under the contrast-based account, short-term adaptation can be accounted for based on local contrasts in adjacent auditory spectra. The spectral center of gravity, a phonetic cue to fricative identity, was manipulated for several types of context sound: /z/, /v/, and white noise matched in the long-term av...
The Journal of the Acoustical Society of America, 2019
Processes of talker recognition and adaptation rely on a high degree of inter-talker phonetic var... more Processes of talker recognition and adaptation rely on a high degree of inter-talker phonetic variability and systematicity, respectively. While superficially in opposition, talker recognition in part depends on adaptation to the talker at hand. In this talk, we present evidence that talker variability is simultaneously extensive and structured within natural classes of speech sounds. In American English, talker mean peak frequencies for [s] span over 3000 Hz, but the variation in [s] is not independent of that in [z]: strong correlations of the talker mean peak frequency, among other phonetic dimensions, are observed between sibilant fricatives. Covariation among speech sounds indicates mutual predictability, such that evidence from one speech sound could be used to refine estimates or make predictions about a second. Listeners indeed demonstrate perceptual knowledge of covariation in generalized adaptation to novel talkers. After exposure to a talker with a relatively high- or low-peak frequency [z], listeners adjusted their [s]-[ʃ] boundary in accordance with the empirical covariation. As talker recognition entails estimation of a talker’s phonetic parameters, prior perceptual knowledge of covariation could be used to refine estimation of multiple speech sounds from minimal exposure, thus accelerating processes of talker adaptation and recognition.Processes of talker recognition and adaptation rely on a high degree of inter-talker phonetic variability and systematicity, respectively. While superficially in opposition, talker recognition in part depends on adaptation to the talker at hand. In this talk, we present evidence that talker variability is simultaneously extensive and structured within natural classes of speech sounds. In American English, talker mean peak frequencies for [s] span over 3000 Hz, but the variation in [s] is not independent of that in [z]: strong correlations of the talker mean peak frequency, among other phonetic dimensions, are observed between sibilant fricatives. Covariation among speech sounds indicates mutual predictability, such that evidence from one speech sound could be used to refine estimates or make predictions about a second. Listeners indeed demonstrate perceptual knowledge of covariation in generalized adaptation to novel talkers. After exposure to a talker with a relatively high- or low-peak frequency [z], li...
Uploads
Papers by Eleanor Chodroff