My research is focused on the development and use of computer models to aid in understanding how the shapes, sizes, and movements of both the voice source components and the vocal tract contribute to the sounds of speech.
Journal of the Acoustical Society of America, Oct 1, 2020
A recently developed model of speech production [Story & Bunton, JASA, 146(4), 2522–2528]... more A recently developed model of speech production [Story & Bunton, JASA, 146(4), 2522–2528] was used to generate VCVs that were examined with regard to both articulation and identification of the consonant. In this model, an utterance is generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies called resonance deflection patterns (RDPs) that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. RDPs specifying /b/, /d/, and /g/ would typically be coded as [−1 −1 −1], [−1 1 1], and [−1 1 −1], respectively, indicating, from left to right, the targeted directional shift of the first, second, and third resonances of the vocal tract. In this study, two types of V1CV2 continua were constructed in three vowel contexts (/i, a, u/) by incrementing in small steps (1) the second resonance deflection from −1 to 1, and (2) the third resonance deflection from 1 to −1. The resulting time-varying vocal tract shapes emulate expected articulation patterns for the stop consonants, and a perceptual experiment indicated that listeners identify the consonants based on the polarity of RDP values.
Journal of the Acoustical Society of America, Mar 1, 2018
During speech development, a child’s vocal tract undergoes changes due to growth of anatomic stru... more During speech development, a child’s vocal tract undergoes changes due to growth of anatomic structures. Such changes typically lower the formant frequencies, reshaping the [F1,F2] vowel space. Much of what is known about vowel space change, however, is based on cross-sectional formant measurements averaged over children in various age groups. The purpose of this study was to characterize changes in the vowel space of four children between the ages of 2 and 6 years. Longitudinally-collected audio recordings of four children (2F,2M) were selected from the Arizona Child Acoustic Database. Each child had been recorded every four months from ages 2-6 years, and produced a variety of words, phrases, vowel-vowel progressions, and occasional spontaneous speech. Formant frequencies (F1 and F2 only) were measured from the recordings using a spectral filtering technique. At each age increment, the formant frequencies for each child were plotted as vowel space density, where the “density” dimension indicates the relative tendency of a talker to produce sound in particular region of the vowel space. The change in location and shape of the density cloud during this period of development will be demonstrated. [Research supported by NIH R01-DC011275, NSF BCS-1145011.]
Journal of the Acoustical Society of America, Oct 1, 2017
A fundamental aspect of teaching, on any topic, is the continual pursuit of telling a story. Alth... more A fundamental aspect of teaching, on any topic, is the continual pursuit of telling a story. Although technology and advances in teaching methods may facilitate new and exciting forms of presenting course materials, they do not, by themselves, build the context for the content of a course. Every lecture, activity, homework assignment, project, quiz, and examination can be regarded as chapters that build, over the duration of a course, a compelling and engaging story in which students take part. The aim of this talk is to encourage development of speech science courses that weave together history, theory, technology, visual and auditory experience, assessment, and, importantly, the instructor’s own research to spin a good tale. [Work supported by NIH R01-DC011275 and NSF BCS-1145011.]
Journal of the Acoustical Society of America, Oct 1, 2016
A model of a child-like vocal tract has been developed such that the deformation patterns superim... more A model of a child-like vocal tract has been developed such that the deformation patterns superimposed on a vowel substrate to generate coarticulated consonants are specified by a time-varying set of directional shifts in the first three resonance frequencies. These deflection patterns are denoted as a combination of three numbers each of which can vary between -1 and 1; a negative value implies a downward shift in a resonance frequency whereas an upward shift results for positive value. For example, a “bilabial” consonant specified as [-1,-1,-1] would be transformed via calculations of acoustic sensitivity functions to a time-varying vocal tract shape that presents the expected constriction at the lips, but also modifies other parts of the vocal tract that may be necessary for producing the appropriate formant transitions into and out of the consonant. Using this model, three sets of 30 VCV utterances were generated in which the values of deflection patterns were set to produce vocal tract shapes that hypothetically produce the stop consonants /b/, /d/, and /g/ embedded in three different vowel-vowel contexts. A perceptual experiment was performed to test their identification by listeners. [Work supported by NIH R01-DC011275 and NSF BCS-1145011.]
Journal of the Acoustical Society of America, Apr 1, 2012
The human singing and speech spectrum includes energy above 5 kHz, but this portion of the spectr... more The human singing and speech spectrum includes energy above 5 kHz, but this portion of the spectrum is typically ignored in speech and voice science. Generally it has been assumed that this high-frequency energy (HFE) contributes to only qualitative percepts of singing and speech, but prior work shows HFE contributes to several non-qualitative percepts, including speech intelligibility. To begin an in-depth exploration of HFE, a database of multi-channel anechoic high-fidelity recordings of singers and talkers was created and analyzed. Third-octave band analysis from the long-term average spectra (LTAS) showed that production level (soft vs. normal vs. loud), production mode (singing vs. speech), and phoneme (for voiceless fricatives) all significantly affected HFE characteristics. Female HFE levels were significantly greater than male levels only above 11 kHz. As expected, HFE was found to be highly directional toward the front of the singer/talker. While this information resulted from a study initially focused on singing voice aesthetic, it is pertinent to various areas of acoustics, including vocal tract modeling, voice synthesis, augmentative hearing technology (hearing aids and cochlear implants), cell phone technology, and training/therapy for singing and speech. [Work supported by NIH-NIDCD.]
ABSTRACT We agree with Cristina Romani (CR) about reducing confusion and agree that the issues ra... more ABSTRACT We agree with Cristina Romani (CR) about reducing confusion and agree that the issues raised in her commentary are central to the study of apraxia of speech (AOS). However, CR critiques our approach from the perspective of basic cognitive neuropsychology. This is confusing and misleading because, contrary to CR’s claim, we did not attempt to inform models of typical speech production. Instead, we relied on such models to study the impairment in the clinical category of AOS (translational cognitive neuropsychology). Thus, the approach along with the underlying assumptions is different. This response aims to clarify these assumptions, broaden the discussion regarding the methodological approach, and address CR’s concerns. We argue that our approach is well-suited to meet the goals of our recent studies and is commensurate with the current state of the science of AOS. Ultimately, a plurality of approaches is needed to understand a phenomenon as complex as AOS.
Journal of the Acoustical Society of America, Mar 1, 2019
Previous research on stop consonant production found that less than 60% of the stops sampled from... more Previous research on stop consonant production found that less than 60% of the stops sampled from a connected speech corpus contained a clearly defined hold duration followed by a plosive release [Crystal and House, JASA(1988)]. How listeners perceive reduced, voiced stop consonant variants is not well understood. The purpose of the current study was to investigate whether an acoustic cue called a relative formant deflection pattern was capable of predicting listeners’ perceptions of these approximant-like, voiced stop consonants variants. A new methodology motivated by a computational model of speech production was used to extract relative formant deflection patterns from excised VCV segments from a reduced speech database. Participants listened to a total of 56 excised VCV stimuli containing approximant-like, voice stop consonant variants and performed a force choice test (i.e., /b-d-g/). The agreement between the perceptions predicted by the relative formant deflection patterns and listeners’ behavioral performance was compared. The expected relative formant deflection pattern correctly predicted listeners' primary response for percent /b/ and /g/ identifications, but not for listeners’ percent /d/ identifications. The implications of these results on a possible invariant acoustic correlate for listeners’ perceptions of place-of-articulation information will be discussed.
Journal of the Acoustical Society of America, Oct 1, 2016
Models have long been used to understand the relation of anatomical structure and articulatory mo... more Models have long been used to understand the relation of anatomical structure and articulatory movement to the acoustics and perception of speech. Realized as speech synthesizers or artificial talkers, such models simplify and emulate the speech production system. One type of simplification is to view speech production as a set of simultaneously imposed modulations of the airway system. Specifically, the vibratory motion of the vocal folds modulates the glottal airspace, while slower movements of the tongue, jaw, lips, and velum modulate the shape of the pharyngeal and oral cavities, and coupling to the nasal system. The precise timing of these modulations produces an acoustic wave from which listeners extract phonetic and talker-specific information. The first aim of the presentation will be to review two historical models of speech production that exemplify a system in which structure is modulated with movement to produce intelligible speech. The second aim is to describe theoretical aspects of a computational model that allows for simulation of speech based on precise spatio-temporal modulations of an airway structure. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by a speaker and perceived by a listener.
PURPOSE Studies on medical and behavioral interventions for essential vocal tremor (EVT) have sho... more PURPOSE Studies on medical and behavioral interventions for essential vocal tremor (EVT) have shown inconsistent effects on acoustical and perceptual outcome measures across studies and across participants. Remote acoustical and perceptual assessments might facilitate studies with larger samples of participants and repeated measures that could clarify treatment effects and identify optimal treatment candidates. Furthermore, remote acoustical and perceptual assessment might allow clinicians to monitor clients' treatment responses and optimize treatment approaches during telepractice. Thus, the purpose of this study was to evaluate the accuracy of remote signal transmission and recording for acoustical and perceptual assessment of EVT. METHOD Simulations of EVT were produced using a computational model and were recorded using local and remote procedures to represent client- and clinician-end recordings respectively. Acoustical analyses measured the extent and rate of fundamental frequency (fo) and intensity modulation to represent vocal tremor severity and the cepstral peak prominence (CPPS) to represent voice quality. The data were analyzed using repeated measures analysis of variance (ANOVA) with recording as the within-subjects factor and sex of the computational model as the between-subjects factor. RESULTS There was a significant main effect of recording on the rate of fo modulation and significant interactions of recording and sex for the extent of intensity modulation, rate of intensity modulation, and CPPS. Posthoc pairwise comparisons and analysis of effect size indicated that recording procedures had the largest effect on the extent of intensity modulation for male simulations, the rate of intensity modulation for male and female simulations, and the CPPS for male and female simulations. Despite having disabled all known software and computer audio enhancing options and having stable ethernet connections, there was inconsistent attenuation of signal amplitude in remote recordings that was most problematic for samples with a breathy voice quality but also affected samples with typical and pressed voice qualities. CONCLUSIONS Acoustical measures that correlate to perception of vocal tremor and voice quality were altered by remote signal transmission and recording. In particular, signal transmission and recording in Zoom altered time-based estimates of intensity modulation and CPPS with male and female simulations of EVT and magnitude-based estimates of intensity modulation with male simulations of EVT. In contrast, signal transmission and recording in Zoom minimally altered time- and magnitude-based estimates of fo modulation with male and female simulations of EVT. Therefore, acoustical and perceptual assessments of EVT should be performed using audio recordings that are collected locally on the participant- or client-end, particularly when measuring modulation of intensity and CPP or estimating vocal tremor severity and voice quality. Development of procedures for collecting local audio recordings in remote settings may expand data collection for treatment research and enhance telepractice.
Journal of the Acoustical Society of America, Sep 1, 2022
The harmonics-to-noise ratio (HNR) and other spectral noise parameters are important in clinical ... more The harmonics-to-noise ratio (HNR) and other spectral noise parameters are important in clinical objective voice assessment as they could indicate the presence of nonharmonic phenomena, which are tied to the perception of hoarseness or breathiness. Existing HNR estimators are built on the voice signals to be nearly periodic (fixed over a short period), although voice pathology could induce involuntary slow modulation to void this assumption. This paper proposes the use of a deterministically time-varying harmonic model to improve the HNR measurements. To estimate the time-varying model, a two-stage iterative least squares algorithm is proposed to reduce model overfitting. The efficacy of the proposed HNR estimator is demonstrated with synthetic signals, simulated tremor signals, and recorded acoustic signals. Results indicate that the proposed algorithm can produce consistent HNR measures as the extent and rate of tremor are varied.
Journal of the Acoustical Society of America, Oct 1, 2019
Laryngeal vocal tremor (VT) is a neurogenic voice disorder characterized by modulation of the fun... more Laryngeal vocal tremor (VT) is a neurogenic voice disorder characterized by modulation of the fundamental frequency (fo) and intensity. The primary medical treatment for VT is laryngeal botulinum toxin injections, which result in temporarily reduced speaker- and listener-perceived VT severity. These injections also cause temporary breathiness, which is conventionally considered an adverse effect of the treatment. However, previous studies using a computational model of VT revealed that listeners perceived modulated voices as less “shaky” when the vocal quality was breathy, even when the extent of fo modulation was the same. The purpose of the current study is to assess the effect of breathiness on listener perception of VT across a range of modulation extents. A kinematic model of the vocal folds and wave-reflection model of the vocal tract were used to simulate VT with degrees of vocal fold adduction representing a spectrum of normal to breathy voice and with fo modulation extents ranging from 0%–10%. Normal hearing listeners will be presented with pairs of stimuli differing by degree of vocal fold adduction and will be asked to identify which vowel is “shakier.” The findings of this study could inform selection of treatment targets and candidates for behavioral therapy for VT.
Journal of the Acoustical Society of America, Oct 1, 2020
A recently developed model of speech production [Story & Bunton, JASA, 146(4), 2522–2528]... more A recently developed model of speech production [Story & Bunton, JASA, 146(4), 2522–2528] was used to generate VCVs that were examined with regard to both articulation and identification of the consonant. In this model, an utterance is generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies called resonance deflection patterns (RDPs) that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. RDPs specifying /b/, /d/, and /g/ would typically be coded as [−1 −1 −1], [−1 1 1], and [−1 1 −1], respectively, indicating, from left to right, the targeted directional shift of the first, second, and third resonances of the vocal tract. In this study, two types of V1CV2 continua were constructed in three vowel contexts (/i, a, u/) by incrementing in small steps (1) the second resonance deflection from −1 to 1, and (2) the third resonance deflection from 1 to −1. The resulting time-varying vocal tract shapes emulate expected articulation patterns for the stop consonants, and a perceptual experiment indicated that listeners identify the consonants based on the polarity of RDP values.
Journal of the Acoustical Society of America, Mar 1, 2018
During speech development, a child’s vocal tract undergoes changes due to growth of anatomic stru... more During speech development, a child’s vocal tract undergoes changes due to growth of anatomic structures. Such changes typically lower the formant frequencies, reshaping the [F1,F2] vowel space. Much of what is known about vowel space change, however, is based on cross-sectional formant measurements averaged over children in various age groups. The purpose of this study was to characterize changes in the vowel space of four children between the ages of 2 and 6 years. Longitudinally-collected audio recordings of four children (2F,2M) were selected from the Arizona Child Acoustic Database. Each child had been recorded every four months from ages 2-6 years, and produced a variety of words, phrases, vowel-vowel progressions, and occasional spontaneous speech. Formant frequencies (F1 and F2 only) were measured from the recordings using a spectral filtering technique. At each age increment, the formant frequencies for each child were plotted as vowel space density, where the “density” dimension indicates the relative tendency of a talker to produce sound in particular region of the vowel space. The change in location and shape of the density cloud during this period of development will be demonstrated. [Research supported by NIH R01-DC011275, NSF BCS-1145011.]
Journal of the Acoustical Society of America, Oct 1, 2017
A fundamental aspect of teaching, on any topic, is the continual pursuit of telling a story. Alth... more A fundamental aspect of teaching, on any topic, is the continual pursuit of telling a story. Although technology and advances in teaching methods may facilitate new and exciting forms of presenting course materials, they do not, by themselves, build the context for the content of a course. Every lecture, activity, homework assignment, project, quiz, and examination can be regarded as chapters that build, over the duration of a course, a compelling and engaging story in which students take part. The aim of this talk is to encourage development of speech science courses that weave together history, theory, technology, visual and auditory experience, assessment, and, importantly, the instructor’s own research to spin a good tale. [Work supported by NIH R01-DC011275 and NSF BCS-1145011.]
Journal of the Acoustical Society of America, Oct 1, 2016
A model of a child-like vocal tract has been developed such that the deformation patterns superim... more A model of a child-like vocal tract has been developed such that the deformation patterns superimposed on a vowel substrate to generate coarticulated consonants are specified by a time-varying set of directional shifts in the first three resonance frequencies. These deflection patterns are denoted as a combination of three numbers each of which can vary between -1 and 1; a negative value implies a downward shift in a resonance frequency whereas an upward shift results for positive value. For example, a “bilabial” consonant specified as [-1,-1,-1] would be transformed via calculations of acoustic sensitivity functions to a time-varying vocal tract shape that presents the expected constriction at the lips, but also modifies other parts of the vocal tract that may be necessary for producing the appropriate formant transitions into and out of the consonant. Using this model, three sets of 30 VCV utterances were generated in which the values of deflection patterns were set to produce vocal tract shapes that hypothetically produce the stop consonants /b/, /d/, and /g/ embedded in three different vowel-vowel contexts. A perceptual experiment was performed to test their identification by listeners. [Work supported by NIH R01-DC011275 and NSF BCS-1145011.]
Journal of the Acoustical Society of America, Apr 1, 2012
The human singing and speech spectrum includes energy above 5 kHz, but this portion of the spectr... more The human singing and speech spectrum includes energy above 5 kHz, but this portion of the spectrum is typically ignored in speech and voice science. Generally it has been assumed that this high-frequency energy (HFE) contributes to only qualitative percepts of singing and speech, but prior work shows HFE contributes to several non-qualitative percepts, including speech intelligibility. To begin an in-depth exploration of HFE, a database of multi-channel anechoic high-fidelity recordings of singers and talkers was created and analyzed. Third-octave band analysis from the long-term average spectra (LTAS) showed that production level (soft vs. normal vs. loud), production mode (singing vs. speech), and phoneme (for voiceless fricatives) all significantly affected HFE characteristics. Female HFE levels were significantly greater than male levels only above 11 kHz. As expected, HFE was found to be highly directional toward the front of the singer/talker. While this information resulted from a study initially focused on singing voice aesthetic, it is pertinent to various areas of acoustics, including vocal tract modeling, voice synthesis, augmentative hearing technology (hearing aids and cochlear implants), cell phone technology, and training/therapy for singing and speech. [Work supported by NIH-NIDCD.]
ABSTRACT We agree with Cristina Romani (CR) about reducing confusion and agree that the issues ra... more ABSTRACT We agree with Cristina Romani (CR) about reducing confusion and agree that the issues raised in her commentary are central to the study of apraxia of speech (AOS). However, CR critiques our approach from the perspective of basic cognitive neuropsychology. This is confusing and misleading because, contrary to CR’s claim, we did not attempt to inform models of typical speech production. Instead, we relied on such models to study the impairment in the clinical category of AOS (translational cognitive neuropsychology). Thus, the approach along with the underlying assumptions is different. This response aims to clarify these assumptions, broaden the discussion regarding the methodological approach, and address CR’s concerns. We argue that our approach is well-suited to meet the goals of our recent studies and is commensurate with the current state of the science of AOS. Ultimately, a plurality of approaches is needed to understand a phenomenon as complex as AOS.
Journal of the Acoustical Society of America, Mar 1, 2019
Previous research on stop consonant production found that less than 60% of the stops sampled from... more Previous research on stop consonant production found that less than 60% of the stops sampled from a connected speech corpus contained a clearly defined hold duration followed by a plosive release [Crystal and House, JASA(1988)]. How listeners perceive reduced, voiced stop consonant variants is not well understood. The purpose of the current study was to investigate whether an acoustic cue called a relative formant deflection pattern was capable of predicting listeners’ perceptions of these approximant-like, voiced stop consonants variants. A new methodology motivated by a computational model of speech production was used to extract relative formant deflection patterns from excised VCV segments from a reduced speech database. Participants listened to a total of 56 excised VCV stimuli containing approximant-like, voice stop consonant variants and performed a force choice test (i.e., /b-d-g/). The agreement between the perceptions predicted by the relative formant deflection patterns and listeners’ behavioral performance was compared. The expected relative formant deflection pattern correctly predicted listeners' primary response for percent /b/ and /g/ identifications, but not for listeners’ percent /d/ identifications. The implications of these results on a possible invariant acoustic correlate for listeners’ perceptions of place-of-articulation information will be discussed.
Journal of the Acoustical Society of America, Oct 1, 2016
Models have long been used to understand the relation of anatomical structure and articulatory mo... more Models have long been used to understand the relation of anatomical structure and articulatory movement to the acoustics and perception of speech. Realized as speech synthesizers or artificial talkers, such models simplify and emulate the speech production system. One type of simplification is to view speech production as a set of simultaneously imposed modulations of the airway system. Specifically, the vibratory motion of the vocal folds modulates the glottal airspace, while slower movements of the tongue, jaw, lips, and velum modulate the shape of the pharyngeal and oral cavities, and coupling to the nasal system. The precise timing of these modulations produces an acoustic wave from which listeners extract phonetic and talker-specific information. The first aim of the presentation will be to review two historical models of speech production that exemplify a system in which structure is modulated with movement to produce intelligible speech. The second aim is to describe theoretical aspects of a computational model that allows for simulation of speech based on precise spatio-temporal modulations of an airway structure. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by a speaker and perceived by a listener.
PURPOSE Studies on medical and behavioral interventions for essential vocal tremor (EVT) have sho... more PURPOSE Studies on medical and behavioral interventions for essential vocal tremor (EVT) have shown inconsistent effects on acoustical and perceptual outcome measures across studies and across participants. Remote acoustical and perceptual assessments might facilitate studies with larger samples of participants and repeated measures that could clarify treatment effects and identify optimal treatment candidates. Furthermore, remote acoustical and perceptual assessment might allow clinicians to monitor clients' treatment responses and optimize treatment approaches during telepractice. Thus, the purpose of this study was to evaluate the accuracy of remote signal transmission and recording for acoustical and perceptual assessment of EVT. METHOD Simulations of EVT were produced using a computational model and were recorded using local and remote procedures to represent client- and clinician-end recordings respectively. Acoustical analyses measured the extent and rate of fundamental frequency (fo) and intensity modulation to represent vocal tremor severity and the cepstral peak prominence (CPPS) to represent voice quality. The data were analyzed using repeated measures analysis of variance (ANOVA) with recording as the within-subjects factor and sex of the computational model as the between-subjects factor. RESULTS There was a significant main effect of recording on the rate of fo modulation and significant interactions of recording and sex for the extent of intensity modulation, rate of intensity modulation, and CPPS. Posthoc pairwise comparisons and analysis of effect size indicated that recording procedures had the largest effect on the extent of intensity modulation for male simulations, the rate of intensity modulation for male and female simulations, and the CPPS for male and female simulations. Despite having disabled all known software and computer audio enhancing options and having stable ethernet connections, there was inconsistent attenuation of signal amplitude in remote recordings that was most problematic for samples with a breathy voice quality but also affected samples with typical and pressed voice qualities. CONCLUSIONS Acoustical measures that correlate to perception of vocal tremor and voice quality were altered by remote signal transmission and recording. In particular, signal transmission and recording in Zoom altered time-based estimates of intensity modulation and CPPS with male and female simulations of EVT and magnitude-based estimates of intensity modulation with male simulations of EVT. In contrast, signal transmission and recording in Zoom minimally altered time- and magnitude-based estimates of fo modulation with male and female simulations of EVT. Therefore, acoustical and perceptual assessments of EVT should be performed using audio recordings that are collected locally on the participant- or client-end, particularly when measuring modulation of intensity and CPP or estimating vocal tremor severity and voice quality. Development of procedures for collecting local audio recordings in remote settings may expand data collection for treatment research and enhance telepractice.
Journal of the Acoustical Society of America, Sep 1, 2022
The harmonics-to-noise ratio (HNR) and other spectral noise parameters are important in clinical ... more The harmonics-to-noise ratio (HNR) and other spectral noise parameters are important in clinical objective voice assessment as they could indicate the presence of nonharmonic phenomena, which are tied to the perception of hoarseness or breathiness. Existing HNR estimators are built on the voice signals to be nearly periodic (fixed over a short period), although voice pathology could induce involuntary slow modulation to void this assumption. This paper proposes the use of a deterministically time-varying harmonic model to improve the HNR measurements. To estimate the time-varying model, a two-stage iterative least squares algorithm is proposed to reduce model overfitting. The efficacy of the proposed HNR estimator is demonstrated with synthetic signals, simulated tremor signals, and recorded acoustic signals. Results indicate that the proposed algorithm can produce consistent HNR measures as the extent and rate of tremor are varied.
Journal of the Acoustical Society of America, Oct 1, 2019
Laryngeal vocal tremor (VT) is a neurogenic voice disorder characterized by modulation of the fun... more Laryngeal vocal tremor (VT) is a neurogenic voice disorder characterized by modulation of the fundamental frequency (fo) and intensity. The primary medical treatment for VT is laryngeal botulinum toxin injections, which result in temporarily reduced speaker- and listener-perceived VT severity. These injections also cause temporary breathiness, which is conventionally considered an adverse effect of the treatment. However, previous studies using a computational model of VT revealed that listeners perceived modulated voices as less “shaky” when the vocal quality was breathy, even when the extent of fo modulation was the same. The purpose of the current study is to assess the effect of breathiness on listener perception of VT across a range of modulation extents. A kinematic model of the vocal folds and wave-reflection model of the vocal tract were used to simulate VT with degrees of vocal fold adduction representing a spectrum of normal to breathy voice and with fo modulation extents ranging from 0%–10%. Normal hearing listeners will be presented with pairs of stimuli differing by degree of vocal fold adduction and will be asked to identify which vowel is “shakier.” The findings of this study could inform selection of treatment targets and candidates for behavioral therapy for VT.
Uploads