Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

On the linearity of the relationship between the sound pressure level and the negative peak amplitude of the differentiated glottal flow in vowel production

Speech Communication, 1999
...Read more
On the linearity of the relationship between the sound pressure level and the negative peak amplitude of the dierentiated glottal ¯ow in vowel production Paavo Alku a, * , Juha Vintturi b , Erkki Vilkman c a Acoustics Laboratory, Helsinki University of Technology, P.O. Box 3000, Fin-02015 Helsinki, TKK, Finland b Department of Otolaryngology and Phoniatrics, Helsinki University Central Hospital, Helsinki, Finland c Department of Otolaryngology and Phoniatrics, University of Oulu, Finland Received 20 July 1998; received in revised form 22 February 1999; accepted 14 April 1999 Abstract The negative peak amplitude of the dierentiated glottal ¯ow (d peak ) is known to correlate strongly with the sound pressure level (SPL) of speech. Therefore, the function between d peak and SPL has been conventionally modeled as a single line. In this survey, the linearity of the function between d peak and SPL is revisited by analyzing glottal ¯ows that were inverse ®ltered from speech sounds of largely dierent intensities. It is shown that SPL±d peak -graphs can be modeled more accurately by using two linear functions, the ®rst of which models soft phonation, and the second of which models normal and loud speech sounds. For all of the analyzed SPL±d peak -graphs, the slope of the modeling line matching soft phonation was larger than the slope of the line for normal and loud speech. This result suggests that vocal intensity is aected not only by the single amplitude domain value of the voice source, d peak , but also by the shape of the dierentiated glottal ¯ow near the instant of the negative peak. Ó 1999 Elsevier Science B.V. All rights reserved. Zusammenfassung Bei der negativen Spitzenamplitude des abweichenden Stimmritzen¯usses (d peak ) wird bekanntermassen von einer starken Wechselbeziehung mit dem Schalldruckniveau (SPL) des gesprochenen Wortes ausgegangen. Deshalb kommt die Funktion zwischen d peak und SPL ublicherweise in Form einer einzigen Linie zum Ausdruck. In dieser Ubersicht wird die Linearitat der Funktion zwischen d peak und SPL so wiedergegeben, dass der in entgegengesetzter Richtung ge®lterte Stimmritzen¯uss von Sprachtonen sehr verschiedener Intensitat analysiert wurde. Dadurch kann nachgewiesen werden, dass die graphische Darstellung von SPL±d peak praziser ausfallt, wenn zwei lineare Funktionen verwendet werden, namlich eine solche, die von weicher Phonation gebildet wird, und eine zweite zur Wiedergabe von normalen und lauten Sprachtonen. Bei allen analysierten SPL±d peak -Graphiken fallt der Neigungsbereich des Li- nienverlaufs von weicher Phonation breiter aus als der Neigungsbereich der Linie fur normale und laute Aussprache. Dieses Ergebnis lasst den Schluss zu, dass die vokale Intensitat nicht nur durch den Eigenamplitudenwert der Tonquelle beein¯usst wird, sondern ebenso durch die Form des abweichenden Stimmritzen¯usses im Bereich des negativen Ausschlags. Ó 1999 Elsevier Science B.V. All rights reserved. Speech Communication 28 (1999) 269±281 www.elsevier.nl/locate/specom * Corresponding author. E-mail: paavo.alku@hut.® 0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 2 0 - 5
Re Âsume  On sait que l'amplitude de cr^ ete negative est en forte correlation avec le niveau de pression accoustique (SPL) de la parole. Ainsi, la fonction entre d peak et SPL a ete conventionnellement modelee en une seule ligne. Dans cette etude, la linearite de la fonction entre d peak et SPL est reveri®ee en analysant les ¯ux glottaux obtenus par ®ltrage inverse a partir de sons de parole d'une large gamme de dierentes intensites. On peut voir que les courbes d peak et SPL sont obtenues de mani ere bien plus precises en utilisant deux fonctions lineaires, la premi ere pour les modeles de phonation douce, la seconde representant les sons de parole normale et forte. Pour toutes les courbes d peak et SPL analysees, l'inclinaison de la ligne modelee correspondant aux phonations douces est plus prononcee que celle de la ligne des sons de parole normale et forte. Ce resultat suuggere que l'intensite vocale n'est pas seulement aectee par la valeur du domaine d'amplitude de la source vocale, d peak , mais aussi par la forme du ¯ux glottal dierentie a l'approche de la cr^ ete negative. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Voice source analysis; Inverse ®ltering; Intensity regulation 1. Introduction The role of the glottal source in regulating vocal intensity has been studied extensively during the past decades. Research in this area typically in- cludes ®rst the estimation of the glottal volume velocity waveform using inverse ®ltering (Holm- berg et al., 1988; Gaun and Sundberg, 1989; Dromey et al., 1992; Titze and Sundberg, 1992; Sundberg et al., 1993; Sulter and Wit, 1996). The resultant estimates for the glottal ¯ow are then parameterized in order to express the most im- portant features of the source waveform in a compressed form. Extraction of information from voice production can be done by using both the estimated glottal ¯ow directly and its ®rst deriva- tive. Parameterization of the voice source can be computed by using, for example, the so-called time-based parameters of the glottal ¯ow (i.e., certain ratios between the closed phase, the open- ing phase and the closing phase of the glottal cycle) (Holmberg et al., 1988; Vilkman et al., 1997). It is also possible to extract information from the esti- mated glottal waveforms by measuring their am- plitude features (i.e., level of the maximum ¯ow or level of the DC-¯ow) (Herteg ard et al., 1990, 1992; Hillman et al., 1990). Characteristics of the glottal source can also be quanti®ed in a compressed form by using a frequency domain approach by mea- suring, for instance, the spectral decay of the glottal excitation (Childers and Lee, 1991; Titze and Sundberg, 1992; Alku et al., 1997). The key factor behind intensity regulation of speech is subglottal pressure (Bouhuys et al., 1968; Titze, 1994). Increasing intensity of speech implies raising subglottal pressure. However, also other factors of speech production, such as glottal ad- duction and formant frequencies, contribute to intensity regulation (Sundberg et al., 1993). In- creasing vocal intensity by raising subglottal pressure ampli®es in general the AC-¯ow of a glottal pulse but it also makes the shape of the glottal pulse sharper so that it contains more en- ergy at high frequencies. In the frequency domain this means that intensity of a soft sound is pri- marily determined by the fundamental whereas intensity of normal and loud speech is greatly af- fected by the overtones near the ®rst formant (Gramming and Sundberg, 1988; Sundberg et al., 1993). Increasing subglottal pressure aects not only the shape of a single glottal pulse but also the repetition rate of the glottal pulses, i.e., the fun- damental frequency (F 0 ) of the voice. Many stud- ies have been published on the relationship between F 0 and loudness (e.g., Gramming et al., 1988; Titze and Sundberg, 1992). Gramming et al. (1988) reported that the mean pitch increased by about half-semitones when intensity was increased by one decibel. According to Gramming et al. (1988) the increased value of F 0 can be considered a passive result of raising subglottal pressure in order to produce louder sounds. However, F 0 per se can also have an important role in intensity regulation of speech when formant tuning is used (Sundberg, 1990; Titze, 1994). This implies that F 0 of the voice is adjusted so that one of its lowest harmonics almost coincides with the ®rst formant (F 1 ). Consequently, the level of the speech 270 P. Alku et al. / Speech Communication 28 (1999) 269±281
Speech Communication 28 (1999) 269±281 www.elsevier.nl/locate/specom On the linearity of the relationship between the sound pressure level and the negative peak amplitude of the di€erentiated glottal ¯ow in vowel production Paavo Alku a a,* , Juha Vintturi b, Erkki Vilkman c Acoustics Laboratory, Helsinki University of Technology, P.O. Box 3000, Fin-02015 Helsinki, TKK, Finland b Department of Otolaryngology and Phoniatrics, Helsinki University Central Hospital, Helsinki, Finland c Department of Otolaryngology and Phoniatrics, University of Oulu, Finland Received 20 July 1998; received in revised form 22 February 1999; accepted 14 April 1999 Abstract The negative peak amplitude of the di€erentiated glottal ¯ow (dpeak ) is known to correlate strongly with the sound pressure level (SPL) of speech. Therefore, the function between dpeak and SPL has been conventionally modeled as a single line. In this survey, the linearity of the function between dpeak and SPL is revisited by analyzing glottal ¯ows that were inverse ®ltered from speech sounds of largely di€erent intensities. It is shown that SPL±dpeak -graphs can be modeled more accurately by using two linear functions, the ®rst of which models soft phonation, and the second of which models normal and loud speech sounds. For all of the analyzed SPL±dpeak -graphs, the slope of the modeling line matching soft phonation was larger than the slope of the line for normal and loud speech. This result suggests that vocal intensity is a€ected not only by the single amplitude domain value of the voice source, dpeak , but also by the shape of the di€erentiated glottal ¯ow near the instant of the negative peak. Ó 1999 Elsevier Science B.V. All rights reserved. Zusammenfassung Bei der negativen Spitzenamplitude des abweichenden Stimmritzen¯usses (dpeak ) wird bekanntermassen von einer starken Wechselbeziehung mit dem Schalldruckniveau (SPL) des gesprochenen Wortes ausgegangen. Deshalb kommt  blicherweise in Form einer einzigen Linie zum Ausdruck. In dieser Ubersicht die Funktion zwischen dpeak und SPL u wird die Linearit at der Funktion zwischen dpeak und SPL so wiedergegeben, dass der in entgegengesetzter Richtung ge®lterte Stimmritzen¯uss von Spracht onen sehr verschiedener Intensit at analysiert wurde. Dadurch kann nachgewiesen werden, dass die graphische Darstellung von SPL±dpeak pr allt, wenn zwei lineare Funktionen aziser ausf verwendet werden, n amlich eine solche, die von weicher Phonation gebildet wird, und eine zweite zur Wiedergabe von normalen und lauten Spracht onen. Bei allen analysierten SPL±dpeak -Graphiken f allt der Neigungsbereich des Linienverlaufs von weicher Phonation breiter aus als der Neigungsbereich der Linie f ur normale und laute Aussprache. Dieses Ergebnis l asst den Schluss zu, dass die vokale Intensit at nicht nur durch den Eigenamplitudenwert der Tonquelle beein¯usst wird, sondern ebenso durch die Form des abweichenden Stimmritzen¯usses im Bereich des negativen Ausschlags. Ó 1999 Elsevier Science B.V. All rights reserved. * Corresponding author. E-mail: paavo.alku@hut.® 0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 2 0 - 5 270 P. Alku et al. / Speech Communication 28 (1999) 269±281 ReÂsume On sait que l'amplitude de cr^ete negative est en forte correlation avec le niveau de pression accoustique (SPL) de la parole. Ainsi, la fonction entre dpeak et SPL a ete conventionnellement modelee en une seule ligne. Dans cette etude, la a partir linearite de la fonction entre dpeak et SPL est reveri®ee en analysant les ¯ux glottaux obtenus par ®ltrage inverse  de sons de parole d'une large gamme de di€erentes intensites. On peut voir que les courbes dpeak et SPL sont obtenues de maniere bien plus precises en utilisant deux fonctions lineaires, la premiere pour les modeles de phonation douce, la seconde representant les sons de parole normale et forte. Pour toutes les courbes dpeak et SPL analysees, l'inclinaison de la ligne modelee correspondant aux phonations douces est plus prononcee que celle de la ligne des sons de parole normale et forte. Ce resultat suuggere que l'intensite vocale n'est pas seulement a€ectee par la valeur du domaine d'amplitude de la source vocale, dpeak , mais aussi par la forme du ¯ux glottal di€erentie  a l'approche de la cr^ete negative. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Voice source analysis; Inverse ®ltering; Intensity regulation 1. Introduction The role of the glottal source in regulating vocal intensity has been studied extensively during the past decades. Research in this area typically includes ®rst the estimation of the glottal volume velocity waveform using inverse ®ltering (Holmberg et al., 1988; Gaun and Sundberg, 1989; Dromey et al., 1992; Titze and Sundberg, 1992; Sundberg et al., 1993; Sulter and Wit, 1996). The resultant estimates for the glottal ¯ow are then parameterized in order to express the most important features of the source waveform in a compressed form. Extraction of information from voice production can be done by using both the estimated glottal ¯ow directly and its ®rst derivative. Parameterization of the voice source can be computed by using, for example, the so-called time-based parameters of the glottal ¯ow (i.e., certain ratios between the closed phase, the opening phase and the closing phase of the glottal cycle) (Holmberg et al., 1988; Vilkman et al., 1997). It is also possible to extract information from the estimated glottal waveforms by measuring their amplitude features (i.e., level of the maximum ¯ow or level of the DC-¯ow) (Herteg ard et al., 1990, 1992; Hillman et al., 1990). Characteristics of the glottal source can also be quanti®ed in a compressed form by using a frequency domain approach by measuring, for instance, the spectral decay of the glottal excitation (Childers and Lee, 1991; Titze and Sundberg, 1992; Alku et al., 1997). The key factor behind intensity regulation of speech is subglottal pressure (Bouhuys et al., 1968; Titze, 1994). Increasing intensity of speech implies raising subglottal pressure. However, also other factors of speech production, such as glottal adduction and formant frequencies, contribute to intensity regulation (Sundberg et al., 1993). Increasing vocal intensity by raising subglottal pressure ampli®es in general the AC-¯ow of a glottal pulse but it also makes the shape of the glottal pulse sharper so that it contains more energy at high frequencies. In the frequency domain this means that intensity of a soft sound is primarily determined by the fundamental whereas intensity of normal and loud speech is greatly affected by the overtones near the ®rst formant (Gramming and Sundberg, 1988; Sundberg et al., 1993). Increasing subglottal pressure a€ects not only the shape of a single glottal pulse but also the repetition rate of the glottal pulses, i.e., the fundamental frequency (F0 ) of the voice. Many studies have been published on the relationship between F0 and loudness (e.g., Gramming et al., 1988; Titze and Sundberg, 1992). Gramming et al. (1988) reported that the mean pitch increased by about half-semitones when intensity was increased by one decibel. According to Gramming et al. (1988) the increased value of F0 can be considered a passive result of raising subglottal pressure in order to produce louder sounds. However, F0 per se can also have an important role in intensity regulation of speech when formant tuning is used (Sundberg, 1990; Titze, 1994). This implies that F0 of the voice is adjusted so that one of its lowest harmonics almost coincides with the ®rst formant (F1 ). Consequently, the level of the speech P. Alku et al. / Speech Communication 28 (1999) 269±281 spectrum at the ®rst formant is ampli®ed which increases the overall intensity. In order to increase vocal intensity with formant tuning calls for using large values of F0 . Hence, formant tuning is a method of intensity regulation that is used mainly in singing or in production of high-pitched voices. One of the most important parameters of the glottal ¯ow, closely related to vocal intensity, is the negative peak amplitude of the di€erentiated glottal ¯ow (dpeak in Fig. 1). (This parameter is also called the maximum ¯ow declination rate (Holmberg et al., 1988).) Behavior of dpeak was studied as a function of SPL of speech by Gaun and Sundberg (1989). In their study, the main ®nding was that there is a strong linear correlation between dpeak (when expressed in dB units) and SPL (also expressed in dB units). Strong correlation between dpeak and SPL can be explained using the Fant's source-®lter theory of speech production (Fant, 1960), as follows: Production of voiced speech is modeled according to the source-®lter theory by three separate processes: the glottal ¯ow, the vocal tract ®ltering and the lip radiation e€ect. Since the lip radiation e€ect can be estimated as a di€erentiator (Flanagan, 1972), it is possible to combine the ®rst process, the glottal ¯ow, and the third one, the lip radiation e€ect. Consequently, voice production can be modeled by two processes: the di€erentiated glottal ¯ow (also called the 271 e€ective driving function) and the vocal tract ®ltering (Wong et al., 1979). A negative impulse-like peak dominates the waveform of the di€erentiated glottal ¯ow, at least for normal and loud phonations. This peak, occurring at the time instant when the rate of change of the ¯ow reaches its absolute maximum, serves as the main excitation of the vocal tract (Fant, 1993). Consequently, the level of the peak, dpeak , determines to a large extent the energy of the produced speech signal. The linearity of the function between SPL and dpeak is further studied in the present survey. Even though there are well-known publications on this issue (e.g., Gaun and Sundberg, 1989), we consider this topic worth revisiting due to the following two rationales. First, it is known that dynamics of the human voice in terms of SPL can be 70 dB or even more in the case when pitch is  allowed to vary over its extreme values (Akerlund and Gramming, 1994; Titze, 1994). However, in voice source studies where inverse ®ltering is applied, there seems to be a tendency not to analyze very soft or extremely loud voices. In the study by Holmberg et al. (1988), for example, the di€erence between the mean SPL-value of loud and soft voices was only 11.0 dB for male speakers and 11.8 dB for females. Therefore, we believe that a further study on the SPL±dpeak -function is needed in order to determine whether the linearity of the function holds when voices of largely di€erent SPL-values are analyzed. The second rationale for the present study, as shown by two examples depicted in Figs. 4 and 5, is our experimental ®nding indicating a clear ``knee'' in the SPL±dpeak -graphs when voices are analyzed using a wide SPL-range. According to this ®nding SPL±dpeak -graphs can be modeled accurately by a linear function but the slope of the line is di€erent between soft and loud phonations. 2. Material and methods 2.1. Speech material Fig. 1. One cycle of the glottal volume velocity waveform (a) and its ®rst derivative (b). fAC : amplitude of the AC-¯ow, fDC : level of the DC-¯ow, dpeak : negative peak amplitude of the di€erentiated glottal ¯ow. In order to measure the linearity of the SPL± dpeak -function, we designed an experiment where speech data were collected from 11 adult Finnish 272 P. Alku et al. / Speech Communication 28 (1999) 269±281 speakers (®ve females and six males) with no history of speech, voice or hearing disorders. The ages of the speakers varied between 42 and 54 years for females and between 33 and 66 years for males. Each subject produced a series of the word /pa:p:a/ (Finnish for ``grandpa'') by gradually increasing loudness. The acoustic speech pressure waveform was recorded using a condenser microphone (Br uel&Kjaer 4176) which was placed at a distance of 40 cm from the lips of the speaker. (The distance was carefully monitored during the recording because a constant mouth-to-microphone distance is essential to get reliable results for the amplitude values of the glottal ¯ow with our inverse ®ltering method.) Recordings were made in an anechoic chamber, and all of the subjects sat while producing the sounds. The ®rst phonation sample was to be produced as softly as possible without whispering. The output level of the speech signals was controlled by means of a sound level meter (Br uel&Kjaer 2225). By following the LED light display of the sound level meter, the speakers were able to control the SPL-values of their speech samples. Subjects repeated /pa:p:a/-words by increasing the SPL-values in gradations of approximately 5 dB from the softest voice up to the loudest, with an SPL-value of 105 dB. (Some subjects voluntarily also produced the loudest sound with an SPL-value of 110 dB.) The subjects were given no other restrictions regarding their voice production, which means that pitch and phonation type were decided freely by the speakers during the recording. The average number of voice samples covering the intensity range from the softest to the loudest phonation was 12 per speaker. The total number of speech samples produced was 61 and 71 by female and male speakers, respectively. The acoustical speech pressure waveform was digitized using a sampling frequency of 22,050 Hz and a resolution of 16 bits. At the computer, the signals were ®rst high-pass ®ltered with a inear phase FIR-®lter whose cut-o€ frequency was 60 Hz in order to remove any possible low frequency air pressure variations picked up during the recordings. The bandwidth of signals was 11 kHz. For computation of SPL-values, we recorded a calibration signal generated by a Br uel&Kjaer 4231 calibrator. SPL-values were computed on the dB-scale for all speech signals using the root mean square-operation (RMS) and the SPL-value of the calibration tone (94 dB) as follows: SPLfspeechg ˆ 94 dB ‡ 20 log RMS fspeechg : RMS fcalibrationg 1† It is worth noting that in the present study Eq. (1) yields the SPL-value of a speech signal with linear weighting at the distance of 40 cm from the lips of the speaker. 2.2. Inverse ®ltering In order to estimate the glottal volume velocity waveforms, we used an inverse ®ltering technique similar to the one described by Alku (1992). This inverse ®ltering method estimates the glottal excitation directly from the acoustic speech pressure waveform recorded in a free ®eld (i.e., no ¯ow mask is required). The method is based on the separated speech production model by Fant (1960). In the present study, we used a modi®ed version of the inverse ®ltering method described in (Alku, 1992). The modi®cation concerned modeling of the vocal tract transfer function. In the present study, the estimation of the vocal tract transfer function was based on a sophisticated allpole modeling technique, called Discrete All-Pole Modeling (DAP) (El-Jaroudi and Makhoul, 1991) instead of the conventional Linear Predictive Coding (LPC) which is used in (Alku, 1992). The di€erence between DAP and LPC is that the former is based on the Itakura±Saito distortion criterion in determining an optimal all-pole ®lter, whereas the latter uses the least squares error criterion. Consequently, the formants of the vocal tract, particularly the ®rst formant, can be more accurately estimated by DAP than by LPC (ElJaroudi and Makhoul, 1991). An accurate modeling of the vocal tract transfer function is very important from the point of view of glottal inverse ®ltering. As reported in (Alku and Vilkman, 1994), the application of DAP instead of LPC in P. Alku et al. / Speech Communication 28 (1999) 269±281 modeling the vocal tract transfer function decreases the amount of formant ripple in the estimated glottal ¯ows. The inverse ®ltering method used in the present study estimates the vocal tract transfer function using the above-mentioned DAP-technique by ®rst canceling the average e€ect of the glottal source from the speech spectrum using a low order allpole ®ltering. By scaling the DC-gain of the digital ®lter that models the vocal tract to unity, it is possible to estimate the amplitude characteristics of the glottal ¯ow (with arbitrary units), even though no ¯ow mask is used (Alku et al., 1998a). The lip radiation e€ect is canceled by a ®rst order all-pole ®lter with its pole in the z-domain at z ˆ 0.98. In the present study, the estimation of the glottal ¯ow was computed using an analysis window of 50 ms. (For some of the low-pitched male voices, the length of the time window was increased to 70 ms in order to cover at least four glottal cycles.) From each of the estimated glottal waveforms, the value of dpeak was determined by computing the mean of negative peak amplitudes of the di€erentiated 1 glottal ¯ow over four consecutive glottal cycles. 3. Results An example of an SPL±dpeak -graph obtained from phonations of a male subject is shown in Fig. 2. At ®rst sight, the function between SPL and dpeak seems to be linear over the whole SPL-range from the softest phonation up to the loudest. However, a closer examination of the graph reveals that dpeak seems to follow very closely a linear function of SPL over the softest four phonations but then, in the vicinity of 80 dB, it starts to follow another linear function, the slope of which is clearly smaller than the slope of the ®rst line. In 1 The glottal ¯ows computed in the present study were parameterized by extracting information (i.e., dpeak ) from the ®rst derivative of the ¯ow. Computation of the derivative was implemented by ®ltering the ¯ow waveforms with the model of the lip radiation e€ect that was used in the inverse ®ltering stage. The transfer function of the di€erentiator was H(z) ˆ 1.0 ÿ 0.98zÿ1 . 273 Fig. 2. An example of the SPL±dpeak -graph. other words, if voices of greatly di€erent SPLvalues are analyzed, it seems to be more plausible to model the SPL±dpeak -relationship using (at least) two linear functions instead of just one. The reason for the ``knee'' in the SPL±dpeak graph between soft and normal phonations as shown by the example depicted in Fig. 2 can be explained by analyzing the waveforms of the corresponding glottal ¯ows and their derivatives. The glottal ¯ow and its derivative are shown for speech sample no. 4 of Fig. 2 (i.e., the strongest of the soft phonations) in Fig. 3(a) and (b), respectively. The ¯ow and the di€erentiated ¯ow of speech sample no. 5 of Fig. 2 (i.e., the softest of the normal phonations) is shown by Fig. 3(c) and (d), respectively. It can be noticed from these graphs that the amplitude of dpeak increases to some extent (by 25% which equals 1.9 dB) when intensity of the voice rises. However, there is also a change in the shapes of the di€erentiated glottal ¯ows: the waveform of Fig. 3(b) is much smoother during the glottal closing phase than in the signal shown in Fig. 3(d). In the frequency domain, this implies that the spectral decay of the di€erentiated glottal ¯ow shown in Fig. 3(d) is less than the decay of the signal in Fig. 3(b). Therefore, the waveform of Fig. 3(d) produces, after being ®ltered through the vocal tract, a voice signal of larger SPL than the waveform shown in Fig. 3(b). This is explained by the fact that the amplitude of the spectral components near the ®rst formant of the produced 274 P. Alku et al. / Speech Communication 28 (1999) 269±281 SPL). Hence, the accuracy of the classical model for the SPL±dpeak -function that consists of a single line is deteriorated. Therefore, it is justi®ed to analyze how to model the SPL±dpeak -relationship more accurately by taking into account that, particularly in the soft-to-normal transitions, the change of the SPL-value of speech is regulated not only by the level of the negative peak of the differentiated glottal ¯ow, dpeak , but also by the shape of the di€erentiated glottal waveform near the instant of the negative peak. SPL±dpeak -graphs of each speaker were modeled using two optimal linear functions. These lines, denoted by lineopt;1 and lineopt;2 , have been drawn in two examples of SPL±dpeak -graphs shown in Figs. 4 and 5. The functions of the optimal lines were determined for phonations of each speaker as follows: The obtained 12 SPL±dpeak -values of each subject were divided into two groups, denoted by Group1 and Group2 . Group1 consisted of ®rstly the SPL±dpeak -values of the three softest phonations while the rest of the nine SPL±dpeak -values were in Group2 . An optimal linear function (in terms of the mean square error criterion) was then matched separately over the SPL±dpeak -values of both of the groups. Next, the obtained data points were divided into the groups di€erently by taking the four softest phonations into Group1 and the eight loudest phonations into Group2 . A new pair of optimal linear functions was obtained by Fig. 3. Waveforms estimated by inverse ®ltering speech samples no. 4 and 5 of Fig. 2 (y-axis in arbitrary units). (a) Glottal ¯ow of speech sample no. 4. (b) Di€erentiated glottal ¯ow of speech sample no. 4. (c) Glottal ¯ow of speech sample no. 5. (d) Differentiated glottal ¯ow of speech sample no. 5. speech sound will by stronger if the spectral decay of the voice source decreases. In this example a large increase of SPL occurs even though the glottal ¯ows have only a slightly di€erent values of dpeak . When these voices are expressed in the SPL± dpeak -graph, they will have almost the same value on the y-axis (i.e., the level of dpeak ), but the sound excited by the waveform of Fig. 3(d) will yield a clearly larger value on the x-axis (i.e., the value of Fig. 4. SPL±dpeak -graph, female speaker. 275 P. Alku et al. / Speech Communication 28 (1999) 269±281 Fig. 5. SPL±dpeak -graph, male speaker. separately matching the SPL±dpeak -values of both groups with two lines. This procedure was repeated by searching for the division of the data points into two groups yielding a minimal mean square error between the original SPL±dpeak -values and their linear models. Hence, lineopt;1 and lineopt;2 shown in Figs. 4 and 5 represent linear models for the SPL±dpeak -values, including the optimal way to divide the data points into two separate groups to be matched by two lines. The obtained SPL±dpeak -values of all the 12 subjects were analyzed in a similar manner as shown in Figs. 4 and 5. The optimal linear functions, lineopt;1 and lineopt;2 , were quanti®ed using their slopes which are denoted by slope1 and slope2 , respectively. It was found that slope1 was larger than slope2 for the phonations of each subject. In other words, when SPL±dpeak -graphs were modeled by two linear functions that were determined optimally by minimizing the mean square error, the linear model for the softest phonations was a more steeply ascending line than the model matching normal and loud phonations in all cases. The mean (m) and the standard deviation (s.d.) of the two slopes were for female voices as follows: slope1 : m ˆ 0.82, s.d. ˆ 0.29, slope2 : m ˆ 0.36, s.d. ˆ 0.27. For male phonations, the following values were obtained: slope1 : m ˆ 1.14, s.d. ˆ 0.27, slope2 : m ˆ 0.54, s.d. ˆ 0.34. The di€erence between slope1 and slope2 was statistically tested using the Wilcoxon Signed-Rank nonparametric test. The di€erence between the slope values was statistically signi®cant at the signi®cance level of p ˆ 0.0033. The optimal linear functions modeled the SPL±dpeak -graphs accurately: the correlation coecient averaged over the eleven subjects was 0.94 and 0.93 when SPL±dpeak -graphs were modeled with lineopt;1 and lineopt;2 , respectively. Values of SPL and dpeak are shown for the loudest speech sample modeled by lineopt;1 and for the softest speech sample modeled by lineopt;2 in Tables 1 and 2 for each of the female and male speakers, respectively. The subjects of the present study were allowed to produce speech samples freely using the pitch of their own choice. Therefore, the fundamental frequency of the voices increased with intensity, which is in line with previous studies where F0 has been analyzed as a function of vocal loudness (e.g., Gramming et al., 1988; Dromey et al., 1992; Titze Table 1 Sound pressure level (SPL1 ) and negative peak amplitude of the di€erentiated glottal ¯ow (dpeak;1† for the loudest speech sample before the knee in the SPL±dpeak -graphs. Sound pressure level (SPL2 ) and the negative peak amplitude of the di€erentiated glottal ¯ow (dpeak;2 ) for the softest speech sample after the knee in the SPL±dpeak -graphs. All the values are expressed in dB units. Female speakers Speaker SPL1 dpeak;1 SPL2 dpeak;2 HR HS AS EL LM 72 65 82 87 84 65 59 69 68 72 73 71 86 97 88 63 59 69 74 77 Table 2 Sound pressure level (SPL1 ) and negative peak amplitude of the di€erentiated glottal ¯ow (dpeak;1† for the loudest speech sample before the knee in the SPL±dpeak -graphs. Sound pressure level (SPL2 ) and the negative peak amplitude of the di€erentiated glottal ¯ow (dpeak;2 ) for the softest speech sample after the knee in the SPL±dpeak -graphs. All the values are expressed in dB units. Male speakers Speaker SPL1 dpeak; 1 SPL2 dpeak; 2 EV HP PA JK TB JV 69 70 68 63 87 66 62 63 64 57 70 63 74 74 73 70 92 68 64 61 61 56 75 62 276 P. Alku et al. / Speech Communication 28 (1999) 269±281 and Sundberg, 1992). (The minimum of F0 was 125 and 75 Hz for phonations of female and male subjects, respectively. The maximum of F0 was 500 and 315 Hz for phonations of female and male subjects, respectively.) Hence, it is important to analyze, whether the decrease in the slope of the two lines that model the SPL±dpeak -graph is caused by the increase of the fundamental frequency when intensity of speech is raised. In order to test whether the knee in the SPL± dpeak -graphs was caused by F0 or by the sharpening of the ¯ow derivative at the instant of the glottal closure we made the following computations for all the speech samples. First, one cycle was cut from each of the obtained glottal ¯ow waveforms. This single cycle of the glottal ¯ow was ®rst differentiated. Then it was ®ltered through the same digital all-pole ®lter that was used as a model for the vocal tract when the glottal ¯ow from which the period was cut was computed with inverse ®ltering. In other words we re-synthesized one period of speech using the analysis results given by inverse ®ltering (e.g., the glottal ¯ow waveform, and the model of the vocal tract ®lter). Finally, the energy of the speech period obtained, denoted by Energy of the Synthesized Period (ESP), was computed. The rationale for this procedure is as follows. If we assume that the knee in the SPL± dpeak -graphs is caused by F0 then the graphs depicting dpeak as a function of ESP should not show a similar knee. This comes from the fact that ESP is the energy of a hypothetical speech signal that cannot be a€ected by F0 because the speech signal from which ESP is computed is produced by a single glottal cycle. (It is worth noting that ESP is not the same as the energy computed over one period of the original speech signal, which can be a€ected by F0 , i.e., by ¯uctuations from previous glottal periods.) Spectral decay was also measured for each of the glottal waveforms in order to explain the knee in the SPL±dpeak -graphs. If the knee is caused by the changes in the shape of the glottal ¯ow (especially during the glottal closing phase) then the spectral decay of the glottal source should decrease when intensity is increased. Spectral decay of the glottal source was quanti®ed with two methods. First, the di€erence (in dB) between the levels of the ®rst and the second harmonic, denoted by H1 ÿ H2 , was computed from the spectra of the ¯ow waveforms (Titze and Sundberg, 1992). A large value of H1 ÿ H2 implies that the spectrum of the glottal ¯ow decays rapidly, whereas a small value of H1 ÿ H2 indicates that the glottal source has more energy at higher frequencies. Second, the parabolic spectral parameter, PSP, was determined from the pitch-synchronously computed glottal spectra (Alku et al., 1997). PSP quanti®es the spectral decay of a glottal ¯ow by matching a parabolic function (y(k) ˆ ak2 + b, where k denotes the discrete frequency variable) to the pitchsynchronously computed spectrum of the glottal source. The optimal parabolic function is matched to the power spectrum of a glottal pulse by applying the mean square error criterion. In the case of a rapidly decaying glottal spectrum the optimal match yields a large negative value for the parabolic parameter a. In the case of a glottal ¯ow the spectrum of which decays slowly the value of the parabolic parameter a is closer to zero. It is worth noting that in (Alku et al., 1997) a normalized value of PSP was used whereas in the present study the PSP-computation corresponded to searching for the optimal parabolic parameter a without normalization. An example depicting the behavior of ESP, H1 ÿ H2 , and PSP is shown in Fig. 6 together with the corresponding SPL±dpeak -graph. In this example it can be clearly seen that the knee occurs between the fourth softest and the ®fth softest speech sample both in the SPL±dpeak -graph (Fig. 6(a)) and in the ESP±dpeak -graph (Fig. 6(b)). It can also be seen from both the H1 ±H2 -graph (Fig. 6(c)) and the PSP-graph (Fig. 6(d)) that the spectral decay of the glottal source decreases when intensity is increased. Hence, the graphs of Fig. 6, especially the ESP±dpeak -graph, show that the knee in the SPL±dpeak -function in the transition between soft and normal phonations is due to the changes in the shape of the glottal ¯ow and it cannot be explained by the increase of F0 . It can also be seen from Fig. 6(a) and (b) that the dynamic range of SPL is larger than that of ESP. This is explained by the loudest speech sample the SPL-value of which is about 6 dB larger than the SPL-value of the second loudest speech sample whereas the di€erence P. Alku et al. / Speech Communication 28 (1999) 269±281 277 Fig. 6. Scatterplots describing voice source parameters as a function of SPL. Male speaker. (a) Negative peak amplitude of the differentiated glottal ¯ow (dpeak ) as a function of SPL. (b) Negative peak amplitude of the di€erentiated glottal ¯ow (dpeak ) as a function of the Energy of the Synthesized Period (ESP). (c) Di€erence between the levels of the ®rst and the second harmonic (H1 ÿ H2 ) in the voice source spectrum as a function of SPL. (d) Parabolic spectral parameter (PSP) matched to the pitch-synchronously computed voice source spectrum as a function of SPL. in ESP between the corresponding samples is only about 3 dB. Hence, this speaker seems to have used F0 as a method of intensity regulation mainly in producing the loudest speech sample (which is also the sample with the largest F0 ). Similarity between the SPL±dpeak -graphs and the ESP±dpeak -graphs and the decrease in the spectral decay was statistically analyzed from the phonations of the eleven subjects as follows. The ESP±dpeak -graphs were modeled with two linear functions that were optimized in the same way as in the case of the SPL±dpeak -graphs described previously. The di€erence between the slopes of the two lines was tested using the Wilcoxon Signed-Rank nonparametric test. It was found that the di€erence between the slopes was statistically signi®cant at the signi®cance level of p ˆ 0.026. In other words there was a signi®cant di€erence in the slopes of the optimal linear functions between soft and loud speech samples also after removing the e€ect of F0 with the ESPcomputation. The change in the spectral decay of 278 P. Alku et al. / Speech Communication 28 (1999) 269±281 the glottal ¯ow as a function of SPL was tested as follows. The mean of both H1 ÿ H2 and PSP was computed for phonations of each subject over speech samples in two groups. The ®rst group consisted of the speech samples the SPL±dpeak graph of which was modeled by lineopt;1 (i.e., samples left of the knee in the SPL±dpeak -graph). The second group consisted of the speech samples the SPL±dpeak -graph of which was modeled by lineopt;2 (i.e., samples right of the knee in the SPL± dpeak -graph). The decrease of the spectral decay of the glottal ¯ow between the two sample groups occurred for all the subjects and for both of the spectral parameters. (The decrease of the spectral slope was statistically signi®cant with the Wilcoxon Signed-Rank nonparametric test at the signi®cance level of p ˆ 0.0033 for both H1 ÿ H2 and PSP.) All the results reported so far in the present study are based on the glottal ¯ow waveforms estimated by inverse ®ltering. Therefore, in order to con®rm our results an additional analysis was made using the spectra of the radiated speech sounds per se (i.e., results given by inverse ®ltering were not used). By doing this frequency domain analysis we were able to compare our data with the previous results from phonetogram measurements (e.g., Gramming and Sundberg, 1988; Titze, 1992). According to Gramming and Sundberg (1988), the strongest spectral component in soft phonation is usually the fundamental while in loud phonation the strongest partial is generally an overtone. In their study it was Fig. 7. Spectra of radiated speech sounds, male speaker. (a) Softest phonation. (b) Phonation just before the knee occurs in the SPL± dpeak -graph. (c) Phonation just after the knee occurs in the SPL±dpeak -graph. P. Alku et al. / Speech Communication 28 (1999) 269±281 also shown that the level di€erence between the strongest partial and the overall SPL increases when phonation changes from soft to loud. By referring to the study by Gramming and Sundberg (1988) we were interested in analyzing from the radiated spectra, whether the knee in the SPL±dpeak function occurs simultaneously when the strongest partial changes from F0 to an overtone near F1 . The following three speech samples of each subject were analyzed: the softest sound (denoted by s0 (n)), the speech sample that occurs before the knee in the SPL±dpeak -function (denoted by s1 (n)) and the sample that occurs straight after the knee (denoted by s2 (n)). Spectrum was computed using FFT of 2048 samples with Hamming-windowing. Fig. 7 shows spectra of s0 (n), s1 (n) and s2 (n) obtained from voices of one male subject. From these graphs it can be seen that the e€ect of the fundamental was most important for the intensity of the softest sound. However, the spectra of both s1 (n) and s2 (n) are characterized by strong partials near F1 . In order to quantitatively compare the e€ect of the fundamental on vocal intensity for s0 (n), s1 (n) and s2 (n) we computed for each of these sounds the di€erence (in dB) between the overall energy and the energy without the fundamental. This di€erence was clearly largest for s0 (n) (mean: 8.51 dB, standard deviation: 2.91 dB) when voices of all the subjects were analyzed. Both s1 (n) and s2 (n) yielded a value of energy di€erence that was much smaller and the value obtained for s1 (n) (mean: 0.92 dB, standard deviation: 0.94 dB) was close to that computed from s2 (n) (mean: 0.36 dB, standard deviation: 0.32 dB). This con®rms that SPL re¯ects the amplitude of the fundamental only for s0 (n). However, SPL of both s1 (n) and s2 (n) is strongly a€ected by overtones near F1 . Finally, radiated speech spectra were also analyzed in order to test whether SPL of s2 (n) was increased by formant tuning (i.e., by adjusting a harmonic closer to F1 in s2 (n) than in s1 (n)). For this purpose we extracted the center frequency 2 of 2 The center frequency of F1 varied between 646 and 851 Hz for female subjects. For male speakers the center frequency of F1 varied between 528 and 635 Hz. 279 F1 from the pitch-synchronously computed spectra of s1 (n) and s2 (n). The value obtained was then compared to the frequency of the strongest partial near F1 in the pitch-asynchronously computed spectra. This comparison showed that a spectral partial was closer to F1 in s1 (n) in 9 of the 11 cases, whereas an overtone was closer to F1 in s2 (n) in only 2 of the 11 cases. This ®nding con®rms our previous result according to which the knee in the SPL±dpeak -graph does not result from increasing intensity by formant tuning. 4. Summary and conclusions In previous studies, in particular in the classical paper by Gaun and Sundberg (1989), it has been shown that the sound pressure level of speech follows the negative peak amplitude of the differentiated glottal ¯ow, dpeak , in a manner close to linear. The linearity between SPL and dpeak is readdressed in the present study because of our ®nding indicating a clear knee in SPL±dpeak graphs when voices of greatly di€erent intensities are analyzed. This phenomenon was quanti®ed in the present study by modeling the SPL±dpeak graphs of 11 speakers with two optimal lines that minimize the mean square error between the SPL± dpeak -values and their linear models. It was found that the line that models the SPL±dpeak -graph for soft phonations was of a larger slope than the line that models SPL±dpeak -values for loud speech samples. In production of very soft voices, a speaker typically uses a smooth glottal pulse with a small AC-amplitude. Raising intensity can be achieved by increasing the AC-amplitude and also by affecting the shape of the glottal pulse by, for example, shortening the closing phase of the glottal cycle (Sundberg et al., 1993). Both of these changes in the glottal pulse increase the amplitude of the ¯ow derivative. Results of the present study show that when intensity of speech is increased using minor SPL-steps, it is possible to generate two sounds with di€erent SPL-values using practically the same level of dpeak , but the shape of the di€erentiated glottal ¯ow is a€ected during the glottal closing phase. Hence, rising SPL can be achieved 280 P. Alku et al. / Speech Communication 28 (1999) 269±281 not only by increasing the amplitude of dpeak , as suggested by the classical SPL±dpeak -function reported by Gaun and Sundberg (1989), but also by decreasing the spectral decay of the di€erentiated glottal ¯ow by increasing the sharpness of the glottal ¯ow derivative around the time-instant of dpeak . When voice intensity is changed from very soft to loud, speakers tend to make the most distinct change in their SPL±dpeak -function when going from ``loud soft'' to ``soft normal''. This change can be seen as a decrease in the slope of the line that matches the SPL±dpeak -graph. Finally, we would like to point out that both the study by Gaun and Sundberg (1989) and the present one can be considered applications of the Liljencrants±Fant (LF) model (Fant et al., 1985). This is due to the fact that dpeak is actually one of the parameters used in the LF-model. (In the LFmodel terminology notation Ee is used for dpeak .) As stated by Fant (1993), dpeak is the most important among the LF-parameters since it sets the levels of the formant amplitudes. When SPL±dpeak graphs are used in analyzing intensity regulation of speech one is actually applying the LF-model in an extremely compressed form by modeling the differentiated voice source with only one of the four LF-parameters. Further research is needed in order to ®nd out whether both the amplitude and the shape of the di€erentiated glottal ¯ow at the instant of the glottal closure could be presented with a single numerical value. It could be possible, for example, to apply the second derivative of the glottal ¯ow (Holmes, 1976; Hunt, 1987). It is also possible to combine two or more di€erent voice source parameters into a single one in a similar way as has been done with the LF-model by Fant (1995) and by the present authors in (Alku et al., 1998b). References  Akerlund, L., Gramming, P., 1994. Average loudness level, mean fundamental frequency, and subglottal pressure: Comparison between female singers and nonsingers. J. Voice 8, 263±270. Alku, P., 1992. Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering. Speech Communication 11, 109±118. Alku, P., Vilkman, E., 1994. Estimation of the glottal pulseform based on discrete all-pole modeling. In: Proc. Internat. Conf. on Spoken Language Processing, Yokohama, Japan, 18±22 September, pp. 1619±1622. Alku, P., Strik, H., Vilkman, E., 1997. Parabolic spectral parameter ± A new method for quanti®cation of the glottal ¯ow. Speech Communication 22, 67±79. Alku, P., Vilkman, E., Laukkanen, A.-M., 1998a. Estimation of amplitude features of the glottal ¯ow by inverse ®ltering speech pressure signals. Speech Communication 24, 123±132. Alku, P., Vilkman, E., Laukkanen, A.-M., 1998b. Parameterization of the voice source by combining spectral decay and amplitude features of the glottal ¯ow. J. Speech Lang. Hear. Res. 41, 990±1002. Bouhuys, A., Mead, J., Proctor, D., Stevens, K., 1968. Pressure¯ow events during singing. Ann. N.Y. Acad. Sci. 155, 165± 176. Childers, D., Lee, C., 1991. Vocal quality factors: Analysis, synthesis, and perception. J. Acoust. Soc. Amer. 90, 2394± 2410. Dromey, C., Stathopoulos, E., Sapienza, C., 1992. Glottal air¯ow and electroglottographic measures of vocal function at multiple intensities. J. Voice 6, 44±54. El-Jaroudi, A., Makhoul, J., 1991. Discrete all-pole modeling. IEEE Trans. Signal Process. 39, 411±423. Fant, G., 1960. Acoustic Theory of Speech Production. Mouton, The Hague. Fant, G., 1993. Some problems in voice source analysis. Speech Communication 13, 7±22. Fant, G., 1995. The LF-model revisited. Transformations and frequency domain analysis. Speech Transmission Laboratory, Quarterly Progress and Status Report, Royal Institute of Technology, Stockholm, 2±3, 119±156. Fant, G., Liljencrants, J., Lin, Q., 1985. A four-parameter model of glottal ¯ow. Speech Transmission Laboratory, Quarterly Progress and Status Report, Royal Institute of Technology Stockholm, 4, 1±13. Flanagan, J., 1972. Analysis, Synthesis, and Perception of Speech. Springer, Berlin. Gaun, J., Sundberg, J., 1989. Spectral correlates of glottal voice source waveform characteristics. J. Speech Hear. Res. 32, 556±565. Gramming, P., Sundberg, J., 1988. Spectrum factors relevant to phonetogram measurement. J. Acoust. Soc. Amer. 83, 2352±2360. Gramming, P., Sundberg, J., Ternst om, S., Leanderson, R., Perkins, W., 1988. Relationship between changes in voice pitch and loudness. J. Voice 2, 118±126. Herteg ard, S., Gaun, J., Sundberg, J., 1990. Open and covered singing as studied by means of ®beroptics, inverse ®ltering, and spectral analysis. J. Voice 4, 220±230. Herteg ard, S., Gaun, J., Karlsson, I., 1992. Physiological correlates of the inverse ®ltered ¯ow waveforms. J. Voice 6, 224±234. Hillman, R., Holmberg, E., Perkell, J., Walsh, M., Vaughan, C., 1990. Phonatory function associated with hyperfunctionally related vocal fold lesions. J. Voice 4, 52±63. P. Alku et al. / Speech Communication 28 (1999) 269±281 Holmberg, E., Hillman, R., Perkell, J., 1988. Glottal air¯ow and transglottal air pressure measurements for male and female speakers in soft, normal, and loud voice. J. Acoust. Soc. Amer. 84, 511±529. Holmes, J., 1976. Formant excitation before and after glottal closure. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Process., pp. 39±42. Hunt, M., 1987. Studies of glottal excitation using inverse ®ltering and an electroglottograph. In: Proc. of the 11th Internat. Congress of Phonetic Sciences, Tallinn, Estonia, 1±7 August, Vol. 3, pp. 22±26. Sulter, A., Wit, H., 1996. Glottal volume velocity waveform characteristics in subjects with and without vocal training, related to gender, sound intensity, fundamental frequency, and age. J. Acoust. Soc. Amer. 100, 3360±3373. Sundberg, J., 1990. What's so special about singers. J. Voice 4, 107±119. 281 Sundberg, J., Titze, I., Scherer, R., 1993. Phonatory control in male singing: A study of the e€ects of subglottal pressure, fundamental frequency, and mode of phonation on the voice source. J. Voice 7, 15±29. Titze, I., 1992. Acoustic interpretation of the voice range pro®le (phonetogram). J. Speech Hear. Res. 35, 21±34. Titze, I., 1994. Principles of Voice Production. Prentice-Hall, Englewood Cli€s, NJ. Titze, I., Sundberg, J., 1992. Vocal intensity in speakers and singers. J. Acoust. Soc. Amer. 91, 2936±2946. Vilkman, E., Lauri, E-R., Alku, P., Sala, E., Sihvo, M., 1997. Loading changes in time-based parameters of glottal ¯ow waveforms in di€erent ergonomic conditions. Folia Phoniatr. Logop. 49, 247±263. Wong, D., Markel, J., Gray, A., 1979. Least squares glottal inverse ®ltering from the acoustic speech waveform. IEEE Trans. Acoust. Speech Signal Process. 27, 350±355.