Speech Communication 28 (1999) 269±281
www.elsevier.nl/locate/specom
On the linearity of the relationship between the sound pressure
level and the negative peak amplitude of the dierentiated
glottal ¯ow in vowel production
Paavo Alku
a
a,*
,
Juha Vintturi b, Erkki Vilkman
c
Acoustics Laboratory, Helsinki University of Technology, P.O. Box 3000, Fin-02015 Helsinki, TKK, Finland
b
Department of Otolaryngology and Phoniatrics, Helsinki University Central Hospital, Helsinki, Finland
c
Department of Otolaryngology and Phoniatrics, University of Oulu, Finland
Received 20 July 1998; received in revised form 22 February 1999; accepted 14 April 1999
Abstract
The negative peak amplitude of the dierentiated glottal ¯ow (dpeak ) is known to correlate strongly with the sound
pressure level (SPL) of speech. Therefore, the function between dpeak and SPL has been conventionally modeled as a
single line. In this survey, the linearity of the function between dpeak and SPL is revisited by analyzing glottal ¯ows that
were inverse ®ltered from speech sounds of largely dierent intensities. It is shown that SPL±dpeak -graphs can be
modeled more accurately by using two linear functions, the ®rst of which models soft phonation, and the second of
which models normal and loud speech sounds. For all of the analyzed SPL±dpeak -graphs, the slope of the modeling line
matching soft phonation was larger than the slope of the line for normal and loud speech. This result suggests that vocal
intensity is aected not only by the single amplitude domain value of the voice source, dpeak , but also by the shape of the
dierentiated glottal ¯ow near the instant of the negative peak. Ó 1999 Elsevier Science B.V. All rights reserved.
Zusammenfassung
Bei der negativen Spitzenamplitude des abweichenden Stimmritzen¯usses (dpeak ) wird bekanntermassen von einer
starken Wechselbeziehung mit dem Schalldruckniveau (SPL) des gesprochenen Wortes ausgegangen. Deshalb kommt
blicherweise in Form einer einzigen Linie zum Ausdruck. In dieser Ubersicht
die Funktion zwischen dpeak und SPL u
wird die Linearit
at der Funktion zwischen dpeak und SPL so wiedergegeben, dass der in entgegengesetzter Richtung
ge®lterte Stimmritzen¯uss von Spracht
onen sehr verschiedener Intensit
at analysiert wurde. Dadurch kann
nachgewiesen werden, dass die graphische Darstellung von SPL±dpeak pr
allt, wenn zwei lineare Funktionen
aziser ausf
verwendet werden, n
amlich eine solche, die von weicher Phonation gebildet wird, und eine zweite zur Wiedergabe von
normalen und lauten Spracht
onen. Bei allen analysierten SPL±dpeak -Graphiken f
allt der Neigungsbereich des Linienverlaufs von weicher Phonation breiter aus als der Neigungsbereich der Linie f
ur normale und laute Aussprache.
Dieses Ergebnis l
asst den Schluss zu, dass die vokale Intensit
at nicht nur durch den Eigenamplitudenwert der Tonquelle
beein¯usst wird, sondern ebenso durch die Form des abweichenden Stimmritzen¯usses im Bereich des negativen
Ausschlags. Ó 1999 Elsevier Science B.V. All rights reserved.
*
Corresponding author. E-mail: paavo.alku@hut.®
0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 2 0 - 5
270
P. Alku et al. / Speech Communication 28 (1999) 269±281
ReÂsumeÂ
On sait que l'amplitude de cr^ete negative est en forte correlation avec le niveau de pression accoustique (SPL) de la
parole. Ainsi, la fonction entre dpeak et SPL a ete conventionnellement modelee en une seule ligne. Dans cette etude, la
a partir
linearite de la fonction entre dpeak et SPL est reveri®ee en analysant les ¯ux glottaux obtenus par ®ltrage inverse
de sons de parole d'une large gamme de dierentes intensites. On peut voir que les courbes dpeak et SPL sont obtenues de
maniere bien plus precises en utilisant deux fonctions lineaires, la premiere pour les modeles de phonation douce, la
seconde representant les sons de parole normale et forte. Pour toutes les courbes dpeak et SPL analysees, l'inclinaison de
la ligne modelee correspondant aux phonations douces est plus prononcee que celle de la ligne des sons de parole
normale et forte. Ce resultat suuggere que l'intensite vocale n'est pas seulement aectee par la valeur du domaine
d'amplitude de la source vocale, dpeak , mais aussi par la forme du ¯ux glottal dierentie
a l'approche de la cr^ete
negative. Ó 1999 Elsevier Science B.V. All rights reserved.
Keywords: Voice source analysis; Inverse ®ltering; Intensity regulation
1. Introduction
The role of the glottal source in regulating vocal
intensity has been studied extensively during the
past decades. Research in this area typically includes ®rst the estimation of the glottal volume
velocity waveform using inverse ®ltering (Holmberg et al., 1988; Gaun and Sundberg, 1989;
Dromey et al., 1992; Titze and Sundberg, 1992;
Sundberg et al., 1993; Sulter and Wit, 1996). The
resultant estimates for the glottal ¯ow are then
parameterized in order to express the most important features of the source waveform in a
compressed form. Extraction of information from
voice production can be done by using both the
estimated glottal ¯ow directly and its ®rst derivative. Parameterization of the voice source can be
computed by using, for example, the so-called
time-based parameters of the glottal ¯ow (i.e.,
certain ratios between the closed phase, the opening phase and the closing phase of the glottal cycle)
(Holmberg et al., 1988; Vilkman et al., 1997). It is
also possible to extract information from the estimated glottal waveforms by measuring their amplitude features (i.e., level of the maximum ¯ow or
level of the DC-¯ow) (Herteg
ard et al., 1990, 1992;
Hillman et al., 1990). Characteristics of the glottal
source can also be quanti®ed in a compressed form
by using a frequency domain approach by measuring, for instance, the spectral decay of the
glottal excitation (Childers and Lee, 1991; Titze
and Sundberg, 1992; Alku et al., 1997).
The key factor behind intensity regulation of
speech is subglottal pressure (Bouhuys et al., 1968;
Titze, 1994). Increasing intensity of speech implies
raising subglottal pressure. However, also other
factors of speech production, such as glottal adduction and formant frequencies, contribute to
intensity regulation (Sundberg et al., 1993). Increasing vocal intensity by raising subglottal
pressure ampli®es in general the AC-¯ow of a
glottal pulse but it also makes the shape of the
glottal pulse sharper so that it contains more energy at high frequencies. In the frequency domain
this means that intensity of a soft sound is primarily determined by the fundamental whereas
intensity of normal and loud speech is greatly affected by the overtones near the ®rst formant
(Gramming and Sundberg, 1988; Sundberg et al.,
1993). Increasing subglottal pressure aects not
only the shape of a single glottal pulse but also the
repetition rate of the glottal pulses, i.e., the fundamental frequency (F0 ) of the voice. Many studies have been published on the relationship
between F0 and loudness (e.g., Gramming et al.,
1988; Titze and Sundberg, 1992). Gramming et al.
(1988) reported that the mean pitch increased by
about half-semitones when intensity was increased
by one decibel. According to Gramming et al.
(1988) the increased value of F0 can be considered
a passive result of raising subglottal pressure in
order to produce louder sounds. However, F0 per
se can also have an important role in intensity
regulation of speech when formant tuning is used
(Sundberg, 1990; Titze, 1994). This implies that F0
of the voice is adjusted so that one of its lowest
harmonics almost coincides with the ®rst formant
(F1 ). Consequently, the level of the speech
P. Alku et al. / Speech Communication 28 (1999) 269±281
spectrum at the ®rst formant is ampli®ed which
increases the overall intensity. In order to increase
vocal intensity with formant tuning calls for using
large values of F0 . Hence, formant tuning is a
method of intensity regulation that is used mainly
in singing or in production of high-pitched voices.
One of the most important parameters of the
glottal ¯ow, closely related to vocal intensity, is the
negative peak amplitude of the dierentiated
glottal ¯ow (dpeak in Fig. 1). (This parameter is also
called the maximum ¯ow declination rate (Holmberg et al., 1988).) Behavior of dpeak was studied as
a function of SPL of speech by Gaun and
Sundberg (1989). In their study, the main ®nding
was that there is a strong linear correlation between dpeak (when expressed in dB units) and SPL
(also expressed in dB units). Strong correlation
between dpeak and SPL can be explained using the
Fant's source-®lter theory of speech production
(Fant, 1960), as follows: Production of voiced
speech is modeled according to the source-®lter
theory by three separate processes: the glottal ¯ow,
the vocal tract ®ltering and the lip radiation eect.
Since the lip radiation eect can be estimated as a
dierentiator (Flanagan, 1972), it is possible to
combine the ®rst process, the glottal ¯ow, and the
third one, the lip radiation eect. Consequently,
voice production can be modeled by two processes:
the dierentiated glottal ¯ow (also called the
271
eective driving function) and the vocal tract ®ltering (Wong et al., 1979). A negative impulse-like
peak dominates the waveform of the dierentiated
glottal ¯ow, at least for normal and loud phonations. This peak, occurring at the time instant
when the rate of change of the ¯ow reaches its
absolute maximum, serves as the main excitation
of the vocal tract (Fant, 1993). Consequently, the
level of the peak, dpeak , determines to a large extent
the energy of the produced speech signal.
The linearity of the function between SPL and
dpeak is further studied in the present survey. Even
though there are well-known publications on this
issue (e.g., Gaun and Sundberg, 1989), we consider this topic worth revisiting due to the following two rationales. First, it is known that
dynamics of the human voice in terms of SPL can
be 70 dB or even more in the case when pitch is
allowed to vary over its extreme values (Akerlund
and Gramming, 1994; Titze, 1994). However, in
voice source studies where inverse ®ltering is applied, there seems to be a tendency not to analyze
very soft or extremely loud voices. In the study by
Holmberg et al. (1988), for example, the dierence
between the mean SPL-value of loud and soft
voices was only 11.0 dB for male speakers and
11.8 dB for females. Therefore, we believe that a
further study on the SPL±dpeak -function is needed
in order to determine whether the linearity of the
function holds when voices of largely dierent
SPL-values are analyzed. The second rationale for
the present study, as shown by two examples depicted in Figs. 4 and 5, is our experimental ®nding
indicating a clear ``knee'' in the SPL±dpeak -graphs
when voices are analyzed using a wide SPL-range.
According to this ®nding SPL±dpeak -graphs can be
modeled accurately by a linear function but the
slope of the line is dierent between soft and loud
phonations.
2. Material and methods
2.1. Speech material
Fig. 1. One cycle of the glottal volume velocity waveform (a)
and its ®rst derivative (b). fAC : amplitude of the AC-¯ow, fDC :
level of the DC-¯ow, dpeak : negative peak amplitude of the
dierentiated glottal ¯ow.
In order to measure the linearity of the SPL±
dpeak -function, we designed an experiment where
speech data were collected from 11 adult Finnish
272
P. Alku et al. / Speech Communication 28 (1999) 269±281
speakers (®ve females and six males) with no history of speech, voice or hearing disorders. The
ages of the speakers varied between 42 and 54 years
for females and between 33 and 66 years for males.
Each subject produced a series of the word /pa:p:a/
(Finnish for ``grandpa'') by gradually increasing
loudness.
The acoustic speech pressure waveform was
recorded using a condenser microphone
(Br
uel&Kjaer 4176) which was placed at a distance
of 40 cm from the lips of the speaker. (The distance
was carefully monitored during the recording because a constant mouth-to-microphone distance is
essential to get reliable results for the amplitude
values of the glottal ¯ow with our inverse ®ltering
method.) Recordings were made in an anechoic
chamber, and all of the subjects sat while producing the sounds.
The ®rst phonation sample was to be produced
as softly as possible without whispering. The output level of the speech signals was controlled by
means of a sound level meter (Br
uel&Kjaer 2225).
By following the LED light display of the sound
level meter, the speakers were able to control the
SPL-values of their speech samples. Subjects repeated /pa:p:a/-words by increasing the SPL-values in gradations of approximately 5 dB from the
softest voice up to the loudest, with an SPL-value
of 105 dB. (Some subjects voluntarily also produced the loudest sound with an SPL-value of
110 dB.) The subjects were given no other restrictions regarding their voice production, which
means that pitch and phonation type were decided
freely by the speakers during the recording. The
average number of voice samples covering the intensity range from the softest to the loudest phonation was 12 per speaker. The total number of
speech samples produced was 61 and 71 by female
and male speakers, respectively.
The acoustical speech pressure waveform was
digitized using a sampling frequency of 22,050 Hz
and a resolution of 16 bits. At the computer,
the signals were ®rst high-pass ®ltered with a
inear phase FIR-®lter whose cut-o frequency was
60 Hz in order to remove any possible low frequency air pressure variations picked up during
the recordings. The bandwidth of signals was
11 kHz.
For computation of SPL-values, we recorded a
calibration signal generated by a Br
uel&Kjaer
4231 calibrator. SPL-values were computed on the
dB-scale for all speech signals using the root mean
square-operation (RMS) and the SPL-value of the
calibration tone (94 dB) as follows:
SPLfspeechg
94 dB 20 log
RMS fspeechg
:
RMS fcalibrationg
1
It is worth noting that in the present study
Eq. (1) yields the SPL-value of a speech signal with
linear weighting at the distance of 40 cm from the
lips of the speaker.
2.2. Inverse ®ltering
In order to estimate the glottal volume velocity
waveforms, we used an inverse ®ltering technique
similar to the one described by Alku (1992). This
inverse ®ltering method estimates the glottal excitation directly from the acoustic speech pressure
waveform recorded in a free ®eld (i.e., no ¯ow
mask is required). The method is based on the
separated speech production model by Fant
(1960). In the present study, we used a modi®ed
version of the inverse ®ltering method described in
(Alku, 1992). The modi®cation concerned modeling of the vocal tract transfer function. In the
present study, the estimation of the vocal tract
transfer function was based on a sophisticated allpole modeling technique, called Discrete All-Pole
Modeling (DAP) (El-Jaroudi and Makhoul, 1991)
instead of the conventional Linear Predictive
Coding (LPC) which is used in (Alku, 1992). The
dierence between DAP and LPC is that the former is based on the Itakura±Saito distortion criterion in determining an optimal all-pole ®lter,
whereas the latter uses the least squares error
criterion. Consequently, the formants of the vocal
tract, particularly the ®rst formant, can be more
accurately estimated by DAP than by LPC (ElJaroudi and Makhoul, 1991). An accurate modeling of the vocal tract transfer function is very
important from the point of view of glottal inverse ®ltering. As reported in (Alku and Vilkman,
1994), the application of DAP instead of LPC in
P. Alku et al. / Speech Communication 28 (1999) 269±281
modeling the vocal tract transfer function decreases the amount of formant ripple in the estimated glottal ¯ows.
The inverse ®ltering method used in the present
study estimates the vocal tract transfer function
using the above-mentioned DAP-technique by ®rst
canceling the average eect of the glottal source
from the speech spectrum using a low order allpole ®ltering. By scaling the DC-gain of the digital
®lter that models the vocal tract to unity, it is
possible to estimate the amplitude characteristics
of the glottal ¯ow (with arbitrary units), even
though no ¯ow mask is used (Alku et al., 1998a).
The lip radiation eect is canceled by a ®rst order
all-pole ®lter with its pole in the z-domain at
z 0.98. In the present study, the estimation of the
glottal ¯ow was computed using an analysis window of 50 ms. (For some of the low-pitched male
voices, the length of the time window was increased to 70 ms in order to cover at least four
glottal cycles.) From each of the estimated glottal
waveforms, the value of dpeak was determined by
computing the mean of negative peak amplitudes
of the dierentiated 1 glottal ¯ow over four consecutive glottal cycles.
3. Results
An example of an SPL±dpeak -graph obtained
from phonations of a male subject is shown in
Fig. 2. At ®rst sight, the function between SPL and
dpeak seems to be linear over the whole SPL-range
from the softest phonation up to the loudest.
However, a closer examination of the graph reveals that dpeak seems to follow very closely a linear
function of SPL over the softest four phonations
but then, in the vicinity of 80 dB, it starts to follow
another linear function, the slope of which is
clearly smaller than the slope of the ®rst line. In
1
The glottal ¯ows computed in the present study were
parameterized by extracting information (i.e., dpeak ) from the
®rst derivative of the ¯ow. Computation of the derivative was
implemented by ®ltering the ¯ow waveforms with the model of
the lip radiation eect that was used in the inverse ®ltering
stage. The transfer function of the dierentiator was
H(z) 1.0 ÿ 0.98zÿ1 .
273
Fig. 2. An example of the SPL±dpeak -graph.
other words, if voices of greatly dierent SPLvalues are analyzed, it seems to be more plausible
to model the SPL±dpeak -relationship using (at least)
two linear functions instead of just one.
The reason for the ``knee'' in the SPL±dpeak graph between soft and normal phonations as
shown by the example depicted in Fig. 2 can be
explained by analyzing the waveforms of the corresponding glottal ¯ows and their derivatives. The
glottal ¯ow and its derivative are shown for speech
sample no. 4 of Fig. 2 (i.e., the strongest of the soft
phonations) in Fig. 3(a) and (b), respectively. The
¯ow and the dierentiated ¯ow of speech sample
no. 5 of Fig. 2 (i.e., the softest of the normal
phonations) is shown by Fig. 3(c) and (d), respectively. It can be noticed from these graphs that
the amplitude of dpeak increases to some extent (by
25% which equals 1.9 dB) when intensity of the
voice rises. However, there is also a change in
the shapes of the dierentiated glottal ¯ows: the
waveform of Fig. 3(b) is much smoother during
the glottal closing phase than in the signal shown
in Fig. 3(d). In the frequency domain, this implies
that the spectral decay of the dierentiated glottal
¯ow shown in Fig. 3(d) is less than the decay of the
signal in Fig. 3(b). Therefore, the waveform of
Fig. 3(d) produces, after being ®ltered through the
vocal tract, a voice signal of larger SPL than the
waveform shown in Fig. 3(b). This is explained by
the fact that the amplitude of the spectral components near the ®rst formant of the produced
274
P. Alku et al. / Speech Communication 28 (1999) 269±281
SPL). Hence, the accuracy of the classical model
for the SPL±dpeak -function that consists of a single
line is deteriorated. Therefore, it is justi®ed to
analyze how to model the SPL±dpeak -relationship
more accurately by taking into account that, particularly in the soft-to-normal transitions, the
change of the SPL-value of speech is regulated not
only by the level of the negative peak of the differentiated glottal ¯ow, dpeak , but also by the shape
of the dierentiated glottal waveform near the instant of the negative peak.
SPL±dpeak -graphs of each speaker were modeled
using two optimal linear functions. These lines,
denoted by lineopt;1 and lineopt;2 , have been drawn
in two examples of SPL±dpeak -graphs shown in
Figs. 4 and 5. The functions of the optimal lines
were determined for phonations of each speaker as
follows: The obtained 12 SPL±dpeak -values of each
subject were divided into two groups, denoted by
Group1 and Group2 . Group1 consisted of ®rstly
the SPL±dpeak -values of the three softest phonations while the rest of the nine SPL±dpeak -values
were in Group2 . An optimal linear function (in
terms of the mean square error criterion) was then
matched separately over the SPL±dpeak -values of
both of the groups. Next, the obtained data points
were divided into the groups dierently by taking
the four softest phonations into Group1 and the
eight loudest phonations into Group2 . A new pair
of optimal linear functions was obtained by
Fig. 3. Waveforms estimated by inverse ®ltering speech samples
no. 4 and 5 of Fig. 2 (y-axis in arbitrary units). (a) Glottal ¯ow
of speech sample no. 4. (b) Dierentiated glottal ¯ow of speech
sample no. 4. (c) Glottal ¯ow of speech sample no. 5. (d) Differentiated glottal ¯ow of speech sample no. 5.
speech sound will by stronger if the spectral decay
of the voice source decreases. In this example a
large increase of SPL occurs even though the
glottal ¯ows have only a slightly dierent values of
dpeak . When these voices are expressed in the SPL±
dpeak -graph, they will have almost the same value
on the y-axis (i.e., the level of dpeak ), but the sound
excited by the waveform of Fig. 3(d) will yield a
clearly larger value on the x-axis (i.e., the value of
Fig. 4. SPL±dpeak -graph, female speaker.
275
P. Alku et al. / Speech Communication 28 (1999) 269±281
Fig. 5. SPL±dpeak -graph, male speaker.
separately matching the SPL±dpeak -values of both
groups with two lines. This procedure was repeated by searching for the division of the data
points into two groups yielding a minimal mean
square error between the original SPL±dpeak -values
and their linear models. Hence, lineopt;1 and lineopt;2
shown in Figs. 4 and 5 represent linear models for
the SPL±dpeak -values, including the optimal way to
divide the data points into two separate groups to
be matched by two lines.
The obtained SPL±dpeak -values of all the 12
subjects were analyzed in a similar manner as
shown in Figs. 4 and 5. The optimal linear functions, lineopt;1 and lineopt;2 , were quanti®ed using
their slopes which are denoted by slope1 and
slope2 , respectively. It was found that slope1 was
larger than slope2 for the phonations of each subject. In other words, when SPL±dpeak -graphs were
modeled by two linear functions that were determined optimally by minimizing the mean square
error, the linear model for the softest phonations
was a more steeply ascending line than the model
matching normal and loud phonations in all cases.
The mean (m) and the standard deviation (s.d.) of
the two slopes were for female voices as follows:
slope1 : m 0.82, s.d. 0.29, slope2 : m 0.36,
s.d. 0.27. For male phonations, the following
values were obtained: slope1 : m 1.14, s.d. 0.27,
slope2 : m 0.54, s.d. 0.34. The dierence between slope1 and slope2 was statistically tested
using the Wilcoxon Signed-Rank nonparametric
test. The dierence between the slope values was
statistically signi®cant at the signi®cance level of
p 0.0033. The optimal linear functions modeled
the SPL±dpeak -graphs accurately: the correlation
coecient averaged over the eleven subjects was
0.94 and 0.93 when SPL±dpeak -graphs were modeled with lineopt;1 and lineopt;2 , respectively. Values
of SPL and dpeak are shown for the loudest speech
sample modeled by lineopt;1 and for the softest
speech sample modeled by lineopt;2 in Tables 1
and 2 for each of the female and male speakers,
respectively.
The subjects of the present study were allowed
to produce speech samples freely using the pitch of
their own choice. Therefore, the fundamental frequency of the voices increased with intensity,
which is in line with previous studies where F0 has
been analyzed as a function of vocal loudness (e.g.,
Gramming et al., 1988; Dromey et al., 1992; Titze
Table 1
Sound pressure level (SPL1 ) and negative peak amplitude of the
dierentiated glottal ¯ow (dpeak;1 for the loudest speech sample
before the knee in the SPL±dpeak -graphs. Sound pressure level
(SPL2 ) and the negative peak amplitude of the dierentiated
glottal ¯ow (dpeak;2 ) for the softest speech sample after the knee
in the SPL±dpeak -graphs. All the values are expressed in dB
units. Female speakers
Speaker
SPL1
dpeak;1
SPL2
dpeak;2
HR
HS
AS
EL
LM
72
65
82
87
84
65
59
69
68
72
73
71
86
97
88
63
59
69
74
77
Table 2
Sound pressure level (SPL1 ) and negative peak amplitude of the
dierentiated glottal ¯ow (dpeak;1 for the loudest speech sample
before the knee in the SPL±dpeak -graphs. Sound pressure level
(SPL2 ) and the negative peak amplitude of the dierentiated
glottal ¯ow (dpeak;2 ) for the softest speech sample after the knee
in the SPL±dpeak -graphs. All the values are expressed in dB
units. Male speakers
Speaker
SPL1
dpeak; 1
SPL2
dpeak; 2
EV
HP
PA
JK
TB
JV
69
70
68
63
87
66
62
63
64
57
70
63
74
74
73
70
92
68
64
61
61
56
75
62
276
P. Alku et al. / Speech Communication 28 (1999) 269±281
and Sundberg, 1992). (The minimum of F0 was 125
and 75 Hz for phonations of female and male
subjects, respectively. The maximum of F0 was 500
and 315 Hz for phonations of female and male
subjects, respectively.) Hence, it is important to
analyze, whether the decrease in the slope of the
two lines that model the SPL±dpeak -graph is caused
by the increase of the fundamental frequency when
intensity of speech is raised.
In order to test whether the knee in the SPL±
dpeak -graphs was caused by F0 or by the sharpening
of the ¯ow derivative at the instant of the glottal
closure we made the following computations for
all the speech samples. First, one cycle was cut
from each of the obtained glottal ¯ow waveforms.
This single cycle of the glottal ¯ow was ®rst differentiated. Then it was ®ltered through the same
digital all-pole ®lter that was used as a model for
the vocal tract when the glottal ¯ow from which
the period was cut was computed with inverse ®ltering. In other words we re-synthesized one period of speech using the analysis results given by
inverse ®ltering (e.g., the glottal ¯ow waveform,
and the model of the vocal tract ®lter). Finally, the
energy of the speech period obtained, denoted by
Energy of the Synthesized Period (ESP), was
computed. The rationale for this procedure is as
follows. If we assume that the knee in the SPL±
dpeak -graphs is caused by F0 then the graphs depicting dpeak as a function of ESP should not show a
similar knee. This comes from the fact that ESP is
the energy of a hypothetical speech signal that
cannot be aected by F0 because the speech signal
from which ESP is computed is produced by a
single glottal cycle. (It is worth noting that ESP is
not the same as the energy computed over one
period of the original speech signal, which can be
aected by F0 , i.e., by ¯uctuations from previous
glottal periods.)
Spectral decay was also measured for each of
the glottal waveforms in order to explain the knee
in the SPL±dpeak -graphs. If the knee is caused by
the changes in the shape of the glottal ¯ow (especially during the glottal closing phase) then the
spectral decay of the glottal source should decrease
when intensity is increased. Spectral decay of the
glottal source was quanti®ed with two methods.
First, the dierence (in dB) between the levels of
the ®rst and the second harmonic, denoted by
H1 ÿ H2 , was computed from the spectra of the
¯ow waveforms (Titze and Sundberg, 1992). A
large value of H1 ÿ H2 implies that the spectrum
of the glottal ¯ow decays rapidly, whereas a small
value of H1 ÿ H2 indicates that the glottal source
has more energy at higher frequencies. Second, the
parabolic spectral parameter, PSP, was determined
from the pitch-synchronously computed glottal
spectra (Alku et al., 1997). PSP quanti®es the
spectral decay of a glottal ¯ow by matching a
parabolic function (y(k) ak2 + b, where k denotes the discrete frequency variable) to the pitchsynchronously computed spectrum of the glottal
source. The optimal parabolic function is matched
to the power spectrum of a glottal pulse by applying the mean square error criterion. In the case
of a rapidly decaying glottal spectrum the optimal
match yields a large negative value for the parabolic parameter a. In the case of a glottal ¯ow the
spectrum of which decays slowly the value of the
parabolic parameter a is closer to zero. It is worth
noting that in (Alku et al., 1997) a normalized
value of PSP was used whereas in the present study
the PSP-computation corresponded to searching
for the optimal parabolic parameter a without
normalization.
An example depicting the behavior of ESP,
H1 ÿ H2 , and PSP is shown in Fig. 6 together with
the corresponding SPL±dpeak -graph. In this example it can be clearly seen that the knee occurs between the fourth softest and the ®fth softest speech
sample both in the SPL±dpeak -graph (Fig. 6(a)) and
in the ESP±dpeak -graph (Fig. 6(b)). It can also be
seen from both the H1 ±H2 -graph (Fig. 6(c)) and
the PSP-graph (Fig. 6(d)) that the spectral decay
of the glottal source decreases when intensity is
increased. Hence, the graphs of Fig. 6, especially
the ESP±dpeak -graph, show that the knee in the
SPL±dpeak -function in the transition between soft
and normal phonations is due to the changes in the
shape of the glottal ¯ow and it cannot be explained
by the increase of F0 . It can also be seen from
Fig. 6(a) and (b) that the dynamic range of SPL is
larger than that of ESP. This is explained by the
loudest speech sample the SPL-value of which is
about 6 dB larger than the SPL-value of the second loudest speech sample whereas the dierence
P. Alku et al. / Speech Communication 28 (1999) 269±281
277
Fig. 6. Scatterplots describing voice source parameters as a function of SPL. Male speaker. (a) Negative peak amplitude of the differentiated glottal ¯ow (dpeak ) as a function of SPL. (b) Negative peak amplitude of the dierentiated glottal ¯ow (dpeak ) as a function of
the Energy of the Synthesized Period (ESP). (c) Dierence between the levels of the ®rst and the second harmonic (H1 ÿ H2 ) in the
voice source spectrum as a function of SPL. (d) Parabolic spectral parameter (PSP) matched to the pitch-synchronously computed
voice source spectrum as a function of SPL.
in ESP between the corresponding samples is only
about 3 dB. Hence, this speaker seems to have
used F0 as a method of intensity regulation mainly
in producing the loudest speech sample (which is
also the sample with the largest F0 ).
Similarity between the SPL±dpeak -graphs and
the ESP±dpeak -graphs and the decrease in the
spectral decay was statistically analyzed from
the phonations of the eleven subjects as follows.
The ESP±dpeak -graphs were modeled with two
linear functions that were optimized in the same
way as in the case of the SPL±dpeak -graphs described previously. The dierence between the
slopes of the two lines was tested using the Wilcoxon Signed-Rank nonparametric test. It was
found that the dierence between the slopes was
statistically signi®cant at the signi®cance level of
p 0.026. In other words there was a signi®cant
dierence in the slopes of the optimal linear
functions between soft and loud speech samples
also after removing the eect of F0 with the ESPcomputation. The change in the spectral decay of
278
P. Alku et al. / Speech Communication 28 (1999) 269±281
the glottal ¯ow as a function of SPL was tested as
follows. The mean of both H1 ÿ H2 and PSP was
computed for phonations of each subject over
speech samples in two groups. The ®rst group
consisted of the speech samples the SPL±dpeak graph of which was modeled by lineopt;1 (i.e.,
samples left of the knee in the SPL±dpeak -graph).
The second group consisted of the speech samples
the SPL±dpeak -graph of which was modeled by
lineopt;2 (i.e., samples right of the knee in the SPL±
dpeak -graph). The decrease of the spectral decay of
the glottal ¯ow between the two sample groups
occurred for all the subjects and for both of the
spectral parameters. (The decrease of the spectral
slope was statistically signi®cant with the Wilcoxon Signed-Rank nonparametric test at the
signi®cance level of p 0.0033 for both H1 ÿ H2
and PSP.)
All the results reported so far in the present study
are based on the glottal ¯ow waveforms estimated
by inverse ®ltering. Therefore, in order to con®rm
our results an additional analysis was made using
the spectra of the radiated speech sounds per se (i.e.,
results given by inverse ®ltering were not used). By
doing this frequency domain analysis we were able
to compare our data with the previous results from
phonetogram measurements (e.g., Gramming and
Sundberg, 1988; Titze, 1992). According to Gramming and Sundberg (1988), the strongest spectral
component in soft phonation is usually the fundamental while in loud phonation the strongest partial is generally an overtone. In their study it was
Fig. 7. Spectra of radiated speech sounds, male speaker. (a) Softest phonation. (b) Phonation just before the knee occurs in the SPL±
dpeak -graph. (c) Phonation just after the knee occurs in the SPL±dpeak -graph.
P. Alku et al. / Speech Communication 28 (1999) 269±281
also shown that the level dierence between the
strongest partial and the overall SPL increases
when phonation changes from soft to loud. By referring to the study by Gramming and Sundberg
(1988) we were interested in analyzing from the
radiated spectra, whether the knee in the SPL±dpeak function occurs simultaneously when the strongest
partial changes from F0 to an overtone near F1 .
The following three speech samples of each
subject were analyzed: the softest sound (denoted
by s0 (n)), the speech sample that occurs before the
knee in the SPL±dpeak -function (denoted by s1 (n))
and the sample that occurs straight after the knee
(denoted by s2 (n)). Spectrum was computed using
FFT of 2048 samples with Hamming-windowing.
Fig. 7 shows spectra of s0 (n), s1 (n) and s2 (n) obtained from voices of one male subject. From these
graphs it can be seen that the eect of the fundamental was most important for the intensity of the
softest sound. However, the spectra of both s1 (n)
and s2 (n) are characterized by strong partials
near F1 .
In order to quantitatively compare the eect of
the fundamental on vocal intensity for s0 (n), s1 (n)
and s2 (n) we computed for each of these sounds the
dierence (in dB) between the overall energy and
the energy without the fundamental. This dierence was clearly largest for s0 (n) (mean: 8.51 dB,
standard deviation: 2.91 dB) when voices of all the
subjects were analyzed. Both s1 (n) and s2 (n)
yielded a value of energy dierence that was much
smaller and the value obtained for s1 (n) (mean:
0.92 dB, standard deviation: 0.94 dB) was close to
that computed from s2 (n) (mean: 0.36 dB, standard
deviation: 0.32 dB). This con®rms that SPL re¯ects
the amplitude of the fundamental only for s0 (n).
However, SPL of both s1 (n) and s2 (n) is strongly
aected by overtones near F1 .
Finally, radiated speech spectra were also analyzed in order to test whether SPL of s2 (n) was
increased by formant tuning (i.e., by adjusting a
harmonic closer to F1 in s2 (n) than in s1 (n)). For
this purpose we extracted the center frequency 2 of
2
The center frequency of F1 varied between 646 and 851 Hz
for female subjects. For male speakers the center frequency of
F1 varied between 528 and 635 Hz.
279
F1 from the pitch-synchronously computed spectra
of s1 (n) and s2 (n). The value obtained was then
compared to the frequency of the strongest partial
near F1 in the pitch-asynchronously computed
spectra. This comparison showed that a spectral
partial was closer to F1 in s1 (n) in 9 of the 11 cases,
whereas an overtone was closer to F1 in s2 (n) in
only 2 of the 11 cases. This ®nding con®rms our
previous result according to which the knee in the
SPL±dpeak -graph does not result from increasing
intensity by formant tuning.
4. Summary and conclusions
In previous studies, in particular in the classical
paper by Gaun and Sundberg (1989), it has been
shown that the sound pressure level of speech
follows the negative peak amplitude of the differentiated glottal ¯ow, dpeak , in a manner close to
linear. The linearity between SPL and dpeak is readdressed in the present study because of our
®nding indicating a clear knee in SPL±dpeak graphs when voices of greatly dierent intensities
are analyzed. This phenomenon was quanti®ed in
the present study by modeling the SPL±dpeak graphs of 11 speakers with two optimal lines that
minimize the mean square error between the SPL±
dpeak -values and their linear models. It was found
that the line that models the SPL±dpeak -graph for
soft phonations was of a larger slope than the line
that models SPL±dpeak -values for loud speech
samples.
In production of very soft voices, a speaker
typically uses a smooth glottal pulse with a small
AC-amplitude. Raising intensity can be achieved
by increasing the AC-amplitude and also by affecting the shape of the glottal pulse by, for example, shortening the closing phase of the glottal
cycle (Sundberg et al., 1993). Both of these changes
in the glottal pulse increase the amplitude of the
¯ow derivative. Results of the present study show
that when intensity of speech is increased using
minor SPL-steps, it is possible to generate two
sounds with dierent SPL-values using practically
the same level of dpeak , but the shape of the dierentiated glottal ¯ow is aected during the glottal
closing phase. Hence, rising SPL can be achieved
280
P. Alku et al. / Speech Communication 28 (1999) 269±281
not only by increasing the amplitude of dpeak , as
suggested by the classical SPL±dpeak -function reported by Gaun and Sundberg (1989), but also by
decreasing the spectral decay of the dierentiated
glottal ¯ow by increasing the sharpness of the
glottal ¯ow derivative around the time-instant of
dpeak . When voice intensity is changed from very
soft to loud, speakers tend to make the most distinct change in their SPL±dpeak -function when going from ``loud soft'' to ``soft normal''. This change
can be seen as a decrease in the slope of the line that
matches the SPL±dpeak -graph.
Finally, we would like to point out that both
the study by Gaun and Sundberg (1989) and the
present one can be considered applications of the
Liljencrants±Fant (LF) model (Fant et al., 1985).
This is due to the fact that dpeak is actually one of
the parameters used in the LF-model. (In the LFmodel terminology notation Ee is used for dpeak .)
As stated by Fant (1993), dpeak is the most important among the LF-parameters since it sets the
levels of the formant amplitudes. When SPL±dpeak graphs are used in analyzing intensity regulation of
speech one is actually applying the LF-model in an
extremely compressed form by modeling the differentiated voice source with only one of the four
LF-parameters. Further research is needed in order to ®nd out whether both the amplitude and the
shape of the dierentiated glottal ¯ow at the instant of the glottal closure could be presented with
a single numerical value. It could be possible, for
example, to apply the second derivative of the
glottal ¯ow (Holmes, 1976; Hunt, 1987). It is also
possible to combine two or more dierent voice
source parameters into a single one in a similar
way as has been done with the LF-model by Fant
(1995) and by the present authors in (Alku et al.,
1998b).
References
Akerlund,
L., Gramming, P., 1994. Average loudness level,
mean fundamental frequency, and subglottal pressure:
Comparison between female singers and nonsingers.
J. Voice 8, 263±270.
Alku, P., 1992. Glottal wave analysis with Pitch Synchronous
Iterative Adaptive Inverse Filtering. Speech Communication 11, 109±118.
Alku, P., Vilkman, E., 1994. Estimation of the glottal pulseform
based on discrete all-pole modeling. In: Proc. Internat.
Conf. on Spoken Language Processing, Yokohama, Japan,
18±22 September, pp. 1619±1622.
Alku, P., Strik, H., Vilkman, E., 1997. Parabolic spectral
parameter ± A new method for quanti®cation of the glottal
¯ow. Speech Communication 22, 67±79.
Alku, P., Vilkman, E., Laukkanen, A.-M., 1998a. Estimation of
amplitude features of the glottal ¯ow by inverse ®ltering
speech pressure signals. Speech Communication 24, 123±132.
Alku, P., Vilkman, E., Laukkanen, A.-M., 1998b. Parameterization of the voice source by combining spectral decay and
amplitude features of the glottal ¯ow. J. Speech Lang. Hear.
Res. 41, 990±1002.
Bouhuys, A., Mead, J., Proctor, D., Stevens, K., 1968. Pressure¯ow events during singing. Ann. N.Y. Acad. Sci. 155, 165±
176.
Childers, D., Lee, C., 1991. Vocal quality factors: Analysis,
synthesis, and perception. J. Acoust. Soc. Amer. 90, 2394±
2410.
Dromey, C., Stathopoulos, E., Sapienza, C., 1992. Glottal
air¯ow and electroglottographic measures of vocal function
at multiple intensities. J. Voice 6, 44±54.
El-Jaroudi, A., Makhoul, J., 1991. Discrete all-pole modeling.
IEEE Trans. Signal Process. 39, 411±423.
Fant, G., 1960. Acoustic Theory of Speech Production.
Mouton, The Hague.
Fant, G., 1993. Some problems in voice source analysis. Speech
Communication 13, 7±22.
Fant, G., 1995. The LF-model revisited. Transformations and
frequency domain analysis. Speech Transmission Laboratory, Quarterly Progress and Status Report, Royal Institute
of Technology, Stockholm, 2±3, 119±156.
Fant, G., Liljencrants, J., Lin, Q., 1985. A four-parameter
model of glottal ¯ow. Speech Transmission Laboratory,
Quarterly Progress and Status Report, Royal Institute of
Technology Stockholm, 4, 1±13.
Flanagan, J., 1972. Analysis, Synthesis, and Perception of
Speech. Springer, Berlin.
Gaun, J., Sundberg, J., 1989. Spectral correlates of glottal
voice source waveform characteristics. J. Speech Hear. Res.
32, 556±565.
Gramming, P., Sundberg, J., 1988. Spectrum factors relevant to
phonetogram measurement. J. Acoust. Soc. Amer. 83,
2352±2360.
Gramming, P., Sundberg, J., Ternst
om, S., Leanderson, R.,
Perkins, W., 1988. Relationship between changes in voice
pitch and loudness. J. Voice 2, 118±126.
Herteg
ard, S., Gaun, J., Sundberg, J., 1990. Open and
covered singing as studied by means of ®beroptics, inverse
®ltering, and spectral analysis. J. Voice 4, 220±230.
Herteg
ard, S., Gaun, J., Karlsson, I., 1992. Physiological
correlates of the inverse ®ltered ¯ow waveforms. J. Voice 6,
224±234.
Hillman, R., Holmberg, E., Perkell, J., Walsh, M., Vaughan,
C., 1990. Phonatory function associated with hyperfunctionally related vocal fold lesions. J. Voice 4, 52±63.
P. Alku et al. / Speech Communication 28 (1999) 269±281
Holmberg, E., Hillman, R., Perkell, J., 1988. Glottal air¯ow
and transglottal air pressure measurements for male and
female speakers in soft, normal, and loud voice. J. Acoust.
Soc. Amer. 84, 511±529.
Holmes, J., 1976. Formant excitation before and after glottal
closure. In: Proc. IEEE Internat. Conf. on Acoustics,
Speech, and Signal Process., pp. 39±42.
Hunt, M., 1987. Studies of glottal excitation using inverse
®ltering and an electroglottograph. In: Proc. of the 11th
Internat. Congress of Phonetic Sciences, Tallinn, Estonia,
1±7 August, Vol. 3, pp. 22±26.
Sulter, A., Wit, H., 1996. Glottal volume velocity waveform
characteristics in subjects with and without vocal training,
related to gender, sound intensity, fundamental frequency,
and age. J. Acoust. Soc. Amer. 100, 3360±3373.
Sundberg, J., 1990. What's so special about singers. J. Voice 4,
107±119.
281
Sundberg, J., Titze, I., Scherer, R., 1993. Phonatory control in
male singing: A study of the eects of subglottal pressure,
fundamental frequency, and mode of phonation on the
voice source. J. Voice 7, 15±29.
Titze, I., 1992. Acoustic interpretation of the voice range pro®le
(phonetogram). J. Speech Hear. Res. 35, 21±34.
Titze, I., 1994. Principles of Voice Production. Prentice-Hall,
Englewood Clis, NJ.
Titze, I., Sundberg, J., 1992. Vocal intensity in speakers and
singers. J. Acoust. Soc. Amer. 91, 2936±2946.
Vilkman, E., Lauri, E-R., Alku, P., Sala, E., Sihvo, M., 1997.
Loading changes in time-based parameters of glottal ¯ow
waveforms in dierent ergonomic conditions. Folia Phoniatr. Logop. 49, 247±263.
Wong, D., Markel, J., Gray, A., 1979. Least squares glottal
inverse ®ltering from the acoustic speech waveform. IEEE
Trans. Acoust. Speech Signal Process. 27, 350±355.