McCree MixedExcitationLPCVocoderModel ieeetSAP95
McCree MixedExcitationLPCVocoderModel ieeetSAP95
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, NO. 4, JULY 1995
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
243
PERIODIC
PULSE
TRAIN
1LPC
SYNTHESIS
FILTER
SYNTHESIZED
SPEECH
WHITE
NOISE
PERIODIC
PULSE
TRAIN
POSITION
JITTER
SHAPING
FILTER
STRENGTHS
WHITE
SHAPING
FILTER
NOISE
ADAPTIVE
SPECTRAL
ENHANCEMENT
--C
LPC
SYNTHESIS
FILTER
-m
PULSE
DISPERSION
FILTER
SYNTHESIZED
SPEECH
A. Mixed Excitation
The most important feature of the model shown in Fig. 2 is
the mixed pulse and noise excitation. Since the most annoying
aspect of the speech output from the basic LPC vocoder is
a strong buzzy quality, LPC vocoders have previously been
proposed with mixtures of pulse and noise excitation [4], [6],
[8]. Mixed excitations are also commonly used in formant
synthesizers [ 111, [121 and have recently been applied in the
context of sinusoidal coding [13], [14].
We have developed a mixed excitation LPC synthesizer that
can generate an excitation signal with different mixtures of
pulse and noise in each of a number (4-10) of frequency bands
[15]. As shown in Fig. 2, the pulse train and noise sequence
are each passed through time-varying spectral shaping filters
and then added together to give a fullband excitation. For each
frame, the frequency shaping filter coefficients are generated
by a weighted sum of fixed bandpass filters. The pulse filter is
calculated as the sum of each of the bandpass filters weighted
by the voicing strength in that band. The noise filter is
generated by a similar weighted sum, with weights set to keep
the total pulse and noise power constant in each frequency
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. VOL. 3, NO. 4, JULY 1995
244
lot
-50
-60
500
1000
1500
2000
2500
3000
3500
4WO
FREQUENCY IN HZ
!r--
--TO50
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
245
less than one, this moves the poles of the LPC synthesis filter
away from the unit circle in the z-plane and weakens the pole
resonances. An Q greater than one will sharpen the resonances
or even make the filter unstable. Simply sharpening the LPC
poles with a fixed Q: slightly greater than one does improve
the waveform match in many places, but it introduces chirpy
sounds to the synthetic speech, presumably due to occasional
quasi-stable LPC filters. A better approach is to sharpen the
LPC filter for the half of the pitch period starting with the
pitch pulse, and weaken it for the other half. Depending upon
the precise values of Q used, this can sharpen the overall
response, while avoiding the steady sinusoidal response typical
of a quasi-stable filter. It can also provide a better match to
natural speech waveforms. Unfortunately, the performance of
this pole modulation technique degrades with quantization of
the LPC filter coefficients, since the bandwidths of the poles of
the quantized LPC filter often vary significantly from frame
to frame.
The adaptive spectral enhancement filter provides a simpler
solution to the problem of matching formant waveforms.
This adaptive pole/zero filter is widely used in CELP coders
[20], [21] since it is intended to reduce quantization noise
in between the formant frequencies. The poles are generated
by a bandwidth expanded version of the LPC synthesis filter,
with Q equal to 0.8. Since this all-pole filter introduces a
disturbing lowpass filtering effect by increasing the spectral
tilt, a weaker all-zero filter calculated with Q equal to 0.5 is
used to decrease the tilt of the overall filter without reducing
the formant enhancement. In addition, a simple first-order FIR
filter is used to further reduce the lowpass muffling effect.
In the mixed excitation LPC vocoder, reducing quantization
noise is not a concern, but the time-domain properties of
this filter produce an effect similar to pitch-synchronous pole
bandwidth modulation. As shown in Fig. 5, a simple decaying
resonance has a less abrupt time-domain attack when this
enhancement filter is applied. This feature allows the LPC
vocoder speech output to better match the bandpass waveform
properties of natural speech in formant regions, and it increases
the perceived quality of the synthetic speech.
D. Pulse Dispersion Filter
The pulse dispersion filter shown in Fig. 2 improves the
match of bandpass filtered synthetic and natural speech waveforms in frequency bands which do not contain a formant
resonance. At these frequencies, the synthetic speech often
decays to a very small value between the pitch pulses. This
is also true for frequencies near the higher formants, since
these resonances decay significantly between excitation points,
especially for the longer pitch periods of male speakers.
In these cases, the bandpass filtered natural speech has a
smaller peak-to-valley ratio than the synthetic speech. In
natural speech, the excitation may not all be concentrated at the
point in time corresponding to closure of the glottis [22]. This
additional excitation prevents the natural bandpass envelope
from falling as low as the synthetic version. This could be due
to a secondary excitation peak from the opening of the glottis,
aspiration noise resulting from incomplete glottal closure, or
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
246
4ooo
OQ-
3000
ZOO0
0e-
Io00
0 4 02-
-1000
0 -
-2000
-3000
4 2 -
-4000
4 4 -
-10
10
1 .
4 .......... <.........................
20
30
40
I
!
60
70
e0
,........... :...........
-400
20
40
60
80
100
120
L40
160
,so
: - I
200
(C)
500
1000
1500
2000
2500
3000
3500
4000
Frequency in Hz
(C)
(4
Fig. 5. Natural speech versus decaying resonance waveforms: (a) First formant of natural speech vowel; (b) synthetic exponentially decaying resonance;
(c) pole/zero enhancement filter impulse response for this resonance; (d)
enhanced decaying resonance.
Fig. 6. Synthetic triangle pulse and FIR filter: (a) Triangle waveform; (b)
filter coefficients after spectral flattening with length 65 DFT, (c) Fourier
transform (DTlT) after spectral flattening.
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
1i
TABLE I
2400-b/s MIXED
EXCITATION
LPC VWODERBITALLWATION
LPC coefficients (10 LsPs)
gain (2 per frame)
pitch and overall voicing
bandpass voicing
5-1
aperiodic flag
TOTAL: 54 bits
241
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
248
TABLE I1
used to evaluate various Fourier series magnitude or phase
CLEANINPUT
DAM TESTSCORES
distortions, and it can also be used to reproduce transmitted
estimates of the magnitude or phase.
6 Speaker Male Female
Speech Coder
An interesting application of the Fourier series expansion
is to reproduce the magnitude or phase of selected harmonics
53.2
54.0
54.7
2400 bps DoD LPC-lOe
of the pitch fundamental. This requires the difficult task of
57.7
58.9
59.8
2400 bps ME LPC
accurately estimating these values. Early versions of this
61.5
62.6
63.5
4800 bps DoD CELP
speech coder used a Fourier series analysis of the LPC residual
61.3
61.6
61.6
4800 bps ME LPC
signal. This can give good estimates of both magnitude and
phase, but it depends on accurately locating individual pitch
pulses. A more reliable analysis technique is to perform a
DFT of an entire frame of the LPC residual signal, and to
estimate the Fourier series magnitudes by the largest DFT
magnitudes within frequency bands corresponding to each
pitch harmonic. This approach, which is similar to the analysis
used in sinusoidal speech coding [13], provides accurate and
2400 bps ME LPC
consistent estimates of the Fourier magnitudes since it requires
no timing alignment. Unfortunately, it does not measure the
4800 bps DoD CELP
phase of the excitation signal relative to the pitch pulse since
4800 bps ME LPC
it contains a large unknown linear phase term.
To evaluate the capabilities of Fourier series modeling, a
4800-b/s speech coder has been developed. This coder is based
To determine how well this mixed excitation vocoder peron the 2400-b/s mixed excitation LPC vocoder with the four forms in comparison to existing speech coders, formal subfeatures described in the previous section but also includes jective listening tests have been conducted. Since the goal of
coding of the Fourier series magnitudes. The magnitudes are this work is to produce natural sounding synthetic speech, the
estimated with a 512-point fast Fourier transform (m)of overall user acceptability of the processed speech has been
each frame of the LPC residual signal generated by inverse measured with the diagnostic acceptability measure (DAM)
filtering the input speech with the quantized LPC filter. To [26] as performed by Dynastat. This test is widely used to
obtain a 4800-b/s bit rate, the magnitudes of the first 18 har- evaluate low bit rate speech coders, so there is a substantial
monics are divided by the average over all the harmonics, and literature of DAM scores for various speech coding algorithms
the logarithms of these normalized magnitudes are uniformly [21], [27]. In addition to the 2400-b/s mixed excitation LPC
quantized and coded with three bits each. The remaining and 4800-b/s Fourier series mixed excitation LPC vocoders
harmonic magnitudes are synthesized with a fixed value of described previously, two U.S. government standard speech
one, and all the phases are set to zero to align the harmonics coders were also included in the DAM testing: 2400-b/s DoD
into a single pulse per pitch period.
LPC-lOe v.55 and 4800-b/s DoD CELP release 3.2 [21]. LPC10e is a significantly enhanced version of the classical LPC
vocoder standard LPC-10, and it includes some of the features
v. PERFORMANCE EVALUATION
developed by Kang and Everett [7], [SI. This vocoder has
The 2400- and 4800-b/s LPC mixed excitation vocoders previously been reported to score a 54 on the DAM test, which
have undergone both informal and formal listening tests. represents a significant improvement over the the earlier score
Informal listening on a database of about 20 speakers shows of 47 for LPC-10 [21], [27]. The tests were run on a speech
that these coders can produce high quality speech for both male database consisting of 12 sentences from each of three male
and female speakers. In addition, the coders maintain good speakers and three female speakers, and all processing was
performance in acoustic background noise. In a synthetic white done on a SUN workstation. Additional testing was done with
noise environment, the mixed excitation produces natural synthetic white noise added to the same speech input. The
sounding speech without obvious artifacts such as buzz or noise was generated by a Gaussian random number generator,
thumps. In standard military communications environments and the SNR over the six-speaker database was about 8 dB.
such as airplanes, tanks, and helicopters, the new coders still
The DAM test results for both clean and noisy speech are
produce natural sounding speech, although the noise itself shown in Tables I1 and 111. The clean speech DAM scores consounds somewhat distorted. The 4800-b/s coder performs firm that the 2400-b/s mixed excitation LPC vocoder produces
better than the 2400-b/s version due to the additional spectral speech which is significantly better than the current standard
information from the Fourier series magnitudes. This appears LPC-lOe vocoder. In fact, the DAM score for the mixed
to provide improvement in producing nasals, in reproducing excitation LPC is closer to the higher rate 4800-b/s CELP
the identity of a particular speaker, and in the quality of vowels standard than to LPC-10e. For the noisy speech, all the scores
in broad-band acoustic background noise. These are all cases are low due to the annoying amount of background noise,
where the all-pole assumption inherent in the LPC model may but the speech can still be clearly understood. In this difficult
not be accurate, and presumably the better representation of environment, the new coder is clearly superior to LPC- 10e, and
the Fourier spectrum can compensate for this mismatch.
even performs slightly better than the higher rate standard. In
~
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
VI. CONCLUSION
We have presented a mixed excitation LPC vocoder model
that can produce high quality speech at low bit rates. The
model maintains the efficiency of a fully parametric LPC
vocoder model, but it adds more free parameters to the excitation signal so the synthesizer can mimic more characteristics of
natural human speech. In addition, the requirement for a single
binary voicing decision is eliminated, so the vocoder performs
well even in the presence of severe acoustic background noise.
This mixed excitation model is based on the traditional LPC
vocoder with either a periodic impulse train or white noise
exciting an all-pole filter but contains four additional features:
mixed pulse and noise excitation, periodic or aperiodic pulses,
adaptive spectral enhancement, and pulse dispersion filter.
Each of these capabilities is intended to remove a particular
distortion from the synthetic speech. The mixed excitation
eliminates the buzzy quality usually associated with LPC
vocoders by allowing frequency-dependent voicing strength.
A separate aperiodic voiced state is added so the synthesizer can reproduce erratic glottal pulses without introducing
tonal noises. Adaptive spectral enhancement sharpens the formant resonances and improves the bandpass filtered waveform
match between synthetic and natural speech in the frequency
bands including formant frequencies. The pulse dispersion
filter allows the LPC synthesizer to better match waveforms
away from the formant regions by introducing time-domain
spread to the excitation signal.
To verify the performance of the mixed excitation LPC
vocoder model, we have developed and implemented a 2400b/s LPC vocoder. This speech coder is machine portable since
it is written in the C language, but it runs in real-time on
a special hardware platform using a fast DSP microprocessor.
Informal and formal subjective testing of this coder has shown
that it performs significantly better than the current state of
the art at such a low bit rate. Additional testing on a 4800-b/s
vocoder, which also includes Fourier magnitudes of the true
excitation signal, demonstrates the performance improvement
possible with more accurate spectral representation in the
parametric model.
This work could be extended in a number of ways. The
mixed excitation LPC vocoder model could be further improved, perhaps by better utilizing the available Fourier series
information. The design of the 2400-b/s LPC vocoder could be
tailored for specific applications by addressing issues such as
computational efficiency, channel errors, and input microphone
response. Finally, the mixed excitation LPC model could be
applied to other problems such as very low bit rate speech
coding (800-1200 b/s), speech synthesis, pitch scaling, or time
scale modification.
249
REFERENCES
[l] B. S . Atal and S. L. Hanauer, Speech analysis and synthesis by linear
prediction of the speech wave, J. Acourt. Soc. Amer., vol. 50, pp.
637455, Aug. 1971.
[2] F. Itakura and S . Saito, Analysis synthesis telephony based on the
maximum likelihood method, in Proc. Rep. 6th Int. Congr. Acoust.,
Aug. 1968, pp. C17-C20.
[3] T. E. Tremain, The government standard linear predictive coding
algorithm: LPC-IO, Speech Technol., pp. 4049, Apr. 1982.
[4] J. Makhoul, R. Viswanathan, R. Schwartz and A. W. F. Huggins, A
mixed-source model for speech compression and synthesis, J. Acoust.
Soc. Amer., vol. 64, pp. 1577-1581, Dec. 1978.
[5] 0. Fujimura, An approximation to voice aperidcity, IEEE Trans.
Audio Electroacoust., vol. AE-16, pp. 68-72, Mar. 1968.
[6] S. Y. Kwon and A. J. Goldberg, An enhanced LPC vocoder with no
voicedunvoiced switch, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 851-858, Aug. 1984.
[7] G. S . Kang and S. S. Everett, Improvement of the narrowband LPC
analysis, in Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing,
Boston, 1983, pp. 89-92.
[8] ___, Improvement of the excitation source in the narrowband linear
prediction vocoder, IEEE Trans. Acoust., Speech, Signal Processing,
vol. ASSP-33, pp. 377-386, Apr. 1985.
[9] M. R. Sambur, A. E. Rosenberg, L. R. Rabiner, and C. A. McGonegal,
On reducing the buzz in LFC synthesis, J. Acoust. Soc. Amer., vol.
63, pp. 918-924, Mar. 1978.
[lo] D. Y. Wong, On understanding the quality problems of LPC speech,
in Proc. IEEE Int. Con$ Acoust.. Speech, Signal Processing, 1980, pp.
725-728.
[ l l ] D. H. Klatt, Review of text-to-speech conversion for English, J.
Acoust. Soc. Amer., vol. 82, pp. 737-793, Sept. 1987.
[I21 J. N. Holmes, The influence of glottal waveform on the naturalness
of speech from a parallel formant synthesizer, IEEE Trans. Audio
Electroacoust., vol. AE-21, pp. 298-305, June 1973.
[13] R. McAulay, T. Parks, T. Quatieri and M. Sabin, Sine-wave amplitude
coding at low data rates, in Advances in Speech Coding. Norwell,
MA: Kluwer, 1991, pp. 203-214.
141 M. Brandstein, J. Hardwick, and J. Lim, The multiband excitation
speech coder, in Advances in Speech Coding. Norwell, MA: Kluwer,
1991, pp. 215-224.
151 A. V. McCree and T. P. Barnwell III, Improving the performance
of a mixed excitation LPC vocoder in acoustic noise, in Proc. IEEE
Int. Con$ Acoust., Speech, Signal Processing, San Francisco, 1992, pp.
I1137-II140.
161 A. V. McCree, A new LPC vocoder model for low bit rate speech
coding, Ph.D. thesis, Georgia Inst. Technol., Atlanta, GA, Aug. 1992.
[17] W. Hess, Pitch Determination of Speech Signals. Vienna, NY:
Springer, 1983.
[18] A. V. McCree and T. P. Barnwell 111, A new mixed excitation LPC
vocoder, in Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing,
Toronto, 1991, pp. 593-596.
[19] D. L. Thomson and D. P. Prezas, Selective modeling of the LPC
residual during unvoiced frames: White noise or pulse excitation, in
Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing, Tokyo, 1986,
pp. 3087-3090.
[20] J. H. Chen and A. Gersho, Real-time vector APC speech coding at
4800 bps with adaptive postfiltering, in Proc. IEEE Int. Con$ Acoust.,
Speech, Signal Processing, Dallas, 1987, pp. 2185-2188.
[21] J. P. Campbell Jr., T. E. Tremain, and V. C. Welch, The DoD 4.8
kbps standard (proposed federal standard 1016), in Advances in Speech
Coding. Norwell, MA: Kluwer, 1991, pp. 121-133.
[22] J. N. Holmes, Formant excitation before and after glottal closure,
in Proc. IEEE Int. Con$ Acoust., Speech, Signal Processing, 1976, pp.
3942.
[23] A. E. Rosenberg, Effect of glottal pulse shape on the quality of natural
vowels, J. Acoust. Soc. Amer., vol. 49, pp. 583-590, 1971.
[24] A. V. McCree and T. P. Barnwell III, Implementation and evaluation
of a 2400 bps mixed excitation LPC vocoder, in Proc. IEEE Int. Con$
Acoust., Speech, Signal Processing, Minneapolis, 1993, pp. II159-II162.
[25] B. S . Atal and N. David, On synthesizing natural-sounding speech by
linear prediction, in Proc. Int. Con$ Acoust., Speech, Signal Processing,
1979, pp. 4447.
[26] W. D. Voiers, Diagnostic acceptability measure for speech communications systems, in Proc. IEEE Int. Con$ Acoust., Speech, Signal
Processing, 1977, pp. 204-207.
[27] C. Smith, Relating the performance of speech processors to the bit
error rate, Speech Technol., pp. 41-53, Sept. 1983.
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.
250
Authorized licensed use limited to: Petros Maragos. Downloaded on June 09,2010 at 14:43:37 UTC from IEEE Xplore. Restrictions apply.