Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition
Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition
Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition
The acoustic speech signal contains information about both the The linear source-filter model [5] of speech production views
excitation (source) signal and the vocal tract transfer function the speech waveform as the result of convolution between the
(VTTF). There are applications where it is important to excitation signal (which is either quasi-periodic, noise like, or a
accurately estimate the VTTF and to discard variations due to combination of the two) and the impulse response of the vocal
changes in fundamental frequency or pitch. For example, the tract transfer function (VTTF). In the frequency domain, the
VTTF is often used in feature extraction for Automatic Speech speech spectrum is the result of multiplication of the source
Recognition (ASR). Linear Prediction Coding (LPC) analysis (excitation) spectrum and the VTTF, as shown in Figure 1. For
estimates the VTTF with an all-pole model. However, LPC voiced signals the excitation spectrum is harmonic.
based features (for example LPCC) are vulnerable to
background noise. Similarly, an FFT spectrum or a smoothed The speech spectrum can also be viewed as a result of
version of it will be sensitive to background noise. In this amplitude modulation (AM) in the frequency domain with the
paper, we introduce a noise robust technique for estimating the source (excitation) spectrum being a carrier and the vocal tract
envelope of the speech spectrum, which contains information transfer function (VTTF) being the modulating signal.
on the VTTF. The technique resembles amplitude Typically, amplitude modulation refers to modulation in the
demodulation in the frequency domain. time domain, as shown in Figure 2, where the carrier in this
example is a high frequency sinusoid and the modulating signal
The use of the term “modulation” in this paper is different than is a slowly-varying signal. When the carrier is noise, the noise
that used by others. For example, “modulation spectrum” [2] spectral envelope is modulated in a similar way.
[3] uses low-pass filters on the time trajectory of the spectrum
to remove fast-changing components. In [4], the authors model
speech waveforms as amplitude and frequency modulated
(AM-FM) signals where formant frequencies are the
×
frequencies of the carriers.
S (k ) ∗ h( k ) = [∑ S (i )δ (k − i )] ∗ h( k ) = ∑ [ S (i )δ (k − i ) ∗ h(k )]
i i
understood by observing that frequency bands with low energy,
such as inter-harmonic frequencies are more susceptible to
noise.
(a)
3. USING HARMONIC DEMODULATION
IN ASR
To determine whether harmonic demodulation can be used as a
(b) noise robust feature extraction method in ASR, we used it in
computing Mel Frequency Cepstral Coefficients (MFCC) and
performed recognition experiments. MFCCs are the result of
performing a DCT on log spectral estimation obtained with a
critical bandwidth like non-uniform filter bank. In our
evaluations, MFCCs are calculated using the log spectral
(c) estimate of the speech signals after harmonic demodulation. A
block diagram illustrating how harmonic demodulation is used
9
8
is shown in Figure 6.
7
5
One frame (25 ms) of Harmonically
4
the speech waveform demodulated MFCCs
3
0
10 20 30 40 50 60
1024-
(d) point Harmonic MFCC computation: Mel filter
Figure 4: Envelope detection. FFT demodulation bank, logarithm and DCT
(a) A simplified speech spectrum. (b) The response of the low-
pass filter for envelope detection. (c) Results of the
convolution between every point in (a) and (b). (d) The Figure 6: Block diagram of the harmonically demodulated
envelope estimated by linear demodulation (superposition) is MFCCs.
shown as the solid line in the upper part of the figure. Also
shown is the envelope detected using NLED (line with 3.1 Implementing the HDMFCC
triangles).
The harmonically demodulated MFCCs (HDMFCC) are either
computed using linear demodulation or using the non-linear
demodulation technique introduced in this paper. Speech is
sampled at 12.5 kHz and 25 ms frames, overlapped by 15 ms,
are obtained with a Hamming window. Pre-emphasis is used.
(a) For each frame, a 1024 point FFT is computed, and only half
the points are used because of the FFT symmetry. This
10
corresponds to the frequency range between 0 to 6250 Hz. The
9
characteristic of the low-pass filter used in envelope detection
8
7
is shown in Figure 7. The width of the filter is 43 points which
6
5
corresponds to 525 Hz, and the magnitude is above 0.8 for
4 about 210 Hz. The characteristic of the filter was optimized to
3
2
achieve high accuracy in speech recognition experiments.
1 1
0
10 20 30 40 50 60
magnitude
(b)
0.5
task using the TI46 database. For each digit, one HMM with 4
states and 2 mixtures is trained from 160 utterances spoken by 60
16 talkers (8 males and 8 females). Training includes 2 steps of
Maximum Likelihood (ML) and Expectation Maximization 50
(EM) with 4 iterations each. A Viterbi algorithm is used for
MFCCP, NLED, reshaping
recognition using 960 different utterances. Training is done 40
MFCC, NLED, reshaping
with clean signals, and recognition with noisy signals (speech MFCCP
with additive speech shaped noise) at different SNRs. 30 MFCC, NLED
MFCC,LED
MFCC
The following features were used in the experiments. 1) 20
MFCCs, 2) HDMFCCs with linear demodulation 3) 0 3 5 10 20
SNR (dB )
HDMFCCs with non-linear demodulation 4) MFCCs with
peak-isolation [1] (referred to as to MFCCP), 5) MFCCs with Figure 8: Recognition results with additive speech shaped
non-linear demodulation and envelope reshaping, and 6) noise at different SNRs.
HDMFCCs with non-linear demodulation, envelope reshaping,
and peak-isolation. Results are shown in Figure 8.
Acknowledgments
As seen in the figure, as the SNR decreases, demodulation,
Work supported in part by NSF and funding from HRL
whether linear or non-linear, improves recognition
through the UC-MICRO program.
performance. In addition, NLED, envelope reshaping together
with a process that enhances peaks in the spectrum improve
dramatically recognition performance without a significant 5. REFERENCES
increase in computational cost. For example at SNR of 3 dB,
the recognition accuracy is 38 percent for MFCCs, versus 78 1. Strope, B. and Alwan, A. 1997. “A model of
percent for the proposed algorithm with peak-isolation. dynamic auditory perception and its application to
robust word recognition”, IEEE Trans. on Speech
and Audio Processing, Vol. 5, No. 5, p. 451-464.
4. SUMMARY AND CONCLUSION 2. Kanedera, N. Hermansky, H. Arai, T. “On properties
of modulation spectrum for robust automatic speech
In this paper, a novel algorithm that resembles amplitude recognition”, Proc. ICASSP '98, vol.2. p.613-16.
demodulation in the frequency domain is introduced using a 3. Greenberg, S. and Kingsbury, B.E.D. “The
non-linear envelope detection (NLED) technique. The NLED modulation spectrogram: in pursuit of an invariant
relies on the amplitudes of the harmonics and avoids inter- representation of speech”, Proc. ICASSP '97, vol.3,
harmonic valleys. The algorithm differs from linearly p.1647-50.
smoothing the speech spectrum or deconvolution of the source 4. Potamianos, A. and Maragos, P. “Speech analysis
and vocal tract impulse response. This technique is noise and synthesis using an AM-FM modulation model”,
robust since envelope detection does not take into account Speech Communication, vol.28, (no.3), 1999, p.195-
frequency regions of low signal energy. The same principle is 209.
used to reshape the envelope after it is detected. The algorithm 5. Fant Gunnar. “The Acoustic theory of speech
is then used to construct an ASR feature extraction module. It production”. S’Gravenhage, Mouton, 1960.
is shown that this technique achieves superior performance to 6. Haykin, Simon S., “Communication systems”, New
MFCCs in the presence of background noise. Recognition York, Wiley, c1978.
accuracy is further improved if peak isolation [1] is also
performed.