Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition

AM-DEMODULATION OF SPEECH SPECTRA AND ITS APPLICATION TO
NOISE ROBUST SPEECH RECOGNITION

Qifeng Zhu and Abeer Alwan
Department of Electrical Engineering, UCLA

Los Angeles, CA 90095
ABSTRACT Section 3, we evaluate the technique as a front end for an

HMM based automatic speech recognition system at different
In this paper, a novel algorithm that resembles amplitude SNRs.
demodulation in the frequency domain is introduced, and its
application to automatic speech recognition (ASR) is studied. 2. HARMONIC DEMODULATION
Speech production can be regarded as a result of amplitude
modulation (AM) with the source (excitation) spectrum being 2.1. Theory of Speech Production and
the carrier and the vocal tract transfer function (VTTF) being
the modulating signal. From this point of view, the VTTF can Amplitude Modulation (AM)
be recovered by amplitude demodulation. Amplitude
demodulation of the speech spectrum is achieved by a novel
nonlinear technique, which effectively performs envelope ×
detection by using amplitudes of the harmonics and discarding
inter-harmonic valleys. The technique is noise robust since
frequency bands of low energy are discarded. The same
principle is used to reshape the detected envelope. The Excitation spectrum VTTF Speech spectrum
algorithm is then used to construct an ASR feature extraction
module. It is shown that this technique achieves superior Figure 1: The linear source-filter model of speech production
performance to MFCCs in the presence of additive noise. in the frequency domain. The speech spectrum is the result of
Recognition accuracy is further improved if peak isolation [1] the multiplication of the excitation spectrum with the spectrum
is also performed. of the vocal tract transfer function (VTTF). The excitation
spectrum in this example in harmonically related (voiced
1. INTRODUCTION speech). The x-axis is frequency.
The acoustic speech signal contains information about both the The linear source-filter model [5] of speech production views
excitation (source) signal and the vocal tract transfer function the speech waveform as the result of convolution between the
(VTTF). There are applications where it is important to excitation signal (which is either quasi-periodic, noise like, or a
accurately estimate the VTTF and to discard variations due to combination of the two) and the impulse response of the vocal
changes in fundamental frequency or pitch. For example, the tract transfer function (VTTF). In the frequency domain, the
VTTF is often used in feature extraction for Automatic Speech speech spectrum is the result of multiplication of the source
Recognition (ASR). Linear Prediction Coding (LPC) analysis (excitation) spectrum and the VTTF, as shown in Figure 1. For
estimates the VTTF with an all-pole model. However, LPC voiced signals the excitation spectrum is harmonic.
based features (for example LPCC) are vulnerable to
background noise. Similarly, an FFT spectrum or a smoothed The speech spectrum can also be viewed as a result of
version of it will be sensitive to background noise. In this amplitude modulation (AM) in the frequency domain with the
paper, we introduce a noise robust technique for estimating the source (excitation) spectrum being a carrier and the vocal tract
envelope of the speech spectrum, which contains information transfer function (VTTF) being the modulating signal.
on the VTTF. The technique resembles amplitude Typically, amplitude modulation refers to modulation in the
demodulation in the frequency domain. time domain, as shown in Figure 2, where the carrier in this
example is a high frequency sinusoid and the modulating signal
The use of the term “modulation” in this paper is different than is a slowly-varying signal. When the carrier is noise, the noise
that used by others. For example, “modulation spectrum” [2] spectral envelope is modulated in a similar way.
[3] uses low-pass filters on the time trajectory of the spectrum
to remove fast-changing components. In [4], the authors model
speech waveforms as amplitude and frequency modulated
(AM-FM) signals where formant frequencies are the
×
frequencies of the carriers.
In Section 2, we introduce the theory behind the proposed Carrier

algorithm, and compare it to linear envelope detection. In Modulating signal Modulated signal
equation which is used to compute the convolution is:
Figure 2: Amplitude modulation. The modulated signal is the S(k) is the discrete speech spectrum, and h(k) is the
multiplication of the carrier and the modulating signal in the characteristic of the low-pass filter in the frequency domain.
time domain.
This process is illustrated in Figure 4. Figure 4(a) is a
The modulation framework is a simplified speech production simplified representation of the input speech spectrum. The
model. For example, in speech production, the carrier is not highest spikes are the harmonic peaks. The other samples
strictly sinusoidal, and the amplitude of the carrier is not represent inter- harmonic frequency components which maybe
constant. due to noise or other factors. The inter-harmonic peaks are
simplified in this example to either 0 or half the amplitude of
2.2. Demodulating the Speech Signal the harmonic peaks. The envelope of the harmonic peaks is flat
in this example. We will illustrate how to recover this
Our goal is to estimate the vocal tract transfer function and envelope. Figure 4(b) shows an example of the characteristic of
remove any pitch-related information. This is related to the low-pass filter. We first compute the convolution between
demodulating the speech spectrum in the frequency domain. each point in the input spectrum with the characteristic of the
low-pass filter, as explained earlier. Figure 4(c) shows the
Coherent demodulation [6], used in AM radio for example, results of the convolution process of every point in Figure 4(a)
requires recovering the carrier signal. Incoherent and Figure 4(b).
demodulation, on the other hand, involves envelope detection
using a rectifier and low-pass filter. Figure 3(a) illustrates the Finally, the convolution of the whole spectrum with the filter
incoherent demodulation process in the time domain after the response is obtained as the superposition of the results in
modulated signal has been full-wave rectified. We will adopt a Figure 4(c). The result is shown in Figure 4(d) as the top solid
similar strategy but perform it in the frequency domain, as line.
shown in Figure 3(b), where the resultant spectrum is
“harmonically demodulated”. One problem of this linear envelope detection is that it is
vulnerable to inter-harmonic components, hence it would not
h(t) be robust to background noise.
∗ Alternatively, one can perform envelope detection in a way that

is less susceptible to inter-harmonic components. We achieve
this by a non-linear technique, hereafter refer to as NELD,
(a)
which effectively estimates the envelope by focusing only on
the harmonic peaks. Instead of computing the superposition of
h(f) the results in Figure 4(c), we compute the maximum with the
equation:
∗
The result of demodulation using this non-linear technique for
Max [S (i)δ (k − i) ∗ h(k )]
i
(b) the same input spectrum is shown in Figure 4(d), as the line
with triangles. We can see that the result is smoother than the
Figure 3: (a) Envelope detector for AM demodulation. First,
envelope detected linearly.
full wave rectification is applied, and then the signal is
convolved with the impulse response of a low-pass filter. The
x-axis here is time. (b) Similar process to (a), except that it is 2.2.2. Robustness Analysis
done in the frequency domain by using the magnitude of a
speech spectrum. The x-axis here is frequency. As mentioned in the last section, linear envelope detection is
not noise robust. We will illustrate this with an example.
2.2.1. Linear and Non-linear Envelope
Figure 5(a) shows the same speech spectrum as Figure 4(a),
Detection except that one inter-harmonic point has increased in amplitude
due to background noise (indicated by an arrow in Figure 5(a)).
The process in Figure 3(b) can be implemented by linear With the linear envelope detection technique, this change will
convolution between the speech spectrum and the characteristic affect the output, as seen in the top line in Figure 5(d). If the
of the low-pass filter. Convolution is performed as the non-linear technique is used, the output is not affected, as seen
superposition of the convolutions between every point in the in Figure 5(d) in the line with triangles.
DFT spectrum and the response of the low-pass filter. The
S (k ) ∗ h( k ) = [∑ S (i )δ (k − i )] ∗ h( k ) = ∑ [ S (i )δ (k − i ) ∗ h(k )]
i i
understood by observing that frequency bands with low energy,
such as inter-harmonic frequencies are more susceptible to
noise.
(a)
3. USING HARMONIC DEMODULATION
IN ASR
To determine whether harmonic demodulation can be used as a
(b) noise robust feature extraction method in ASR, we used it in
computing Mel Frequency Cepstral Coefficients (MFCC) and
performed recognition experiments. MFCCs are the result of
performing a DCT on log spectral estimation obtained with a
critical bandwidth like non-uniform filter bank. In our
evaluations, MFCCs are calculated using the log spectral
(c) estimate of the speech signals after harmonic demodulation. A
block diagram illustrating how harmonic demodulation is used
9
8
is shown in Figure 6.
7
5
One frame (25 ms) of Harmonically
4
the speech waveform demodulated MFCCs
3
0
10 20 30 40 50 60
1024-
(d) point Harmonic MFCC computation: Mel filter
Figure 4: Envelope detection. FFT demodulation bank, logarithm and DCT
(a) A simplified speech spectrum. (b) The response of the low-
pass filter for envelope detection. (c) Results of the
convolution between every point in (a) and (b). (d) The Figure 6: Block diagram of the harmonically demodulated
envelope estimated by linear demodulation (superposition) is MFCCs.
shown as the solid line in the upper part of the figure. Also
shown is the envelope detected using NLED (line with 3.1 Implementing the HDMFCC
triangles).
The harmonically demodulated MFCCs (HDMFCC) are either
computed using linear demodulation or using the non-linear
demodulation technique introduced in this paper. Speech is
sampled at 12.5 kHz and 25 ms frames, overlapped by 15 ms,
are obtained with a Hamming window. Pre-emphasis is used.
(a) For each frame, a 1024 point FFT is computed, and only half
the points are used because of the FFT symmetry. This
10
corresponds to the frequency range between 0 to 6250 Hz. The
9
characteristic of the low-pass filter used in envelope detection
8
7
is shown in Figure 7. The width of the filter is 43 points which
6
5
corresponds to 525 Hz, and the magnitude is above 0.8 for
4 about 210 Hz. The characteristic of the filter was optimized to
3
2
achieve high accuracy in speech recognition experiments.
1 1
0
10 20 30 40 50 60
magnitude
(b)
0.5
Figure 5: (a) A simplified speech spectrum (the modulated

signal) with additive noise at one point indicated by an arrow.
(b) The demodulated signal. The top solid line is the envelope 0
-20 -10 0 10 20
detected by linear convolution while the lower line (with
triangles) results from NLED. k (discrete frequency)
Figure 7: Characteristic used in envelope detection.
Obviously, if the increase results in amplitudes that are larger
than the harmonic peaks then both the linear and non-linear
The envelope obtained from NLED can be further reshaped for
techniques will be affected.
better noise robustness using the same principle of avoiding
frequency regions with low energy. Envelope values are only
The noise robust capability of the NLED technique can be
considered if they are above a certain threshold. The threshold
is set empirically to be half the mean amplitude of the 512 100
point FFT speech spectrum before envelope detection.
90
3.2 Recognition Experiments

80
Percentage Recognition Rate

In these experiments, a Hidden Markov Model (HMM) based
system (HTK2.1) was used for an isolated digit recognition 70
task using the TI46 database. For each digit, one HMM with 4
states and 2 mixtures is trained from 160 utterances spoken by 60
16 talkers (8 males and 8 females). Training includes 2 steps of
Maximum Likelihood (ML) and Expectation Maximization 50
(EM) with 4 iterations each. A Viterbi algorithm is used for
MFCCP, NLED, reshaping
recognition using 960 different utterances. Training is done 40
MFCC, NLED, reshaping
with clean signals, and recognition with noisy signals (speech MFCCP
with additive speech shaped noise) at different SNRs. 30 MFCC, NLED
MFCC,LED
MFCC
The following features were used in the experiments. 1) 20
MFCCs, 2) HDMFCCs with linear demodulation 3) 0 3 5 10 20
SNR (dB )
HDMFCCs with non-linear demodulation 4) MFCCs with
peak-isolation [1] (referred to as to MFCCP), 5) MFCCs with Figure 8: Recognition results with additive speech shaped
non-linear demodulation and envelope reshaping, and 6) noise at different SNRs.
HDMFCCs with non-linear demodulation, envelope reshaping,
and peak-isolation. Results are shown in Figure 8.
Acknowledgments
As seen in the figure, as the SNR decreases, demodulation,
Work supported in part by NSF and funding from HRL
whether linear or non-linear, improves recognition
through the UC-MICRO program.
performance. In addition, NLED, envelope reshaping together
with a process that enhances peaks in the spectrum improve
dramatically recognition performance without a significant 5. REFERENCES
increase in computational cost. For example at SNR of 3 dB,
the recognition accuracy is 38 percent for MFCCs, versus 78 1. Strope, B. and Alwan, A. 1997. “A model of
percent for the proposed algorithm with peak-isolation. dynamic auditory perception and its application to
robust word recognition”, IEEE Trans. on Speech
and Audio Processing, Vol. 5, No. 5, p. 451-464.
4. SUMMARY AND CONCLUSION 2. Kanedera, N. Hermansky, H. Arai, T. “On properties
of modulation spectrum for robust automatic speech
In this paper, a novel algorithm that resembles amplitude recognition”, Proc. ICASSP '98, vol.2. p.613-16.
demodulation in the frequency domain is introduced using a 3. Greenberg, S. and Kingsbury, B.E.D. “The
non-linear envelope detection (NLED) technique. The NLED modulation spectrogram: in pursuit of an invariant
relies on the amplitudes of the harmonics and avoids inter- representation of speech”, Proc. ICASSP '97, vol.3,
harmonic valleys. The algorithm differs from linearly p.1647-50.
smoothing the speech spectrum or deconvolution of the source 4. Potamianos, A. and Maragos, P. “Speech analysis
and vocal tract impulse response. This technique is noise and synthesis using an AM-FM modulation model”,
robust since envelope detection does not take into account Speech Communication, vol.28, (no.3), 1999, p.195-
frequency regions of low signal energy. The same principle is 209.
used to reshape the envelope after it is detected. The algorithm 5. Fant Gunnar. “The Acoustic theory of speech
is then used to construct an ASR feature extraction module. It production”. S’Gravenhage, Mouton, 1960.
is shown that this technique achieves superior performance to 6. Haykin, Simon S., “Communication systems”, New
MFCCs in the presence of background noise. Recognition York, Wiley, c1978.
accuracy is further improved if peak isolation [1] is also
performed.

Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition

Uploaded by

Copyright:

Available Formats

Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Am-Demodulation of Speech Spectra and Its Application To Noise Robust Speech Recognition

Uploaded by

Copyright:

Available Formats

AM-DEMODULATION OF SPEECH SPECTRA AND ITS APPLICATION TO

NOISE ROBUST SPEECH RECOGNITION

Department of Electrical Engineering, UCLA

ABSTRACT Section 3, we evaluate the technique as a front end for an

In Section 2, we introduce the theory behind the proposed Carrier

∗ Alternatively, one can perform envelope detection in a way that

Figure 5: (a) A simplified speech spectrum (the modulated

3.2 Recognition Experiments

Percentage Recognition Rate

You might also like