Speech Communication 84 (2016) 66–82
Contents lists available at ScienceDirect
Speech Communication
journal homepage: www.elsevier.com/locate/specom
Human-inspired modulation frequency features for noise-robust ASR
Sara Ahmadi a,b,∗, Bert Cranen a, Lou Boves a, Louis ten Bosch a, Antal van den Bosch a
a
b
Center for Language Studies, Radboud University, PO- Box 9600, NL-6500 HD Nijmegen, The Netherlands
Speech Processing Research Laboratory, Electrical Engineering Department, Amirkabir University of Technology, Hafez Avenue, Tehran 15914, Iran
a r t i c l e
i n f o
Article history:
Available online 19 September 2016
Keywords:
Modulation frequency
Auditory model
Noise-robust ASR
a b s t r a c t
This paper investigates a computational model that combines a frontend based on an auditory model
with an exemplar-based sparse coding procedure for estimating the posterior probabilities of sub-word
units when processing noisified speech. Envelope modulation spectrogram (EMS) features are extracted
using an auditory model which decomposes the envelopes of the outputs of a bank of gammatone filters
into one lowpass and multiple bandpass components. Through a systematic analysis of the configuration
of the modulation filterbank, we investigate how and why different configurations affect the posterior
probabilities of sub-word units by measuring the recognition accuracy on a semantics-free speech
recognition task. Our main finding is that representing speech signal dynamics by means of multiple
bandpass filters typically improves recognition accuracy. This effect is particularly noticeable in very
noisy conditions. In addition we find that to have maximum noise robustness, the bandpass filters
should focus on low modulation frequencies. This reenforces our intuition that noise robustness can be
increased by exploiting redundancy in those frequency channels which have long enough integration
time not to suffer from envelope modulations that are solely due to noise. The ASR system we design
based on these findings behaves more similar to human recognition of noisified digit strings than
conventional ASR systems. Thanks to the relation between the modulation filterbank and procedures for
computing dynamic acoustic features in conventional ASR systems, the finding can be used for improving
the frontends in those systems.
© 2016 Published by Elsevier B.V.
1. Introduction
During the last decades a substantial body of neurophysiological
and behavioral knowledge about the human auditory system has
been accumulated. Psycho-acoustic research has provided detailed
information about the frequency and time resolution capabilities
of the human auditory system (e.g. Fletcher, 1940; Zwicker et al.,
1957; Kay and Matthews, 1972; Bacon and Viemeister, 1985; Houtgast, 1989; Houtgast and Steeneken, 1985; Drullman et al., 1994;
Dau et al., 1997a; 1997b; Ewert and Dau, 20 0 0; Chi et al., 2005;
Moore, 2008; Jørgensen and Dau, 2011; Jørgensen et al., 2013). It is
now generally assumed that the rate with which the tonotopic representations in the cochlea change over time, the so-called modulation frequencies, is a crucial aspect of the intelligibility of speech
signals. Drullman et al. (1994) showed that modulation frequencies between 4 Hz and 16 Hz carry the bulk of the information in
Corresponding author at: Center for Language Studies, Radboud University, POBox 9600, NL-6500 HD Nijmegen, the Netherlands.
E-mail addresses: s.ahmadi@let.ru.nl (S. Ahmadi), b.cranen@let.ru.nl
(B. Cranen), l.boves@let.ru.nl (L. Boves), l.tenbosch@let.ru.nl (L. ten Bosch),
a.vandenbosch@let.ru.nl (A. van den Bosch).
∗
http://dx.doi.org/10.1016/j.specom.2016.09.003
0167-6393/© 2016 Published by Elsevier B.V.
speech signals. Modulation frequencies around 4 Hz roughly correspond to the number of syllables per second in normal speech; the
highest modulation frequencies are most likely related to changes
induced by transitions between phones.1 Despite the fact that several attempts have been made to integrate the concept of modulation frequencies in automatic speech recognition (ASR) (e.g.,
Hermansky, 1997; Kanedera et al., (1998); Kanedera et al., 1999;
Hermansky, 2011; Schädler et al., 2012; Moritz et al., 2015), these
investigations have not led to the crucial break-through in noiserobust ASR that was hoped for. The performance gap between human speech recognition (HSR) and ASR is still large, especially for
speech corrupted by noise (e.g. Lippmann, 1996; Sroka and Braida,
2005; Meyer et al., 2011; Meyer, 2013).
For meaningful connected speech, part of the advantage of humans is evidently due to semantic predictability, but also in tasks
1
Brainstem research indicates that the human brain has access to modulation
frequencies up to at least 250 Hz. Such modulation frequencies might allow resolving the fundamental frequency of voiced speech, which would provide interesting perspectives for understanding speech in –for instance– multi-speaker environments. However, we limit ourselves to the modulation frequency range that pertains
to articulatory induced changes in the spectrum.
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
where there is no semantic advantage, such as in recognizing digit
sequences (Meyer, 2013) or phonemes (Meyer et al., 2011), humans
tend to outperform machines substantially. Therefore, it must be
assumed that acoustic details that are important in human processing are lost in feature extraction or in the computation of posterior probabilities in ASR systems.
There is convincing evidence that some information is lost if
(noisy) speech signals are merely represented as sequences of
spectral envelopes. Demuynck et al. (2004) showed that it is possible to reconstruct intelligible speech from a sequence of MFCC vectors, but when Meyer et al. (2011) investigated the recognition accuracy of re-synthesized speech in noise by human listeners, they
found that in order to achieve the same phoneme recognition accuracy as with the original speech, the re-synthesized speech required a signal-to-noise ratio (SNR) that was 10 dB higher (3.8 dB
versus −6.2 dB).
In Macho et al. (2002) it was shown that an advanced frontend that implements a dynamic noise reduction prior to the computation of MFCC features reduces the word error rate. Meyer
(2013) showed that advanced features, such as power-normalized
cepstral coefficients (PNCC) (Kim and Stern, 2009) and Gabor filter
features (Schädler et al., 2012) improve recognition accuracy compared to default MFCCs. The advanced frontend, the PNCC and the
Gabor filter features introduce characteristics of the temporal dynamics of the speech signals that go beyond static coefficients enriched by adding deltas and delta-deltas. Therefore, it is quite likely
that both HSR and ASR suffer from the fact that a conventional
frontend that samples the spectral envelope at a rate of 100 times
per second and then adds first and second order time derivatives
yields an impoverished representation of crucial information about
the dynamic changes in noisy speech.
The research reported here is part of a long-term enterprize aimed at understanding human speech comprehension by
means of a computational model that is in conformity with the
(neuro)physiological knowledge. For that purpose we want to build
a simulation that not only makes equally few, but also the same
kind of recognition errors as humans in tasks that do not involve
elusive semantic processing. As a first step in that direction we investigate the performance of ASR systems with frontends inspired
by an auditory model that has proved to predict intelligibility quite
accurately in conditions with additive stationary noise, reverberation, and non-linear processing with spectral subtraction (Elhilali
et al., 2003; Jørgensen and Dau, 2011; 2014; Jørgensen et al., 2013).
In addition, we investigate how an exemplar-based procedure for
estimating the posterior probabilities of sub-word units interacts
with the auditory-based frontends.
Auditory models predict speech intelligibility on the basis of
difference between the long-term average power of the noise and
the speech signals at the output of the peripheral auditory system
(Jørgensen and Dau, 2011). However, it is evident that the longterm power spectrum of a speech signal is not sufficient for speech
recognition. Auditory models are silent about all the processing of
their outputs that is necessary to accomplish speech recognition.
As a consequence, it is not clear whether an auditory model that
performs well in predicting intelligibility for humans based on the
noise envelope power ratio, such as the SNRenv model (Jørgensen
and Dau, 2011) is also optimal in an ASR system that most probably processes the output of the auditory model in a different way
than humans do.
The modulation filterbank in the auditory frontend proposed in
e.g. Jørgensen and Dau (2011); 2014); Jørgensen et al. (2013) consists of a lowpass filter (LPF) and a number of bandpass filters
(BPFs) that together cover the modulation frequency band up to
20 Hz. In our work we will vary the cut-off frequency of the
LPF, as well as the number and center frequencies of the BPFs.
In this respect, our experiments are somewhat similar to the ex-
67
periments reported in Moritz et al. (2015), who aimed to harness knowledge about the human auditory system to improve the
conventional procedure for enriching MFCCs with delta and deltadelta coefficients. In our research the focus is on understanding
how and why resolving specific details in the modulation spectrum improves recognition performance, rather than on obtaining
the highest possible recognition accuracy. The way in which we
use sparse coding for estimating the likelihood of sub-word units
in noise-corrupted speech is very different from the approach pioneered by Gemmeke et al. (2011), who tried to capture the articulatory continuity in speech by using exemplars that spanned
300 ms. In Ahmadi et al. (2014) it was shown that single-frame
samples of the output of a modulation filterbank capture a comparable amount of information about articulatory continuity. In that
paper we designed the modulation filterbank based on knowledge
collected from relevant literature on the impact of different modulation bands on clean speech recognition. Here, we extend that
work substantially by experimenting with conceptually motivated
designs of the filterbank.
All theories of human speech comprehension (e.g. Cutler, 2012)
and all extant ASR systems (e.g. Rabiner and Juang, 1993; Huang
et al., 2001; Holmes and Holmes, 2001) assume that speech recognition hinges on recognizing words in some lexicon, and that these
words are represented in the form of a limited number of subword units. The recognition after the frontend is assumed to comprise two additional processes, viz. estimating the likelihoods of
sub-word units and finding the sequence of words that is most
likely given the sub-word unit likelihoods. Both computational
models of HSR (e.g. ten Bosch et al., 2013; 2015) and ASR prefer
statistical models, or -alternatively- neural network models, for estimating sub-word model likelihoods and some sort of finite state
transducer for finding the best path through the sub-word unit lattice.
Despite the analogy between artificial neural networks and the
operation of the brain, and despite the fact that networks of spiking neurons have been shown to be able to approximate arbitrary
statistical distributions (e.g. Buesing et al., 2011), there is no empirical evidence in support of a claim that human speech processing makes use of statistical models of sub-word units. Therefore,
we decided to explore the possibility that the estimation of the
likelihoods of sub-word units is mediated by an exemplar-based
procedure (Goldinger, 1998). Exemplar-based procedures offer several benefits, compared to GMM-based approaches. An advantage
that is especially beneficial for our work is that exemplar-based
approaches can handle high-dimensional feature vectors, without
the need for dimensionality reduction procedures that are likely to
mix up tonotopic features that are clean and features that are corrupted by some kind of ‘noise’. In addition, exemplar-based representations are compatible with recent findings about the representation of auditory patterns in human cortex (Mesgarani et al.,
2014a; 2014b) and models of memory formation and retrieval (e.g.
Wei et al., 2012; Myers and Wallis, 2013).
De Wachter et al. (2007) have shown that an exemplar-based
approach to automatic speech recognition is feasible when using
MFCCs and GMMs. More recently, Ahmadi et al. (2014); Gemmeke
et al. (2011) have shown that noise-robust ASR systems can be
built using exemplar-based procedures in combination with sparse
coding (e.g. Lee and Seung, 1999; Olshausen and Field, 2004; Ness
et al., 2012). Geiger et al. (2013) have shown that the exemplarbased SC approach can be extended to handle medium-vocabulary
noise-robust ASR. In sparse coding procedures a -possibly very
large- dictionary of exemplars of speech and noise is used to represent unknown incoming observations as a sparse sum of the exemplars in the dictionary.
The seminal research in Bell Labs by Fletcher (1940); 1953) provides evidence for the hypothesis that speech processing relies on
68
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
2.1. Feature extraction
Fig. 1. Block diagram of the noise-robust ASR system.
matching incoming signals to stored knowledge in separate frequency bands. That insight has been explored for the purpose of
noise-robust ASR in the form of multi-stream processing (Misra,
2006). We apply the same insight to the frequency bands in the
modulation spectrum: we assume that the high-dimensional modulation spectrum contains enough features that are not affected by
the noise, so that they will dominate the distance measure in a
sparse coding engine. The probability that ‘clean’ bands exist will
depend on the design details of the modulation filter (and on the
noise characteristics).
A sparse coding engine that represents noisy speech in the form
of sparse sums of clean speech and pure noise exemplars can operate in three main ways. If it starts with matching noise exemplars,
the operation is reminiscent of noise suppression and spectral subtraction (e.g. Kolossa and Haeb-Umbach, 2011). If the engine starts
with matching speech exemplars, its operation is reminiscent of
missing data approaches and glimpsing (Cooke, 2006). Combinations of both strategies can also be envisaged. A third possible
strategy, and the strategy used in this paper, is treating the noise
and speech exemplars in the exact same way, leaving it to the
solver whether an unknown exemplar is first matched to speech
or noise exemplars.
To maximize the possibility for comparing our results to previous research, we develop our initial system using the aurora-2
data set. Although one might argue that the aurora-2 task is not
representative for a general speech recognition task, the task does
not limit the generalizability of the insight gained. Actually, the design of aurora-2 is beneficial for our current purpose for two reasons. First, recognizing connected digit strings does not require an
advanced language model; the fact that all sequences of two digits
are equally probable minimizes the interference between the frontend and the backend. This set-up also corresponds to research on
human speech intelligibility, which is often based on short semantically unpredictable (and therefore effectively meaningless) utterances. Second, the literature contains a number of benchmarks to
which the current results can be compared. In our experiments we
will follow the conventional approach to the aurora-2 task which
requires estimating the posterior probabilities of 176 speech and 3
silence states in a hidden Markov model.
2. System overview
The recognition system used in this work is depicted schematically in Fig. 1. We discern three main processing blocks. In the first
block, acoustic features are extracted every 10 ms from the speech
signal using the same type of signal processing as employed in the
speech-based envelope power spectrum model (sEPSM) proposed
by Jørgensen and Dau (2011); Jørgensen et al. (2013). The sEPSM
model contains more simplifying assumptions than the auditory
model proposed in Chi et al. (2005), but the models are very similar in spirit. The feature extraction block is described in more detail in Section 2.1. The second block uses the outputs of the modulation filters for estimating the posterior probabilities of the 179
sub-word units (HMM-states) in aurora-2 by means of a sparse
coding (SC) approach (Ahmadi et al., 2014; Gemmeke et al., 2011).
This block is explained in detail in Section 2.2. Finally, the third
block is a conventional Viterbi decoder that finds the most likely
word sequence combining prior and posterior probabilities of the
179 model states. This block is described in Section 2.3.
Fig. 2 shows a diagram of the feature extraction module. An auditory filterbank consisting of 15 gammatone filters is applied to
the 8 kHz speech signal x(t) and forms a set of sub-band signals
Xg (t ), g = 1, · · · , 15. The center frequencies of the gammatone filters range from F1 = 125 to F15 = 3150 Hz, distributed along a logfrequency scale with 1/3rd octave spacing. The gammatone filters
were implemented in the time domain. The envelope of each gammatone filter output is then calculated as the magnitude of the analytic signal using the Hilbert transform:
g (t ) + j.Hilbert(X
g (t ))|.
Eg (t ) = |X
(1)
The model proposed in Chi et al. (2005) uses 24 filters per octave. However, it is widely agreed (e.g. Moore, 2008) that a 1/3rd
octave gammatone filterbank captures all detail in auditory signals
that are relevant for speech recognition. Therefore, the design of
the gammatone filterbank is kept constant in all experiments.
The 15 sub-band envelopes are downsampled to 100 Hz and
then fed into a bank of M + 1 modulation frequency filters, one
lowpass and M bandpass filters. Thus, the output of the modulation filterbank consists of 15 · (M + 1 )-dimensional feature vectors.
In Section 3 we evaluate the impact on recognition performance
when the number of modulation bandpass filters and the way in
which their center frequencies are distributed on the frequency
axis are varied.
In the modulation filterbank we used a first-order Butterworth
lowpass filter (downward slope −6 dB/oct) and a set of secondorder bandpass filters with quality factor Q = 1 (rising and falling
slopes of +6 and −6 dB/oct respectively), since a filterbank consisting of Q = 1 filters simulated the intelligibility of human listeners
best (e.g. Jørgensen et al., 2013; Jørgensen and Dau, 2014; 2011).
The modulation filterbanks were also implemented in the time domain.
The operation of the feature extraction module is illustrated
in Fig. 2. The left-hand column shows the operation in the frequency domain. The right-hand column shows two snapshots of
the operation in the time domain. The top panel shows the envelope of the output of the gammatone filter with center frequency
Fg = 315 Hz for an utterance of the digit string “zero-six”. The bottom panel shows the decomposition of this envelope in its modulation frequency components. The all-positive blue curve in the
right-hand bottom panel is the output of the low pass filter; the
other curves in this panel represent the output of the modulation
bandpass filters. The complete output of the modulation filterbank
is a set of time signals Em, g (t) which represent the mth modulation
frequency component centered at Fm Hz of the gth gammatone
sub-band envelope at Fg Hz. The envelopes at the outputs of the
gammatone filters can be approximately reconstructed by means of
Eq. (2).2
M+1
Em,g (t ) ≈ Eg (t ) , g = 1, 2, . . . , 15.
(2)
m=1
The bottom panel in the left-hand column in Fig. 2 shows the
amplitudes of the outputs of nine modulation frequency filters for
each of the 15 gammatone filters for the utterance “zero-six”. We
will refer to this representations as the envelope modulation spectrogram (EMS) in the remainder of the paper. The EMS feature vector is obtained by stacking the decomposed sub-band envelopes.
2
Depending on the spacing of the center frequencies of the filters, the approximation of Eq. (2) may be more or less accurate. If a non-uniform resolution over
frequency is considered desirable, the resulting sum is a “distorted” version of
the original envelope in which the more densely represented frequencies are overrepresented/emphasized.
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
69
Fig. 2. Block diagram of the feature extraction module. Left column: system operation in frequency domain. Right column: examples of time domain representations.
Because the signal envelopes are downsampled to 100 Hz, we obtain an EMS feature vector every 10 ms (which we will, analogous to customary ASR terminology, refer to as feature frames).
Contrary to conventional Mel filter feature extraction, the vector
elements do not apply to fixed analysis windows of 25 ms that
are shifted with a step size of 10 ms. Instead, the effective time
context spanned by the feature value in a modulation band depends on the duration of the impulse response of the corresponding modulation filter. Ahmadi et al. (2014) found that retaining the
phase information of the modulation frequency components, i.e.,
not compensating for the group delay and refraining from applying
full-wave rectification to the filter outputs, had a beneficial effect
on recognition performance. A similar result was found in Moritz
et al. (2015). Therefore, we refrained from reverting to magnitude
features and any form of group delay compensation.
2.2. Computation of posterior probabilities
The sparse coding procedure needs a dictionary of speech and
noise exemplars. In all experiments in this paper we used a dic-
70
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
Fig. 3. Block diagram of the posterior probability computation block. A sample posterior probability matrix is visualized in the right side of the figure. The activation vector
(S) and state posterior probability vector (P) of a single time frame of the sample signal is shown in the bottom part of the figure.
tionary that comprises 17,148 speech exemplars and 13,504 noise
exemplars. For each configuration of the modulation filterbank
a new dictionary was constructed. Exemplars consist of a single
feature frame (EMS vector). Given the amplitude response of the
modulation filters with the lowest center frequencies, information
about continuity of spectral changes over time is preserved in the
EMS features. For all configurations of the modulation filterbank
the exact same time frames extracted from the training set in
aurora-2 were used as exemplars.
The speech and noise exemplars were obtained by means of a
semi-random selection procedure. We made sure that we had the
same number of exemplars from female and male speakers, and almost the same number of exemplars associated with the 179 states
in the aurora-2 task. For that purpose we labeled the clean training speech by means of a conventional HMM system using forced
alignment. Most states were represented by 98 exemplars in the
dictionary. The remaining states, which had fewer frames in the
training material, were represented by at least 86 exemplars. To
obtain the noise exemplars the noise signals were reconstructed
by subtracting the clean speech from the noisified speech in the
multi-condition training set. The resulting signals were processed
by the modulation frequency frontend, and the noise exemplars
were randomly selected from these output signals.
As can be seen in Fig. 3, the procedure for estimating posterior probabilities of sub-word units consists of several steps. The
first step involves a normalization of the EMS features (i.e., standard deviation equalization and Euclidean-normalization), the second implements the reconstruction of unknown observations as a
sparse sum of exemplars in a dictionary (sparse coding), and the
third step converts the exemplar activations to posterior probabilities.
Standard deviation equalization and Euclidean-normalization. We
used a Lasso procedure for reconstructing EMS vectors as a sparse
sum of exemplars from the dictionary (Efron et al., 2004). Lasso is
able to handle the positive and negative components in the EMS
vectors. The Lasso procedure minimizes the root mean square of
the difference between an observation and its reconstruction. The
range and variance of the components of the EMS vectors differs
considerably (Ahmadi et al., 2014). To make sure that all gammatone bands can make an effective contribution to the distance
measure, some equalization in the EMS vectors is required. We follow the strategy used in Ahmadi et al. (2014), in which the standard deviations of the samples of the gammatone envelope signals
Eg (t) within each modulation band are equalized using weights obtained from the speech exemplars in the dictionary. Each Em, g (t) is
multiplied by an equalization weight wg :
wg = 1
M+1
1
σ15·(m−1)+g
M+1
m=1
f or
1 ≤ g ≤ 15,
(3)
where σ i (i = 15 · (m − 1 ) + g), 1 ≤ i ≤ 15 · (M + 1 ), is the standard
deviation of the ith element of the speech dictionary exemplars.
With this procedure the standard deviation of these modified features is equalized within each modulation band, while the relative importance of the different modulation bands is retained. The
equalization weights were recomputed for each configuration of
the modulation filterbank.
Algorithms for finding the optimal representation of unknown
observations in the form of a sparse sum of exemplars are sensitive to the (Euclidean) norm of the observations and exemplars.
Therefore, we normalized all exemplars and all unknown feature
vectors to unit Euclidean norm. However, for speech-silence segmentation, information about the absolute magnitude of the filter
outputs is needed. We used the unnormalized EMS vectors for that
purpose.
−−→
Sparse coding. Unknown observations EMS (t ) are reconstructed as
a sparse linear combination of exemplars from a dictionary A that
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
contains both speech and noise exemplars,
N
−−→
n = AS,
sn a
EMS(t ) ≈
(4)
n=1
where S is a sparse weight vector that contains the non-negative
exemplar activation scores of the dictionary exemplars that mini−−→
mize the Euclidean distance between the test vector EMS (t ) and
the reconstructed version, subject to a sparsity constraint (controlled by λ):
−−→
min EMS(t ) − AS
2
s.t.
S < λ.
1
(5)
From activations to posterior probabilities. The exemplar activation
scores must be converted into state posterior probabilities. For that
purpose, we use the state labels of the speech exemplars in the
dictionary. As the exemplar dictionary A = [As , An ] is the concatenation of a noise and a speech dictionary, the activation vector S
S
in Eq. (5) can be split into two separate parts S = [ s ], indicating
Sn
the weights corresponding to speech and noise exemplars, respectively. Since the noise exemplar activations are irrelevant for estimating the posterior state probabilities, we ignore the noise exemplar activations (Sn ). With L1×NA the label vector (NAs = 17, 148
s
is the number of speech exemplars), and the ith element 1 ≤ Li
≤ 179 representing the label of the ith exemplar in the speech
dictionary, we compute a cumulative state activation vector C in
which each element C j , j = 1, 2, . . . , 179 is the sum of the activation scores corresponding to dictionary exemplars that have state
label number j:
Cj =
Si ,
3. Exploiting modulation frequency domain information
To investigate the impact of the way in which the information
about modulation frequencies is represented in the EMS feature
vectors, we designed a sequence of experiments. In “Study1” we
use a simplified version of the auditory model to investigate several technical and conceptual issues. We also address the correspondence between the LPF and BPFs in the modulation filterbank
on the one hand and the static and dynamic features in conventional ASR systems (c.f. Moritz et al., 2015). In Study 2 we investigate the performance gain that can be obtained when the cut-off
frequency of the LPF is varied and an additional number of modulation bandpass filters are added. Also, we investigate how recognition performance is affected when the LPF and BPFs cover the
same modulation frequency range. Finally, in Study 3 we return to
the original auditory model (keeping the cut-off frequency of the
LPF fixed at 1 Hz), and investigate the impact of different configurations of the bank of BPFs (varying number of BPFs and the spacing of center frequencies, i.e., linearly or logarithmically) used for
capturing the dynamic information.
3.1. Study 1: exploratory experiments
where Si is the ith element in Ss . The state posterior probability
estimate is then computed by normalizing the vector C to L1
norm 1.
j=1 C j
to one). The state-to-state transition matrix is fixed across all experiments in this paper. The word-word transitions in the language
model (LM) are determined by the conditional bigram (word-word)
probabilities, which are virtually uniform.
There are two free parameters (i.e. the word and silence entrance penalties) which were tuned on a development test set
for adjusting the balance between insertions and deletions and to
minimize the word error rate. The decoder only provides the best
path with the associated accumulated score and the hypothesized
words and silences, including a segmentation at the word level.
(6)
{i|Li = j }
C
P = 179
71
.
(7)
As in Gemmeke et al. (2011), it appeared that the procedure
of Eq. (6) systematically underestimates the posterior probability
of the three silence states. This is due to the fact that the normalization of all EMS vectors to unit length effectively equalizes
the overall magnitude, thereby destroying most of the information
that distinguishes silence from speech. Therefore, we implemented
an additional procedure that estimates the probability of a frame
being either speech or silence on the basis of the unnormalized
feature values. In frames that were classified as silence by this procedure the posterior probability of the three silence states was set
to 0.333, and the posterior probability of the 176 speech states was
set to some small floor value.
2.3. Viterbi decoder
The Viterbi decoder finds the most likely word sequence in a
179 (states) by N (frames) matrix by combining prior and posterior probabilities of the 179 states. The implementation allows us
to use different word entrance penalties for the eleven digit words
and the silence ‘word’. The decoder uses a pre-estimated 179-by179 state-to-state transition matrix that contains the log probabilities associated to each state-state transition. Probabilities of the
non-eligible transitions are first floored to a small positive value
before the logarithm is applied. This flooring has a negligible effect on the total probability mass (i.e., the posterior probabilities
of the 179 states to which a transition is allowed still sum almost
We started experimenting with a highly simplified auditory-like
model that consists of a LPF in combination with one BPF that
emphasizes modulations in a specific frequency band, i.e., M, the
number of BPFs in the modulation filterbank equal to one. One
conceptual issue concerns the cut-off frequency of the LPF. Different instantiations of the auditory model used quite different
LPFs. For example, Moritz et al. (2015) started from the system
described in Dau et al. (1997a), where the LPF has a cut-off frequency of 6 Hz. This corresponds to an integration time of approximately 170 ms, compared to the 10 0 0 ms integration time
of the LPF with a cut-off frequency of 1 Hz in Jørgensen and Dau
(2014) that is used here. One might wonder whether such a long
integration time can at all be used in experiments with isolated
utterances that may have a duration between 0.5 and 3 s. We address the cut-off frequency of the LPF in this study and investigate
it further in the next study in configurations with multiple BPFs. In
our simplified model, we followed two different strategies in defining the LPF cut-off frequency: 1) the LPF cut-off frequency is fixed
at 1 Hz, while the center frequency of the BPF increases; 2) the
LPF cut-off frequency is always 1 Hz below the center frequency
of the BPF, the center frequency of which increases. We compare
the performance of these simplified models with a single LPF covering the same modulation frequency range to evaluate the advantage of emphasizing specific modulation frequencies using the BPF.
The number of feature elements in the simplified auditory model
(LPF+BPF) is twice the number of feature elements obtained using
a single LPF. Moreover, the shape of effective transfer function of
a filterbank consisting a LPF and one BPF is different from a single
Butterworth LPF, as shown in Fig. 4a–c. To disentangle the effect
of these two factors on the performance and also to verify that
the effective transfer function is an important issue to consider in
the design of a modulation filterbank, we compare the accuracy
72
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
Fig. 4. Word recognition accuracy on clean speech using feature vectors consisting of lowpass filtered gammatone filter envelopes without (blue) or with additional emphasis
on a specific modulation frequency band. Emphasis is accomplished by modifying the frequency response of a single lowpass filter (magenta open circles) or by adding an
additional bandpass filter. The green curve (diamonds) pertains to a fixed lowpass filter (FLP = 1 Hz) in combination with a bandpass filter with varying center frequency;
the red curve (asterisks) pertains to a lowpass filter of which the cut-off frequency was 1 Hz below the center frequency of the accompanying bandpass filter. The shaded
bands indicate the 95% confidence interval. Sub-figures (a), (b) and (c) show the transfer functions of the composing filters and their sum (in red). (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)
that can be obtained with a two-filter system and a system with
a single LPF that has the same transfer function as the two-filter
system. A final, also somewhat conceptual issue that we wanted to
explore is to what extent results obtained with a specific configuration for clean speech generalize to the noisified test utterances.
3.1.1. Clean speech
The results of the pilot experiments on clean speech are summarized in Fig. 4. The red curve (asterisks) in Fig. 4d shows the
recognition accuracy obtained with a modulation filterbank that
consists of a LPF with a cut-off frequency that increases from 1 Hz
to 16 Hz, combined with a BPF center frequency 1 Hz higher than
the LPF cut-off frequency. Accuracy increases with an increase of
the modulation frequency band that is covered, up to a frequency
of 7 Hz, where ceiling performance is reached. Interestingly, this
‘optimum’ is obtained with the cut-off frequency of the LPF in the
auditory model proposed in Dau et al. (1997a). With 15 gammatone and two filters in the modulation filterbank the EMS feature
vectors contained 30 coefficients.
The purple (open circles) curve in Fig. 4d pertains to a modulation ‘filterbank’ that consisted of a single LPF with a frequency response identical to the two-filter system underlying the red (asterisk) curve. Since the modulation filterbank comprised only a single
filter, the EMS vectors contained 15 features. From this comparison it can be concluded that representing an overall frequency response by means of two filters, resulting in EMS vectors that contain two sets of 15 features is advantageous.
The blue (filled circles) curve shows the recognition accuracy
obtained with a single LPF with increasing cut-off frequency, and
a frequency response that was flat in the pass band. The comparison between this curve and the purple curve shows that an overall
frequency response identical to the two-filter system yield better
accuracy than a flat response when the EMS vectors contain the
same number of features.
The green curve (open diamonds) pertains to the accuracy obtained with a two-filter system in which the cut-off frequency of
the LPF was fixed at 1 Hz, while the center frequency of the BPF
was increased from 2 Hz to 17 Hz. For the BPF center frequency of
2 Hz the configuration is identical to the second configuration in
the red (filled asterisks) curve. When the center frequency of the
LPF is 3 Hz it can already be seen that the performance lags, relative to the configuration in which the this BPF is combined with
a LPF with a cut-off frequency of 2 Hz (the red curve), despite the
equal number of features in the EMS vectors. For center frequencies of the BPF > 6 Hz the accuracy of the this system decreases
with increasing center frequency. The accuracy of this two-filter
system drops below the single LPF system (the purple open circles curve) for BPF center frequencies > 8 Hz. The accuracy even
drops below the single, flat response LPF system for BPF center frequencies > 14 Hz. We attribute this effect to the overall transfer
function of this two-filter filterbank. As can be seen in Fig. 4b, the
frequency response contains a trough around 4 Hz that deepens as
the center frequency of the BPF increases.
From the data in Fig. 4 we can draw several preliminary conclusions. Probably the most important conclusions is that the overall
frequency response of the modulation filterbank has a large impact
on the performance of the system. The frequency response must
cover at least the band up to 7 Hz, and emphasizing a somewhat
narrow band centered around frequencies up to 7 Hz yield higher
accuracy than a flat response. Emphasizing ever higher modulation
frequencies has no beneficial, but also no detrimental effect. The
second conclusion is that the number of coefficients in the EMS
feature vectors is important. With identical frequency responses,
the systems that encode the output of the BPF as an additional
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
73
Fig. 5. Word recognition accuracy on noisy speech (four noise types in test set A), using feature vectors consisting of a lowpass filtered gammatone filter envelopes together
with an additional bandpass filtered version of the envelope. (a) Word recognition accuracy on four noise types at SNR level of 20 dB. (b) Word recognition accuracy on four
noise types at SNR level of −5 dB. (Note the different scales of the vertical axes.).
set of 15 coefficients always perform much better. This indicates
that EMS vectors that distribute information about the overall frequency response in a set of features that correspond to the flat
part of the shape and the region of the frequency axis that is emphasized are more discriminative.
3.1.2. Noisy speech
Since filterbanks that combine a LPF with increasing cut-off frequency with a BPF with center frequency 1 Hz above the cut-off
appeared to yield the best accuracy, we tested these configurations
on the noisy utterances of test set A. The other (inferior) configurations mentioned above were also tested; results are not shown, because they do not contribute additional information. Fig. 5 shows
the accuracies in the SNR = 20 dB and SNR = −5 dB conditions.
From Fig. 5a it can be seen that the results in the SNR = 20 dB
condition are similar to the results obtained with clean speech.
However, the frequency range at which ceiling performance is
reached differs slightly between the four noise types. Also, the extent to which the accuracy varies on the plateau seems to differ
slightly between the four noise types.
At the SNR= −5 dB level (c.f., Fig. 5b), a different pattern of results is visible. Although it is not safe to draw strong conclusions
from very low recognition accuracies, several observations stand
out. First, there is substantial difference between the noise types.
Noise type N2, babble noise, yields the highest accuracies for all
LPF cut-off frequencies. The accuracy with car noise (N3) drops almost to the level of subway noise (N1) with cut-off frequencies
≥ 12 Hz. It can also be seen that all four noise types show a decreasing accuracy when the cut-off frequency of the LPF increases
beyond some maximum. For car noise the fall is deep and steep,
whereas it is quite shallow for subway noise.
An in-depth analysis of the distributions of the EMS vectors
showed that these somewhat surprising results are caused by the
difference (or similarity) between the two-band EMS features of
speech and the corresponding features of the four noise types. In
the lower SNR conditions (and especially with SNR= −5 dB) we see
two different effects. Noise exemplars in the dictionary account for
a substantial proportion of the approximation of the noisy speech
EMS vectors; this results in low –and possibly random– activations
of the speech states. Except for the subway noise, the reduction of
the total activation of speech states becomes worse as the BPF emphasizes higher modulation frequencies, which are less informative
for speech. The overall reduction of the activation of speech states
is combined with an increasing shift of the activations towards
a small number of speech states that happen to have EMS vec-
tors that are somewhat similar to the vectors that characterize the
noises. This results in a digit confusion pattern that strongly favors
the digits that happen to contain these favored states. This effect
is especially clear for N1 (subway) and N4 (exhibition hall), whose
EMS vectors are characterized by high values in the high-frequency
gammatone filters, both in the LPF and BPF. The EMS vectors of
N1 show this effect already at low cut-off frequencies, which explains the fairly flat shape of accuracies as a function of cut-off frequency. Babble noise behaves differently in that it does not favor a
small number of speech states. The especially detrimental effect of
N3 (car) is due to a combination of the two effects: a small number of speech states is favored, while the total activation of the
speech states is small. The large differences between the recognition accuracies with the four noise types at −5 dB SNR suggest
–unsurprisingly– that a two-filter modulation filterbank does not
provide sufficient resolution for coping with different noise types.
3.1.3. The link with delta coefficients in conventional ASR
In addition to commonalities between the acoustic features
used in conventional ASR and the output of an auditory model,
there are also substantial differences. The conventional ASR approach is based on (power) spectra estimated from 100 overlapping windows per second. Such a spectrum can be considered as
equivalent to the EMS features in a LPF with cut-off frequency set
to 50 Hz. Furthermore, conventional delta coefficients in conventional ASR (i.e., the time derivatives of the static features) can be
viewed as the output of a single modulation frequency bandpass
filter. The transfer function of a differentiator has a rising slope
of +6 dB/octave; therefore, the output of a bandpass filter with
a rising slope of +6 dB/octave can be considered as a low-pass filtered version of a differentiator. The falling slope of the BPF determines to what extent the high frequencies in the differentiated
signal 1are attenuated. In the Q = 1 filters of our auditory model,
the falling slope is −6 dB/octave. In conventional ASR the center frequency of the bandpass filter, as well as the steepness of
the falling slope, depend on the number of static coefficients involved in the regression function used in computing the deltas.
With DELTAWINDOW=5 and a frame rate of 100 frames per second in HTK (Young et al., 2009) the center frequency of the ‘delta’
filter is approximately 7.5 Hz, while the attenuation at the Nyquist
frequency of 50 Hz is approximately 20 dB.
To obtain a better understanding of the effect of centering the
‘delta’ filter at different frequencies, we carried out an experiment
in which we combined a 16 Hz cut-off frequency LPF with a single BPF with a center frequency that varied between 2 Hz and
74
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
Fig. 6. Relative recognition accuracy (Acc) improvement obtained by adding an additional bandpass filtered version of the envelope to the 16 Hz lowpass filtered one.
The subplots show the results on three different SNR levels of noisy speech with
four different noise types of test set A.
16 Hz. The recognition accuracies obtained with these configurations were compared to the accuracy obtained with a single LPF
with cut-off frequency 16 Hz. Fig. 6 shows the relative improvement for the four noise types for SNR levels of 20, 5, and −5 dB. A
comparison between the curves for the SNR levels shows that the
gain increases as the SNR level decreases: While the relative improvement is of the order of 20% to 25% in the SNR = 20 dB condition, the performance is improved by 50% up to 130% (noise type
dependent) in the SNR= −5 dB condition. Especially at SNR−5 dB
the center frequency at which the recognition accuracy increases
most depends on the noise type. This confirms that a single ‘delta’
filter is not sufficient for making the EMS features robust against
different noise types.
3.2. Study 2: multi-resolution representations of modulation
frequencies
It is quite likely that humans pay selective attention to the
spectro-temporal input when understanding speech, and that selective attention becomes more important as the listening conditions grow more adverse. The gammatone filters allow for a sufficient degree of selectivity in the frequency domain. The subsequent modulation filterbank must provide the selectivity in the
modulation frequency domain. In combination with the sparse
coding approach for obtaining the posterior probabilities of the 179
states in the aurora-2 task, a multi-resolution representation, with
its attendant longer feature vectors, might enhance the probability that ‘correct’ clean speech exemplars in the dictionary have a
small Euclidean distance to noisy speech frames, because the energy of the noise is much smaller than the energy of the speech in
some regions of the EMS vectors. If this is indeed the case, a multiresolution representation should enhance the resulting recognition
accuracy.
In Section 3.1 it was concluded that modulation frequencies in
the band up to 16 Hz must be covered and that the largest gain
in performance relative to a configuration with a single LPF is obtained by emphasizing different modulation frequencies for different noise types and different SNR levels. Therefore, it can be expected that a configuration in which multiple BPFs separate the
modulations in different frequency bands would outperform a con-
Fig. 7. Word recognition accuracy obtained with feature vectors covering the modulation frequency range of 016 Hz. The modulation filterbank consisted of a single lowpass filter with variable cut-off frequency and a variable number of additional bandpass filters with center frequencies spaced 1 Hz apart to cover the interval beyond the LPF cut-off frequency. Results for clean (top) and noisy speech
(bottom) are shown in separate panels to improve resolution. The shaded bands indicate the 95% confidence interval. The dashed line shows the trajectory of the peak
position across SNR level.
figuration that contains only a LPF and a single BPF. Auditory models do precisely this, by combining a LPF with a bank of BPFs. Such
a filterbank can be configured in two different ways: the BPFs can
cover the frequency range above the cut-off frequency of the LPF,
or the frequency ranges of the BPF and LPF may overlap, so that
the BPFs provide additional resolution in a band that is already
covered. Below, we compare these configurations. By doing so, we
address two questions: 1- In which modulation frequency range is
a high resolution most beneficial for noisy speech recognition? 2to what extent is it beneficial to represent modulation frequencies
both in terms of static and dynamic features by choosing overlapping LPF and BPFs?
In the first experiment we employed a modulation filterbank
consisting of a LPF with a variable cut-off frequency (ranging from
1 to 16 Hz), augmented with a bank of BPFs with center frequencies (spaced 1 Hz apart) covering the range from 1 Hz above the
cut-off frequency of the LPF up to 16 Hz. Obviously, the total number of filters in the filterbank (M + 1), and therefore the total number of features in the EMS vectors (15 · (M + 1 )), will increase as
the cut-off frequency of the LPF decreases. The test is performed
on all the clean and noisified data in test set A of aurora-2.
The results of this experiment (averaged over four different noise
types) are summarized in Fig. 7.
The first observation that can be made from the figure is that
the configurations with the largest number of modulation BPFs
do not always yield the best recognition accuracy: the curves for
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
the highest SNR levels start with a (small) interval in which the
performance is increasing as the cut-off frequency of the LPF increases, corresponding to a decrease of the total number of filters. The cut-off frequency at which the maximum accuracy is obtained is clearly dependent on the SNR level. In the clean condition, the best performance is obtained when the LPF cut-off frequency is 5 Hz and the modulation frequency range of 6 − 16 Hz
is covered by M = 11 linearly spaced BPFs. At lower SNR levels, the
LPF cut-off frequency at which the maximum accuracy is obtained
shifts towards lower frequencies as illustrated by the dashed line
in Fig. 7b: it interpolates the LPF cut-off frequency at which the
best performance is obtained at different SNR levels. Moreover, the
steeper slopes in the curves corresponding to lower SNR levels indicate that increasing the LPF cut-off frequency, and as a result
decreasing the resolution in the lower modulation frequencies, is
more harmful in the presence of high noise energy. Apparently,
separating modulations in the very low frequency bands, which are
not very important for the intelligibility of clean speech, enhances
the capability of the sparse coding engine to match noisy speech
EMS vectors with ‘correct’ clean speech exemplars.
In the second experiment we combined 15 BPFs with center
frequencies linearly spaced between 1 Hz and 15 Hz with a LPF
the cut-off frequency of which was decreased from 15 Hz to 1 Hz.
With lower cut-off frequencies the amount of information about
the modulations that can be said to be represented twice (in the
BPFs and in the LPF) decreases, but all configurations cover the
modulation frequencies up to 16 Hz. Also, the total number of features (15 · 16) was identical in all configurations, because the number of filters was fixed.
It appeared that decreasing the cut-off frequency of the LPF
from 16 Hz to 1 Hz had no effect on the resulting recognition accuracy. The performance was independent of the cut-off frequency
and always equal to the accuracy corresponding to LPF cut-off frequency of 1 Hz in Fig. 7. From this experiment it can be concluded that the 1 Hz cut-off frequency of the LPF in the model of
Jørgensen and Dau (2014) is to be preferred over the 6 Hz cut-off
frequency in the model of Dau et al. (1997a), especially in low SNR
conditions. Apparently, a high resolution in the modulation filterbank is almost always beneficial. The only exception is formed by
the conditions with a very high SNR level, where a high resolution
in the very low modulation frequencies has a small negative effect.
3.3. Study 3: the auditory model revisited
Now that we know that a set of modulation BPFs that cover the
frequency range from 216 Hz, in combination with a LPF with a
cut-off frequency as low as 1 Hz, can yield promising recognition
accuracies, we can return to the question whether the ‘standard’
configuration in auditory models, i.e., Q = 1 BPFs spaced at one octave intervals, is the optimal configuration for ASR applications. To
address this question we carried out experiments in which the envelopes of the gammatone sub-bands are processed by a number
of different modulation filterbanks. The filterbanks consisted of a
fixed LPF with a cut-off frequency at 1 Hz and a variable number
of BPFs with quality factor Q = 1.
3.3.1. LPF at 1 Hz and BPFs with different distribution patterns
We first compare the recognition performance using filterbanks
with similar frequency coverage, but with different number of BPFs
and distribution patterns of center frequencies. The center frequencies of the BPFs were chosen in three different manners: linearly
spaced at 1 Hz distance, logarithmically spaced at 1/3rd octave and
at full octave distance. The number of BPFs is gradually increased,
adding modulation bands, until they cover the frequency range up
75
to 25 Hz.3 In Fig. 8a the recognition accuracies for clean speech
of each of these filterbanks are depicted as a function of the center frequency of the last BPF included (red: linear spacing; purple: octave spacing; green: 1/3rd octave spacing). Note that, as a
consequence of the different distribution patterns of the BPFs, the
number of BPFs used for covering the range up to a given modulation frequency was different (14 with linear spacing, 4 with octave
spacing, 12 with 1/3rd octave spacing, and 10 with the first two
filters in the 1/3rd octave spacing left out).
The first observation from this figure is that adding more
BPFs improves recognition accuracy, but a ceiling performance
is reached when the center frequency of the last-added filter is
16 Hz. The highest word recognition accuracy is obtained with the
linear spacing strategy and amounts to 96.13%, an improvement
of approximately 2.2% absolute compared to the best performance
obtained with a combination of a LPF with cut-off frequency 15 Hz
and a single BPF with center frequency 16 Hz (cf. Study 1).
The second observation from Fig. 8a is the consistent and statistically significant advantage of the linearly spaced filterbank (the
red curve) over the logarithmically spaced filters (the purple and
green curves). The difference in number of filters employed cannot
explain this observation: to cover the range up to a modulation
frequency of 10 Hz, the 1/3rd octave spacing and the linear spacing
require ten and nine filters respectively; still, the linearly spaced
filterbank outperforms the 1/3rd octave spacing. Also, despite the
different number of filters, the octave spaced and the 1/3rd octave
spaced filterbank have roughly equal performance. Therefore, the
most plausible explanation lies in the fact that different locations
of the center frequencies of a set of BPFs yield different effective
transfer functions. To illustrate this effect, we plotted the effective
transfer function for the linear, octave and the 1/3rd octave spaced
filterbanks with M = 10 in Fig. 8d. Clearly, the 1/3rd octave spaced
filterbank emphasizes the very low modulation frequencies much
more than the linearly spaced filterbank (peak at 3.0 Hz compared
to 6.25 Hz).4
From the perspective of sparse coding this means that information about modulations in a frequency range that exhibits nonnegligible variance, but contains little information about the contents of speech signals, may have too strong an impact on the
Euclidean distance measure, giving rise to sub-optimal recognition performance. To test this hypothesis, we removed the first
two BPFs from the 1/3rd octave spaced filterbank ( fc = 1.26 Hz
and fc = 1.58 Hz). As a result, the effective transfer function of
the modified filterbank does no longer over-emphasize the lowest
modulation frequencies compared to the linearly spaced filterbank
(cf. Fig. 8d). Consequently, as shown by distance between the green
and the light blue curve in Fig. 8a, the recognition accuracy for
clean speech was always higher than the results with the corresponding full 1/3rd filterbanks. Note that the content of the BPFs
with Fc = 1.26 Hz and Fc = 1.58 Hz do contain some useful information since the performance levels at the two left most points
in the red curve of Fig. 4d are larger than the left-most point of
the blue curve in Fig. 4d (using one static feature only). However,
in combination with more BPFs covering a larger range of modulation frequencies, a modulation frequency range that is sampled too
densely at the low end is harmful for recognition.
In this experiment we compared filterbanks with different
numbers of filters. We also created filterbanks with the same num-
3
We increased the modulation frequency range compared to previous experiments. This was done to verify that the logarithmically spaced BPFs (that exhibit
a wider spacing at high frequencies) also yielded ceiling performance above approximately 16 Hz.
4
The frequency response of the filterbank with octave spacing is not shown,
because the center frequency of a substantial number of 10 filters is beyond the
Nyquist frequency.
76
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
Fig. 8. Word recognition accuracy for (a) clean speech, (b) noisy speech SNR= 20 dB, (c) noisy speech SNR= −5 dB, as a function of the highest center frequency in the
bank of bandpass filters. The shaded areas represent the 95% confidence intervals. The center frequencies of the filters (FBP ) are spaced linearly at 1 Hz intervals (red), or
logarithmically at full octave intervals (purple) or at 1/3 octave intervals (green). The blue curve depicts the results obtained with the same 1/3 octave filterbank, without
the two filters with FBP < 2 Hz. Sub-figure (d) shows the effective transfer functions of the modulation filterbanks with LPF cut-off frequency 1 Hz and 10 (or 8 for the blue
curve) BPFs.
ber of filters as in the filterbank with linearly spaced BPFs, equally
spaced on a logarithmic frequency axis. None of these configurations appeared to provide better performance than the linear spacing of the center frequencies.
Fig. 8 b and c show the results obtained with increasing numbers of differently spaced filters for the two extreme noise conditions, i.e., SNR=20 dB and SNR=−5 dB. In the SNR=20 dB condition the superiority of the linear spacing, with a (much) larger
number of filters, is more apparent than in the clean speech condition. In the SNR=−5 dB condition the filterbanks with octave
spacing, and therefore smaller numbers of filters, yield much lower
accuracies than the configurations with higher numbers of filters.
This suggests that, particularly in noisy conditions, the sampling
of the modulation frequency domain needs to be sufficiently finegrained for the ASR system to reap the maximum possible benefit from the multi-resolution representation. It can also be seen
that the recognition accuracy obtained with linearly spaced filters
starts decreasing when filters with center frequencies > 16 Hz are
added. The modulations in these frequency bands are mainly associated to the noise. This confirms our earlier conclusion that it is
counter-productive to dedicate a substantial proportion of the EMS
features to modulation frequency bands that do not contain information relevant for speech recognition. From Fig. 8c it can also be
seen that a larger number of BPFs is not always beneficial: the configuration with fewer 1/3rd octave spaced filters is clearly competitive.
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
77
Fig. 9. Word recognition accuracy for (a) clean speech and (b) noisy speech as a function of the number of logarithmically spaced filters. The black lines show the recognition
accuracy averaged over all the four noise types (The shaded areas represent the 95% confidence intervals). At SNR=-5, the individual noise results are also plotted in scattered
markers: red circle:N1 (Subway), blue hexagram: N2 (Babble), green diamond: N3 (Car) and cyan square: N4 (Exhibition). The red asterisks shows the recognition accuracy
obtained with the best performing linear filterbank (1LPF+18 BPFs).
3.3.2. LPF at 1 Hz and varying number of BPFs logarithmically
positioned to approximate a given effective transfer function
In most of the previously described experiments there was
an interaction between the total range of modulation frequencies
covered, the number of filters in that range and the distribution
patterns of the center frequencies. The fact that leaving out the
lowest-frequency filters from the 1/3rd octave filterbank improved
the recognition accuracy suggests that the presence of irrelevant
features incurs the risk that the Euclidean distance in the sparse
coding process homes in on exemplars that fit these irrelevant
features, at the cost of the features that do matter. The shape of
the effective transfer function of the filterbanks and the frequency
at which the response is maximal indicate which modulation frequencies will be represented with many features and dominate
the Euclidean distance measure in the Lasso decoder. In study 1
it was found that the effective transfer function can be used as a
criterion for comparing different filterbank configurations. Therefore, we conducted an experiment in which we used the effective
transfer function of the best-performing linearly spaced filterbank
(i.e. 19 filters: 1 LPF + 18 BPFs) as a target that we try to approximate by means of a variable number of logarithmically spaced
BPFs. In contrast to the previous experiments, however, we allowed
the center frequencies of the first and last filter in the filterbank to
vary. Imposing the additional condition that the resulting configurations would provide at least some resolution in the low modulation frequency range, without pushing the lowest center frequency
below 2 Hz and without pushing the highest one above 36 Hz (so
that the −3 dB point of the falling slope of the BPF does not exceed
the Nyquist frequency), we ended up with configurations with a
minimum number of 10 and a maximum number of 22 filters. The
recognition accuracy results obtained with these configurations are
shown in Fig. 9.
As can be seen from Fig. 9a, close to maximum performance
on clean speech can be achieved with any number of log-spaced
BPFs with M ≥ 15. The best performance is achieved with 18 BPFs;
the center frequency of the first and last BPFs are 3.26 Hz and
20.24 Hz, respectively. Although the number of filters is equal to
the target linear filterbank, the achieved recognition accuracy is
even slightly (but significantly) higher than with the 18 linearly
spaced BPFs (0.4% relative; the red asterisk that indicates the accuracy with linearly spaced filters is just beyond the 95% confidence
interval).
Fig. 9b shows the corresponding results for the noisy test utterances from set A at SNRs ranging from 20 dB down to −5 dB. For
the highest SNR conditions the accuracy does not improve substantially when the number of filters is increased from 10 to 18. For
the lowest three SNR levels increasing the number of BPFs does
improve accuracy. In all cases a larger number of filters results
in a higher resolution in the lowest modulation frequencies. For
SNR=−5 dB, using M = 21 BPFs (rather than M = 18) yielded a 5%
relative improvement. In this configuration the center frequencies
of the lowest and highest BPF were 2.25 Hz and 19.3 Hz. Fig. 9b
breaks out the recognition accuracies obtained with the four noise
types in the SNR = −5 conditions. Increasing the resolution of the
modulation filterbank has the smallest effect for the babble noise.
This was to be expected, because it is unlikely that there are many
modulation frequency bands in which babble noise differs substantially from speech.
4. Comparison with other ASR systems and HSR
4.1. ASR
In this research we investigated how different configurations
of the modulation filterbank affect recognition performance. To
deepen our understanding of the strengths and weaknesses of the
combination of EMS features and SC, we compared the performance on test sets A and B in aurora-2 with previously published recognition accuracies of three other systems: the ‘standard’
aurora-2 system trained with the multi-condition data (Hirsch
and Pearce, 20 0 0), the multi-condition aurora-2 system that includes the Wiener filter based ETSI advanced frontend (Hirsch and
Pearce, 2006), and the SC-based system of Gemmeke et al. (2011).
The first two systems use GMMs based on MFCC features to estimate state posterior probabilities, while the third one used Melfrequency energy spectra as stacks of up to 30 frames and used
non-negative matrix factorization with the Kullback–Leibler divergence as the solver in the sparse coding engine. Since there is
78
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
Table 1
The word recognition accuracy obtained using Lin18-EMS features on aurora-2 test sets. (For explanation see text) .
SNR
Clean
20
15
10
5
0
−5
Average
Test A
Subway
Babble
Car
Exhibition
Average
94.14
93.62
93.56
93.55
93.72
94.84
93.44
92.69
95.53
94.12
94.38
92.93
92.45
95.19
93.74
88.89
91.9
91.65
94.38
91.70
86.98
87.07
88.58
85.71
87.24
81.98
73.52
80.11
82.63
79.56
66.38
42.05
59.68
73.93
60.51
86.80
82.16
85.53
88.70
85.79
Test B
Restaurant
Street
Airport
Train station
Average
94.14
93.62
93.56
93.55
93.72
89.39
93.68
94.87
94.54
93.10
91.56
92.53
94.15
91.76
92.50
90.97
90.60
91.02
85.00
89.40
84.86
83.92
82.58
67.76
79.78
69.11
63.27
63.05
38.41
58.46
36.94
29.53
28.78
13.24
27.12
79.56
78.16
78.28
69.18
76.29
Fig. 10. Word recognition accuracy per test set as a function of ASR for four different systems. 1- The proposed EMS features (Lin18-EMS). 2- Sparse classification results
using Mel-spectra features (Gemmeke et al., 2011). 3- Aurora2 multi-condition recognizer applied to MFCC features (Hirsch and Pearce, 20 0 0). 4- ETSI-AFE multi-condition
recognizer applied to MFCC features (Hirsch and Pearce, 2006).
no configuration of the modulation filterbank that is optimal for
all SNR levels and all noise types, we conducted the comparison
with the modulation filterbank consisting of the 1 Hz cut-off frequency LPF and M = 18 linearly spaced BPFs (which we refer to as
the Lin18-EMS system). The Lin18-EMS system is a good compromise between the highest-possible performance for clean speech
and the conditions with the lowest SNR level. The detailed results
obtained with the Lin18-EMS system are collected in Table 1.
In Fig. 10, the recognition accuracies of the Lin18-EMS system
and the three competing systems is plotted. Fig. 10a shows the test
results for matched noise types in test set A. While the Lin18-EMS
system outperforms both MFCC-based multi-condition recognizers
at very low SNR levels, its performance at higher SNRs is substantially worse than the MFCC-based systems. The single-frame EMS
features almost always outperform the 30-frame Mel features.
However, the results of the Lin18-EMS system on test set B, pertaining to the unseen noise type conditions, shown in Fig. 10b,
show that our system does not generalize well to unseen noise
types, a characteristic that it shares with the other exemplar-based
system. The superior performance of the 30-frame Mel features is
most probably due to the fact that Gemmeke et al. (2011) included
artificially constructed noise exemplars that accounted to some extent for the mismatch between the noise exemplars from test set A
and the different noise types in test set B. Our EMS-based system
did not include artificially constructed exemplars. In cleaner conditions (down to 10 dB) the EMS-based system has roughly equal
performance as the other exemplar based system. In contrast to
the behavior for test set A, however, the performance drop in SNRs
< 10 dB is much steeper. Averaged over the four noise types of test
set B, the recognition accuracy is approximately equal to that of
the multi-condition trained GMM system without noise reduction.
A detailed analysis revealed that the performance of the Lin18EMS system in fact is very similar to the system of Gemmeke et al.
(2011), except for train station noise (cf. Table 1). In search for the
cause of this deviant behavior, we found that omitting the standard deviation equalization step ((3) in Section 2.2) substantially
improved recognition performance for utterances corrupted with
train station noise at low SNR levels. This is illustrated by the dotted line in Fig. 10b, which shows the average performance on test
set B (SNR= 5, 0, −5 dB) when excluding the standard deviation
equalization for train station noise. Recall that the main purpose of
the standard deviation equalization procedure was to equalize the
contribution of all gammatone frequency bands. The equalization
weight vector was designed -using the speech exemplars from the
dictionary- such that the standard deviation of the coefficients in
the EMS vector are on average equal in all 15 gammatone filters,
without changing the relative magnitude of the coefficients pertaining to the modulation bands. It appeared that the equalization
procedure works well for noisified speech, as long as the envelope
of the 15 gammatone coefficients in the modulation bands does
not change between bands with low and high modulation frequencies. As long as that is the case, applying a fixed equalization vector will not change the average modulation spectrum of the noises.
However, there are two noise types that violate this assumption,
viz. car noise in test set A and train station noise in test set B.
The detrimental effect of the violations in car noise are limited,
because it is represented in the noise dictionary exemplars taken
from the car noise signals. For the train station noise this is not the
case. As a result, the match between the modulation spectra of the
speech noisified by adding train station noise and the exemplars in
the dictionary deteriorates as the SNR level decreases.
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
4.2. Comparison with HSR
To evaluate the combination of EMS features and sparse coding in terms of human like performance, we re-use the data about
the recognition accuracy of ten human listeners on aurora-2 utterances in Meyer (2013). Meyer used three different criteria: speech
reception threshold (SRT), the effect of noise types and the effect
of string lengths. SRT is the SNR at which listeners achieve a 50%
accuracy; usually it corresponds to the SNR at which the accuracy
as a function of SNR has the largest negative slope. The SRT estimated for HSR in Meyer (2013) is around −10.2 dB while for
the aurora-2 system trained with the multi-condition data (Hirsch
and Pearce, 20 0 0) the SRT is −1.5 dB. From Fig. 10a, it can be inferred that the SRT of the EMS-based system is well below −5 dB;
although it is dangerous to extrapolate the curves, it is reasonable to assume that the SRT for the two exemplar-based systems is
close to the human SRT. As can be seen from Fig. 10b, which represents the noise mismatch case (test set B), our EMS-based system
does not generalize well to unseen noise types. We will come back
to this issue in Section 5.
According to Meyer (2013), the difficult noises for ASR and HSR
are different. At SNR=0 and -5 dB, performance of aurora-2 system trained with multi-condition data the performance for babble
noise is higher than for car noise, while HSR shows higher performance for car than for babble noise. From Table 1 it can be
seen that our EMS+SC system shows the same trend as the human listeners: accuracy with babble noise is lower than with car
noise. The same holds for the comparison of airport and train station noise, provided that we solve the equalization issue.
In the human data there is a small but clear drop in accuracy for the longest digit strings, which is probably due to memorization problems. Our EMS-based system does not show this effect. This was to be expected, because an automatic system is not
affected by the need to memorize long strings. Our system also
does not show the problems with one-digit utterances reported
by Meyer (2013) for the ‘standard’ aurora-2 systems with multicondition training. The raw EMS features that we used for speechsilence segmentation yield quite accurate results. Only in a very
small proportion of the utterances the endpoint estimates differed
from voice onset and offset determined from the forced alignment
by more than 16 frames, the minimum number of frames needed
to find –or hallucinate– a digit word.
In summary, it can be concluded that the operation of our EMSplus-SC system for the estimation of sub-word probabilities mimics human speech recognition on a semantics-free task better than
more conventional MFCC-plus-GMM systems.
5. General discussion
In this paper we investigated how different configurations of
the modulation filterbank in an auditory frontend affect the degree
to which an exemplar-based engine can provide accurate posterior
probability estimates of sub-word units when recognizing noisecorrupted speech. The auditory model proposed in (Jørgensen and
Dau, 2014), which consists of a LPF with a cut-off frequency of
1 Hz and nine Q = 1 BPFs with center frequencies one octave
apart, served as the point of departure. For estimating the posterior probabilities of the sub-word units, we used sparse coding and
a large dictionary of semi-randomly selected exemplars. We found
that BPFs with center frequencies one octave apart do not provide
sufficient resolution of the modulation frequencies for automatic
(and maybe also for human) speech recognition. We conjecture
that a filterbank with octave spacing between the modulation filters is able to discover noise conditions that will certainly compromise intelligibility, but that this configuration may not accurately
predict specific confusions that would occur in tasks that require
79
participants to distinguish confusable sounds in the absence of semantic predictability.
From our experiments it appears that there is no unique configuration of the modulation filterbank that is optimal for all SNR
levels and all noise types. However, it is safe to conclude that a filterbank consisting of a LPF with cut-off frequency 1 Hz and about
M = 18 BPFs with center frequencies between 2 Hz and 20 Hz
will provide accuracies close to optimal for most conditions. Center frequencies of the BPFs with equal spacing on a linear or on
a logarithmic frequency axis yielded very similar results. In the
SNR−5 dB condition the best results were obtained with a configuration that comprised M = 21 logarithmically spaced BPFs, with
the lowest BPF centered at 2.25 Hz. In all experiments we found
that the lowest SNR levels benefitted from a large resolution in
the lowest modulation frequencies; however, for the highest SNR
levels a very high resolution in the modulation frequency band <
6 Hz was somewhat detrimental.
The exemplar-based engine for estimating posterior probabilities of sub-word units was based on a Lasso solver in a sparse
coding procedure. In the Lin18-EMS system we used 17,148 speech
exemplars and 13,504 noise exemplars. These numbers are about
twice as high as the numbers of speech and noise exemplars used
in Gemmeke et al. (2011). The need for large numbers of exemplars in our system is probably related to the combination of features with positive and negative values and the Euclidean distance
measure. In Ahmadi et al. (2014) we found that in a sparse coding
framework it is advantageous to keep the phase information in the
output of the modulation BPFs. The same conclusion was reached
by Moritz et al. (2015). However, Baby and Van hamme (2015),
who used EMS-like features for training DNNs obtained good results when using only the magnitudes of the amplitude of the output of the modulation filters.
The fact that our EMS features have positive and negative feature values ruled out the use of sparse coding engines based
on Kullback–Leibler divergence (the preferred distance measure in
non-negative matrix factorization). It is well known that for many
features used in pattern recognition tasks the Euclidean distance
does not represent the conceptual distance (e.g., Choi et al., 2014).
The default solution is to transform the original features to a space
in which Euclidean distance does represent conceptual neighborhood. We counteracted some of the undesirable effects of the Euclidean distance by the equalization and normalization procedures
that we applied to the exemplars and the unknown observations.
Forcing all exemplars and unknown observations to unit length
makes the Euclidean distance equivalent to cosine distance (Choi
et al., 2014). In our equalization procedure the exact same weights
are used for the 15 gammatone bands in all M modulation bands.
As long as the pattern formed by the magnitude of the 15 numbers
in the M modulation bands does not differ substantially between
the modulation bands, using fixed weights is beneficial. However, if
the patterns become different in some modulation bands because
of the different characteristics of the noise, fixed weights can be
detrimental. This appeared to be the case with the train station
noise in test set B.
We preferred an exemplar-based approach over GMMs or neural networks (including DNNs) for estimating the posterior probabilities, because this approach appears to have closer connections
to emerging knowledge about cortical representations of audio signals (Mesgarani et al., 2014a; 2014b) and neural processing. Our
research was based on the assumption that some configurations
of the modulation filterbank would yield EMS vectors in which a
substantial proportion of the features is not affected by the background noise, because the expected values of these features are
different for the noise and the speech signals. If the proportion
of unaffected features is high enough, the sparse coding engine
should be able to match partly damaged EMS vectors to the cor-
80
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
rect exemplars. That assumption is reinforced by the superior performance of human listeners, especially in tasks where there is little or no help from semantics or world knowledge. The assumption
is also in line with widely accepted theories about human pattern
recognition, which claim that missing data will be reconstructed
(Grossberg and Kazerounian, 2011; Myers and Wallis, 2013; Wei
et al., 2012). In addition, exemplar-based approaches can handle
the very high-dimensional feature vectors produced by the most
elaborate versions of the auditory model.
To verify that it is the information in the EMS features, rather
than the operation of the sparse coding engine, that drives the performance and to verify that the findings about the design of EMS
features are not limited to a SC procedure for estimating posterior probabilities, we repeated many experiments with the KNeighborsClassifier in scikit-learn (Pedregosa et al., 2011). We always used
the exact same speech-plus-noise dictionaries to ‘train’ the kNN
classifier as were used with the SC engine. We saw the same trend
in the results as a function of modulation filterbank configuration
in all SNR conditions. For the higher SNR levels the absolute accuracies obtained with the kNN classifier were very close to what
we obtained with sparse coding. However, in the lowest SNR levels
the SC engine had a clear advantage.
We compared the performance of the Lin18-EMS system with
the performance of three other systems on the same data set:
Mel-spectra features + SC (Gemmeke et al., 2011), the MFCC
aurora-2-multi-condition recognizer (Hirsch and Pearce, 20 0 0),
and the MFCC ETSI-AFE multi-condition recognizer (Hirsch and
Pearce, 2006). In test set A the Lin18-EMS system outperformed
the other systems in the lowest SNR conditions. However, the two
GMM-based systems outperform the two exemplar-based system
by a wide margin in the high SNR conditions. The fact that both
exemplar-based system suffered in the same conditions, despite
using very different features, shows that the problem is not caused
by the EMS features. Also, the lower performance of the exemplarbased systems at the highest SNR levels is not due to the interference of the noise exemplars in the dictionary. In-depth analysis
of the activations of the exemplars showed that the noise exemplars receive only very small activations in the highest SNR conditions. Decodings with and without the noise exemplars in the dictionary yielded essentially the same accuracy for clean speech and
SNR20 dB. The exemplar-based systems mainly suffer from confusion errors. Moreover, we encountered the same problem with
the kNN classifier. It is left to future research to understand what
causes the confusions in exemplar-based systems at the highest
SNR levels.
It has been shown that the performance of an ASR system
can be improved by fusing the posterior probabilities obtained
from an exemplar-based system and corresponding estimates from
GMM- or ANN-based systems (e.g. Geiger et al., 2013). In Sun et al.
(2014) it was shown that fusing the posterior probability estimates
of an exemplar-based and a GMM-based system can reduce the
word error rate for clean speech in aurora-2 to less than 0.5%.
However, it is unlikely that humans use a similar procedure to accomplish their superior recognition performance.
We also compared the performance of the Lin18-EMS system
to the -admittedly few and incomplete- data about human recognition performance on the aurora-2 task. Using the criteria proposed in Meyer (2013) we found that the performance of our system is more similar to humans than some conventional ASR systems. The only discrepancy is that our system did not show the
effect that human accuracy decreases with increasing string length.
Our system shares this property with all ASR systems and computational models that do not simulate working memory problems.
In the remainder of this section, we will discuss possible ways
to repair some of the weaknesses of the proposed system. First of
all, the EMS features might be improved, for example by adding
the non-linear compression that is present in virtually all auditory
models, but that was left out in the model of Jørgensen and Dau
(2014), because compression was not necessary for the purpose of
predicting intelligibility. Including the static 10th power compression in the version of the model in Dau et al. (1996) did increase
the recognition accuracy for clean speech, at the cost of a substantial decrease in the SNR−5 db condition (from 68% correct to 34%
correct in test set A). We leave the implementation of the full dynamic compression to future research; we expect that it will show
the same positive effect for clean speech without the strong negative effect for the lowest SNR conditions.
The EMS representation of (noisy) speech signals is reminiscent of the approaches advocated in multi-stream ASR architectures (Bourlard, 1999; Bourlard et al., 1996; Hermansky, 2013; Hermansky and Fousek, 2005; Okawa et al., 1998; Tibrewala and Hermansky, 1997). A representation in terms of multiple modulation
frequency bands is likely to contain features that are not heavily
affected by the noise. Instead of designing a procedure to optimally
fuse the parallel streams at the feature, the probability or output level, we investigated whether the undistorted features would
dominate the distance measure between clean speech exemplars
and noisy observations in the sparse coding engine. The recognition accuracy that we obtained on test set A of the aurora-2
task confirms the viability of this assumption, but the results also
show that we are still far from human-like performance in terms
of absolute accuracy. The conventional combination of static features, deltas and delta-deltas in ASR corresponds to an auditory
model in which the LPF in the modulation filterbank has a cutoff frequency of about 50 Hz. In addition, there is one BPF with a
center frequency of approximately 7 Hz and a quality factor Q = 1
and another BPF with a quality factor Q = 2. The fact that conventional ASR systems typically benefit from adding delta-delta coefficients raises the question whether the Lin18-EMS system can
be improved by adding Q = 2 BPFs with cut-off/centre frequencies at strategically chosen positions. The results of Moritz et al.
(2015) provide evidence in support of this assumption.
From recent developments in multi-stream ASR (e.g.,
Hermansky, 2013) it is clear that it is necessary to combine
bottom-up fusion (whether at the level of features, probabilities
or outputs) with some kind of knowledge about the best -possibly
condition-dependent- way for selecting or combining features. The
sparse coding procedure that we used for computing the posterior
probabilities of the sub-word units does nothing of the kind. We
can see at least two ways in which knowledge could be brought
into play. First, it is possible to learn the distributions of individual
features or groups of features (per gammatone or per modulation
filter) in clean speech from the training material. During test,
the likelihood that (groups of) features fit the clean distribution
can be estimated, and these estimates can be used as additional
weights in computing the Euclidean distances in the Lasso solver.
Second, it is possible to improve the conversion of the exemplar
activations from the sparse coding procedure to posterior probabilities of sub-word units by involving some kind of learning. In
Ahmadi et al. (2014) we argued that we should not aim at the
optimal approximation of unknown observations as sparse sums
of exemplars; rather, we should aim for the optimal classification
of the unknown observations. Research is underway in which
we apply label-consistent discriminative dictionary learning to
replace the semi-random selection of exemplars by a procedure
that learns the exemplars that are optimal for reconstruction and
classification (e.g. Jiang et al., 2013).
The need for introducing some kind of learning in the procedure for computing posterior probabilities of sub-word units
is strengthened by recent observations of the representation of
speech signals in the auditory cortex (Mesgarani et al., 2014b;
Pasley et al., 2012). From intra-cranial recordings it can be inferred
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
that representations in auditory cortex still accurately reflect the
tonotopic representations formed in the peripheral auditory system. This suggests that speech recognition relies on higher-level
processes that operate on the tonotopic representations. These processes can only be successful on the basis of substantial amounts
of learning. The need for higher-level operations, including selective attention, is also discussed by Henry et al. (e.g. 2013), based on
the different behaviors of brain oscillations in frequencies < 5 Hz,
which are associated to auditory processing and oscillations at frequencies > 8 Hz, which are associated with higher-level cognitive
processing (Luo and Poeppel, 2007).
6. Conclusion
In this paper we investigated to what extent a model of the
human auditory system that is capable of predicting speech intelligibility in adverse conditions also provides a promising starting
point for designing the frontend of a noise robust ASR system. The
long-term goal of the research is to design a computational model
that shows human-like recognition behavior in terms of performance level and the type of errors. We investigated which details
of the auditory model configuration are most important for maximizing the recognition performance of an exemplar-based system.
We found that a system that combines a frontend based on the
envelope modulation spectrum with a sparse coding engine for
computing posterior probabilities of sub-word units yields competitive performance as long as the modulation spectrum of the background noise is similar to the noise exemplars in the dictionary.
The modulation filterbank must cover the frequency range up to
about 20 Hz, but there is no configuration that is optimal for all
noise types and all SNR levels. The lower the SNR, the more important becomes a high resolution in modulation frequencies ≤ 6 Hz.
Although the accuracy of our system is still below human performance, our system behaves more human-like than MFCC-GMM
based ASR systems.
The output of the lowpass filter in the proposed modulation
filterbank can be considered as the static features in a conventional ASR frontend, while the bandpass filter outputs can be considered as delta-features which are lowpass filtered with different cut-off frequencies. Using this insight, our results indicate that
not only our sparse coding based system, but in fact any classical ASR system, would benefit from a frontend in which the static
features, the delta coefficients and the delta-delta coefficients are
all represented in a multi-resolution fashion. The highly redundant EMS feature vectors have proven to be a promising starting
point for noise robust speech recognition. With a more sophisticated distance measure and a built-in ability to learn how to use
this high dimensional acoustic space to discriminate different subword units in different acoustic conditions, an interesting research
area opens up where ASR can interface with auditory and brain
research.
Acknowledgments
This project has received funding from the European Union’s
Seventh Framework Programme for research, technological development and demonstration under grant agreement no FP7PEOPLE-2011-290 0 0 0. We express our gratitude towards Torsten
Dau, the discussions with whom were most helpful during the design phase of the experiments. Also his contributions in interpreting the results are greatly appreciated. We are also grateful to Tobias May for his advices during experiments and for providing part
of the software that has been used in feature extraction.
81
References
Ahmadi, S., Ahadi, S.M., Cranen, B., Boves, L., 2014. Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP J. Audio
Speech Music Process. 2014 (1), 1–20.
Baby, D., Van hamme, H., 2015. Investigating modulation spectrogram features for
deep neural network-based automatic speech recognition. In: Proceedings INTERSPEECH. Dresden, Germany, pp. 2479–2483.
Bacon, S.P., Viemeister, N.F., 1985. Temporal modulation transfer functions in
normal-hearing and hearing-impaired listeners. Int. J. Audiol. 24 (2), 117–
134.
Bourlard, H., 1999. Non-stationary multi-channel (multi-stream) processing towards
robust and adaptive asr. In: Proceedings ESCA Workshop Robust Methods
Speech Recognition in Adverse Conditions, pp. 1–10.
Bourlard, H., Dupont, S., Hermansky, H., Morgan, N., 1996. Towards subband-based
speech recognition. In: Proceedings of EUSIPCO, pp. 1579–1582.
Buesing, L., Bill, J., Nessler, B., Maass, W., 2011. Neural dynamics as sampling: a
model for stochastic computation in recurrent networks of spiking neurons.
PLoS Comput. Biol. 7 (12).
Chi, T., Ru, P., Shamma, S.A., 2005. Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118 (2), 887–906.
Choi, J., Cho, H., Kwac, J., Davis, L., 2014. Toward sparse coding on cosine
distance. In: 22nd International Conference on Pattern Recognition (ICPR),
pp. 4423–4428.
Cooke, M., 2006. A glimpsing model of speech perception in noise. J. Acoust. Soc.
Am. 119 (3), 1562–1573.
Cutler, A., 2012. Native Listening: Language Experience and the Recognition of Spoken Words. MIT Press.
Dau, T., Kollmeier, B., Kohlrausch, A., 1997a. Modeling auditory processing of amplitude modulation. i. detection and masking with narrow-band carriers. J. Acoust.
Soc. Am. 102 (5), 2892–2905.
Dau, T., Kollmeier, B., Kohlrausch, A., 1997b. Modeling auditory processing of amplitude modulation. ii. spectral and temporal integration. J. Acoust. Soc. Am. 102
(5), 2906–2919.
Dau, T., Püschel, D., Kohlrausch, A., 1996. A quantitative model of the “effective”
signal processing in the auditory system. i. model structure. J. Acoust. Soc. Am.
99 (6), 3615–3622.
De Wachter, M., Matton, M., Demuynck, K., Wambacq, P., Cools, R., Van Compernolle, D., 2007. Template-based continuous speech recognition. IEEE Trans. Audio Speech Lang. Process. 15 (4), 1377–1390.
Demuynck, K., Garcia, O., Van Compernolle, D., 2004. Synthesizing speech from
speech recognition parameters. In: Proceedings of Interspeech, 2, pp. 945–948.
Jeju Island, Korea.
Drullman, R., Festen, J.M., Plomp, R., 1994. Effect of temporal envelope smearing on
speech reception. J. Acoust. Soc. Am. 95, 1053–1064.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al., 2004. Least angle regression.
Ann. Stat. 32 (2), 407–499.
Elhilali, M., Chi, T., Shamma, S.A., 2003. A spectro-temporal modulation index (stmi)
for assessment of speech intelligibility. Speech Commun. 41 (23), 331–348.
Ewert, S.D., Dau, T., 20 0 0. Characterizing frequency selectivity for envelope fluctuations. J. Acoust. Soc. Am. 108 (3), 1181–1196.
Fletcher, H., 1940. Auditory patterns. Rev. Mod. Phys. 12 (1), 47.
Fletcher, H., 1953. Speech and Hearing in Communication. Krieger, New York.
Geiger, J., Weninger, F., Hurmalainen, A., Gemmeke, J., Wöllmer, M., Schuller, B.,
Rigoll, G., Virtanen, T., 2013. The TUM+ TUT+ KUL approach to the 2nd CHiME
challenge: multi-stream ASR exploiting BLSTM networks and sparse NMF. In:
Proceedings of CHiME, pp. 25–30.
Gemmeke, J.F., Virtanen, T., Hurmalainen, A., 2011. Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech
Lang. Process. 19 (7), 2067–2080.
Goldinger, S., 1998. Echoes of echoes? an episodic theory of lexical access. Psychol.
Rev. 105 (2), 251–279.
Grossberg, S., Kazerounian, S., 2011. Laminar cortical dynamics of conscious speech
perception: neural model of phonemic restoration using subsequent context in
noise. J. Acoust. Soc. Am. 130 (1), 440–460.
Henry, M.J., Herrmann, B., Obleser, J., 2013. Selective attention to temporal features
on nested time scales. Cereb. Cortex.
Hermansky, H., 1997. The modulation spectrum in the automatic recognition of
speech. In: Proceedings IEEE Workshop on Automatic Speech Recognition and
Understanding. Santa Barbara, pp. 140–147.
Hermansky, H., 2011. Speech recognition from spectral dynamics. Sadhana 36 (5),
729–744.
Hermansky, H., 2013. Multistream recognition of speech: dealing with unknown unknowns. Proc. IEEE 101 (5), 1076–1088.
Hermansky, H., Fousek, P., 2005. Multi-resolution rasta filtering for TANDEM-based
ASR. In: Proc. Int. Conf. Spoken Lang. Process., pp. 361–364.
Hirsch, H., Pearce, D., 2006. Applying the advanced ETSI frontend to the Aurora-2
task. Tech. Report version 1.1. http://dnt.kr.hsnr.de/aurora/download/Aurora2_
afe_v1_1.pdf
Hirsch, H.G., Pearce, D., 20 0 0. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions.
In: Proceedings ISCA Workshop ASR20 0 0, Automatic Speech Recognition: Challenges for the Next Millennium. Paris, France, pp. 29–32.
Holmes, J., Holmes, W., 2001. Speech Synthesis and Recognition, 2 edition Taylor
and Francis, London and New York.
82
S. Ahmadi et al. / Speech Communication 84 (2016) 66–82
Houtgast, T., 1989. Frequency selectivity in amplitude-modulation detection. J.
Acoust. Soc. Am. 85 (4), 1676–1680.
Houtgast, T., Steeneken, H.J.M., 1985. A review of the mtf concept in room acoustics
and its use for estimating speech intelligibility in auditoria. J. Acoust. Soc. Am.
77, 1069–1077.
Huang, X., Acero, A., Hon, H.-W., 2001. Spoken Language Processing. Prentice Hall,
Upper Saddle River, NJ.
Jiang, Z., Lin, Z., Davis, L.S., 2013. Label consistent K-SVD: Learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35 (11),
2651–2664.
Jørgensen, S., Dau, T., 2011. Predicting speech intelligibility based on the signal–
to-noise envelope power ratio after modulation-frequency selective processing.
J. Acoust. Soc. Am. 130 (3), 1475–1487.
Jørgensen, S., Dau, T., 2014. Modeling speech intelligibility based on the signal–
to-noise envelope power ratio. Technical University of Denmark, Department of
Electrical Engineering Ph.D. thesis. PhD-afhandling.
Jørgensen, S., Ewert, S.D., Dau, T., 2013. A multi-resolution envelope-power based
model for speech intelligibility. J. Acoust. Soc. Am. 134 (1), 436–446.
Kanedera, N., Arai, T., Hermansky, H., Pavel, M., 1999. On the relative importance of
various components of the modulation spectrum for automatic speech recognition. Speech Commun. 28 (1), 43–55.
Kanedera, N., Hermansky, H., Arai, T., 1998. On properties of modulation spectrum
for robust automatic speech recognition. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 2, pp. 613–616.
Kay, R., Matthews, D., 1972. On the existence in human auditory pathways of channels electively tuned to the modulation present in frequency-modulated tones.
J. Physiol. 225 (3), 657–677.
Kim, C., Stern, R.M., 2009. Feature extraction for robust speech recognition using a
power-law nonlinearity and power-bias subtraction. In: INTERSPEECH. Brighton,
UK, pp. 28–31.
Kolossa, D., Haeb-Umbach, R. (Eds.), 2011, Robust Speech Recognition of Uncertain
or Missing Data — Theory and Applications. Springer.
Lee, D., Seung, H., 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), 788–791.
Lippmann, R., 1996. Speech recognition by humans and machines: miles to go before we sleep. Speech Commun. 18 (3), 247–248.
Luo, H., Poeppel, D., 2007. Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54 (6), 1001–1010.
Macho, D., Mauuary, L., Noé, B., Cheng, Y.M., Ealey, D., Jouvet, D., Kelleher, H.,
Pearce, D., Saadoun, F., 2002. Evaluation of a noise-robust DSR front-end on aurora databases. In: Proceedings INTERSPEECH. Denver, Colorado, USA, pp. 17–20.
Mesgarani, N., Cheung, C., Johnson, K., Chang, E.F., 2014. Phonetic feature encoding
in human superior temporal gyrus. Science 343 (6174), 1006–1010.
Mesgarani, N., David, S.V., Fritz, J.B., Shamma, S.A., 2014. Mechanisms of noise robust representation of speech in primary auditory cortex. Proc. Natl. Acad. Sci.
111 (18), 6792–6797. URL http://www.pnas.org/content/111/18/6792.abstract.
Meyer, B.T., 2013. What’s the difference? Comparing humans and machines on the
aurora-2 speech recognition task.. In: INTERSPEECH, pp. 2634–2638.
Meyer, B.T., Brand, T., Kollmeier, B., 2011. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. J. Acoust. Soc. Am. 129
(1), 388–403.
Misra, H., 2006. Multi-stream processing for noise robust speech recognition. École
Polytechnique Fédérale de Lausanne, Lausanne, Switzerland Ph.D. thesis. IDIAP-RR 2006 28.
Moore, B.C.J., 2008. Basic auditory processes involved in the analysis of speech
sounds. Philos. Trans. R. Soc. London 363, 947–963.
Moritz, N., Anemüller, J., Kollmeier, B., 2015. An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition.
IEEE/ACM Trans. Audio Speech Lang. Process. 23 (11), 1926–1937.
Myers, N., Wallis, G., 2013. Constraining theories of working memory with biophysical modelling. J.Neurosci. 33 (2), 385–386.
Ness, S.R., Walters, T., Lyon, R.F., 2012. Auditory sparse coding. In: Li, T., Ogihara, M.,
Tzanetakis, G. (Eds.), Music Data Mining. CRC Press, Boca Raton, FL 33487-2742.
33487–2742
Okawa, S., Bocchieri, E., Potamianos, A., 1998. Multi-band speech recognition in
noisy environments. In: Proceedings Int. Conf. Acoust. Speech Signal Process.,
pp. 641–644.
Olshausen, B.A., Field, D.J., 2004. Sparse coding of sensory inputs. Curr. Opin. Neurobiol. 14 (4), 481–487.
Pasley, B.N., David, S.V., Mesgarani, N., Flinker, A., Shamma, S.A., Crone, N.E.,
Knight, R.T., Chang, E.F., et al., 2012. Reconstructing speech from human auditory cortex. PLoS Biol. 10 (1), 175.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine
learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Rabiner, L., Juang, B.-H., 1993. Fundamentals of Speech Recognition. Prentice-Hall,
Inc., Upper Saddle River, NJ, USA.
Schädler, M.R., Meyer, B.T., Kollmeier, B., 2012. Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J.
Acoust. Soc. Am. 131 (5), 4134–4151.
Sroka, J.J., Braida, L.D., 2005. Human and machine consonant recognition. Speech
Commun. 45 (4), 401–423.
Sun, Y., Gemmeke, J.F., Cranen, B., ten Bosch, L., Boves, L., 2014. Fusion of parametric and non-parametric approaches to noise-robust ASR. Speech Commun. 56,
49–62.
ten Bosch, L., Boves, L., Ernestus, M., 2013. Towards an end-to-end computational
model of speech comprehension: simulating a lexical decision task. In: Proceedings of Interspeech. Lyon, France. 0 0 0 0–0 0 0 0
ten Bosch, L., Boves, L., Tucker, B., Ernestus, M., 2015. DIANA: towards computational
modeling reaction times in lexical decision in North American English. In: Proceedings of Interspeech. Dresden, Germany. 0 0 0 0–0 0 0 0
Tibrewala, S., Hermansky, H., 1997. Multi-stream approach in acoustic modeling. In:
Proceedings DARPA Large Vocabulary Cont. Speech Recognit. Hub 5 Workshop,
pp. 1255–1258.
Wei, Z., Wang, X.-J., Wang, D.-H., 2012. From distributed resources to limited slots in
multiple-item working memory: a spiking network model with normalization.
J. Neurosci. 32 (33), 1122811240.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J.,
Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2009. The HTK Book (for HTK
version 3.4). Technical Report. Cambridge University Engineering Department,
Cambridge, UK.
Zwicker, E., Flottorp, G., Stevens, S.S., 1957. Critical band width in loudness summation. J. Acoust. Soc. Am. 29 (5), 548–557.