Human-inspired modulation frequency features for noise-robust ASR

Sara Ahmadi

Human-inspired modulation frequency features for noise-robust ASR

2016, Speech Communication

Speech Communication 84 (2016) 66–82 Contents lists available at ScienceDirect Speech Communication journal homepage: www.elsevier.com/locate/specom Human-inspired modulation frequency features for noise-robust ASR Sara Ahmadi a,b,∗, Bert Cranen a, Lou Boves a, Louis ten Bosch a, Antal van den Bosch a a b Center for Language Studies, Radboud University, PO- Box 9600, NL-6500 HD Nijmegen, The Netherlands Speech Processing Research Laboratory, Electrical Engineering Department, Amirkabir University of Technology, Hafez Avenue, Tehran 15914, Iran a r t i c l e i n f o Article history: Available online 19 September 2016 Keywords: Modulation frequency Auditory model Noise-robust ASR a b s t r a c t This paper investigates a computational model that combines a frontend based on an auditory model with an exemplar-based sparse coding procedure for estimating the posterior probabilities of sub-word units when processing noisiﬁed speech. Envelope modulation spectrogram (EMS) features are extracted using an auditory model which decomposes the envelopes of the outputs of a bank of gammatone ﬁlters into one lowpass and multiple bandpass components. Through a systematic analysis of the conﬁguration of the modulation ﬁlterbank, we investigate how and why different conﬁgurations affect the posterior probabilities of sub-word units by measuring the recognition accuracy on a semantics-free speech recognition task. Our main ﬁnding is that representing speech signal dynamics by means of multiple bandpass ﬁlters typically improves recognition accuracy. This effect is particularly noticeable in very noisy conditions. In addition we ﬁnd that to have maximum noise robustness, the bandpass ﬁlters should focus on low modulation frequencies. This reenforces our intuition that noise robustness can be increased by exploiting redundancy in those frequency channels which have long enough integration time not to suffer from envelope modulations that are solely due to noise. The ASR system we design based on these ﬁndings behaves more similar to human recognition of noisiﬁed digit strings than conventional ASR systems. Thanks to the relation between the modulation ﬁlterbank and procedures for computing dynamic acoustic features in conventional ASR systems, the ﬁnding can be used for improving the frontends in those systems. © 2016 Published by Elsevier B.V. 1. Introduction During the last decades a substantial body of neurophysiological and behavioral knowledge about the human auditory system has been accumulated. Psycho-acoustic research has provided detailed information about the frequency and time resolution capabilities of the human auditory system (e.g. Fletcher, 1940; Zwicker et al., 1957; Kay and Matthews, 1972; Bacon and Viemeister, 1985; Houtgast, 1989; Houtgast and Steeneken, 1985; Drullman et al., 1994; Dau et al., 1997a; 1997b; Ewert and Dau, 20 0 0; Chi et al., 2005; Moore, 2008; Jørgensen and Dau, 2011; Jørgensen et al., 2013). It is now generally assumed that the rate with which the tonotopic representations in the cochlea change over time, the so-called modulation frequencies, is a crucial aspect of the intelligibility of speech signals. Drullman et al. (1994) showed that modulation frequencies between 4 Hz and 16 Hz carry the bulk of the information in Corresponding author at: Center for Language Studies, Radboud University, POBox 9600, NL-6500 HD Nijmegen, the Netherlands. E-mail addresses: s.ahmadi@let.ru.nl (S. Ahmadi), b.cranen@let.ru.nl (B. Cranen), l.boves@let.ru.nl (L. Boves), l.tenbosch@let.ru.nl (L. ten Bosch), a.vandenbosch@let.ru.nl (A. van den Bosch). ∗ http://dx.doi.org/10.1016/j.specom.2016.09.003 0167-6393/© 2016 Published by Elsevier B.V. speech signals. Modulation frequencies around 4 Hz roughly correspond to the number of syllables per second in normal speech; the highest modulation frequencies are most likely related to changes induced by transitions between phones.1 Despite the fact that several attempts have been made to integrate the concept of modulation frequencies in automatic speech recognition (ASR) (e.g., Hermansky, 1997; Kanedera et al., (1998); Kanedera et al., 1999; Hermansky, 2011; Schädler et al., 2012; Moritz et al., 2015), these investigations have not led to the crucial break-through in noiserobust ASR that was hoped for. The performance gap between human speech recognition (HSR) and ASR is still large, especially for speech corrupted by noise (e.g. Lippmann, 1996; Sroka and Braida, 2005; Meyer et al., 2011; Meyer, 2013). For meaningful connected speech, part of the advantage of humans is evidently due to semantic predictability, but also in tasks 1 Brainstem research indicates that the human brain has access to modulation frequencies up to at least 250 Hz. Such modulation frequencies might allow resolving the fundamental frequency of voiced speech, which would provide interesting perspectives for understanding speech in –for instance– multi-speaker environments. However, we limit ourselves to the modulation frequency range that pertains to articulatory induced changes in the spectrum. S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 where there is no semantic advantage, such as in recognizing digit sequences (Meyer, 2013) or phonemes (Meyer et al., 2011), humans tend to outperform machines substantially. Therefore, it must be assumed that acoustic details that are important in human processing are lost in feature extraction or in the computation of posterior probabilities in ASR systems. There is convincing evidence that some information is lost if (noisy) speech signals are merely represented as sequences of spectral envelopes. Demuynck et al. (2004) showed that it is possible to reconstruct intelligible speech from a sequence of MFCC vectors, but when Meyer et al. (2011) investigated the recognition accuracy of re-synthesized speech in noise by human listeners, they found that in order to achieve the same phoneme recognition accuracy as with the original speech, the re-synthesized speech required a signal-to-noise ratio (SNR) that was 10 dB higher (3.8 dB versus −6.2 dB). In Macho et al. (2002) it was shown that an advanced frontend that implements a dynamic noise reduction prior to the computation of MFCC features reduces the word error rate. Meyer (2013) showed that advanced features, such as power-normalized cepstral coefficients (PNCC) (Kim and Stern, 2009) and Gabor ﬁlter features (Schädler et al., 2012) improve recognition accuracy compared to default MFCCs. The advanced frontend, the PNCC and the Gabor ﬁlter features introduce characteristics of the temporal dynamics of the speech signals that go beyond static coefficients enriched by adding deltas and delta-deltas. Therefore, it is quite likely that both HSR and ASR suffer from the fact that a conventional frontend that samples the spectral envelope at a rate of 100 times per second and then adds ﬁrst and second order time derivatives yields an impoverished representation of crucial information about the dynamic changes in noisy speech. The research reported here is part of a long-term enterprize aimed at understanding human speech comprehension by means of a computational model that is in conformity with the (neuro)physiological knowledge. For that purpose we want to build a simulation that not only makes equally few, but also the same kind of recognition errors as humans in tasks that do not involve elusive semantic processing. As a ﬁrst step in that direction we investigate the performance of ASR systems with frontends inspired by an auditory model that has proved to predict intelligibility quite accurately in conditions with additive stationary noise, reverberation, and non-linear processing with spectral subtraction (Elhilali et al., 2003; Jørgensen and Dau, 2011; 2014; Jørgensen et al., 2013). In addition, we investigate how an exemplar-based procedure for estimating the posterior probabilities of sub-word units interacts with the auditory-based frontends. Auditory models predict speech intelligibility on the basis of difference between the long-term average power of the noise and the speech signals at the output of the peripheral auditory system (Jørgensen and Dau, 2011). However, it is evident that the longterm power spectrum of a speech signal is not sufficient for speech recognition. Auditory models are silent about all the processing of their outputs that is necessary to accomplish speech recognition. As a consequence, it is not clear whether an auditory model that performs well in predicting intelligibility for humans based on the noise envelope power ratio, such as the SNRenv model (Jørgensen and Dau, 2011) is also optimal in an ASR system that most probably processes the output of the auditory model in a different way than humans do. The modulation ﬁlterbank in the auditory frontend proposed in e.g. Jørgensen and Dau (2011); 2014); Jørgensen et al. (2013) consists of a lowpass ﬁlter (LPF) and a number of bandpass ﬁlters (BPFs) that together cover the modulation frequency band up to 20 Hz. In our work we will vary the cut-off frequency of the LPF, as well as the number and center frequencies of the BPFs. In this respect, our experiments are somewhat similar to the ex- 67 periments reported in Moritz et al. (2015), who aimed to harness knowledge about the human auditory system to improve the conventional procedure for enriching MFCCs with delta and deltadelta coefficients. In our research the focus is on understanding how and why resolving speciﬁc details in the modulation spectrum improves recognition performance, rather than on obtaining the highest possible recognition accuracy. The way in which we use sparse coding for estimating the likelihood of sub-word units in noise-corrupted speech is very different from the approach pioneered by Gemmeke et al. (2011), who tried to capture the articulatory continuity in speech by using exemplars that spanned 300 ms. In Ahmadi et al. (2014) it was shown that single-frame samples of the output of a modulation ﬁlterbank capture a comparable amount of information about articulatory continuity. In that paper we designed the modulation ﬁlterbank based on knowledge collected from relevant literature on the impact of different modulation bands on clean speech recognition. Here, we extend that work substantially by experimenting with conceptually motivated designs of the ﬁlterbank. All theories of human speech comprehension (e.g. Cutler, 2012) and all extant ASR systems (e.g. Rabiner and Juang, 1993; Huang et al., 2001; Holmes and Holmes, 2001) assume that speech recognition hinges on recognizing words in some lexicon, and that these words are represented in the form of a limited number of subword units. The recognition after the frontend is assumed to comprise two additional processes, viz. estimating the likelihoods of sub-word units and ﬁnding the sequence of words that is most likely given the sub-word unit likelihoods. Both computational models of HSR (e.g. ten Bosch et al., 2013; 2015) and ASR prefer statistical models, or -alternatively- neural network models, for estimating sub-word model likelihoods and some sort of ﬁnite state transducer for ﬁnding the best path through the sub-word unit lattice. Despite the analogy between artiﬁcial neural networks and the operation of the brain, and despite the fact that networks of spiking neurons have been shown to be able to approximate arbitrary statistical distributions (e.g. Buesing et al., 2011), there is no empirical evidence in support of a claim that human speech processing makes use of statistical models of sub-word units. Therefore, we decided to explore the possibility that the estimation of the likelihoods of sub-word units is mediated by an exemplar-based procedure (Goldinger, 1998). Exemplar-based procedures offer several beneﬁts, compared to GMM-based approaches. An advantage that is especially beneﬁcial for our work is that exemplar-based approaches can handle high-dimensional feature vectors, without the need for dimensionality reduction procedures that are likely to mix up tonotopic features that are clean and features that are corrupted by some kind of ‘noise’. In addition, exemplar-based representations are compatible with recent ﬁndings about the representation of auditory patterns in human cortex (Mesgarani et al., 2014a; 2014b) and models of memory formation and retrieval (e.g. Wei et al., 2012; Myers and Wallis, 2013). De Wachter et al. (2007) have shown that an exemplar-based approach to automatic speech recognition is feasible when using MFCCs and GMMs. More recently, Ahmadi et al. (2014); Gemmeke et al. (2011) have shown that noise-robust ASR systems can be built using exemplar-based procedures in combination with sparse coding (e.g. Lee and Seung, 1999; Olshausen and Field, 2004; Ness et al., 2012). Geiger et al. (2013) have shown that the exemplarbased SC approach can be extended to handle medium-vocabulary noise-robust ASR. In sparse coding procedures a -possibly very large- dictionary of exemplars of speech and noise is used to represent unknown incoming observations as a sparse sum of the exemplars in the dictionary. The seminal research in Bell Labs by Fletcher (1940); 1953) provides evidence for the hypothesis that speech processing relies on 68 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 2.1. Feature extraction Fig. 1. Block diagram of the noise-robust ASR system. matching incoming signals to stored knowledge in separate frequency bands. That insight has been explored for the purpose of noise-robust ASR in the form of multi-stream processing (Misra, 2006). We apply the same insight to the frequency bands in the modulation spectrum: we assume that the high-dimensional modulation spectrum contains enough features that are not affected by the noise, so that they will dominate the distance measure in a sparse coding engine. The probability that ‘clean’ bands exist will depend on the design details of the modulation ﬁlter (and on the noise characteristics). A sparse coding engine that represents noisy speech in the form of sparse sums of clean speech and pure noise exemplars can operate in three main ways. If it starts with matching noise exemplars, the operation is reminiscent of noise suppression and spectral subtraction (e.g. Kolossa and Haeb-Umbach, 2011). If the engine starts with matching speech exemplars, its operation is reminiscent of missing data approaches and glimpsing (Cooke, 2006). Combinations of both strategies can also be envisaged. A third possible strategy, and the strategy used in this paper, is treating the noise and speech exemplars in the exact same way, leaving it to the solver whether an unknown exemplar is ﬁrst matched to speech or noise exemplars. To maximize the possibility for comparing our results to previous research, we develop our initial system using the aurora-2 data set. Although one might argue that the aurora-2 task is not representative for a general speech recognition task, the task does not limit the generalizability of the insight gained. Actually, the design of aurora-2 is beneﬁcial for our current purpose for two reasons. First, recognizing connected digit strings does not require an advanced language model; the fact that all sequences of two digits are equally probable minimizes the interference between the frontend and the backend. This set-up also corresponds to research on human speech intelligibility, which is often based on short semantically unpredictable (and therefore effectively meaningless) utterances. Second, the literature contains a number of benchmarks to which the current results can be compared. In our experiments we will follow the conventional approach to the aurora-2 task which requires estimating the posterior probabilities of 176 speech and 3 silence states in a hidden Markov model. 2. System overview The recognition system used in this work is depicted schematically in Fig. 1. We discern three main processing blocks. In the ﬁrst block, acoustic features are extracted every 10 ms from the speech signal using the same type of signal processing as employed in the speech-based envelope power spectrum model (sEPSM) proposed by Jørgensen and Dau (2011); Jørgensen et al. (2013). The sEPSM model contains more simplifying assumptions than the auditory model proposed in Chi et al. (2005), but the models are very similar in spirit. The feature extraction block is described in more detail in Section 2.1. The second block uses the outputs of the modulation ﬁlters for estimating the posterior probabilities of the 179 sub-word units (HMM-states) in aurora-2 by means of a sparse coding (SC) approach (Ahmadi et al., 2014; Gemmeke et al., 2011). This block is explained in detail in Section 2.2. Finally, the third block is a conventional Viterbi decoder that ﬁnds the most likely word sequence combining prior and posterior probabilities of the 179 model states. This block is described in Section 2.3. Fig. 2 shows a diagram of the feature extraction module. An auditory ﬁlterbank consisting of 15 gammatone ﬁlters is applied to the 8 kHz speech signal x(t) and forms a set of sub-band signals Xg (t ), g = 1, · · · , 15. The center frequencies of the gammatone ﬁlters range from F1 = 125 to F15 = 3150 Hz, distributed along a logfrequency scale with 1/3rd octave spacing. The gammatone ﬁlters were implemented in the time domain. The envelope of each gammatone ﬁlter output is then calculated as the magnitude of the analytic signal using the Hilbert transform: g (t ) + j.Hilbert(X g (t ))|. Eg (t ) = |X (1) The model proposed in Chi et al. (2005) uses 24 ﬁlters per octave. However, it is widely agreed (e.g. Moore, 2008) that a 1/3rd octave gammatone ﬁlterbank captures all detail in auditory signals that are relevant for speech recognition. Therefore, the design of the gammatone ﬁlterbank is kept constant in all experiments. The 15 sub-band envelopes are downsampled to 100 Hz and then fed into a bank of M + 1 modulation frequency ﬁlters, one lowpass and M bandpass ﬁlters. Thus, the output of the modulation ﬁlterbank consists of 15 · (M + 1 )-dimensional feature vectors. In Section 3 we evaluate the impact on recognition performance when the number of modulation bandpass ﬁlters and the way in which their center frequencies are distributed on the frequency axis are varied. In the modulation ﬁlterbank we used a ﬁrst-order Butterworth lowpass ﬁlter (downward slope −6 dB/oct) and a set of secondorder bandpass ﬁlters with quality factor Q = 1 (rising and falling slopes of +6 and −6 dB/oct respectively), since a ﬁlterbank consisting of Q = 1 ﬁlters simulated the intelligibility of human listeners best (e.g. Jørgensen et al., 2013; Jørgensen and Dau, 2014; 2011). The modulation ﬁlterbanks were also implemented in the time domain. The operation of the feature extraction module is illustrated in Fig. 2. The left-hand column shows the operation in the frequency domain. The right-hand column shows two snapshots of the operation in the time domain. The top panel shows the envelope of the output of the gammatone ﬁlter with center frequency Fg = 315 Hz for an utterance of the digit string “zero-six”. The bottom panel shows the decomposition of this envelope in its modulation frequency components. The all-positive blue curve in the right-hand bottom panel is the output of the low pass ﬁlter; the other curves in this panel represent the output of the modulation bandpass ﬁlters. The complete output of the modulation ﬁlterbank is a set of time signals Em, g (t) which represent the mth modulation frequency component centered at Fm Hz of the gth gammatone sub-band envelope at Fg Hz. The envelopes at the outputs of the gammatone ﬁlters can be approximately reconstructed by means of Eq. (2).2 M+1 Em,g (t ) ≈ Eg (t ) , g = 1, 2, . . . , 15. (2) m=1 The bottom panel in the left-hand column in Fig. 2 shows the amplitudes of the outputs of nine modulation frequency ﬁlters for each of the 15 gammatone ﬁlters for the utterance “zero-six”. We will refer to this representations as the envelope modulation spectrogram (EMS) in the remainder of the paper. The EMS feature vector is obtained by stacking the decomposed sub-band envelopes. 2 Depending on the spacing of the center frequencies of the ﬁlters, the approximation of Eq. (2) may be more or less accurate. If a non-uniform resolution over frequency is considered desirable, the resulting sum is a “distorted” version of the original envelope in which the more densely represented frequencies are overrepresented/emphasized. S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 69 Fig. 2. Block diagram of the feature extraction module. Left column: system operation in frequency domain. Right column: examples of time domain representations. Because the signal envelopes are downsampled to 100 Hz, we obtain an EMS feature vector every 10 ms (which we will, analogous to customary ASR terminology, refer to as feature frames). Contrary to conventional Mel ﬁlter feature extraction, the vector elements do not apply to ﬁxed analysis windows of 25 ms that are shifted with a step size of 10 ms. Instead, the effective time context spanned by the feature value in a modulation band depends on the duration of the impulse response of the corresponding modulation ﬁlter. Ahmadi et al. (2014) found that retaining the phase information of the modulation frequency components, i.e., not compensating for the group delay and refraining from applying full-wave rectiﬁcation to the ﬁlter outputs, had a beneﬁcial effect on recognition performance. A similar result was found in Moritz et al. (2015). Therefore, we refrained from reverting to magnitude features and any form of group delay compensation. 2.2. Computation of posterior probabilities The sparse coding procedure needs a dictionary of speech and noise exemplars. In all experiments in this paper we used a dic- 70 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 Fig. 3. Block diagram of the posterior probability computation block. A sample posterior probability matrix is visualized in the right side of the ﬁgure. The activation vector (S) and state posterior probability vector (P) of a single time frame of the sample signal is shown in the bottom part of the ﬁgure. tionary that comprises 17,148 speech exemplars and 13,504 noise exemplars. For each conﬁguration of the modulation ﬁlterbank a new dictionary was constructed. Exemplars consist of a single feature frame (EMS vector). Given the amplitude response of the modulation ﬁlters with the lowest center frequencies, information about continuity of spectral changes over time is preserved in the EMS features. For all conﬁgurations of the modulation ﬁlterbank the exact same time frames extracted from the training set in aurora-2 were used as exemplars. The speech and noise exemplars were obtained by means of a semi-random selection procedure. We made sure that we had the same number of exemplars from female and male speakers, and almost the same number of exemplars associated with the 179 states in the aurora-2 task. For that purpose we labeled the clean training speech by means of a conventional HMM system using forced alignment. Most states were represented by 98 exemplars in the dictionary. The remaining states, which had fewer frames in the training material, were represented by at least 86 exemplars. To obtain the noise exemplars the noise signals were reconstructed by subtracting the clean speech from the noisiﬁed speech in the multi-condition training set. The resulting signals were processed by the modulation frequency frontend, and the noise exemplars were randomly selected from these output signals. As can be seen in Fig. 3, the procedure for estimating posterior probabilities of sub-word units consists of several steps. The ﬁrst step involves a normalization of the EMS features (i.e., standard deviation equalization and Euclidean-normalization), the second implements the reconstruction of unknown observations as a sparse sum of exemplars in a dictionary (sparse coding), and the third step converts the exemplar activations to posterior probabilities. Standard deviation equalization and Euclidean-normalization. We used a Lasso procedure for reconstructing EMS vectors as a sparse sum of exemplars from the dictionary (Efron et al., 2004). Lasso is able to handle the positive and negative components in the EMS vectors. The Lasso procedure minimizes the root mean square of the difference between an observation and its reconstruction. The range and variance of the components of the EMS vectors differs considerably (Ahmadi et al., 2014). To make sure that all gammatone bands can make an effective contribution to the distance measure, some equalization in the EMS vectors is required. We follow the strategy used in Ahmadi et al. (2014), in which the standard deviations of the samples of the gammatone envelope signals Eg (t) within each modulation band are equalized using weights obtained from the speech exemplars in the dictionary. Each Em, g (t) is multiplied by an equalization weight wg : wg = 1 M+1 1 σ15·(m−1)+g M+1 m=1 f or 1 ≤ g ≤ 15, (3) where σ i (i = 15 · (m − 1 ) + g), 1 ≤ i ≤ 15 · (M + 1 ), is the standard deviation of the ith element of the speech dictionary exemplars. With this procedure the standard deviation of these modiﬁed features is equalized within each modulation band, while the relative importance of the different modulation bands is retained. The equalization weights were recomputed for each conﬁguration of the modulation ﬁlterbank. Algorithms for ﬁnding the optimal representation of unknown observations in the form of a sparse sum of exemplars are sensitive to the (Euclidean) norm of the observations and exemplars. Therefore, we normalized all exemplars and all unknown feature vectors to unit Euclidean norm. However, for speech-silence segmentation, information about the absolute magnitude of the ﬁlter outputs is needed. We used the unnormalized EMS vectors for that purpose. −−→ Sparse coding. Unknown observations EMS (t ) are reconstructed as a sparse linear combination of exemplars from a dictionary A that S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 contains both speech and noise exemplars, N −−→ n = AS, sn a EMS(t ) ≈ (4) n=1 where S is a sparse weight vector that contains the non-negative exemplar activation scores of the dictionary exemplars that mini−−→ mize the Euclidean distance between the test vector EMS (t ) and the reconstructed version, subject to a sparsity constraint (controlled by λ): −−→ min EMS(t ) − AS 2 s.t. S < λ. 1 (5) From activations to posterior probabilities. The exemplar activation scores must be converted into state posterior probabilities. For that purpose, we use the state labels of the speech exemplars in the dictionary. As the exemplar dictionary A = [As , An ] is the concatenation of a noise and a speech dictionary, the activation vector S S in Eq. (5) can be split into two separate parts S = [ s ], indicating Sn the weights corresponding to speech and noise exemplars, respectively. Since the noise exemplar activations are irrelevant for estimating the posterior state probabilities, we ignore the noise exemplar activations (Sn ). With L1×NA the label vector (NAs = 17, 148 s is the number of speech exemplars), and the ith element 1 ≤ Li ≤ 179 representing the label of the ith exemplar in the speech dictionary, we compute a cumulative state activation vector C in which each element C j , j = 1, 2, . . . , 179 is the sum of the activation scores corresponding to dictionary exemplars that have state label number j: Cj = Si , 3. Exploiting modulation frequency domain information To investigate the impact of the way in which the information about modulation frequencies is represented in the EMS feature vectors, we designed a sequence of experiments. In “Study1” we use a simpliﬁed version of the auditory model to investigate several technical and conceptual issues. We also address the correspondence between the LPF and BPFs in the modulation ﬁlterbank on the one hand and the static and dynamic features in conventional ASR systems (c.f. Moritz et al., 2015). In Study 2 we investigate the performance gain that can be obtained when the cut-off frequency of the LPF is varied and an additional number of modulation bandpass ﬁlters are added. Also, we investigate how recognition performance is affected when the LPF and BPFs cover the same modulation frequency range. Finally, in Study 3 we return to the original auditory model (keeping the cut-off frequency of the LPF ﬁxed at 1 Hz), and investigate the impact of different conﬁgurations of the bank of BPFs (varying number of BPFs and the spacing of center frequencies, i.e., linearly or logarithmically) used for capturing the dynamic information. 3.1. Study 1: exploratory experiments where Si is the ith element in Ss . The state posterior probability estimate is then computed by normalizing the vector C to L1 norm 1. j=1 C j to one). The state-to-state transition matrix is ﬁxed across all experiments in this paper. The word-word transitions in the language model (LM) are determined by the conditional bigram (word-word) probabilities, which are virtually uniform. There are two free parameters (i.e. the word and silence entrance penalties) which were tuned on a development test set for adjusting the balance between insertions and deletions and to minimize the word error rate. The decoder only provides the best path with the associated accumulated score and the hypothesized words and silences, including a segmentation at the word level. (6) {i|Li = j } C P = 179 71 . (7) As in Gemmeke et al. (2011), it appeared that the procedure of Eq. (6) systematically underestimates the posterior probability of the three silence states. This is due to the fact that the normalization of all EMS vectors to unit length effectively equalizes the overall magnitude, thereby destroying most of the information that distinguishes silence from speech. Therefore, we implemented an additional procedure that estimates the probability of a frame being either speech or silence on the basis of the unnormalized feature values. In frames that were classiﬁed as silence by this procedure the posterior probability of the three silence states was set to 0.333, and the posterior probability of the 176 speech states was set to some small ﬂoor value. 2.3. Viterbi decoder The Viterbi decoder ﬁnds the most likely word sequence in a 179 (states) by N (frames) matrix by combining prior and posterior probabilities of the 179 states. The implementation allows us to use different word entrance penalties for the eleven digit words and the silence ‘word’. The decoder uses a pre-estimated 179-by179 state-to-state transition matrix that contains the log probabilities associated to each state-state transition. Probabilities of the non-eligible transitions are ﬁrst ﬂoored to a small positive value before the logarithm is applied. This ﬂooring has a negligible effect on the total probability mass (i.e., the posterior probabilities of the 179 states to which a transition is allowed still sum almost We started experimenting with a highly simpliﬁed auditory-like model that consists of a LPF in combination with one BPF that emphasizes modulations in a speciﬁc frequency band, i.e., M, the number of BPFs in the modulation ﬁlterbank equal to one. One conceptual issue concerns the cut-off frequency of the LPF. Different instantiations of the auditory model used quite different LPFs. For example, Moritz et al. (2015) started from the system described in Dau et al. (1997a), where the LPF has a cut-off frequency of 6 Hz. This corresponds to an integration time of approximately 170 ms, compared to the 10 0 0 ms integration time of the LPF with a cut-off frequency of 1 Hz in Jørgensen and Dau (2014) that is used here. One might wonder whether such a long integration time can at all be used in experiments with isolated utterances that may have a duration between 0.5 and 3 s. We address the cut-off frequency of the LPF in this study and investigate it further in the next study in conﬁgurations with multiple BPFs. In our simpliﬁed model, we followed two different strategies in deﬁning the LPF cut-off frequency: 1) the LPF cut-off frequency is ﬁxed at 1 Hz, while the center frequency of the BPF increases; 2) the LPF cut-off frequency is always 1 Hz below the center frequency of the BPF, the center frequency of which increases. We compare the performance of these simpliﬁed models with a single LPF covering the same modulation frequency range to evaluate the advantage of emphasizing speciﬁc modulation frequencies using the BPF. The number of feature elements in the simpliﬁed auditory model (LPF+BPF) is twice the number of feature elements obtained using a single LPF. Moreover, the shape of effective transfer function of a ﬁlterbank consisting a LPF and one BPF is different from a single Butterworth LPF, as shown in Fig. 4a–c. To disentangle the effect of these two factors on the performance and also to verify that the effective transfer function is an important issue to consider in the design of a modulation ﬁlterbank, we compare the accuracy 72 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 Fig. 4. Word recognition accuracy on clean speech using feature vectors consisting of lowpass ﬁltered gammatone ﬁlter envelopes without (blue) or with additional emphasis on a speciﬁc modulation frequency band. Emphasis is accomplished by modifying the frequency response of a single lowpass ﬁlter (magenta open circles) or by adding an additional bandpass ﬁlter. The green curve (diamonds) pertains to a ﬁxed lowpass ﬁlter (FLP = 1 Hz) in combination with a bandpass ﬁlter with varying center frequency; the red curve (asterisks) pertains to a lowpass ﬁlter of which the cut-off frequency was 1 Hz below the center frequency of the accompanying bandpass ﬁlter. The shaded bands indicate the 95% conﬁdence interval. Sub-ﬁgures (a), (b) and (c) show the transfer functions of the composing ﬁlters and their sum (in red). (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.) that can be obtained with a two-ﬁlter system and a system with a single LPF that has the same transfer function as the two-ﬁlter system. A ﬁnal, also somewhat conceptual issue that we wanted to explore is to what extent results obtained with a speciﬁc conﬁguration for clean speech generalize to the noisiﬁed test utterances. 3.1.1. Clean speech The results of the pilot experiments on clean speech are summarized in Fig. 4. The red curve (asterisks) in Fig. 4d shows the recognition accuracy obtained with a modulation ﬁlterbank that consists of a LPF with a cut-off frequency that increases from 1 Hz to 16 Hz, combined with a BPF center frequency 1 Hz higher than the LPF cut-off frequency. Accuracy increases with an increase of the modulation frequency band that is covered, up to a frequency of 7 Hz, where ceiling performance is reached. Interestingly, this ‘optimum’ is obtained with the cut-off frequency of the LPF in the auditory model proposed in Dau et al. (1997a). With 15 gammatone and two ﬁlters in the modulation ﬁlterbank the EMS feature vectors contained 30 coefficients. The purple (open circles) curve in Fig. 4d pertains to a modulation ‘ﬁlterbank’ that consisted of a single LPF with a frequency response identical to the two-ﬁlter system underlying the red (asterisk) curve. Since the modulation ﬁlterbank comprised only a single ﬁlter, the EMS vectors contained 15 features. From this comparison it can be concluded that representing an overall frequency response by means of two ﬁlters, resulting in EMS vectors that contain two sets of 15 features is advantageous. The blue (ﬁlled circles) curve shows the recognition accuracy obtained with a single LPF with increasing cut-off frequency, and a frequency response that was ﬂat in the pass band. The comparison between this curve and the purple curve shows that an overall frequency response identical to the two-ﬁlter system yield better accuracy than a ﬂat response when the EMS vectors contain the same number of features. The green curve (open diamonds) pertains to the accuracy obtained with a two-ﬁlter system in which the cut-off frequency of the LPF was ﬁxed at 1 Hz, while the center frequency of the BPF was increased from 2 Hz to 17 Hz. For the BPF center frequency of 2 Hz the conﬁguration is identical to the second conﬁguration in the red (ﬁlled asterisks) curve. When the center frequency of the LPF is 3 Hz it can already be seen that the performance lags, relative to the conﬁguration in which the this BPF is combined with a LPF with a cut-off frequency of 2 Hz (the red curve), despite the equal number of features in the EMS vectors. For center frequencies of the BPF > 6 Hz the accuracy of the this system decreases with increasing center frequency. The accuracy of this two-ﬁlter system drops below the single LPF system (the purple open circles curve) for BPF center frequencies > 8 Hz. The accuracy even drops below the single, ﬂat response LPF system for BPF center frequencies > 14 Hz. We attribute this effect to the overall transfer function of this two-ﬁlter ﬁlterbank. As can be seen in Fig. 4b, the frequency response contains a trough around 4 Hz that deepens as the center frequency of the BPF increases. From the data in Fig. 4 we can draw several preliminary conclusions. Probably the most important conclusions is that the overall frequency response of the modulation ﬁlterbank has a large impact on the performance of the system. The frequency response must cover at least the band up to 7 Hz, and emphasizing a somewhat narrow band centered around frequencies up to 7 Hz yield higher accuracy than a ﬂat response. Emphasizing ever higher modulation frequencies has no beneﬁcial, but also no detrimental effect. The second conclusion is that the number of coefficients in the EMS feature vectors is important. With identical frequency responses, the systems that encode the output of the BPF as an additional S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 73 Fig. 5. Word recognition accuracy on noisy speech (four noise types in test set A), using feature vectors consisting of a lowpass ﬁltered gammatone ﬁlter envelopes together with an additional bandpass ﬁltered version of the envelope. (a) Word recognition accuracy on four noise types at SNR level of 20 dB. (b) Word recognition accuracy on four noise types at SNR level of −5 dB. (Note the different scales of the vertical axes.). set of 15 coefficients always perform much better. This indicates that EMS vectors that distribute information about the overall frequency response in a set of features that correspond to the ﬂat part of the shape and the region of the frequency axis that is emphasized are more discriminative. 3.1.2. Noisy speech Since ﬁlterbanks that combine a LPF with increasing cut-off frequency with a BPF with center frequency 1 Hz above the cut-off appeared to yield the best accuracy, we tested these conﬁgurations on the noisy utterances of test set A. The other (inferior) conﬁgurations mentioned above were also tested; results are not shown, because they do not contribute additional information. Fig. 5 shows the accuracies in the SNR = 20 dB and SNR = −5 dB conditions. From Fig. 5a it can be seen that the results in the SNR = 20 dB condition are similar to the results obtained with clean speech. However, the frequency range at which ceiling performance is reached differs slightly between the four noise types. Also, the extent to which the accuracy varies on the plateau seems to differ slightly between the four noise types. At the SNR= −5 dB level (c.f., Fig. 5b), a different pattern of results is visible. Although it is not safe to draw strong conclusions from very low recognition accuracies, several observations stand out. First, there is substantial difference between the noise types. Noise type N2, babble noise, yields the highest accuracies for all LPF cut-off frequencies. The accuracy with car noise (N3) drops almost to the level of subway noise (N1) with cut-off frequencies ≥ 12 Hz. It can also be seen that all four noise types show a decreasing accuracy when the cut-off frequency of the LPF increases beyond some maximum. For car noise the fall is deep and steep, whereas it is quite shallow for subway noise. An in-depth analysis of the distributions of the EMS vectors showed that these somewhat surprising results are caused by the difference (or similarity) between the two-band EMS features of speech and the corresponding features of the four noise types. In the lower SNR conditions (and especially with SNR= −5 dB) we see two different effects. Noise exemplars in the dictionary account for a substantial proportion of the approximation of the noisy speech EMS vectors; this results in low –and possibly random– activations of the speech states. Except for the subway noise, the reduction of the total activation of speech states becomes worse as the BPF emphasizes higher modulation frequencies, which are less informative for speech. The overall reduction of the activation of speech states is combined with an increasing shift of the activations towards a small number of speech states that happen to have EMS vec- tors that are somewhat similar to the vectors that characterize the noises. This results in a digit confusion pattern that strongly favors the digits that happen to contain these favored states. This effect is especially clear for N1 (subway) and N4 (exhibition hall), whose EMS vectors are characterized by high values in the high-frequency gammatone ﬁlters, both in the LPF and BPF. The EMS vectors of N1 show this effect already at low cut-off frequencies, which explains the fairly ﬂat shape of accuracies as a function of cut-off frequency. Babble noise behaves differently in that it does not favor a small number of speech states. The especially detrimental effect of N3 (car) is due to a combination of the two effects: a small number of speech states is favored, while the total activation of the speech states is small. The large differences between the recognition accuracies with the four noise types at −5 dB SNR suggest –unsurprisingly– that a two-ﬁlter modulation ﬁlterbank does not provide sufficient resolution for coping with different noise types. 3.1.3. The link with delta coefficients in conventional ASR In addition to commonalities between the acoustic features used in conventional ASR and the output of an auditory model, there are also substantial differences. The conventional ASR approach is based on (power) spectra estimated from 100 overlapping windows per second. Such a spectrum can be considered as equivalent to the EMS features in a LPF with cut-off frequency set to 50 Hz. Furthermore, conventional delta coefficients in conventional ASR (i.e., the time derivatives of the static features) can be viewed as the output of a single modulation frequency bandpass ﬁlter. The transfer function of a differentiator has a rising slope of +6 dB/octave; therefore, the output of a bandpass ﬁlter with a rising slope of +6 dB/octave can be considered as a low-pass ﬁltered version of a differentiator. The falling slope of the BPF determines to what extent the high frequencies in the differentiated signal 1are attenuated. In the Q = 1 ﬁlters of our auditory model, the falling slope is −6 dB/octave. In conventional ASR the center frequency of the bandpass ﬁlter, as well as the steepness of the falling slope, depend on the number of static coefficients involved in the regression function used in computing the deltas. With DELTAWINDOW=5 and a frame rate of 100 frames per second in HTK (Young et al., 2009) the center frequency of the ‘delta’ ﬁlter is approximately 7.5 Hz, while the attenuation at the Nyquist frequency of 50 Hz is approximately 20 dB. To obtain a better understanding of the effect of centering the ‘delta’ ﬁlter at different frequencies, we carried out an experiment in which we combined a 16 Hz cut-off frequency LPF with a single BPF with a center frequency that varied between 2 Hz and 74 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 Fig. 6. Relative recognition accuracy (Acc) improvement obtained by adding an additional bandpass ﬁltered version of the envelope to the 16 Hz lowpass ﬁltered one. The subplots show the results on three different SNR levels of noisy speech with four different noise types of test set A. 16 Hz. The recognition accuracies obtained with these conﬁgurations were compared to the accuracy obtained with a single LPF with cut-off frequency 16 Hz. Fig. 6 shows the relative improvement for the four noise types for SNR levels of 20, 5, and −5 dB. A comparison between the curves for the SNR levels shows that the gain increases as the SNR level decreases: While the relative improvement is of the order of 20% to 25% in the SNR = 20 dB condition, the performance is improved by 50% up to 130% (noise type dependent) in the SNR= −5 dB condition. Especially at SNR−5 dB the center frequency at which the recognition accuracy increases most depends on the noise type. This conﬁrms that a single ‘delta’ ﬁlter is not sufficient for making the EMS features robust against different noise types. 3.2. Study 2: multi-resolution representations of modulation frequencies It is quite likely that humans pay selective attention to the spectro-temporal input when understanding speech, and that selective attention becomes more important as the listening conditions grow more adverse. The gammatone ﬁlters allow for a sufﬁcient degree of selectivity in the frequency domain. The subsequent modulation ﬁlterbank must provide the selectivity in the modulation frequency domain. In combination with the sparse coding approach for obtaining the posterior probabilities of the 179 states in the aurora-2 task, a multi-resolution representation, with its attendant longer feature vectors, might enhance the probability that ‘correct’ clean speech exemplars in the dictionary have a small Euclidean distance to noisy speech frames, because the energy of the noise is much smaller than the energy of the speech in some regions of the EMS vectors. If this is indeed the case, a multiresolution representation should enhance the resulting recognition accuracy. In Section 3.1 it was concluded that modulation frequencies in the band up to 16 Hz must be covered and that the largest gain in performance relative to a conﬁguration with a single LPF is obtained by emphasizing different modulation frequencies for different noise types and different SNR levels. Therefore, it can be expected that a conﬁguration in which multiple BPFs separate the modulations in different frequency bands would outperform a con- Fig. 7. Word recognition accuracy obtained with feature vectors covering the modulation frequency range of 016 Hz. The modulation ﬁlterbank consisted of a single lowpass ﬁlter with variable cut-off frequency and a variable number of additional bandpass ﬁlters with center frequencies spaced 1 Hz apart to cover the interval beyond the LPF cut-off frequency. Results for clean (top) and noisy speech (bottom) are shown in separate panels to improve resolution. The shaded bands indicate the 95% conﬁdence interval. The dashed line shows the trajectory of the peak position across SNR level. ﬁguration that contains only a LPF and a single BPF. Auditory models do precisely this, by combining a LPF with a bank of BPFs. Such a ﬁlterbank can be conﬁgured in two different ways: the BPFs can cover the frequency range above the cut-off frequency of the LPF, or the frequency ranges of the BPF and LPF may overlap, so that the BPFs provide additional resolution in a band that is already covered. Below, we compare these conﬁgurations. By doing so, we address two questions: 1- In which modulation frequency range is a high resolution most beneﬁcial for noisy speech recognition? 2to what extent is it beneﬁcial to represent modulation frequencies both in terms of static and dynamic features by choosing overlapping LPF and BPFs? In the ﬁrst experiment we employed a modulation ﬁlterbank consisting of a LPF with a variable cut-off frequency (ranging from 1 to 16 Hz), augmented with a bank of BPFs with center frequencies (spaced 1 Hz apart) covering the range from 1 Hz above the cut-off frequency of the LPF up to 16 Hz. Obviously, the total number of ﬁlters in the ﬁlterbank (M + 1), and therefore the total number of features in the EMS vectors (15 · (M + 1 )), will increase as the cut-off frequency of the LPF decreases. The test is performed on all the clean and noisiﬁed data in test set A of aurora-2. The results of this experiment (averaged over four different noise types) are summarized in Fig. 7. The ﬁrst observation that can be made from the ﬁgure is that the conﬁgurations with the largest number of modulation BPFs do not always yield the best recognition accuracy: the curves for S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 the highest SNR levels start with a (small) interval in which the performance is increasing as the cut-off frequency of the LPF increases, corresponding to a decrease of the total number of ﬁlters. The cut-off frequency at which the maximum accuracy is obtained is clearly dependent on the SNR level. In the clean condition, the best performance is obtained when the LPF cut-off frequency is 5 Hz and the modulation frequency range of 6 − 16 Hz is covered by M = 11 linearly spaced BPFs. At lower SNR levels, the LPF cut-off frequency at which the maximum accuracy is obtained shifts towards lower frequencies as illustrated by the dashed line in Fig. 7b: it interpolates the LPF cut-off frequency at which the best performance is obtained at different SNR levels. Moreover, the steeper slopes in the curves corresponding to lower SNR levels indicate that increasing the LPF cut-off frequency, and as a result decreasing the resolution in the lower modulation frequencies, is more harmful in the presence of high noise energy. Apparently, separating modulations in the very low frequency bands, which are not very important for the intelligibility of clean speech, enhances the capability of the sparse coding engine to match noisy speech EMS vectors with ‘correct’ clean speech exemplars. In the second experiment we combined 15 BPFs with center frequencies linearly spaced between 1 Hz and 15 Hz with a LPF the cut-off frequency of which was decreased from 15 Hz to 1 Hz. With lower cut-off frequencies the amount of information about the modulations that can be said to be represented twice (in the BPFs and in the LPF) decreases, but all conﬁgurations cover the modulation frequencies up to 16 Hz. Also, the total number of features (15 · 16) was identical in all conﬁgurations, because the number of ﬁlters was ﬁxed. It appeared that decreasing the cut-off frequency of the LPF from 16 Hz to 1 Hz had no effect on the resulting recognition accuracy. The performance was independent of the cut-off frequency and always equal to the accuracy corresponding to LPF cut-off frequency of 1 Hz in Fig. 7. From this experiment it can be concluded that the 1 Hz cut-off frequency of the LPF in the model of Jørgensen and Dau (2014) is to be preferred over the 6 Hz cut-off frequency in the model of Dau et al. (1997a), especially in low SNR conditions. Apparently, a high resolution in the modulation ﬁlterbank is almost always beneﬁcial. The only exception is formed by the conditions with a very high SNR level, where a high resolution in the very low modulation frequencies has a small negative effect. 3.3. Study 3: the auditory model revisited Now that we know that a set of modulation BPFs that cover the frequency range from 216 Hz, in combination with a LPF with a cut-off frequency as low as 1 Hz, can yield promising recognition accuracies, we can return to the question whether the ‘standard’ conﬁguration in auditory models, i.e., Q = 1 BPFs spaced at one octave intervals, is the optimal conﬁguration for ASR applications. To address this question we carried out experiments in which the envelopes of the gammatone sub-bands are processed by a number of different modulation ﬁlterbanks. The ﬁlterbanks consisted of a ﬁxed LPF with a cut-off frequency at 1 Hz and a variable number of BPFs with quality factor Q = 1. 3.3.1. LPF at 1 Hz and BPFs with different distribution patterns We ﬁrst compare the recognition performance using ﬁlterbanks with similar frequency coverage, but with different number of BPFs and distribution patterns of center frequencies. The center frequencies of the BPFs were chosen in three different manners: linearly spaced at 1 Hz distance, logarithmically spaced at 1/3rd octave and at full octave distance. The number of BPFs is gradually increased, adding modulation bands, until they cover the frequency range up 75 to 25 Hz.3 In Fig. 8a the recognition accuracies for clean speech of each of these ﬁlterbanks are depicted as a function of the center frequency of the last BPF included (red: linear spacing; purple: octave spacing; green: 1/3rd octave spacing). Note that, as a consequence of the different distribution patterns of the BPFs, the number of BPFs used for covering the range up to a given modulation frequency was different (14 with linear spacing, 4 with octave spacing, 12 with 1/3rd octave spacing, and 10 with the ﬁrst two ﬁlters in the 1/3rd octave spacing left out). The ﬁrst observation from this ﬁgure is that adding more BPFs improves recognition accuracy, but a ceiling performance is reached when the center frequency of the last-added ﬁlter is 16 Hz. The highest word recognition accuracy is obtained with the linear spacing strategy and amounts to 96.13%, an improvement of approximately 2.2% absolute compared to the best performance obtained with a combination of a LPF with cut-off frequency 15 Hz and a single BPF with center frequency 16 Hz (cf. Study 1). The second observation from Fig. 8a is the consistent and statistically signiﬁcant advantage of the linearly spaced ﬁlterbank (the red curve) over the logarithmically spaced ﬁlters (the purple and green curves). The difference in number of ﬁlters employed cannot explain this observation: to cover the range up to a modulation frequency of 10 Hz, the 1/3rd octave spacing and the linear spacing require ten and nine ﬁlters respectively; still, the linearly spaced ﬁlterbank outperforms the 1/3rd octave spacing. Also, despite the different number of ﬁlters, the octave spaced and the 1/3rd octave spaced ﬁlterbank have roughly equal performance. Therefore, the most plausible explanation lies in the fact that different locations of the center frequencies of a set of BPFs yield different effective transfer functions. To illustrate this effect, we plotted the effective transfer function for the linear, octave and the 1/3rd octave spaced ﬁlterbanks with M = 10 in Fig. 8d. Clearly, the 1/3rd octave spaced ﬁlterbank emphasizes the very low modulation frequencies much more than the linearly spaced ﬁlterbank (peak at 3.0 Hz compared to 6.25 Hz).4 From the perspective of sparse coding this means that information about modulations in a frequency range that exhibits nonnegligible variance, but contains little information about the contents of speech signals, may have too strong an impact on the Euclidean distance measure, giving rise to sub-optimal recognition performance. To test this hypothesis, we removed the ﬁrst two BPFs from the 1/3rd octave spaced ﬁlterbank ( fc = 1.26 Hz and fc = 1.58 Hz). As a result, the effective transfer function of the modiﬁed ﬁlterbank does no longer over-emphasize the lowest modulation frequencies compared to the linearly spaced ﬁlterbank (cf. Fig. 8d). Consequently, as shown by distance between the green and the light blue curve in Fig. 8a, the recognition accuracy for clean speech was always higher than the results with the corresponding full 1/3rd ﬁlterbanks. Note that the content of the BPFs with Fc = 1.26 Hz and Fc = 1.58 Hz do contain some useful information since the performance levels at the two left most points in the red curve of Fig. 4d are larger than the left-most point of the blue curve in Fig. 4d (using one static feature only). However, in combination with more BPFs covering a larger range of modulation frequencies, a modulation frequency range that is sampled too densely at the low end is harmful for recognition. In this experiment we compared ﬁlterbanks with different numbers of ﬁlters. We also created ﬁlterbanks with the same num- 3 We increased the modulation frequency range compared to previous experiments. This was done to verify that the logarithmically spaced BPFs (that exhibit a wider spacing at high frequencies) also yielded ceiling performance above approximately 16 Hz. 4 The frequency response of the ﬁlterbank with octave spacing is not shown, because the center frequency of a substantial number of 10 ﬁlters is beyond the Nyquist frequency. 76 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 Fig. 8. Word recognition accuracy for (a) clean speech, (b) noisy speech SNR= 20 dB, (c) noisy speech SNR= −5 dB, as a function of the highest center frequency in the bank of bandpass ﬁlters. The shaded areas represent the 95% conﬁdence intervals. The center frequencies of the ﬁlters (FBP ) are spaced linearly at 1 Hz intervals (red), or logarithmically at full octave intervals (purple) or at 1/3 octave intervals (green). The blue curve depicts the results obtained with the same 1/3 octave ﬁlterbank, without the two ﬁlters with FBP < 2 Hz. Sub-ﬁgure (d) shows the effective transfer functions of the modulation ﬁlterbanks with LPF cut-off frequency 1 Hz and 10 (or 8 for the blue curve) BPFs. ber of ﬁlters as in the ﬁlterbank with linearly spaced BPFs, equally spaced on a logarithmic frequency axis. None of these conﬁgurations appeared to provide better performance than the linear spacing of the center frequencies. Fig. 8 b and c show the results obtained with increasing numbers of differently spaced ﬁlters for the two extreme noise conditions, i.e., SNR=20 dB and SNR=−5 dB. In the SNR=20 dB condition the superiority of the linear spacing, with a (much) larger number of ﬁlters, is more apparent than in the clean speech condition. In the SNR=−5 dB condition the ﬁlterbanks with octave spacing, and therefore smaller numbers of ﬁlters, yield much lower accuracies than the conﬁgurations with higher numbers of ﬁlters. This suggests that, particularly in noisy conditions, the sampling of the modulation frequency domain needs to be sufficiently ﬁnegrained for the ASR system to reap the maximum possible beneﬁt from the multi-resolution representation. It can also be seen that the recognition accuracy obtained with linearly spaced ﬁlters starts decreasing when ﬁlters with center frequencies > 16 Hz are added. The modulations in these frequency bands are mainly associated to the noise. This conﬁrms our earlier conclusion that it is counter-productive to dedicate a substantial proportion of the EMS features to modulation frequency bands that do not contain information relevant for speech recognition. From Fig. 8c it can also be seen that a larger number of BPFs is not always beneﬁcial: the conﬁguration with fewer 1/3rd octave spaced ﬁlters is clearly competitive. S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 77 Fig. 9. Word recognition accuracy for (a) clean speech and (b) noisy speech as a function of the number of logarithmically spaced ﬁlters. The black lines show the recognition accuracy averaged over all the four noise types (The shaded areas represent the 95% conﬁdence intervals). At SNR=-5, the individual noise results are also plotted in scattered markers: red circle:N1 (Subway), blue hexagram: N2 (Babble), green diamond: N3 (Car) and cyan square: N4 (Exhibition). The red asterisks shows the recognition accuracy obtained with the best performing linear ﬁlterbank (1LPF+18 BPFs). 3.3.2. LPF at 1 Hz and varying number of BPFs logarithmically positioned to approximate a given effective transfer function In most of the previously described experiments there was an interaction between the total range of modulation frequencies covered, the number of ﬁlters in that range and the distribution patterns of the center frequencies. The fact that leaving out the lowest-frequency ﬁlters from the 1/3rd octave ﬁlterbank improved the recognition accuracy suggests that the presence of irrelevant features incurs the risk that the Euclidean distance in the sparse coding process homes in on exemplars that ﬁt these irrelevant features, at the cost of the features that do matter. The shape of the effective transfer function of the ﬁlterbanks and the frequency at which the response is maximal indicate which modulation frequencies will be represented with many features and dominate the Euclidean distance measure in the Lasso decoder. In study 1 it was found that the effective transfer function can be used as a criterion for comparing different ﬁlterbank conﬁgurations. Therefore, we conducted an experiment in which we used the effective transfer function of the best-performing linearly spaced ﬁlterbank (i.e. 19 ﬁlters: 1 LPF + 18 BPFs) as a target that we try to approximate by means of a variable number of logarithmically spaced BPFs. In contrast to the previous experiments, however, we allowed the center frequencies of the ﬁrst and last ﬁlter in the ﬁlterbank to vary. Imposing the additional condition that the resulting conﬁgurations would provide at least some resolution in the low modulation frequency range, without pushing the lowest center frequency below 2 Hz and without pushing the highest one above 36 Hz (so that the −3 dB point of the falling slope of the BPF does not exceed the Nyquist frequency), we ended up with conﬁgurations with a minimum number of 10 and a maximum number of 22 ﬁlters. The recognition accuracy results obtained with these conﬁgurations are shown in Fig. 9. As can be seen from Fig. 9a, close to maximum performance on clean speech can be achieved with any number of log-spaced BPFs with M ≥ 15. The best performance is achieved with 18 BPFs; the center frequency of the ﬁrst and last BPFs are 3.26 Hz and 20.24 Hz, respectively. Although the number of ﬁlters is equal to the target linear ﬁlterbank, the achieved recognition accuracy is even slightly (but signiﬁcantly) higher than with the 18 linearly spaced BPFs (0.4% relative; the red asterisk that indicates the accuracy with linearly spaced ﬁlters is just beyond the 95% conﬁdence interval). Fig. 9b shows the corresponding results for the noisy test utterances from set A at SNRs ranging from 20 dB down to −5 dB. For the highest SNR conditions the accuracy does not improve substantially when the number of ﬁlters is increased from 10 to 18. For the lowest three SNR levels increasing the number of BPFs does improve accuracy. In all cases a larger number of ﬁlters results in a higher resolution in the lowest modulation frequencies. For SNR=−5 dB, using M = 21 BPFs (rather than M = 18) yielded a 5% relative improvement. In this conﬁguration the center frequencies of the lowest and highest BPF were 2.25 Hz and 19.3 Hz. Fig. 9b breaks out the recognition accuracies obtained with the four noise types in the SNR = −5 conditions. Increasing the resolution of the modulation ﬁlterbank has the smallest effect for the babble noise. This was to be expected, because it is unlikely that there are many modulation frequency bands in which babble noise differs substantially from speech. 4. Comparison with other ASR systems and HSR 4.1. ASR In this research we investigated how different conﬁgurations of the modulation ﬁlterbank affect recognition performance. To deepen our understanding of the strengths and weaknesses of the combination of EMS features and SC, we compared the performance on test sets A and B in aurora-2 with previously published recognition accuracies of three other systems: the ‘standard’ aurora-2 system trained with the multi-condition data (Hirsch and Pearce, 20 0 0), the multi-condition aurora-2 system that includes the Wiener ﬁlter based ETSI advanced frontend (Hirsch and Pearce, 2006), and the SC-based system of Gemmeke et al. (2011). The ﬁrst two systems use GMMs based on MFCC features to estimate state posterior probabilities, while the third one used Melfrequency energy spectra as stacks of up to 30 frames and used non-negative matrix factorization with the Kullback–Leibler divergence as the solver in the sparse coding engine. Since there is 78 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 Table 1 The word recognition accuracy obtained using Lin18-EMS features on aurora-2 test sets. (For explanation see text) . SNR Clean 20 15 10 5 0 −5 Average Test A Subway Babble Car Exhibition Average 94.14 93.62 93.56 93.55 93.72 94.84 93.44 92.69 95.53 94.12 94.38 92.93 92.45 95.19 93.74 88.89 91.9 91.65 94.38 91.70 86.98 87.07 88.58 85.71 87.24 81.98 73.52 80.11 82.63 79.56 66.38 42.05 59.68 73.93 60.51 86.80 82.16 85.53 88.70 85.79 Test B Restaurant Street Airport Train station Average 94.14 93.62 93.56 93.55 93.72 89.39 93.68 94.87 94.54 93.10 91.56 92.53 94.15 91.76 92.50 90.97 90.60 91.02 85.00 89.40 84.86 83.92 82.58 67.76 79.78 69.11 63.27 63.05 38.41 58.46 36.94 29.53 28.78 13.24 27.12 79.56 78.16 78.28 69.18 76.29 Fig. 10. Word recognition accuracy per test set as a function of ASR for four different systems. 1- The proposed EMS features (Lin18-EMS). 2- Sparse classiﬁcation results using Mel-spectra features (Gemmeke et al., 2011). 3- Aurora2 multi-condition recognizer applied to MFCC features (Hirsch and Pearce, 20 0 0). 4- ETSI-AFE multi-condition recognizer applied to MFCC features (Hirsch and Pearce, 2006). no conﬁguration of the modulation ﬁlterbank that is optimal for all SNR levels and all noise types, we conducted the comparison with the modulation ﬁlterbank consisting of the 1 Hz cut-off frequency LPF and M = 18 linearly spaced BPFs (which we refer to as the Lin18-EMS system). The Lin18-EMS system is a good compromise between the highest-possible performance for clean speech and the conditions with the lowest SNR level. The detailed results obtained with the Lin18-EMS system are collected in Table 1. In Fig. 10, the recognition accuracies of the Lin18-EMS system and the three competing systems is plotted. Fig. 10a shows the test results for matched noise types in test set A. While the Lin18-EMS system outperforms both MFCC-based multi-condition recognizers at very low SNR levels, its performance at higher SNRs is substantially worse than the MFCC-based systems. The single-frame EMS features almost always outperform the 30-frame Mel features. However, the results of the Lin18-EMS system on test set B, pertaining to the unseen noise type conditions, shown in Fig. 10b, show that our system does not generalize well to unseen noise types, a characteristic that it shares with the other exemplar-based system. The superior performance of the 30-frame Mel features is most probably due to the fact that Gemmeke et al. (2011) included artiﬁcially constructed noise exemplars that accounted to some extent for the mismatch between the noise exemplars from test set A and the different noise types in test set B. Our EMS-based system did not include artiﬁcially constructed exemplars. In cleaner conditions (down to 10 dB) the EMS-based system has roughly equal performance as the other exemplar based system. In contrast to the behavior for test set A, however, the performance drop in SNRs < 10 dB is much steeper. Averaged over the four noise types of test set B, the recognition accuracy is approximately equal to that of the multi-condition trained GMM system without noise reduction. A detailed analysis revealed that the performance of the Lin18EMS system in fact is very similar to the system of Gemmeke et al. (2011), except for train station noise (cf. Table 1). In search for the cause of this deviant behavior, we found that omitting the standard deviation equalization step ((3) in Section 2.2) substantially improved recognition performance for utterances corrupted with train station noise at low SNR levels. This is illustrated by the dotted line in Fig. 10b, which shows the average performance on test set B (SNR= 5, 0, −5 dB) when excluding the standard deviation equalization for train station noise. Recall that the main purpose of the standard deviation equalization procedure was to equalize the contribution of all gammatone frequency bands. The equalization weight vector was designed -using the speech exemplars from the dictionary- such that the standard deviation of the coefficients in the EMS vector are on average equal in all 15 gammatone ﬁlters, without changing the relative magnitude of the coefficients pertaining to the modulation bands. It appeared that the equalization procedure works well for noisiﬁed speech, as long as the envelope of the 15 gammatone coefficients in the modulation bands does not change between bands with low and high modulation frequencies. As long as that is the case, applying a ﬁxed equalization vector will not change the average modulation spectrum of the noises. However, there are two noise types that violate this assumption, viz. car noise in test set A and train station noise in test set B. The detrimental effect of the violations in car noise are limited, because it is represented in the noise dictionary exemplars taken from the car noise signals. For the train station noise this is not the case. As a result, the match between the modulation spectra of the speech noisiﬁed by adding train station noise and the exemplars in the dictionary deteriorates as the SNR level decreases. S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 4.2. Comparison with HSR To evaluate the combination of EMS features and sparse coding in terms of human like performance, we re-use the data about the recognition accuracy of ten human listeners on aurora-2 utterances in Meyer (2013). Meyer used three different criteria: speech reception threshold (SRT), the effect of noise types and the effect of string lengths. SRT is the SNR at which listeners achieve a 50% accuracy; usually it corresponds to the SNR at which the accuracy as a function of SNR has the largest negative slope. The SRT estimated for HSR in Meyer (2013) is around −10.2 dB while for the aurora-2 system trained with the multi-condition data (Hirsch and Pearce, 20 0 0) the SRT is −1.5 dB. From Fig. 10a, it can be inferred that the SRT of the EMS-based system is well below −5 dB; although it is dangerous to extrapolate the curves, it is reasonable to assume that the SRT for the two exemplar-based systems is close to the human SRT. As can be seen from Fig. 10b, which represents the noise mismatch case (test set B), our EMS-based system does not generalize well to unseen noise types. We will come back to this issue in Section 5. According to Meyer (2013), the difficult noises for ASR and HSR are different. At SNR=0 and -5 dB, performance of aurora-2 system trained with multi-condition data the performance for babble noise is higher than for car noise, while HSR shows higher performance for car than for babble noise. From Table 1 it can be seen that our EMS+SC system shows the same trend as the human listeners: accuracy with babble noise is lower than with car noise. The same holds for the comparison of airport and train station noise, provided that we solve the equalization issue. In the human data there is a small but clear drop in accuracy for the longest digit strings, which is probably due to memorization problems. Our EMS-based system does not show this effect. This was to be expected, because an automatic system is not affected by the need to memorize long strings. Our system also does not show the problems with one-digit utterances reported by Meyer (2013) for the ‘standard’ aurora-2 systems with multicondition training. The raw EMS features that we used for speechsilence segmentation yield quite accurate results. Only in a very small proportion of the utterances the endpoint estimates differed from voice onset and offset determined from the forced alignment by more than 16 frames, the minimum number of frames needed to ﬁnd –or hallucinate– a digit word. In summary, it can be concluded that the operation of our EMSplus-SC system for the estimation of sub-word probabilities mimics human speech recognition on a semantics-free task better than more conventional MFCC-plus-GMM systems. 5. General discussion In this paper we investigated how different conﬁgurations of the modulation ﬁlterbank in an auditory frontend affect the degree to which an exemplar-based engine can provide accurate posterior probability estimates of sub-word units when recognizing noisecorrupted speech. The auditory model proposed in (Jørgensen and Dau, 2014), which consists of a LPF with a cut-off frequency of 1 Hz and nine Q = 1 BPFs with center frequencies one octave apart, served as the point of departure. For estimating the posterior probabilities of the sub-word units, we used sparse coding and a large dictionary of semi-randomly selected exemplars. We found that BPFs with center frequencies one octave apart do not provide sufficient resolution of the modulation frequencies for automatic (and maybe also for human) speech recognition. We conjecture that a ﬁlterbank with octave spacing between the modulation ﬁlters is able to discover noise conditions that will certainly compromise intelligibility, but that this conﬁguration may not accurately predict speciﬁc confusions that would occur in tasks that require 79 participants to distinguish confusable sounds in the absence of semantic predictability. From our experiments it appears that there is no unique conﬁguration of the modulation ﬁlterbank that is optimal for all SNR levels and all noise types. However, it is safe to conclude that a ﬁlterbank consisting of a LPF with cut-off frequency 1 Hz and about M = 18 BPFs with center frequencies between 2 Hz and 20 Hz will provide accuracies close to optimal for most conditions. Center frequencies of the BPFs with equal spacing on a linear or on a logarithmic frequency axis yielded very similar results. In the SNR−5 dB condition the best results were obtained with a conﬁguration that comprised M = 21 logarithmically spaced BPFs, with the lowest BPF centered at 2.25 Hz. In all experiments we found that the lowest SNR levels beneﬁtted from a large resolution in the lowest modulation frequencies; however, for the highest SNR levels a very high resolution in the modulation frequency band < 6 Hz was somewhat detrimental. The exemplar-based engine for estimating posterior probabilities of sub-word units was based on a Lasso solver in a sparse coding procedure. In the Lin18-EMS system we used 17,148 speech exemplars and 13,504 noise exemplars. These numbers are about twice as high as the numbers of speech and noise exemplars used in Gemmeke et al. (2011). The need for large numbers of exemplars in our system is probably related to the combination of features with positive and negative values and the Euclidean distance measure. In Ahmadi et al. (2014) we found that in a sparse coding framework it is advantageous to keep the phase information in the output of the modulation BPFs. The same conclusion was reached by Moritz et al. (2015). However, Baby and Van hamme (2015), who used EMS-like features for training DNNs obtained good results when using only the magnitudes of the amplitude of the output of the modulation ﬁlters. The fact that our EMS features have positive and negative feature values ruled out the use of sparse coding engines based on Kullback–Leibler divergence (the preferred distance measure in non-negative matrix factorization). It is well known that for many features used in pattern recognition tasks the Euclidean distance does not represent the conceptual distance (e.g., Choi et al., 2014). The default solution is to transform the original features to a space in which Euclidean distance does represent conceptual neighborhood. We counteracted some of the undesirable effects of the Euclidean distance by the equalization and normalization procedures that we applied to the exemplars and the unknown observations. Forcing all exemplars and unknown observations to unit length makes the Euclidean distance equivalent to cosine distance (Choi et al., 2014). In our equalization procedure the exact same weights are used for the 15 gammatone bands in all M modulation bands. As long as the pattern formed by the magnitude of the 15 numbers in the M modulation bands does not differ substantially between the modulation bands, using ﬁxed weights is beneﬁcial. However, if the patterns become different in some modulation bands because of the different characteristics of the noise, ﬁxed weights can be detrimental. This appeared to be the case with the train station noise in test set B. We preferred an exemplar-based approach over GMMs or neural networks (including DNNs) for estimating the posterior probabilities, because this approach appears to have closer connections to emerging knowledge about cortical representations of audio signals (Mesgarani et al., 2014a; 2014b) and neural processing. Our research was based on the assumption that some conﬁgurations of the modulation ﬁlterbank would yield EMS vectors in which a substantial proportion of the features is not affected by the background noise, because the expected values of these features are different for the noise and the speech signals. If the proportion of unaffected features is high enough, the sparse coding engine should be able to match partly damaged EMS vectors to the cor- 80 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 rect exemplars. That assumption is reinforced by the superior performance of human listeners, especially in tasks where there is little or no help from semantics or world knowledge. The assumption is also in line with widely accepted theories about human pattern recognition, which claim that missing data will be reconstructed (Grossberg and Kazerounian, 2011; Myers and Wallis, 2013; Wei et al., 2012). In addition, exemplar-based approaches can handle the very high-dimensional feature vectors produced by the most elaborate versions of the auditory model. To verify that it is the information in the EMS features, rather than the operation of the sparse coding engine, that drives the performance and to verify that the ﬁndings about the design of EMS features are not limited to a SC procedure for estimating posterior probabilities, we repeated many experiments with the KNeighborsClassiﬁer in scikit-learn (Pedregosa et al., 2011). We always used the exact same speech-plus-noise dictionaries to ‘train’ the kNN classiﬁer as were used with the SC engine. We saw the same trend in the results as a function of modulation ﬁlterbank conﬁguration in all SNR conditions. For the higher SNR levels the absolute accuracies obtained with the kNN classiﬁer were very close to what we obtained with sparse coding. However, in the lowest SNR levels the SC engine had a clear advantage. We compared the performance of the Lin18-EMS system with the performance of three other systems on the same data set: Mel-spectra features + SC (Gemmeke et al., 2011), the MFCC aurora-2-multi-condition recognizer (Hirsch and Pearce, 20 0 0), and the MFCC ETSI-AFE multi-condition recognizer (Hirsch and Pearce, 2006). In test set A the Lin18-EMS system outperformed the other systems in the lowest SNR conditions. However, the two GMM-based systems outperform the two exemplar-based system by a wide margin in the high SNR conditions. The fact that both exemplar-based system suffered in the same conditions, despite using very different features, shows that the problem is not caused by the EMS features. Also, the lower performance of the exemplarbased systems at the highest SNR levels is not due to the interference of the noise exemplars in the dictionary. In-depth analysis of the activations of the exemplars showed that the noise exemplars receive only very small activations in the highest SNR conditions. Decodings with and without the noise exemplars in the dictionary yielded essentially the same accuracy for clean speech and SNR20 dB. The exemplar-based systems mainly suffer from confusion errors. Moreover, we encountered the same problem with the kNN classiﬁer. It is left to future research to understand what causes the confusions in exemplar-based systems at the highest SNR levels. It has been shown that the performance of an ASR system can be improved by fusing the posterior probabilities obtained from an exemplar-based system and corresponding estimates from GMM- or ANN-based systems (e.g. Geiger et al., 2013). In Sun et al. (2014) it was shown that fusing the posterior probability estimates of an exemplar-based and a GMM-based system can reduce the word error rate for clean speech in aurora-2 to less than 0.5%. However, it is unlikely that humans use a similar procedure to accomplish their superior recognition performance. We also compared the performance of the Lin18-EMS system to the -admittedly few and incomplete- data about human recognition performance on the aurora-2 task. Using the criteria proposed in Meyer (2013) we found that the performance of our system is more similar to humans than some conventional ASR systems. The only discrepancy is that our system did not show the effect that human accuracy decreases with increasing string length. Our system shares this property with all ASR systems and computational models that do not simulate working memory problems. In the remainder of this section, we will discuss possible ways to repair some of the weaknesses of the proposed system. First of all, the EMS features might be improved, for example by adding the non-linear compression that is present in virtually all auditory models, but that was left out in the model of Jørgensen and Dau (2014), because compression was not necessary for the purpose of predicting intelligibility. Including the static 10th power compression in the version of the model in Dau et al. (1996) did increase the recognition accuracy for clean speech, at the cost of a substantial decrease in the SNR−5 db condition (from 68% correct to 34% correct in test set A). We leave the implementation of the full dynamic compression to future research; we expect that it will show the same positive effect for clean speech without the strong negative effect for the lowest SNR conditions. The EMS representation of (noisy) speech signals is reminiscent of the approaches advocated in multi-stream ASR architectures (Bourlard, 1999; Bourlard et al., 1996; Hermansky, 2013; Hermansky and Fousek, 2005; Okawa et al., 1998; Tibrewala and Hermansky, 1997). A representation in terms of multiple modulation frequency bands is likely to contain features that are not heavily affected by the noise. Instead of designing a procedure to optimally fuse the parallel streams at the feature, the probability or output level, we investigated whether the undistorted features would dominate the distance measure between clean speech exemplars and noisy observations in the sparse coding engine. The recognition accuracy that we obtained on test set A of the aurora-2 task conﬁrms the viability of this assumption, but the results also show that we are still far from human-like performance in terms of absolute accuracy. The conventional combination of static features, deltas and delta-deltas in ASR corresponds to an auditory model in which the LPF in the modulation ﬁlterbank has a cutoff frequency of about 50 Hz. In addition, there is one BPF with a center frequency of approximately 7 Hz and a quality factor Q = 1 and another BPF with a quality factor Q = 2. The fact that conventional ASR systems typically beneﬁt from adding delta-delta coefficients raises the question whether the Lin18-EMS system can be improved by adding Q = 2 BPFs with cut-off/centre frequencies at strategically chosen positions. The results of Moritz et al. (2015) provide evidence in support of this assumption. From recent developments in multi-stream ASR (e.g., Hermansky, 2013) it is clear that it is necessary to combine bottom-up fusion (whether at the level of features, probabilities or outputs) with some kind of knowledge about the best -possibly condition-dependent- way for selecting or combining features. The sparse coding procedure that we used for computing the posterior probabilities of the sub-word units does nothing of the kind. We can see at least two ways in which knowledge could be brought into play. First, it is possible to learn the distributions of individual features or groups of features (per gammatone or per modulation ﬁlter) in clean speech from the training material. During test, the likelihood that (groups of) features ﬁt the clean distribution can be estimated, and these estimates can be used as additional weights in computing the Euclidean distances in the Lasso solver. Second, it is possible to improve the conversion of the exemplar activations from the sparse coding procedure to posterior probabilities of sub-word units by involving some kind of learning. In Ahmadi et al. (2014) we argued that we should not aim at the optimal approximation of unknown observations as sparse sums of exemplars; rather, we should aim for the optimal classiﬁcation of the unknown observations. Research is underway in which we apply label-consistent discriminative dictionary learning to replace the semi-random selection of exemplars by a procedure that learns the exemplars that are optimal for reconstruction and classiﬁcation (e.g. Jiang et al., 2013). The need for introducing some kind of learning in the procedure for computing posterior probabilities of sub-word units is strengthened by recent observations of the representation of speech signals in the auditory cortex (Mesgarani et al., 2014b; Pasley et al., 2012). From intra-cranial recordings it can be inferred S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 that representations in auditory cortex still accurately reﬂect the tonotopic representations formed in the peripheral auditory system. This suggests that speech recognition relies on higher-level processes that operate on the tonotopic representations. These processes can only be successful on the basis of substantial amounts of learning. The need for higher-level operations, including selective attention, is also discussed by Henry et al. (e.g. 2013), based on the different behaviors of brain oscillations in frequencies < 5 Hz, which are associated to auditory processing and oscillations at frequencies > 8 Hz, which are associated with higher-level cognitive processing (Luo and Poeppel, 2007). 6. Conclusion In this paper we investigated to what extent a model of the human auditory system that is capable of predicting speech intelligibility in adverse conditions also provides a promising starting point for designing the frontend of a noise robust ASR system. The long-term goal of the research is to design a computational model that shows human-like recognition behavior in terms of performance level and the type of errors. We investigated which details of the auditory model conﬁguration are most important for maximizing the recognition performance of an exemplar-based system. We found that a system that combines a frontend based on the envelope modulation spectrum with a sparse coding engine for computing posterior probabilities of sub-word units yields competitive performance as long as the modulation spectrum of the background noise is similar to the noise exemplars in the dictionary. The modulation ﬁlterbank must cover the frequency range up to about 20 Hz, but there is no conﬁguration that is optimal for all noise types and all SNR levels. The lower the SNR, the more important becomes a high resolution in modulation frequencies ≤ 6 Hz. Although the accuracy of our system is still below human performance, our system behaves more human-like than MFCC-GMM based ASR systems. The output of the lowpass ﬁlter in the proposed modulation ﬁlterbank can be considered as the static features in a conventional ASR frontend, while the bandpass ﬁlter outputs can be considered as delta-features which are lowpass ﬁltered with different cut-off frequencies. Using this insight, our results indicate that not only our sparse coding based system, but in fact any classical ASR system, would beneﬁt from a frontend in which the static features, the delta coefficients and the delta-delta coefficients are all represented in a multi-resolution fashion. The highly redundant EMS feature vectors have proven to be a promising starting point for noise robust speech recognition. With a more sophisticated distance measure and a built-in ability to learn how to use this high dimensional acoustic space to discriminate different subword units in different acoustic conditions, an interesting research area opens up where ASR can interface with auditory and brain research. Acknowledgments This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no FP7PEOPLE-2011-290 0 0 0. We express our gratitude towards Torsten Dau, the discussions with whom were most helpful during the design phase of the experiments. Also his contributions in interpreting the results are greatly appreciated. We are also grateful to Tobias May for his advices during experiments and for providing part of the software that has been used in feature extraction. 81 References Ahmadi, S., Ahadi, S.M., Cranen, B., Boves, L., 2014. Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP J. Audio Speech Music Process. 2014 (1), 1–20. Baby, D., Van hamme, H., 2015. Investigating modulation spectrogram features for deep neural network-based automatic speech recognition. In: Proceedings INTERSPEECH. Dresden, Germany, pp. 2479–2483. Bacon, S.P., Viemeister, N.F., 1985. Temporal modulation transfer functions in normal-hearing and hearing-impaired listeners. Int. J. Audiol. 24 (2), 117– 134. Bourlard, H., 1999. Non-stationary multi-channel (multi-stream) processing towards robust and adaptive asr. In: Proceedings ESCA Workshop Robust Methods Speech Recognition in Adverse Conditions, pp. 1–10. Bourlard, H., Dupont, S., Hermansky, H., Morgan, N., 1996. Towards subband-based speech recognition. In: Proceedings of EUSIPCO, pp. 1579–1582. Buesing, L., Bill, J., Nessler, B., Maass, W., 2011. Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput. Biol. 7 (12). Chi, T., Ru, P., Shamma, S.A., 2005. Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118 (2), 887–906. Choi, J., Cho, H., Kwac, J., Davis, L., 2014. Toward sparse coding on cosine distance. In: 22nd International Conference on Pattern Recognition (ICPR), pp. 4423–4428. Cooke, M., 2006. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 119 (3), 1562–1573. Cutler, A., 2012. Native Listening: Language Experience and the Recognition of Spoken Words. MIT Press. Dau, T., Kollmeier, B., Kohlrausch, A., 1997a. Modeling auditory processing of amplitude modulation. i. detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102 (5), 2892–2905. Dau, T., Kollmeier, B., Kohlrausch, A., 1997b. Modeling auditory processing of amplitude modulation. ii. spectral and temporal integration. J. Acoust. Soc. Am. 102 (5), 2906–2919. Dau, T., Püschel, D., Kohlrausch, A., 1996. A quantitative model of the “effective” signal processing in the auditory system. i. model structure. J. Acoust. Soc. Am. 99 (6), 3615–3622. De Wachter, M., Matton, M., Demuynck, K., Wambacq, P., Cools, R., Van Compernolle, D., 2007. Template-based continuous speech recognition. IEEE Trans. Audio Speech Lang. Process. 15 (4), 1377–1390. Demuynck, K., Garcia, O., Van Compernolle, D., 2004. Synthesizing speech from speech recognition parameters. In: Proceedings of Interspeech, 2, pp. 945–948. Jeju Island, Korea. Drullman, R., Festen, J.M., Plomp, R., 1994. Effect of temporal envelope smearing on speech reception. J. Acoust. Soc. Am. 95, 1053–1064. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al., 2004. Least angle regression. Ann. Stat. 32 (2), 407–499. Elhilali, M., Chi, T., Shamma, S.A., 2003. A spectro-temporal modulation index (stmi) for assessment of speech intelligibility. Speech Commun. 41 (23), 331–348. Ewert, S.D., Dau, T., 20 0 0. Characterizing frequency selectivity for envelope ﬂuctuations. J. Acoust. Soc. Am. 108 (3), 1181–1196. Fletcher, H., 1940. Auditory patterns. Rev. Mod. Phys. 12 (1), 47. Fletcher, H., 1953. Speech and Hearing in Communication. Krieger, New York. Geiger, J., Weninger, F., Hurmalainen, A., Gemmeke, J., Wöllmer, M., Schuller, B., Rigoll, G., Virtanen, T., 2013. The TUM+ TUT+ KUL approach to the 2nd CHiME challenge: multi-stream ASR exploiting BLSTM networks and sparse NMF. In: Proceedings of CHiME, pp. 25–30. Gemmeke, J.F., Virtanen, T., Hurmalainen, A., 2011. Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19 (7), 2067–2080. Goldinger, S., 1998. Echoes of echoes? an episodic theory of lexical access. Psychol. Rev. 105 (2), 251–279. Grossberg, S., Kazerounian, S., 2011. Laminar cortical dynamics of conscious speech perception: neural model of phonemic restoration using subsequent context in noise. J. Acoust. Soc. Am. 130 (1), 440–460. Henry, M.J., Herrmann, B., Obleser, J., 2013. Selective attention to temporal features on nested time scales. Cereb. Cortex. Hermansky, H., 1997. The modulation spectrum in the automatic recognition of speech. In: Proceedings IEEE Workshop on Automatic Speech Recognition and Understanding. Santa Barbara, pp. 140–147. Hermansky, H., 2011. Speech recognition from spectral dynamics. Sadhana 36 (5), 729–744. Hermansky, H., 2013. Multistream recognition of speech: dealing with unknown unknowns. Proc. IEEE 101 (5), 1076–1088. Hermansky, H., Fousek, P., 2005. Multi-resolution rasta ﬁltering for TANDEM-based ASR. In: Proc. Int. Conf. Spoken Lang. Process., pp. 361–364. Hirsch, H., Pearce, D., 2006. Applying the advanced ETSI frontend to the Aurora-2 task. Tech. Report version 1.1. http://dnt.kr.hsnr.de/aurora/download/Aurora2_ afe_v1_1.pdf Hirsch, H.G., Pearce, D., 20 0 0. The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proceedings ISCA Workshop ASR20 0 0, Automatic Speech Recognition: Challenges for the Next Millennium. Paris, France, pp. 29–32. Holmes, J., Holmes, W., 2001. Speech Synthesis and Recognition, 2 edition Taylor and Francis, London and New York. 82 S. Ahmadi et al. / Speech Communication 84 (2016) 66–82 Houtgast, T., 1989. Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Am. 85 (4), 1676–1680. Houtgast, T., Steeneken, H.J.M., 1985. A review of the mtf concept in room acoustics and its use for estimating speech intelligibility in auditoria. J. Acoust. Soc. Am. 77, 1069–1077. Huang, X., Acero, A., Hon, H.-W., 2001. Spoken Language Processing. Prentice Hall, Upper Saddle River, NJ. Jiang, Z., Lin, Z., Davis, L.S., 2013. Label consistent K-SVD: Learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35 (11), 2651–2664. Jørgensen, S., Dau, T., 2011. Predicting speech intelligibility based on the signal– to-noise envelope power ratio after modulation-frequency selective processing. J. Acoust. Soc. Am. 130 (3), 1475–1487. Jørgensen, S., Dau, T., 2014. Modeling speech intelligibility based on the signal– to-noise envelope power ratio. Technical University of Denmark, Department of Electrical Engineering Ph.D. thesis. PhD-afhandling. Jørgensen, S., Ewert, S.D., Dau, T., 2013. A multi-resolution envelope-power based model for speech intelligibility. J. Acoust. Soc. Am. 134 (1), 436–446. Kanedera, N., Arai, T., Hermansky, H., Pavel, M., 1999. On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Commun. 28 (1), 43–55. Kanedera, N., Hermansky, H., Arai, T., 1998. On properties of modulation spectrum for robust automatic speech recognition. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 2, pp. 613–616. Kay, R., Matthews, D., 1972. On the existence in human auditory pathways of channels electively tuned to the modulation present in frequency-modulated tones. J. Physiol. 225 (3), 657–677. Kim, C., Stern, R.M., 2009. Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction. In: INTERSPEECH. Brighton, UK, pp. 28–31. Kolossa, D., Haeb-Umbach, R. (Eds.), 2011, Robust Speech Recognition of Uncertain or Missing Data — Theory and Applications. Springer. Lee, D., Seung, H., 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), 788–791. Lippmann, R., 1996. Speech recognition by humans and machines: miles to go before we sleep. Speech Commun. 18 (3), 247–248. Luo, H., Poeppel, D., 2007. Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54 (6), 1001–1010. Macho, D., Mauuary, L., Noé, B., Cheng, Y.M., Ealey, D., Jouvet, D., Kelleher, H., Pearce, D., Saadoun, F., 2002. Evaluation of a noise-robust DSR front-end on aurora databases. In: Proceedings INTERSPEECH. Denver, Colorado, USA, pp. 17–20. Mesgarani, N., Cheung, C., Johnson, K., Chang, E.F., 2014. Phonetic feature encoding in human superior temporal gyrus. Science 343 (6174), 1006–1010. Mesgarani, N., David, S.V., Fritz, J.B., Shamma, S.A., 2014. Mechanisms of noise robust representation of speech in primary auditory cortex. Proc. Natl. Acad. Sci. 111 (18), 6792–6797. URL http://www.pnas.org/content/111/18/6792.abstract. Meyer, B.T., 2013. What’s the difference? Comparing humans and machines on the aurora-2 speech recognition task.. In: INTERSPEECH, pp. 2634–2638. Meyer, B.T., Brand, T., Kollmeier, B., 2011. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. J. Acoust. Soc. Am. 129 (1), 388–403. Misra, H., 2006. Multi-stream processing for noise robust speech recognition. École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland Ph.D. thesis. IDIAP-RR 2006 28. Moore, B.C.J., 2008. Basic auditory processes involved in the analysis of speech sounds. Philos. Trans. R. Soc. London 363, 947–963. Moritz, N., Anemüller, J., Kollmeier, B., 2015. An auditory inspired amplitude modulation ﬁlter bank for robust feature extraction in automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23 (11), 1926–1937. Myers, N., Wallis, G., 2013. Constraining theories of working memory with biophysical modelling. J.Neurosci. 33 (2), 385–386. Ness, S.R., Walters, T., Lyon, R.F., 2012. Auditory sparse coding. In: Li, T., Ogihara, M., Tzanetakis, G. (Eds.), Music Data Mining. CRC Press, Boca Raton, FL 33487-2742. 33487–2742 Okawa, S., Bocchieri, E., Potamianos, A., 1998. Multi-band speech recognition in noisy environments. In: Proceedings Int. Conf. Acoust. Speech Signal Process., pp. 641–644. Olshausen, B.A., Field, D.J., 2004. Sparse coding of sensory inputs. Curr. Opin. Neurobiol. 14 (4), 481–487. Pasley, B.N., David, S.V., Mesgarani, N., Flinker, A., Shamma, S.A., Crone, N.E., Knight, R.T., Chang, E.F., et al., 2012. Reconstructing speech from human auditory cortex. PLoS Biol. 10 (1), 175. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830. Rabiner, L., Juang, B.-H., 1993. Fundamentals of Speech Recognition. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Schädler, M.R., Meyer, B.T., Kollmeier, B., 2012. Spectro-temporal modulation subspace-spanning ﬁlter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131 (5), 4134–4151. Sroka, J.J., Braida, L.D., 2005. Human and machine consonant recognition. Speech Commun. 45 (4), 401–423. Sun, Y., Gemmeke, J.F., Cranen, B., ten Bosch, L., Boves, L., 2014. Fusion of parametric and non-parametric approaches to noise-robust ASR. Speech Commun. 56, 49–62. ten Bosch, L., Boves, L., Ernestus, M., 2013. Towards an end-to-end computational model of speech comprehension: simulating a lexical decision task. In: Proceedings of Interspeech. Lyon, France. 0 0 0 0–0 0 0 0 ten Bosch, L., Boves, L., Tucker, B., Ernestus, M., 2015. DIANA: towards computational modeling reaction times in lexical decision in North American English. In: Proceedings of Interspeech. Dresden, Germany. 0 0 0 0–0 0 0 0 Tibrewala, S., Hermansky, H., 1997. Multi-stream approach in acoustic modeling. In: Proceedings DARPA Large Vocabulary Cont. Speech Recognit. Hub 5 Workshop, pp. 1255–1258. Wei, Z., Wang, X.-J., Wang, D.-H., 2012. From distributed resources to limited slots in multiple-item working memory: a spiking network model with normalization. J. Neurosci. 32 (33), 1122811240. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2009. The HTK Book (for HTK version 3.4). Technical Report. Cambridge University Engineering Department, Cambridge, UK. Zwicker, E., Flottorp, G., Stevens, S.S., 1957. Critical band width in loudness summation. J. Acoust. Soc. Am. 29 (5), 548–557.

Log In

Human-inspired modulation frequency features for noise-robust ASR

Related papers

Related papers

Related topics