Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Hynek Hermansky

    Hynek Hermansky

    ... timer derivative estimation (such as the 10-frame interval applied in language identification system ... between the performances of both systems for models trained with 1 conversation side, which ... In a cheating experiment, the... more
    ... timer derivative estimation (such as the 10-frame interval applied in language identification system ... between the performances of both systems for models trained with 1 conversation side, which ... In a cheating experiment, the broad-phonetic categories are obtained by a canonical ...
    Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms analysis step). Recent experiments indicate that the most e cient com-ponents of the modulation spectrum of speech for ASR are up to about 16 Hz... more
    Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms analysis step). Recent experiments indicate that the most e cient com-ponents of the modulation spectrum of speech for ASR are up to about 16 Hz 1]. Consequently, RASTA pro-cessing ...
    TRAP based ASR attempts to extract information from rather long (as long as 1 s) and narrow (one critical-band) patches (tem-poral patterns) from time-frequency plane. We investigate the ef-fect of combining temporal patterns of... more
    TRAP based ASR attempts to extract information from rather long (as long as 1 s) and narrow (one critical-band) patches (tem-poral patterns) from time-frequency plane. We investigate the ef-fect of combining temporal patterns of logarithmic critical-band energies from several ...
    A new technique is presented which improves thesubjective quality of band-limited speech. The approachis based on a linear model of speech production,in which we independently estimate the spectralenvelope and excitation function for a... more
    A new technique is presented which improves thesubjective quality of band-limited speech. The approachis based on a linear model of speech production,in which we independently estimate the spectralenvelope and excitation function for a broad-bandwidthspeech signal to reconstruct missing frequency componentsin narrow-bandwidth speech.
    The work examines Karhunen-Loeve Transform andLinear Discriminant Analysis as means for designing optimizedspectral bases for the projection of the critical-bandauditory-like spectrum.1. INTRODUCTION1.1. The state-of-artTypical large... more
    The work examines Karhunen-Loeve Transform andLinear Discriminant Analysis as means for designing optimizedspectral bases for the projection of the critical-bandauditory-like spectrum.1. INTRODUCTION1.1. The state-of-artTypical large vocabulary automatic recognition ofspeech (ASR) consists of three main components: featureextraction, pattern classification, and language modeling.The feature extraction attempts to reduce the informationrate of raw speech data by alleviating...
    We provide an analysis of the relative importance ofcomponents of the modulation spectrum for speaker verification.The aim is to remove less relevant components andreduce system sensitivity to acoustic disturbances whileimproving... more
    We provide an analysis of the relative importance ofcomponents of the modulation spectrum for speaker verification.The aim is to remove less relevant components andreduce system sensitivity to acoustic disturbances whileimproving verification accuracy. Spectral components between0.1 Hz and 10 Hz are found to contain the mostuseful speaker information. We discuss this result in thecontext of RASTA processing and cepstral mean subtraction.When
    This paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive (PLP) analysis. It is shown that linear discriminant analysis of... more
    This paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive (PLP) analysis. It is shown that linear discriminant analysis of short-time Fourier spectrum of speech yields spectral basis functions which provide comparatively lower resolution to the high frequency region of spectrum. This is consistent with critical-band resolution and is shown to be caused by the spectral properties of vowel sounds. Further, we show that this non-uniform resolution can be traced to the physiology of speech production mechanism. In ASR experiments, features extracted by the discriminant functions are shown to outperform the conventional features derived by cosine basis functions.
    The work proposes a radically different set of featuresfor ASR where TempoRAl Patterns of spectral energies areused in place of the conventional spectral patterns. Theapproach has several inherent advantages, among them robustnessto... more
    The work proposes a radically different set of featuresfor ASR where TempoRAl Patterns of spectral energies areused in place of the conventional spectral patterns. Theapproach has several inherent advantages, among them robustnessto stationary or slowly varying disturbances.1. INTRODUCTION1.1. Spectral featuresIn 1665 Isaac Newton made the following observation:"The filling of a very deepe flaggon with a constant streameof beere or water
    The means of the long temporal trajectories of loga-rithmic critical band energies in a vicinity of individ-ual phoneme show distinct patterns (TRAPs Fig 1) in each critical band for different phonemes. These temporal patterns were... more
    The means of the long temporal trajectories of loga-rithmic critical band energies in a vicinity of individ-ual phoneme show distinct patterns (TRAPs Fig 1) in each critical band for different phonemes. These temporal patterns were successfully used in Automatic Speech ...
    The paper reviews OGI submission for NIST 2002 speaker recognition evaluation. It describes the systems submitted for oneand two-speaker detection tasks and the post-evaluation improvements. In one-speaker detection system, we present a... more
    The paper reviews OGI submission for NIST 2002 speaker recognition evaluation. It describes the systems submitted for oneand two-speaker detection tasks and the post-evaluation improvements. In one-speaker detection system, we present a new design of a data-driven temporal filter. We show that using few broad phonetic categories improves the performance of speaker recognition system. In post evaluation experiments, we show that combinations with complementary features and modeling techniques significantly improve the performance of the GMM-based system. In two-speaker detection system, we present a structured approach to detect speaker in the conversations.
    ... EFFECT OF THE COMMUNICATION CHANNEL IN AUDITORY-LIKE ANALYSIS OF SPEECH (RASTA-PLP) Hynek Hermansky , Nelson Morgan , Aruna Bayya , Phil Kohn** ^ * US ... It can be shown that if the derivative of step (2) is estimated by a simple... more
    ... EFFECT OF THE COMMUNICATION CHANNEL IN AUDITORY-LIKE ANALYSIS OF SPEECH (RASTA-PLP) Hynek Hermansky , Nelson Morgan , Aruna Bayya , Phil Kohn** ^ * US ... It can be shown that if the derivative of step (2) is estimated by a simple first difference, and if the ...
    Deviating from the conventional Hidden Markov Model-Multi-Layer Perceptron (HMM-MLP) hybrid paradigm of using MLP for classification, the proposed discriminative MLP technique uses MLP as a mapping module for fea-ture extraction for... more
    Deviating from the conventional Hidden Markov Model-Multi-Layer Perceptron (HMM-MLP) hybrid paradigm of using MLP for classification, the proposed discriminative MLP technique uses MLP as a mapping module for fea-ture extraction for conventional HMM-based systems. ...
    This paper examines sources of variability in the speech signal using a new technique that is based on a nested spectral analysis of variance (SANOVA). By constructing an ANOVA in the modulation spectral domain, the technique allows a... more
    This paper examines sources of variability in the speech signal using a new technique that is based on a nested spectral analysis of variance (SANOVA). By constructing an ANOVA in the modulation spectral domain, the technique allows a characterization of unwanted variability in the time sequences of logarithmic energy caused by extraneuous sources of variability such as additive noise, convolutional noise, and telephone handset transducer. Very low and moderate to high modulation frequencies are shown to be particularly affected by these sources. Veri cation results for 500 speakers on Switchboard data from the 1998 NIST speaker recognition evaluation are presented to con rm the conclusions. It is shown that a bandpass ltering and down sampling of the time sequences of logarithmic energy, compared to a conventional highpass ltering, leads to a 13% relative reduction of the EER in mismatched conditions.
    ... Sachin Kajarekar1, Narendranath Malayath1 and Hynek Hermansky1,2 1Oregon Graduate Institute of Science and Technology, Portland, Oregon, USA. ... variability in the speech signal can be attributed to the following sources: a Phonetic... more
    ... Sachin Kajarekar1, Narendranath Malayath1 and Hynek Hermansky1,2 1Oregon Graduate Institute of Science and Technology, Portland, Oregon, USA. ... variability in the speech signal can be attributed to the following sources: a Phonetic content, b Speaker and Channel, and c ...
    ABSTRACT Local frequency and time averaging and differentiating op- erators, using three neighboring points of critical-band time- frequency plane, are used to process the plane prior to its use in TRAP-based ASR. In that way, five... more
    ABSTRACT Local frequency and time averaging and differentiating op- erators, using three neighboring points of critical-band time- frequency plane, are used to process the plane prior to its use in TRAP-based ASR. In that way, five alternative TRAP-based ASR systems (the original one and the time/frequency inte- grated/differentiated ones)are created. We show that the fre- quency differentiating operator improves performance of the TRAP-based ASR. 1. Introduction Unlike features which are based on full short-term spectrum with its short time context, temporal pattern (TRAP) features are based on narrow band spectrum with long time context. By breaking the spectrum into individual critical band and using each critical band independently in the initial stage of the fea- ture extraction, the TRAP-based features can be inherently less sensitive to changes in relative levels of the individual critical bands. Further, by using longer temporal context, all informa- tion about underlying linguistic events, which is spread in time due to coarticulation, may be utilized. Initially, a single time trajectories of critical band spectral densities in each critical band were used as input vectors in the frequency-localized TRAP probabilty estimators (3). Thus, the burden of exploiting the useful information in the tempo- ral pattern and alleviating the irrelevant one was fully left on the estimator. Later, attempts for parametrizing the trajectory vectors were made and the critical band spectral density vec- tors were projected on bases obtained by Principal Comonent Analysis (PCA) (6) or Linear Discriminant Analysis (LDA) (5), with the resulting reduction of the size of the input vector to the frequency-localized probability estimator. Recent studies indicate that information extracted from sev- eral (up to three) neighboring bands improves performance of the TRAP system (7). Since these studies use PCA of the input vector space, it is possible to investigate the resulting projec- tion basis. Such an inspection reveals that the PCA rotation resembles frequency averaging and frequency differentiating of the neighboring bands with the subsequent projection on co- sine transform bases. This observation suggests that a simple pre-processing of a critical-band spectrogram (CRBS) prior to the cosine transformation and the TRAP classification may be beneficial. The current work investigates such modifications of CRBS in TRAP system and evaluates their individual efficiency as well as their effect in conjunction with the original (i.e. un- processed) CRBS.
    Band-independent categories are investigated for feature es-timation in ASR. These categories represent distinct speech-events manifested in frequency-localized temporal patterns of the speech signal. A universal, single estimator is... more
    Band-independent categories are investigated for feature es-timation in ASR. These categories represent distinct speech-events manifested in frequency-localized temporal patterns of the speech signal. A universal, single estimator is proposed for estimating speech-event ...
    We have developed a statistical model of speech that incorporates certain temporal properties of human speech perception. The primary goal of this work is to avoid a number of current constraining assumptions for statistical speech... more
    We have developed a statistical model of speech that incorporates certain temporal properties of human speech perception. The primary goal of this work is to avoid a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. A focus on perceptual models may in principle allow for statistical modeling of speech components that are more relevant for discrimination between candidate utterances during speech recognition. In particular, we hope to develop systems that have some of the robust properties of human audition for speech collected under adverse conditions. The outline of this new research direction is given here, along with some preliminary theoretical work.
    In this paper we propose a method for enhancement of speech in the presence of additive noise. The objective is to selectively enhance the high SNR regions in the noisy speech in the temporal and spectral domains, without causing... more
    In this paper we propose a method for enhancement of speech in the presence of additive noise. The objective is to selectively enhance the high SNR regions in the noisy speech in the temporal and spectral domains, without causing signiicant distortion in the resulting enhanced speech. This is proposed to be done at three diierent levels: (a) At the gross level, by identifying the regions of speech and noise in the temporal domain, (b) At the ner level, by identifying the regions of high and low SNR portions in the noisy speech, and (c) At the short{time spectrum level, by enhancing the spectral peaks over spectral valleys. Processing of noisy speech for enhancement involves mostly weighting the LP residual samples. The weighted residual samples are used to excite the time{ varying LP lter to produce enhanced speech.
    The choice of units, sub-word or whole-word, is generally based on the size of the vocabulary and the amount of training data. In this work, we have introduced new constraints on the units: 1) they should contain suÆcient statistics of... more
    The choice of units, sub-word or whole-word, is generally based on the size of the vocabulary and the amount of training data. In this work, we have introduced new constraints on the units: 1) they should contain suÆcient statistics of the features and 2) they should contain suÆcient statistics of the vocabulary. This led to minimization of two cost functions, rst based on the confusion between the features and the units and the second based on the confusion between the units and the words. We minimized rst cost function by forming broad phone classes that were less confusing among themselves than the phones. The second cost function was minimized by coding the word-speci c phone sequences. On the continuous digit recognition task, the broad classes performed worse than the phones. The word-speci c phone sequences however signi cantly improved the performance over both the phones and the whole-word units. In this paper we discuss the new constraints, our speci c implementation of the cost functions, and the corresponding recognition performance.
    To overcome the problems related with the long impulseresponses produced by reverberation, we use a long timewindow (high frequency resolution) analysis during thechannel normalization steps of the feature extraction processin automatic... more
    To overcome the problems related with the long impulseresponses produced by reverberation, we use a long timewindow (high frequency resolution) analysis during thechannel normalization steps of the feature extraction processin automatic speech recognition (ASR). After normalization,a trade between frequency and time resolutionis used to increase the rate at which the time informationis sampled (short-time domain), yielding an appropriatedomain to derive
    ABSTRACT In this paper we use mutual information to study the dis-tribution in time and frequency of information relevant for phonetic classification. A large database of hand-labeled fluent speech is used to (a) compute the mutual... more
    ABSTRACT In this paper we use mutual information to study the dis-tribution in time and frequency of information relevant for phonetic classification. A large database of hand-labeled fluent speech is used to (a) compute the mutual information between phoneme labels and a point of ...
    In this paper, we investigate the use of TemPoRal PatternS (TRAPS) classifiers for estimating manner of articulation features on the small-vocabulary Aurora-2002 database. By combining a stream of TRAPS-estimated manner features with a... more
    In this paper, we investigate the use of TemPoRal PatternS (TRAPS) classifiers for estimating manner of articulation features on the small-vocabulary Aurora-2002 database. By combining a stream of TRAPS-estimated manner features with a stream of noise-robust MFCC ...
    ... The dissertation Temporal Processing of Speech in a Time-Feature Space" by Carlos Avendaño has been examined and approved by the following Examination Committee: Hynek Hermansky Associate Professor Thesis Research Adviser Misha... more
    ... The dissertation Temporal Processing of Speech in a Time-Feature Space" by Carlos Avendaño has been examined and approved by the following Examination Committee: Hynek Hermansky Associate Professor Thesis Research Adviser Misha Pavel Associate Professor ...
    Feature extraction plays a major role in any form of pattern recognition. Current feature extraction methods used for automatic speech recognition (ASR) and speaker verification rely mainly on properties of speech production (modeled by... more
    Feature extraction plays a major role in any form of pattern recognition. Current feature extraction methods used for automatic speech recognition (ASR) and speaker verification rely mainly on properties of speech production (modeled by all-pole filters) and perception (critical band integration simulated by Mel/Bark filter bank). We propose stochastic methods to design feature extraction methods which are trained to alleviate the unwanted variability present in speech signal. In this dissertation we show that such data-driven methods provide significant advantages over the conventional methods for feature extraction. In the first part of the dissertation discriminant methods are introduced for extracting spectral features for ASR. Spectral basis functions which preserve phonetic class separability are derived using linear discriminant analysis (LDA). It is observed that the discriminant basis functions analyze the low frequency part of the spectrum with higher resolution than the high frequency part. This trend is consistent with properties of human hearing which are explained using the notion of critical bandwidth and emulated in the current feature extraction modules by Mel/Bark filter bank. The proposed discriminant features are shown to outperform the conventional features in ASR experiments. The second part of the dissertation introduces data-driven methods for the design of channel normalizing filters for speaker verification. It has been observed that a reasonable verification error can be achieved if the speaker uses the same handset and telephone line for testing. On the other hand if the speaker uses a different telephone handset while testing, the verification error can increase by four to five times. We introduce a data-driven method for designing filters capable of normalizing the variability introduced by different telephone handsets. The design of the filter is based on the estimated second order statistics of handset variability. This filter is applied on the logarithmic energy outputs of Mel spaced filter banks. The effectiveness of the proposed channel normalizing filter in improving speaker verification performance in mismatched conditions is also demonstrated.
    Publication in the conference proceedings of EUSIPCO, Lausanne, Switzerland, 2008
    In this letter, a new feature extraction technique based on modulation spectrum derived from syllable-length segments of sub-band temporal envelopes is proposed. These sub-band envelopes are derived from auto-regressive modelling of... more
    In this letter, a new feature extraction technique based on modulation spectrum derived from syllable-length segments of sub-band temporal envelopes is proposed. These sub-band envelopes are derived from auto-regressive modelling of Hilbert envelopes of the signal in critical bands, processed by both a static (logarithmic) and a dynamic (adaptive loops) compression. These features are then used for machine recognition of phonemes in telephone speech. Without degrading the performance in clean conditions, the proposed features show significant improvements compared to other state-of-the-art speech analysis techniques. In addition to the overall phoneme recognition rates, the performance with broad phonetic classes is reported.
    The temporal trajectories of the spectral energy in auditory critical bands over 250 ms segments are approximated by an all-pole model, the time-domain dual of conventional linear prediction. This quarter-second auditory spectro-temporal... more
    The temporal trajectories of the spectral energy in auditory critical bands over 250 ms segments are approximated by an all-pole model, the time-domain dual of conventional linear prediction. This quarter-second auditory spectro-temporal pattern is further smoothed by iterative alternation of spectral and temporal all-pole modeling. Just as Perceptual Linear Prediction (PLP) uses an autoregressive model in the frequency domain to estimate peaks in an auditory-like short-term spectral slice, PLP$^2$ uses all-pole modeling in both time and frequency domains to estimate peaks of a two-dimensional spectro-temporal pattern, motivated by considerations of the auditory system.
    This paper proposes modifications to the Multi-resolution RASTA (MRASTA) feature extraction technique for the automatic speech recognition (ASR). By emulating asymmetries of the temporal receptive field (TRF) profiles of auditory... more
    This paper proposes modifications to the Multi-resolution RASTA (MRASTA) feature extraction technique for the automatic speech recognition (ASR). By emulating asymmetries of the temporal receptive field (TRF) profiles of auditory mid-brain neurons, we obtain more than 13% relative improvement in word error rate on OGI-Digits database. Experiments on TIMIT database confirm that proposed modifications are indeed useful.
    In the framework of hidden Markov models (HMM) or hybrid HMM/Artificial Neural Network (ANN) systems, we present a new approach towards speech recognition. The general idea is to split the whole frequency band (represented in terms of... more
    In the framework of hidden Markov models (HMM) or hybrid HMM/Artificial Neural Network (ANN) systems, we present a new approach towards speech recognition. The general idea is to split the whole frequency band (represented in terms of critical bands) into a few sub-bands on which different recognizers are independently applied and then recombined at a certain speech unit level to yield global scores and a global recognition decision. The preliminary results presented in this paper show that such an approach, even using quite simple recombination strategies, can yield at least comparable performance on clean speech while providing significantly better robustness in the case of speech corrupted by narrowband noise.
    In this paper, we investigate the approach of comparing two different parallel streams of phoneme posterior probability estimates for OOV word detection. The first phoneme posterior probability stream is estimated using only the knowledge... more
    In this paper, we investigate the approach of comparing two different parallel streams of phoneme posterior probability estimates for OOV word detection. The first phoneme posterior probability stream is estimated using only the knowledge of short-term acoustic observation. In our work we refer this stream as “out-of-context posteriors”. The second posterior probability stream, referred also as “in-context posteriors” is estimated using the knowledge of the whole acoustic observation sequence: the acoustic model and the language model of an ASR system. In particular, we focus our study on different types of distance measures, namely KL-divergence and Euclidean distance, to compare the two phoneme posterior probability streams. Our experiments on large vocabulary automatic speech recognition task shows that using KL-divergence measure estimated with the in-context posteriors as reference distribution consistently yields a better OOV word detection system.
    We present a feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands. Estimation of the sub-band temporal envelopes is done using Frequency Domain Linear Prediction... more
    We present a feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands. Estimation of the sub-band temporal envelopes is done using Frequency Domain Linear Prediction (FDLP). These sub-band envelopes are compressed with a static (logarithmic) and dynamic (adaptive loops) compression. The compressed sub-band envelopes are transformed into modulation spectral components which are used as features for speech recognition. Experiments are performed on a phoneme recognition task using a hybrid HMM-ANN phoneme recognition system and an ASR task using the TANDEM speech recognition system. The proposed features provide a relative improvements of 3.8 % and 11.5 % in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS) respectively. Further, these improvements are found to be consistent for ASR tasks on OGI-Digits database (relative improvement of 13.5 %).
    Spectrotemporal representation of speech has already shown promising results in speech processing technologies, however, many inherent issues of such representation, such as high dimensionality have limited their use in speech and speaker... more
    Spectrotemporal representation of speech has already shown promising results in speech processing technologies, however, many inherent issues of such representation, such as high dimensionality have limited their use in speech and speaker recognition. Multistream ...
    We propose to incorporate features derived using spectro-temporal receptive fields (STRFs) of neurons in the auditory cortex for phoneme recognition. Each of these STRFs is tuned to different auditory frequencies, scales and modulation... more
    We propose to incorporate features derived using spectro-temporal receptive fields (STRFs) of neurons in the auditory cortex for phoneme recognition. Each of these STRFs is tuned to different auditory frequencies, scales and modulation rates. We select different sets of STRFs ...
    The paper introduces a mixture of auto-associative neural networks for speaker verification. A new objective function based on posterior probabilities of phoneme classes is used for training the mixture. This objective function allows... more
    The paper introduces a mixture of auto-associative neural networks for speaker verification. A new objective function based on posterior probabilities of phoneme classes is used for training the mixture. This objective function allows each component of the mixture to model part of the acoustic space corresponding to a broad phonetic class. This paper also proposes how factor analysis can be applied in this setting. The proposed techniques show promising results on a subset of NIST08 speaker recognition evaluation (SRE) and yield about 10% relative improvement when combined with the state-of-the-art Gaussian Mixture Model i-vector system.

    And 280 more