Export Citations
Performance of automatic speech recognition relies on a vast amount of training speech data mostly recorded with little or no background noise. The performance degrades significantly with background noise, which increases type mismatch between train and test environments. Speech enhancement techniques can reduce the amount of type mismatch.
At very low SNR with nonstationary noise, the enhanced speech may still contain significant noise either in noise-only segments or speech segments. The former masquerade as nonexistent speech and the latter as distorted speech. Both significantly degrade the performance of the automatic speech recognizer. This encourages the use of voice activity detection (VAD) algorithms to determine regions with speech present. To use only the reliable speech features, we need to further determine whether the features from the speech region are mainly from speech or from nonstationary noises masking the speech. For more robust speech recognition, this thesis proposes a three-hypothesis VAD consisting of H 0: noise-only region; H S : speech-dominant speech region; and H N : noise-dominant speech region.
Spectrum-based VAD uses knowledge of the noise spectrum to detect voice activity using the nonstationary nature of speech. This thesis proposes a method of estimating the instantaneous noise spectrum for VAD. The spectrum-based VAD, however, cannot distinguish speech from nonstationary noise because both appear nonstationary to the VAD, and thus look like speech. A microphone array can determine the noise-corrupted speech region when the nonstationary noise is from a location other than that of the speech source. This thesis proposes a method of distinguishing H S from H N based on the steered response power (SRP) method, which estimates power from any location.
Phonemic restoration is a phenomenon in which humans claim to hear missing phonemes that have been replaced by noise. Given strong nonstationary noises occasionally masking the speech region, as well as knowledge of H S and H N , this thesis proposes a phoneme restoration approach for automatic speech recognition in the hidden Markov model framework.
The proposed approach has two steps: speech enhancement as a preprocessor of noisy speech signals, followed by the phoneme restoration for robust speech recognition against nonstationary noises given knowledge of H S and H N .
Index Terms
- Robust speech recognition in a car using a microphone array
Recommendations
Robust mandarin speech recognition for car navigation interface
PCM'06: Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information ProcessingThis paper presents a robust automatic speech recognition (ASR) system as multimedia interface for car navigation. In front-end, we use the minimum-mean square error (MMSE) enhancement to suppress the background in-car noise and then compensate the ...
Improving Throat Microphone Speech Recognition by Joint Analysis of Throat and Acoustic Microphone Recordings
We present a new framework for joint analysis of throat and acoustic microphone (TAM) recordings to improve throat microphone only speech recognition. The proposed analysis framework aims to learn joint sub-phone patterns of throat and acoustic ...
Microphone array driven speech recognition: influence of localization on the word error rate
MLMI'05: Proceedings of the Second international conference on Machine Learning for Multimodal InteractionInterest within the automatic speech recognition (ASR) research community has recently focused on the recognition of speech captured with one or more microphones located in the far field, rather than being mounted on a headset and positioned next to the ...