LSTM-LM with long-term history for first-pass decoding in conversational speech recognition

X Chen, S Parthasarathy, W Gale, S Chang… - arXiv preprint arXiv …, 2020 - arxiv.org
X Chen, S Parthasarathy, W Gale, S Chang, M Zeng
arXiv preprint arXiv:2010.11349, 2020arxiv.org
LSTM language models (LSTM-LMs) have been proven to be powerful and yielded
significant performance improvements over count based n-gram LMs in modern speech
recognition systems. Due to its infinite history states and computational load, most previous
studies focus on applying LSTM-LMs in the second-pass for rescoring purpose. Recent work
shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-
pass decoding within a dynamic (or tree based) decoder framework. In this work, the LSTM …
LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Due to its infinite history states and computational load, most previous studies focus on applying LSTM-LMs in the second-pass for rescoring purpose. Recent work shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-pass decoding within a dynamic (or tree based) decoder framework. In this work, the LSTM-LM is composed with a WFST decoder on-the-fly for the first-pass decoding. Furthermore, motivated by the long-term history nature of LSTM-LMs, the use of context beyond the current utterance is explored for the first-pass decoding in conversational speech recognition. The context information is captured by the hidden states of LSTM-LMs across utterance and can be used to guide the first-pass search effectively. The experimental results in our internal meeting transcription system show that significant performance improvements can be obtained by incorporating the contextual information with LSTM-LMs in the first-pass decoding, compared to applying the contextual information in the second-pass rescoring.
arxiv.org