Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICASSP.2016.7472633guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Investigations on speaker adaptation of LSTM RNN models for speech recognition

Published: 01 March 2016 Publication History

Abstract

Recently Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN) acoustic models have demonstrated superior performance over deep neural networks (DNN) models in speech recognition and many other tasks. Although a lot of work have been reported on DNN model adaptation, very little has been done on LSTM model adaptation. In this paper we present our extensive studies of speaker adaptation of LSTM-RNN models for speech recognition. We investigated different adaptation methods combined with KL-divergence based regularization, where and which network component to adapt, supervised versus unsupervised adaptation and asymptotic analysis. We made a few distinct and important observations. In a large vocabulary speech recognition task, by adapting only 2.5% of the LSTM model parameters using 50 utterances per speaker, we obtained 12.6% WERR on the dev set and 9.1% WERR on the evaluation set over a strong LSTM baseline model.

6. References

[1]
G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition”, IEEE Signal Processing Magazine, 2012.
[2]
G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 30–34, 2012.
[3]
D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2013.
[4]
H. Liao, “Speaker adaptation of context dependent deep neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2013.
[5]
F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription”, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2011.
[6]
K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context-dependent deep neural networks for automatic speech recognition”, in Proceedings of the IEEE Workshop on Spoken Language Technology, 2012.
[7]
J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network”, in International Conference on Acoustic, Speech and Signal Processing, 2014.
[8]
B. Li and K. C. Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid nn/hmm systems”, in Annual Conference of the International Speech Communication Association (Interspeech), 2010.
[9]
P. Swietojanski and S. Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural networks acoustic models”, in Proceedings of the IEEE Workshop on Spoken Language Technology, 2015.
[10]
S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, “Improved feature processing for deep neural networks”, in Annual Conference of the International Speech Communication Association (Interspeech), 2013.
[11]
G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors”, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2013.
[12]
A. Senior and I. Lopez-Moreno, “Improving dnn speaker independence with i-vector inputs”, in International Conference on Acoustic, Speech and Signal Processing, 2014.
[13]
Y. Miao, H. Zhang, and F. Metze, “Towards speaker adaptive training of deep neural network acoustic models”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.
[14]
O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code”, in International Conference on Acoustic, Speech and Signal Processing, 2013.
[15]
P. Karanasou, Y. Wang, M. Gales, and P. Woodland, “Adaptation of deep neural network acoustic models using factorised i-vectors”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.
[16]
S. Dupont and L. Cheboub, “Fast speaker adaptation of artificial neural networks for automatic speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2000.
[17]
C. Wu and M. Gales, “Multi-basis adaptive neural network for rapid adaptation in speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2015.
[18]
T. Tan, Y. Qian, M. Yin, Y. Zhuang, and K. Yu, “Cluster adaptive training for deep neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2015.
[19]
Y. Bengio, P. Simard, and P. Frasconi, “Learning long term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, vol. 5, pp. 157–166, 1994.
[20]
S. Hochreiter and J. Schmidhuber, “Long short-term memory”, Neural computation, vol. 9, pp. 1735–1780, 1997.
[21]
A. Graves, A. Mohamed, and G. Hilton, “Speech recognition with deep recurrent neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2013.
[22]
A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recognition with deep bi-directional LSTM”, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2013.
[23]
H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.
[24]
H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott, R. Monga, and M. Mao, “Sequence discriminative distributed training of long short-term memory recurrent neural networks”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.
[25]
X. Li and X. Wu, “Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2015.
[26]
Y. Miao and F. Metze, “On speaker adaptation of long short-term memory recurrent neural networks”, in Annual Conference of the International Speech Communication Association (Interspeech), 2015.
[27]
D. Yu, A. Eversole, M. Seltzer et al., “An introduction to computational networks and the computational network toolkit”, Microsoft Technical Report MSR-TR-2014–112, 2014.
[28]
Y. Zhao, J. Li, J. Xue, and Y. Gong, “Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data”, in International Conference on Acoustic, Speech and Signal Processing, 2015.
[29]
T. Ochiai, S. Matsuda, X. Lu, C. Hori, and S. Katagiri, “Speaker adaptive training using deep neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2014.
[30]
K. Kumar, C. Liu, K. Yao, and Y. Gong, “Intermediate-layer dnn adaptation for offline and session-based iterative speaker adaptation”, in Annual Conference of the International Speech Communication Association (Interspeech), 2015.
[31]
K. Kumar, C. Liu, and Y. Gong, “Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile”, in International Conference on Acoustic, Speech and Signal Processing, 2016.

Cited By

View all
  • (2022)Explainability-Based Mix-Up Approach for Text Data AugmentationACM Transactions on Knowledge Discovery from Data10.1145/353304817:1(1-14)Online publication date: 27-Apr-2022
  • (2021)Mixed Script Identification Using Automated DNN Hyperparameter OptimizationComputational Intelligence and Neuroscience10.1155/2021/84153332021Online publication date: 10-Dec-2021

Index Terms

  1. Investigations on speaker adaptation of LSTM RNN models for speech recognition
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
          6592 pages

          Publisher

          IEEE Press

          Publication History

          Published: 01 March 2016

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 26 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2022)Explainability-Based Mix-Up Approach for Text Data AugmentationACM Transactions on Knowledge Discovery from Data10.1145/353304817:1(1-14)Online publication date: 27-Apr-2022
          • (2021)Mixed Script Identification Using Automated DNN Hyperparameter OptimizationComputational Intelligence and Neuroscience10.1155/2021/84153332021Online publication date: 10-Dec-2021

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media