research-article

Investigations on speaker adaptation of LSTM RNN models for speech recognition

Authors:

Yongqiang Wang,

Yifan GongAuthors Info & Claims

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pages 5020 - 5024

https://doi.org/10.1109/ICASSP.2016.7472633

Published: 01 March 2016 Publication History

Abstract

Recently Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN) acoustic models have demonstrated superior performance over deep neural networks (DNN) models in speech recognition and many other tasks. Although a lot of work have been reported on DNN model adaptation, very little has been done on LSTM model adaptation. In this paper we present our extensive studies of speaker adaptation of LSTM-RNN models for speech recognition. We investigated different adaptation methods combined with KL-divergence based regularization, where and which network component to adapt, supervised versus unsupervised adaptation and asymptotic analysis. We made a few distinct and important observations. In a large vocabulary speech recognition task, by adapting only 2.5% of the LSTM model parameters using 50 utterances per speaker, we obtained 12.6% WERR on the dev set and 9.1% WERR on the evaluation set over a strong LSTM baseline model.

6. References

[1]

G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition”, IEEE Signal Processing Magazine, 2012.

[2]

G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 30–34, 2012.

Digital Library

[3]

D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2013.

[4]

H. Liao, “Speaker adaptation of context dependent deep neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2013.

[5]

F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription”, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2011.

[6]

K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context-dependent deep neural networks for automatic speech recognition”, in Proceedings of the IEEE Workshop on Spoken Language Technology, 2012.

[7]

J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network”, in International Conference on Acoustic, Speech and Signal Processing, 2014.

[8]

B. Li and K. C. Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid nn/hmm systems”, in Annual Conference of the International Speech Communication Association (Interspeech), 2010.

[9]

P. Swietojanski and S. Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural networks acoustic models”, in Proceedings of the IEEE Workshop on Spoken Language Technology, 2015.

[10]

S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, “Improved feature processing for deep neural networks”, in Annual Conference of the International Speech Communication Association (Interspeech), 2013.

[11]

G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors”, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2013.

[12]

A. Senior and I. Lopez-Moreno, “Improving dnn speaker independence with i-vector inputs”, in International Conference on Acoustic, Speech and Signal Processing, 2014.

[13]

Y. Miao, H. Zhang, and F. Metze, “Towards speaker adaptive training of deep neural network acoustic models”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.

[14]

O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code”, in International Conference on Acoustic, Speech and Signal Processing, 2013.

[15]

P. Karanasou, Y. Wang, M. Gales, and P. Woodland, “Adaptation of deep neural network acoustic models using factorised i-vectors”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.

[16]

S. Dupont and L. Cheboub, “Fast speaker adaptation of artificial neural networks for automatic speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2000.

[17]

C. Wu and M. Gales, “Multi-basis adaptive neural network for rapid adaptation in speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2015.

[18]

T. Tan, Y. Qian, M. Yin, Y. Zhuang, and K. Yu, “Cluster adaptive training for deep neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2015.

[19]

Y. Bengio, P. Simard, and P. Frasconi, “Learning long term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, vol. 5, pp. 157–166, 1994.

Digital Library

[20]

S. Hochreiter and J. Schmidhuber, “Long short-term memory”, Neural computation, vol. 9, pp. 1735–1780, 1997.

Digital Library

[21]

A. Graves, A. Mohamed, and G. Hilton, “Speech recognition with deep recurrent neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2013.

[22]

A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recognition with deep bi-directional LSTM”, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2013.

[23]

H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.

[24]

H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott, R. Monga, and M. Mao, “Sequence discriminative distributed training of long short-term memory recurrent neural networks”, in Annual Conference of the International Speech Communication Association (Interspeech), 2014.

[25]

X. Li and X. Wu, “Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition”, in International Conference on Acoustic, Speech and Signal Processing, 2015.

[26]

Y. Miao and F. Metze, “On speaker adaptation of long short-term memory recurrent neural networks”, in Annual Conference of the International Speech Communication Association (Interspeech), 2015.

[27]

D. Yu, A. Eversole, M. Seltzer et al., “An introduction to computational networks and the computational network toolkit”, Microsoft Technical Report MSR-TR-2014–112, 2014.

[28]

Y. Zhao, J. Li, J. Xue, and Y. Gong, “Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data”, in International Conference on Acoustic, Speech and Signal Processing, 2015.

[29]

T. Ochiai, S. Matsuda, X. Lu, C. Hori, and S. Katagiri, “Speaker adaptive training using deep neural networks”, in International Conference on Acoustic, Speech and Signal Processing, 2014.

[30]

K. Kumar, C. Liu, K. Yao, and Y. Gong, “Intermediate-layer dnn adaptation for offline and session-based iterative speaker adaptation”, in Annual Conference of the International Speech Communication Association (Interspeech), 2015.

[31]

K. Kumar, C. Liu, and Y. Gong, “Non-negative intermediate-layer DNN adaptation for a 10-kb speaker adaptation profile”, in International Conference on Acoustic, Speech and Signal Processing, 2016.

Cited By

Kwon SLee Y(2022)Explainability-Based Mix-Up Approach for Text Data AugmentationACM Transactions on Knowledge Discovery from Data10.1145/353304817:1(1-14)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3533048
Yasir MChen LKhatoon AMalik MAbid F(2021)Mixed Script Identification Using Automated DNN Hyperparameter OptimizationComputational Intelligence and Neuroscience10.1155/2021/84153332021Online publication date: 10-Dec-2021
https://dl.acm.org/doi/10.1155/2021/8415333

Index Terms

Investigations on speaker adaptation of LSTM RNN models for speech recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

6592 pages

Copyright © 2016.

Publisher

IEEE Press

Publication History

Published: 01 March 2016

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kwon SLee Y(2022)Explainability-Based Mix-Up Approach for Text Data AugmentationACM Transactions on Knowledge Discovery from Data10.1145/353304817:1(1-14)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3533048
Yasir MChen LKhatoon AMalik MAbid F(2021)Mixed Script Identification Using Automated DNN Hyperparameter OptimizationComputational Intelligence and Neuroscience10.1155/2021/84153332021Online publication date: 10-Dec-2021
https://dl.acm.org/doi/10.1155/2021/8415333

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents