research-article

Open access

Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition

Authors:

John Paul Shen,

Ian LaneAuthors Info & Claims

NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Pages 39 - 43

https://doi.org/10.1145/3342827.3342847

Published: 28 June 2019 Publication History

Abstract

In this work, we investigate three types of deep speaker embedding as text-independent features for speaker-targeted speech recognition in cocktail party environments. The text-independent speaker embedding is extracted from the target speaker's existing speech segment (i-vector and x-vector) or face image (f-vector), which is concatenated with acoustic features of any new speech utterances as input features. Since the proposed model extracts the speaker embedding of the target speaker once and for all, it is computationally more efficient than many prior approaches which estimate the target speaker's characteristics on the fly. Empirical evaluation shows that using speaker embedding along with acoustic features improves Word Error Rate over the audio-only model, from 65.7% to 29.5%. Among the three types of speaker embedding, x-vector and f-vector show robustness against environment variations while i-vector tends to overfit to the specific speaker and environment condition.

References

[1]

Alam, M.J., Gupta, V., Kenny, P. and Dumouchel, P. 2015. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation. EURASIP Journal on Advances in Signal Processing.

[2]

Chao, G.-L., Chan, W., and Lane, I. 2016. Speaker-targeted audio-visual models for speech recognition in cocktail-party environments. Interspeech.

[3]

Cui, X., Goel, V. and Saon, G. 2017. Embedding-based speaker adaptive training of deep neural networks. Interspeech.

[4]

Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., and Ouellet. P. 2011. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing.

Digital Library

[5]

Delcroix, M., Zmolikova, K., Kinoshita, K., Ogawa, A. and Nakatani, T. 2018. Single channel target speaker extraction and recognition with speaker beam. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]

Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T. and Rubinstein, M. 2018. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics.

Digital Library

[7]

Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S. and Parthasarathi, S.H.K. 2015. Robust i-vector based adaptation of DNN acoustic model for speech recognition. Annual Conference of the International Speech Communication Association.

[8]

Guo, Y., Zhang, L., Hu, Y., He, X. and Gao, J. 2016. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. European Conference on Computer Vision (ECCV).

[9]

Kim, S., Raj, B. and Lane, I. 2016. Environmental noise embeddings for robust speech recognition. arXiv preprint arXiv:1601.02553.

[10]

King, B., Chen, I.F., Vaizman, Y., Liu, Y., Maas, R., Parthasarathi, S.H.K. and Hoffmeister, B. 2017. Robust speech recognition via anchor word representations. Interspeech.

[11]

Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A. and Zhu, Z. 2017. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304.

[12]

Liu, Y. and Kirchhoff, K. 2016. Novel front-end features based on neural graph embeddings for DNN-HMM and LSTM-CTC acoustic modeling. Interspeech.

[13]

Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y. and Khudanpur, S. 2016. Purely sequence-trained neural networks for ASR based on lattice-free MMI. Interspeech.

[14]

Qian, Y., Chang, X. and Yu, D. 2018. Single-channel multi-talker speech recognition with permutation invariant training. Speech Communication.

[15]

Rousseau, A., Deléglise, P. and Estève, Y. 2014. Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. International Conference on Language Resources and Evaluation (LREC).

[16]

Saon, G., Soltau, H., Nahamoo, D. and Picheny, M. 2013. Speaker adaptation of neural network acoustic models using i-vectors. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[17]

Schroff, F., Kalenichenko, D. and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]

Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y. and Khudanpur, S. 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. IEEE Spoken Language Technology Workshop (SLT).

[19]

Snyder, D., Garcia-Romero, D., Povey, D. and Khudanpur, S. 2017. Deep neural network embeddings for text-independent speaker verification. Interspeech.

[20]

Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A.A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI Conference on Artificial Intelligence.

Digital Library

[21]

Tüske, Z., Michel, W., Schlüter, R. and Ney, H. 2017. Parallel neural network features for improved tandem acoustic modeling. Interspeech.

[22]

Yu, D., Kolbæk, M., Tan, Z.H. and Jensen, J. 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]

Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J. and Torralba, A. 2018. The sound of pixels. European Conference on Computer Vision (ECCV).

[24]

Žmolíková, K., Delcroix, M., Kinoshita, K., Higuchi, T., Ogawa, A. and Nakatani, T. 2017. Learning speaker representation for neural network based multichannel speaker extraction. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

Index Terms

Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Rule-based triphone mapping for acoustic modeling in automatic speech recognition
TSD'11: Proceedings of the 14th international conference on Text, speech and dialogue

This paper presents rule-based triphone mapping for acoustic models training in automatic speech recognition. We test if the incorporation of expanded knowledge at the level of parameter tying in acoustic modeling improves the performance of automatic ...
On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition
ICASSP '91: Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference

The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech ...
Accent and speaker recognition for advanced automatic speech recognition

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

June 2019

171 pages

ISBN:9781450362795

DOI:10.1145/3342827

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Southwest Jiaotong University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

NLPIR 2019

NLPIR 2019: 2019 the 3rd International Conference on Natural Language Processing and Information Retrieval

June 28 - 30, 2019

Tokushima, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
618
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)23

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten