Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3342827.3342847acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnlpirConference Proceedingsconference-collections
research-article
Open access

Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition

Published: 28 June 2019 Publication History

Abstract

In this work, we investigate three types of deep speaker embedding as text-independent features for speaker-targeted speech recognition in cocktail party environments. The text-independent speaker embedding is extracted from the target speaker's existing speech segment (i-vector and x-vector) or face image (f-vector), which is concatenated with acoustic features of any new speech utterances as input features. Since the proposed model extracts the speaker embedding of the target speaker once and for all, it is computationally more efficient than many prior approaches which estimate the target speaker's characteristics on the fly. Empirical evaluation shows that using speaker embedding along with acoustic features improves Word Error Rate over the audio-only model, from 65.7% to 29.5%. Among the three types of speaker embedding, x-vector and f-vector show robustness against environment variations while i-vector tends to overfit to the specific speaker and environment condition.

References

[1]
Alam, M.J., Gupta, V., Kenny, P. and Dumouchel, P. 2015. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation. EURASIP Journal on Advances in Signal Processing.
[2]
Chao, G.-L., Chan, W., and Lane, I. 2016. Speaker-targeted audio-visual models for speech recognition in cocktail-party environments. Interspeech.
[3]
Cui, X., Goel, V. and Saon, G. 2017. Embedding-based speaker adaptive training of deep neural networks. Interspeech.
[4]
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., and Ouellet. P. 2011. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing.
[5]
Delcroix, M., Zmolikova, K., Kinoshita, K., Ogawa, A. and Nakatani, T. 2018. Single channel target speaker extraction and recognition with speaker beam. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6]
Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W.T. and Rubinstein, M. 2018. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics.
[7]
Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S. and Parthasarathi, S.H.K. 2015. Robust i-vector based adaptation of DNN acoustic model for speech recognition. Annual Conference of the International Speech Communication Association.
[8]
Guo, Y., Zhang, L., Hu, Y., He, X. and Gao, J. 2016. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. European Conference on Computer Vision (ECCV).
[9]
Kim, S., Raj, B. and Lane, I. 2016. Environmental noise embeddings for robust speech recognition. arXiv preprint arXiv:1601.02553.
[10]
King, B., Chen, I.F., Vaizman, Y., Liu, Y., Maas, R., Parthasarathi, S.H.K. and Hoffmeister, B. 2017. Robust speech recognition via anchor word representations. Interspeech.
[11]
Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A. and Zhu, Z. 2017. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304.
[12]
Liu, Y. and Kirchhoff, K. 2016. Novel front-end features based on neural graph embeddings for DNN-HMM and LSTM-CTC acoustic modeling. Interspeech.
[13]
Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y. and Khudanpur, S. 2016. Purely sequence-trained neural networks for ASR based on lattice-free MMI. Interspeech.
[14]
Qian, Y., Chang, X. and Yu, D. 2018. Single-channel multi-talker speech recognition with permutation invariant training. Speech Communication.
[15]
Rousseau, A., Deléglise, P. and Estève, Y. 2014. Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. International Conference on Language Resources and Evaluation (LREC).
[16]
Saon, G., Soltau, H., Nahamoo, D. and Picheny, M. 2013. Speaker adaptation of neural network acoustic models using i-vectors. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[17]
Schroff, F., Kalenichenko, D. and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18]
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y. and Khudanpur, S. 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. IEEE Spoken Language Technology Workshop (SLT).
[19]
Snyder, D., Garcia-Romero, D., Povey, D. and Khudanpur, S. 2017. Deep neural network embeddings for text-independent speaker verification. Interspeech.
[20]
Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A.A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI Conference on Artificial Intelligence.
[21]
Tüske, Z., Michel, W., Schlüter, R. and Ney, H. 2017. Parallel neural network features for improved tandem acoustic modeling. Interspeech.
[22]
Yu, D., Kolbæk, M., Tan, Z.H. and Jensen, J. 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[23]
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J. and Torralba, A. 2018. The sound of pixels. European Conference on Computer Vision (ECCV).
[24]
Žmolíková, K., Delcroix, M., Kinoshita, K., Higuchi, T., Ogawa, A. and Nakatani, T. 2017. Learning speaker representation for neural network based multichannel speaker extraction. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

Index Terms

  1. Deep Speaker Embedding for Speaker-Targeted Automatic Speech Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval
    June 2019
    171 pages
    ISBN:9781450362795
    DOI:10.1145/3342827
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    In-Cooperation

    • Southwest Jiaotong University

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. acoustic modeling
    2. robust speaker embeddings
    3. speaker-targeted speech recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    NLPIR 2019

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 618
      Total Downloads
    • Downloads (Last 12 months)127
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 11 Feb 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media