Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Task-based Meta Focal Loss for Multilingual Low-resource Speech Recognition

Published: 20 November 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Low-resource automatic speech recognition is a challenging task due to a lack of labeled training data. To resolve this issue, multilingual meta-learning learns a better model initialization from many source language tasks for fast adaptation to unseen target languages. However, for diverse source languages, the quantity and difficulty vary greatly because of their different data scales and phonological systems. These differences lead to task-quantity and task-difficulty imbalance issues and thus a failure of multilingual meta-learning ASR. In this work, we propose a task-based meta focal loss (TMFL) approach to address this tough challenge. Specifically, we introduce a hard-task moderator and update the meta-parameters using gradients from both the support set and query set. Our proposed approach focuses more on hard tasks and makes full use of the data from hard tasks. Moreover, we analyze the significance of the hard task moderator and interpret its significance at the sample level. Experiment results show that the proposed method, TMFL, significantly outperforms the state-of-the-art multilingual meta-learning on all target languages for the IARPA BABEL and OpenSLR datasets, especially under very-low-resource conditions. In particular, it can reduce character error rate from 72% to 60% by fine-tuning the pre-trained model with about 22 hours of Vietnamese data.

    References

    [1]
    Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH’19), Gernot Kubin and Zdravko Kacic (Eds.). ISCA, 2613–2617.
    [2]
    Ke Hu, Antoine Bruguier, Tara N. Sainath, Rohit Prabhavalkar, and Golan Pundak. 2019. Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH’19), Gernot Kubin and Zdravko Kacic (Eds.). ISCA, 2155–2159. DOI:
    [3]
    Wenxin Hou, Yue Dong, Bairong Zhuang, Longfei Yang, Jiatong Shi, and Takahiro Shinozaki. 2020. Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), Helen Meng, Bo Xu, and Thomas Fang Zheng (Eds.). ISCA, 1037–1041. DOI:
    [4]
    Jui-Yang Hsu, Yuan-Jui Chen, and Hung-yi Lee. 2020. Meta learning for end-to-end low-resource speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7844–7848. DOI:
    [5]
    Martin Karafiát, Frantisek Grézl, Lukás Burget, Igor Szöke, and Jan Cernocký. 2015. Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH’15). ISCA, 2454–2458. http://www.isca-speech.org/archive/interspeech_2015/i15_2454.html
    [6]
    Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. Audio augmentation for speech recognition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH’15). ISCA, 3586–3589.
    [7]
    Qianying Liu, Yuhang Yang, Zhuo Gong, Sheng Li, Chenchen Ding, Nobuaki Minematsu, Hao Huang, Fei Cheng, and Sadao Kurohashi. 2023. Hierarchical softmax for end-to-end low-resource multilingual speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’23).
    [8]
    Yubei Xiao, Ke Gong, Pan Zhou, Guolin Zheng, Xiaodan Liang, and Liang Lin. 2021. Adversarial meta sampling for multilingual low-resource speech recognition. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI 2021), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI’21). AAAI Press, 14112–14120.
    [9]
    Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH’19), Gernot Kubin and Zdravko Kacic (Eds.). ISCA, 3465–3469. DOI:
    [10]
    Ngoc-Quan Pham, Alexander H. Waibel, and Jan Niehues. 2022. Adaptive multilingual speech recognition with pretrained models. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’22).
    [11]
    Yu-An Chung and James R. Glass. 2020. Generative pre-training for speech with autoregressive predictive coding. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 3497–3501. DOI:
    [12]
    Jacob Kahn, Ann Lee, and Awni Y. Hannun. 2020. Self-training for end-to-end speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7084–7088. DOI:
    [13]
    Jing Zhao and Weiqiang Zhang. 2022. Improving automatic speech recognition performance for low-resource languages with self-supervised models. IEEE J. Select. Top. Sign. Process. 16 (2022), 1227–1241. DOI:
    [14]
    Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 4835–4839. DOI:
    [15]
    Shigeki Karita, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, and Ryuichi Yamamoto. 2019. A comparative study on transformer vs RNN in speech applications. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU’19). IEEE, 449–456. DOI:
    [16]
    Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro J. Moreno, Eugene Weinstein, and Kanishka Rao. 2018. Multilingual speech recognition with a single end-to-end model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4904–4908. DOI:
    [17]
    Shiyu Zhou, Shuang Xu, and Bo Xu. 2018. Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv:1806.05059. Retrieved from http://arxiv.org/abs/1806.05059
    [18]
    Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML’17),Proceedings of Machine Learning Research, Vol. 70, Doina Precup and Yee Whye Teh (Eds.). PMLR, 1126–1135.
    [19]
    Ondrej Klejch, Joachim Fainberg, Peter Bell, and Steve Renals. 2019. Speaker adaptive training using model agnostic meta-learning. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU’19). IEEE, 881–888. DOI:
    [20]
    Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, and Pascale Fung. 2020. Learning fast adaptation on cross-accented speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20), Helen Meng, Bo Xu, and Thomas Fang Zheng (Eds.). ISCA, 1276–1280. DOI:
    [21]
    Wenxin Hou, Yidong Wang, Shengzhou Gao, and Takahiro Shinozaki. 2021. Meta-adapter: Efficient cross-lingual adaptation with meta-learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 7028–7032. DOI:
    [22]
    Suransh Chopra, Puneet Mathur, Ramit Sawhney, and Rajiv Ratn Shah. 2021. Meta-learning for low-resource speech emotion recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 6259–6263. DOI:
    [23]
    Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2020. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2 (2020), 318–327. DOI:
    [24]
    Yuanyuan Zhao, Jie Li, Xiaorui Wang, and Yan Li. 2019. The speechtransformer for large-scale Mandarin Chinese speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 7095–7099. DOI:
    [25]
    Suraj Tripathi, Abhay Kumar, Abhiram Ramesh, Chirag Singh, and Promod Yenigalla. 2019. Focal loss based residual convolutional neural network for speech emotion recognition. arXiv:1906.05682. Retrieved from http://arxiv.org/abs/1906.05682
    [26]
    Yufan Hou, Lixin Zou, and Weidong Liu. 2020. Task-based focal loss for adversarially robust meta-learning. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR’20). IEEE, 2824–2829. DOI:
    [27]
    Liam Collins, Aryan Mokhtari, and Sanjay Shakkottai. 2020. Task-robust model-agnostic meta-learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20), Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
    [28]
    Martín Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In International Conference on Machine Learning.
    [29]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008.
    [30]
    Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06)ACM International Conference Proceeding Series, Vol. 148, William W. Cohen and Andrew W. Moore (Eds.). ACM, 369–376. DOI:
    [31]
    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). The Association for Computer Linguistics. DOI:
    [32]
    Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. 2017. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [33]
    Amir Erfan Eshratifar, Mohammad Saeed Abrishami, David Eigen, and Massoud Pedram. 2018. A meta-learning approach for custom model training. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [34]
    Mark J. F. Gales, Kate M. Knill, Anton Ragni, and Shakti P. Rath. 2014. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED. In Proceedings of the 4th Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14). ISCA, 16–23.
    [35]
    Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. 2019. The Pytorch-Kaldi speech recognition toolkit. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6465–6469. DOI:
    [36]
    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.).

    Cited By

    View all
    • (2023)HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph ConstructionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3611306Online publication date: 2-Aug-2023

    Index Terms

    1. Task-based Meta Focal Loss for Multilingual Low-resource Speech Recognition

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 11
        November 2023
        255 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3633309
        • Editor:
        • Imed Zitouni
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 November 2023
        Online AM: 29 September 2023
        Accepted: 20 September 2023
        Revised: 15 May 2023
        Received: 20 February 2023
        Published in TALLIP Volume 22, Issue 11

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Meta learning
        2. low-resource
        3. speech recognition
        4. focal loss
        5. IARPA-BABEL
        6. OpenSLR

        Qualifiers

        • Research-article

        Funding Sources

        • National Natural Science Foundation of China
        • Natural Science Foundation of Henan Province of China
        • Henan Zhongyuan Science and Technology Innovation Leading Talent

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)148
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 10 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph ConstructionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3611306Online publication date: 2-Aug-2023

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media