Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
short-paper
Free access
Just Accepted

Optimizing Uyghur Speech Synthesis by Combining Pretrained Cross-Lingual Model

Online AM: 28 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    End-to-end speech synthesis methodologies have exhibited considerable advancements for languages with abundant corpus resources. Nevertheless, such achievements are yet to be realized for languages constrained by limited corpora. This manuscript delineates a novel strategy that leverages contextual encoding information to augment the naturalness of the speech synthesized through FastSpeech2, particularly under resource-scarce conditions. Initially, we harness the cross-linguistic model XLM-RoBERTa to extract contextual features, which serve as an auxiliary input to the mel-spectrum decoder of FastSpeech2. Subsequently, we refine the mel-spectrum prediction module to mitigate the overfitting dilemma encountered by FastSpeech2 amidst scant training datasets. To this end, Conformer blocks, rather than traditional Transformer blocks, are employed within both the encoder and decoder to concentrate intensively on varying levels and granularities of feature information. Additionally, we introduce a token-average mechanism to equalize pitch and energy attributes at the frame level. The empirical outcomes indicate that our pre-training with the LJ Speech dataset, followed by fine-tuning using a modest 10-minute paired Uyghur corpus, yields satisfactory synthesized Uyghur speech. Relative to the baseline framework, our proposed technique halves the character error rate and enhances the mean opinion score by over 0.6. Similar results were observed in Mandarin Chinese experimental evaluations.

    References

    [1]
    Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-End Speech Synthesis. April 2017.
    [2]
    Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards End-to-End Speech Synthesis, April 2017.
    [3]
    Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783, Calgary, AB, April 2018. IEEE.
    [4]
    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech: Fast, Robust and Controllable Text to Speech. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
    [5]
    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations, March 2022.
    [6]
    Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition. August 2020.
    [7]
    Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A Survey on Neural Speech Synthesis, July 2021.
    [8]
    Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung-yi Lee. End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning, April 2019.
    [9]
    Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and R.J. Skerry-Ryan. Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6940–6944, Brighton, United Kingdom, May 2019. IEEE.
    [10]
    Bajibabu Bollepalli, Lauri Juvela, and Paavo Alku. Lombard speech synthesis using transfer learning in a tacotron text-to-speech system. In Gernot Kubin and Zdravko Kacic, editors, Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2833–2837. ISCA, 2019.
    [11]
    Chen-Yu Chiang. Cross-dialect adaptation framework for constructing prosodic models for chinese dialect text-to-speech systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1):108–121, 2017.
    [12]
    Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Shubham Toshniwal, and Karen Livescu. Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis. In Interspeech 2019, pages 4430–4434. ISCA, September 2019.
    [13]
    Guillaume Lample and Alexis Conneau. Cross-lingual Language Model Pretraining, January 2019.
    [14]
    Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, and Kyle Mahowald. Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT, January 2021.
    [15]
    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Representation Learning at Scale, April 2020.
    [16]
    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented Transformer for Speech Recognition, May 2020.
    [17]
    Mathieu Bernard and Hadrien Titeux. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software, 6(68):3958, December 2021.
    [18]
    Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, August 2018.
    [19]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    [20]
    Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. Recent Developments on Espnet Toolkit Boosted By Conformer. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5874–5878, June 2021.
    [21]
    Adrian Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592. IEEE, 2021.
    [22]
    Florian Lux, Julia Koch, Antje Schweitzer, and Ngoc Thang Vu. The IMS toucan system for the blizzard challenge 2021. In Proc. Blizzard Challenge Workshop, volume 2021, 2021.
    [23]
    Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae, and Young-Sun Joo. Avocodo: Generative adversarial network for artifact-free vocoder. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12562–12570, 2023.
    [24]
    Keith Ito. The LJ Speech Dataset, 2017.
    [25]
    R. Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, volume 1, pages 125–128, Victoria, BC, Canada, 1993. IEEE.

    Index Terms

    1. Optimizing Uyghur Speech Synthesis by Combining Pretrained Cross-Lingual Model

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted
      ISSN:2375-4699
      EISSN:2375-4702
      Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Online AM: 28 June 2024
      Accepted: 22 June 2024
      Revised: 01 April 2024
      Received: 22 October 2023

      Check for updates

      Author Tags

      1. end-to-end
      2. low resource
      3. speech synthesis
      4. cross-language

      Qualifiers

      • Short-paper

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 46
        Total Downloads
      • Downloads (Last 12 months)46
      • Downloads (Last 6 weeks)46
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media