short-paper

Free access

Just Accepted

Optimizing Uyghur Speech Synthesis by Combining Pretrained Cross-Lingual Model

Authors:

Ke ChenAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing

Accepted on 22 June 2024

https://doi.org/10.1145/3675397

Online AM: 28 June 2024 Publication History

Abstract

End-to-end speech synthesis methodologies have exhibited considerable advancements for languages with abundant corpus resources. Nevertheless, such achievements are yet to be realized for languages constrained by limited corpora. This manuscript delineates a novel strategy that leverages contextual encoding information to augment the naturalness of the speech synthesized through FastSpeech2, particularly under resource-scarce conditions. Initially, we harness the cross-linguistic model XLM-RoBERTa to extract contextual features, which serve as an auxiliary input to the mel-spectrum decoder of FastSpeech2. Subsequently, we refine the mel-spectrum prediction module to mitigate the overfitting dilemma encountered by FastSpeech2 amidst scant training datasets. To this end, Conformer blocks, rather than traditional Transformer blocks, are employed within both the encoder and decoder to concentrate intensively on varying levels and granularities of feature information. Additionally, we introduce a token-average mechanism to equalize pitch and energy attributes at the frame level. The empirical outcomes indicate that our pre-training with the LJ Speech dataset, followed by fine-tuning using a modest 10-minute paired Uyghur corpus, yields satisfactory synthesized Uyghur speech. Relative to the baseline framework, our proposed technique halves the character error rate and enhances the mean opinion score by over 0.6. Similar results were observed in Mandarin Chinese experimental evaluations.

References

[1]

Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-End Speech Synthesis. April 2017.

[2]

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards End-to-End Speech Synthesis, April 2017.

[3]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783, Calgary, AB, April 2018. IEEE.

Digital Library

[4]

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech: Fast, Robust and Controllable Text to Speech. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[5]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations, March 2022.

[6]

Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition. August 2020.

Digital Library

[7]

Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A Survey on Neural Speech Synthesis, July 2021.

[8]

Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung-yi Lee. End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning, April 2019.

[9]

Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and R.J. Skerry-Ryan. Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6940–6944, Brighton, United Kingdom, May 2019. IEEE.

[10]

Bajibabu Bollepalli, Lauri Juvela, and Paavo Alku. Lombard speech synthesis using transfer learning in a tacotron text-to-speech system. In Gernot Kubin and Zdravko Kacic, editors, Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2833–2837. ISCA, 2019.

[11]

Chen-Yu Chiang. Cross-dialect adaptation framework for constructing prosodic models for chinese dialect text-to-speech systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1):108–121, 2017.

[12]

Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Shubham Toshniwal, and Karen Livescu. Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis. In Interspeech 2019, pages 4430–4434. ISCA, September 2019.

[13]

Guillaume Lample and Alexis Conneau. Cross-lingual Language Model Pretraining, January 2019.

[14]

Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, and Kyle Mahowald. Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT, January 2021.

[15]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Representation Learning at Scale, April 2020.

[16]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented Transformer for Speech Recognition, May 2020.

[17]

Mathieu Bernard and Hadrien Titeux. Phonemizer: Text to Phones Transcription for Multiple Languages in Python. Journal of Open Source Software, 6(68):3958, December 2021.

[18]

Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, August 2018.

[19]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[20]

Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. Recent Developments on Espnet Toolkit Boosted By Conformer. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5874–5878, June 2021.

[21]

Adrian Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592. IEEE, 2021.

[22]

Florian Lux, Julia Koch, Antje Schweitzer, and Ngoc Thang Vu. The IMS toucan system for the blizzard challenge 2021. In Proc. Blizzard Challenge Workshop, volume 2021, 2021.

[23]

Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae, and Young-Sun Joo. Avocodo: Generative adversarial network for artifact-free vocoder. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12562–12570, 2023.

Digital Library

[24]

Keith Ito. The LJ Speech Dataset, 2017.

[25]

R. Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, volume 1, pages 125–128, Victoria, BC, Canada, 1993. IEEE.

Index Terms

Optimizing Uyghur Speech Synthesis by Combining Pretrained Cross-Lingual Model
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Synthesis Speech Based Data Augmentation for Low Resource Children ASR
Speech and Computer
Abstract
Successful speech recognition for children requires large training data with sufficient speaker variability. The collection of such a training database of children’s voices is challenging and very expensive for zero/low resource language like ...
Multi-Feature Cross-Lingual Transfer Learning Approach for Low-Resource Vietnamese Speech Synthesis
AI2A '23: Proceedings of the 2023 3rd International Conference on Artificial Intelligence, Automation and Algorithms

Abstract—Based on neural network end-to-end speech synthesis systems, high-quality speech can be synthesized when there is sufficient training data. However, it is difficult for languages with small datasets to synthesize speech with high quality and ...
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted

ISSN:2375-4699

EISSN:2375-4702

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 28 June 2024

Accepted: 22 June 2024

Revised: 01 April 2024

Received: 22 October 2023

Check for updates

Author Tags

Qualifiers

Short-paper

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
46
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)46

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables