Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
short-paper

Improving Generative Adversarial Network-based Vocoding through Multi-scale Convolution

Published: 22 September 2023 Publication History

Abstract

Vocoding is a sub-process of text-to-speech task, which aims at generating audios from intermediate representations between text and audio. Several recent works have shown that generative adversarial network– (GAN) based vocoders can generate audios with high quality. While GAN-based neural vocoders have shown higher efficiency in generating speed than autoregressive vocoders, the audio fidelity still cannot compete with ground-truth samples. One major cause of the degradation in audio quality and spectrogram vague comes from the average pooling layers in discriminator. As the multi-scale discriminator commonly used by recent GAN-based vocoders applies several average pooling layers to capture different-frequency bands, we believe it is crucial to prevent the high-frequency information from leakage in the average pooling process. This article proposes MSCGAN, which solves the above-mentioned problem and achieves higher-fidelity speech synthesis. We demonstrate that substituting the average pooling process with a multi-scale convolution architecture effectively retains high-frequency features and thus forces the generator to recover audio details in time and frequency domain. Compared with other state-of-the-art GAN-based vocoders, MSCGAN can produce competitive audio with a higher spectrogram clarity and mean opinion score score in subjective human evaluation.

References

[1]
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). PMLR, 2415–2424.
[2]
Hideki Kawahara. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27, 6 (2006), 349–353.
[3]
Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, and Seong-Whan Lee. 2021. Fre-GAN: Adversarial frequency-consistent audio synthesis. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 2197–2201.
[4]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20).
[5]
Kundan Kumar, Rithesh Kumar, Thibault de Boissière, Lucas Gestin, Wei Zhen Teoh, Jose M. R. Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C. Courville. 2019. MelGAN: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’19). 14881–14892.
[6]
Zhengxi Liu and Yanmin Qian. 2021. Basis-MelGAN: Efficient neural vocoder based on audio decomposition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 2222–2226. DOI:
[7]
Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99, 7 (2016), 1877–1884.
[8]
Nathanaël Perraudin, Peter Balazs, and Peter L. Søndergaard. 2013. A fast Griffin-Lim algorithm. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.
[9]
Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song. 2020. WaveFlow: A compact flow-based model for raw audio. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). PMLR, 7706–7716.
[10]
Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 3617–3621.
[11]
Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Ling-Hui Chen, Lei Xie, and Shan Liu. 2020. TFGAN: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis (unpublished).
[12]
Jean-Marc Valin and Jan Skoglund. 2019. LPCNET: Improving Neural Speech Synthesis through Linear Prediction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19), Brighton, United Kingdom, May 12-17, 2019, IEEE, 5891–5895.
[13]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. In Proceedings of the 9th ISCA Speech Synthesis Workshop. ISCA, 125.
[14]
Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR). DOI:
[15]
Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C. P. Ramos, and Nicholas D. Lane. 2020. Bunched LPCNet: Vocoder for low-cost neural text-to-speech systems. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association (Virtual Event, Shanghai, China, 25-29 October 2020, ISCA). 3565–3569.
[16]
Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, and Tomoki Toda. 2021. Quasi-periodic parallel WaveGAN: A non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Trans. Aud. Speech Lang. Process. 29 (2021), 792–806.
[17]
Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20), 6199–6203.
[18]
Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. 2021. Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’21). IEEE, 492–498.
[19]
Jinhyeok Yang, Junmo Lee, Young-Ik Kim, Hoon-Young Cho, and Injung Kim. 2020. VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 200–204.
[20]
Xianbing Zhao, Yixin Chen, Yiting Chen, Sicen Liu, and Buzhou Tang. 2022a. HMAI-BERT: Hierarchical multimodal alignment and interaction network-enhanced BERT for multimodal sentiment analysis. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’22). IEEE, 1–6.
[21]
Xianbing Zhao, Yixin Chen, Wanting Li, Lei Gao, and Buzhou Tang. 2022b. MAG+: An extended multimodal adaptation gate for multimodal sentiment analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 4753–4757.
[22]
Ge Zhu, Fei Jiang, and Zhiyao Duan. 2021. Y-Vector: Multiscale waveform encoder for speaker embedding. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 96–100.

Index Terms

  1. Improving Generative Adversarial Network-based Vocoding through Multi-scale Convolution

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 9
    September 2023
    226 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3625383
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 September 2023
    Online AM: 16 August 2023
    Accepted: 13 May 2023
    Revised: 16 March 2023
    Received: 23 August 2022
    Published in TALLIP Volume 22, Issue 9

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Speech generation
    2. neural vocoder

    Qualifiers

    • Short-paper

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 149
      Total Downloads
    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media