short-paper

Improving Generative Adversarial Network-based Vocoding through Multi-scale Convolution

Authors:

Buzhou TangAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 9

Article No.: 230, Pages 1 - 10

https://doi.org/10.1145/3610532

Published: 22 September 2023 Publication History

Abstract

Vocoding is a sub-process of text-to-speech task, which aims at generating audios from intermediate representations between text and audio. Several recent works have shown that generative adversarial network– (GAN) based vocoders can generate audios with high quality. While GAN-based neural vocoders have shown higher efficiency in generating speed than autoregressive vocoders, the audio fidelity still cannot compete with ground-truth samples. One major cause of the degradation in audio quality and spectrogram vague comes from the average pooling layers in discriminator. As the multi-scale discriminator commonly used by recent GAN-based vocoders applies several average pooling layers to capture different-frequency bands, we believe it is crucial to prevent the high-frequency information from leakage in the average pooling process. This article proposes MSCGAN, which solves the above-mentioned problem and achieves higher-fidelity speech synthesis. We demonstrate that substituting the average pooling process with a multi-scale convolution architecture effectively retains high-frequency features and thus forces the generator to recover audio details in time and frequency domain. Compared with other state-of-the-art GAN-based vocoders, MSCGAN can produce competitive audio with a higher spectrogram clarity and mean opinion score score in subjective human evaluation.

References

[1]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). PMLR, 2415–2424.

[2]

Hideki Kawahara. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27, 6 (2006), 349–353.

[3]

Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, and Seong-Whan Lee. 2021. Fre-GAN: Adversarial frequency-consistent audio synthesis. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 2197–2201.

[4]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20).

[5]

Kundan Kumar, Rithesh Kumar, Thibault de Boissière, Lucas Gestin, Wei Zhen Teoh, Jose M. R. Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C. Courville. 2019. MelGAN: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’19). 14881–14892.

[6]

Zhengxi Liu and Yanmin Qian. 2021. Basis-MelGAN: Efficient neural vocoder based on audio decomposition. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 2222–2226. DOI:

[7]

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99, 7 (2016), 1877–1884.

[8]

Nathanaël Perraudin, Peter Balazs, and Peter L. Søndergaard. 2013. A fast Griffin-Lim algorithm. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.

[9]

Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song. 2020. WaveFlow: A compact flow-based model for raw audio. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). PMLR, 7706–7716.

[10]

Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 3617–3621.

[11]

Qiao Tian, Yi Chen, Zewang Zhang, Heng Lu, Ling-Hui Chen, Lei Xie, and Shan Liu. 2020. TFGAN: Time and frequency domain based generative adversarial network for high-fidelity speech synthesis (unpublished).

[12]

Jean-Marc Valin and Jan Skoglund. 2019. LPCNET: Improving Neural Speech Synthesis through Linear Prediction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19), Brighton, United Kingdom, May 12-17, 2019, IEEE, 5891–5895.

[13]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. In Proceedings of the 9th ISCA Speech Synthesis Workshop. ISCA, 125.

[14]

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR). DOI:

[15]

Ravichander Vipperla, Sangjun Park, Kihyun Choo, Samin Ishtiaq, Kyoungbo Min, Sourav Bhattacharya, Abhinav Mehrotra, Alberto Gil C. P. Ramos, and Nicholas D. Lane. 2020. Bunched LPCNet: Vocoder for low-cost neural text-to-speech systems. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association (Virtual Event, Shanghai, China, 25-29 October 2020, ISCA). 3565–3569.

[16]

Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, and Tomoki Toda. 2021. Quasi-periodic parallel WaveGAN: A non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Trans. Aud. Speech Lang. Process. 29 (2021), 792–806.

Digital Library

[17]

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20), 6199–6203.

[18]

Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. 2021. Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’21). IEEE, 492–498.

[19]

Jinhyeok Yang, Junmo Lee, Young-Ik Kim, Hoon-Young Cho, and Injung Kim. 2020. VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (INTERSPEECH’20). ISCA, 200–204.

[20]

Xianbing Zhao, Yixin Chen, Yiting Chen, Sicen Liu, and Buzhou Tang. 2022a. HMAI-BERT: Hierarchical multimodal alignment and interaction network-enhanced BERT for multimodal sentiment analysis. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’22). IEEE, 1–6.

[21]

Xianbing Zhao, Yixin Chen, Wanting Li, Lei Gao, and Buzhou Tang. 2022b. MAG+: An extended multimodal adaptation gate for multimodal sentiment analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 4753–4757.

[22]

Ge Zhu, Fei Jiang, and Zhiyao Duan. 2021. Y-Vector: Multiscale waveform encoder for speaker embedding. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH’21). ISCA, 96–100.

Index Terms

Improving Generative Adversarial Network-based Vocoding through Multi-scale Convolution
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN
AISS '22: Proceedings of the 4th International Conference on Advanced Information Science and System

Neural vocoders are widely utilized in modern text-to-speech (TTS) and voice conversion (VC) systems due to their high generation quality and fast inference speed. Recently, GAN-based neural vocoders have attracted great interest due to their ...
Czech Speech Synthesis with Generative Neural Vocoder
Text, Speech, and Dialogue
Abstract
In recent years, new neural architectures for generating high-quality synthetic speech on a per-sample basis were introduced. We describe our application of statistical parametric speech synthesis based on LSTM neural networks combined with a ...
Generative adversarial networks for speech processing: A review
Abstract
Generative adversarial networks (GANs) have seen remarkable progress in recent years. They are used as generative models for all kinds of data such as text, images, audio, music, videos, and animations. This paper presents a ...
Highlights
- Presents a comprehensive review of speech GANs.
- Categorizes speech GANs based ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 9

September 2023

226 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3625383

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 September 2023

Online AM: 16 August 2023

Accepted: 13 May 2023

Revised: 16 March 2023

Received: 23 August 2022

Published in TALLIP Volume 22, Issue 9

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
149
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents