research-article

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Authors:

Lei XieAuthors Info & Claims

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

Pages 131 - 136

https://doi.org/10.1145/3461615.3491115

Published: 17 December 2021 Publication History

Abstract

Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.

References

[1]

Hanbin Bae, Jae-Sung Bae, Young-Sun Joo, Young-Ik Kim, and Hoon-Young Cho. 2021. A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 6603–6607. https://doi.org/10.1109/ICASSP39728.2021.9415061

[2]

Merlijn Blaauw and Jordi Bonada. 2020. Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 7229–7233. https://doi.org/10.1109/ICASSP40776.2020.9053944

[3]

Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. CoRR abs/2009.01776(2020). arXiv:2009.01776https://arxiv.org/abs/2009.01776

[4]

Matthew C Cieslak, Ann M Castelfranco, Vittoria Roncalli, Petra H Lenz, and Daniel K Hartline. 2020. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine genomics 51(2020), 100723.

[5]

Nat Dilokthanakul, Pedro A. M. Mediano, Marta Garnelo, Matthew C. H. Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. 2016. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. CoRR abs/1611.02648(2016). arXiv:1611.02648http://arxiv.org/abs/1611.02648

[6]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010. 249–256. http://proceedings.mlr.press/v9/glorot10a.html

[7]

Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. 2021. ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders. In 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021, Hong Kong, January 24-27, 2021. 1–5. https://doi.org/10.1109/ISCSLP49672.2021.9362104

[8]

Romain Hennequin, Anis Khlif, Félix Voituret, and Manuel Moussallam. 2020. Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw. 5, 56 (2020), 2154. https://doi.org/10.21105/joss.02154

[9]

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James R. Glass. 2019. Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. 5901–5905. https://doi.org/10.1109/ICASSP.2019.8683561

[10]

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, and Ruoming Pang. 2019. Hierarchical Generative Modeling for Controllable Speech Synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=rygkk305YQ

[11]

Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. 2017. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. 1965–1972. https://doi.org/10.24963/ijcai.2017/273

[12]

Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 3581–3589. https://proceedings.neurips.cc/paper/2014/hash/d523773c6b194f37b938d340d5d02232-Abstract.html

[13]

Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. http://arxiv.org/abs/1312.6114

[14]

Haohe Liu, Lei Xie, Jian Wu, and Geng Yang. 2020. Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1241–1245. https://doi.org/10.21437/Interspeech.2020-2555

[15]

Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. 2020. XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1306–1310. https://doi.org/10.21437/Interspeech.2020-1410

[16]

Chandan K. A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Asokan Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, and Sriram Srinivasan. 2021. Interspeech 2021 Deep Noise Suppression Challenge. CoRR abs/2101.01902(2021). arXiv:2101.01902https://arxiv.org/abs/2101.01902

[17]

Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. 2020. DeepSinger: Singing Voice Synthesis with Data Mined From the Web. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. 1979–1989. https://doi.org/10.1145/3394486.3403249

Digital Library

[18]

Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, and Qin Jin. 2021. Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 76–80. https://doi.org/10.1109/ICASSP39728.2021.9414348

[19]

R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. 4700–4709. http://proceedings.mlr.press/v80/skerry-ryan18a.html

[20]

David Talkin and W Bastiaan Kleijn. 1995. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis 495 (1995), 518.

[21]

Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2016. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016. 146–152. https://doi.org/10.21437/SSW.2016-24

[22]

Emmanuel Vincent, Shinji Watanabe, Jon Barker, and Ricard Marxer. 2016. The 4th CHiME speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime_challenge/(last accessed on 1 August, 2018)(2016).

[23]

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017. 4006–4010. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/1452.html

[24]

Jie Wu and Jian Luan. 2020. Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1296–1300. https://doi.org/10.21437/Interspeech.2020-1109

[25]

Tiejun Yang, Yudan Zhou, Lei Li, and Chunhua Zhu. 2020. DCU-Net: Multi-scale U-Net for brain tumor segmentation. Journal of X-ray Science and Technology 28, 4 (2020), 709–726.

[26]

Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 7063–7067. https://doi.org/10.1109/ICASSP39728.2021.9413934

[27]

Xinglei Zhu, Gerald Beauregard, and Lonce L. Wyse. 2007. Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra. IEEE Trans. Speech Audio Process. 15, 5 (2007), 1645–1653. https://doi.org/10.1109/TASL.2007.899236

Digital Library

Index Terms

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Index terms have been assigned to the content through auto-classification.

Recommendations

DeepSinger: Singing Voice Synthesis with Data Mined From the Web
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including ...
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity ...
Singing Voice Database
Speech and Computer
Abstract
The first publicly available singing voice database, which was first released in 2012, is presented in this paper. This database contains recordings of professional singers including one Grammy Award winner. The database includes so-called plain ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

October 2021

418 pages

ISBN:9781450384711

DOI:10.1145/3461615

Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '21

Sponsor:

SIGCHI

ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 18 - 22, 2021

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
125
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents