Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3461615.3491115acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Published: 17 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.

    References

    [1]
    Hanbin Bae, Jae-Sung Bae, Young-Sun Joo, Young-Ik Kim, and Hoon-Young Cho. 2021. A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 6603–6607. https://doi.org/10.1109/ICASSP39728.2021.9415061
    [2]
    Merlijn Blaauw and Jordi Bonada. 2020. Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 7229–7233. https://doi.org/10.1109/ICASSP40776.2020.9053944
    [3]
    Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. CoRR abs/2009.01776(2020). arXiv:2009.01776https://arxiv.org/abs/2009.01776
    [4]
    Matthew C Cieslak, Ann M Castelfranco, Vittoria Roncalli, Petra H Lenz, and Daniel K Hartline. 2020. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine genomics 51(2020), 100723.
    [5]
    Nat Dilokthanakul, Pedro A. M. Mediano, Marta Garnelo, Matthew C. H. Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. 2016. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. CoRR abs/1611.02648(2016). arXiv:1611.02648http://arxiv.org/abs/1611.02648
    [6]
    Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010. 249–256. http://proceedings.mlr.press/v9/glorot10a.html
    [7]
    Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. 2021. ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders. In 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021, Hong Kong, January 24-27, 2021. 1–5. https://doi.org/10.1109/ISCSLP49672.2021.9362104
    [8]
    Romain Hennequin, Anis Khlif, Félix Voituret, and Manuel Moussallam. 2020. Spleeter: a fast and efficient music source separation tool with pre-trained models. J. Open Source Softw. 5, 56 (2020), 2154. https://doi.org/10.21105/joss.02154
    [9]
    Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James R. Glass. 2019. Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. 5901–5905. https://doi.org/10.1109/ICASSP.2019.8683561
    [10]
    Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, and Ruoming Pang. 2019. Hierarchical Generative Modeling for Controllable Speech Synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=rygkk305YQ
    [11]
    Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. 2017. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. 1965–1972. https://doi.org/10.24963/ijcai.2017/273
    [12]
    Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 3581–3589. https://proceedings.neurips.cc/paper/2014/hash/d523773c6b194f37b938d340d5d02232-Abstract.html
    [13]
    Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. http://arxiv.org/abs/1312.6114
    [14]
    Haohe Liu, Lei Xie, Jian Wu, and Geng Yang. 2020. Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1241–1245. https://doi.org/10.21437/Interspeech.2020-2555
    [15]
    Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. 2020. XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1306–1310. https://doi.org/10.21437/Interspeech.2020-1410
    [16]
    Chandan K. A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Asokan Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, and Sriram Srinivasan. 2021. Interspeech 2021 Deep Noise Suppression Challenge. CoRR abs/2101.01902(2021). arXiv:2101.01902https://arxiv.org/abs/2101.01902
    [17]
    Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. 2020. DeepSinger: Singing Voice Synthesis with Data Mined From the Web. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. 1979–1989. https://doi.org/10.1145/3394486.3403249
    [18]
    Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, and Qin Jin. 2021. Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 76–80. https://doi.org/10.1109/ICASSP39728.2021.9414348
    [19]
    R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, and Rif A. Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. 4700–4709. http://proceedings.mlr.press/v80/skerry-ryan18a.html
    [20]
    David Talkin and W Bastiaan Kleijn. 1995. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis 495 (1995), 518.
    [21]
    Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2016. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016. 146–152. https://doi.org/10.21437/SSW.2016-24
    [22]
    Emmanuel Vincent, Shinji Watanabe, Jon Barker, and Ricard Marxer. 2016. The 4th CHiME speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime_challenge/(last accessed on 1 August, 2018)(2016).
    [23]
    Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017. 4006–4010. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/1452.html
    [24]
    Jie Wu and Jian Luan. 2020. Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1296–1300. https://doi.org/10.21437/Interspeech.2020-1109
    [25]
    Tiejun Yang, Yudan Zhou, Lei Li, and Chunhua Zhu. 2020. DCU-Net: Multi-scale U-Net for brain tumor segmentation. Journal of X-ray Science and Technology 28, 4 (2020), 709–726.
    [26]
    Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 7063–7067. https://doi.org/10.1109/ICASSP39728.2021.9413934
    [27]
    Xinglei Zhu, Gerald Beauregard, and Lonce L. Wyse. 2007. Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra. IEEE Trans. Speech Audio Process. 15, 5 (2007), 1645–1653. https://doi.org/10.1109/TASL.2007.899236

    Index Terms

    1. Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction
            October 2021
            418 pages
            ISBN:9781450384711
            DOI:10.1145/3461615
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 17 December 2021

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. Found data
            2. Gaussian mixture variational autoencoder
            3. Singing voice synthesis

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Conference

            ICMI '21
            Sponsor:
            ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
            October 18 - 22, 2021
            QC, Montreal, Canada

            Acceptance Rates

            Overall Acceptance Rate 453 of 1,080 submissions, 42%

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 125
              Total Downloads
            • Downloads (Last 12 months)17
            • Downloads (Last 6 weeks)1
            Reflects downloads up to 26 Jul 2024

            Other Metrics

            Citations

            View Options

            Get Access

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media