Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612150acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching

Published: 27 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Though previous works have shown remarkable achievements in singing voice generation, most existing models focus on one specific application and there is a lack of unified singing voice synthesis models. In addition to low relevance among tasks, different input modalities are one of the most intractable hindrances. Current methods suffer from information confusion and they can not perform precise control. In this work, we propose UniSinger, a unified end-to-end singing voice synthesizer, which integrates three abilities related to singing voice generation: singing voice synthesis (SVS), singing voice conversion (SVC), and singing voice editing (SVE) into a single framework. Specifically, we perform representation disentanglement for controlling different attributes of the singing voice. We further propose a cross-modality information matching method to close the distribution gap between multi-modal inputs and achieve end-to-end training. The experiments conducted on the OpenSinger dataset demonstrate that UniSinger achieves state-of-the-art results in three applications. Further extensive experiments verify the capability of representation disentanglement and information matching, reflecting that UniSinger enjoys great superiority in sample quality, timbre similarity, and multi-task compatibility. Audio samples can be found in https://unisinger.github.io/Samples/.

    References

    [1]
    Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. 2022. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning. PMLR, 2709--2720.
    [2]
    Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776 (2020).
    [3]
    Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. Club: A contrastive log-ratio upper bound of mutual information. In International conference on machine learning. PMLR, 1779--1788.
    [4]
    Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised Cross-lingual Representation Learning for Speech Recognition. arxiv: 2006.13979 [cs.CL]
    [5]
    Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, and Zhou Zhao. 2021. EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. arXiv preprint arXiv:2106.09317 (2021).
    [6]
    Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, and Zhou Zhao. 2022. VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement. arXiv preprint arXiv:2211.10666 (2022).
    [7]
    Chengqi Deng, Chengzhu Yu, Heng Lu, Chao Weng, and Dong Yu. 2020. Pitchnet: Unsupervised singing voice conversion with pitch adversarial network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7749--7753.
    [8]
    Jeff Donahue, Sander Dieleman, Miko?aj Bi?kowski, Erich Elsen, and Karen Simonyan. 2021. End-to-End Adversarial Text-to-Speech. arxiv: 2006.03575 [cs.SD]
    [9]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative Adversarial Networks. Commun. ACM, Vol. 63, 11 (oct 2020), 139--144. https://doi.org/10.1145/3422622
    [10]
    Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. 2021. Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 1--5.
    [11]
    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, Vol. 33 (2020), 6840--6851.
    [12]
    Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2021. Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 3945--3954. https://doi.org/10.1145/3474085.3475437
    [13]
    Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, Zhou Zhao, Baoxing Huai, and Zhefeng Wang. 2022a. Singgan: Generative adversarial network for high-fidelity singing voice generation. In Proceedings of the 30th ACM International Conference on Multimedia. 2525--2535.
    [14]
    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661 (2023).
    [15]
    Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022b. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. arXiv preprint arXiv:2204.09934 (2022).
    [16]
    Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. [n.,d.]. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech. In Advances in Neural Information Processing Systems.
    [17]
    Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, and Jinzheng He. 2022c. TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation. arXiv preprint arXiv:2205.12523 (2022).
    [18]
    Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 266--273.
    [19]
    Takuhiro Kaneko and Hirokazu Kameoka. 2018. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2100--2104.
    [20]
    Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, Vol. 31 (2018).
    [21]
    Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arxiv: 1312.6114 [stat.ML]
    [22]
    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.
    [23]
    Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. 2021. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514 (2021).
    [24]
    Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. 2019. Adversarially trained end-to-end korean singing voice synthesis system. arXiv preprint arXiv:1908.01919 (2019).
    [25]
    Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. 2022. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658 (2022).
    [26]
    Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6706--6713.
    [27]
    Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11020--11028.
    [28]
    Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. 2017. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International conference on machine learning. PMLR, 2391--2400.
    [29]
    Eliya Nachmani and Lior Wolf. 2019. Unsupervised singing voice conversion. arXiv preprint arXiv:1904.06590 (2019).
    [30]
    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
    [31]
    Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning. PMLR, 7836--7846.
    [32]
    Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning. PMLR, 5210--5219.
    [33]
    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020a. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
    [34]
    Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. 2020b. Deepsinger: Singing voice synthesis with data mined from the web. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1979--1989.
    [35]
    Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779--4783.
    [36]
    Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial Discriminative Domain Adaptation. arxiv: 1702.05464 [cs.CV]
    [37]
    Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. 2021. Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. arXiv preprint arXiv:2106.10132 (2021).
    [38]
    Zhenhui Ye, Zhou Zhao, Yi Ren, and Fei Wu. 2022. SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech. arxiv: 2204.11792 [cs.SD]
    [39]
    Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, et al. 2022b. M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus. Advances in Neural Information Processing Systems, Vol. 35 (2022), 6914--6926.
    [40]
    Lichao Zhang, Zhou Zhao, Yi Ren, and Liqun Deng. [n.,d.]. EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling. ([n.,d.]).
    [41]
    Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, and Mengxiao Bi. 2022a. Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7237--7241.

    Index Terms

    1. UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 October 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. singing voice conversion
        2. singing voice editing
        3. singing voice synthesis
        4. unified end-to-end model

        Qualifiers

        • Research-article

        Funding Sources

        • National Natural Science Foundation of China
        • National Key R&D Program of China

        Conference

        MM '23
        Sponsor:
        MM '23: The 31st ACM International Conference on Multimedia
        October 29 - November 3, 2023
        Ottawa ON, Canada

        Acceptance Rates

        Overall Acceptance Rate 995 of 4,171 submissions, 24%

        Upcoming Conference

        MM '24
        The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 263
          Total Downloads
        • Downloads (Last 12 months)263
        • Downloads (Last 6 weeks)14
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media