A Melody-Unsupervision Model for Singing Voice Synthesis

Choi, Soonbeom; Nam, Juhan

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2110.06546 (eess)

[Submitted on 13 Oct 2021 (v1), last revised 14 Apr 2022 (this version, v2)]

Title:A Melody-Unsupervision Model for Singing Voice Synthesis

Authors:Soonbeom Choi, Juhan Nam

View PDF

Abstract:Recent studies in singing voice synthesis have achieved high-quality results leveraging advances in text-to-speech models based on deep neural networks. One of the main issues in training singing voice synthesis models is that they require melody and lyric labels to be temporally aligned with audio data. The temporal alignment is a time-exhausting manual work in preparing for the training data. To address the issue, we propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time but generates singing voice audio given a melody and lyrics input in inference time. The proposed model is composed of a phoneme classifier and a singing voice generator jointly trained in an end-to-end manner. The model can be fine-tuned by adjusting the amount of supervision with temporally aligned melody labels. Through experiments in melody-unsupervision and semi-supervision settings, we compare the audio quality of synthesized singing voice. We also show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.

Comments:	ICASSP 2022
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2110.06546 [eess.AS]
	(or arXiv:2110.06546v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2110.06546

Submission history

From: Soonbeom Choi [view email]
[v1] Wed, 13 Oct 2021 07:42:35 UTC (1,709 KB)
[v2] Thu, 14 Apr 2022 12:48:44 UTC (1,709 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:A Melody-Unsupervision Model for Singing Voice Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:A Melody-Unsupervision Model for Singing Voice Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators