Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Yang, Yaogen; Zhang, Haozhe; Qin, Xiaoyi; Liang, Shanshan; Cui, Huahua; Xu, Mingyang; Li, Ming

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2104.10832 (eess)

[Submitted on 22 Apr 2021]

Title:Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Authors:Yaogen Yang, Haozhe Zhang, Xiaoyi Qin, Shanshan Liang, Huahua Cui, Mingyang Xu, Ming Li

View PDF

Abstract:Building cross-lingual voice conversion (VC) systems for multiple speakers and multiple languages has been a challenging task for a long time. This paper describes a parallel non-autoregressive network to achieve bilingual and code-switched voice conversion for multiple speakers when there are only mono-lingual corpora for each language. We achieve cross-lingual VC between Mandarin speech with multiple speakers and English speech with multiple speakers by applying bilingual bottleneck features. To boost voice cloning performance, we use an adversarial speaker classifier with a gradient reversal layer to reduce the source speaker's information from the output of encoder. Furthermore, in order to improve speaker similarity between reference speech and converted speech, we adopt an embedding consistency loss between the synthesized speech and its natural reference speech in our network. Experimental results show that our proposed method can achieve high quality converted speech with mean opinion score (MOS) around 4. The conversion system performs well in terms of speaker similarity for both in-set speaker conversion and out-set-of one-shot conversion.

Comments:	Submitted to Interspeech 2021
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2104.10832 [eess.AS]
	(or arXiv:2104.10832v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2104.10832

Submission history

From: Yaogen Yang [view email]
[v1] Thu, 22 Apr 2021 02:43:26 UTC (1,384 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators