research-article

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Authors:

Bo Chen,

Zhihang Xu,

Kai YuAuthors Info & Claims

Volume 136, Issue C

Pages 14 - 22

https://doi.org/10.1016/j.specom.2021.10.001

Published: 01 January 2022 Publication History

Abstract

Non-parallel data voice conversion is a popular and challenging research area. The main task is to build acoustic mappings from the source speaker to the target speaker in different units (e.g., frame, phoneme, cluster, sentence). With the help of the recent high-quality speech synthesis techniques, it is possible to directly produce parallel speech using non-parallel data. This paper proposes ParaGen: a data augmentation based technique for non-parallel data voice conversion. The system consists of a speaker disentangler based text-to-speech model and a simple frame-to-frame spectrogram conversion model. The text-to-speech model takes the text and reference audio as input to produce the speech in the target speaker identity with the time-aligned local speaking style from the reference audio. The spectrogram conversion model directly converts the source spectrogram to the target speaker framewisely. The local speaking style is extracted by an acoustic encoder while the speaker identity is eliminated by a conditional convolutional disentangler. The local style encodings are time-aligned with the text encodings by the attention mechanism. The attention contexts are decoded by a conditional recurrent decoder. The experiment shows that the speaker identity of the source speech is converted to the target speaker while the local speaking style (e.g., prosody) is preserved after the augmentation. The method is compared to the augmentation model with typical statistical parameter speech synthesis (SPSS) with pre-aligned phoneme duration. The result shows that the converted speech has better speech naturalness than the SPSS system, while the speaker similarities of the converted speech are close.

Highlights

•

We propose a data augmentation based technique for non-parallel voice conversion.

•

It produces time-aligned parallel data with the same frame-level speaking style.

•

We use the frame-level adversarial loss to reduce the speaker identity.

•

We propose two separate speaker embeddings before and after the attention mechanism.

•

We use stacked 2D CNNs with conditional 1D CNNs to extract local speaking style.

•

We can use a simple network to build voice conversion model with the augmented data.

References

[1]

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al., 2016. Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283.

Abstract

Highlights

References

Index Terms

Recommendations

Speaker-independent HMM-based voice conversion using adaptive quantization of the fundamental frequency

Voice conversion by mapping the speaker-specific features using pitch synchronous approach

Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Share

Share this Publication link

Share on social media

Affiliations