Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Published: 01 January 2022 Publication History

Abstract

Non-parallel data voice conversion is a popular and challenging research area. The main task is to build acoustic mappings from the source speaker to the target speaker in different units (e.g., frame, phoneme, cluster, sentence). With the help of the recent high-quality speech synthesis techniques, it is possible to directly produce parallel speech using non-parallel data. This paper proposes ParaGen: a data augmentation based technique for non-parallel data voice conversion. The system consists of a speaker disentangler based text-to-speech model and a simple frame-to-frame spectrogram conversion model. The text-to-speech model takes the text and reference audio as input to produce the speech in the target speaker identity with the time-aligned local speaking style from the reference audio. The spectrogram conversion model directly converts the source spectrogram to the target speaker framewisely. The local speaking style is extracted by an acoustic encoder while the speaker identity is eliminated by a conditional convolutional disentangler. The local style encodings are time-aligned with the text encodings by the attention mechanism. The attention contexts are decoded by a conditional recurrent decoder. The experiment shows that the speaker identity of the source speech is converted to the target speaker while the local speaking style (e.g., prosody) is preserved after the augmentation. The method is compared to the augmentation model with typical statistical parameter speech synthesis (SPSS) with pre-aligned phoneme duration. The result shows that the converted speech has better speech naturalness than the SPSS system, while the speaker similarities of the converted speech are close.

Highlights

We propose a data augmentation based technique for non-parallel voice conversion.
It produces time-aligned parallel data with the same frame-level speaking style.
We use the frame-level adversarial loss to reduce the speaker identity.
We propose two separate speaker embeddings before and after the attention mechanism.
We use stacked 2D CNNs with conditional 1D CNNs to extract local speaking style.
We can use a simple network to build voice conversion model with the augmented data.

References

[1]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al., 2016. Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283.
[2]
Black A., Taylor P., Caley R., Clark R., The festival speech synthesis system, 1998.
[3]
i Barrobes H.D., Voice Conversion Applied to Text-to-Speech systems, Universitat Politecnica de Catalunya, 2006.
[4]
Ito K., Johnson L., The LJ speech dataset, 2017, https://keithito.com/LJ-Speech-Dataset/.
[5]
Kameoka H., Kaneko T., Tanaka K., Hojo N., Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks, in: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp. 266–273.
[6]
Kominek J., Black A.W., Ver V., Cmu ARCTIC databases for speech synthesis, 2003.
[7]
Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T., Ling, Z., 2018. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In: Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 195–202.
[8]
Luong H.-T., Yamagishi J., Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 200–207.
[9]
Maaten L.v.d., Hinton G., Visualizing data using t-SNE, J. Mach. Learn. Res. 9 (Nov) (2008) 2579–2605.
[10]
McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto, O., 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Vol. 8, pp. 18–25.
[11]
Park, S.-w., Kim, D.-y., Joe, M.-c., 2020. Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data. In: Proc. Interspeech 2020, pp. 4696–4700.
[12]
Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlicek P., Qian Y., Schwarz P., et al., The kaldi speech recognition toolkit, in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, 2011.
[13]
Saito Y., Ijima Y., Nishida K., Takamichi S., Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 5274–5278.
[14]
Schoeffler M., Bartoschek S., Stöter F.-R., Roess M., Westphal S., Edler B., Herre J., WebmUShra—A comprehensive framework for web-based listening tests, J. Open Res. Softw. 6 (1) (2018).
[15]
Shen J., Pang R., Weiss R.J., Schuster M., Jaitly N., Yang Z., Chen Z., Zhang Y., Wang Y., Skerrv-Ryan R., et al., Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 4779–4783.
[16]
Sun L., Li K., Wang H., Kang S., Meng H., Phonetic posteriorgrams for many-to-one voice conversion without parallel data training, in: 2016 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2016, pp. 1–6.
[17]
Toda T., Chen L.-H., Saito D., Villavicencio F., Wester M., Wu Z., Yamagishi J., The voice conversion challenge 2016, in: Interspeech, 2016, pp. 1632–1636.
[18]
van den Oord A., Dieleman S., Zen H., Simonyan K., Vinyals O., Graves A., Kalchbrenner N., Senior A., Kavukcuoglu K., Wavenet: A generative model for raw audio, 2016, CoRR vol. abs/1609.03499.
[19]
Veaux C., Yamagishi J., MacDonald K., CSTR VCTK corpus. University of edinburgh, 2010.
[20]
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al., 2017. Tacotron: Towards end-to-end speech synthesis. In: Proc. Interspeech 2017, pp. 4006–4010.
[21]
Yamamoto R., Felipe J., Blaauw M., R9y9/pysptk: 0.1. 14, 2019, URL: https://github.com/r9y9/pysptk.
[22]
Yi, Z., Huang, W.-C., Tian, X., Yamagishi, J., Das, R.K., Kinnunen, T., Ling, Z., Toda, T., 2020. Voice conversion challenge 2020—intra-lingual semi-parallel and cross-lingual voice conversion. In: Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80–98.
[23]
Zhang J.-X., Ling Z.-H., Dai L.-R., Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations, IEEE/ACM Trans. Audio Speech Language Process. 28 (2019) 540–552.
[24]
Zhang M., Zhou Y., Zhao L., Li H., Transfer learning from speech synthesis to voice conversion with non-parallel training data, IEEE/ACM Trans. Audio Speech Language Process. 29 (2021) 1290–1302.

Index Terms

  1. Data augmentation based non-parallel voice conversion with frame-level speaker disentangler
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Speech Communication
          Speech Communication  Volume 136, Issue C
          Jan 2022
          128 pages

          Publisher

          Elsevier Science Publishers B. V.

          Netherlands

          Publication History

          Published: 01 January 2022

          Author Tags

          1. Voice conversion
          2. Data augmentation
          3. Speaker disentanglement
          4. Style extraction
          5. Non-parallel data

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 02 Feb 2025

          Other Metrics

          Citations

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media