Abstract
Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Adami AG, Mihaescu R, Reynold DA, Godjirey JJ (2003) Modeling prosodic dynamics for speaker recognition. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 788–791
Barlow M, Wagner M (1988) Prosody as a basis for determining speaker characteristics. In: Proc. The Australian International Conference on Speech Science and Technology, pp. 80–85
Chen LH, Ling ZH, Dai LR (2014) Voice conversion using generative trained deep neural networks with multiple frame spectral envelopes. In: Proc. INTERSPEECH, pp 2313– 2317
Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE Trans Audio Speech Lang Process 22(12):1859–1872
Chen LH, Ling ZH, Song Y, Dai LR (2013) Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion (August)
Dahan D, Bernard JM (1996) Interspeaker variability in emphatic accent production in French. Lang Speech 39(4):341–374
Desai S, Raghavendra E, Yegnanarayana B, Black A, Prahallad K (2009) Voice conversion using artificial neural networks. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3893–3896
van Donzel ME, Koopmans-van Beinum FJ (1997) Evaluation of prosodic characteristics in retold stories in Dutch by means of semantic scales. In: Proc. EUROSPEECH, pp. 211–214
Erro D, Moreno A, Bonafonte A (2010) Voice conversion based on weighted frequency warping. IEEE Trans Audio Speech Lang Process 18(5):922–931
Erro D, Navas E, Hernaez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21 (3):556–566
Helander E, Silen H, Virtanen T, Gabbouj M (2012) Voice conversion using dynamic kernel partial least squares regression. IEEE Trans Audio Speech Lang Process 20(3):806–817
Helander E, Virtanen T, Nurminen J, Gabbouj M (2010) Voice conversion using partial least squares regression. IEEE Trans Audio Speech Lang Process 18 (5):912–921
Helander EE, Nurminen J (2007) A novel method for prosody prediction in voice conversion. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 509–512
Hinton G (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Hwang HT, Tsao Y, Wang HM, Wang YR, Chen SH (2013) Incorporating global variance in the training phase of GMM-based voice conversion. In: Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp 1–6
Kain A, Macon M (1998) Spectral voice x. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 1, pp 285–288
Kawahara H, Morise M, Takahashi T, Nisimura R, Irino T, Banno H (2008) TANDEM-STRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3933–3936
Kominek J, Black AW (2003) CMU ARCTIC databases for speech synthesis. Tech rep
Lamere P, Kwok P, Walker W, Gouvêa EB, Singh R, Raj B, Wolf P (2003) Design of the CMU Sphinx-4 decoder. In: Proc. EUROSPEECH, pp 1181–1184
Le QV, Coates A, Prochnow B, Ng AY (2011) On optimization methods for deep learning. In: Proc. The 28th International Conference on Machine Learning (ICML), pp 265–272
Lee SW, Ang ST, Dong M, Li H (2012) Generalized F0 modelling with absolute and relative pitch features for singing voice synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 429–432
Lee SW, Wu Z, Dong M, Tian X, Li H (2014) A comparative study of spectral transformation techniques for singing voice synthesis. In: Proc. INTERSPEECH, pp 2499–2503
Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng H, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation. IEEE Signal Proc Mag:35–52
Liu LJ, Chen LH, Ling ZH, Dai LR (2015) Spectral conversion using deep neural networks trained with multi-source speakers. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4849–4853
Meyer GA (1961) The semantics of stress and pitch in english. The Faculty Association, Utah State University
Nakashika T, Takashima R (2013) Voice conversion in high-order eigen space using deep belief nets. In: Proc. INTERSPEECH, pp 369–372
Nakashika T, Takiguchi T, Ariki Y. (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. In: Proc. INTERSPEECH, pp 2278–2282
Narendranath M, Murthy HA, Rajendran S, Yegnanarayana B (1995) Transformation of formants for voice conversion using artificial neural networks. Speech Comm 16(2):207–216
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Sanchez G, Silen H, Nurminen J, Gabbouj M (2014) Hierarchical modeling of F0 contours for voice conversion. In: Proc. INTERSPEECH, pp 2318–2321
Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Comm 46 (3-4):455–472
Sorin A, Shechtman S, Pollet V (2015) Coherent modification of pitch and energy for expressive prosody implantation. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4914–4918
Srikanth R (2012) Duration modelling in voice conversion using artificial neural networks. In: Proc. The Anual International Conference on Systems, Signals and Image Processing, pp 556–559
Stylianou Y, Cappé O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6(2):131–142
Sundermann D, Ney H, Hoge H (2003) VTLN-based cross-language voice conversion. In: Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp 676–681
Takashima R, Takiguchi T, Ariki Y (2012) Exemplar-based voice conversion in noisy environment. In: Proc. Spoken Language Technology workshop (SLT), pp 313–317
Tian X, Wu Z, Lee SW, Chng ES (2014) Correlation-based frequency warping for voice conversion. In: Proc. 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp 211–215
Tian X, Wu Z, Lee SW, Hy NQ, Chng ES, Dong M (2015) Sparse representation for frequency warping based voice conversion. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 01, pp 4235–4239
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 3, pp 1315–1318
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
Wu CH, Hsia CC, Liu TH, Wang JF (2006) Voice conversion using duration-embedded Bi-HMMs for expressive speech synthesis. IEEE Trans Audio Speech Lang Process 14(4):1109–1116
Wu Z, Chng ES, Li H (2013) Conditional restricted Boltzmann machine for voice conversion. In: Proc. IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP), pp 104–108
Wu Z, Chng ES, Li H (2014) Joint nonnegative matrix factorization for exemplar-based voice conversion. In: Proc. INTERSPEECH, pp 2509–2513
Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE Trans Audio Speech Lang Process 22(10):1506–1521
Xie FL, Qian Y, Fan Y, Soong FK, Li H (2014) Sequence error (SE) minimization training of neural network for voice conversion. In: Proc. INTERSPEECH, pp 2283–2287
Xie FL, Qian Y, Soong FK, Li H (2014) Pitch transformation in neural network based voice conversion. In: Proc. The 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 197–200. IEEE
Ye H, Young S (2004) High quality voice morphing. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–9–12
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proc. EUROSPEECH, pp 2347–2350
Yu D, Deng L (2011) Deep learning and its applications to signal and information processing. IEEE Signal Proc Mag:145–154
Yu D, Deng L (2015) Automatic Speech Recognition - A Deep Learning Approach. Springer-Verlag London
Yutani K, Uto Y, Nankaku Y, Toda T, Tokuda K (2008) Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching. In: Proc. INTERSPEECH, vol 3, pp 1072–1075
Acknowledgments
This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its IDM Futures Funding Initiative and administered by the Interactive and Digital Media Programme Office.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nguyen, H.Q., Lee, S.W., Tian, X. et al. High quality voice conversion using prosodic and high-resolution spectral features. Multimed Tools Appl 75, 5265–5285 (2016). https://doi.org/10.1007/s11042-015-3039-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3039-x