Abstract
Unit selection speech synthesis involves concatenating segments of speech contained in a large database in such a way as to create novel utterances. The sequence of speech segments is chosen using a cost function. In particular the join cost determines how well consecutive speech segments fit together by extracting acoustic parameters from frames of speech on either side of a potential join point and calculating the distance between them. Although many different metrics have been proposed, there is very little agreement on what constitutes an appropriate window length, with values in the literature ranging from 5 ms to 30 ms. Clearly it is not possible to compare the performance of different metrics when the role of such a fundamental parameter such as window length is not properly investigated with real speech signals. Here we address this short-coming by focusing on one of the most common metrics, the mel-frequency cepstral coefficient (MFCC) [1] and show with experimental results that the choice of window length has a direct impact on the MFCC values calculated, and that the ability of the distance measure to predict discontinuity differs with respect to both the width of the windowing function and the whether the sounds are vowels, voiceless fricatives and voiced fricatives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980)
Hunt, A., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. Acoustics, Speech, and Signal Processing 1, 373–376 (1996)
O’Shaughnessy, D.: Speech communication: human and machine, p. 150. Addison-Wesley (1987)
Memon, S., Lech, M., Maddage, N., He, L.: Application of the Vector Quantization Methods and the Fused MFCC-IMFCC Features in the GMM based Speaker Recognition. In: Zaher, A.A. (ed.) Recent Advances in Signal Processing. InTech (2009)
Kirkpatrick, B., O’Brien, D., Scaife, R.: A comparison of spectral continuity measures as a join cost in concatenative speech synthesis. In: Proceedings of the IET Irish Signals and Systems Conference, ISSC (2006)
Kelly, A.C.: Join Cost Optimisation for Unit Selection Speech Synthesis, Sao Paulo School of Advanced Studies in Speech Dynamics, Brazil (2010), poster, http://www.dinafon.iel.unicamp.br/spsassd_files/posterAmeliaKelly.pdf
Wouters, J., Macon, M.W.: Perceptual evaluation of distance measures for concatenative speech synthesis. In: International Conference on Spoken Language Processing, ICSLP (1998)
Klabbers, E., Veldhuis, R.: On the Reduction of Concatenation Artefacts in Diphone Synthesis. In: ICASLP 1998: Proceedings of the Acoustics, Speech, and Language Processing (1998)
Chen, J.D., Campbell, N.: Objective distance measures for assessing concatenative speech synthesis. In: Proceedings of Eurospeech (1999)
Styliano, Y., Syrdal, A.: Perceptual and Objective Detection of Discontinuities in Concatenative Speech Synthesis. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (2001)
Vepa, J., King, S., Taylor, P.: New objective distance measures for spectral discontinuities in concatenative speech synthesis. In: Proceedings of 2002 IEEE Workshop (2002)
Pantazis, Y., Stylianou, Y., Klabbers, E.: Discontinuity Detection in Concatenated Speech Synthesis based on Nonlinear Speech Analysis. In: Interspeech (2005)
Kominek, J., Black, A.: The CMU ARCTIC speech databases for speech synthesis research. Tech. Rep. CMU-LTI-03-177. Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA (2003), http://www.festvox.org/cmuarctic/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kelly, A.C., Gobl, C. (2011). The Effects of Windowing on the Calculation of MFCCs for Different Types of Speech Sounds. In: Travieso-González, C.M., Alonso-Hernández, J.B. (eds) Advances in Nonlinear Speech Processing. NOLISP 2011. Lecture Notes in Computer Science(), vol 7015. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25020-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-25020-0_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25019-4
Online ISBN: 978-3-642-25020-0
eBook Packages: Computer ScienceComputer Science (R0)