The Effects of Windowing on the Calculation of MFCCs for Different Types of Speech Sounds

Kelly, Amelia C.; Gobl, Christer

doi:10.1007/978-3-642-25020-0_15

Amelia C. Kelly²⁰ &
Christer Gobl²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7015))

Included in the following conference series:

International Conference on Nonlinear Speech Processing

1031 Accesses

Abstract

Unit selection speech synthesis involves concatenating segments of speech contained in a large database in such a way as to create novel utterances. The sequence of speech segments is chosen using a cost function. In particular the join cost determines how well consecutive speech segments fit together by extracting acoustic parameters from frames of speech on either side of a potential join point and calculating the distance between them. Although many different metrics have been proposed, there is very little agreement on what constitutes an appropriate window length, with values in the literature ranging from 5 ms to 30 ms. Clearly it is not possible to compare the performance of different metrics when the role of such a fundamental parameter such as window length is not properly investigated with real speech signals. Here we address this short-coming by focusing on one of the most common metrics, the mel-frequency cepstral coefficient (MFCC) [1] and show with experimental results that the choice of window length has a direct impact on the MFCC values calculated, and that the ability of the distance measure to predict discontinuity differs with respect to both the width of the windowing function and the whether the sounds are vowels, voiceless fricatives and voiced fricatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

On the Contribution of Articulatory Features to Speech Synthesis

Defining a Global Adaptive Duration Target Cost for Unit Selection Speech Synthesis

First Steps Towards Hybrid Speech Synthesis in Czech TTS System ARTIC

References

Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980)
Article Google Scholar
Hunt, A., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. Acoustics, Speech, and Signal Processing 1, 373–376 (1996)
Google Scholar
O’Shaughnessy, D.: Speech communication: human and machine, p. 150. Addison-Wesley (1987)
Google Scholar
Memon, S., Lech, M., Maddage, N., He, L.: Application of the Vector Quantization Methods and the Fused MFCC-IMFCC Features in the GMM based Speaker Recognition. In: Zaher, A.A. (ed.) Recent Advances in Signal Processing. InTech (2009)
Google Scholar
Kirkpatrick, B., O’Brien, D., Scaife, R.: A comparison of spectral continuity measures as a join cost in concatenative speech synthesis. In: Proceedings of the IET Irish Signals and Systems Conference, ISSC (2006)
Google Scholar
Kelly, A.C.: Join Cost Optimisation for Unit Selection Speech Synthesis, Sao Paulo School of Advanced Studies in Speech Dynamics, Brazil (2010), poster, http://www.dinafon.iel.unicamp.br/spsassd_files/posterAmeliaKelly.pdf
Wouters, J., Macon, M.W.: Perceptual evaluation of distance measures for concatenative speech synthesis. In: International Conference on Spoken Language Processing, ICSLP (1998)
Google Scholar
Klabbers, E., Veldhuis, R.: On the Reduction of Concatenation Artefacts in Diphone Synthesis. In: ICASLP 1998: Proceedings of the Acoustics, Speech, and Language Processing (1998)
Google Scholar
Chen, J.D., Campbell, N.: Objective distance measures for assessing concatenative speech synthesis. In: Proceedings of Eurospeech (1999)
Google Scholar
Styliano, Y., Syrdal, A.: Perceptual and Objective Detection of Discontinuities in Concatenative Speech Synthesis. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (2001)
Google Scholar
Vepa, J., King, S., Taylor, P.: New objective distance measures for spectral discontinuities in concatenative speech synthesis. In: Proceedings of 2002 IEEE Workshop (2002)
Google Scholar
Pantazis, Y., Stylianou, Y., Klabbers, E.: Discontinuity Detection in Concatenated Speech Synthesis based on Nonlinear Speech Analysis. In: Interspeech (2005)
Google Scholar
Kominek, J., Black, A.: The CMU ARCTIC speech databases for speech synthesis research. Tech. Rep. CMU-LTI-03-177. Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA (2003), http://www.festvox.org/cmuarctic/

Download references

Author information

Authors and Affiliations

Phonetics and Speech Laboratory, Centre for Language and Communication Studies, SLSCS, Trinity College, Dublin, Ireland
Amelia C. Kelly & Christer Gobl

Authors

Amelia C. Kelly
View author publications
You can also search for this author in PubMed Google Scholar
Christer Gobl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Technological Development and Innovation in Communications (IDETIC), Signals and Communications Department, University of Las Palmas de Gran Canaria, Campus de Tafira, s/n, 35017, Las Palmas de Gran Canaria, Spain
Carlos M. Travieso-González & Jesús B. Alonso-Hernández &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kelly, A.C., Gobl, C. (2011). The Effects of Windowing on the Calculation of MFCCs for Different Types of Speech Sounds. In: Travieso-González, C.M., Alonso-Hernández, J.B. (eds) Advances in Nonlinear Speech Processing. NOLISP 2011. Lecture Notes in Computer Science(), vol 7015. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25020-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-25020-0_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25019-4
Online ISBN: 978-3-642-25020-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics