Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A fast method for high-resolution voiced/unvoiced detection and glottal closure/opening instant estimation of speech

Published: 01 February 2016 Publication History

Abstract

We propose a fast speech analysis method which simultaneously performs high-resolution voiced/unvoiced detection (VUD) and accurate estimation of glottal closure and glottal opening instants (GCIs and GOIs, respectively). The proposed algorithm exploits the structure of the glottal flow derivative in order to estimate GCIs and GOIs only in voiced speech using simple time-domain criteria. We compare our method with well-known GCI/GOI methods, namely, the dynamic programming projected phase-slope algorithm (DYPSA), the yet another GCI/GOI algorithm (YAGA) and the speech event detection using the residual excitation and a mean-based signal (SEDREAMS). Furthermore, we examine the performance of the aforementioned methods when combined with state-of-the-art VUD algorithms, namely, the robust algorithm for pitch tracking (RAPT) and the summation of residual harmonics (SRH). Experiments conducted on the APLAWD and SAM databases show that the proposed algorithm outperforms the state-of-the-art combinations of VUD and GCI/GOI algorithms with respect to almost all evaluation criteria for clean speech. Experiments on speech contaminated with several noise types (white Gaussian, babble, and car-interior) are also presented and discussed. The proposed algorithm outperforms the state-of-the-art combinations in most evaluation criteria for signal-to-noise ratio greater than 10 dB.

References

[1]
W. Hess and H. Indefrey, "Accurate time-domain pitch determination of speech signals by means of a laryngograph," Speech Commun., vol. 6, no. 1, pp. 55-68, Mar. 1987.
[2]
T. Ananthapadmanabha and B. Yegnanarayana, "Epoch extraction of voiced speech," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-23, no. 6, pp. 562-570, Dec. 1975.
[3]
A. E. Rosenberg, "Effect of glottal pulse shape on the quality of natural vowels," J. Acoust. Soc. Amer., vol. 49, no. 2B, pp. 583-590, 1971.
[4]
T. Ananthapadmanabha, "Acoustic analysis of voice source dynamics," STL-QPSR, vol. 25, no. 2-3, pp. 1-24, 1984.
[5]
G. Fant, J. Liljencrants, and Q. Lin, "A four-parameter model of glottal flow," STL-QPSR, vol. 26, no. 4, pp. 1-13, 1985.
[6]
M. Thomas, J. Gudnason, and P. Naylor, "Data-driven voice source waveform modelling," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2009, pp. 3965-3968.
[7]
N. D. Gaubitch and P. A. Naylor, "Spatiotemporal averaging method for enhancement of reverberant speech," in Proc. 15th Int. Conf. Digital Signal Process., Jul. 2007, pp. 607-610.
[8]
D. Y. Wong, J. D. Markel, and A. H. Gray, "Least squares glottal inverse filtering from the acoustic speech waveform," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 4, pp. 350-355, Aug. 1979.
[9]
M. Plumpe, T. Quatieri, and D. Reynolds, "Modeling of the glottal flow derivative waveform with application to speaker identification," IEEE Trans. Speech, Audio Process., vol. 7, no. 5, pp. 569-586, Sep. 1999.
[10]
P. Alku, "Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering," Speech Commun., vol. 11, no. 2-3, pp. 109-118, Jun. 1992.
[11]
P. Alku, C. Magi, S. Yrttiaho, T. Bäckström, and B. Story, "Closed phase covariance analysis based on constrained linear prediction for glottal inverse filtering," J. Acoust. Soc. Amer., vol. 125, no. 5, pp. 3289-3305, 2009.
[12]
R. E. Slyh, E. G. Hansen, and T. R. Anderson, "Glottal modeling and closed-phase analysis for speaker recognition," in Proc. Speaker Odyssey: Speaker Recog. Workshop (Odyssey'04), 2004, pp. 315-322.
[13]
J. Gudnason and M. Brookes, "Voice source cepstrum coefficients for speaker identification," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Mar. 2008, pp. 4821-4824.
[14]
N. D. Gaubitch, P. A. Naylor, and D. B. Ward, "Multi-microphone speech dereverberation using spatio-temporal averaging," in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2004, pp. 809-812.
[15]
J. P. Cabral, S. Renals, J. Yamagishi, and K. Richmond, "HMM-based speech synthesizer using the LF-model of the glottal source," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2011, pp. 4704-4707.
[16]
T. Drugman, A. Moinet, T. Dutoit, and G. Wilfart, "Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2009, pp. 3793-3796.
[17]
P. Hedelin, "High quality glottal LPC-vocoding," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 1986, vol. 11, pp. 465-468.
[18]
Y. Agiomyrgiannakis and O. Rosec, "ARX-LF-based source-filter methods for voice modification and transformation," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2009, pp. 3589-3592.
[19]
E. Moulines and F. Charpentier, "Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones," Speech Commun., vol. 9, no. 5-6, pp. 453-467, Dec. 1990.
[20]
D. Rentzos, S. Vaseghi, E. Turajlic, Q. Yan, and C.-H. Ho, "Transformation of speaker characteristics for voice conversion," in Proc. IEEE Workshop Autom. Speech Recog. Understand., 2003, pp. 706-711.
[21]
H. W. Strube, "Determination of the instant of glottal closure from the speech wave," J. Acoust. Soc. Amer., vol. 56, no. 5, pp. 1625-1629, 1974.
[22]
C. Ma, Y. Kamp, and L. F. Willems, "A Frobenius norm approach to glottal closure detection from the speech signal," IEEE Trans. Speech, Audio Process., vol. 2, no. 2, pp. 258-265, Apr. 1994.
[23]
R. Smits and B. Yegnanarayana, "Determination of instants of significant excitation in speech using group delay function," IEEE Trans. Speech, Audio Process., vol. 3, no. 9, pp. 325-333, Sep. 1995.
[24]
P. S. Murthy and B. Yegnanarayana, "Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals," IEEE Trans. Speech, Audio Process., vol. 7, no. 6, pp. 609-619, Nov. 1999.
[25]
A. Kounoudes, P. A. Naylor, and M. Brookes, "The DYPSA algorithm for estimation of glottal closure instants in voiced speech," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2002, vol. 1, pp. 349-352.
[26]
A. P. Prathosh, T. V. Ananthapadmanabha, and A. G. Ramakrishnan, "Epoch extraction based on integrated linear prediction residual using plosion index," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 12, pp. 2471-2480, Dec. 2013.
[27]
V. Khanagha, K. Daoudi, and H. Yahia, "Detection of glottal closure instants based on the microcanonical multiscale formalism," IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1941-1950, Dec. 2014.
[28]
V. Tuan and C. d'Alessandro, "Robust glottal closure detection using the wavelet transform," in Proc. Eur. Conf. Speech Commun. Technol., Sep. 1999, pp. 2805-2808.
[29]
T. Ananthapadmanabha and B. Yegnanarayana, "Epoch extraction from linear prediction residual for identification of closed glottis interval," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 4, pp. 309-319, Aug. 1979.
[30]
P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, "Estimation of glottal closure instants in voiced speech using the DYPSA algorithm," IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 34-43, Jan. 2007.
[31]
T. Drugman and T. Dutoit, "Glottal closure and opening instant detection from speech signals," in Proc. Interspeech Conf., Sep. 2009.
[32]
M. R. P. Thomas, J. Gudnason, and P. A. Naylor, "Estimation of glottal closing and opening instants in voiced speech using the YAGA algorithm," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 82-91, Jan. 2012.
[33]
A. Bouzid and N. Ellouze, "Glottal opening instant detection from speech signals," in Proc. 12th Eur. Signal Process. Conf., 2004.
[34]
M. Brookes, "Voicebox: Speech processing toolbox for Matlab," 2007 [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
[35]
T. Drugman, M. Thomas, J. Gudnason, P. Naylor, and T. Dutoit, "Detection of glottal closure instants from speech signals: A quantitative review," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 3, pp. 994-1006, Mar. 2012.
[36]
B. S. Atal and L. R. Rabiner, "A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-24, no. 3, pp. 201-212, Jun. 1976.
[37]
L. J. Siegel, "A procedure for using pattern classification techniques to obtain a voiced/unvoiced classifier," IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 1, pp. 83-89, Feb. 1979.
[38]
D. G. Childers, M. Hahn, and J. N. Larar, "Silent and voiced/unvoiced/mixed excitation (four-way) classification of speech," IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 11, pp. 1771-1774, Nov. 1989.
[39]
D. Talkin, "A robust algorithm for pitch tracking (RAPT)," in Speech Coding Synth., 1995, pp. 495-518.
[40]
S. Ahmadi and A. S. Spanias, "Cepstrum-based pitch detection using a new statistical V/UV classification algorithm," IEEE Trans. Speech, Audio Process., vol. 7, no. 3, pp. 333-338, May 1999.
[41]
T. Drugman and A. Alwan, "Joint robust voicing detection and pitch estimation based on residual harmonics," in Proc. Interspeech Conf., Aug. 2011, pp. 1973-1976.
[42]
S. Gonzalez and M. Brookes, "PEFAC - A pitch estimation algorithm robust to high levels of noise," IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 518-530, Feb. 2014.
[43]
D. M. Howard and G. Lindsey, "Conditioned variability in voicing offsets," IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 3, pp. 406-407, Mar. 1988.
[44]
D. Chan, A. Fourcin, D. Gibbon, B. Granstrom, M. Huckvale, G. Kokkinakis, K. Kvale, L. Lamel, B. Lindberg, A. Moreno, J. Mouropoulos, F. Senia, I. Trancoso, C. Veld, and J. Zeiliger, "EUROMa spoken language resource for the EU," in Proc. Eur. Conf. Speech Commun. Technol., 1995, pp. 867-870.
[45]
G. Lindsey, A. Breen, and S. Nevard, SPAR's archivable actual-word databases Univ. College London, London, U.K., Tech. Rep., 1987.
[46]
G. Fant, Acoustic Theory of Speech Production: With Calculations Based on X-ray Studies of Russian Articulations. The Hague, The Netherlands: Mounton, 1970.
[47]
T. F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice. Upper Saddle River, NJ, USA: Prentice-Hall, 2002.
[48]
T. Drugman, P. Alku, A. Alwan, and B. Yegnanarayana, "Glottal source processing: From analysis to applications," Comput. Speech Lang., vol. 28, no. 5, pp. 1117-1138, 2014.
[49]
M. R. P. Thomas and P. A. Naylor, "The SIGMA algorithm: A glottal activity detector for electroglottographic signals," IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 8, pp. 1557-1566, Nov. 2009.
[50]
D. Childers, D. Hicks, G. Moore, L. Eskenazi, and A. Lalwani, "Elec-troglottography and vocal fold physiology," J. Speech Hear. Res., vol. 33, pp. 245-254, 1990.
[51]
P. Alku and E. Vilkman, "Effects of bandwidth on glottal airflow waveforms estimated by inverse filtering," J. Acoust. Soc. Amer., vol. 98, no. 2, pp. 763-767, Aug. 1995.
[52]
B. Doval, C. d'Alessandro, and H. Nathalie, "The spectrum of glottal flow models," Acta Acoust. United Acust., vol. 92, no. 6, pp. 1026-1046, Dec. 2006.
[53]
G. Fant, "The LF-model revisited. transformations and frequency domain analysis," STL-QPSR, vol. 36, no. 2-3, pp. 119-156, 1995.
[54]
D. G. Childers and C. K. Lee, "Vocal quality factors: Analysis, synthesis and perception," J. Acoust. Soc. Amer., vol. 90, no. 5, pp. 2394-2410, 1991.
[55]
J. Makhoul, "Linear prediction: A tutorial review," Proc. IEEE, vol. 63, no. 4, pp. 561-580, Apr. 1975.
[56]
A. H. Gray, Jr. and J. D. Markel, "A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis," IEEE Trans. Acoust., Speech, Signal Process., vol. 22, no. 3, pp. 207-216, Jun. 1974.
[57]
G. Kafentzis, "On the inverse filtering of speech," M.S. thesis, Dept. of Comput. Sci., Univ. of Crete, Crete, Greece, 2010.
[58]
A. I. Koutrouvelis, "Speech production modelling and analysis," M.S. thesis, Delft Univ. of Technol., Delft, The Netherlands, 2014.
[59]
J. D. Markel and A. H. Gray, Linear Prediction of Speech. New York, NY, USA: Springer, 1982.
[60]
A. Varga and H. J. M. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Commun., vol. 12, no. 3, pp. 247-251, Jul. 1993.
[61]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, "CO-VAREP: A collaborative voice analysis repository for speech technologies," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2014, pp. 960-964.

Cited By

View all
  • (2022)Application of big data language recognition technology and GPU parallel computing in English teaching visualization systemInternational Journal of Speech Technology10.1007/s10772-021-09904-125:3(667-677)Online publication date: 1-Sep-2022
  • (2022)A Recurrence Network Approach for Characterization and Detection of Dynamical Transitions During Human Speech ProductionCircuits, Systems, and Signal Processing10.1007/s00034-022-02103-641:12(6975-6998)Online publication date: 1-Dec-2022
  • (2021)Mel Scale-Based Linear Prediction Approach to Reduce the Prediction Filter Order in CELP ParadigmCircuits, Systems, and Signal Processing10.1007/s00034-021-01647-340:8(3813-3835)Online publication date: 1-Aug-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 24, Issue 2
February 2016
185 pages
ISSN:2329-9290
EISSN:2329-9304
  • Editor:
  • Haizhou Li
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 February 2016
Accepted: 22 November 2015
Revised: 28 July 2015
Received: 01 April 2015
Published in TASLP Volume 24, Issue 2

Author Tags

  1. glottal closure instants (GCIs)
  2. glottal opening instants (GOIs)
  3. pitch estimation
  4. speech analysis
  5. voiced/unvoiced detection (VUD)

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Application of big data language recognition technology and GPU parallel computing in English teaching visualization systemInternational Journal of Speech Technology10.1007/s10772-021-09904-125:3(667-677)Online publication date: 1-Sep-2022
  • (2022)A Recurrence Network Approach for Characterization and Detection of Dynamical Transitions During Human Speech ProductionCircuits, Systems, and Signal Processing10.1007/s00034-022-02103-641:12(6975-6998)Online publication date: 1-Dec-2022
  • (2021)Mel Scale-Based Linear Prediction Approach to Reduce the Prediction Filter Order in CELP ParadigmCircuits, Systems, and Signal Processing10.1007/s00034-021-01647-340:8(3813-3835)Online publication date: 1-Aug-2021
  • (2021)Toward Improving the Performance of Epoch Extraction from Telephonic SpeechCircuits, Systems, and Signal Processing10.1007/s00034-020-01551-240:4(2050-2064)Online publication date: 1-Apr-2021
  • (2020)Analysis of algorithms to estimate glottal closure instants from speech signalsInternational Journal of Speech Technology10.1007/s10772-020-09752-523:4(825-849)Online publication date: 1-Dec-2020
  • (2020)Context-Aware XGBoost for Glottal Closure Instant Detection in Speech SignalText, Speech, and Dialogue10.1007/978-3-030-58323-1_48(446-455)Online publication date: 8-Sep-2020
  • (2019)Automatic detection of consonant omission in cleft palate speechInternational Journal of Speech Technology10.1007/s10772-018-09570-w22:1(59-65)Online publication date: 1-Mar-2019
  • (2018)PSFM—A Probabilistic Source Filter Model for Noise Robust Glottal Closure Instant DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.283473326:9(1645-1657)Online publication date: 1-Sep-2018
  • (2018)Robust Detection of Glottal Activity Using Unwrapped Phase Electroglottographic Signal2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8461565(5584-9)Online publication date: 15-Apr-2018
  • (2018)Epoch Estimation from Emotional Speech Signals Using Variational Mode DecompositionCircuits, Systems, and Signal Processing10.1007/s00034-018-0804-x37:8(3245-3274)Online publication date: 1-Aug-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media