Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Statistical conversion of silent articulation into audible speech using full-covariance HMM

Published: 01 March 2016 Publication History

Abstract

Conversion of silent articulation captured by ultrasound and video to modal speech.Comparison of GMM and full-covariance phonetic HMM without vocabulary limitation.HMM-based approach allows the use of linguistic information for regularization.Objective evaluation showed a lower but more fluctuant spectral distortion for HMM.Perceptual evaluation showed a better intelligibility for HMM on consonants. This article investigates the use of statistical mapping techniques for the conversion of articulatory movements into audible speech with no restriction on the vocabulary, in the context of a silent speech interface driven by ultrasound and video imaging. As a baseline, we first evaluated the GMM-based mapping considering dynamic features, proposed by Toda et al. (2007) for voice conversion. Then, we proposed a 'phonetically-informed' version of this technique, based on full-covariance HMM. This approach aims (1) at modeling explicitly the articulatory timing for each phonetic class, and (2) at exploiting linguistic knowledge to regularize the problem of silent speech conversion. Both techniques were compared on continuous speech, for two French speakers (one male, one female). For modal speech, the HMM-based technique showed a lower spectral distortion (objective evaluation). However, perceptual tests (transcription and XAB discrimination tests) showed a better intelligibility of the GMM-based technique, probably related to its less fluctuant quality. For silent speech, a perceptual identification test revealed a better segmental intelligibility for the HMM-based technique on consonants.

References

[1]
P. Birkholz, D. Jackèl, B. Kroger, Construction and control of a three-dimensional vocal tract model, in: Proceedings of ICASSP, 2006, pp. 873-876.
[2]
A.L. Buchsbaum, J.P. van Santen, Selecting training inputs via greedy rank covering, in: Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, 1996, pp. 288-295.
[3]
T. Chen, Audiovisual speech processing, Signal Process. Mag. IEEE, 18 (2001) 9-21.
[4]
P. Clarkson, R. Rosenfeld, Statistical language modeling using the CMU-Cambridge toolkit, in: Proceedings of Eurospeech, 1997, pp. 2707-2710.
[5]
P. Combescure, Vingt listes de dix phrases phonétiquement équilibrées, Rev. Acoust., 14 (1981) 34-38.
[6]
B. Denby, Y. Oussar, G. Dreyfus, M. Stone, Prospects for a silent speech interface using ultrasound imaging, in: Proceedings of ICASSP, 2006, pp. 365-368.
[7]
B. Denby, T. Schultz, K. Honda, T. Hueber, J. Gilbert, J. Brumberg, Silent speech interfaces, Speech Commun., 52 (2010) 270-287.
[8]
M.J. Fagan, S.R. Ell, J.M. Gilbert, E. Sarrazin, P.M. Chapman, Development of a (silent) speech recognition system for patients following laryngectomy, Med. Eng. Phys., 30 (2008) 419-425.
[9]
J.M. Gilbert, S.I. Rybchenko, R. Hofe, S.R. Ell, M.J. Fagan, R.K. Moore, P. Green, Isolated word recognition of silent speech using magnetic implants and sensors, Med. Eng. Phys., 32 (2010) 1189-1197.
[10]
S. Hiroya, M. Honda, Estimation of articulatory movements from speech acoustics using an HMM-based speech production model, IEEE Trans. Speech Audio Process., 12 (2004) 175-185.
[11]
T. Hueber, G. Aversano, G. Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, Eigentongue feature extraction for an ultrasound-based silent speech interface, in: Proceedings of ICASSP, 2007, pp. 1245-1248.
[12]
T. Hueber, P. Badin, C. Savariaux, C. Vilain, G. Bailly, Differences in articulatory strategies between silent, whispered and normal speech? A pilot study using electromagnetic articulography, in: Proceedings of International Seminar on Speech Production (ISSP), 2010.
[13]
T. Hueber, E.-L. Benaroya, G. Chollet, B. Denby, M. Stone, Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips, Speech Commun., 52 (2010) 288-300.
[14]
T. Hueber, E.-L. Benaroya, B. Denby, G. Chollet, Statistical mapping between articulatory and acoustic data for an ultrasound-based silent speech interface, in: Proceedings of Interspeech, 2011, pp. 593-596.
[15]
T. Hueber, A. Ben Youssef, G. Bailly, P. Badin, F. Elisei, Cross-speaker acoustic-to-articulatory inversion using phone-based trajectory HMM for pronunciation training, in: Proceedings of Interspeech, 2012.
[16]
T. Hueber, G. Chollet, B. Denby, G. Dreyfus, M. Stone, Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface, in: Proceedings of Interspeech, 2009, pp. 640-643.
[17]
T. Hueber, G. Chollet, B. Denby, M. Stone, Acquisition of ultrasound, video and acoustic speech data for a silent-speech interface application, in: Proceedings of International Seminar on Speech Production, 2008, pp. 365-369.
[18]
S. Imai, K. Sumita, C. Furuichi, Mel log spectrum approximation (MLSA) filter for speech synthesis, Electron. Commun. Jpn. Part I Commun., 66 (1983) 10-18.
[19]
M. Janke, M. Wand, T. Schultz, Impact of lack of acoustic feedback in EMG-based silent speech recognition, in: Proceedings of Interspeech, 2010, pp. 2686-2689.
[20]
C.T. Kello, D.C. Plaut, A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters, J. Acoust. Soc. Am., 116 (2004) 2354-2364.
[21]
Z.-H. Ling, K. Richmond, J. Yamagishi, An analysis of HMM-based prediction of articulatory movements, Speech Commun., 52 (2010) 834-846.
[22]
Z.-H. Ling, K. Richmond, J. Yamagishi, R.-H. Wang, Integrating articulatory features into HMM-based parametric speech synthesis, IEEE Trans. Audio Speech Lang. Process., 17 (2009) 1171-1185.
[23]
S. Maeda, Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model, Springer, 1990.
[24]
T. Muramatsu, Y. Ohtani, T. Toda, H. Saruwatari, K. Shikano, Low-delay voice conversion based on maximum likelihood estimation of spectral parameter trajectory, in: Proceedings of Interspeech, 2008, pp. 1076-1079.
[25]
Y. Nakajima, H. Kashioka, N. Campbell, K. Shikano, Non-audible murmur (NAM) recognition, IEICE Trans. Inf. Syst., 89 (2006) 1-8.
[26]
Y. Nakajima, H. Kashioka, K. Shikano, N. Campbell, Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin, in: Proceedings of ICASSP, 2003, pp. 708-711.
[27]
M. Ostendorf, V.V. Digalakis, O.A. Kimball, From HMM's to segment models: a unified view of stochastic modeling for speech recognition, IEEE Trans. Speech Audio Process., 4 (1996) 360-378.
[28]
K. Richmond, A trajectory mixture density network for the acoustic-articulatory inversion mapping, in: Proceedings of Interspeech, 2006, pp. 577-580.
[29]
M. Russell, R. Moore, Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition, in: Proceedings of ICASSP, 1985, pp. 5-8.
[30]
T. Schultz, M. Wand, Modeling coarticulation in EMG-based continuous speech recognition, Speech Commun., 52 (2010) 341-353.
[31]
M. Shannon, W. Byrne, Fast, low-artifact speech synthesis considering global variance, in: Proceedings of ICASSP, 2013, pp. 7869-7873.
[32]
H. Silén, E. Helander, J. Nurminen, M. Gabbouj, Ways to implement global variance in statistical speech synthesis, in: Proceedings of Interspeech, 2012.
[33]
M.M. Sondhi, J. Schroeter, A hybrid time-frequency domain articulatory speech synthesizer, IEEE Trans. Acoust. Speech Signal Process., 35 (1987) 955-967.
[34]
M. Stone, A guide to analysing tongue motion from ultrasound images, Clin. Linguist. Phon., 19 (2005) 455-501.
[35]
T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio Speech Lang. Process., 15 (2007) 2222-2235.
[36]
T. Toda, A.W. Black, K. Tokuda, Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model, Speech Commun., 50 (2008) 215-227.
[37]
T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis, IEICE Trans. Inf. Syst., E90-D (2007) 816-824.
[38]
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, in: Proceedings of ICASSP, 2000, pp. 1315-1318.
[39]
M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci., 3 (1991) 71-86.
[40]
M. Wand, T. Schultz, Session-independent EMG-based speech recognition, in: Proceedings of Biosignals, 2011, pp. 295-300.
[41]
A. Wrench, J. Scobbie, M. Linden, Evaluation of a helmet to hold an ultrasound probe, in: Presented at the Ultrafest IV, 2007.
[42]
A. Wrench, J.M. Scobbie, Spatio-temporal inaccuracies of video-based ultrasound images of the tongue, in: Proceedings of the International Seminar on Speech Production, 2006, pp. 451-458.
[43]
S. Young, The HTK Book, 2005. http://htk.eng.cam.ac.uk/
[44]
A.B. Youssef, T. Hueber, P. Badin, G. Bailly, Toward a multi-speaker visual articulatory feedback system, in: Proceedings of Interspeech, 2011, pp. 589-592.
[45]
Y.J. Yu, S.T. Acton, Speckle reducing anisotropic diffusion, IEEE Trans. Image Process., 11 (2002) 1260-1270.
[46]
H. Zen, Y. Nankaku, K. Tokuda, Continuous stochastic feature mapping based on trajectory HMMS, IEEE Trans. Audio Speech Lang. Process., 19 (2011) 417-430.
[47]
L. Zhang, S. Renals, Acoustic-articulatory modelling with the trajectory HMM, IEEE Signal Process. Lett., 15 (2008) 245-248.

Cited By

View all
  • (2022)SVoiceProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568530(622-636)Online publication date: 6-Nov-2022
  • (2018)Updating the Silent Speech Challenge benchmark with deep learningSpeech Communication10.1016/j.specom.2018.02.00298:C(42-50)Online publication date: 1-Apr-2018
  • (2017)Direct Speech Reconstruction From Articulatory Sensor Data by Machine LearningIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.275726325:12(2362-2374)Online publication date: 1-Dec-2017
  • Show More Cited By

Index Terms

  1. Statistical conversion of silent articulation into audible speech using full-covariance HMM
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Computer Speech and Language
        Computer Speech and Language  Volume 36, Issue C
        March 2016
        394 pages

        Publisher

        Academic Press Ltd.

        United Kingdom

        Publication History

        Published: 01 March 2016

        Author Tags

        1. Articulatory-acoustic mapping
        2. GMM
        3. HMM
        4. Silent speech interface
        5. Ultrasound

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)SVoiceProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568530(622-636)Online publication date: 6-Nov-2022
        • (2018)Updating the Silent Speech Challenge benchmark with deep learningSpeech Communication10.1016/j.specom.2018.02.00298:C(42-50)Online publication date: 1-Apr-2018
        • (2017)Direct Speech Reconstruction From Articulatory Sensor Data by Machine LearningIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.275726325:12(2362-2374)Online publication date: 1-Dec-2017
        • (2017)Biosignal-Based Spoken CommunicationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.275236525:12(2257-2271)Online publication date: 1-Dec-2017
        • (2017)Effects of Laryngeal Activity on ArticulationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.273856425:12(2272-2280)Online publication date: 1-Dec-2017
        • (2017)Vid2speech: Speech reconstruction from silent video2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2017.7953127(5095-5099)Online publication date: 5-Mar-2017
        • (2017)Feature extraction using multimodal convolutional neural networks for visual speech recognition2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2017.7952701(2971-2975)Online publication date: 5-Mar-2017
        • (2017)Learning adaptive dressing assistance from human demonstrationRobotics and Autonomous Systems10.1016/j.robot.2017.03.01793:C(61-75)Online publication date: 1-Jul-2017
        • (2016)Speech Production in Speech TechnologiesComputer Speech and Language10.1016/j.csl.2015.11.00236:C(165-172)Online publication date: 1-Mar-2016

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media