Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A silent speech system based on permanent magnet articulography and direct synthesis

Published: 01 September 2016 Publication History

Abstract

HighlightsThis paper introduces a 'Silent Speech Interface' with the potential to restore the power of speech to people who have completely lost their voices.Small, unobtrusive magnets are attached to the lips and tongues and changes in magnetic field are sensed as the 'speaker' mouths what s/he wants to say.The sensor data is transformed to acoustic data by a speaker-dependent, learned transformation over parallel acoustic and sensor data.The machine learning technique used here is Mixture of Factor Analysis.Results are presented for 3 speakers which demonstrate that the SSI is capable of producing 'speech' which is both intelligible and natural. In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies.

References

[1]
G. Ananthakrishnan, O. Engwall, D. Neiberg, Exploring the predictability of non-unique acoustic-to-articulatory mappings, IEEE Trans. Audio Speech Lang. Process., 20 (2012) 2672-2682.
[2]
T.W. Anderson, An Introduction to Multivariate Statistical Analysis, Wiley-Interscience, 2003.
[3]
B.S. Atal, J.J. Chang, M.V. Mathews, J.W. Tukey, Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique, J. Acoust. Soc. Am., 63 (1978) 1535-1555.
[4]
P. Badin, G. Bailly, L. Revret, M. Baciu, C. Segebarth, C. Savariaux, Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images, J. Phon., 30 (2002) 533-553.
[5]
P. Birkholz, D. Jackel, A three-dimensional model of the vocal tract for speech synthesis, in: Proc. 15th International congress of phonetic sciences, 2003, pp. 2597-2600.
[6]
P. Birkholz, I. Steiner, S. Breuer, Control concepts for articulatory speech synthesis, in: Proc. 6th ISCA Workshop on Speech Synthesis, 2008, pp. 5-10.
[7]
C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
[8]
K. Brigham, B.V.K. Vijaya Kumar, Imagined speech classification with EEG signals for silent communication: a preliminary investigation into synthetic telepathy, in: Proc. 4th International Conference on Bioinformatics and Biomedical Engineering, 2010, June, pp. 1-4.
[9]
J.S. Brumberg, A. Nieto-Castanon, P.R. Kennedy, F.H. Guenther, Brain-computer interfaces for speech communication, Speech Commun., 52 (2010, April) 367-379.
[10]
J.S. Brumberg, E.J. Wright, D.S. Andreasen, F.H. Guenther, P.R. Kennedy, Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech-motor cortex, Front. Neurosci., 5 (2011, May) 1-12.
[11]
J. Cai, T. Hueber, S. Manitsaris, P. Roussel, L. Crevier-Buchman, M. Stone, C. Pillot-Loiseau, G. Chollet, G. Dreyfus, B. Denby, Vocal tract imaging system for post-laryngectomy voice replacement, in: Proc. IEEE International Instrumentation and Measurement Technology Conference (I2MTC), 2013, pp. 676-680.
[12]
L.A. Cheah, J. Bai, J.A. Gonzalez, S.R. Ell, J.M. Gilbert, R.K. Moore, P.D. Green, A user-centric design of permanent magnetic articulography based assistive speech technology, in: Proc. BioSignals, 2015, pp. 109-116.
[13]
S. De Jong, SIMPLS: an alternative approach to partial least squares regression, Chemomet. Intell. Lab. Syst., 18 (1993) 251-263.
[14]
B. Denby, T. Schultz, K. Honda, T. Hueber, J. Gilbert, J. Brumberg, Silent speech interfaces, Speech Commun., 52 (2010) 270-287.
[15]
Y. Deng, J.T. Heaton, G.S. Meltzner, Towards a practical silent speech recognition system, in: Proc. Interspeech., 2014, pp. 1164-1168.
[16]
S. Desai, E.V. Raghavendra, B. Yegnanarayana, A.W. Black, K. Prahallad, Voice conversion using artificial neural networks, in: Proc. ICASSP., 2009, pp. 3893-3896.
[17]
M.J. Fagan, S.R. Ell, J.M. Gilbert, E. Sarrazin, P.M. Chapman, Development of a (silent) speech recognition system for patients following laryngectomy, Med. Eng. Phys., 30 (2008) 419-425.
[18]
T. Fukada, K. Tokuda, T. Kobayashi, S. Imai, An adaptive algorithm for Mel-cepstral analysis of speech, in: Proc. ICASSP., 1992, pp. 137-140.
[19]
Z. Ghahramani, G.E. Hinton, The EM algorithm for mixtures of factor analyzers. Tech. Rep. CRG-TR-96-1, University of Toronto, 1996.
[20]
J.M. Gilbert, S.I. Rybchenko, R. Hofe, S.R. Ell, M.J. Fagan, R.K. Moore, P. Green, Isolated word recognition of silent speech using magnetic implants and sensors, Med. Eng. Phys., 32 (2010) 1189-1197.
[21]
J.A. Gonzalez, L.A. Cheah, J. Bai, S.R. Ell, J.M. Gilbert, R.K.M. 1, P.D. Green, Analysis of phonetic similarity in a silent speech interface based on permanent magnetic articulography, in: Proc. Interspeech., 2014, pp. 1018-1022.
[22]
J.A. Gonzalez, P.D. Green, R.K. Moore, L.A. Cheah, J.M. Gilbert, A non-parametric articulatory-to-acoustic conversion system for silent speech using shared Gaussian process dynamical models, in: Proc. UK Speech., 2015, pp. 11.
[23]
C. Herff, D. Heger, A. de Pesters, D. Telaar, P. Brunner, G. Schalk, T. Schultz, Brain-to-text: decoding spoken phrases from phone representations in the brain, Front. Neurosci., 9 (2015).
[24]
R. Hofe, J. Bai, L.A. Cheah, S.R. Ell, J.M. Gilbert, R.K. Moore, P.D. Green, Performance of the MVOCA silent speech interface across multiple speakers, in: Proc. Interspeech, 2013, pp. 1140-1143.
[25]
R. Hofe, S.R. Ell, M.J. Fagan, J.M. Gilbert, P.D. Green, R.K. Moore, S.I. Rybchenko, Speech synthesis parameter generation for the assistive silent speech interface MVOCA, in: Proc. Interspeech, 2011, pp. 3009-3012.
[26]
R. Hofe, S.R. Ell, M.J. Fagan, J.M. Gilbert, P.D. Green, R.K. Moore, S.I. Rybchenko, Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing, Speech Commun., 55 (2013) 22-32.
[27]
T. Hueber, G. Bailly, B. Denby, Continuous articulatory-to-acoustic mapping using phone-baseda trajectory HMM for a silent speech interface, in: Proc. Interspeech., 2012, pp. 723-726.
[28]
T. Hueber, E.-L. Benaroya, G. Chollet, B. Denby, G. Dreyfus, M. Stone, Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips, Speech Commun., 52 (2010) 288-300.
[29]
T. Hueber, E.-L. Benaroya, B. Denby, G. Chollet, Statistical mapping between articulatory and acoustic data for an ultrasound-based silent speech interface, in: Proc. Interspeech., 2011, pp. 593-596.
[30]
T. Hueber, G. Chollet, B. Denby, G. Dreyfus, M. Stone, Phone recognition from ultrasound and optical video sequences for a silent speech interface, in: Proc. Interspeech., 2008, pp. 2032-2035.
[31]
M. Janke, M. Wand, K. Nakamura, T. Schultz, Further investigations on EMG-to-speech conversion, in: Proc. ICASSP., 2012, pp. 365-368.
[32]
S.-C. Jou, T. Schultz, M. Walliczek, F. Kraft, A. Waibel, Towards continuous speech recognition using surface electromyography, in: Proc. Interspeech., 2006, pp. 573-576.
[33]
H. Kawahara, I. Masuda-Katsuse, A. De Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds, Speech Commun., 27 (1999) 187-207.
[34]
R. Kubichek, Mel-cepstral distance measure for objective speech quality assessment, in: Proc. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 1993, pp. 125-128.
[35]
R. Leonard, A database for speaker-independent digit recognition, in: Proc. ICASSP., 1984, pp. 328-331.
[36]
S. Maeda, A digital simulation method of the vocal-tract system, Speech Commun., 1 (1982) 199-229.
[37]
I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading, IEEE Trans. Pattern Anal. Mach. Intell., 24 (2002) 198-213.
[38]
T. Moriguchi, T. Toda, M. Sano, H. Sato, G. Neubig, S. Sakti, S. Nakamura, A digital signal processor implementation of silent/electrolaryngeal speech enhancement based on real-time statistical voice conversion, in: Proc. Interspeech., 2013, pp. 3072-3076.
[39]
K. Nakamura, T. Toda, H. Saruwatari, K. Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech, Speech Commun., 54 (2012) 134-146.
[40]
D. Neiberg, G. Ananthakrishnan, O. Engwall, The acoustic to articulation mapping: non-linear or non-unique?, in: Proc. Interspeech., 2008, pp. 1485-1488.
[41]
E. Petajan, B. Bischoff, D. Bodoff, N.M. Brooke, An improved automatic lipreading system to enhance speech recognition, in: Proc. SIGCHI conference on Human factors in computing systems, 1988, pp. 19-25.
[42]
E.D. Petajan, Automatic lipreading to enhance speech recognition (speech reading), University of Illinois at Urbana-Champaign, 1984.
[43]
C. Qin, M.Á. Carreira-Perpiñán, An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping, in: Proc. Interspeech., 2007, pp. 74-77.
[44]
P. Rubin, T. Baer, P. Mermelstein, An articulatory synthesizer for perceptual research, J. Acoust. Soc. Am., 70 (1981) 321-328.
[45]
J. Schroeter, M.M. Sondhi, Techniques for estimating vocal-tract shapes from the speech signal, IEEE Trans. Speech Audio Process., 2 (1994) 133-150.
[46]
T. Schultz, M. Wand, Modeling coarticulation in EMG-based continuous speech recognition, Speech Commun., 52 (2010) 341-353.
[47]
Y. Stylianou, O. Cappe, E. Moulines, Continuous probabilistic transform for voice conversion, IEEE Trans. Speech Audio Process., 6 (1998) 131-142.
[48]
T. Toda, A.W. Black, K. Tokuda, Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter, in: Proc. ICASSP., 2005, pp. 9-12.
[49]
T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio Speech Lang. Process., 15 (2007) 2222-2235.
[50]
T. Toda, A.W. Black, K. Tokuda, Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model, Speech Commun., 50 (2008) 215-227.
[51]
T. Toda, T. Muramatsu, H. Banno, Implementation of computationally efficient real-time voice conversion, in: Proc. Interspeech, 2012, pp. 94-97.
[52]
T. Toda, M. Nakagiri, K. Shikano, Statistical voice conversion techniques for body-conducted unvoiced speech enhancement, IEEE Trans. Audio Speech Lang. Process., 20 (2012) 2505-2517.
[53]
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, in: Proc. ICASSP., 2000, pp. 1315-1318.
[54]
A.R. Toth, K. Kalgaonkar, B. Raj, T. Ezzat, Synthesizing speech from Doppler signals, in: Proc. ICASSP., 2010, pp. 4638-4641.
[55]
A. Toutios, K.G. Margaritis, A support vector approach to the acoustic-to-articulatory mapping, in: Proc. Interspeech., 2005, pp. 3221-3224.
[56]
A. Toutios, S. Narayanan, Articulatory synthesis of French connected speech from EMA data, in: Proc. Interspeech., 2013, pp. 2738-2742.
[57]
A. Toutios, S. Ouni, Y. Laprie, Estimating the control parameters of an articulatory model from electromagnetic articulograph data, J. Acoust. Soc. Am., 129 (2011) 3245-3257.
[58]
M. Wand, M. Janke, T. Schultz, Tackling speaking mode varieties in EMG-based speech recognition, IEEE Trans. Bio-Med. Eng., 61 (2014) 2515-2526.
[59]
M. Wand, T. Schultz, Analysis of phone confusion in EMG-based speech recognition, in: Proc. ICASSP., 2011, pp. 757-760.
[60]
M. Wester, Unspoken speech: speech recognition based on electroencephalography, Universität Karlsruhe, 2006.
[61]
M. Zahner, M. Janke, M. Wand, T. Schultz, Conversion from facial myoelectric signals to speech: a unit selection approach, in: Proc. Interspeech., 2014, pp. 1184-1188.
[62]
H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis, Speech Commun., 51 (2009) 1039-1064.

Cited By

View all
  • (2024)Human-inspired computational models for European Portuguese: a reviewLanguage Resources and Evaluation10.1007/s10579-023-09648-158:1(43-72)Online publication date: 1-Mar-2024
  • (2023)LipLearner: Customizable Silent Speech Interactions on Mobile DevicesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581465(1-21)Online publication date: 19-Apr-2023
  • (2022)SVoiceProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568530(622-636)Online publication date: 6-Nov-2022
  • Show More Cited By
  1. A silent speech system based on permanent magnet articulography and direct synthesis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Computer Speech and Language
      Computer Speech and Language  Volume 39, Issue C
      September 2016
      128 pages

      Publisher

      Academic Press Ltd.

      United Kingdom

      Publication History

      Published: 01 September 2016

      Author Tags

      1. Augmentative and alternative communication
      2. Permanent magnet articulography
      3. Silent speech interfaces
      4. Speech rehabilitation
      5. Speech synthesis

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Human-inspired computational models for European Portuguese: a reviewLanguage Resources and Evaluation10.1007/s10579-023-09648-158:1(43-72)Online publication date: 1-Mar-2024
      • (2023)LipLearner: Customizable Silent Speech Interactions on Mobile DevicesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581465(1-21)Online publication date: 19-Apr-2023
      • (2022)SVoiceProceedings of the 20th ACM Conference on Embedded Networked Sensor Systems10.1145/3560905.3568530(622-636)Online publication date: 6-Nov-2022
      • (2022)EarCommandProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35346136:2(1-28)Online publication date: 7-Jul-2022
      • (2022)SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatographyProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502015(1-19)Online publication date: 29-Apr-2022
      • (2020)TieLentProceedings of the 2020 International Conference on Advanced Visual Interfaces10.1145/3399715.3399852(1-8)Online publication date: 28-Sep-2020
      • (2019)Field study as a method to assess effectiveness of post-laryngectomy communication assistive interfacesCompanion Proceedings of the 24th International Conference on Intelligent User Interfaces10.1145/3308557.3308719(115-116)Online publication date: 16-Mar-2019
      • (2019)SottoVoceProceedings of the 2019 CHI Conference on Human Factors in Computing Systems10.1145/3290605.3300376(1-11)Online publication date: 2-May-2019
      • (2018)Non-Invasive Silent Phoneme Recognition Using Microwave SignalsIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.286560926:12(2404-2411)Online publication date: 1-Dec-2018
      • (2017)Direct Speech Reconstruction From Articulatory Sensor Data by Machine LearningIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.275726325:12(2362-2374)Online publication date: 1-Dec-2017
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media