Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

Published: 01 April 2010 Publication History

Abstract

This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. The system is built around an audio-visual dictionary which associates visual to acoustic observations for each phonetic class. Visual features are extracted from ultrasound images of the tongue and from video images of the lips using a PCA-based image coding technique. Visual observations of each phonetic class are modeled by continuous HMMs. The system then combines a phone recognition stage with corpus-based synthesis. In the recognition stage, the visual HMMs are used to identify phonetic targets in a sequence of visual features. In the synthesis stage, these phonetic targets constrain the dictionary search for the sequence of diphones that maximizes similarity to the input test data in the visual space, subject to a concatenation cost in the acoustic domain. A prosody-template is extracted from the training corpus, and the final speech waveform is generated using ''Harmonic plus Noise Model'' concatenative synthesis techniques. Experimental results are based on an audiovisual database containing 1h of continuous speech from each of two speakers.

References

[1]
Akgul, Y.S., Kambhamettu, C., Stone, M., 2000. A Task-specific Contour Tracker for Ultrasound. IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, Hilton Head Island, South Carolina, pp. 135-142.
[2]
Birkholz, P., Jackèl, D., 2003. A three-dimensional model of the vocal tract for speech synthesis. In: Proc. 15th ICPhS, Barcelona, pp. 2597-2600.
[3]
Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika. v68. 589-599.
[4]
Obtaining a palatal trace for ultrasound images. J. Acoust. Soc. Amer. v115 i5. 2631-2632.
[5]
Development of a (silent) speech recognition system for patients following laryngectomy. Med. Eng. Phys. v30 i4. 419-425.
[6]
The Viterbi algorithm. Proc. IEEE. v61 i3. 268-278.
[7]
Gravier, G., Potamianos, G., Neti, C., 2002. Asynchrony modeling for audio-visual speech recognition. In: Proc. 2nd Internat. Conf. on Human Language Technology Research, San Diego, CA.
[8]
Heracleous, P., Nakajima, Y., Saruwatari, H., Shikano, K., 2005. A Tissue-conductive Acoustic Sensor Applied in Speech Recognition for Privacy. Smart Objects and Ambient Intelligence Oc-EUSAI, pp. 93-98.
[9]
Probability and Statistical Inference. fifth ed. Prentice Hall, Upper Saddle River, NJ.
[10]
Eigentongue Feature Extraction for an Ultrasound-based Silent Speech Interface. IEEE ICASSP, Honolulu. v1. 1245-1248.
[11]
Continuous-speech Phone Recognition from Ultrasound and Optical Images of the Tongue and Lips. Interspeech, Antwerp, Belgium.
[12]
Hueber, T., Chollet, G., Denby, B., Stone, M., 2008. Acquisition of Ultrasound, Video and Acoustic Speech Data for a Silent-Speech Interface Application. International Seminar on Speech Production, Strasbourg, France, pp. 365-369.
[13]
Phone Recognition from Ultrasound and Optical Video Sequences for a Silent Speech Interface. Interspeech, Brisbane, Australia.
[14]
Visuo-phonetic Decoding using Multi-stream and Context-dependent Models for an Ultrasound-based Silent Speech Interface. Interspeech, Brighton, UK.
[15]
Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database. IEEE ICASSP, Atlanta.
[16]
Jorgensen, C., Lee, D.D., Agabon, S., 2003. Sub auditory speech recognition based on EMG/EPG signals. In: Proc. Internat. Joint Conf. Neural Networks, vol. 4, pp. 3128-3133.
[17]
Kominek, J., Black, A., 2004. The CMU Arctic speech databases. In: Proc. 5th ISCA Speech Synthesis Workshop, Pittsburgh, pp. 223-224.
[18]
Automatic contour tracking in ultrasound images. Clin. Linguist. Phonet. v19 i6-7. 545-554.
[19]
Lucey, P., Potamianos, G., 2006. Lipreading using profile versus frontal views. In: Proc. IEEE Workshop on Multimedia Signal Processing (MMSP '06), Victoria, BC, Canada, pp. 24-28.
[20]
Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In: Hardcastle, W., Marchal, A. (Eds.), Speech Production and Speech Modelling, Kluwer Academic Publishers, Dordrecht. pp. 131-149.
[21]
Maier-Hein, L., Metze, F., Schultz, T., Waibel, A., 2005. Session Independent Non-audible Speech Recognition Using Surface Electromyography. IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 331-336.
[22]
Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. v12. 629-639.
[23]
Sinder, D., Richard, G., Duncan, H., Flanagan, J., Krane, M., Levinson, S., Slimon, S., Davis, D., 1997. Flow visualization in stylized vocal tracts. In: Proc. ASVA97, Tokyo, pp. 439-444.
[24]
A head and transducer support (HATS) system for use in ultrasound imaging of the tongue during speech. J. Acoust. Soc. Amer. v98. 3107-3112.
[25]
Diphone Concatenation Using a Harmonic Plus Noise Model of Speech. Eurospeech, Rhodes, Greece.
[26]
Speech Parameter Generation Algorithms for HMM-based Speech Synthesis. IEEE ICASSP, Istanbul, Turkey.
[27]
Tran, V.-A., Bailly, G., L'venbruck, H., Jutten, C., 2008. Improvement to a NAM Captured Whisper-to-Speech System. Interspeech 2008, Brisbane, Australia, pp. 1465-1498.
[28]
Turk, M.A., Pentland, A.P., 1991. Face Recognition Using Eigenfaces. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586-591.
[29]
Young, S., Russel, N., Thornton, J., 1989. Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems, CUED Technical Report F INFENG/TR38, Cambridge University.
[30]
Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2005. The HTK Book. <http://htk.eng.cam.ac.uk/>.
[31]
Speckle reducing anisotropic diffusion. IEEE Trans. Image Proc. v11. 1260-1270.

Cited By

View all
  • (2024)Whispering Wearables: Multimodal Approach to Silent Speech Recognition with Head-Worn DevicesProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685720(214-223)Online publication date: 4-Nov-2024
  • (2024)Unvoiced: Designing an LLM-assisted Unvoiced User Interface using EarablesProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699374(784-798)Online publication date: 4-Nov-2024
  • (2024)MELDER: The Design and Evaluation of a Real-time Silent Speech Recognizer for Mobile DevicesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642348(1-23)Online publication date: 11-May-2024
  • Show More Cited By
  1. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Speech Communication
    Speech Communication  Volume 52, Issue 4
    April, 2010
    113 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 April 2010

    Author Tags

    1. Corpus-based speech synthesis
    2. Silent speech
    3. Ultrasound
    4. Visual phone recognition

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Whispering Wearables: Multimodal Approach to Silent Speech Recognition with Head-Worn DevicesProceedings of the 26th International Conference on Multimodal Interaction10.1145/3678957.3685720(214-223)Online publication date: 4-Nov-2024
    • (2024)Unvoiced: Designing an LLM-assisted Unvoiced User Interface using EarablesProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699374(784-798)Online publication date: 4-Nov-2024
    • (2024)MELDER: The Design and Evaluation of a Real-time Silent Speech Recognizer for Mobile DevicesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642348(1-23)Online publication date: 11-May-2024
    • (2023)Multi-stage Multi-modalities Fusion of Lip, Tongue and Acoustics Information for Speech RecognitionProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference10.1145/3639592.3639623(226-231)Online publication date: 16-Dec-2023
    • (2023)LipLearner: Customizable Silent Speech Interactions on Mobile DevicesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581465(1-21)Online publication date: 19-Apr-2023
    • (2023)Mouth2Audio: intelligible audio synthesis from videos with distinctive vowel articulationInternational Journal of Speech Technology10.1007/s10772-023-10030-326:2(459-474)Online publication date: 25-May-2023
    • (2022)u-HuBERTProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601808(21157-21170)Online publication date: 28-Nov-2022
    • (2022)Design and Evaluation of a Silent Speech-Based Selection Method for Eye-Gaze PointingProceedings of the ACM on Human-Computer Interaction10.1145/35677236:ISS(328-353)Online publication date: 14-Nov-2022
    • (2022)MuteItProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35502816:3(1-26)Online publication date: 7-Sep-2022
    • (2022)SilentSpeller: Towards mobile, hands-free, silent speech text entry using electropalatographyProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3502015(1-19)Online publication date: 29-Apr-2022
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media