demonstration

Ultrasound-Based Silent Speech Interface using Sequential Convolutional Auto-encoder

Authors:

Kele Xu,

Yuxiang Wu,

Zhifeng GaoAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 2194 - 2195

https://doi.org/10.1145/3343031.3350596

Published: 15 October 2019 Publication History

Get Access

Abstract

"Silent Speech Interfaces'' (SSI) refers to a system which uses non-audible signals recorded during speech production to perform speech recognition and synthesis tasks. Different approaches have been proposed for the SSI systems. In this paper, we focus on an ultrasound-based SSI. The performance of ultrasound-based SSI system heavily relies on the feature extraction approach. However, most of the previous attempts are often limited to individual frame analysis, and the context information of the image sequence cannot be taken into account. Inspired by the recent success of the recurrent neural network and convolutional auto-encoder, we explore a novel sequential feature extraction approach for SSI system. The architecture can extract spatial and temporal feature from the image sequence, which can be further deployed for the speech recognition and synthetic tasks. By quantitative comparison between different unsupervised feature extraction approaches, the new approach outperforms other methods on the 2010 SSI challenge.

References

[1]

Jun Cai, Bruce Denby, Pierre Roussel, Gérard Dreyfus, and Lise Crevier-Buchman. 2011. Recognition and real time performances of a lightweight ultrasound based silent speech interface employing a language model. In Twelfth Annual Conference of the International Speech Communication Association .

Crossref

Google Scholar

[2]

Bruce Denby, Tanja Schultz, Kiyoshi Honda, Thomas Hueber, Jim M Gilbert, and Jonathan S Brumberg. 2010. Silent speech interfaces. Speech Communication, Vol. 52, 4 (2010), 270--287.

Digital Library

Google Scholar

[3]

Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, Vol. 52, 4 (2010), 288--300.

Digital Library

Google Scholar

[4]

Yan Ji, Licheng Liu, Hongcui Wang, Zhilei Liu, Zhibin Niu, and Bruce Denby. 2018. Updating the Silent Speech Challenge benchmark with deep learning. Speech Communication, Vol. 98 (2018), 42--50.

Digital Library

Google Scholar

[5]

Bo Li, Kele Xu, Dawei Feng, Haibo Mi, Huaimin Wang, and Jian Zhu. 2019. Denoising convolutional autoencoder based B-mode ultrasound tongue image feature extraction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

Crossref

Google Scholar

[6]

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen AWM van der Laak, Bram Van Ginneken, and Clara I Sánchez. 2017. A survey on deep learning in medical image analysis. Medical image analysis, Vol. 42 (2017), 60--88.

Google Scholar

[7]

Maureen Stone. 2005. A guide to analysing tongue motion from ultrasound images. Clinical linguistics & phonetics, Vol. 19, 6--7 (2005), 455--501.

Google Scholar

Cited By

View all

Dan XXu KZhou YYang CChen YDou YYang C(2025)Spatio-temporal masked autoencoder-based phonetic segments classification from ultrasoundSpeech Communication10.1016/j.specom.2025.103186(103186)Online publication date: Jan-2025
https://doi.org/10.1016/j.specom.2025.103186
Zhang RChen HAgarwal DJin RLi KGuimbretière FZhang C(2023)HPSpeech: Silent Speech Interface for Commodity HeadphonesProceedings of the 2023 ACM International Symposium on Wearable Computers10.1145/3594738.3611365(60-65)Online publication date: 8-Oct-2023
https://dl.acm.org/doi/10.1145/3594738.3611365
Bhatia ASaini AKalra IMukherjee MParnami AWard JMcGill MMarky K(2023)DUMask: A Discrete and Unobtrusive Mask-Based Interface for Facial GesturesProceedings of the Augmented Humans International Conference 202310.1145/3582700.3582726(255-266)Online publication date: 12-Mar-2023
https://dl.acm.org/doi/10.1145/3582700.3582726
Show More Cited By

Recommendations

Statistical conversion of silent articulation into audible speech using full-covariance HMM

Conversion of silent articulation captured by ultrasound and video to modal speech.Comparison of GMM and full-covariance phonetic HMM without vocabulary limitation.HMM-based approach allows the use of linguistic information for regularization.Objective ...
Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks
Text, Speech, and Dialogue
Abstract
Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement ...
Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing

This paper reports on word recognition experiments using a silent speech interface based on magnetic sensing of articulator movements. A magnetic field was generated by permanent magnet pellets fixed to relevant speech articulators. Magnetic field ...

Comments

Information & Contributors

Information

Published In

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Check for updates

Author Tags

Qualifiers

Demonstration

Funding Sources

National Key Research and Development Program

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
207
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Dan XXu KZhou YYang CChen YDou YYang C(2025)Spatio-temporal masked autoencoder-based phonetic segments classification from ultrasoundSpeech Communication10.1016/j.specom.2025.103186(103186)Online publication date: Jan-2025
https://doi.org/10.1016/j.specom.2025.103186
Zhang RChen HAgarwal DJin RLi KGuimbretière FZhang C(2023)HPSpeech: Silent Speech Interface for Commodity HeadphonesProceedings of the 2023 ACM International Symposium on Wearable Computers10.1145/3594738.3611365(60-65)Online publication date: 8-Oct-2023
https://dl.acm.org/doi/10.1145/3594738.3611365
Bhatia ASaini AKalra IMukherjee MParnami AWard JMcGill MMarky K(2023)DUMask: A Discrete and Unobtrusive Mask-Based Interface for Facial GesturesProceedings of the Augmented Humans International Conference 202310.1145/3582700.3582726(255-266)Online publication date: 12-Mar-2023
https://dl.acm.org/doi/10.1145/3582700.3582726
Zhang RLi KHao YWang YLai ZGuimbretière FZhang C(2023)EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic SensingProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580801(1-18)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580801
You KLiu BXu KXiong YXu QFeng MCsapó TZhu B(2023)Raw Ultrasound-Based Phonetic Segments Classification Via Mask ModelingICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095156(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095156
Ferreira DSilva SCurado FTeixeira A(2022)Exploring Silent Speech Interfaces Based on Frequency-Modulated Continuous-Wave RadarSensors10.3390/s2202064922:2(649)Online publication date: 14-Jan-2022
https://doi.org/10.3390/s22020649
Gonzalez-Lopez JGomez-Alanis AMartin Donas JPerez-Cordoba JGomez A(2020)Silent Speech Interfaces for Speech Restoration: A ReviewIEEE Access10.1109/ACCESS.2020.30265798(177995-178021)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3026579

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

Statistical conversion of silent articulation into audible speech using full-covariance HMM

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks

Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations