research-article

UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech

Authors:

Jizhong ZhaoAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 6, Issue 3

Article No.: 111, Pages 1 - 25

https://doi.org/10.1145/3550303

Published: 07 September 2022 Publication History

Abstract

Speech enhancement can benefit lots of practical voice-based interaction applications, where the goal is to generate clean speech from noisy ambient conditions. This paper presents a practical design, namely UltraSpeech, to enhance speech by exploring the correlation between the ultrasound (profiled articulatory gestures) and speech. UltraSpeech uses a commodity smartphone to emit the ultrasound and collect the composed acoustic signal for analysis. We design a complex masking framework to deal with complex-valued spectrograms, incorporating the magnitude and phase rectification of speech simultaneously. We further introduce an interaction module to share information between ultrasound and speech two branches and thus enhance their discrimination capabilities. Extensive experiments demonstrate that UltraSpeech increases the Scale Invariant SDR by 12dB, improves the speech intelligibility and quality effectively, and is capable to generalize to unknown speakers.

References

[1]

[n.d.]. Amazon Echo. https://en.wikipedia.org/wiki/Amazon_Echo. Wikipedia, 2022.

[2]

[n.d.]. An Android APP that Emits Sounds of User-Specified Frequencies. https://github.com/dtczhl/dtc-frequency-player. 2019.

[3]

[n.d.]. Speech Recognition on Android with Wav2Vec2. https://github.com/pytorch/android-demo-app/tree/master/SpeechRecognition. 2021.

[4]

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-Visual Speech Recognition. IEEE transactions on Pattern Analysis and Machine Intelligence (2018).

[5]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2019. My Lips are Concealed: Audio-Visual Speech Enhancement Through Obstructions. In Proceedings of Interspeech.

[6]

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477 (2020).

[7]

Lanthao Benedikt, Darren Cosker, Paul L Rosin, and David Marshall. 2010. Assessing the Uniqueness and Permanence of Facial Actions for Use in Biometric Applications. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 3 (2010), 449--460.

Digital Library

[8]

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In Proceedings of IEEE/CVF CVPR.

[9]

Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. 2019. Metricgan: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. In Proceedings of PMLR ICML.

[10]

John S Garofolo et al. 1988. DARPA TIMIT Acoustic-Phonetic Speech Database. National Institute of Standards and Technology (NIST) 15 (1988), 29--50.

[11]

Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu. 2019. End-to-End Multi-Channel Speech Separation. arXiv preprint arXiv:1905.06286 (2019).

[12]

Google Home Help. [n.d.]. Google Home Specifications. Google Inc., 2017.

[13]

Guoning Hu and DeLiang Wang. 2010. A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation. IEEE Transactions on Audio, Speech, and Language Processing 18, 8 (2010), 2067--2079.

[14]

Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. In Proceedings of Interspeech.

[15]

Recommendation ITU-T P ITU. [n.d.]. 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs. ITU-Telecommunication Standardization Sector, 2007.

[16]

Nathalie Henrich Bernardoni Joe Wolfe, Maëva Garnier and John Smith. 2020. The Mechanics and Acoustics of the Singing Voice. Routledge.

[17]

Sunil Kamath, Philipos Loizou, et al. 2002. A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. In Proceedings of IEEE ICASSP.

[18]

Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-based Silent Speech Interaction using Deep Neural Networks. In Proceedings of ACM CHI.

Digital Library

[19]

Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, and Daiki Takeuchi. 2020. Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention. In Proceedings of IEEE ICASSP.

[20]

Richard Li, Jason Wu, and Thad Starner. 2019. Tongueboard: An Oral Interface for Subtle Input. In Proceedings of ACM AH.

Digital Library

[21]

Kang Ling, Haipeng Dai, Yuntang Liu, and Alex X Liu. 2018. UltraGesture: Fine-grained Gesture Sensing and Recognition. In Proceedings of IEEE SECON.

Digital Library

[22]

Philipos C Loizou. 2011. Speech Quality Assessment. In Multimedia analysis, processing and communications. Springer, 623--654.

[23]

Li Lu, Jiadi Yu, Yingying Chen, and Yan Wang. 2020. Vocallock: Sensing Vocal Tract for Passphrase-Independent User Authentication Leveraging Acoustic Signals on Smartphones. In Proceedings of ACM IMWUT/UbiComp.

Digital Library

[24]

Yi Luo and Nima Mesgarani. 2018. TasNet: Time-domain Audio Separation Network for Real-time, Single-channel Speech Separation. In Proceedings of IEEE ICASSP.

Digital Library

[25]

Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of Citeseer ICML.

[26]

Héctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-Speech: Noise-Robust Speech Capturing Glasses using Vibration Sensors. Proceedings of ACM IMWUT/UbiComp (2018).

[27]

Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. 2016. Multichannel Audio Source Separation with Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 9 (2016), 1652--1664.

[28]

Fundamentals of Cognitive Neuroscience: A Beginner's Guide. 2013. Chapter 11 - Language. Academic Press.

[29]

Kuldip K Paliwal and Kaisheng Yao. 2010. Robust Speech Recognition Under Noisy Ambient Conditions. In Human-centric Interfaces for Ambient Intelligence. Elsevier, 135--162.

[30]

Ashutosh Pandey and DeLiang Wang. 2019. TCNN: Temporal Convolutional Neural Network for Real-Time Speech Enhancement in the Time Domain. In Proceedings of IEEE ICASSP.

[31]

Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of Interspeech.

[32]

Markku Pukkila. 2000. Channel Estimation Modeling.

[33]

EH Rothauser. 1969. IEEE Recommended Practice for Speech Quality Measurements. IEEE Transactions on Audio and Electroacoustics 17 (1969), 225--246.

[34]

Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: A Wearable System for Silent Speech Recognition. In Proceedings of ACM ISWC.

Digital Library

[35]

Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of ACM UIST.

Digital Library

[36]

Ke Sun and Xinyu Zhang. 2021. UltraSE: Single-channel Speech Enhancement using Ultrasound. In Proceedings of ACM MobiCom.

Digital Library

[37]

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech. In Proceedings of IEEE ICASSP.

[38]

Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. 2018. Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation. In Proceedings of IEEE IWAENC.

[39]

Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, and Noboru Harada. 2020. Invertible DNN-based Nonlinear Time-Frequency Transform for Speech Enhancement. In Proceedings of IEEE ICASSP.

[40]

Ke Tan and DeLiang Wang. 2018. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of Interspeech.

[41]

Vincent Mohammad Tavakoli, Jesper Rindom Jensen, Mads Graesboll Christensen, and Jacob Benesty. 2016. A Framework for Speech Enhancement with AD Hoc Microphone Arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 6 (2016), 1038--1051.

Digital Library

[42]

Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and Lip Motion Patterns in Voiced, Whispered, and Silent Vowel Production. In Proceedings of ASSTA ICPhS.

[43]

Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, Joao Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal. 2017. Deep Complex Networks. arXiv preprint arXiv:1705.09792 (2017).

[44]

Jianren Wang, Zhaoyuan Fang, and Hang Zhao. 2020. Alignnet: A Unifying Approach to Audio-Visual Alignment. In Proceedings of IEEE/CVF WACV.

[45]

Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On Training Targets for Supervised Speech Separation. IEEE/ACM transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849--1858.

[46]

Donald S Williamson, Yuxuan Wang, and DeLiang Wang. 2015. Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483--492.

[47]

Xiong Xiao, Zhuo Chen, Takuya Yoshioka, Hakan Erdogan, Changliang Liu, Dimitrios Dimitriadis, Jasha Droppo, and Yifan Gong. 2019. Single-Channel Speech Extraction using Speaker Inventory and Attention Network. In Proceedings of IEEE ICASSP.

[48]

Bo Xu, Cheng Lu, Yandong Guo, and Jacob Wang. 2020. Discriminative Multi-Modality Speech Recognition. In Proceedings of IEEE/CVF CVPR.

[49]

Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2020. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. In Proceedings of AAAI.

[50]

Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to Hear: Speech Enhancement for Mobile Devices using Acoustic Signals. In Proceedings of ACM IMWUT/UbiComp.

Digital Library

[51]

Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. In Proceedings of ACM IMWUT/UbiComp.

Digital Library

[52]

Cui Zhao, Zhenjiang Li, Han Ding, Wei Xi, Ge Wang, and Jizhong Zhao. 2021. Anti-Spoofing Voice Commands: A Generic Wireless Assisted Design. In Proceedings of ACM IMWUT/UbiComp.

Digital Library

Cited By

Han FYang PZuo YShang FXu FLi X(2024)EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech EnhancementProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785948:3(1-30)Online publication date: 9-Sep-2024
https://doi.org/10.1145/3678594
Hiraki HKanazawa SMiura TYoshida MMochimaru MRekimoto J(2024)Conductive Fabric Diaphragm for Noise-Suppressive Headset MicrophoneAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686768(1-3)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3672539.3686768
Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://doi.org/10.1145/3659614
Show More Cited By

Index Terms

UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

EarSE: Bringing Robust Speech Enhancement to COTS Headphones

Speech enhancement is regarded as the key to the quality of digital communication and is gaining increasing attention in the research field of audio processing. In this paper, we present EarSE, the first robust, hands-free, multi-modal speech enhancement ...
Combined speech enhancement and auditory modelling for robust distributed speech recognition

The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech

An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 6, Issue 3

September 2022

1612 pages

EISSN:2474-9567

DOI:10.1145/3563014

Issue’s Table of Contents

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2022

Published in IMWUT Volume 6, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
491
Total Downloads

Downloads (Last 12 months)147
Downloads (Last 6 weeks)24

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han FYang PZuo YShang FXu FLi X(2024)EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech EnhancementProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785948:3(1-30)Online publication date: 9-Sep-2024
https://doi.org/10.1145/3678594
Hiraki HKanazawa SMiura TYoshida MMochimaru MRekimoto J(2024)Conductive Fabric Diaphragm for Noise-Suppressive Headset MicrophoneAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686768(1-3)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3672539.3686768
Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://doi.org/10.1145/3659614
Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://doi.org/10.1145/3659598
Wei YXiong JLiu HYu YPan JDu J(2024)AdaStreamLiteProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314607:4(1-29)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631460
Duan DChen YXu WLi T(2024)EarSEProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314477:4(1-33)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631447
Lu YWang SYang YWang HGuo BZhang DWang SHe TSerra ESpezzano F(2024)DECO: Cooperative Order Dispatching for On-Demand Delivery with Real-Time Encounter DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680084(4734-4742)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680084
Yang QCui KZheng Y(2024)Room-scale Voice Liveness Detection for Smart DevicesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3367269(1-14)Online publication date: 2024
https://doi.org/10.1109/TDSC.2024.3367269
Ozturk MWu CWang BWu MLiu K(2024) RadioVAD : mmWave-Based Noise and Interference-Resilient Voice Activity Detection IEEE Internet of Things Journal10.1109/JIOT.2024.339435311:15(26005-26019)Online publication date: 1-Aug-2024
https://doi.org/10.1109/JIOT.2024.3394353
He TNiu QLiu N(2023)GC-LocProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35694956:4(1-27)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3569495
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents