Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech

Published: 07 September 2022 Publication History

Abstract

Speech enhancement can benefit lots of practical voice-based interaction applications, where the goal is to generate clean speech from noisy ambient conditions. This paper presents a practical design, namely UltraSpeech, to enhance speech by exploring the correlation between the ultrasound (profiled articulatory gestures) and speech. UltraSpeech uses a commodity smartphone to emit the ultrasound and collect the composed acoustic signal for analysis. We design a complex masking framework to deal with complex-valued spectrograms, incorporating the magnitude and phase rectification of speech simultaneously. We further introduce an interaction module to share information between ultrasound and speech two branches and thus enhance their discrimination capabilities. Extensive experiments demonstrate that UltraSpeech increases the Scale Invariant SDR by 12dB, improves the speech intelligibility and quality effectively, and is capable to generalize to unknown speakers.

References

[1]
[n.d.]. Amazon Echo. https://en.wikipedia.org/wiki/Amazon_Echo. Wikipedia, 2022.
[2]
[n.d.]. An Android APP that Emits Sounds of User-Specified Frequencies. https://github.com/dtczhl/dtc-frequency-player. 2019.
[3]
[n.d.]. Speech Recognition on Android with Wav2Vec2. https://github.com/pytorch/android-demo-app/tree/master/SpeechRecognition. 2021.
[4]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-Visual Speech Recognition. IEEE transactions on Pattern Analysis and Machine Intelligence (2018).
[5]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2019. My Lips are Concealed: Audio-Visual Speech Enhancement Through Obstructions. In Proceedings of Interspeech.
[6]
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0: A Framework for Self-supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477 (2020).
[7]
Lanthao Benedikt, Darren Cosker, Paul L Rosin, and David Marshall. 2010. Assessing the Uniqueness and Permanence of Facial Actions for Use in Biometric Applications. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 3 (2010), 449--460.
[8]
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In Proceedings of IEEE/CVF CVPR.
[9]
Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. 2019. Metricgan: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. In Proceedings of PMLR ICML.
[10]
John S Garofolo et al. 1988. DARPA TIMIT Acoustic-Phonetic Speech Database. National Institute of Standards and Technology (NIST) 15 (1988), 29--50.
[11]
Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu. 2019. End-to-End Multi-Channel Speech Separation. arXiv preprint arXiv:1905.06286 (2019).
[12]
Google Home Help. [n.d.]. Google Home Specifications. Google Inc., 2017.
[13]
Guoning Hu and DeLiang Wang. 2010. A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation. IEEE Transactions on Audio, Speech, and Language Processing 18, 8 (2010), 2067--2079.
[14]
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. In Proceedings of Interspeech.
[15]
Recommendation ITU-T P ITU. [n.d.]. 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs. ITU-Telecommunication Standardization Sector, 2007.
[16]
Nathalie Henrich Bernardoni Joe Wolfe, Maëva Garnier and John Smith. 2020. The Mechanics and Acoustics of the Singing Voice. Routledge.
[17]
Sunil Kamath, Philipos Loizou, et al. 2002. A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. In Proceedings of IEEE ICASSP.
[18]
Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-based Silent Speech Interaction using Deep Neural Networks. In Proceedings of ACM CHI.
[19]
Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, and Daiki Takeuchi. 2020. Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention. In Proceedings of IEEE ICASSP.
[20]
Richard Li, Jason Wu, and Thad Starner. 2019. Tongueboard: An Oral Interface for Subtle Input. In Proceedings of ACM AH.
[21]
Kang Ling, Haipeng Dai, Yuntang Liu, and Alex X Liu. 2018. UltraGesture: Fine-grained Gesture Sensing and Recognition. In Proceedings of IEEE SECON.
[22]
Philipos C Loizou. 2011. Speech Quality Assessment. In Multimedia analysis, processing and communications. Springer, 623--654.
[23]
Li Lu, Jiadi Yu, Yingying Chen, and Yan Wang. 2020. Vocallock: Sensing Vocal Tract for Passphrase-Independent User Authentication Leveraging Acoustic Signals on Smartphones. In Proceedings of ACM IMWUT/UbiComp.
[24]
Yi Luo and Nima Mesgarani. 2018. TasNet: Time-domain Audio Separation Network for Real-time, Single-channel Speech Separation. In Proceedings of IEEE ICASSP.
[25]
Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of Citeseer ICML.
[26]
Héctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-Speech: Noise-Robust Speech Capturing Glasses using Vibration Sensors. Proceedings of ACM IMWUT/UbiComp (2018).
[27]
Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent. 2016. Multichannel Audio Source Separation with Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 9 (2016), 1652--1664.
[28]
Fundamentals of Cognitive Neuroscience: A Beginner's Guide. 2013. Chapter 11 - Language. Academic Press.
[29]
Kuldip K Paliwal and Kaisheng Yao. 2010. Robust Speech Recognition Under Noisy Ambient Conditions. In Human-centric Interfaces for Ambient Intelligence. Elsevier, 135--162.
[30]
Ashutosh Pandey and DeLiang Wang. 2019. TCNN: Temporal Convolutional Neural Network for Real-Time Speech Enhancement in the Time Domain. In Proceedings of IEEE ICASSP.
[31]
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of Interspeech.
[32]
Markku Pukkila. 2000. Channel Estimation Modeling.
[33]
EH Rothauser. 1969. IEEE Recommended Practice for Speech Quality Measurements. IEEE Transactions on Audio and Electroacoustics 17 (1969), 225--246.
[34]
Himanshu Sahni, Abdelkareem Bedri, Gabriel Reyes, Pavleen Thukral, Zehua Guo, Thad Starner, and Maysam Ghovanloo. 2014. The Tongue and Ear Interface: A Wearable System for Silent Speech Recognition. In Proceedings of ACM ISWC.
[35]
Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of ACM UIST.
[36]
Ke Sun and Xinyu Zhang. 2021. UltraSE: Single-channel Speech Enhancement using Ultrasound. In Proceedings of ACM MobiCom.
[37]
Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech. In Proceedings of IEEE ICASSP.
[38]
Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. 2018. Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation. In Proceedings of IEEE IWAENC.
[39]
Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, and Noboru Harada. 2020. Invertible DNN-based Nonlinear Time-Frequency Transform for Speech Enhancement. In Proceedings of IEEE ICASSP.
[40]
Ke Tan and DeLiang Wang. 2018. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement. In Proceedings of Interspeech.
[41]
Vincent Mohammad Tavakoli, Jesper Rindom Jensen, Mads Graesboll Christensen, and Jacob Benesty. 2016. A Framework for Speech Enhancement with AD Hoc Microphone Arrays. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 6 (2016), 1038--1051.
[42]
Kristin J Teplansky, Brian Y Tsang, and Jun Wang. 2019. Tongue and Lip Motion Patterns in Voiced, Whispered, and Silent Vowel Production. In Proceedings of ASSTA ICPhS.
[43]
Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, Joao Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal. 2017. Deep Complex Networks. arXiv preprint arXiv:1705.09792 (2017).
[44]
Jianren Wang, Zhaoyuan Fang, and Hang Zhao. 2020. Alignnet: A Unifying Approach to Audio-Visual Alignment. In Proceedings of IEEE/CVF WACV.
[45]
Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On Training Targets for Supervised Speech Separation. IEEE/ACM transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849--1858.
[46]
Donald S Williamson, Yuxuan Wang, and DeLiang Wang. 2015. Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483--492.
[47]
Xiong Xiao, Zhuo Chen, Takuya Yoshioka, Hakan Erdogan, Changliang Liu, Dimitrios Dimitriadis, Jasha Droppo, and Yifan Gong. 2019. Single-Channel Speech Extraction using Speaker Inventory and Attention Network. In Proceedings of IEEE ICASSP.
[48]
Bo Xu, Cheng Lu, Yandong Guo, and Jacob Wang. 2020. Discriminative Multi-Modality Speech Recognition. In Proceedings of IEEE/CVF CVPR.
[49]
Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2020. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. In Proceedings of AAAI.
[50]
Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to Hear: Speech Enhancement for Mobile Devices using Acoustic Signals. In Proceedings of ACM IMWUT/UbiComp.
[51]
Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. In Proceedings of ACM IMWUT/UbiComp.
[52]
Cui Zhao, Zhenjiang Li, Han Ding, Wei Xi, Ge Wang, and Jizhong Zhao. 2021. Anti-Spoofing Voice Commands: A Generic Wireless Assisted Design. In Proceedings of ACM IMWUT/UbiComp.

Cited By

View all
  • (2024)EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech EnhancementProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785948:3(1-30)Online publication date: 9-Sep-2024
  • (2024)Conductive Fabric Diaphragm for Noise-Suppressive Headset MicrophoneAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686768(1-3)Online publication date: 13-Oct-2024
  • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
  • Show More Cited By

Index Terms

  1. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 6, Issue 3
    September 2022
    1612 pages
    EISSN:2474-9567
    DOI:10.1145/3563014
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 September 2022
    Published in IMWUT Volume 6, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. acoustic sensing
    2. multi-modality fusion
    3. speech enhancement

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)147
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech EnhancementProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785948:3(1-30)Online publication date: 9-Sep-2024
    • (2024)Conductive Fabric Diaphragm for Noise-Suppressive Headset MicrophoneAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686768(1-3)Online publication date: 13-Oct-2024
    • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
    • (2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
    • (2024)AdaStreamLiteProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314607:4(1-29)Online publication date: 12-Jan-2024
    • (2024)EarSEProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314477:4(1-33)Online publication date: 12-Jan-2024
    • (2024)DECO: Cooperative Order Dispatching for On-Demand Delivery with Real-Time Encounter DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680084(4734-4742)Online publication date: 21-Oct-2024
    • (2024)Room-scale Voice Liveness Detection for Smart DevicesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3367269(1-14)Online publication date: 2024
    • (2024) RadioVAD : mmWave-Based Noise and Interference-Resilient Voice Activity Detection IEEE Internet of Things Journal10.1109/JIOT.2024.339435311:15(26005-26019)Online publication date: 1-Aug-2024
    • (2023)GC-LocProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35694956:4(1-27)Online publication date: 11-Jan-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media