Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3447993.3448626acmconferencesArticle/Chapter ViewAbstractPublication PagesmobicomConference Proceedingsconference-collections
research-article
Open access

UltraSE: single-channel speech enhancement using ultrasound

Published: 09 September 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Robust speech enhancement is considered as the holy grail of audio processing and a key requirement for human-human and human-machine interaction. Solving this task with single-channel, audio-only methods remains an open challenge, especially for practical scenarios involving a mixture of competing speakers and background noise. In this paper, we propose UltraSE, which uses ultrasound sensing as a complementary modality to separate the desired speaker's voice from interferences and noise. UltraSE uses a commodity mobile device (e.g., smartphone) to emit ultrasound and capture the reflections from the speaker's articulatory gestures. It introduces a multi-modal, multi-domain deep learning framework to fuse the ultrasonic Doppler features and the audible speech spectrogram. Furthermore, it employs an adversarially trained discriminator, based on a cross-modal similarity measurement network, to learn the correlation between the two heterogeneous feature modalities. Our experiments verify that UltraSE simultaneously improves speech intelligibility and quality, and outperforms state-of-the-art solutions by a large margin.

    References

    [1]
    Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A Saurous, Ron J Weiss, Ye Jia, and Ignacio Lopez Moreno. Voice-filter: Targeted voice separation by speaker-conditioned spectrogram masking. In Proceedings of Interspeech, 2019.
    [2]
    Hakan Erdogan, John R Hershey, Shinji Watanabe, Michael I Mandel, and Jonathan Le Roux. Improved mvdr beamforming using single-channel mask prediction networks. In Proceedings of Interspeech, 2016.
    [3]
    DeLiang Wang and Jitong Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
    [4]
    Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 2019.
    [5]
    Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. WHAM!: Extending Speech Separation to Noisy Environments. CoRR, abs/1907.01160, 2019.
    [6]
    Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In Proceedings of ACM SIGGRAPH, 2018.
    [7]
    Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. My lips are concealed: Audio-visual speech enhancement through obstructions. In Proceedings of Interspeech, 2019.
    [8]
    Donald S Williamson, Yuxuan Wang, and DeLiang Wang. Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing, 2015.
    [9]
    John S Garofolo et al. Darpa timit acoustic-phonetic speech database. National Institute of Standards and Technology (NIST), 1988.
    [10]
    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of IEEE ICASSP, 2017.
    [11]
    Philipos C Loizou. Speech enhancement: theory and practice. CRC press, 2013.
    [12]
    Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of AAAI, 2020.
    [13]
    Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Wenxuan Xie, and Wenjun Zeng. Joint time-frequency and time domain learning for speech enhancement. In Proceedings of IJCAI, 2020.
    [14]
    Yariv Ephraim and David Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing, 1984.
    [15]
    Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing, 1979.
    [16]
    Yariv Ephraim and Harry L Van Trees. A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing, 1995.
    [17]
    Guoning Hu and DeLiang Wang. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Transactions on Neural Networks, 2004.
    [18]
    Arun Narayanan and DeLiang Wang. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of IEEE ICASSP, 2013.
    [19]
    Yuxuan Wang, Arun Narayanan, and DeLiang Wang. On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing, 2014.
    [20]
    Donald S Williamson, Yuxuan Wang, and DeLiang Wang. Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing, 2015.
    [21]
    Hakan Erdogan, John R Hershey, Shinji Watanabe, and Jonathan Le Roux. Phasesensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of IEEE ICASSP, 2015.
    [22]
    Zhaoheng Ni and Michael I Mandel. Mask-dependent phase estimation for monaural speaker separation. In Proceedings of IEEE ICASSP, 2020.
    [23]
    Se Rim Park and Jinwon Lee. A fully convolutional neural network for speech enhancement. In Proceedings of Interspeech, 2017.
    [24]
    Zhiheng Ouyang, Hongjiang Yu, Wei-Ping Zhu, and Benoit Champagne. A fully convolutional neural network for complex spectrogram processing in speech enhancement. In Proceedings of IEEE ICASSP, 2019.
    [25]
    Meet H Soni, Neil Shah, and Hemant A Patil. Time-frequency masking-based speech enhancement using generative adversarial network. In Proceedings of IEEE ICASSP, 2018.
    [26]
    Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet for speech denoising. In Proceedings of IEEE ICASSP, 2018.
    [27]
    Ashutosh Pandey and DeLiang Wang. Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. In Proceedings of IEEE ICASSP, 2019.
    [28]
    Santiago Pascual, Antonio Bonafonte, and Joan Serra. Segan: Speech enhancement generative adversarial network. In Proceedings of IEEE ICASSP, 2018.
    [29]
    John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of IEEE ICASSP, 2016.
    [30]
    Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proceedings of IEEE ICASSP, 2017.
    [31]
    Naoya Takahashi, Sudarsanam Parthasaarathy, Nabarun Goswami, and Yuki Mitsufuji. Recursive speech separation for unknown number of speakers. In Proceedings of Interspeech, 2019.
    [32]
    John R Hershey and Michael Casey. Audio-visual sound separation via hidden markov models. In Proceedings of NeurIPS, 2002.
    [33]
    Bertrand Rivet, Wenwu Wang, Syed Mohsen Naqvi, and Jonathon A Chambers. Audiovisual speech source separation: An overview of key methodologies. IEEE Signal Processing Magazine, 2014.
    [34]
    Ki-Seung Lee. Speech enhancement using ultrasonic doppler sonar. Speech Communication, 2019.
    [35]
    Tom Barker, Tuomas Virtanen, and Olivier Delhomme. Ultrasound-coupled semi-supervised nonnegative matrix factorisation for speech enhancement. In Proceedings of IEEE ICASSP, 2014.
    [36]
    Wei Wang, Alex X. Liu, and Ke Sun. Device-free gesture tracking using acoustic signals. In Proceedings of ACM MobiCom, 2016.
    [37]
    Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of ACM MobiSys, 2017.
    [38]
    Ke Sun, Wei Wang, Alex X. Liu, and Haipeng Dai. Depth aware finger tapping on virutal displays. In Proceedings of ACM MobiSys, 2018.
    [39]
    Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. Vskin: Sensing touch gestures on surfaces of mobile devices using acoustic signals. In Proceedings of ACM MobiCom, 2018.
    [40]
    Wenguang Mao, Mei Wang, Wei Sun, Lili Qiu, Swadhin Pradhan, and Yi-Chao Chen. Rnn-based room scale hand motion tracking. In Proceedings of ACM MobiCom, 2019.
    [41]
    Jiayao Tan, Cam-Tu Nguyen, and Xiaoliang Wang. Silenttalk: Lip reading through ultrasonic sensing on mobile phones. In Proceedings of IEEE INFOCOM, 2017.
    [42]
    Linghan Zhang, Sheng Tan, and Jie Yang. Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authentication. In Proceedings of ACM CCS, 2017.
    [43]
    Yeonjoon Lee, Yue Zhao, Jiutian Zeng, Kwangwuk Lee, Nan Zhang, Faysal Hossain Shezan, Yuan Tian, Kai Chen, and XiaoFeng Wang. Using sonar for liveness detection to protect smart speakers against remote attackers. Proceedings of ACM IMWUT (UbiComp), 2020.
    [44]
    Jiayao Tan, Xiaoliang Wang, Cam-Tu Nguyen, and Yu Shi. Silentkey: A new authentication framework through ultrasonic-based lip reading. Proceedings of ACM IMWUT (UbiComp), 2018.
    [45]
    Bing Zhou, Jay Lohokare, Ruipeng Gao, and Fan Ye. Echoprint: Two-factor authentication using acoustics and vision on smartphones. In Proceedings of ACM MobiCom, 2018.
    [46]
    Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Yunfei Liu, and Minglu Li. Lippass: Lip reading-based user authentication on smartphones leveraging acoustic signals. In Proceedings of IEEE INFOCOM, 2018.
    [47]
    Li Lu, Jiadi Yu, Yingying Chen, and Yan Wang. Vocallock: Sensing vocal tract for passphrase-independent user authentication leveraging acoustic signals on smartphones. Proceedings of ACM IMWUT (UbiComp), 2020.
    [48]
    Catherine P Browman and Louis Goldstein. Articulatory gestures as phonological units. Phonology, 1989.
    [49]
    Kristin J Teplansky, Brian Y Tsang, and Jun Wang. Tongue and lip motion patterns in voiced, whispered, and silent vowel production. In Proceedings of ASSTA ICPhS.
    [50]
    Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In Proceedings of IEEE CVPR, 2017.
    [51]
    Rajalakshmi Nandakumar, Vikram Iyer, Desney Tan, and Shyamnath Gollakota. Fingerio: Using active sonar for fine-grained finger tracking. In Proceedings of ACM CHI, 2016.
    [52]
    Xun Wang, Ke Sun, Ting Zhao, Wei Wang, and Qing Gu. Dynamic speed warping: Similarity-based one-shot learning for device-free gesture signals. In Proceedings of IEEE INFOCOM, 2020.
    [53]
    Wenguang Mao, Jian He, and Lili Qiu. Cat: high-precision acoustic motion tracking. In Proceedings of ACM MobiCom, 2016.
    [54]
    Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. Backdoor: Making microphones hear inaudible sounds. In Proceedings of ACM MobiSys, 2017.
    [55]
    Nirupam Roy, Sheng Shen, Haitham Hassanieh, and Romit Roy Choudhury. Inaudible voice commands: The long-range attack and defense. In Proceedings of Usenix NSDI, 2018.
    [56]
    Ke Sun, Chen Chen, and Xinyu Zhang. "Alexa, Stop Spying on Me!": Speech Privacy Protection against Voice Assistants. In Proceedings of ACM SenSys, 2020.
    [57]
    Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.
    [58]
    Ingo R Titze and Daniel W Martin. Principles of voice production, 1998.
    [59]
    Brian B Monson, Eric J Hunter, Andrew J Lotto, and Brad H Story. The perceptual significance of high-frequency energy in the human voice. Frontiers in psychology, 2014.
    [60]
    Chenhan Xu, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. Waveear: Exploring a mmwave-based noise-resistant speech sensing for voice-user interface. In Proceedings of ACM MobiSys, 2019.
    [61]
    Fuming Chen, Sheng Li, Yang Zhang, and Jianqi Wang. Detection of the vibration signal from human vocal folds using a 94-ghz millimeter-wave radar. Sensors, 2017.
    [62]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE CVPR, 2016.
    [63]
    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of IEEE CVPR, 2018.
    [64]
    Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015.
    [65]
    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of IEEE CVPR, 2006.
    [66]
    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of IEEE CVPR, 2017.
    [67]
    Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. Sdr-half-baked or well done? In Proceedings of IEEE ICASSP, 2019.
    [68]
    Amazon Transcribe, 2019. https://aws.amazon.com/transcribe/.
    [69]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of IEEE ICCV, 2015.
    [70]
    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Proceedings of IEEE ICASSP, 2015.
    [71]
    Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. Visual speech enhancement. In Proceedings of Interspeech, 2018.
    [72]
    Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, 2006.
    [73]
    Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In Proceedings of IEEE ICASSP, 2018.
    [74]
    Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proceedings of IEEE ICASSP, 2010.
    [75]
    ITU-T Recommendation. Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862, 2001.
    [76]
    Yuan-Kuei Wu, Chao-I Tuan, Hung-yi Lee, and Yu Tsao. Saddel: Joint speech separation and denoising model based on multitask learning. arXiv preprint arXiv:2005.09966, 2020.
    [77]
    Chris Lewis and Steve Pickavance. Implementing quality of service over cisco mpls vpns. Selecting MPLS VPN Services, 2006.
    [78]
    Pytorch Mobile, 2020. https://pytorch.org/mobile/home/.
    [79]
    Siqi Wang, Anuj Pathania, and Tulika Mitra. Neural network inference on mobile socs. IEEE Design & Test, 2020.
    [80]
    Android Profiler, 2020. https://developer.android.com/studio/profile.
    [81]
    Mayank Goel, Jacob Wobbrock, and Shwetak Patel. Gripsense: using built-in sensors to detect hand posture and pressure on commodity mobile phones. In Proceedings of ACM UIST, 2012.
    [82]
    John D'Errico. Surface fitting using gridfit. MathWorks file exchange, 643, 2005. https://www.mathworks.com/matlabcentral/fileexchange/8998-surface-fitting-using-gridfit.
    [83]
    Best smartphones for audio, 2020. https://www.soundguys.com/best-smartphones-for-audio-16373.

    Cited By

    View all
    • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
    • (2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
    • (2024)UFaceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435468:1(1-27)Online publication date: 6-Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MobiCom '21: Proceedings of the 27th Annual International Conference on Mobile Computing and Networking
    October 2021
    887 pages
    ISBN:9781450383424
    DOI:10.1145/3447993
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 September 2021

    Check for updates

    Author Tags

    1. speech enhancement
    2. ultrasound

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ACM MobiCom '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 440 of 2,972 submissions, 15%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)470
    • Downloads (Last 6 weeks)44
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
    • (2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
    • (2024)UFaceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435468:1(1-27)Online publication date: 6-Mar-2024
    • (2024)Exploring the Feasibility of Remote Cardiac Auscultation Using EarphonesProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3649366(357-372)Online publication date: 29-May-2024
    • (2024)EarSEProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314477:4(1-33)Online publication date: 12-Jan-2024
    • (2024)ClearSpeechProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314097:4(1-25)Online publication date: 12-Jan-2024
    • (2024)UltraCLR: Contrastive Representation Learning Framework for Ultrasound-based SensingACM Transactions on Sensor Networks10.1145/359749820:4(1-23)Online publication date: 11-May-2024
    • (2024)Wavoice: An mmWave-Assisted Noise-Resistant Speech Recognition SystemACM Transactions on Sensor Networks10.1145/359745720:4(1-29)Online publication date: 11-May-2024
    • (2024)Combining IMU With Acoustics for Head Motion Tracking Leveraging Wireless EarphoneIEEE Transactions on Mobile Computing10.1109/TMC.2023.332582623:6(6835-6847)Online publication date: Jun-2024
    • (2024)HeadTrack: Real-Time Human–Computer Interaction via Wireless EarphonesIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.334538142:4(990-1002)Online publication date: Apr-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media