Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3498361.3538933acmconferencesArticle/Chapter ViewAbstractPublication PagesmobisysConference Proceedingsconference-collections
research-article
Open access

ClearBuds: wireless binaural earbuds for learning-based speech enhancement

Published: 27 June 2022 Publication History

Abstract

We present ClearBuds, the first hardware and software system that utilizes a neural network to enhance speech streamed from two wireless earbuds. Real-time speech enhancement for wireless earbuds requires high-quality sound separation and background cancellation, operating in real-time and on a mobile phone. Clear-Buds bridges state-of-the-art deep learning for blind audio source separation and in-ear mobile systems by making two key technical contributions: 1) a new wireless earbud design capable of operating as a synchronized, binaural microphone array, and 2) a lightweight dual-channel speech enhancement neural network that runs on a mobile device. Our neural network has a novel cascaded architecture that combines a time-domain conventional neural network with a spectrogram-based frequency masking neural network to reduce the artifacts in the audio output. Results show that our wireless earbuds achieve a synchronization error less than 64 μs and our network has a runtime of 21.4 ms on an accompanying mobile phone. In-the-wild evaluation with eight users in previously unseen indoor and outdoor multipath scenarios demonstrates that our neural network generalizes to learn both spatial and acoustic cues to perform noise suppression and background speech removal. In a user-study with 37 participants who spent over 15.4 hours rating 1041 audio samples collected in-the-wild, our system achieves improved mean opinion score and background noise suppression.
System demo video: https://youtu.be/HYu0ybjcQPA

Supplementary Material

Supplemental videos (videos.zip)

References

[1]
https://appleinsider.com/articles/21/03/30/apple-airpods-beats-dominated-audio-wearable-market-in-2020.
[2]
Barry D Van Veen and Kevin M Buckley. Beamforming: A versatile approach to spatial filtering. IEEE assp magazine, 5(2):4--24, 1988.
[3]
Hamid Krim and Mats Viberg. Two decades of array signal processing research: the parametric approach. IEEE signal processing magazine, 13(4):67--94, 1996.
[4]
Amit Chhetri, Philip Hilmes, Trausti Kristjansson, Wai Chu, Mohamed Mansour, Xiaoxue Li, and Xianxian Zhang. Multichannel audio front-end for far-field automatic speech recognition. In 2018 EUSIPCO, pages 1527--1531. IEEE, 2018.
[5]
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. Attention is all you need in speech separation, 2021.
[6]
Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
[7]
www.krisp.ai.
[8]
Apple airpods. https://www.apple.com/airpods/.
[9]
Fahim Kawsar, Chulhong Min, Akhil Mathur, and Alessandro Montanari. Earables for personal-scale behavior analytics. IEEE Pervasive Computing, 17(3):83--89, 2018.
[10]
Series G: Transmission Systems and Media, Digital Systems and Networks, 2003.
[11]
Galaxy s5 explained: Audio. https://news.samsung.com/global/galaxy-s5-explained-audio, Jun 2014.
[12]
Earbuds that put sound first. https://en-de.sennheiser.com/newsroom/earbuds-that-put-sound-first, Mar 2020.
[13]
Microphone array beamforming. https://invensense.tdk.com/wp-content/uploads/2015/02/microphone-array-beamforming.pdf.
[14]
Echo (3rd gen). https://www.amazon.com/all-new-echo/dp/b07nftvp7p.
[15]
InvenSense. Microphone array beamforming. Technical Report AN-1140-00, InvenSense Inc., 1745 Technology Drive, San Jose, CA 95110 U.S.A, December 2013.
[16]
Telephony Bluetooth Audio and Automotive Working Group. Hands-free profile: Bluetooth® profile specification. Technical Report v1.8, Bluetooth SIG, Apr 2020.
[17]
Otis Lamont Frost. An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE, 60(8):926--935, 1972.
[18]
Xueliang Zhang and DeLiang Wang. Deep learning based binaural speech separation in reverberant environments. IEEE/ACM transactions on audio, speech, and language processing, 25(5), 2017.
[19]
Michael Brandstein. Microphone arrays: signal processing techniques and applications. Springer Science & Business Media, 2001.
[20]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):7--19, 2015.
[21]
Nasser Mohammadiha, Paris Smaragdis, and Arne Leijon. Supervised and un-supervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing, 21(10):2140--2151, 2013.
[22]
Zhiyao Duan, Gautham J. Mysore, and Paris Smaragdis. Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments. INTERSPEECH 2012, pages 594--597.
[23]
Mohammad Nikzad, Aaron Nicolson, Yongsheng Gao, Jun Zhou, Kuldip K. Paliwal, and Fanhua Shang. Deep residual-dense lattice network for speech enhancement. 2020.
[24]
Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee. Phase-aware speech enhancement with deep complex u-net. 2019.
[25]
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Roux, John R. Hershey, and Björn Schuller. Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. LVA/ICA 2015, page 91--99. Springer-Verlag, 2015.
[26]
Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. 2019.
[27]
Meet H. Soni, Neil Shah, and Hemant A. Patil. Time-frequency masking-based speech enhancement using generative adversarial network. In ICASSP 2018.
[28]
Francois G. Germain, Qifeng Chen, and Vladlen Koltun. Speech denoising with deep feature losses. 2018.
[29]
Santiago Pascual, Antonio Bonafonte, and Joan Serrà. Segan: Speech enhancement generative adversarial network. 2017.
[30]
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. 2020.
[31]
Craig Macartney and Tillman Weyde. Improved speech enhancement with the wave-u-net. 2018.
[32]
meet.google.com.
[33]
Igor Fedorov, Marko Stamenovic, Carl Jensen, Li-Chia Yang, Ari Mandell, Yiming Gan, Matthew Mattina, and Paul N. Whatmough. Tinylstms: Efficient neural speech enhancement for hearing aids. Interspeech 2020, Oct 2020.
[34]
Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, and Fil Alleva. Multi-microphone neural speech separation for far-field multi-talker speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5739--5743. IEEE, 2018.
[35]
Zhuo Chen, Xiong Xiao, Takuya Yoshioka, Hakan Erdogan, Jinyu Li, and Yifan Gong. Multi-channel overlapped speech recognition with location guided speech extraction network. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 558--565. IEEE, 2018.
[36]
Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, and Dong Yu. Enhancing end-to-end multi-channel speech separation via spatial feature learning. arXiv preprint arXiv:2003.03927, 2020.
[37]
Panagiotis Tzirakis, Anurag Kumar, and Jacob Donley. Multi-channel speech enhancement using graph neural networks. 2021.
[38]
Teerapat Jenrungrot, Vivek Jayaram, Steve Seitz, and Ira Kemelmacher-Shlizerman. The cone of silence: Speech separation by localization. 2020.
[39]
Xingwei Sun, Risheng Xia, Junfeng Li, and Yonghong Yan. A deep learning based binaural speech enhancement approach with spatial cues preservation. In ICASSP 2019, pages 5766--5770.
[40]
Cong Han, Yi Luo, and Nima Mesgarani. Real-time binaural speech separation with preserved spatial cues. 2020.
[41]
Junfeng Li, Shuichi Sakamoto, Satoshi Hongo, Masato Akagi, and Yôiti Suzuki. Two-stage binaural speech enhancement with wiener filter for high-quality speech communication. Speech Communication, 53(5):677--689, 2011.
[42]
Klaus Reindl, Yuanhang Zheng, and Walter Kellermann. Speech enhancement for binaural hearing aids based on blind source separation. In 2010 4th International Symposium on Communications, Control and Signal Processing (ISCCSP), pages 1--6. IEEE, 2010.
[43]
Richard van Hoesel, Melanie Böhm, Jörg Pesch, Andrew Vandali, Rolf D Battmer, and Thomas Lenarz. Binaural speech unmasking and localization in noise with bilateral cochlear implants using envelope and fine-timing based strategies. The Journal of the Acoustical Society of America, 123(4):2249--2263, 2008.
[44]
Richard Lyon. A computational model of binaural localization and separation. In ICASSP'83. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 8, pages 1148--1151. IEEE, 1983.
[45]
WE Kock. Binaural localization and masking. The Journal of the Acoustical Society of America, 22(6):801--804, 1950.
[46]
Ke Tan, Xueliang Zhang, and DeLiang Wang. Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios. In ICASSP 2019, pages 5751--5755, 2019.
[47]
Ke Tan, Xueliang Zhang, and Deliang Wang. Deep learning based real-time speech enhancement for dual-microphone mobile phones. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1--1, 2021.
[48]
Nikhil Shankar, Gautam Shreedhar Bhat, and Issa Panahi. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids. The Journal of the Acoustical Society of America, 148:389--400, 07 2020.
[49]
Anran Wang, Maruchi Kim, Hao Zhang, and Shyamnath Gollakota. Hybrid neural networks for on-device directional hearing, AAAI 2022.
[50]
Dong Ma, Andrea Ferlini, and Cecilia Mascolo. OESense: Employing Occlusion Effect for in-Ear Human Sensing, page 175--187. 2021.
[51]
Chulhong Min, Akhil Mathur, and Fahim Kawsar. Exploring audio and kinetic sensing on earable devices. WearSys '18, page 5--10, 2018.
[52]
Jovan Powar and Alastair R. Beresford. A data sharing platform for earables research. In Proceedings of the 1st International Workshop on Earable Computing, EarComp'19, page 30--35, New York, NY, USA, 2019.
[53]
Zhijian Yang and Romit Roy Choudhury. Personalizing head related transfer functions for earables. SIGCOMM '21, page 137--150, New York, NY, USA, 2021.
[54]
Justin Chan, Sharat Raju, Rajalakshmi Nandakumar, Randall Bly, and Shyamnath Gollakota. Detecting middle ear fluid using smartphones. Science Translational Medicine, 11:eaav1102, 05 2019.
[55]
Nam Bui, Nhat Pham, Jessica Jacqueline Barnitz, Zhanan Zou, Phuc Nguyen, Hoang Truong, Taeho Kim, Nicholas Farrow, Anh Nguyen, Jianliang Xiao, Robin Deterding, Thang Dinh, and Tam Vu. Ebp: An ear-worn device for frequent and comfortable blood pressure monitoring. Commun. ACM, 64(8):118--125, jul 2021.
[56]
Justin Chan, Ali Najafi, Mallory Baker, Julie Kinsman, Lisa Mancl, Susan Norton, Randall Bly, and Shyamnath Gollakota. Performing tympanometry using smartphones. Communications Medicine, 06 2022.
[57]
Dong Ma, Andrea Ferlini, and Cecilia Mascolo. Oesense: Employing occlusion effect for in-ear human sensing. MobiSys '21, page 175--187, 2021.
[58]
Enea Ceolini, Jens Hjortkjær, Daniel Wong, James O'Sullivan, Vinay Raghavan, Jose Herrero, Ashesh Mehta, Shih-Chii Liu, and Nima Mesgarani. Brain-informed speech separation (biss) for enhancement of target speaker in multitalker speech perception. NeuroImage, 223:117282, 08 2020.
[59]
Caslav Pavlovic, Volker Hohmann, Hendrik Kayser, Louis Wong, Tobias Herzke, S. R. Prakash, zezhang Hou, and Paul Maanen. Open portable platform for hearing aid research. The Journal of the Acoustical Society of America, 143(3):1738--1738, 2018.
[60]
T. Herzke, H. Kayser, F. Loshaj, G. Grimm, and V. Hohmann. Open signal processing software platform for hearing aid research (openmha). 2017.
[61]
Tom Le Paine, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. Fast wavenet generation algorithm. 2016.
[62]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. 2017.
[63]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. 2015.
[64]
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[65]
Neto-S.F.de C. Lamblin C. Cox, R. and M.H. Sherif. Itu-t coders for wideband, superwideband, and fullband speech communication. IEEE, 2009.
[66]
Setting up the timeslot api. https://devzone.nordicsemi.com/nordic/short-range-guides/b/software-development-kit/posts/setting-up-the-timeslot-api, Jul 2015.
[67]
Wireless timer synchronization among nrf5 devices. https://devzone.nordicsemi.com/nordic/short-range-guides/b/bluetooth-low-energy/posts/wireless-timer-synchronization-among-nrf5-devices, Jul 2016.
[68]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. 2018.
[69]
Yi Luo, Zhuo Chen, Nima Mesgarani, and Takuya Yoshioka. End-to-end microphone permutation and number invariant multi-channel speech separation, 2020.
[70]
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
[71]
Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux. Wham!: Extending speech separation to noisy environments. arXiv preprint arXiv:1907.01160, 2019.
[72]
M. Risoud, J.-N. Hanson, F. Gauvrit, C. Renard, P.-E. Lemesre, N.-X. Bonne, and C. Vincent. Sound source localization. European Annals of Otorhinolaryngology, Head and Neck Diseases, 135(4):259--264, 2018.
[73]
Jont B Allen and David A Berkley. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4):943--950, 1979.
[74]
Robin Scheibler, Eric Bezzam, and Ivan Dokmanić. Pyroomacoustics: A python package for audio room simulation and array processing algorithms. In 2018 ICASSP, pages 351--355. IEEE.
[75]
Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. SDR - half-baked or well done? CoRR, abs/1811.02508, 2018.
[76]
Project gutenberg. https://www.gutenberg.org/. Accessed: 2021-12-20.
[77]
Chandan K. A. Reddy, Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, and Sriram Srinivasan. Icassp 2021 deep noise suppression challenge. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6623--6627, 2021.
[78]
https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/!results.
[79]
Nils L. Westhausen and Bernd T. Meyer. Dual-signal transformation lstm network for real-time noise suppression, arxiv, 2020.
[80]
Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separation evaluation campaign. 2018.
[81]
DeLiang Wang. On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines, pages 181--197. Springer, 2005.
[82]
Bluetooth Core Specification v5.0, 2016.
[83]
https://www.bluetooth.com/media/le-audio/le-audio-faqs.

Cited By

View all
  • (2024)TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable PlatformsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997578:4(1-29)Online publication date: 21-Nov-2024
  • (2024)EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech EnhancementProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785948:3(1-30)Online publication date: 9-Sep-2024
  • (2024)BrushBuds: Toothbrushing Tracking Using Earphone IMUsCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3680521(655-660)Online publication date: 5-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MobiSys '22: Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services
June 2022
668 pages
ISBN:9781450391856
DOI:10.1145/3498361
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. audio and speech processing
  2. audio source separation
  3. cascaded neural networks
  4. earable computing
  5. noise cancellation

Qualifiers

  • Research-article

Funding Sources

  • National Science Foundation
  • UW Reality Lab

Conference

MobiSys '22

Acceptance Rates

Overall Acceptance Rate 274 of 1,679 submissions, 16%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)648
  • Downloads (Last 6 weeks)98
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable PlatformsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997578:4(1-29)Online publication date: 21-Nov-2024
  • (2024)EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech EnhancementProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785948:3(1-30)Online publication date: 9-Sep-2024
  • (2024)BrushBuds: Toothbrushing Tracking Using Earphone IMUsCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3680521(655-660)Online publication date: 5-Oct-2024
  • (2024)Conductive Fabric Diaphragm for Noise-Suppressive Headset MicrophoneAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686768(1-3)Online publication date: 13-Oct-2024
  • (2024)Piezoelectric Sensing of Mask Surface Waves for Noise-Suppressive Speech InputAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686331(1-3)Online publication date: 13-Oct-2024
  • (2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
  • (2024)SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality AwarenessProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661556(116-132)Online publication date: 1-Jul-2024
  • (2024)F2Key: Dynamically Converting Your Face into a Private Key Based on COTS Headphones for Reliable Voice InteractionProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661860(127-140)Online publication date: 3-Jun-2024
  • (2024)EarSEProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314477:4(1-33)Online publication date: 12-Jan-2024
  • (2024)Thermal EarringProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314407:4(1-28)Online publication date: 12-Jan-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media