Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech Enhancement

Published: 09 September 2024 Publication History

Abstract

Earphones have become a popular voice input and interaction device. However, airborne speech is susceptible to ambient noise, making it necessary to improve the quality and intelligibility of speech on earphones in noisy conditions. As the dual-microphone structure (i.e., outer and in-ear microphones) has been widely adopted in earphones (especially ANC earphones), we design EarSpeech which exploits in-ear acoustic sensory as the complementary modality to enable airborne speech enhancement. The key idea of EarSpeech is that in-ear speech is less sensitive to ambient noise and exhibits a correlation with airborne speech. However, due to the occlusion effect, in-ear speech has limited bandwidth, making it challenging to directly correlate with full-band airborne speech. Therefore, we exploit the occlusion effect to carry out theoretical modeling and quantitative analysis of this cross-channel correlation and study how to leverage such cross-channel correlation for speech enhancement. Specifically, we design a series of methodologies including data augmentation, deep learning-based fusion, and noise mixture scheme, to improve the generalization, effectiveness, and robustness of EarSpeech, respectively. Lastly, we conduct real-world experiments to evaluate the performance of our system. Specifically, EarSpeech achieves an average improvement ratio of 27.23% and 13.92% in terms of PESQ and STOI, respectively, and significantly improves SI-SDR by 8.91 dB. Benefiting from data augmentation, EarSpeech can achieve comparable performance with a small-scale dataset that is 40 times less than the original dataset. In addition, we validate the generalization of different users, speech content, and language types, respectively, as well as robustness in the real world via comprehensive experiments. The audio demo of EarSpeech is available on https://github.com/EarSpeech/earspeech.github.io/.

References

[1]
Marwa A Abd El-Fattah, Moawad I Dessouky, Alaa M Abbas, Salaheldin M Diab, El-Sayed M El-Rabaie, Waleed Al-Nuaimy, Saleh A Alshebeili, and Fathi E Abd El-samie. 2014. Speech enhancement with an adaptive Wiener filter. International Journal of Speech Technology 17 (2014), 53--64.
[2]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [cs.CL]
[3]
Deepak Baby and Sarah Verhulst. 2019. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 106--110.
[4]
Yetong Cao, Chao Cai, Fan Li, Zhe Chen, and Jun Luo. 2023. HeartPrint: Passive Heart Sounds Authentication Exploiting In-Ear Microphones. Heart 50, S1 (2023), S2.
[5]
Yetong Cao, Chao Cai, Anbo Yu, Fan Li, and Jun Luo. 2023. EarAce: Empowering Versatile Acoustic Sensing via Earable Active Noise Cancellation Platform. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 2 (2023), 1--23.
[6]
Kévin Carillo, Olivier Doutres, and Franck Sgard. 2020. Theoretical investigation of the low frequency fundamental mechanism of the objective occlusion effect induced by bone-conducted stimulation. The Journal of the Acoustical Society of America 147, 5 (2020), 3476--3489.
[7]
Kévin Carillo, Olivier Doutres, and Franck Sgard. 2021. On the removal of the open earcanal high-pass filter effect due to its occlusion: A bone-conduction occlusion effect theory. Acta Acustica 5 (2021), 36.
[8]
Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 384--396.
[9]
Liangliang Cheng, Yunfeng Dou, Jian Zhou, Huabin Wang, and Liang Tao. 2023. Speaker-Independent Spectral Enhancement for Bone-Conducted Speech. Algorithms 16, 3 (2023), 153.
[10]
Romit Roy Choudhury. 2021. Earable computing: A new area to think about. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 147--153.
[11]
Han Ding, Yizhan Wang, Hao Li, Cui Zhao, Ge Wang, Wei Xi, and Jizhong Zhao. 2022. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--25.
[12]
Yariv Ephraim and Harry L Van Trees. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing 3, 4 (1995), 251--266.
[13]
Xiaoran Fan, David Pearl, Richard Howard, Longfei Shangguan, and Trausti Thormundsson. 2023. APG: Audioplethysmography for Cardiac Monitoring in Hearables. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking. 1--15.
[14]
Andrea Ferlini, Dong Ma, Robert Harle, and Cecilia Mascolo. 2021. EarGate: gait-based user identification with in-ear microphones. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 337--349.
[15]
Szu-Wei Fu, Yu Tsao, Xugang Lu, and Hisashi Kawai. 2017. Raw waveform-based speech enhancement by fully convolutional networks. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 006--012.
[16]
Yang Gao, Yincheng Jin, Jagmohan Chauhan, Seokmin Choi, Jiyang Li, and Zhanpeng Jin. 2021. Voice in ear: Spoofing-resistant and passphrase-independent body sound authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 1--25.
[17]
John S Garofolo. 1993. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993).
[18]
Theodoros Giannakopoulos. 2009. A method for silence removal and segmentation of speech signals, implemented in Matlab. University of Athens, Athens 2 (2009).
[19]
Feiyu Han, Panlong Yang, Haohua Du, and Xiang-Yang Li. 2022. Accuth: Anti-Spoofing Voice Authentication via Accelerometer. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems. 637--650.
[20]
Feiyu Han, Panlong Yang, Haohua Du, and Xiang-Yang Li. 2024. Accuth++: Accelerometer-Based Anti-Spoofing Voice Authentication on Wrist-Worn Wearables. IEEE Transactions on Mobile Computing 23, 5 (2024), 5571--5588.
[21]
Feiyu Han, Panlong Yang, Yuanhao Feng, Weiwei Jiang, Youwei Zhang, and Xiang-Yang Li. 2024. EarSleep: In-ear Acoustic-based Physical and Physiological Activity Recognition for Sleep Stage Detection. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1--31.
[22]
Feiyu Han, Panlong Yang, Shaojie Yan, Haohua Du, and Yuanhao Feng. 2023. BreathSign: Transparent and Continuous In-ear Authentication Using Bone-conducted Breathing Biometrics. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 1--10.
[23]
Lixing He, Haozheng Hou, Shuyao Shi, Xian Shuai, and Zhenyu Yan. 2023. Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services. 14--27.
[24]
Sunil Kamath, Philipos Loizou, et al. 2002. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In ICASSP, Vol. 4. Citeseer, 44164--44164.
[25]
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482--7491.
[26]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[27]
Sen M Kuo, Bob H Lee, and Wenshun Tian. 2013. Real-time digital signal processing: fundamentals, implementations and applications. John Wiley & Sons.
[28]
Jianwei Liu, Wenfan Song, Leming Shen, Jinsong Han, and Kui Ren. 2022. Secure user verification and continuous authentication via earphone imu. IEEE Transactions on Mobile Computing (2022).
[29]
Tianyi Liu, Minshuo Chen, Mo Zhou, Simon S Du, Enlu Zhou, and Tuo Zhao. 2019. Towards understanding the importance of shortcut connections in residual networks. Advances in neural information processing systems 32 (2019).
[30]
Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, and Kui Ren. 2021. Wavoice: A noise-resistant multi-modal speech recognition system fusing mmwave and audio signals. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 97--110.
[31]
Dong Ma, Andrea Ferlini, and Cecilia Mascolo. 2021. OESense: employing occlusion effect for in-ear human sensing. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. 175--187.
[32]
Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1368--1396.
[33]
Vimal Mollyn, Riku Arakawa, Mayank Goel, Chris Harrison, and Karan Ahuja. 2023. IMUPoser: Full-Body Pose Estimation using IMUs in Phones, Watches, and Earbuds. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--12.
[34]
Online. 2022. Basic English Speaking. https://basicenglishspeaking.com
[35]
Online. 2023. Alango Technologies - Making Digital Sound Better. Technologies. Voice Enhancement. OnlyVoice™ with In-Ear Microphone. https://www.alango.com/technologies-onlyvoice-in-ear-microphone.php.
[36]
Online. 2023. Earphones And Headphones Market Size, Share & Trends Analysis Report By Price Band (>100, 50--100, <50), By Product (Earphones And Headphones), By Technology, By Application, By Region, And Segment Forecasts, 2023 - 2030. https://www.researchandmarkets.com/reports/4118850/earphones-and-headphones-market-size-share-and. (Accessed on 10/30/2023).
[37]
Online. 2023. PyTorch. https://pytorch.org/.
[38]
Online. 2023. Transcribe Audio to Text | Notta. https://www.notta.ai/en. (Accessed on 11/09/2023).
[39]
Online. 2024. AS-B6027AL30-RC microphone. http://www.aospow.com/Products/znjqry6mmm.html. (Accessed on 04/21/2024).
[40]
Online. 2024. WH-1000XM4 | Help Guide | Speaking with someone while wearing the headset (Speak-to-Chat). https://helpguide.sony.net/mdr/wh1000xm4/v1/en/contents/TP0002754732.html. (Accessed on 04/17/2024).
[41]
Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, Min Wu, and KJ Ray Liu. 2023. Radio SES: mmWave-Based Audioradio Speech Enhancement and Separation System. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 1333--1347.
[42]
Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. speech communication 53, 4 (2011), 465--494.
[43]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.
[44]
Ashutosh Pandey and DeLiang Wang. 2019. TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6875--6879.
[45]
Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132 (2016).
[46]
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).
[47]
Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015--1018.
[48]
Timothy Pommée, Mathieu Balaguer, Julien Pinquier, Julie Mauclair, Virginie Woisard, and Renée Speyer. 2021. Relationship between phoneme-level spectral acoustics and speech intelligibility in healthy speech: a systematic review. Speech, Language and Hearing 24, 2 (2021), 105--132.
[49]
Tobias Röddiger, Christian Dinse, and Michael Beigl. 2021. Wearability and comfort of earables during sleep. In Proceedings of the 2021 ACM International Symposium on Wearable Computers. 150--152.
[50]
Dongjing Shan, Xiongwei Zhang, Chao Zhang, and Li Li. 2018. A novel encoder-decoder model via NS-LSTM used for bone-conducted speech enhancement. IEEE Access 6 (2018), 62638--62644.
[51]
Premjeet Singh, Manoj Kumar Mukul, and Rajkishore Prasad. 2018. Bone conducted speech signal enhancement using LPC and MFCC. In Intelligent Human Computer Interaction: 10th International Conference, IHCI 2018, Allahabad, India, December 7-9, 2018, Proceedings 10. Springer, 148--158.
[52]
David Snyder, Guoguo Chen, and Daniel Povey. 2015. MUSAN: A Music, Speech, and Noise Corpus. arXiv:1510.08484 arXiv:1510.08484v1.
[53]
Stefan Stenfelt. 2012. Transcranial attenuation of bone-conducted sound when stimulation is at the mastoid and at the bone conduction hearing aid position. Otology & neurotology 33, 2 (2012), 105--114.
[54]
Stefan Stenfelt and Sabine Reinfeldt. 2007. A model of the occlusion effect with bone-conducted stimulation. International journal of audiology 46, 10 (2007), 595--608.
[55]
Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the 27th annual international conference on mobile computing and networking. 160--173.
[56]
Tayseer MF Taha, Ahsan Adeel, and Amir Hussain. 2018. A survey on techniques for enhancing speech. International Journal of Computer Applications 179, 17 (2018), 1--14.
[57]
Tomoki Toda, Alan W Black, and Keiichi Tokuda. 2007. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2222--2235.
[58]
Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence 44, 7 (2021), 3614--3633.
[59]
Mahesh Viswanathan and Madhubalan Viswanathan. 2005. Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Computer speech & language 19, 1 (2005), 55--83.
[60]
Heming Wang, Xueliang Zhang, and DeLiang Wang. 2022. Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM transactions on audio, speech, and language processing 30 (2022), 3134--3143.
[61]
Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe, and Jonathan Le Roux. 2022. STFT-domain neural speech enhancement with very low algorithmic latency. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2022), 397--410.
[62]
Yadong Xie, Fan Li, Yue Wu, Huijie Chen, Zhiyuan Zhao, and Yu Wang. 2022. TeethPass: Dental Occlusion-based User Authentication via In-ear Acoustic Sensing. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 1789--1798.
[63]
Chenhan Xu, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. 2019. Waveear: Exploring a mmwave-based noise-resistant speech sensing for voice-user interface. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services. 14--26.
[64]
Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2020. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9458--9465.
[65]
Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.
[66]
Xiao-Lei Zhang and DeLiang Wang. 2016. A deep ensemble learning method for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing 24, 5 (2016), 967--977.
[67]
Changyan Zheng, Tieyong Cao, Jibin Yang, Xiongwei Zhang, and Meng Sun. 2019. Spectra restoration of bone-conducted speech via attention-based contextual information and spectro-temporal structure constraint. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 102, 12 (2019), 2001--2007.

Cited By

View all
  • (2024)Exploring Earable-Based Passive User Authentication via Interpretable In-Ear Breathing BiometricsIEEE Transactions on Mobile Computing10.1109/TMC.2024.345341223:12(15238-15255)Online publication date: Dec-2024
  • (2024)Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech EnhancementIEEE Access10.1109/ACCESS.2024.351987012(193131-193140)Online publication date: 2024

Index Terms

  1. EarSpeech: Exploring In-Ear Occlusion Effect on Earphones for Data-efficient Airborne Speech Enhancement

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 8, Issue 3
    September 2024
    1782 pages
    EISSN:2474-9567
    DOI:10.1145/3695755
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 September 2024
    Published in IMWUT Volume 8, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Earphone-based Sensing and Computing
    2. In-ear Acoustic Sensing
    3. Occlusion Effect
    4. Speech Enhancement

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)337
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 01 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Exploring Earable-Based Passive User Authentication via Interpretable In-Ear Breathing BiometricsIEEE Transactions on Mobile Computing10.1109/TMC.2024.345341223:12(15238-15255)Online publication date: Dec-2024
    • (2024)Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech EnhancementIEEE Access10.1109/ACCESS.2024.351987012(193131-193140)Online publication date: 2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media