Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612173acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TE-KWS: Text-Informed Speech Enhancement for Noise-Robust Keyword Spotting

Published: 27 October 2023 Publication History

Abstract

Keyword spotting (KWS) presents a formidable challenge, particularly in high-noise environments. Traditional denoising algorithms that rely solely on speech have difficulty recovering speech that has been severely corrupted by noise. In this investigation, we develop an adaptive text-informed denoising model to bolster reliable keyword identification in the presence of considerable noise degradation. The whole proposed TE-KWS incorporates a tripartite branch structure, where the speech branch (SB) takes noisy speech as input which provides the raw speech information, the alignment branch (AB) accommodates aligned text input which facilitates accurate restoration of the corresponding speech when text with alignment is preserved, and the text branch (TB) handles unaligned text which prompts the model to autonomously learn the alignment between speech and text. To make the proposed denoising model more beneficial for KWS, following the training of the whole model,the alignment branch (AB) is frozen, and the model is fine-tuned by leveraging its speech restoration and forced alignment capabilities. Subsequently, the input for the text branch (TB) is supplanted with designated keywords, and a heavier denoising penalty is applied on the keywords period, thereby explicitly intensifying the speech restoration ability of the model for keywords. Finally, the Combined Adversarial Domain Adaptation (CADA) is implemented to enhance the robustness of KWS with regard to data pre-and post-speech enhancement (SE). Experimental results indicate that our approach not only markedly ameliorates highly corrupted speech, achieving SOTA performance for marginally corrupted speech, but also bolsters the efficacy and generalizability of prevailing mainstream KWS models.

References

[1]
Sherif Abdulatif, Ruizhe Cao, and Bin Yang. 2022. CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement. arXiv preprint arXiv:2209.11112 (2022).
[2]
Stefan Braun, Daniel Neil, and Shih-Chii Liu. 2017. A curriculum learning method for improved noise robustness in automatic speech recognition. In 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 548--552.
[3]
Guangyi Chen, Yuhao Lu, Jiwen Lu, and Jie Zhou. 2020. Deep credible metric learning for unsupervised domain adaptation person re-identification. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VIII 16. Springer, 643--659.
[4]
Szu-Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, and Shinji Watanabe. 2018. Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline. arXiv preprint arXiv:1803.10109 (2018).
[5]
Zhuo Chen, Yi Luo, and Nima Mesgarani. 2017. Deep attractor network for single-microphone speaker separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 246--250.
[6]
Seungwoo Choi, Seokjun Seo, Beomjun Shin, Hyeongmin Byun, Martin Kersner, Beomsu Kim, Dongyoung Kim, and Sungjoo Ha. 2019. Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices. Proc. Interspeech 2019 (2019), 3372--3376.
[7]
Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, and Thibaut Lavril. 2019. Efficient keyword spotting using dilated convolutions and gating. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6351--6355.
[8]
Feng Dang, Hangting Chen, and Pengyuan Zhang. 2022. DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6857--6861.
[9]
Douglas Coimbra De Andrade, Sabato Leo, Martin Loesener Da Silva Viana, and Christoph Bernkopf. 2018. A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929 (2018).
[10]
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020).
[11]
Zhijie Deng, Yucen Luo, and Jun Zhu. 2019. Cluster alignment with a teacher for unsupervised domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision. 9944--9953.
[12]
Hakan Erdogan, John R Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 708--712.
[13]
Hao Feng, Minghao Chen, Jinming Hu, Dong Shen, Haifeng Liu, and Deng Cai. 2021. Complementary pseudo labels for unsupervised domain adaptation on person re-identification. IEEE Transactions on Image Processing 30 (2021), 2898--2907.
[14]
Yihui Fu, Yun Liu, Jingdong Li, Dawei Luo, Shubo Lv, Yukai Jv, and Lei Xie. 2022. Uformer: A unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7417--7421.
[15]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research 17, 1 (2016), 2096-2030.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[17]
John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 31--35.
[18]
Guoning Hu and DeLiang Wang. 2001. Speech segregation based on pitch tracking and amplitude modulation. In Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575). IEEE, 79--82.
[19]
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020).
[20]
Yi Hu and Philipos C Loizou. 2007. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16, 1 (2007), 229--238.
[21]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.
[22]
Xuan Ji, Meng Yu, Jie Chen, Jimeng Zheng, Dan Su, and Dong Yu. 2020. Integration of multi-look beamformers for multi-channel keyword spotting. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7464--7468.
[23]
Chieh-Chi Kao, Ming Sun, Yixin Gao, Shiv Vitaladevuni, and Chao Wang. 2019. Sub-Band Convolutional Neural Networks for Small-Footprint Spoken Term Classification. Proc. Interspeech 2019 (2019), 2195--2199.
[24]
Byeonggeun Kim, Simyung Chang, Jinkyu Lee, and Dooyong Sung. 2021. Broadcasted residual learning for efficient keyword spotting. arXiv preprint arXiv:2106.04140 (2021).
[25]
Eesung Kim and Hyeji Seo. 2021. SE-Conformer: Time-Domain Speech Enhancement Using Conformer. In Interspeech. 2736--2740.
[26]
Keisuke Kinoshita, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani. 2015. Text-informed speech enhancement with deep neural networks. In INTERSPEECH. ISCA, 1760--1764.
[27]
Keisuke Kinoshita, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani. 2015. Text-informed speech enhancement with deep neural networks. In Sixteenth Annual Conference of the International Speech Communication Association.
[28]
Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jesper Jensen. 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 10 (2017), 1901--1913.
[29]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84--90.
[30]
Yuki Kubo, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, and Shoko Araki. 2019. Mask-based MVDR beamformer for noisy multisource environments: Introduction of time-varying spatial covariance model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6855--6859.
[31]
Bo Li, Tara N Sainath, Ron J Weiss, Kevin W Wilson, and Michiel Bacchiani. 2016. Neural network adaptive beamforming for robust multichannel speech recognition. (2016).
[32]
Shuang Li, Chi Liu, Qiuxia Lin, Binhui Xie, Zhengming Ding, Gao Huang, and Jian Tang. 2020. Domain conditioned adaptation network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11386--11393.
[33]
Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S. Yu. 2013. Transfer Feature Learning with Joint Distribution Adaptation. In 2013 IEEE International Conference on Computer Vision. 2200--2207. https://doi.org/10.1109/ ICCV.2013.274
[34]
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In International conference on machine learning. PMLR, 2208--2217.
[35]
Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27, 8 (2019), 1256--1266.
[36]
Somshubra Majumdar and Boris Ginsburg. 2020. MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition. Proc. Interspeech 2020 (2020), 3356--3360.
[37]
Assaf Hurwitz Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, and Petar Aleksic. 2017. Keyword spotting for Google assistant using contextual speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 272--278.
[38]
Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R Hershey, and Xiong Xiao. 2017. Unified architecture for multichannel end-to-end speech recognition with neural beamforming. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1274--1288.
[39]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.
[40]
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).
[41]
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, Robust and Controllable Text to Speech. In NeurIPS. 3165--3174.
[42]
Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, and Stella Laurenzo. 2020. Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020).
[43]
Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, and Stella Laurenzo. 2020. Streaming Keyword Spotting on Mobile Devices. Proc. Interspeech 2020 (2020), 2277--2281.
[44]
Tara Sainath and Carolina Parada. 2015. Convolutional neural networks for small-footprint keyword spotting. (2015).
[45]
Tara N Sainath, Ron J Weiss, Kevin W Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew Senior, Kean Chin, et al. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 5 (2017), 965--979.
[46]
Alexander Schmidt, Heinrich W Löllmann, and Walter Kellermann. 2018. A novel ego-noise suppression algorithm for acoustic signal enhancement in autonomous systems. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6583--6587.
[47]
Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and Roland Badeau. 2019. Weakly Informed Audio Source Separation. In WASPAA. IEEE, 273--277. https://doi.org/10.1109/WASPAA.2019.8937266
[48]
Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and Roland Badeau. 2020. Joint Phoneme Alignment and Text-Informed Speech Separation on Highly Corrupted Speech. In ICASSP. IEEE, 7274--7278. https://doi.org/10.1109/ICASSP40776. 2020.9053182
[49]
Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie. 2018. Attention-based End-to-End Models for Small-Footprint Keyword Spotting. Proc. Interspeech 2018 (2018), 2037--2041.
[50]
Soundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Communication 48, 11 (2006), 1486--1501.
[51]
Ming Sun, David Snyder, Yixin Gao, Varun K Nagaraja, Mike Rodehorst, Sankaran Panchapagesan, Nikko Strom, Spyros Matsoukas, and Shiv Vitaladevuni. 2017. Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting. In Proc. Interspeech 2017. 3607--3611.
[52]
Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Wenxuan Xie, and Wenjun Zeng. 2021. Joint time-frequency and time domain learning for speech enhancement. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 3816--3822.
[53]
Naohiro Tawara, Tetsunori Kobayashi, and Tetsuji Ogawa. 2019. Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoen-coder. In INTERSPEECH. 86--90.
[54]
Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. 2013. The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America 133 (05 2013), 3591. https://doi.org/10.1121/1.4806631
[55]
Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014).
[56]
Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2016. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In SSW. 146--152.
[57]
Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2016. Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks. In interspeech. ISCA, 352--356. https: //doi.org/10.21437/Interspeech.2016-159
[58]
Christophe Veaux, Junichi Yamagishi, and Simon King. 2013. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In O-COCOSDA/CASLRE. IEEE, 1--4. https://doi.org/10.1109/ICSDA. 2013.6709856
[59]
DeLiang Wang and Jitong Chen. 2018. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE ACM Trans. Audio Speech Lang. Process. 26, 10 (2018), 1702--1726. https://doi.org/10.1109/TASLP.2018.2842159
[60]
Lin Wang and Andrea Cavallaro. 2020. A blind source separation framework for ego-noise reduction on multi-rotor drones. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2523--2537.
[61]
Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon, and Rif A Saurous. 2017. Trainable frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5670--5674.
[62]
Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 22, 12 (2014), 1849--1858.
[63]
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12. Springer, 91--99.
[64]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2014), 7--19.
[65]
Yong Xu, Meng Yu, Shi-Xiong Zhang, Lianwu Chen, Chao Weng, Jianming Liu, and Dong Yu. 2020. Neural spatio-temporal beamformer for target speech separation. arXiv preprint arXiv:2005.03889 (2020).
[66]
Guanglei Yang, Haifeng Xia, Mingli Ding, and Zhengming Ding. 2020. Bi-directional generation for unsupervised domain adaptation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 6615--6622.
[67]
Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2020. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9458--9465.
[68]
Dacheng Yin, Zhiyuan Zhao, Chuanxin Tang, Zhiwei Xiong, and Chong Luo. 2022. TridentSE: Guiding Speech Enhancement with 32 Global Tokens. arXiv preprint arXiv:2210.12995 (2022).
[69]
Pei Chee Yong and Sven Nordholm. 2017. Real time noise suppression in social settings comprising a mixture of non-stationary anc transient noise. In 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 588--592.
[70]
Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bho-janapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In ICLR. OpenReview.net.
[71]
Guochen Yu, Andong Li, Chengshi Zheng, Yinuo Guo, Yutian Wang, and Hui Wang. 2022. Dual-branch attention-in-attention transformer for single-channel speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7847--7851.
[72]
Yan Zhao, Zhong-Qiu Wang, and DeLiang Wang. 2017. A two-stage algorithm for noisy and reverberant speech enhancement. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5580--5584.

Index Terms

  1. TE-KWS: Text-Informed Speech Enhancement for Noise-Robust Keyword Spotting

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. keyword spotting
    2. speech enhancement
    3. text-informed speech enhancement

    Qualifiers

    • Research-article

    Funding Sources

    • Post-graduate Research & Practice Innovation Program of Jiangsu Province
    • Key Project of National Nature Science Foundation of China
    • National Nature Science Foundation of China
    • Jiangsu key research and development plan
    • MTRAC Grant for Advanced Computing Technologies

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 233
      Total Downloads
    • Downloads (Last 12 months)223
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media