Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3681550acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction

Published: 28 October 2024 Publication History

Abstract

Speaker extraction aims to selectively extract the target speaker from the multi-talker environment under the guidance of auxiliary reference. Recent studies have shown that the attended speaker's information can be decoded by the auditory attention decoding from the listener's brain activity. However, how to more effectively utilize the common information about the target speaker contained in both electroencephalography (EEG) and speech is still an unresolved problem. In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to make full use of the speech information, the mixed speech is encoded with multiple time scales so that the multi-scale embeddings are acquired. In addition, to effectively extract the non-Euclidean data of EEG, the graph convolutional networks are used as the EEG encoder. Finally, these multi-scale embeddings are separately fused with the EEG features. To facilitate research related to auditory attention decoding and further validate the effectiveness of the proposed method, we also construct the AVED dataset, a new EEG-Audio dataset. Experimental results on both the public Cocktail Party dataset and the newly proposed AVED dataset in this paper show that our MSFNet model significantly outperforms the state-of-the-art method in certain objective evaluation metrics.

References

[1]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In Proc. Interspeech 2018. 3244--3248. https://doi.org/10.21437/Interspeech.2018--1400
[2]
Michael P Broderick, Andrew J Anderson, Giovanni M Di Liberto, Michael J Crosse, and Edmund C Lalor. 2018. Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Current Biology 28, 5 (2018), 803--809.
[3]
Adelbert W Bronkhorst. 2000. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica 86, 1 (2000), 117--128.
[4]
Siqi Cai, Ran Zhang, and Haizhou Li. 2024. Robust Decoding of the Auditory Attention from EEG Recordings Through Graph Convolutional Networks. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2320--2324. https://doi.org/10.1109/ICASSP48485.2024.10447633
[5]
Enea Ceolini, Jens Hjortkjær, Daniel DE Wong, James O'Sullivan, Vinay S Raghavan, Jose Herrero, Ashesh D Mehta, Shih-Chii Liu, and Nima Mesgarani. 2020. Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception. NeuroImage 223 (2020), 117282.
[6]
Xiaoyu Chen, Changde Du, Qiongyi Zhou, and Huiguang He. 2023. Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning. In Proceedings of the 31st ACM International Conference on Multimedia. 6025--6033.
[7]
Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19, 4 (2010), 788--798.
[8]
Arnaud Delorme and Scott Makeig. 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of neuroscience methods 134, 1 (2004), 9--21.
[9]
Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, and Haizhou Li. 2022. L-spex: Localized target speaker extraction. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7287--7291.
[10]
Masoud Geravanchizadeh and Sahar Zakeri. 2021. Ear-EEG-based binaural speech enhancement (ee-BSE) using auditory attention detection and audiometric characteristics of hearing-impaired subjects. Journal of Neural Engineering 18, 4 (2021), 0460d6.
[11]
Cong Han, James O'Sullivan, Yi Luo, Jose Herrero, Ashesh D Mehta, and Nima Mesgarani. 2019. Speaker-independent auditory attention decoding without access to clean speech sources. Science advances 5, 5 (2019), eaav6134.
[12]
Maryam Hosseini, Luca Celotti, and Éric Plourde. 2021. Speaker-independent brain enhanced speech denoising. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1310--1314.
[13]
Maryam Hosseini, Luca Celotti, and Eric Plourde. 2022. End-to-end brain-driven speech enhancement in multi-talker conditions. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 1718--1733.
[14]
Yunus Korkmaz. 2023. Hybrid voice activity detection system based on LSTM and auditory speech features. Biomedical Signal Processing and Control 80 (2023), 104408.
[15]
Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, and Ahmed H Tewfik. 2020. Speech synthesis using EEG. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1235--1238.
[16]
Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. 2019. SDR--half-baked or well done?. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 626--630.
[17]
Yi Luo, Zhuo Chen, and Takuya Yoshioka. 2020. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 46--50.
[18]
Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27, 8 (2019), 1256--1266.
[19]
Marc-Antoine Moinnereau, Jean Rouat, Kevin Whittingstall, and Eric Plourde. 2020. A frequency-band coupling model of EEG signals can capture features from an input audio stimulus. Hearing Research 393 (2020), 107994.
[20]
James A O'sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn-Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. 2015. Attentional selection in a cocktail party environment can be decoded from single-trial EEG. Cerebral cortex 25, 7 (2015), 1697--1706.
[21]
James O'Sullivan, Zhuo Chen, Jose Herrero, Guy M McKhann, Sameer A Sheth, Ashesh D Mehta, and Nima Mesgarani. 2017. Neural decoding of attentional selection in multi-speaker environments without access to clean sources. Journal of neural engineering 14, 5 (2017), 056001.
[22]
Zexu Pan, Meng Ge, and Haizhou Li. 2022. USEV: Universal speaker extraction with visual cue. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 3032--3045.
[23]
Zexu Pan, Xinyuan Qian, and Haizhou Li. 2022. Speaker extraction with co-speech gestures cue. IEEE Signal Processing Letters 29 (2022), 1467--1471.
[24]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[25]
Wenqiang Pu, Jinjun Xiao, Tao Zhang, and Zhi-Quan Luo. 2019. A joint auditory attention decoding and adaptive binaural beamforming algorithm for hearing devices. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 311--315.
[26]
Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749--752.
[27]
David Snyder, Pegah Ghahremani, Daniel Povey, Daniel Garcia-Romero, Yishay Carmiel, and Sanjeev Khudanpur. 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 165--170.
[28]
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 21--25.
[29]
Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 4214--4217.
[30]
Doroteo T Toledano, María Pilar Fernández-Gallego, and Alicia Lozano-Diez. 2018. Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT. PloS one 13, 10 (2018), e0205355.
[31]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879--4883.
[32]
Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, and Ignacio Lopez Moreno. 2019. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. Interspeech 2019. 2728--2732. https://doi.org/10.21437/Interspeech.2019--1101
[33]
Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, and Shinji Watanabe. 2023. TF-GridNet: Integrating full-and sub-band modeling for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
[34]
JianWu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, and Dong Yu. 2019. Time domain audio visual speech separation. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 667--673.
[35]
Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2019. Time-domain speaker extraction network. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 327--334.
[36]
Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2020. Spex: Multi-scale time domain speaker extraction network. IEEE/ACM transactions on audio, speech, and language processing 28 (2020), 1370--1384.
[37]
Chenyu Yang, Mengxi Chen, Yanfeng Wang, and Yu Wang. 2023. Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings. In Proceedings of the 31st ACM International Conference on Multimedia. 4031--4041.
[38]
Lei Yang, Wei Liu, Lufen Tan, Jaemo Yang, and Han-Gil Moon. 2023. Target Speaker Extraction with Ultra-Short Reference Speech by VE-VE Framework. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.
[39]
Xue Yang, Changchun Bao, Jing Zhou, and Xianhong Chen. 2024. Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10476--10480. https://doi.org/10.1109/ICASSP48485.2024.10447529
[40]
Jie Zhang, QingTian Xu, Qiu-Shi Zhu, and Zhen-Hua Ling. 2023. BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions. In Proc. INTERSPEECH 2023. 3117--3121. https://doi.org/10.21437/Interspeech.2023--673
[41]
Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jia Qi Yip, Dianwen Ng, and Bin Ma. 2024. MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10356--10360. https://doi.org/10.1109/ICASSP48485.2024.10445985
[42]
Yan Zhao, DeLiang Wang, Eric M Johnson, and Eric W Healy. 2018. A deep learning based segregation algorithm to increase speech intelligibility for hearingimpaired listeners in reverberant-noisy conditions. The Journal of the Acoustical Society of America 144, 3 (2018), 1627--1637.

Index Terms

  1. MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. eeg signals
      2. graph convolutional network
      3. multi-modal fusion
      4. multi-talker environment
      5. speaker extraction

      Qualifiers

      • Research-article

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 57
        Total Downloads
      • Downloads (Last 12 months)57
      • Downloads (Last 6 weeks)47
      Reflects downloads up to 23 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media