research-article

MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction

Authors:

Jingjing Zhang,

Zhao LvAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 1652 - 1661

https://doi.org/10.1145/3664647.3681550

Published: 28 October 2024 Publication History

Abstract

Speaker extraction aims to selectively extract the target speaker from the multi-talker environment under the guidance of auxiliary reference. Recent studies have shown that the attended speaker's information can be decoded by the auditory attention decoding from the listener's brain activity. However, how to more effectively utilize the common information about the target speaker contained in both electroencephalography (EEG) and speech is still an unresolved problem. In this paper, we propose a multi-scale fusion network (MSFNet) for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to make full use of the speech information, the mixed speech is encoded with multiple time scales so that the multi-scale embeddings are acquired. In addition, to effectively extract the non-Euclidean data of EEG, the graph convolutional networks are used as the EEG encoder. Finally, these multi-scale embeddings are separately fused with the EEG features. To facilitate research related to auditory attention decoding and further validate the effectiveness of the proposed method, we also construct the AVED dataset, a new EEG-Audio dataset. Experimental results on both the public Cocktail Party dataset and the newly proposed AVED dataset in this paper show that our MSFNet model significantly outperforms the state-of-the-art method in certain objective evaluation metrics.

References

[1]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In Proc. Interspeech 2018. 3244--3248. https://doi.org/10.21437/Interspeech.2018--1400

[2]

Michael P Broderick, Andrew J Anderson, Giovanni M Di Liberto, Michael J Crosse, and Edmund C Lalor. 2018. Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Current Biology 28, 5 (2018), 803--809.

[3]

Adelbert W Bronkhorst. 2000. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica 86, 1 (2000), 117--128.

[4]

Siqi Cai, Ran Zhang, and Haizhou Li. 2024. Robust Decoding of the Auditory Attention from EEG Recordings Through Graph Convolutional Networks. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2320--2324. https://doi.org/10.1109/ICASSP48485.2024.10447633

[5]

Enea Ceolini, Jens Hjortkjær, Daniel DE Wong, James O'Sullivan, Vinay S Raghavan, Jose Herrero, Ashesh D Mehta, Shih-Chii Liu, and Nima Mesgarani. 2020. Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception. NeuroImage 223 (2020), 117282.

[6]

Xiaoyu Chen, Changde Du, Qiongyi Zhou, and Huiguang He. 2023. Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning. In Proceedings of the 31st ACM International Conference on Multimedia. 6025--6033.

Digital Library

[7]

Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19, 4 (2010), 788--798.

Digital Library

[8]

Arnaud Delorme and Scott Makeig. 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of neuroscience methods 134, 1 (2004), 9--21.

[9]

Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, and Haizhou Li. 2022. L-spex: Localized target speaker extraction. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7287--7291.

[10]

Masoud Geravanchizadeh and Sahar Zakeri. 2021. Ear-EEG-based binaural speech enhancement (ee-BSE) using auditory attention detection and audiometric characteristics of hearing-impaired subjects. Journal of Neural Engineering 18, 4 (2021), 0460d6.

[11]

Cong Han, James O'Sullivan, Yi Luo, Jose Herrero, Ashesh D Mehta, and Nima Mesgarani. 2019. Speaker-independent auditory attention decoding without access to clean speech sources. Science advances 5, 5 (2019), eaav6134.

[12]

Maryam Hosseini, Luca Celotti, and Éric Plourde. 2021. Speaker-independent brain enhanced speech denoising. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1310--1314.

[13]

Maryam Hosseini, Luca Celotti, and Eric Plourde. 2022. End-to-end brain-driven speech enhancement in multi-talker conditions. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 1718--1733.

Digital Library

[14]

Yunus Korkmaz. 2023. Hybrid voice activity detection system based on LSTM and auditory speech features. Biomedical Signal Processing and Control 80 (2023), 104408.

[15]

Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, and Ahmed H Tewfik. 2020. Speech synthesis using EEG. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1235--1238.

[16]

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey. 2019. SDR--half-baked or well done?. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 626--630.

[17]

Yi Luo, Zhuo Chen, and Takuya Yoshioka. 2020. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 46--50.

[18]

Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27, 8 (2019), 1256--1266.

[19]

Marc-Antoine Moinnereau, Jean Rouat, Kevin Whittingstall, and Eric Plourde. 2020. A frequency-band coupling model of EEG signals can capture features from an input audio stimulus. Hearing Research 393 (2020), 107994.

[20]

James A O'sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn-Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. 2015. Attentional selection in a cocktail party environment can be decoded from single-trial EEG. Cerebral cortex 25, 7 (2015), 1697--1706.

[21]

James O'Sullivan, Zhuo Chen, Jose Herrero, Guy M McKhann, Sameer A Sheth, Ashesh D Mehta, and Nima Mesgarani. 2017. Neural decoding of attentional selection in multi-speaker environments without access to clean sources. Journal of neural engineering 14, 5 (2017), 056001.

[22]

Zexu Pan, Meng Ge, and Haizhou Li. 2022. USEV: Universal speaker extraction with visual cue. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 3032--3045.

Digital Library

[23]

Zexu Pan, Xinyuan Qian, and Haizhou Li. 2022. Speaker extraction with co-speech gestures cue. IEEE Signal Processing Letters 29 (2022), 1467--1471.

[24]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.

[25]

Wenqiang Pu, Jinjun Xiao, Tao Zhang, and Zhi-Quan Luo. 2019. A joint auditory attention decoding and adaptive binaural beamforming algorithm for hearing devices. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 311--315.

[26]

Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749--752.

Digital Library

[27]

David Snyder, Pegah Ghahremani, Daniel Povey, Daniel Garcia-Romero, Yishay Carmiel, and Sanjeev Khudanpur. 2016. Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 165--170.

[28]

Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 21--25.

[29]

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 4214--4217.

[30]

Doroteo T Toledano, María Pilar Fernández-Gallego, and Alicia Lozano-Diez. 2018. Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT. PloS one 13, 10 (2018), e0205355.

[31]

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879--4883.

Digital Library

[32]

Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, and Ignacio Lopez Moreno. 2019. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proc. Interspeech 2019. 2728--2732. https://doi.org/10.21437/Interspeech.2019--1101

[33]

Zhong-Qiu Wang, Samuele Cornell, Shukjae Choi, Younglo Lee, Byeong-Yeol Kim, and Shinji Watanabe. 2023. TF-GridNet: Integrating full-and sub-band modeling for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).

[34]

JianWu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, and Dong Yu. 2019. Time domain audio visual speech separation. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 667--673.

[35]

Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2019. Time-domain speaker extraction network. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 327--334.

[36]

Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2020. Spex: Multi-scale time domain speaker extraction network. IEEE/ACM transactions on audio, speech, and language processing 28 (2020), 1370--1384.

[37]

Chenyu Yang, Mengxi Chen, Yanfeng Wang, and Yu Wang. 2023. Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings. In Proceedings of the 31st ACM International Conference on Multimedia. 4031--4041.

Digital Library

[38]

Lei Yang, Wei Liu, Lufen Tan, Jaemo Yang, and Han-Gil Moon. 2023. Target Speaker Extraction with Ultra-Short Reference Speech by VE-VE Framework. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

[39]

Xue Yang, Changchun Bao, Jing Zhou, and Xianhong Chen. 2024. Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10476--10480. https://doi.org/10.1109/ICASSP48485.2024.10447529

[40]

Jie Zhang, QingTian Xu, Qiu-Shi Zhu, and Zhen-Hua Ling. 2023. BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with Convolutional Cross Attention in Multi-talker Conditions. In Proc. INTERSPEECH 2023. 3117--3121. https://doi.org/10.21437/Interspeech.2023--673

[41]

Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jia Qi Yip, Dianwen Ng, and Bin Ma. 2024. MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10356--10360. https://doi.org/10.1109/ICASSP48485.2024.10445985

[42]

Yan Zhao, DeLiang Wang, Eric M Johnson, and Eric W Healy. 2018. A deep learning based segregation algorithm to increase speech intelligibility for hearingimpaired listeners in reverberant-noisy conditions. The Journal of the Acoustical Society of America 144, 3 (2018), 1627--1637.

Index Terms

MSFNet: Multi-Scale Fusion Network for Brain-Controlled Speaker Extraction
1. Computing methodologies
  1. Artificial intelligence
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

MAVAR-SE: Multi-scale Audio-Visual Association Representation Network for End-to-End Speaker Extraction
MultiMedia Modeling
Abstract
Speaker extraction to separate the target speech from the mixed audio is a problem worth studying in the speech separation field. Since human pronunciation is closely related to lip motions and facial expressions during speaking, this paper ...
A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction
MultiMedia Modeling
Abstract
Speaker extraction is a technique that separates the target speech from multi-talker mixtures using a priori information about the target speaker, such as pre-enrolled reference speech. However, in real-world scenarios, the mixture speech is ...
Fusion of TEO Phase with MFCC Features for Speaker Verification
PerMIn '15: Proceedings of the 2nd International Conference on Perception and Machine Intelligence

In the last few years, there has been significant work on using temporal features of speech excitation source, viz., Linear Prediction (LP) residual and its analytic or instantaneous phase, group delay method, glottal glow derivative, etc. for speaker ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
57
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)47

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents