Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Lee, Younglo; Choi, Shukjae; Kim, Byeong-Yeol; Wang, Zhong-Qiu; Watanabe, Shinji

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2401.12473 (eess)

[Submitted on 23 Jan 2024]

Title:Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Authors:Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

View PDF HTML (experimental)

Abstract:We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.

Comments:	5 pages, 4 figures, accepted by ICASSP 2024
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2401.12473 [eess.AS]
	(or arXiv:2401.12473v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2401.12473

Submission history

From: Younglo Lee [view email]
[v1] Tue, 23 Jan 2024 03:55:22 UTC (478 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators