TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Saijo, Kohei; Wichern, Gordon; Germain, François G.; Pan, Zexu; Roux, Jonathan Le

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2408.03440 (eess)

[Submitted on 6 Aug 2024]

Title:TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Authors:Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

View PDF HTML (experimental)

Abstract:Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

Comments:	Accepted to IWAENC 2024
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2408.03440 [eess.AS]
	(or arXiv:2408.03440v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2408.03440

Submission history

From: Jonathan Le Roux [view email]
[v1] Tue, 6 Aug 2024 20:30:14 UTC (89 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators