Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Sang, Mufan; Zhao, Yong; Liu, Gang; Hansen, John H. L.; Wu, Jian

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2302.08639 (eess)

[Submitted on 17 Feb 2023 (v1), last revised 28 Feb 2023 (this version, v2)]

Title:Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Authors:Mufan Sang, Yong Zhao, Gang Liu, John H.L. Hansen, Jian Wu

View PDF

Abstract:Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.

Comments:	Accepted to ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2302.08639 [eess.AS]
	(or arXiv:2302.08639v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2302.08639

Submission history

From: Mufan Sang [view email]
[v1] Fri, 17 Feb 2023 01:04:51 UTC (273 KB)
[v2] Tue, 28 Feb 2023 23:32:08 UTC (273 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators