DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

Zhang, Nan; Wang, Jianzong; Hong, Zhenhou; Zhao, Chendong; Qu, Xiaoyang; Xiao, Jing

Computer Science > Sound

arXiv:2205.13249 (cs)

[Submitted on 26 May 2022]

Title:DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

Authors:Nan Zhang, Jianzong Wang, Zhenhou Hong, Chendong Zhao, Xiaoyang Qu, Jing Xiao

View PDF

Abstract:Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.

Comments:	Accepted by IJCNN2022 (The 2022 International Joint Conference on Neural Networks)
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2205.13249 [cs.SD]
	(or arXiv:2205.13249v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2205.13249

Submission history

From: Nan Zhang [view email]
[v1] Thu, 26 May 2022 09:36:26 UTC (888 KB)

Computer Science > Sound

Title:DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators