Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

Wang, Wei; Ren, Shuo; Qian, Yao; Liu, Shujie; Shi, Yu; Qian, Yanmin; Zeng, Michael

Computer Science > Sound

arXiv:2110.12138 (cs)

[Submitted on 23 Oct 2021]

Title:Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

Authors:Wei Wang, Shuo Ren, Yao Qian, Shujie Liu, Yu Shi, Yanmin Qian, Michael Zeng

View PDF

Abstract:The advances in attention-based encoder-decoder (AED) networks have brought great progress to end-to-end (E2E) automatic speech recognition (ASR). One way to further improve the performance of AED-based E2E ASR is to introduce an extra text encoder for leveraging extensive text data and thus capture more context-aware linguistic information. However, this approach brings a mismatch problem between the speech encoder and the text encoder due to the different units used for modeling. In this paper, we propose an embedding aligner and modality switch training to better align the speech and text latent spaces. The embedding aligner is a shared linear projection between text encoder and speech encoder trained by masked language modeling (MLM) loss and connectionist temporal classification (CTC), respectively. The modality switch training randomly swaps speech and text embeddings based on the forced alignment result to learn a joint representation space. Experimental results show that our proposed approach achieves a relative 14% to 19% word error rate (WER) reduction on Librispeech ASR task. We further verify its effectiveness on spoken language understanding (SLU), i.e., an absolute 2.5% to 2.8% F1 score improvement on SNIPS slot filling task.

Comments:	submitted to ICASSP 2022
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2110.12138 [cs.SD]
	(or arXiv:2110.12138v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2110.12138

Submission history

From: Wei Wang [view email]
[v1] Sat, 23 Oct 2021 04:45:22 UTC (559 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.SD

< prev | next >

new | recent | 2021-10

Change to browse by:

cs
eess
eess.AS

References & Citations

DBLP - CS Bibliography

listing | bibtex

Wei Wang
Shuo Ren
Yao Qian
Shujie Liu
Yu shi

…

export BibTeX citation

Computer Science > Sound

Title:Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators