Joint Unsupervised and Supervised Training for Multilingual ASR

Bai, Junwen; Li, Bo; Zhang, Yu; Bapna, Ankur; Siddhartha, Nikhil; Sim, Khe Chai; Sainath, Tara N.

Computer Science > Computation and Language

arXiv:2111.08137 (cs)

[Submitted on 15 Nov 2021]

Title:Joint Unsupervised and Supervised Training for Multilingual ASR

Authors:Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath

View PDF

Abstract:Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2111.08137 [cs.CL]
	(or arXiv:2111.08137v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2111.08137

Submission history

From: Junwen Bai [view email]
[v1] Mon, 15 Nov 2021 23:11:24 UTC (174 KB)

Computer Science > Computation and Language

Title:Joint Unsupervised and Supervised Training for Multilingual ASR

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Joint Unsupervised and Supervised Training for Multilingual ASR

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators