Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

He, Jianfeng; Salazar, Julian; Yao, Kaisheng; Li, Haoqi; Cai, Jinglun

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2305.12793 (eess)

[Submitted on 22 May 2023 (v1), last revised 3 Feb 2024 (this version, v2)]

Title:Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Authors:Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

View PDF HTML (experimental)

Abstract:End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time. Our code and data are released in this https URL.

Comments:	18 pages, 7 figures
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
Cite as:	arXiv:2305.12793 [eess.AS]
	(or arXiv:2305.12793v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2305.12793

Submission history

From: Jianfeng He [view email]
[v1] Mon, 22 May 2023 07:42:52 UTC (398 KB)
[v2] Sat, 3 Feb 2024 03:24:46 UTC (208 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators