Industry Scale Semi-Supervised Learning for Natural Language Understanding

Chen, Luoxin; Garcia, Francisco; Kumar, Varun; Xie, He; Lu, Jianhua

Computer Science > Computation and Language

arXiv:2103.15871 (cs)

[Submitted on 29 Mar 2021]

Title:Industry Scale Semi-Supervised Learning for Natural Language Understanding

Authors:Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, Jianhua Lu

View PDF

Abstract:This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context: 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how do the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-Label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.

Comments:	NAACL 2021 Industry track
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2103.15871 [cs.CL]
	(or arXiv:2103.15871v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2103.15871

Submission history

From: Varun Kumar [view email]
[v1] Mon, 29 Mar 2021 18:24:02 UTC (504 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-03

Change to browse by:

cs
cs.AI
cs.LG

References & Citations

DBLP - CS Bibliography

listing | bibtex

Francisco Garcia
Varun Kumar
Jianhua Lu

export BibTeX citation

Computer Science > Computation and Language

Title:Industry Scale Semi-Supervised Learning for Natural Language Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Industry Scale Semi-Supervised Learning for Natural Language Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators