Generate, Annotate, and Learn: NLP with Synthetic Text

He, Xuanli; Nassar, Islam; Kiros, Jamie; Haffari, Gholamreza; Norouzi, Mohammad

Computer Science > Machine Learning

arXiv:2106.06168 (cs)

[Submitted on 11 Jun 2021 (v1), last revised 31 May 2022 (this version, v3)]

Title:Generate, Annotate, and Learn: NLP with Synthetic Text

Authors:Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, Mohammad Norouzi

View PDF

Abstract:This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called ``generate, annotate, and learn (GAL)'' to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.

Comments:	accepted to TACL2022
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2106.06168 [cs.LG]
	(or arXiv:2106.06168v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2106.06168

Submission history

From: Xuanli He [view email]
[v1] Fri, 11 Jun 2021 05:01:24 UTC (1,610 KB)
[v2] Thu, 9 Dec 2021 08:49:22 UTC (1,638 KB)
[v3] Tue, 31 May 2022 15:06:16 UTC (1,677 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2021-06

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Xuanli He
Jamie Kiros
Gholamreza Haffari
Mohammad Norouzi

export BibTeX citation

Computer Science > Machine Learning

Title:Generate, Annotate, and Learn: NLP with Synthetic Text

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Generate, Annotate, and Learn: NLP with Synthetic Text

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators