Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Stahlberg, Felix; Kumar, Shankar

Computer Science > Computation and Language

arXiv:2105.13318 (cs)

[Submitted on 27 May 2021]

Title:Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Authors:Felix Stahlberg, Shankar Kumar

View PDF

Abstract:Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

Comments:	Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, 2021. this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2105.13318 [cs.CL]
	(or arXiv:2105.13318v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2105.13318

Submission history

From: Felix Stahlberg [view email]
[v1] Thu, 27 May 2021 17:17:21 UTC (388 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-05

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Felix Stahlberg
Shankar Kumar

export BibTeX citation

Computer Science > Computation and Language

Title:Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators