Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Li, Tianjian; Xu, Haoran; Koehn, Philipp; Khashabi, Daniel; Murray, Kenton

Computer Science > Computation and Language

arXiv:2310.00840 (cs)

[Submitted on 2 Oct 2023 (v1), last revised 18 Mar 2024 (this version, v2)]

Title:Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Authors:Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

View PDF HTML (experimental)

Abstract:Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.

Comments:	ICLR 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.00840 [cs.CL]
	(or arXiv:2310.00840v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.00840

Submission history

From: Tianjian Li [view email]
[v1] Mon, 2 Oct 2023 01:30:27 UTC (8,458 KB)
[v2] Mon, 18 Mar 2024 19:28:38 UTC (8,061 KB)

Computer Science > Computation and Language

Title:Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators