Synthetic Pre-Training Tasks for Neural Machine Translation

He, Zexue; Blackwood, Graeme; Panda, Rameswar; McAuley, Julian; Feris, Rogerio

Computer Science > Computation and Language

arXiv:2212.09864v1 (cs)

[Submitted on 19 Dec 2022 (this version), latest version 31 May 2023 (v2)]

Title:Synthetic Pre-Training Tasks for Neural Machine Translation

Authors:Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, Rogerio Feris

View PDF

Abstract:Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.

Comments:	17 pages including appendix, 3 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2212.09864 [cs.CL]
	(or arXiv:2212.09864v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.09864

Submission history

From: Zexue He [view email]
[v1] Mon, 19 Dec 2022 21:34:00 UTC (554 KB)
[v2] Wed, 31 May 2023 01:34:54 UTC (8,221 KB)

Computer Science > Computation and Language

Title:Synthetic Pre-Training Tasks for Neural Machine Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Synthetic Pre-Training Tasks for Neural Machine Translation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators