Investigating Data Contamination for Pre-training Language Models

Jiang, Minhao; Liu, Ken Ziyu; Zhong, Ming; Schaeffer, Rylan; Ouyang, Siru; Han, Jiawei; Koyejo, Sanmi

Computer Science > Computation and Language

arXiv:2401.06059 (cs)

[Submitted on 11 Jan 2024]

Title:Investigating Data Contamination for Pre-training Language Models

Authors:Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

View PDF HTML (experimental)

Abstract:Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

Comments:	16 pages, 5 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2401.06059 [cs.CL]
	(or arXiv:2401.06059v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.06059

Submission history

From: Minhao Jiang [view email]
[v1] Thu, 11 Jan 2024 17:24:49 UTC (10,484 KB)

Computer Science > Computation and Language

Title:Investigating Data Contamination for Pre-training Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Investigating Data Contamination for Pre-training Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators