Data Contamination Calibration for Black-box LLMs

Ye, Wentao; Hu, Jiaqi; Li, Liyao; Wang, Haobo; Chen, Gang; Zhao, Junbo

Computer Science > Machine Learning

arXiv:2405.11930 (cs)

[Submitted on 20 May 2024 (v1), last revised 3 Jun 2024 (this version, v2)]

Title:Data Contamination Calibration for Black-box LLMs

Authors:Wentao Ye, Jiaqi Hu, Liyao Li, Haobo Wang, Gang Chen, Junbo Zhao

View PDF HTML (experimental)

Abstract:The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) -- from machine learning community -- by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2405.11930 [cs.LG]
	(or arXiv:2405.11930v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2405.11930

Submission history

From: Wentao Ye [view email]
[v1] Mon, 20 May 2024 10:12:23 UTC (396 KB)
[v2] Mon, 3 Jun 2024 05:21:54 UTC (396 KB)

Computer Science > Machine Learning

Title:Data Contamination Calibration for Black-box LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Data Contamination Calibration for Black-box LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators