Token-wise Influential Training Data Retrieval for Large Language Models

Lin, Huawei; Long, Jikai; Xu, Zhaozhuo; Zhao, Weijie

Computer Science > Computation and Language

arXiv:2405.11724v1 (cs)

[Submitted on 20 May 2024 (this version), latest version 22 Oct 2024 (v2)]

Title:Token-wise Influential Training Data Retrieval for Large Language Models

Authors:Huawei Lin, Jikai Long, Zhaozhuo Xu, Weijie Zhao

View PDF

Abstract:Given a Large Language Model (LLM) generation, how can we identify which training data led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to LLMs for estimating the influence of each training data. The proposed framework consists of two stages: caching and retrieval. First, we compress the gradient vectors by over 200,000x, allowing them to be cached on disk or in GPU/CPU memory. Then, given a generation, RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports multi-GPU parallelization to substantially accelerate caching and retrieval. Our empirical result confirms the efficiency and effectiveness of RapidIn.

Comments:	Accepted to ACL 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
Cite as:	arXiv:2405.11724 [cs.CL]
	(or arXiv:2405.11724v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.11724

Submission history

From: Huawei Lin [view email]
[v1] Mon, 20 May 2024 01:57:34 UTC (218 KB)
[v2] Tue, 22 Oct 2024 19:07:08 UTC (218 KB)

Computer Science > Computation and Language

Title:Token-wise Influential Training Data Retrieval for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Token-wise Influential Training Data Retrieval for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators