LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Lee, Nicholas; Wattanawong, Thanakul; Kim, Sehoon; Mangalam, Karttikeya; Shen, Sheng; Anumanchipalli, Gopala; Mahoney, Michael W.; Keutzer, Kurt; Gholami, Amir

Computer Science > Computation and Language

arXiv:2403.15042 (cs)

[Submitted on 22 Mar 2024 (v1), last revised 13 Jul 2024 (this version, v2)]

Title:LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Authors:Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipalli, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

View PDF HTML (experimental)

Abstract:Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at this https URL .

Comments:	ACL 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.15042 [cs.CL]
	(or arXiv:2403.15042v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.15042

Submission history

From: Nicholas Lee [view email]
[v1] Fri, 22 Mar 2024 08:57:07 UTC (209 KB)
[v2] Sat, 13 Jul 2024 07:36:49 UTC (7,081 KB)

Computer Science > Computation and Language

Title:LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators