Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Kim, Jeonghoon; Lee, Jung Hyun; Kim, Sungdong; Park, Joonsuk; Yoo, Kang Min; Kwon, Se Jung; Lee, Dongsoo

Computer Science > Machine Learning

arXiv:2305.14152 (cs)

[Submitted on 23 May 2023 (v1), last revised 28 Oct 2023 (this version, v2)]

Title:Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Authors:Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee

View PDF

Abstract:Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed to ease memory demands and accelerate LLM inference, most of these techniques are geared towards the deployment phase. To bridge this gap, this paper presents Parameter-Efficient and Quantization-aware Adaptation (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs. By updating solely the quantization scales, PEQA can be directly applied to quantized LLMs, ensuring seamless task transitions. Parallel to existing PEFT methods, PEQA significantly reduces the memory overhead associated with the optimizer state. Furthermore, it leverages the advantages of quantization to substantially reduce model sizes. Even after fine-tuning, the quantization structure of a PEQA-tuned LLM remains intact, allowing for accelerated inference on the deployment stage. We employ PEQA-tuning for task-specific adaptation on LLMs with up to 65 billion parameters. To assess the logical reasoning and language comprehension of PEQA-tuned LLMs, we fine-tune low-bit quantized LLMs using a instruction dataset. Our results show that even when LLMs are quantized to below 4-bit precision, their capabilities in language modeling, few-shot in-context learning, and comprehension can be resiliently restored to (or even improved over) their full-precision original performances with PEQA.

Comments:	Published at NeurIPS 2023. Camera-ready version
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2305.14152 [cs.LG]
	(or arXiv:2305.14152v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2305.14152

Submission history

From: Jeonghoon Kim [view email]
[v1] Tue, 23 May 2023 15:20:01 UTC (775 KB)
[v2] Sat, 28 Oct 2023 11:53:52 UTC (782 KB)

Computer Science > Machine Learning

Title:Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators