GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Kang, Hao; Zhang, Qingru; Kundu, Souvik; Jeong, Geonhwa; Liu, Zaoxing; Krishna, Tushar; Zhao, Tuo

Computer Science > Machine Learning

arXiv:2403.05527 (cs)

[Submitted on 8 Mar 2024 (v1), last revised 30 Sep 2024 (this version, v4)]

Title:GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Authors:Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

View PDF HTML (experimental)

Abstract:Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at this https URL.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2403.05527 [cs.LG]
	(or arXiv:2403.05527v4 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2403.05527

Submission history

From: Hao Kang [view email]
[v1] Fri, 8 Mar 2024 18:48:30 UTC (2,567 KB)
[v2] Mon, 11 Mar 2024 18:55:40 UTC (2,567 KB)
[v3] Thu, 29 Aug 2024 16:48:58 UTC (11,181 KB)
[v4] Mon, 30 Sep 2024 22:44:58 UTC (11,181 KB)

Computer Science > Machine Learning

Title:GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators