In-context KV-Cache Eviction for LLMs
via Attention-Gate
Abstract
The KV-Cache technique has become the standard for the inference of large language models (LLMs). It caches states of self-attention to avoid recomputation. Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system, especially when confronted with ultra-large models and long-context queries. A natural remedy is to discard the KV-Cache for less important tokens, with StreamingLLM (Xiao et al., 2024) as an example, but the used static eviction strategies cannot flexibly adapt to varying contexts. Remedies like H2O (Zhang et al., 2024) leverage accumulative attention scores to perform dynamic eviction but suffer from the attention bias issue in capturing contextual information. This paper bridges this gap by devising a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate, which accepts the whole context as input and yields eviction flags for each token to realize in-context eviction. The subsequent self-attention module proceeds according to the flags and only the KV states for the remaining tokens need to be cached. The Attention-Gates can vary among different heads and layers and be trivially plugged into pre-trained LLMs, tuned by cost-effective continual pre-training or supervised fine-tuning objectives to acquire what to discard. The computational and memory overhead introduced by Attention-Gates is minimal. Our method is validated across multiple tasks, demonstrating both efficiency and adaptability. After a highly efficient continual pre-training, it achieves higher average accuracy and evicts more tokens compared to traditional training-free methods. In supervised fine-tuning, it not only evicts many tokens but also outperforms LoRA-finetuned LLMs on some datasets, such as RTE, where it improves accuracy by 13.9% while evicting 62.8% of tokens, showing that effective eviction of redundant tokens can even enhance performance.
1 Introduction
Large language models (LLMs) (Dubey et al., 2024; Team et al., 2024; Chiang et al., 2023) have achieved remarkable success across a wide range of natural language processing tasks. A key technique that has enabled efficient LLM inference is KV-Cache, which stores transient attention keys and values to avoid recomputation. However, as the size of LLMs continues to increase and the demand for handling long-context queries grows, the KV-Cache has emerged as a significant bottleneck. Storing attention states for numerous tokens can lead to considerable memory overhead and data transfer among the memory hierarchy results in substantially increased inference time.
Studies have shown that sparsity is a natural phenomenon in attention mechanisms, with many tokens being redundant for inference (Zhang et al., 2024). This suggests that retaining all tokens in the KV-Cache is unnecessary. Existing works have explored this insight to compress KV-Cache using static strategies or hinging on accumulative attention scores. StreamingLLM (Xiao et al., 2024) is a representative of the former by retaining a fixed window of beginning and recent tokens in the KV-Cache but it struggles to flexibly adapt to specific contexts. E.g., in sentiment analysis, retaining the token “cute” in “a cute cat” is crucial, while in object recognition, the token “cat” would be more important. H2O (Zhang et al., 2024), on the other hand, employs a token-adaptive approach, using local accumulative attention scores to determine which tokens to evict. However, it is criticized that in practice, H2O suffers from the attention bias issue (Oren et al., 2024), with a tendency to over-prioritize either the initial or recent tokens.
To overcome these challenges, we introduce a parameterized KV-Cache eviction mechanism named the Attention-Gate (AG), to perform reliable in-context eviction. AG is positioned before a self-attention layer within the model. It takes a sequence of token features as input and generates eviction flags for the tokens, indicating whether a token should be excluded from subsequent self-attention computations. Tokens that are evicted do not require their KV states to be cached. AGs can be seamlessly integrated into pre-trained LLMs and tuned by minimizing the language modeling loss. Ideally, AGs can automatically learn to discern the most relevant tokens for the current context without manual intervention. In practice, we can implement the AG as a self-attention layer with much fewer heads than the original model (e.g., 4 v.s. 32). This way, the parallel computational capabilities of the hardware can be harnessed to minimize the extra overhead introduced by AGs.
AG is empirically shown to enjoy high training efficiency, e.g., only four NVIDIA 4090 GPUs and a dataset of 5,000 samples are required for continual pre-training when applying AGs to LLaMA2-7B (Touvron et al., 2023). This alleviates concerns about the computational overhead related to trainable eviction strategies (Zhang et al., 2024; Chen et al., 2024) and amplifies the performance superiority of our approach over existing training-free approaches. As illustrated in Figure 1, AG generates different eviction strategies across different layers and attention-heads for different tokens, demonstrating its adaptability to the diverse requirements of each component in the model.
To validate the effectiveness of our method, we conduct extensive experiments across multiple benchmarks. After efficient continual pre-training (CPT), our approach outperforms traditional training-free eviction strategies, such as StreamingLLM and H2O, in both accuracy and token eviction rates. In supervised fine-tuning (SFT), our method not only evicts a significant number of redundant tokens but also maintains or surpasses the performance of LoRA-finetuned LLMs. For example, on the RTE dataset, our approach improves accuracy by 13.9% while evicting 62.8% of tokens, demonstrating that selective token eviction can enhance performance. In summary, the Attention-Gate mechanism provides a scalable and efficient solution for KV-Cache management, addressing the limitations of traditional training-free methods.
2 Related Work
As large language models (LLMs) scale in both size and input sequence length, optimizing their efficiency has become increasingly important, particularly in addressing space and time complexity. A significant bottleneck lies in the attention mechanism, which demands considerable computational and memory resources, especially for long sequences.
Computation and memory challenges. The computational cost of attention is driven by its quadratic complexity, while the memory burden stems from the need to store the KV-Cache during inference to avoid recomputation. Although several works have focused on reducing the computational load of attention mechanisms, many fail to address the memory constraints that remain a critical issue. Techniques like Reformer (Kitaev et al., 2020) and FlashAttention (Dao et al., 2022; Dao, 2023) mitigate memory usage for long sequences but still require substantial cache storage. On the other hand, approaches such as MQA (Shazeer, 2019) and GQA (Ainslie et al., 2023) prioritize memory compression without tackling computational complexity. These solutions tend to focus on either memory or computational efficiency, rarely addressing both simultaneously.
KV-Cache eviction strategies. To address both memory and computational challenges, KV-Cache eviction has emerged as an effective strategy. Existing approaches can be categorized into static and adaptive methods based on accumulative attention scores.
Static strategies, such as those used in Sparse Transformers (Child et al., 2019), employ fixed pruning patterns, such as Strided and Fixed Attention. While effective in some cases, these approaches are not adaptive to specific contexts, often sacrificing accuracy. StreamingLLM (Xiao et al., 2024) tackles the Attention Sink phenomenon, where attention scores concentrate on initial tokens, by retaining these tokens along with a fixed window of recent tokens. While this improves performance, static approaches generally lack the flexibility needed to adapt to different tokens, attention-heads, or layers.
Strategies using accumulative attention scores offer more flexibility by dynamically identifying important tokens. For instance, SpAtten (Wang et al., 2021) employs Accumulative Attention Scores (A2S), which sum the softmax outputs for each token to measure its importance. This approach allows selective token pruning in subsequent layers, effectively reducing computational complexity without the need for retraining. H2O (Zhang et al., 2024) extends this concept to decoder-based models, using local A2S statistics for adaptive eviction in autoregressive generation. However, H2O suffers from the attention bias issue (Oren et al., 2024), particularly in long-context inputs. Several follow-up works have aimed to address this limitation. NACL (Chen et al., 2024) introduces random eviction to mitigate attention bias, while A2SF (Jo & Shin, 2024) incorporates a Forgetting Factor. However, none of these approaches fully resolve the underlying problem.
More Adaptive strategies. Although strategies based on accumulative attention scores provide more flexibility than static methods, they still have notable limitations. For instance, H2O (Zhang et al., 2024) applies the same token eviction ratio across all attention heads, restricting the adaptability of the method. FastGen (Ge et al., 2023), on the other hand, introduces a different approach by hybridizing KV-Cache compression policies and applying adaptive strategies to each attention head. However, it focuses on the decoding stage and neglects the importance of the prefilling stage. Learnable eviction strategies, on the other hand, offer greater flexibility by enabling different layers and attention heads to adopt heterogeneous eviction policies. However, such strategies have been relatively underexplored, likely due to concerns about the computational overhead they may introduce (Zhang et al., 2024; Chen et al., 2024). Nonetheless, task-specific training is essential for optimizing performance across different contexts. For example, a recent approach (Anagnostidis et al., 2024) introduces a learnable mechanism for dropping uninformative tokens, but it faces difficulties in batched generation and does not account for continual pre-training or decoding-only LLMs. Despite these challenges, learnable strategies have strong potential to improve performance across a variety of tasks by allowing models to adapt their eviction strategies to meet task-specific requirements.
3 Method
This section first briefly reveals the basics of multi-head attention and KV-Cache and then describes the proposed Attention-Gate (AG) mechanism for in-context KV-Cache eviction for LLM inference acceleration. An illustrative overview of AG is presented in Figure 2.
3.1 Preliminary
Multi-Head Attention (MHA) (Vaswani et al., 2017) is a core component of the Transformer architecture, widely used in most LLMs. MHA enables the model to capture dependencies across different tokens in a sequence by computing self-attention across multiple attention-heads. Each head attends to different parts of the input sequence independently, allowing the model to capture various aspects of the relationships between tokens. The outputs of these attention heads are then concatenated and projected through a final output matrix.
KV-Cache is employed to store the key and value representations of tokens from previous time steps during inference. This prevents redundant recomputation of these representations for every new token generated, significantly speeding up auto-regressive generation. The inference process in auto-regressive transformers, such as Llama (Touvron et al., 2023), can be divided into two stages: prefilling and decoding. 1. In the prefilling stage, the model processes the entire input sequence, generating key-value pairs for each attention-head in each layer. These key-value pairs are stored in the KV-Cache for future reuse. 2. In the decoding stage, when generating the next token, the model only computes the key and value for that new token and appends them to the cached key-value pairs. This reduces the computational load, as the previously cached key-value pairs do not need to be recomputed. The process continues token by token until the sequence generation is complete.
KV-Cache plays a critical role in improving the efficiency of LLM inference, especially in scenarios where long sequences are processed. However, the size of the KV-Cache grows with the input sequence length, leading to substantial memory overhead. Efficiently managing this cache while maintaining model performance has become a key challenge in scaling LLMs to longer contexts.
3.2 Limitations of Traditional Eviction Strategies
Lack of flexibility. Flexibility in KV-Cache eviction strategies spans across various dimensions, including token-specific, attention-head-specific, layer-specific, task-specific, and model-specific adaptability. Static eviction policies offer no such flexibility, requiring manual intervention to adjust them. While methods based on accumulative attention scores (e.g., H2O (Zhang et al., 2024)) have improved upon static strategies by introducing token-level and head-level adaptability, they are still limited. Specifically, in H2O, the eviction ratio remains uniform across all attention heads, restricting head-level flexibility. Flexibility is crucial in eviction strategies because models process varying types of data across different contexts, requiring the ability to selectively retain important tokens while discarding others. Without this adaptability, the model may inefficiently store redundant information, increasing memory usage and slowing down inference times.
Absence of global statistics. Eviction strategies that rely on accumulative attention scores, such as H2O, base their calculations on local statistics, which often results in attention bias issues (Oren et al., 2024). This local approach may misjudge a token’s importance, especially when the context spans long sequences. While several methods (Chen et al., 2024; Jo & Shin, 2024) have proposed solutions to mitigate this bias, they are still fundamentally rooted in local statistics, addressing the issue only partially. A more effective solution would incorporate global statistics to ensure that the eviction process is based on a more comprehensive understanding of the context, thereby reducing biases and improving the accuracy of token retention.
Inefficiency. Efficiency is another challenge for traditional methods such as H2O and FastGen (Ge et al., 2023). These methods operates in a token-by-token manner for each decoding step, sequentially deciding which key-value pairs to evict from the cache. Additionally, H2O determines which tokens to evict after computing the attention scores, meaning some computation is wasted on tokens that are later discarded. This sequential approach hinders efficiency, especially when handling large batches of data or when processing long sequences.
Our Attention-Gate mechanism is designed to address the aforementioned limitations.
3.3 Attention-Gate
The Attention-Gate (AG) is a lightweight, trainable module designed to determine which tokens in the KV-Cache of each attention-head should be retained or discardedAG is positioned before the Multi-Head Attention (MHA) layer and operates by generating eviction flags for each token, guiding both the computation of attention scores and the management of the KV-Cache.
Input and output. AG takes the hidden states as input and generates binary flags as output, one for each token in every attention head. These flags indicate whether the corresponding token’s key-value pairs should be stored or discarded in the KV-Cache. Specifically, the input is a hidden state matrix , where represents the sequence length, and denotes the dimensionality of the hidden states. The output consists of independent binary decisions for the retention or eviction of each token’s key-value pairs across all attention heads in the MHA module.
Local or global? The structure of AG offers two choices: using either local or global information to guide the eviction policy. 1. Local information: The simplest approach leverages a linear layer where each token only considers its own hidden state without taking other tokens in the sequence into account. This method is computationally efficient and straightforward, making it appealing for its simplicity. 2. Global information: On the other hand, a more comprehensive approach adopts an attention-like structure, where tokens can aggregate information from across the entire sequence. This enables the eviction policy to reflect the global in-context information, making decisions that are informed by the broader context. Naturally, this method is computationally heavier than the local approach, but it offers a more accurate reflection of the sequence’s token importance.
In our design, we opted for the attention-like structure because global context is essential for making effective eviction decisions. As demonstrated in Section 4.4, using an attention-like structure significantly outperforms the linear layer, as the latter struggles to properly evict tokens. This outcome is expected, as determining whether a token is redundant or crucial for inference often requires a global understanding of the sequence. A local approach, like a linear layer focusing on individual tokens, lacks the broader perspective necessary to make accurate decisions about which tokens are important. This finding also underscores the limitations of methods like H2O, which rely on local statistics rather than global ones, as mentioned in Section 3.2.
Softmax or Sigmoid? After gathering information through the attention-like structure, the next step is to compute the eviction probability for each token. Common activation functions such as Softmax and Sigmoid can be used to generate these probabilities.
In our case, we choose the Sigmoid function for a key reason: while Softmax normalizes probabilities across the entire sequence, introducing competition between tokens, Sigmoid treats each token independently. This independence is crucial because we do not want a token’s likelihood of being evicted to depend on the eviction probabilities of other tokens in the sequence.
Eviction flags. After computing the eviction probabilities for each token, we apply a threshold . If the probability of retaining a token exceeds , that token is retained in the KV-Cache; otherwise, it is discarded. In the attention matrix, the columns corresponding to evicted tokens are masked out. However, the diagonal elements, where a token attends to itself, are always preserved. This guarantees that each token continues to interact with its own key.
Prefilling or decoding? AG is primarily applied during the prefilling stage, where the full sequence is available. By making eviction decisions before the MHA layers, AG effectively manages the KV-Cache during this phase. In contrast, AG is not used during the decoding stage to avoid adding complexity to the inference process. Using AG in decoding could slow down inference and increase the cache memory footprint, as additional keys and values from the attention-like structure would need to be stored.
In summary, AG dynamically determines which tokens should be retained, optimizing both KV-Cache usage and attention score computation by utilizing an attention-like structure, Sigmoid activation function, and a threshold-based decision for token eviction.
3.4 Training Implementation
To train the Attention-Gate (AG) effectively, this section outlines the key components of the training process. For more comprehensive details, please refer to Section 4 and Appendix A.
Eviction Loss. To encourage the eviction of unnecessary tokens, we introduce a dedicated loss function called the Eviction Loss. This loss encourages the model to discard as many tokens as possible, which is defined as:
(1) |
where is the number of layers, and represents the average output of the AG module in layer . In this formula, adjusts the intensity of KV-Cache eviction, while ensures that eviction does not become overly aggressive.This loss function works alongside the auto-regressive loss to balance token eviction with maintaining model performance. The loss allows for adaptive eviction across layers and attention-heads. As shown in Figure 1, after training, the model learns to increase token eviction in deeper layers, which is not achievable with training-free methods like H2O (Zhang et al., 2024) or StreamingLLM (Xiao et al., 2024).
Initialization. We initialize the AG parameters using Xavier initialization (Glorot & Bengio, 2010) to provide a stable starting point for learning. Additionally, a small constant is added inside Sigmoid, ensuring that the initial retention probabilities are close to 1. This encourages the model to retain most tokens early in training, allowing it to learn which tokens are important more gradually.
Handling non-differentiability. Directly applying the threshold-based gating mechanism from Section 3.3 would lead to non-differentiable gradients during training due to the hard thresholding’s discrete nature. To resolve this, we employ the Straight-Through Estimator (STE) (Yin et al., 2019), which allows gradients to flow through discrete decisions by approximating them during the backward pass. Specifically, during backpropagation, instead of using the hard 0 or 1 values obtained from comparing against the threshold, we utilize the smooth output of the Sigmoid function. This approach ensures smooth gradients and enables effective training of the AG while preserving its binary behavior during the forward pass.
4 Experiments
This section consists of three main parts. First, we evaluate the performance of AG in two scenarios: continual pre-training (CPT) and supervised fine-tuning (SFT). For CPT, we compare AG with classic methods on performance and eviction rates (Table 2). For SFT, we benchmark AG against vanilla LoRA fine-tuning to highlight its task-specific adaptability (Section 4.2). Second, we provide visualization of selected examples to demonstrate the core characteristics of AG (Section 4.3). Finally, we conduct ablation studies to provide further insights into the effectiveness of AG (Section 4.4).
4.1 Continual Pre-training
Metric | PIQA | ARC-C | ARC-E | RTE | COPA | BoolQ | HellaSwag | Avg. | |
Llama2-7B | Acc. | 76.33 | 37.29 | 51.32 | 51.99 | 62.00 | 69.94 | 68.16 | 59.58 |
Local | Acc. | 69.97 | 31.86 | 48.68 | 51.99 | 60.00 | 57.86 | 37.08 | 51.06 |
StreamingLLM | Acc. | 72.69 | 33.22 | 51.15 | 50.18 | 63.00 | 62.05 | 40.85 | 53.31 |
H2O | Acc. | 75.9 | 33.22 | 52.03 | 52.71 | 47.00 | 67.37 | 66.32 | 56.36 |
Ours | Acc. | 76.17 | 33.90 | 49.03 | 52.35 | 63.00 | 67.52 | 66.33 | 58.33 |
%Eviction | 54.29 | 51.03 | 51.05 | 46.70 | 40.02 | 57.75 | 52.16 | 51.87 |
Metric | PIQA | ARC-C | RTE | COPA | BoolQ | OBQA | Avg. | |
---|---|---|---|---|---|---|---|---|
Fine-tuned Llama2-7B | Acc. | 82.92 | 60.34 | 64.98 | 92.00 | 88.10 | 78.80 | 77.86 |
Ours | Acc. | 82.15 | 59.66 | 64.26 | 93.00 | 86.82 | 78.80 | 77.45 |
%Eviction | 66.16 | 48.31 | 65.47 | 45.40 | 67.46 | 67.17 | 60.00 | |
Ours | Acc. | 81.50 | 57.63 | 74.01 | 95.00 | 87.00 | 79.20 | 79.06 |
%Eviction | 64.96 | 36.45 | 62.80 | 34.77 | 67.31 | 66.65 | 55.49 |
4.1.1 Setup
Models and datasets. We use Llama2-7B (Touvron et al., 2023) as our primary base model due to its strong performance and popularity within the AI community. Additionally, we validate the feasibility of our approach on Mistral-7B (Jiang et al., 2023), with results provided in Table 4. For continual pre-training, we select a subset of the Redpajama dataset (Computer, 2023) containing approximately 5,000 samples 111We sampled 4,997 samples proportionally from each subset of RedPajama.. To assess the effectiveness of our method, we evaluate it on seven widely recognized benchmarks: PIQA (Bisk et al., 2020), ARC-C (Clark et al., 2018), ARC-E (Clark et al., 2018), RTE (Bar-Haim et al., 2006), COPA (Roemmele et al., 2011), BoolQ (Clark et al., 2019), and HellaSwag (Zellers et al., 2019). All evaluations are conducted in a zero-shot setting, with accuracy assessed using OpenCompass (Contributors, 2023).
Training details. During continual pre-training (CPT), only the AG modules are learnable, while the rest of the model’s parameters are frozen. For the Llama2-7B model, the hyperparameters from Section 3.4 are used: , , , and . The model was trained for one epoch.
Metrics and Baselines. The metrics considered include accuracy and the mean eviction ratio for all KV-Cache. The accuracy of vanilla Llama2-7B on these datasets is used as the upper bound, with the goal of minimizing accuracy loss while maximizing the mean eviction ratio. For KV-Cache eviction strategies, we select a local method (retaining only recent tokens) and StreamingLLM (Xiao et al., 2024) as representatives of static strategies, while H2O (Zhang et al., 2024) serves as a representative of methods based accumulative attention scores. For all these methods, eviction ratio is set to 50%.
4.1.2 Results
The results in Table 1 demonstrate the effectiveness of our method in balancing accuracy and KV-Cache eviction. Our method consistently outperforms the baseline strategies in terms of accuracy across most tasks while achieving a higher mean eviction ratio. This indicates that our method is able to evict more tokens on average without significantly compromising accuracy.
In terms of accuracy, our method achieves competitive performance, closely matching the performance of vanilla Llama2-7B in most cases. For instance, on PIQA, COPA, and BoolQ, our method performs comparably to Llama2-7B, demonstrating minimal accuracy degradation. Compared to baselines, our method shows superior performance, particularly in tasks like BoolQ and PIQA, where both accuracy and eviction ratio surpass the baselines. This confirms the advantage of learnable mechanism for more efficient token retention strategies without sacrificing model performance.
Moreover, it is worth highlighting that our method achieves these results with minimal computational overhead, as mentioned in Section 4.1.1. The continual pre-training was conducted on only 5,000 samples and trained for just one epoch, demonstrating the lightweight nature of our approach. This efficiency can be attributed to the fact that our method does not need to learn new knowledge from scratch but rather focuses on learning effective token retention strategies, leveraging the existing capabilities of the pre-trained model.
4.2 Supervised Fine-tuning
4.2.1 Setup
Model and tasks. We utilize Llama2-7B (Touvron et al., 2023) as base model. To evaluate our approach, we select six widely recognized downstream tasks: PIQA (Bisk et al., 2020), ARC-C (Clark et al., 2018), RTE (Bar-Haim et al., 2006), COPA (Roemmele et al., 2011), BoolQ (Clark et al., 2019), and OpenBookQA (Mihaylov et al., 2018). For each task, we fine-tune the model using the respective training set and evaluate its performance on the corresponding test set.
Baselines and Implementation Details. For the baseline, we apply LoRA (Hu et al., 2021) to fine-tune the models on each task, targeting the weights in the self-attention modules. In our method, in addition to applying LoRA to these parameters, the AG modules are also set to be learnable. As mentioned in Section 3.4, the following hyperparameters are used: , , or , and . We employ the AdamW optimizer (Loshchilov & Hutter, 2017) with a learning rate of 5e-5, and train for two epochs on each dataset.
4.2.2 Results
As shown in Table 2, our method demonstrates a strong balance between accuracy and KV-Cache eviction across the six downstream tasks.
With , our method maintains competitive accuracy compared to the fine-tuned Llama2-7B baseline, with minimal drops (e.g., 82.15% vs. 82.92% on PIQA), while achieving a high mean eviction ratio of 60.00%. With , the eviction rate is reduced to 55.49%. But in some tasks like RTE and COPA, our method even surpasses the baseline accuracy. The average accuracy also exceeds the baseline (79.06% vs. 77.86%), suggesting that effectively evicting redundant tokens allows the model to focus on relevant information and improves performance.
Additionally, under the same hyperparameter settings, the performance varies across tasks. For instance, ARC-C is more challenging to evict compared to OpenBookQA, leading to a larger accuracy drop post-eviction. This highlights the importance of task-specific KV-Cache eviction policies.
4.3 Visualization
Metric | PIQA | ARC-C | RTE | COPA | BoolQ | OpenBookQA | Avg. | |
---|---|---|---|---|---|---|---|---|
(1) | Acc. | 82.15 | 59.66 | 64.26 | 93.00 | 86.82 | 78.80 | 77.45 |
%Eviction | 66.16 | 48.31 | 65.47 | 45.40 | 67.46 | 67.17 | 60.00 | |
(2-1) | Acc. | 81.88 | 57.63 | 65.70 | 91.00 | 87.52 | 77.40 | 76.86 |
%Eviction | 63.92 | 36.38 | 62.73 | 24.38 | 65.22 | 63.57 | 52.70 | |
(2-2) | Acc. | 82.15 | 53.90 | 62.45 | 89.00 | 87.31 | 77.40 | 75.37 |
%Eviction | 58.97 | 31.47 | 59.77 | 20.32 | 63.02 | 59.17 | 48.79 | |
(3-1) | Acc. | 81.45 | 53.36 | 58.84 | 88.00 | 86.73 | 78.40 | 74.46 |
%Eviction | 61.75 | 33.55 | 61.34 | 19.24 | 64.59 | 59.59 | 50.01 | |
(3-2) | Acc. | 83.03 | 53.90 | 59.93 | 89.00 | 87.16 | 76.40 | 74.90 |
%Eviction | 58.68 | 24.23 | 32.23 | 12.28 | 59.40 | 55.54 | 40.39 | |
(4-1) | Acc. | 81.66 | 55.25 | 66.06 | 88.00 | 86.85 | 78.00 | 75.97 |
%Eviction | 49.52 | 36.92 | 46.85 | 28.74 | 56.02 | 60.32 | 46.40 | |
(4-2) | Acc. | 82.75 | 55.93 | 79.06 | 82.00 | 86.33 | 78.40 | 77.41 |
%Eviction | 53.31 | 44.38 | 51.20 | 47.95 | 61.98 | 61.73 | 53.43 | |
(5) | Acc. | 82.54 | 54.58 | 57.40 | 81.00 | 87.71 | 74.80 | 73.01 |
%Eviction | 1.06 | 0.46 | 0.81 | 0.26 | 1.38 | 1.16 | 0.86 |
In this section, we present visualizations to highlight key characteristics of our AG mechanism. After fine-tuning on specific tasks, we visualize the model’s MHA and AG weights for selected samples, as shown in Figure 3. Additional visualizations can be found in Section A.2.
4.4 Ablation
In this ablation study, we explore the effects of various configurations on the AG mechanism, focusing on the number of AG heads, AG head dimensions, and eviction strategies. For details of each setting and the corresponding performance, refer to Table 3. Key findings include the impact of reducing the number of heads, as shown in (2-1) and (2-2), and the reduction of head dimensions in (3-1) and (3-2), both of which result in lower accuracy and eviction ratios. In (4-1) and (4-2), we examine eviction strategies where the current layer’s eviction is based on the previous layer’s hidden states and AG module, which introduces parallelism but slightly reduces performance. Finally, replacing the AG with a linear layer in (5) demonstrates the necessity of using an attention-like structure for effective token eviction. For more ablation, please refer to Table 5.
5 Conclusion
In conclusion, the proposed Attention-Gate mechanism offers a flexible and adaptive solution to KV-Cache eviction in large language models. By dynamically identifying and discarding less important tokens, Attention-Gate addresses the limitations of static and attention-score-based strategies, providing efficient context-aware eviction. This mechanism integrates seamlessly with pre-trained models and can be easily tuned, making it a practical and effective method for enhancing both performance and memory efficiency in various tasks.
References
- Ainslie et al. (2023) Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Anagnostidis et al. (2024) Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, and Thomas Hofmann. Dynamic context pruning for efficient and interpretable autoregressive transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Bar-Haim et al. (2006) Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, volume 1. Citeseer, 2006.
- Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
- Chen et al. (2024) Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, and Hua Wu. Nacl: A general and effective kv cache eviction framework for llms at inference time. arXiv preprint arXiv:2408.03675, 2024.
- Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT 2019, 2019.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Computer (2023) Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Contributors (2023) OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Dao (2023) Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Ge et al. (2023) Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Jo & Shin (2024) Hyun Rae Jo and Dong Kun Shin. A2sf: Accumulative attention scoring with forgetting factor for token pruning in transformer decoder. arXiv preprint arXiv:2407.20485, 2024.
- Kitaev et al. (2020) Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2017.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
- Oren et al. (2024) Matanel Oren, Michael Hassid, Yossi Adi, and Roy Schwartz. Transformers are multi-state rnns. arXiv preprint arXiv:2401.06104, 2024.
- Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
- Shazeer (2019) Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- Team et al. (2024) Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wang et al. (2021) Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 97–110. IEEE, 2021.
- Xiao et al. (2024) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024. URL https://arxiv.org/abs/2309.17453.
- Yin et al. (2019) Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. Understanding straight-through estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662, 2019.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Zhang et al. (2024) Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
Appendix A Appendix
A.1 Results of Continual Pre-training on Mistral
We conducted continual pre-training on Mistral using 5,000 samples from RedPajama (Computer, 2023), and the results are shown in Table 4. Compared to the performance of Llama2-7B presented in Table 1, Mistral’s performance slightly declined. We hypothesize that this may be due to the distribution of RedPajama’s data being less suited to Mistral. Additionally, this raises the question of whether KV-Cache eviction is model-dependent, and whether its effectiveness is related to the model’s expressive power. Although the parameter counts of Mistral-7B and Llama2-7B are similar, Mistral-7B significantly outperforms Llama2-7B. This could suggest that Mistral is utilizing more tokens or scoring them with finer granularity, which results in fewer redundant tokens and thus makes eviction less effective.
Metric | PIQA | ARC-C | ARC-E | RTE | COPA | BoolQ | HellaSwag | Avg. | |
---|---|---|---|---|---|---|---|---|---|
Mistral-7B | Acc. | 80.09 | 42.37 | 63.14 | 48.01 | 76 | 64.22 | 73.02 | 63.84 |
Ours | Acc. | 75.9 | 34.24 | 55.2 | 48.01 | 65 | 62.2 | 67.91 | 58.35 |
Eviction | 37.14 | 39.48 | 37.80 | 40.93 | 45.27 | 44.68 | 50.92 | 42.32 |
A.2 More Visualization
A.3 More Ablation
Metric | PIQA | ARC-C | RTE | COPA | BoolQ | OpenBookQA | Avg. | |
---|---|---|---|---|---|---|---|---|
(1) | Acc. | 82.15 | 59.66 | 64.26 | 93.00 | 86.82 | 78.80 | 77.45 |
%Eviction | 66.16 | 48.31 | 65.47 | 45.40 | 67.46 | 67.17 | 60.00 | |
(6-1) | Acc. | 83.08 | 50.85 | 65.34 | 82.00 | 87.31 | 73.20 | 73.63 |
%Eviction | 65.15 | 40.96 | 64.29 | 21.37 | 67.49 | 63.69 | 53.83 | |
(6-2) | Acc. | 81.61 | 53.56 | 60.29 | 82.00 | 87.37 | 74.20 | 73.17 |
%Eviction | 65.66 | 44.48 | 65.14 | 24.28 | 68.18 | 63.44 | 55.20 |
In this section, we present additional ablation results in Table 5.