Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Ting Liu¹ Xuyang Liu^2∗ Liangtao Shi³ Zunnan Xu⁴
Siteng Huang⁵ Yi Xin⁶ Quanjun Yin¹

¹National University of Defense Technology ²Sichuan University ³Hefei University of Technology
⁴Tsinghua University ⁵Zhejiang University ⁶Nanjing University
liuting20@nudt.edu.cn liuxuyang@stu.scu.edu.cn Equal contribution.

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a popular approach for adapting pre-trained Vision Transformer (ViT) models to downstream applications. While current PEFT methods achieve parameter efficiency, they overlook GPU memory and time efficiency during both fine-tuning and inference, due to the repeated computation of redundant tokens in the ViT architecture. This falls short of practical requirements for downstream task adaptation. In this paper, we propose Sparse-Tuning, a novel tuning paradigm that substantially enhances both fine-tuning and inference efficiency for pre-trained ViT models. Sparse-Tuning efficiently fine-tunes the pre-trained ViT by sparsely preserving the informative tokens and merging redundant ones, enabling the ViT to focus on the foreground while reducing computational costs on background regions in the images. To accurately distinguish informative tokens from uninformative ones, we introduce a tailored Dense Adapter, which establishes dense connections across different encoder layers in the ViT, thereby enhancing the representational capacity and quality of token sparsification. Empirical results on VTAB-1K, three complete image datasets, and two complete video datasets demonstrate that Sparse-Tuning reduces the GFLOPs to 62%-70% of the original ViT-B while achieving state-of-the-art performance. Source code is available at https://github.com/liuting20/Sparse-Tuning.

Refer to caption — Figure 1: Comparisons of Sparse-Tuning with other mainstream PEFT methods on CIFAR-100. Sparse-Tuning enhances performance while remarkably reducing training /inference time, GPU memory consumption, and computational cost, achieving both fine-tuning and inference efficiency.

1 Introduction

Large-scale Vision Transformer (ViT) models [10, 41, 40, 48, 30] have demonstrated strong generalization capabilities across a wide range of downstream vision tasks. The prevalent approach to adapt these models for specific tasks follows the pretrain-then-finetune paradigm, where models are initially pre-trained on large-scale datasets [8, 38, 32] and then fine-tuned for each downstream task. However, as these pre-trained ViT models continue to scale up [57, 7], fully fine-tuning them becomes more computationally intensive. Additionally, there are risks of catastrophic forgetting and overfitting when fine-tuning on limited downstream datasets [59, 24, 9].

Recently, various parameter-efficient fine-tuning (PEFT) methods [21, 34, 22, 25, 5] have been proposed to address the high computational costs and risks associated with fully fine-tuning large models. By updating additional parameters inserted into the model [21, 5] or appended to the input data [34, 25], PEFT methods can achieve similar or even better performance compared to full fine-tuning. Current PEFT methods [29, 26, 12, 44] focus primarily on efficiently fine-tuning a pre-trained ViT for downstream visual tasks. However, despite achieving fine-tuning efficiency, they fall short in inference efficiency due to the repeated computation of redundant tokens in the Multi-Head Attention (MHA) and Feed-Forward Network (FFN) blocks, failing to meet practical requirements.

Inspired by the model acceleration methods for ViT [49, 51, 36, 37], we aim to reduce redundant tokens to enhance efficiency during both the fine-tuning and inference stages. Since most acceleration methods progressively prune tokens across encoder layers in ViT, we consider establishing interactions between shallow encoder layers, which capture local information, and deep encoder layers, which capture global information. By supplementing global information with local details, we aim to accurately discriminate between informative and uninformative tokens. To this end, in this paper, we introduce Sparse-Tuning, a novel tuning paradigm that achieves efficient fine-tuning and inference for ViT adaptation. Sparse-Tuning efficiently fine-tunes the pre-trained ViT by sparsely adapting the tokens to enable the model to focus on informative regions, thereby enhancing both computational and memory efficiency during the fine-tuning and inference stages. Specifically, we design a novel parameter-free Token Sparsification method that allows the pre-trained ViT to progressively preserve informative tokens while integrating uninformative ones into a representative token, thereby reducing redundant computational costs. Moreover, to mitigate information loss from Token Sparsification and efficiently fine-tune the pre-trained ViT, we propose Dense Adapters, which take multiple features from different encoder layers as inputs to establish dense connections between multiple Token Sparsification steps. With these non-trivial designs, Sparse-Tuning enhances performance while reducing training and inference time, GPU memory consumption, and computational cost for ViT adaptation, compared to most PEFT methods [21, 22, 5, 25, 60], as shown in Figure 1.

To fully evaluate the generalization capability, we conduct extensive experiments on the common PEFT benchmark VTAB-1K [58], three complete image datasets: CIFAR-100 [33], SVHN [13], and Food-101 [2], as well as two complete video datasets: Kinetics-400 (K400) [4] and Something-Something V2 (SSv2) [14]. Empirical results on VTAB-1K demonstrate that with only 11.65 GFLOPs, approximately 66% of the computational cost of the original ViT-B, Sparse-Tuning outperforms all state-of-the-art methods in performance, fine-tuning and inference efficiency. Moreover, Sparse-Tuning achieves superior performance in both image and video recognition on complete datasets while significantly improving both fine-tuning and inference efficiency.

2 Related Work

2.1 Parameter-efficient Fine-tuning

With the trend of scaling up Vision Transformers (ViT) [10, 18, 57, 7] for enhanced performance and generalization, adapting the entire model to downstream tasks becomes increasingly computationally expensive. To mitigate this, parameter-efficient fine-tuning (PEFT) [21, 34, 22, 25, 55, 54] emerges as a strategic approach. PEFT methods updates only a small subset of additional parameters while keeping the majority of the pre-trained model frozen, thereby mitigating the risks of catastrophic forgetting and overfitting. Most PEFT methods methods designed for Transformer [50] can be classified into three types: (1) Partially Tuning [31, 56, 11, 17], that only updates a small subset of inherent parameters while freezing most original parameters. (2) Prompt Tuning [34, 25, 24, 53], that integrates a fixed length of learnable tokens (i.e., prompts) appended with the input data. It only updates the prompts during fine-tuning. (3) Adapter Tuning [21, 5, 52, 39], that only updates additional parameters in the module inserted into the model (i.e., Adapter) during fine-tuning.

While most PEFT methods improve parameter efficiency during the fine-tuning stage of ViT, they often introduce new parameters that hinder inference efficiency. Reparameterization methods, such as LoRA [22] and FacT [29], introduce learnable parameters that can be integrated into the original model through reparameterization during inference. Therefore, these methods can maintain the original model’s inference efficiency. However, current PEFT methods fail to enhance inference efficiency while achieving parameter-efficient fine-tuning, which does not meet the practical needs for adapting the large-size ViT (e.g., ViT-L [10]). In this paper, we aim to enhance the efficiency of both the fine-tuning and inference processes for the pre-trained ViT.

2.2 Model Acceleration for ViT

Recently, numerous works have explored to accelerate the inference efficiency of ViT [15, 49, 36, 6, 37], with most of them aiming to reduce the token redundancy in ViT to decrease computational complexity. For instance, DynamicViT [49] efficiently sparsifies ViT by pruning less informative tokens identified through prediction modules. DVT [51] enhances computational efficiency by automatically adapting the token count for each input image. SuperViT [37] handles diverse patch sizes with a single model, adaptively adjusting token retention during inference.

Existing ViT acceleration methods generally require either fine-tuning all pre-trained parameters [49, 51], or training models from scratch [36, 37]. Consequently, these approaches necessitate substantial training or fine-tuning time to adapt ViT to downstream vision tasks. Recently, Dynamic Tuning (DyT) [60] keeps the pre-trained ViT parameters frozen, updating only the adapters and token dispatchers to enhance parameter efficiency and reduce redundant computation during inference. Unlike DyT, which directly skips the uninformative tokens, our method consolidates these tokens into a single representative token to retain visual features beneficial for classification. Additionally, unlike DyT, which necessitates computing all tokens to update the parameters of the proposed token dispatchers, our approach does not introduce any additional modules for token sparsification. Our method efficiently fine-tunes the pre-trained ViT by selectively adapting the tokens, thereby improving efficiency during both the fine-tuning and inference stages.

3 Method

In this section, we present our proposed Sparse-Tuning in detail. First, we briefly review the preliminaries of Vision Transformer and Adapter Tuning. in Section 3.1. Next, we provide a general introduction to the overall framework of Sparse-Tuning in Section 3.2. Following this, we elaborate on the core techniques of Sparse-Tuning: Token Sparsification and Dense Adapter.

3.1 Preliminaries

Vision Transformers (ViTs) [10] basically consist of a patch embedding layer and a stack of transformer encoder layers. The patch embedding layer first splits and flattens an input image $\mathit{x}\in\mathbb{R}^{H\times W\times 3}$ into a sequence of patches $\mathit{x}_{p}\in\mathbb{R}^{N\times(P^{2}\cdot C)}$ , where $(H,W)$ denotes the size of the input image, $(P,P)$ represents the size of each image patch, $C$ is the number of channels, and $N={H\cdot W}/{P^{2}}$ is the number of image tokens. The patches $\mathit{x}_{p}$ , prepended with a learnable [CLS] token, are fed into a stack of transformer encoder layers, each of which includes a Multi-Head Attention (MHA) block and a Feed-Forward Network (FFN). In MHA, the tokens are linearly projected and packed into three vectors, namely $\bm{Q}$ , $\bm{K}$ , and $\bm{V}$ . The self-attention operation can be written as:

{\rm Attention}(\bm{Q},\bm{K},\bm{V})={\rm Softmax}(\frac{\bm{Q}\bm{K}^{\top}}% {\sqrt{d}})\bm{V},

(1)

${\rm Softmax}(\frac{\bm{Q}\bm{K}^{\top}}{\sqrt{d}})$ is the attention map, where $\bm{Q}\bm{K}^{\top}$ indicates the attention from the [CLS] token to all tokens and reflects the importance of each token. Subsequently, the output tokens are sent to a Layer Normalization (LayerNorm) [1] and a FFN, which consists of two fully-connected layers with a GELU activation function [20] in between. After processing the tokens by a stack of encoder layers, the [CLS] token is extracted and utilized for classification.

Adapter Tuning is a prevalent strategy for efficient fine-tuning of ViT [5, 45, 27], typically involving the insertion of an MLP in parallel with the FFN. The adapters consist of a down-projection layer $\mathbf{W}_{\text{down}}$ , ReLU non-linear activation, and an up-projection layer $\mathbf{W}_{\text{up}}$ in sequence. Given the input feature $x$ , the function of a standard adapter can be formally expressed as:

{\rm Adapter}(x)=x+s\cdot{{\rm ReLU}(x\mathbf{W}_{\text{down}})}\mathbf{W}_{% \text{up}},

(2)

where $s$ denotes the scaling factor. Unlike the standard adapter, in this paper, we introduce the Dense Adapter, which receives multiple adapted features from different encoder layers to establish connections across the encoder layers in ViT.

3.2 Sparse-Tuning for Efficient ViT adaptation

Existing works [49, 51, 36] have demonstrated that the final prediction in ViT largely depends on a subset of the most informative tokens. Dynamic Tuning (DyT) [60] keeps the pre-trained parameters frozen and updates the adapters with the proposed token dispatcher to distinguish and discard uninformative tokens. This design can improve the inference speed of ViT but suffer from two main shortcomings: (1) Inefficient fine-tuning, as DyT requires gradients to backpropagate through all tokens to update the parameters of the proposed token dispatcher, thus leading to low efficiency in terms of GPU memory consumption and fine-tuning speed. (2) Information loss, as the token dispatcher directly removes those inactivated tokens, which can lead to a direct loss of information, thereby deteriorating the classification accuracy.

Motivated by the above analysis, we introduce Sparse-Tuning with Dense Adapters, efficiently fine-tuning pre-trained ViT by selectively adapting tokens to focus on informative regions, enhancing efficiency during both fine-tuning and inference stages. As shown in Figure 2, the overall framework includes two parts: (1) a pre-trained ViT-B/16 [10] that consists of a patch embedding layer and 12 transformer encoder layers with our carefully designed Token Sparsification process, and (2) our Dense Adapters. During fine-tuning, we freeze the pre-trained ViT and only update a series of Dense Adapters to facilitate efficient adaptation to downstream tasks. In the 4th, 7th, and 10th encoder layers (we conduct relevant analysis on Table 6), we implement Token Sparsification (see Figure 3) to enable ViT focus more on the informative tokens and reduce redundant computation cost.

Token Sparsification. The main idea of Sparse-Tuning is to decrease the computation load on uninformative tokens, which in turn reduces the computational cost for both ViT and Dense Adapters during fine-tuning and inference, thereby improving overall efficiency and speed. An intriguing question arises: how to distinguish the informative tokens from other tokens with less information? Previous works [3, 47, 16] have demonstrated the strong relationship between the [CLS] token and the class-specific tokens. In other words, the attention scores between the [CLS] token and other tokens reflect the contribution of the current token to the classification. Consequently, tokens that exhibit higher/lower attention scores with the [CLS] token contain more/less semantic information for classification, and thus can be viewed as the attentive/inattentive tokens. Though the inattentive tokens show lower attention scores, they may still influence the classification results in some cases, such as the prediction of large objects which cover large regions of the image. To this end, unlike DyT [60], our Token Sparsification progressively preserves the attentive tokens and merges the inattentive ones into one token during fine-tuning and inference to reduce the computation cost. Specifically, as shown in Figure 2, we calculate the average attention scores of all heads in MHA, and preserve the k largest (i.e., top-k) elements corresponding to tokens (attentive tokens), and fuse the rest tokens (inattentive tokens) by weighted average into a representative token to supplement the attentive ones. With this design, Sparse-Tuning allows a pre-trained ViT to concentrate on the most informative regions while discarding the uninformative ones, consequently lowering redundant computational costs during both fine-tuning and inference. Furthermore, by integrating the inattentive tokens, Sparse-Tuning mitigates information loss resulting from Token Sparsification.

Dense Adapter. To further alleviate the information loss caused by Token Sparsification and efficiently adapt the pre-trained ViT for downstream tasks, we consider utilizing the adpter-tuning method. Most current adapter-tuning methods for ViT [5, 26] follow the basic residual connection approach from ResNet [19], which can only establish connections between two adjacent ViT encoder layers, greatly limiting the propagation of adapted features during fine-tuning. The transition from local features to global features across encoder layers in ViT affects the effectiveness of Token Sparsification. Given that Token Sparsification occurs across encoder layers in ViT, we introduce the Dense Adapter (DA), inspired by DenseNet [23], to establish dense connections across multiple encoder layers. As shown in Figure 2 (right), unlike the standard adapter [21], DA takes multiple features from different encoder layers as inputs to establish interactions between multiple Token Sparsification steps, thereby compensating for the information loss caused by Token Sparsification.

According to the position, DA consists of one to three down-projection layers (i.e., $\mathbf{W}_{\text{down}_{1}}$ , $\mathbf{W}_{\text{down}_{2}}$ , $\mathbf{W}_{\text{down}_{3}}$ ), ReLU non-linear activation, and an up-projection layer $\mathbf{W}_{\text{up}}$ . Specifically, we donate the N-th DA as $\text{DA}_{\text{N}}$ . The output of $\text{DA}_{\text{N}}$ can be formulated as:

x_{\text{DA}_{\text{N}}}=\begin{cases}{{\rm ReLU}(x_{\text{MHA}_{\text{N}}}% \mathbf{W}_{\text{down}_{1}})}\mathbf{W}_{\text{up}}&\text{ if }\text{N}=1\\ {{\rm ReLU}(x_{\text{MHA}_{\text{N}}}\mathbf{W}_{\text{down}_{1}}+x_{\text{DA}% _{\text{N-1}}}\mathbf{W}_{\text{down}_{2}}})\mathbf{W}_{\text{up}}&\text{ if }% 1<\text{N}\leq 3\\ {{\rm ReLU}(x_{\text{MHA}_{\text{N}}}\mathbf{W}_{\text{down}_{1}}+x_{\text{DA}% _{\text{N-1}}}\mathbf{W}_{\text{down}_{2}}}+x_{\text{DA}_{\text{N-3}}}\mathbf{% W}_{\text{down}_{3}})\mathbf{W}_{\text{up}}&\text{ if }3<\text{N}\leq 12\\ \end{cases},

(3)

where $x_{\text{DA}_{\text{N}}}$ and $x_{\text{MHA}_{\text{N}}}$ represent the outputs of the $\text{DA}_{\text{N}}$ and MHA at the N-th encoder layer, respectively. It is noteworthy that when $3<N\leq 12$ , the $x_{\text{DA}_{\text{N-1}}}$ and $x_{\text{DA}_{\text{N-3}}}$ may be sparsified by the corresponding Token Sparsification step to ensure consistency in token length for $x_{\text{DA}_{\text{N}}}$ , $x_{\text{DA}_{\text{N-1}}}$ , and $x_{\text{DA}_{\text{N-3}}}$ . Dense Adapters facilitate multiple interactions between the lower and higher layers of the ViT encoder, thereby enhancing the representational capacity and quality of Token Sparsification.

4 Experiments

4.1 Experimental Setup

Datasets. We compare our Sparse-Tuning with other state-of-the-art methods on the common PEFT benchmark VTAB-1K [58] to evaluate the adaptation performance when the training data is limited. For each downstream classification task, the training data in VTAB-1K [58] is extremely scarce, comprising only 1,000 training samples. Thus, following [5, 60], we conduct experiments on three complete image datasets: CIFAR-100 [33], SVHN [13], and Food-101 [2], as well as two complete video datasets: Kinetics-400 (K400) [4] and Something-Something V2 (SSv2) [14], to further evaluate the adaptation performance and efficiency of our Sparse-Tuning.

Implementation Details. We utilize the ViT-Base (ViT-B/16) model [10] as our backbone, which is pre-trained on the ImageNet21K dataset [8] under full supervision. The bottleneck dimension $d$ of our Dense Adapter is set to 32 by default, and we reduce $d$ to 8 on VTAB-1K, following most existing works [5, 60]. The scaling factor $s$ is set to 1. We set the keeping rate $r$ of attentive tokens to 0.7 by default, unless otherwise specified. We adhere to the same training schedule as reported in [5, 60]. For all the downstream tasks, we employ top-1 accuracy as the primary evaluation metric. We conduct all experiments on a A800 GPU. More details are provided in the Appendix C.

Table 1: Comparison to state-of-the-art PEFT methods on VTAB-1K with ViT-B/16. Group Mean: the average accuracy of the three subgroups. Params.: the number of learnable parameters excluding the final classification layer. GFLOPs: the average GFLOPs across all datasets.

r

denotes the keeping rate of attentive (activated) tokens. We highlight the best and the second-best results.

	Natural							Specialized				Structured
	CIFAR-100	Caltech101	DTD	Flowers102	Pets	SVHN	Sun397	Camelyon	EuroSAT	Resisc45	Retinopathy	Clevr-Count	Clevr-Dist	DMLab	KITTI-Dist	dSpr-Loc	dSpr-Ori	sNORB-Azim	sNORB-Elev	Group Mean	Params. (M) $\downarrow$	GFLOPs $\downarrow$
Traditional fine-tuning
Full fine-tunig	68.9	87.7	64.3	97.2	86.9	87.4	38.8	79.7	95.7	84.2	73.9	56.3	58.6	41.7	65.5	57.5	46.7	25.7	29.1	68.96	85.84	17.58
Linear	63.4	85.0	63.2	97.0	86.3	36.6	51.0	78.5	87.5	68.6	74.0	34.3	30.6	33.2	55.4	12.5	20.0	9.6	19.2	57.64	0	17.58
Parameter-efficient fine-tuning
Adapter [21]	69.2	90.1	68.0	98.8	89.9	82.8	54.3	84.0	94.9	81.9	75.5	80.9	65.3	48.6	78.3	74.8	48.5	29.9	41.6	73.85	0.16	17.61
BitFit [56]	72.8	87.0	59.2	97.5	85.3	59.9	51.4	78.7	91.6	72.9	69.8	61.5	55.6	32.4	55.9	66.6	40.0	15.7	25.1	65.21	0.10	17.58
LoRA [22]	67.1	91.4	69.4	98.8	90.4	85.3	54.0	84.9	95.3	84.4	73.6	82.9	69.2	49.8	78.5	75.7	47.1	31.0	44.0	74.60	0.29	17.58
VPT [25]	78.8	90.8	65.8	98.0	88.3	78.1	49.6	81.8	96.1	83.4	68.4	68.5	60.0	46.5	72.8	73.6	47.9	32.9	37.8	71.96	0.53	18.30
SSF [35]	69.0	92.6	75.1	99.4	91.8	90.2	52.9	87.4	95.9	87.4	75.5	75.9	62.3	53.3	80.6	77.3	54.9	29.5	37.9	75.69	0.20	17.58
NOAH [59]	69.6	92.7	70.2	99.1	90.4	86.1	53.7	84.4	95.4	83.9	75.8	82.8	68.9	49.9	81.7	81.8	48.3	32.8	44.2	75.48	0.36	17.58
ConvPass [27]	72.3	91.2	72.2	99.2	90.9	91.3	54.9	84.2	96.1	85.3	75.6	82.3	67.9	51.3	80.0	85.9	53.1	36.4	44.4	76.56	0.33	17.64
AdaptFormer [5]	70.8	91.2	70.5	99.1	90.9	86.6	54.8	83.0	95.8	84.4	76.3	81.9	64.3	49.3	80.3	76.3	45.7	31.7	41.1	74.75	0.16	17.61
FacT [29]	71.3	89.6	70.7	98.9	91.0	87.8	54.6	85.2	95.5	83.4	75.7	82.0	69.0	49.8	80.0	79.2	48.4	34.2	41.4	75.30	0.04	17.58
Res-Tuning [26]	75.2	92.7	71.9	99.3	91.9	86.7	58.5	86.7	95.6	85.0	74.6	80.2	63.6	50.6	80.2	85.4	55.7	31.9	42.0	76.32	0.51	17.67
DyT $r=0.5$ [60]	70.4	94.2	71.1	99.1	91.7	88.0	51.5	87.1	95.3	84.2	75.8	79.2	61.8	51.0	82.4	79.7	52.3	35.3	44.5	75.73	0.16	12.54
DyT $r=0.7$ [60]	73.9	94.9	72.1	99.4	91.8	88.4	55.5	87.2	95.6	86.2	75.9	80.3	61.8	51.7	83.1	81.6	53.7	35.3	45.2	76.69	0.16	14.92
DyT $r=0.9$ [60]	74.0	95.1	72.9	99.3	91.7	87.6	56.9	87.7	95.7	85.4	76.1	81.6	63.2	50.1	83.0	83.3	52.0	34.5	44.5	76.74	0.16	17.07
The proposed Sparse-Tuning
Sparse-Tuning $r=0.5$	70.6	94.6	71.5	99.3	91.9	88.5	51.9	87.6	95.7	84.7	75.6	79.9	62.3	51.9	82.7	80.1	52.9	35.8	44.7	76.54	0.32	8.94
Sparse-Tuning $r=0.7$	74.2	95.1	72.5	99.6	92.2	90.3	55.8	87.7	96.3	86.7	76.2	81.7	62.6	52.6	83.8	82.3	55.3	36.9	45.8	77.71	0.32	11.65
Sparse-Tuning $r=0.9$	74.8	95.5	73.2	99.4	91.7	88.1	58.7	88.2	96.4	85.8	76.4	82.9	64.7	50.7	83.4	83.9	53.7	35.2	45.2	77.92	0.32	15.62

4.2 Main Results

Comparisons on VTAB-1K. The comparison results with state-of-the-art (SOTA) PEFT methods on VTAB-1K [58] are presented in Table 1, from which we can observe that: (1) Sparse-Tuning outperforms all SOTA PEFT methods. Sparse-Tuning achieves a 1.18% improvement in terms of the average accuracy across the three subgroups, compared with the previous best model DyT [60]. (2) Sparse-Tuning largely improves inference efficiency. With only 11.65 GFLOPs, about 66% of the computational cost of the original ViT-B, Sparse-Tuning with keeping rate $r=0.7$ has outperformed all state-of-the-art methods in terms of both performance and inference efficiency. (3) Sparse-Tuning continues to exhibit better performance when the keeping rate $r$ increases. While even the Sparse-Tuning with $r=0.5$ can outperform recent strong methods, such as Res-Tuning [26] and FacT [29], which validates the effectiveness and efficiency of our Sparse-Tuning.

Table 2: Results on complete image and video datasets. Avg.: the mean value derived from the corresponding results across various image and video datasets. Params.: the number of learnable parameters excluding the final classification layer. The GFLOPs are evaluated on CIFAR-100 and K400. DyT

{\dagger}

N=4

represents DyT with four experts.

Method	Params. $\downarrow$	Image Datasets					Video Datasets
Method	(M)	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.	GFLOPs $\downarrow$	K400	SSv2	Avg.
Traditional fine-tuning
Full fine-tuning	85.80	17.58	90.91	97.29	90.69	92.69	142.53	75.48	75.22	60.35
Linear	0	17.58	85.87	56.29	88.07	76.74	142.53	69.04	27.64	48.34
Parameter-efficient fine-tuning
Adapter [21]	1.19	17.81	91.76	96.88	89.91	92.76	144.39	74.72	44.58	59.75
AdaptFormer [5]	1.19	17.81	92.03	97.23	90.84	93.36	144.39	75.53	45.36	60.45
LoRA [22]	1.19	17.58	91.42	97.36	90.48	93.08	142.53	75.48	45.62	60.55
VPT [25]	0.07	18.32	91.64	95.72	90.41	92.59	148.44	73.46	38.17	55.82
DyT [60]	1.19	12.21	91.37	97.08	90.32	92.92	108.31	74.39	45.34	59.87
DyT ${\dagger}$ $N=4$ [60]	4.80	12.29	91.01	96.90	89.77	92.56	105.45	75.00	46.56	60.78
The proposed Sparse-Tuning
Sparse-Tuning	1.10	11.70	92.31	97.47	90.72	93.50	99.8	75.55	46.67	61.11

Comparisons on Complete Datasets. We conduct experiments on comprehensive image and video datasets to evaluate the adaptation performance with abundant training data. The results on complete image and video datasets are shown in Table 2, from which we find that: (1) Sparse-Tuning outperforms all baseline methods on both image and video datasets, demonstrating its strong transferability on complete datasets. (2) Sparse-Tuning demonstrates exceptional inference efficiency on both image and video datasets. Particularly on video datasets, Sparse-Tuning reduces the computational complexity of the original ViT-B by around 30%, highlighting its strong efficiency in video applications. With only 1.11M updated parameters, our Sparse-Tuning achieves superior performance in image and video recognition, while significantly improving inference efficiency.

Table 3: Ablation on different components of Sparse-Tuning. Without any component of Sparse-Tuning, it can be viewed as freezing the pre-trained ViT, and only fine-tuning the final classification layer. Params.: learnable parameters excluding the final classification layer.

#	Token	Dense	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
#	Sparsification	Adapters	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
(a)			0	17.58	85.87	56.29	88.07	76.74
(b)	✓		0	11.81	76.59	48.81	78.50	67.97
(c)		✓	1.10	17.89	92.66	97.93	91.34	93.98
(d)	✓	✓	1.10	11.70	92.31	97.47	90.72	93.50

4.3 Ablation Studies

In this subsection, we first analyze the effectiveness of Token Sparsification and Dense Adapters. We then provide an in-depth analysis of the feature inputs and their fusion methods in our Dense Adapters. Subsequently, we investigate the impact of different positions of Token Sparsification to achieve optimal performance. Finally, we verify the effectiveness of Sparse-Tuning when the pre-trained ViT is scaled up. We conduct all ablation studies on three complete image datasets.

Components Effectiveness. In Table 3, we report the performance of using different components of Sparse-Tuning to investigate the effectiveness of Token Sparsification and Dense Adapters. We can observe the following: (1) Token Sparsification can reduce the computational complexity, but it leads to a significant performance degradation, resulting in a 7% decrease in average accuracy (Table 3 (a,b)). (2) Dense Adapters can significantly improve the performance across three datasets (Table 3 (a,c)), which demonstrates their effectiveness in ViT adaptation. (3) Sparse-Tuning incorporates Token Sparsification and Dense Adapters into the pre-trained ViT, achieving the best trade-off between performance and fine-tuning and inference efficiency (Table 3 (a,b,c,d)). Compared to using only Dense Adapters for efficient ViT adaptation (Table 3 (b)), Sparse-Tuning sacrifices only 0.48% average accuracy while significantly reducing the computational cost from 17.89 GFLOPs to 11.70 GFLOPs, highlighting its strong adaptation performance and efficiency.

Table 4: Comparison of different feature inputs.

\text{MHA}_{\text{N}}

\text{DA}_{\text{N-1}}

, and

\text{DA}_{\text{N-3}}

represent the outputs of the MHA at the N-th encoder layer, the outputs of the

\text{DA}_{\text{N-1}}

and

\text{DA}_{\text{N-3}}

, respectively.

#	$\text{MHA}_{\text{N}}$	$\text{DA}_{\text{N-1}}$	$\text{DA}_{\text{N-3}}$	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
(a)	✓			0.69	11.19	91.54	96.67	89.94	92.72
(b)	✓	✓		0.95	11.67	91.95	97.03	90.65	93.21
(c)	✓		✓	0.95	11.67	91.61	96.96	90.71	93.09
(d)	✓	✓	✓	1.10	11.70	92.31	97.47	90.72	93.50

Effects of Different Feature Inputs. To investigate the effectiveness of dense connections, we compare different inputs to our Dense Adapters. As shown in Table 4, when feeding multiple features from different encoder layers into the Dense Adapters, the performance increases. This suggests that our Dense Adapters effectively facilitate dense interactions between the lower and higher layers of the ViT to enhance the representational capability, thereby improving performance compared to standard adapter-tuning (Table 4 (a)). It is worth noting that while our Sparse-Tuning introduces more feature interactions requiring computation, the GFLOPs are still reduced compared to adapter-tuning, demonstrating that Token Sparsification also alleviates the computation cost in Dense Adapters.

Table 5: Comparison of different feature fusion methods. The positions "Input", "Inner", and "Output" correspond to (a), (b), and (c) in Figure 4, respectively.

#	Position	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
(a)	Input	0.69	11.19	91.05	96.55	89.72	92.44
(b)	Inner	1.10	11.70	92.31	97.47	90.72	93.50
(c)	Output	1.68	11.81	91.27	97.14	90.19	92.87

Effects of Different Feature Fusion Methods. Since Dense Adapters take multiple features as inputs, we consider three variants of Dense Adapters that can fuse these multi-level features, as shown in Figure 4. We report the performance of different feature fusion methods in Table 5. Fusing the multi-level features before feeding them into the Dense Adapters (Figure 4 (a)) requires fewer trainable parameters but deteriorates performance. This occurs because this fusion method leads to information loss; features from different layers may contain complementary information, and simple addition may not effectively integrate this information. Fusing the features after feeding them into the Dense Adapters (Figure 4 (c)) also deteriorates performance. This is due to the fact that multi-level features are mapped into different spaces, and directly fusing them may obscure important information, thereby reducing classification performance. Our Dense Adapters first project multi-level features into the same space, then fuse them, and finally up-project the fused features back into their original shape (Figure 4 (b)). This ensures that the dense interaction process occurs within the same feature space, which leads to better performance.

Effects of Different Positions of Token Sparsification. Since Token Sparsification occurs across different encoder layers in the ViT, we investigate its effects at various positions to achieve the best trade-off between performance and computational cost. As shown in Table 6, the shallower the position of the first Token Sparsification, the fewer encoder layers with full tokens need to be processed, hence the lower the computational cost. However, in the early stages, ViT cannot reliably identify important tokens, so merging tokens based on unreliable attention maps may result in the loss of important information, leading to decreased performance (Table 6 (a, b)). In contrast, as shown in Table 6 (d, e), shallower-layer tokens tokens processed later by Dense Adapters may have lost local features, resulting in better overall performance compared to Table 6 (a, b) but still not optimal. We find that adopting Token Sparsification in the 4th, 7th, and 10th encoder layers yields the best performance. This suggests that performing multiple dense interactions in the relatively middle encoder layers of ViT balances local and global features more effectively during Token Sparsification. Therefore, we select the 4th, 7th, and 10th encoder layers in ViT for Token Sparsification to achieve the best trade-off between performance and computational cost.

Scaling up ViT with Sparse-Tuning. We apply Sparse-Tuning to ViT-L [10] to evaluate its performance and efficiency when scaling up the pre-trained model. As shown in Table 7, Sparse-Tuning reduces tunable parameters by 99.03% and decreases GFLOPs by 7.82-30.97 compared to full fine-tuning, while also surpassing its performance. Additionally, Sparse-Tuning outperforms DyT [60] in both performance and efficiency, demonstrating its effectiveness for larger pre-trained models.

Table 6: Comparison of different positions of Token Sparsification. For instance, "[4, 7, 10]" represents conducting Token Sparsification in the 4th, 7th, and 10th encoder layers of ViT.

#	Position	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
(a)	[2, 5, 8]	1.10	10.78	89.77	94.65	88.71	91.04
(b)	[3, 6, 9]	1.10	11.35	91.03	96.21	89.76	92.33
(c)	[4, 7, 10]	1.10	11.70	92.31	97.47	90.72	93.50
(d)	[5, 8, 11]	1.10	12.73	92.12	96.69	90.24	92.68
(e)	[6, 9, 12]	1.10	13.68	91.38	96.10	89.93	92.47

Table 7: Comparison when scaling up the model size to ViT-L [10].

r

represents the keeping rate of the attentive (activated) tokens.

Method	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
Full fine-tuning	303.3	61.60	92.05	97.44	90.62	93.04
DyT $r=0.5$ [60]	3.17	43.79	93.49	97.38	91.49	94.12
DyT $r=0.7$ [60]	3.17	51.11	93.28	97.25	91.60	94.04
DyT $r=0.9$ [60]	3.17	60.05	93.44	97.23	91.59	94.09
Sparse-Tuning $r=0.5$	2.93	30.63	93.56	97.31	91.46	94.11
Sparse-Tuning $r=0.7$	2.93	40.08	93.97	98.23	91.98	94.73
Sparse-Tuning $r=0.9$	2.93	53.78	93.45	98.15	92.77	94.79

5 Conclusion

In this work, we aim to enhance efficiency during both fine-tuning and inference stages when adapting the pre-trained ViT. To this end, we propose a novel tuning method called Sparse-Tuning, which selectively adapts tokens to enable the pre-trained ViT to focus more on the foreground and less on background regions during the fine-tuning stage. By gradually preserving informative tokens and merging uninformative ones into one representative token, our Sparse-Tuning significantly reduces redundant computational costs, achieving both fine-tuning and inference efficiency for ViT adaptation. We conduct empirical experiments on the VTAB-1K benchmark, three complete image datasets, and two complete video datasets to ensure the generalizability of our Sparse-Tuning for efficient ViT adaptation. Extensive experimental results demonstrate that our Sparse-Tuning can enhance performance as well as significantly improve fine-tuning and inference efficiency.In this paper, as we mainly focus on classification tasks in our experiments, extending our Sparse-Tuning to other vision tasks, such as segmentation and detection, will be our future direction.

References

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision, pages 446–461. Springer, 2014.
[3] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[5] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 16664–16678, 2022.
[6] Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2061–2070, 2023.
[7] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In Proceedings of the International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009.
[9] Haiwen Diao, Bo Wan, Ying Zhang, Xu Jia, Huchuan Lu, and Long Chen. UniPT: Universal parallel tuning for transfer learning with efficient parameter and memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 2020.
[11] Chin-Lun Fu, Zih-Ching Chen, Yun-Ru Lee, and Hung-Yi Lee. Adapterbias: Parameter-efficient token-dependent representation shift for adapters in nlp tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2608–2621, 2022.
[12] Minghao Fu, Ke Zhu, and Jianxin Wu. Dtl: Disentangled transfer learning for visual recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12082–12090, 2024.
[13] Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082, 2013.
[14] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5842–5850, 2017.
[15] Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
[16] Joakim Bruslund Haurum, Sergio Escalera, Graham W Taylor, and Thomas B Moeslund. Which tokens to use? investigating token reduction in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 773–783, 2023.
[17] Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, and Bohan Zhuang. Sensitivity-aware visual parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11825–11835, 2023.
[18] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[20] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[21] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
[22] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, 2021.
[23] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
[24] Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, and Donglin Wang. Vop: Text-video co-operative prompt tuning for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6565–6574, 2023.
[25] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, pages 709–727. Springer, 2022.
[26] Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yiliang Lv, Yujun Shen, Deli Zhao, and Jingren Zhou. Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone. In Proceedings of the Advances in Neural Information Processing Systems, volume 36, 2024.
[27] Shibo Jie and Zhi-Hong Deng. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
[28] Shibo Jie and Zhi-Hong Deng. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
[29] Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1060–1068, 2023.
[30] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
[31] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2661–2671, 2019.
[32] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017.
[33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[34] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
[35] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 109–123, 2022.
[36] Y Liang, C Ge, Z Tong, Y Song, P Xie, et al. Not all patches are what you need: Expediting vision transformers via token reorganizations. In Proceedings of the International Conference on Learning Representations, 2022.
[37] Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Chunhua Shen, Rongrong Ji, and Liujuan Cao. Super vision transformer. International Journal of Computer Vision, 131(12):3136–3151, 2023.
[38] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pages 740–755. Springer, 2014.
[39] Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, and Yue Hu. DARA: Domain- and relation-aware adapters make parameter-efficient tuning for visual grounding. arXiv preprint arXiv:2405.06217, 2024.
[40] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
[41] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
[42] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
[43] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[44] Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, and Anurag Arnab. Time-, memory-and parameter-efficient visual adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[45] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter-efficient image-to-video transfer learning. In Proceedings of the Advances in Neural Information Processing Systems, pages 26462–26477, 2022.
[46] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. St-adapter: Parameter-efficient image-to-video transfer learning. NeurIPS, 35:26462–26477, 2022.
[47] Yao Qiang, Deng Pan, Chengyin Li, Xin Li, Rhongho Jang, and Dongxiao Zhu. Attcat: Explaining transformers via attentive class activation tokens. In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 5052–5064, 2022.
[48] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[49] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Proceedings of the Advances in Neural Information Processing Systems, volume 34, pages 13937–13949, 2021.
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, volume 30, 2017.
[51] Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, and Gao Huang. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In Proceedings of the Advances in Neural Information Processing Systems, volume 34, pages 11960–11973, 2021.
[52] Yi Xin, Junlong Du, Qiang Wang, Zhiwen Lin, and Ke Yan. VMT-Adapter: Parameter-efficient transfer learning for multi-task dense scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16085–16093, 2024.
[53] Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. MmAP: Multi-modal alignment prompt for cross-domain multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16076–16084, 2024.
[54] Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv preprint arXiv:2402.02242, 2024.
[55] Zunnan Xu, Zhihong Chen, Yong Zhang, et al. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
[56] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1–9, 2022.
[57] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
[58] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
[59] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural prompt search. arXiv preprint arXiv:2206.04673, 2022.
[60] Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, and Yang You. Dynamic tuning towards parameter and inference efficiency for vit adaptation. arXiv preprint arXiv:2403.11808, 2024.

Appendix A Additional Experiments

A.1 Performance and Efficiency on CIFAR-100

Table 8: Comparison with Mainstream PEFT Methods on CIFAR-100. This table replicates the data exactly as shown in Figure 1.

Method	Params. (M) $\downarrow$	Memory Usage (GB)		Time (sec/epoch)		GFLOPs $\downarrow$	Acc.
Method	Params. (M) $\downarrow$	Train $\downarrow$	Inference $\downarrow$	Train $\downarrow$	Inference $\downarrow$	GFLOPs $\downarrow$	Acc.
Full fine-tuning	85.8	18.82	3.43	90	8	17.58	90.91
Adapter [21]	1.19	13.74	2.61	70	8	17.81	91.76
LoRA [22]	1.19	14.07	2.89	75	8	17.58	91.42
AdaptFormer [5]	1.19	13.69	2.61	68	8	17.81	92.03
VPT [25]	0.07	14.60	3.12	77	8	18.32	91.64
DyT [60]	4.80	11.32	2.35	90	7	12.21	91.01
Sparse-Tuning	1.10	8.89	1.53	40	4	11.70	92.31

We present the numbers of updated parameters during fine-tuning, GPU memory usage during both fine-tuning and inference, time for fine-tuning and inference, GFLOPs, and accuracy of our Sparse-Tuning method compared to other mainstream PEFT methods on the CIFAR-100 dataset [33]. Evidently, our Sparse-Tuning achieves state-of-the-art performance while significantly enhancing efficiency during both the fine-tuning and inference stages.

A.2 Effects of Different Bottleneck Dimensions of Dense Adapter

Table 9: Comparison of different bottleneck dimensions of Dense Adapter.

d

represents the bottleneck dimension of the Dense Adapter.

#	$d$	Params. (M) $\downarrow$	GFLOPs $\downarrow$	CIFAR-100	SVHN	Food-101	Avg.
(a)	8	0.29	11.60	91.03	96.31	89.55	92.30
(b)	16	0.56	11.63	91.75	97.02	89.92	92.90
(c)	32	1.10	11.70	92.31	97.47	90.72	93.50
(d)	64	2.18	11.83	92.19	96.96	90.25	93.13
(e)	128	4.35	12.09	92.11	96.15	90.31	92.47

We explore the impact of the bottleneck dimension $d$ of our Dense Adapter in Sparse-Tuning to achieve the best trade-off between performance, updated parameters, and computational cost. As reported in Table 9, a higher bottleneck dimension $d$ introduces more parameters and higher GFLOPs. However, with a smaller $d$ , the down-projection may lose significant information about the original features, leading to performance degradation. We observe that performance peaks at a bottleneck dimension of 32 and declines thereafter. Therefore, considering the trade-off between trainable parameters, GFLOPs, and performance, we select a bottleneck dimension of 32.

Appendix B More Visualizations of Token Sparsification

We present more visualization results of Token Sparsification in Figure 5. The results demonstrate that given various images, the Token Sparsification in our Sparse-Tuning can effectively maintain the tokens from the foreground regions.

Appendix C Implementation Details for Each Task

Experimental settings on VTAB-1K.

Following previous works [28, 26], we fine-tune the model for 100 epochs on each dataset in VTAB-1K [58]. We do not use any data augmentation strategy in these experiments. We adopt the AdamW [43] optimizer. The base learning rate is set to 0.01 and gradually decays to 0 based on a cosine schedule [42].

Experimental settings on complete image datasets.

We use the settings in Table 10 to fine-tune the ViT with the proposed Sparse-Tuning. Experiments on other parameter-efficient methods such as AdaptFormer [5], LoRA [22], and VPT [25] also follow the settings [60] in Table 10.

Table 10: Experimental settings for complete image datasets. We present the hyperparameters in Sparse-Tuning.

Configuration	CIFAR-100 [33], SVHN [13], Food-101 [2]
Optimizer	AdamW [43]
Base learning rate	0.01
Weight decay	0.01
Batch size	128
Training crop size	224
Learning rate schedule	Cosine decay [42]
GPU numbers	1
Warmup epochs	20
Training epochs	100
Augmentation	RandomResizedCrop

Experimental settings on video datasets.

We use two video datasets, Kinetics-400 (K400) [4] and Something-Something V2 (SSv2) [14], to evaluate performance as the token count scales up. The experimental settings are shown in Table 11. The number of input frames is set to 8. During testing, we use multi-view, a common practice in video action recognition. Experiments on others PEFT methods also follow these experimental settings.

Table 11: Experimental settings for complete video datasets. We follow most of settings in [46]. The number of input frames is set to 8 in all experiments.

Configuration

K400 [4]

SSV2 [14]

Optimizer

AdamW [43]

Base learning rate

1e-3

Weight decay

0.01

Batch size

128

Training epochs

lr Training resize

ShortSideJitter

RandomResizedCrop

Training crop size

224

Learning rate schedule

Cosine decay [42]

Num. testing views

1 spatial

\times

3 temporal

3 spatial

\times

1 temporal

Appendix D Pseudocode of Sparse-Tuning

We present the PyTorch-like pseudocode of our Sparse-Tuning in algorithm 1 to help to better understand the whole process.

Algorithm 1 PyTorch-like pseudocode of Sparse-Tuning for a ViT encoder.

⬇

# H: number of attention heads

# N: number of input tokens

# C: the dimension of token vector

# k: the token keeping rate

# x: the input tokens with shape [N, C], with the first being the [CLS] token

# fc_q, fc_k, fc_v: linear transforms for query, key, and value of self-attention

# mha_output: multi-head attention output of the N-th layer

# prev_adapter_output: adapter output of the previous layer

# adapter_output_N_3_layer: adapter outputs of the N-3 layer

# proj: linear projection in self-attention

# norm: layer normalization

# ffn: feed-forward network

# dense_adapter: layer-specific adapter network

for name, p in model.named_parameters():

if "adapter" in name or "head" in name:

p.requires_grad = True

else:

p.requires_grad = False

avg_cls_attn = zeros(N-1)

x_out = []

x_residual = x

x = norm(x)

# compute self-attention for each attention head

for i in range(0, H):

q, k, v = fc_q[i](x), fc_k[i](x), fc_v[i](x)

attn = (q @ k.transpose()) / sqrt(C/H)

attn = softmax(attn, dim=1)

x_head = attn @ v

x_out.append(x_head)

cls_attn = attn[0, 1:]

avg_cls_attn += cls_attn

x = concat(x_out, dim=1)

x = proj(x) # shape: [N, C]

x = x + x_residual

avg_cls_attn /= H

sorted_cls_attn, idx = sort(avg_cls_attn)

# compute the number of attentive tokens, without counting the [CLS] token

K = ceil(k * (N - 1))

topk_attn, topk_idx = sorted_cls_attn[:K], idx[:K]

non_topk_attn, non_topk_idx = sorted_cls_attn[K:], idx[K:]

cls_token = x[0:1]

x_without_cls = x[1:]

# obtain the attentive and inattentive tokens

attentive_tokens = x_without_cls[topk_idx]

inattentive_tokens = x_without_cls[non_topk_idx]

# compute the weighted combination of inattentive tokens

fused_token = non_topk_attn @ inattentive_tokens

x_new = concat([cls_token, attentive_tokens, fused_token], dim=0)

# dense adapter processing

adapter_output = dense_adapter(mha_output, prev_adapter_output, adapter_output_N_3_layer)

x_residual = x_new

x_new = norm(x_new)

x_new = ffn(x_new)

x_new = x_new + x_residual+ adapter_output

return x_new