research-article

Do we really need a large number of visual prompts?

Authors:

Abhishek Moitra,

Priyadarshini PandaAuthors Info & Claims

Volume 177, Issue C

https://doi.org/10.1016/j.neunet.2024.106390

Published: 24 July 2024 Publication History

Abstract

Due to increasing interest in adapting models on resource-constrained edges, parameter-efficient transfer learning has been widely explored. Among various methods, Visual Prompt Tuning (VPT), prepending learnable prompts to input space, shows competitive fine-tuning performance compared to training of full network parameters. However, VPT increases the number of input tokens, resulting in additional computational overhead. In this paper, we analyze the impact of the number of prompts on fine-tuning performance and self-attention operation in a vision transformer architecture. Through theoretical and empirical analysis we show that adding more prompts does not lead to linear performance improvement. Further, we propose a Prompt Condensation (PC) technique that aims to prevent performance degradation from using a small number of prompts. We validate our methods on FGVC and VTAB-1k tasks and show that our approach reduces the number of prompts by ∼70% while maintaining accuracy.

References

[1]

Bahng H., Jahanian A., Sankaranarayanan S., Isola P., Exploring visual prompts for adapting large-scale models, 2022, p. 4. arXiv preprint arXiv:2203.17274.

[2]

Bolya D., Fu C.-Y., Dai X., Zhang P., Feichtenhofer C., Hoffman J., Token merging: Your ViT but faster, 2022, arXiv preprint arXiv:2210.09461.

[3]

Cai H., Gan C., Zhu L., Han S., Tinytl: Reduce memory, not parameters for efficient on-device learning, Advances in Neural Information Processing Systems 33 (2020) 11285–11297.

[4]

Chakrabarti A., Moseley B., Backprop with approximate activations for memory-efficient network training, Advances in Neural Information Processing Systems 32 (2019).

[5]

Chen Z., Duan Y., Wang W., He J., Lu T., Dai J., et al., Vision transformer adapter for dense predictions, 2022, arXiv preprint arXiv:2205.08534.

[6]

Chen S., Ge C., Tong Z., Wang J., Song Y., Wang J., et al., Adaptformer: Adapting vision transformers for scalable visual recognition, 2022, arXiv preprint arXiv:2205.13535.

[7]

Chen J., Zheng L., Yao Z., Wang D., Stoica I., Mahoney M., et al., Actnn: Reducing training memory footprint via 2-bit activation compressed training, in: International conference on machine learning, PMLR, 2021, pp. 1803–1813.

[8]

Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.

[9]

Evans R.D., Aamodt T., Ac-gc: Lossy activation compression with guaranteed convergence, Advances in Neural Information Processing Systems 34 (2021) 27434–27448.

[10]

Fayyaz M., Koohpayegani S.A., Jafari F.R., Sengupta S., Joze H.R.V., Sommerlade E., et al., Adaptive token sampling for efficient vision transformers, in: Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part XI, Springer, 2022, pp. 396–414.

[11]

Fu F., Hu Y., He Y., Jiang J., Shao Y., Zhang C., et al., Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript, in: International conference on machine learning, PMLR, 2020, pp. 3304–3314.

[12]

Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., & Fei-Fei, L. (2017). Fine-grained car detection for visual census estimation. Vol. 31, In Proceedings of the AAAI conference on artificial intelligence. (1).

[13]

Goyal S., Choudhury A.R., Raje S., Chakaravarthy V., Sabharwal Y., Verma A., PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination, in: International conference on machine learning, PMLR, 2020, pp. 3690–3699.

[14]

He X., Li C., Zhang P., Yang J., Wang X.E., Parameter-efficient fine-tuning for vision transformers, 2022, arXiv preprint arXiv:2203.16329.

[15]

Houlsby N., Giurgiu A., Jastrzebski S., Morrone B., De Laroussilhe Q., Gesmundo A., et al., Parameter-efficient transfer learning for NLP, in: International conference on machine learning, PMLR, 2019, pp. 2790–2799.

[16]

Hu E.J., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., et al., Lora: Low-rank adaptation of large language models, 2021, arXiv preprint arXiv:2106.09685.

[17]

Jia M., Tang L., Chen B.-C., Cardie C., Belongie S., Hariharan B., et al., Visual prompt tuning, in: Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part XXXIII, Springer, 2022, pp. 709–727.

[18]

Jiang, Z., Chen, X., Huang, X., Du, X., Zhou, D., & Wang, Z. Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation. In Advances in neural information processing systems.

[19]

Jie S., Deng Z.-H., Convolutional bypasses are better vision transformer adapters, 2022, arXiv preprint arXiv:2207.07039.

[20]

Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., & Khan, F. S. (2023). Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19113–19122).

[21]

Khattak, M. U., Wasim, S. T., Naseer, M., Khan, S., Yang, M.-H., & Khan, F. S. (2023). Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15190–15200).

[22]

Khosla A., Jayadevaprakash N., Yao B., Li F.-F., Novel dataset for fine-grained image categorization: Stanford dogs, in: Proc. CVPR workshop on fine-grained visual categorization, Vol. 2, Citeseer, 2011.

[23]

Kim G., Cho K., Length-adaptive transformer: Train once with length drop, use anytime with search, 2020, arXiv preprint arXiv:2010.07003.

[24]

Kim S., Shen S., Thorsley D., Gholami A., Kwon W., Hassoun J., et al., Learned token pruning for transformers, 2021, arXiv preprint arXiv:2107.00910.

[25]

Kong Z., Dong P., Ma X., Meng X., Niu W., Sun M., et al., SPViT: Enabling faster vision transformers via latency-aware soft token pruning, in: Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, part XI, Springer, 2022, pp. 620–640.

[26]

LeCun Y., Denker J., Solla S., Optimal brain damage, in: Advances in neural information processing systems, Vol. 2, 1989.

[27]

Liang Y., Ge C., Tong Z., Song Y., Wang J., Xie P., Not all patches are what you need: Expediting vision transformers via token reorganizations, 2022, arXiv preprint arXiv:2202.07800.

[28]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

[29]

Meng, L., Li, H., Chen, B.-C., Lan, S., Wu, Z., Jiang, Y.-G., et al. (2022). Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12309–12318).

[30]

Molchanov P., Tyree S., Karras T., Aila T., Kautz J., Pruning convolutional neural networks for resource efficient inference, 2016, arXiv preprint arXiv:1611.06440.

[31]

Nilsback M.-E., Zisserman A., Automated flower classification over a large number of classes, in: 2008 sixth Indian conference on computer vision, graphics & image processing, IEEE, 2008, pp. 722–729.

[32]

Rao Y., Zhao W., Liu B., Lu J., Zhou J., Hsieh C.-J., Dynamicvit: Efficient vision transformers with dynamic token sparsification, in: Advances in neural information processing systems, Vol. 34, 2021, pp. 13937–13949.

[33]

Rebuffi, S.-A., Bilen, H., & Vedaldi, A. (2018). Efficient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8119–8127).

[34]

Rusu A.A., Rabinowitz N.C., Desjardins G., Soyer H., Kirkpatrick J., Kavukcuoglu K., et al., Progressive neural networks, 2016, arXiv preprint arXiv:1606.04671.

[35]

Smith, J. S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., et al. (2023). Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11909–11919).

[36]

Song Z., Xu Y., He Z., Jiang L., Jing N., Liang X., Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction, 2022, arXiv preprint arXiv:2203.04570.

[37]

Su Y., Wang X., Qin Y., Chan C.-M., Lin Y., Wang H., et al., On transferability of prompt tuning for natural language processing, 2021, arXiv preprint arXiv:2111.06719.

[38]

Sung Y.-L., Cho J., Bansal M., Lst: Ladder side-tuning for parameter and memory efficient transfer learning, 2022, arXiv preprint arXiv:2206.06522.

[39]

Sung, Y.-L., Cho, J., & Bansal, M. (2022b). Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5227–5237).

[40]

Tay Y., Dehghani M., Bahri D., Metzler D., Efficient transformers: A survey, ACM Computing Surveys 55 (6) (2022) 1–28.

[41]

Van Horn, G., Branson, S., Farrell, R., Haber, S., Barry, J., Ipeirotis, P., et al. (2015). Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 595–604).

[42]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., et al., Attention is all you need, in: Advances in neural information processing systems, Vol. 30, 2017.

[43]

Wah C., Branson S., Welinder P., Perona P., Belongie S., The caltech-ucsd birds-200–2011 dataset, 2011.

[44]

Wang S., Li B.Z., Khabsa M., Fang H., Ma H., Linformer: Self-attention with linear complexity, 2020, arXiv preprint arXiv:2006.04768.

[45]

Wang Z., Zhang Z., Ebrahimi S., Sun R., Zhang H., Lee C.-Y., et al., Dualprompt: Complementary prompting for rehearsal-free continual learning, in: European conference on computer vision, Springer, 2022, pp. 631–648.

[46]

Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., et al. (2022). Learning to prompt for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 139–149).

[47]

Yin, H., Vahdat, A., Alvarez, J. M., Mallya, A., Kautz, J., & Molchanov, P. (2022). A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10809–10818).

[48]

Yu H., Wu J., A unified pruning framework for vision transformers, 2021, arXiv preprint arXiv:2111.15127.

[49]

Zaken E.B., Ravfogel S., Goldberg Y., Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2021, arXiv preprint arXiv:2106.10199.

[50]

Zhai X., Puigcerver J., Kolesnikov A., Ruyssen P., Riquelme C., Lucic M., et al., A large-scale study of representation learning with the visual task adaptation benchmark, 2019, arXiv preprint arXiv:1910.04867.

[51]

Zhang R., Fang R., Zhang W., Gao P., Li K., Dai J., et al., Tip-adapter: Training-free clip-adapter for better vision-language modeling, 2021, arXiv preprint arXiv:2111.03930.

[52]

Zhang J.O., Sax A., Zamir A., Guibas L., Malik J., Side-tuning: a baseline for network adaptation via additive side networks, in: Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part III 16, Springer, 2020, pp. 698–714.

[53]

Zhang Y., Zhou K., Liu Z., Neural prompt search, 2022, arXiv preprint arXiv:2206.04673.

[54]

Zhou K., Yang J., Loy C.C., Liu Z., Learning to prompt for vision-language models, International Journal of Computer Vision 130 (9) (2022) 2337–2348.

Recommendations

"Tell It Like It Really Is": A Case of Online Content Creation and Sharing Among Older Adult Bloggers
CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems

While the majority of older adults are now active online, they are often perceived as passive consumers of online information rather than active creators of content. As a counter to this view, we examine the practices of older adult bloggers (N=20) ...
Why Do You Need This?: Selective Disclosure of Data Among Citizen Scientists
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Recent scandals involving data from participatory research have contributed to broader public concern about online privacy. Such concerns might make people more reluctant to participate in research that asks them to volunteer personal data, compromising ...
“I hope I never need one”: Unpacking Stigma in Aging in Place Technology
CHI '22: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems

Aging in place technologies are designed to extend independence and autonomy in older adults, promoting quality of life and peace of mind. Prior work has described how adoption and continued use of these technologies are low when they are perceived as ...

Comments

Information & Contributors

Information

Published In

cover image Neural Networks

Neural Networks Volume 177, Issue C

Sep 2024

298 pages

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 24 July 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents