research-article

Multi-task prompt tuning with soft context sharing for vision–language models

Authors:

Chunhong PanAuthors Info & Claims

Volume 603, Issue C

https://doi.org/10.1016/j.neucom.2024.128290

Published: 28 October 2024 Publication History

Abstract

Vision–language models have recently shown great potential on many tasks in computer vision. Meanwhile, prior work demonstrates prompt tuning designed for vision–language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In practice, many few-shot tasks are inherently correlated, particularly within specialized domains. However, such information is overlooked previously. Inspired by the fact that modeling task relationship by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to tune pre-trained vision–language models on multiple target few-shot tasks jointly. Specifically, we design a task-shared meta network to generate prompt context for each task using task name together with a learnable task context as input. The parameters of this meta network as well as the task context are tuned on the joint training set of all tasks. As such, the prompt context of all tasks will be shared in a soft manner. Extensive experiments across four multi-task few-shot datasets covering 44 tasks and 1593 categories demonstrate that SoftCPT significantly outperforms single-task prompt tuning methods, highlighting the effectiveness of multi-task learning for vision–language prompt tuning.

References

[1]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, in: ICML, 2021, pp. 4904–4916.

[2]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, Learning Transferable Visual Models From Natural Language Supervision, in: ICML, 2021, pp. 8748–8763.

[3]

Mu Norman, Kirillov Alexander, Wagner David A., Xie Saining, SLIP: self-supervision meets language-image pre-training, CoRR abs/2112.12750, 2021.

[4]

Li Xiang, Wen Congcong, Hu Yuan, Zhou Nan, RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision, Int. J. Appl. Earth Obs. Geoinf. 124 (2023).

[5]

Zhou Kaiyang, Yang Jingkang, Loy Chen Change, Liu Ziwei, Learning to prompt for vision-language models, IJCV 130 (9) (2022) 2337–2348.

[6]

Xing Yinghui, Wu Qirui, Cheng De, Zhang Shizhou, Liang Guoqiang, Wang Peng, Zhang. Yanning, Dual modality prompt tuning for vision-language pre-trained model, IEEE Trans. Multimed. (2023) 1–13.

[7]

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, Guoqi Li, Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, in: CVPR, 2022, pp. 14084–14093.

[8]

Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, Lin Ma, PromptDet: Towards Open-vocabulary Detection using Uncurated Images, in: ECCV, 2022.

[9]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui, Open-vocabulary Object Detection via Vision and Language Knowledge Distillation, in: ICLR, 2022.

[10]

Timo Lüddecke, Alexander Ecker, Image Segmentation Using Text and Image Prompts, in: CVPR, 2022, pp. 7086–7096.

[11]

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu, DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, in: CVPR, 2022, pp. 18082–18091.

[12]

Fischer Marc, Bartler Alexander, Yang Bin, Prompt tuning for parameter-efficient medical image segmentation, Med. Image Anal. 91 (2024).

[13]

Brian Lester, Rami Al-Rfou, Noah Constant, The Power of Scale for Parameter-Efficient Prompt Tuning, in: EMNLP, 2021, pp. 3045–3059.

[14]

Xiang Lisa Li, Percy Liang, Prefix-Tuning: Optimizing Continuous Prompts for Generation, in: ACL-IJCNLP, 2021, pp. 4582–4597.

[15]

Liu Xiao, Zheng Yanan, Du Zhengxiao, Ding Ming, Qian Yujie, Yang Zhilin, Tang Jie, GPT understands, too, CoRR abs/2103.10385, 2021.

[16]

Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, Phillip Isola, Rethinking Few-Shot Image Classification: A Good Embedding is All You Need?, in: ECCV, 2020, pp. 266–282.

[17]

Crawshaw Michael, Multi-task learning with deep neural networks: A survey, CoRR abs/2009.09796, 2020.

[18]

Alex Kendall, Yarin Gal, Roberto Cipolla, Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics, in: CVPR, 2018, pp. 7482–7491.

[19]

Asai Akari, Salehi Mohammadreza, Peters Matthew E., Hajishirzi Hannaneh, Attentional mixtures of soft prompt tuning for parameter-efficient multi-task knowledge sharing, CoRR abs/2205.11961, 2022.

[20]

Qin Yujia, Wang Xiaozhi, Su YuSheng, Lin Yankai, Ding Ning, Liu Zhiyuan, Li Juanzi, Hou Lei, Li Peng, Sun Maosong, Zhou Jie, Exploring low-dimensional intrinsic task subspace via prompt tuning, CoRR abs/2110.07867, 2021.

[21]

Li Liunian Harold, Yatskar Mark, Yin Da, Hsieh Cho-Jui, Chang Kai-Wei, VisualBERT: A simple and performant baseline for vision and language, CoRR abs/1908.03557, 2019.

[22]

Li Gen, Duan Nan, Fang Yuejian, Gong Ming, Jiang Daxin, Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training, in: AAAI, 2020, pp. 11336–11344.

[23]

Wonjae Kim, Bokyung Son, Ildoo Kim, ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, in: ICML, 2021, pp. 5583–5594.

[24]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: NAACL, 2019, pp. 4171–4186.

[25]

Jeremy Howard, Sebastian Ruder, Universal Language Model Fine-tuning for Text Classification, in: ACL, 2018, pp. 328–339.

[26]

Dodge Jesse, Ilharco Gabriel, Schwartz Roy, Farhadi Ali, Hajishirzi Hannaneh, Smith Noah A., Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping, CoRR abs/2002.06305, 2020.

[27]

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, Yoav Artzi, Revisiting Few-sample BERT Fine-tuning, in: ICLR, 2021.

[28]

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Tuo Zhao, SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization, in: ACL, 2020, pp. 2177–2190.

[29]

Kumar Ananya, Raghunathan Aditi, Jones Robbie, Ma Tengyu, Liang Percy, Fine-tuning can distort pretrained features and underperform out-of-distribution, CoRR abs/2202.10054, 2022.

[30]

Taylor Shin, Yasaman Razeghi, Robert L. Logan I.V., Eric Wallace, Sameer Singh, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts, in: EMNLP, 2020, pp. 4222–4235.

[31]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly, Parameter-Efficient Transfer Learning for NLP, in: ICML, 2019, pp. 2790–2799.

[32]

Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, Iryna Gurevych, AdapterDrop: On the Efficiency of Adapters in Transformers, in: EMNLP, 2021, pp. 7930–7946.

[33]

Gao Peng, Geng Shijie, Zhang Renrui, Ma Teli, Fang Rongyao, Zhang Yongfeng, Li Hongsheng, Qiao Yu, CLIP-Adapter: Better vision-language models with feature adapters, CoRR, abs/2110.04544, 2021.

[34]

Yi-Lin Sung, Jaemin Cho andMohit Bansal, VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, in: CVPR, 2022, pp. 5227–5237.

[35]

Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, Liu Peter J., Exploring the limits of transfer learning with a unified text-to-text transformer, JMLR 21 (2020) 140:1–140:67.

[36]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, Language Models are Few-Shot Learners, in: NeurIPS, 2020.

[37]

Tianyu Gao, Adam Fisch, Danqi Chen, Making Pre-trained Language Models Better Few-shot Learners, in: ACL-IJCNLP, 2021, pp. 3816–3830.

[38]

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, Hanwang Zhang, Prompt-aligned Gradient for Prompt Tuning, in: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023, pp. 15613–15623.

[39]

Yuxian Gu, Xu Han, Zhiyuan Liu, Minlie Huang, PPT: Pre-trained Prompt Tuning for Few-shot Learning, in: ACL, 2022, pp. 8410–8423.

[40]

Jie Shibo, Deng Zhi-Hong, Convolutional bypasses are better vision transformer adapters, CoRR abs/2207.07039, 2022.

[41]

Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, Xinmei Tian, Prompt Distribution Learning, in: CVPR, 2022, pp. 5206–5215.

[42]

Zhang Yuanhan, Zhou Kaiyang, Liu Ziwei, Neural prompt search, CoRR abs/2206.04673, 2022.

[43]

Zang Yuhang, Li Wei, Zhou Kaiyang, Huang Chen, Loy Chen Change, Unified vision and language prompt learning, CoRR abs/2210.07225, 2022.

[44]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Conditional Prompt Learning for Vision-Language Models, in: CVPR, 2022, pp. 16795–16804.

[45]

Hantao Yao, Rui Zhang, Changsheng Xu, Visual-Language Prompt Tuning with Knowledge-Guided Context Optimization, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023, pp. 6757–6767.

[46]

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun Zhang, PLOT: Prompt Learning with Optimal Transport for Vision-Language Models, in: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023.

[47]

Sun Ximeng, Hu Ping, Saenko Kate, DualCoOp: Fast adaptation to multi-label recognition with limited annotations, CoRR abs/2206.09541, 2022.

[48]

Caruana Rich, Multi-task learning, Mach. Learn. 28 (1997) 41––75.

Digital Library

[49]

Luo Yuanyi, Wu Rui, Liu Jiafeng, Tang Xianglong, A text guided multi-task learning network for multimodal sentiment analysis, Neurocomputing 560 (2023).

[50]

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, Andrea Vedaldi, Describing Textures in the Wild, in: CVPR, 2014, pp. 3606–3613.

[51]

Touvron Hugo, Lavril Thibaut, Izacard Gautier, Martinet Xavier, Lachaux Marie-Anne, Lacroix Timothée, Rozière Baptiste, Goyal Naman, Hambro Eric, Azhar Faisal, Rodriguez Aurélien, Joulin Armand, Grave Edouard, Lample Guillaume, LLaMA: Open and efficient foundation language models, CoRR abs/2302.13971, 2023.

[52]

Li Fei-Fei, Fergus Robert, Perona Pietro, Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories, Comput. Vis. Image Underst. 106 (1) (2007) 59–70.

[53]

Helber Patrick, Bischke Benjamin, Dengel Andreas, Borth Damian, EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification, IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 12 (7) (2019) 2217–2226.

[54]

Maji Subhransu, Rahtu Esa, Kannala Juho, Blaschko Matthew B., Vedaldi Andrea, Fine-grained visual classification of aircraft, CoRR abs/1306.5151, 2013.

[55]

Lukas Bossard, Matthieu Guillaumin, Luc Van Gool, Food-101 - Mining Discriminative Components with Random Forests, in: ECCV, 2014, pp. 446–461.

[56]

Maria-Elena Nilsback, Andrew Zisserman, Automated Flower Classification over a Large Number of Classes, in: ICVGIP, 2008, pp. 722–729.

[57]

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, C. V. Jawahar, Cats and dogs, in: CVPR, 2012, pp. 3498–3505.

[58]

Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei, 3D Object Representations for Fine-Grained Categorization, in: ICCVW, 2013, pp. 554–561.

[59]

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, Antonio Torralba, SUN database: Large-scale scene recognition from abbey to zoo, in: CVPR, 2010, pp. 3485–3492.

[60]

Soomro Khurram, Zamir Amir Roshan, Shah Mubarak, UCF101: a dataset of 101 human actions classes from videos in the wild, CoRR abs/1212.0402, 2012.

[61]

Kaggle, Fruits and vegetables image recognition dataset, 2022, https://www.kaggle.com/datasets/kritikseth/fruit-and-vegetable-image-recognition. (Accessed: 1 June 2022).

[62]

Kaggle, Flowers recognition, 2022, https://www.kaggle.com/datasets/alxmamaev/flowers-recognition. (Accessed: 1 June 2022).

[63]

Kaggle, Mushrooms classification - common genus’s images, 2022, https://www.kaggle.com/datasets/maysee/mushrooms-classification-common-genuss-images. (Accessed: 1 June 2022).

[64]

Kaggle, Vegetable image dataset, 2022, https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset. (Accessed: 2 June 2022).

[65]

University Aarhus, Plant seedlings dataset, 2022, https://vision.eng.au.dk/plant-seedlings-dataset. (Accessed: 3 June 2022).

[66]

Geetharamani G., Pandian J. Arun, Identification of plant leaf diseases using a nine-layer deep convolutional neural network, Comput. Electr. Eng. 76 (2019) 323–338.

Digital Library

[67]

Xia Gui-Song, Hu Jingwen, Hu Fan, Shi Baoguang, Bai Xiang, Zhong Yanfei, Zhang Liangpei, Lu Xiaoqiang, AID: A benchmark data set for performance evaluation of aerial scene classification, IEEE Trans. Geosci. Remote Sens. 55 (7) (2017) 3965–3981.

[68]

Cheng Gong, Han Junwei, Lu Xiaoqiang, Remote sensing image scene classification: Benchmark and state of the art, Proc. IEEE 105 (10) (2017) 1865–1883.

[69]

Wang Qi, Liu Shaoteng, Chanussot Jocelyn, Li Xuelong, Scene classification with recurrent attention of VHR remote sensing images, IEEE Trans. Geosci. Remote Sens. 57 (2) (2018) 1155–1167.

[70]

Li Haifeng, Dou Xin, Tao Chao, Wu Zhixiang, Chen Jie, Peng Jian, Deng Min, Zhao Ling, RSI-CB: A large-scale remote sensing image classification benchmark using crowdsourced data, Sensors 20 (6) (2020) 1594.

[71]

Zou Qin, Ni Lihao, Zhang Tong, Wang Qian, Deep learning based feature selection for remote sensing scene classification, IEEE Geosci. Remote Sens. Lett. 12 (11) (2015) 2321–2325.

[72]

Zhou Zhuang, Li Shengyang, Wu Wei, Guo Weilong, Li Xuan, Xia Guisong, Zhao Zifei, NaSC-TG2: Natural scene classification with tiangong-2 remotely sensed imagery, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 14 (2021) 3228–3242.

[73]

Yi Yang, Shawn D. Newsam, Bag-of-visual-words and spatial extensions for land-use classification, in: 18th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM-GIS 2010, November 3-5, 2010, San Jose, CA, USA, Proceedings, 2010, pp. 270–279.

[74]

Dai Dengxin, Yang Wen, Satellite image classification via two-layer sparse coding with biased image representation, IEEE Trans. Geosci. Remote Sens. 8 (1) (2011) 173–176.

[75]

Aladin Virmaux, Kevin Scaman, Lipschitz regularity of deep neural networks: analysis and efficient estimation, in: NeurIPS, 2018, pp. 3839–3848.

[76]

Jake Snell, Kevin Swersky, Richard S. Zemel, Prototypical Networks for Few-shot Learning, in: NeurIPS, 2017, pp. 4077–4087.

[77]

Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, Sameer Singh, Yejin Choi, Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts, in: NAACL, 2022, pp. 3631–3643.

[78]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, in: ICLR, 2021.

Cited By

Ding KLi XYu QWang YZhang HXiang S(2024)Compositional Kronecker Context Optimization for vision–language modelsNeurocomputing10.1016/j.neucom.2024.128421608:COnline publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.128421

Index Terms

Multi-task prompt tuning with soft context sharing for vision–language models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Hierarchical Prompt Tuning for Few-Shot Multi-Task Learning
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Prompt tuning has enhanced the performance of Pre-trained Language Models for multi-task learning in few-shot scenarios. However, existing studies fail to consider that the prompts among different layers in Transformer are different due to the diverse ...
MixPrompt: Enhancing Generalizability and Adversarial Robustness for Vision-Language Models via Prompt Fusion
Advanced Intelligent Computing Technology and Applications
Abstract
Pretrained Vision-Language Models (VLMs) like CLIP have exhibited remarkable capacities across downstream tasks, while their image encoders are vulnerable to adversarial examples. A recently introduced lightweight approach, termed Adversarial ...
No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pre-trained models have been shown effective in many code intelligence tasks. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the inputs to pre-training and downstream tasks are in ...

Comments

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 603, Issue C

Oct 2024

214 pages

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 28 October 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ding KLi XYu QWang YZhang HXiang S(2024)Compositional Kronecker Context Optimization for vision–language modelsNeurocomputing10.1016/j.neucom.2024.128421608:COnline publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.128421

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents