Unsupervised Prototype Adapter for Vision-Language Models

Zhang, Yi; Zhang, Ce; Hu, Xueting; He, Zhihai

doi:10.1007/978-981-99-8429-9_16

Yi Zhang^15,16,
Ce Zhang¹⁶,
Xueting Hu¹⁶ &
…
Zhihai He^16,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1583 Accesses
1 Citations

Abstract

Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP’s prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

Article 05 June 2024

References

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14084–14093 (2022)
Google Scholar
Duan, J., et al.: Multi-modal alignment using representation codebook. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15651–15660 (2022)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 178–178 (2004)
Google Scholar
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observ. Rem. Sens. 12(7), 2217–2226 (2019)
Article Google Scholar
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)
Google Scholar
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271 (2021)
Google Scholar
Huang, T., Chu, J., Wei, F.: Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649 (2022)
Huang, X., et al.: Idea: increasing text diversity via online multi-label recognition for vision-language pre-training. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4573–4583 (2022)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 554–561 (2013)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Manli, S., et al.: Test-time prompt tuning for zero-shot generalization in vision-language models. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Menon, S., Vondrick, C.: Visual classification via description from large language models. In: International Conference on Learning Representations (2023)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (2008)
Google Scholar
Novack, Z., McAuley, J., Lipton, Z.C., Garg, S.: Chils: zero-shot image classification with hierarchical label sets. In: International Conference on Machine Learning, pp. 26342–26362. PMLR (2023)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3498–3505 (2012)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International Conference on Machine Learning (2019)
Google Scholar
Ru, L., Zhan, Y., Yu, B., Du, B.: Learning affinity from attention: end-to-end weakly-supervised semantic segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16846–16855 (2022)
Google Scholar
Scudder, H.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965)
Article MathSciNet Google Scholar
Shi, H., Hayat, M., Wu, Y., Cai, J.: Proposalclip: unsupervised open-category object proposal generation via exploiting clip cues. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9611–9620 (2022)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tang, Y., Guo, Q., He, Z.: Cross-inferential networks for source-free unsupervised domain adaptation. arXiv preprint arXiv:2306.16957 (2023)
Tang, Y., et al.: Neuro-modulated Hebbian learning for fully test-time adaptation. arXiv preprint arXiv:2303.00914 (2023)
Udandarao, V., Gupta, A., Albanie, S.: SUS-X: training-free name-only transfer of vision-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2020)
Article MathSciNet Google Scholar
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: a survey on few-shot learning. ACM Comput. Surv. 53(3), 1–34 (2020)
Article Google Scholar
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: SimVLM: simple visual language model pretraining with weak supervision. In: International Conference on Learning Representations (2022)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3485–3492 (2010)
Google Scholar
Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: COLORFUL prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797 (2021)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: COCA: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research (2022)
Google Scholar
Zhang, R., et al.: Tip-adapter: training-free adaption of CLIP for few-shot classification. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision. ECCV 2022. LNCS, vol. 13695. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_29
Zhang, Y., Zhang, C., Tang, Y., He, Z.: Cross-modal concept learning and inference for vision-language models. arXiv preprint arXiv:2307.15460 (2023)
Zhou, K., Liu, Z., Qiao, Y., Xiang, T., Loy, C.C.: Domain generalization: a survey. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)
Article Google Scholar
Zhou, M., Yu, L., Singh, A., Wang, M., Yu, Z., Zhang, N.: Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16485–16494 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Yi Zhang
Southern University of Science and Technology, Shenzhen, China
Yi Zhang, Ce Zhang, Xueting Hu & Zhihai He
Pengcheng Laboratory, Shenzhen, China
Zhihai He

Authors

Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ce Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xueting Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhihai He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhihai He .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Zhang, C., Hu, X., He, Z. (2024). Unsupervised Prototype Adapter for Vision-Language Models. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_16

Download citation

DOI: https://doi.org/10.1007/978-981-99-8429-9_16
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Prototype Adapter for Vision-Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Unsupervised Prototype Adapter for Vision-Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

MLTU: mixup long-tail unsupervised zero-shot image classification on vision-language models

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation