Article

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

Authors:

Xiaopeng ZhangAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII

Pages 259 - 277

https://doi.org/10.1007/978-3-031-72920-1_15

Published: 01 October 2024 Publication History

Abstract

Vision-language foundation models, represented by Contras-tive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.

References

[1]

Barbu, A., et al.: ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[2]

Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[3]

Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700 m: image-text pair dataset. https://github.com/kakaobrain/coyo-dataset/ (2022)

[4]

Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M., Xue, X.: BigDetection: a large-scale benchmark for improved object detector pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4787 (2022)

[5]

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12 m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)

[6]

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

[7]

Cheng G, Han J, and Lu X Remote sensing image scene classification: benchmark and state of the art Proc. IEEE 2017 105 10 1865-1883

[8]

Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)

[9]

Cho, S., et al.: Cat-SEG: cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797 (2023)

[10]

Chou, S.H., Fan, Z., Little, J.J., Sigal, L.: Semi-supervised grounding alignment for multi-modal feature learning. In: 2022 19th Conference on Robots and Vision (CRV), pp. 48–57 (2022)

[11]

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)

[12]

Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011)

[13]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

[14]

Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11583–11592 (2022)

[15]

Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with maskclip (2023)

[16]

Dong, X., et al.: MaskClip: masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10995–11005 (2023)

[17]

Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, and Zisserman A The pascal visual object classes challenge: a retrospective Int. J. Comput. Vision 2015 111 98-136

Digital Library

[18]

Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva-02: a visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023)

[19]

Fang, Y., et al.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)

[20]

Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)

[21]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022).

Digital Library

[22]

Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation learning: A report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. pp. 117–124. Springer (2013)

[23]

Helber P, Bischke B, Dengel A, and Borth D EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2019 12 7 2217-2226

[24]

Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)

[25]

Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271 (2021)

[26]

Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

[27]

Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039 (2022)

[28]

Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)

[29]

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)

[30]

Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

[31]

Krishna R et al. Visual genome: connecting language and vision using crowdsourced dense image annotations Int. J. Comput. Vision 2017 123 32-73

Digital Library

[32]

Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

[33]

LeCun Y, Bottou L, Bengio Y, and Haffner P Gradient-based learning applied to document recognition Proc. IEEE 1998 86 11 2278-2324

[34]

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022)

[35]

Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L.M., Shum, H.Y.: Mask DINO: towards a unified transformer-based framework for object detection and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3041–3050 (2023)

[36]

Li, J., et al.: AiluRus: a scalable VIT framework for dense prediction. In: Advances in Neural Information Processing Systems (2023)

[37]

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

[38]

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

[39]

Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)

[40]

Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400 (2023)

[41]

Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527 (2022)

[42]

Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070 (2023)

[43]

Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

[44]

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

[45]

Ma, C., Jiang, Y., Wen, X., Yuan, Z., Qi, X.: Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. arXiv preprint arXiv:2310.16667 (2023)

[46]

Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: self-supervision meets language-image pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13686. Springer, Cham (2022).

Digital Library

[47]

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)

[48]

Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, vol. 24 (2011)

[49]

Qin, J., et al.: Freeseg: Unified, universal and open-vocabulary image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19446–19455 (2023)

[50]

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

[51]

Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506 (2020)

[52]

Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imageNet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)

[53]

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

[54]

Schuhmann C et al. Laion-5b: an open large-scale dataset for training next generation image-text models Adv. Neural. Inf. Process. Syst. 2022 35 25278-25294

[55]

Schuhmann, C., et al.: Laion-400 m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

[56]

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)

[57]

Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)

[58]

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

[59]

Veeling BS, Linmans J, Winkens J, Cohen T, and Welling M Frangi AF, Schnabel JA, Davatzikos C, Alberola-López C, and Fichtinger G Rotation equivariant CNNs for digital pathology Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 2018 Cham Springer 210-218

Digital Library

[60]

Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[61]

Wang, H., et al.: Sam-clip: merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308 (2023)

[62]

Wang, J., et al.: V3det: vast vocabulary visual detection dataset. arXiv preprint arXiv:2304.03752 (2023)

[63]

Wang, W., et al.: The all-seeing project: towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907 (2023)

[64]

Wu, X., Zhu, F., Zhao, R., Li, H.: Cora: adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. ArXiv abs/2303.13076 (2023)

[65]

Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8256–8265 (2019)

[66]

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)

[67]

Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)

[68]

Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954 (2023)

[69]

Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022).

Digital Library

[70]

Yang, Y., et al.: Attentive mask clip. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2771–2781 (2023)

[71]

Yao L et al. Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection Adv. Neural. Inf. Process. Syst. 2022 35 9125-9138

[72]

Young P, Lai A, Hodosh M, and Hockenmaier J From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions Trans. Assoc. Comput. Linguist. 2014 2 67-78

[73]

Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487 (2023)

[74]

Zhang H et al. Glipv2: unifying localization and vision-language understanding Adv. Neural. Inf. Process. Syst. 2022 35 36067-36080

[75]

Zhang, S., et al.: Gpt4roi: instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)

[76]

Zhang, Y., et al.: Recognize anything: a strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)

[77]

Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803 (2022)

Index Terms

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Object detection
      2. Computer vision tasks
        Scene understanding
2. Information systems
  1. Information retrieval
    1. Document representation

Index terms have been assigned to the content through auto-classification.

Recommendations

Accelerator for multi-granularity attribute reduction
Abstract
By considering the information granulation in Granular Computing, the concept of the multi-granularity is important. It is mainly because different results of information granulation will imply different levels of granularity. Nevertheless, multi-...
Highlights
- The multi-granularity is introduced into attribute reduction.
- An acceleration strategy is proposed to speed up the process of finding multi-granularity reduct.
- Our approach can reduce the time consumption of finding reduct.
Multi-granularity Attribute Reduction
Rough Sets
Abstract
It is known that different parameters used in Gaussian kernel will provide us different granularities of information granulations. Therefore, kernel based fuzzy rough set has the characteristic of multi-granularity. From this point of view, a ...
An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2024
Abstract
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM’s performance from the ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII

Sep 2024

583 pages

ISBN:978-3-031-72919-5

DOI:10.1007/978-3-031-72920-1

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Hessen, Germany
,
Olga Russakovsky
Princeton University, Palo Alto, CA, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents