Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-72920-1_15guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

Published: 01 October 2024 Publication History

Abstract

Vision-language foundation models, represented by Contras-tive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.

References

[1]
Barbu, A., et al.: ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
[2]
Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
[3]
Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: COYO-700 m: image-text pair dataset. https://github.com/kakaobrain/coyo-dataset/ (2022)
[4]
Cai, L., Zhang, Z., Zhu, Y., Zhang, L., Li, M., Xue, X.: BigDetection: a large-scale benchmark for improved object detector pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4787 (2022)
[5]
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12 m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)
[6]
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
[7]
Cheng G, Han J, and Lu X Remote sensing image scene classification: benchmark and state of the art Proc. IEEE 2017 105 10 1865-1883
[8]
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)
[9]
Cho, S., et al.: Cat-SEG: cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797 (2023)
[10]
Chou, S.H., Fan, Z., Little, J.J., Sigal, L.: Semi-supervised grounding alignment for multi-modal feature learning. In: 2022 19th Conference on Robots and Vision (CRV), pp. 48–57 (2022)
[11]
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613 (2014)
[12]
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
[13]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
[14]
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11583–11592 (2022)
[15]
Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with maskclip (2023)
[16]
Dong, X., et al.: MaskClip: masked self-distillation advances contrastive language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10995–11005 (2023)
[17]
Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, and Zisserman A The pascal visual object classes challenge: a retrospective Int. J. Comput. Vision 2015 111 98-136
[18]
Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva-02: a visual representation for neon genesis. arXiv preprint arXiv:2303.11331 (2023)
[19]
Fang, Y., et al.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)
[20]
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)
[21]
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022).
[22]
Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation learning: A report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. pp. 117–124. Springer (2013)
[23]
Helber P, Bischke B, Dengel A, and Borth D EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2019 12 7 2217-2226
[24]
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)
[25]
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271 (2021)
[26]
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
[27]
Jie, S., Deng, Z.H.: Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039 (2022)
[28]
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)
[29]
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
[30]
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[31]
Krishna R et al. Visual genome: connecting language and vision using crowdsourced dense image annotations Int. J. Comput. Vision 2017 123 32-73
[32]
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
[33]
LeCun Y, Bottou L, Bengio Y, and Haffner P Gradient-based learning applied to document recognition Proc. IEEE 1998 86 11 2278-2324
[34]
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022)
[35]
Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L.M., Shum, H.Y.: Mask DINO: towards a unified transformer-based framework for object detection and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3041–3050 (2023)
[36]
Li, J., et al.: AiluRus: a scalable VIT framework for dense prediction. In: Advances in Neural Information Processing Systems (2023)
[37]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[38]
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
[39]
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
[40]
Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400 (2023)
[41]
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527 (2022)
[42]
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070 (2023)
[43]
Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
[44]
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
[45]
Ma, C., Jiang, Y., Wen, X., Yuan, Z., Qi, X.: Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. arXiv preprint arXiv:2310.16667 (2023)
[46]
Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: self-supervision meets language-image pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13686. Springer, Cham (2022).
[47]
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)
[48]
Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, vol. 24 (2011)
[49]
Qin, J., et al.: Freeseg: Unified, universal and open-vocabulary image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19446–19455 (2023)
[50]
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
[51]
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506 (2020)
[52]
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imageNet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
[53]
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
[54]
Schuhmann C et al. Laion-5b: an open large-scale dataset for training next generation image-text models Adv. Neural. Inf. Process. Syst. 2022 35 25278-25294
[55]
Schuhmann, C., et al.: Laion-400 m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
[56]
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
[57]
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)
[58]
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
[59]
Veeling BS, Linmans J, Winkens J, Cohen T, and Welling M Frangi AF, Schnabel JA, Davatzikos C, Alberola-López C, and Fichtinger G Rotation equivariant CNNs for digital pathology Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 2018 Cham Springer 210-218
[60]
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
[61]
Wang, H., et al.: Sam-clip: merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308 (2023)
[62]
Wang, J., et al.: V3det: vast vocabulary visual detection dataset. arXiv preprint arXiv:2304.03752 (2023)
[63]
Wang, W., et al.: The all-seeing project: towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907 (2023)
[64]
Wu, X., Zhu, F., Zhao, R., Li, H.: Cora: adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. ArXiv abs/2303.13076 (2023)
[65]
Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8256–8265 (2019)
[66]
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. IEEE (2010)
[67]
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)
[68]
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954 (2023)
[69]
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022).
[70]
Yang, Y., et al.: Attentive mask clip. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2771–2781 (2023)
[71]
Yao L et al. Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection Adv. Neural. Inf. Process. Syst. 2022 35 9125-9138
[72]
Young P, Lai A, Hodosh M, and Hockenmaier J From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions Trans. Assoc. Comput. Linguist. 2014 2 67-78
[73]
Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487 (2023)
[74]
Zhang H et al. Glipv2: unifying localization and vision-language understanding Adv. Neural. Inf. Process. Syst. 2022 35 36067-36080
[75]
Zhang, S., et al.: Gpt4roi: instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
[76]
Zhang, Y., et al.: Recognize anything: a strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)
[77]
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803 (2022)

Index Terms

  1. UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVIII
          Sep 2024
          583 pages
          ISBN:978-3-031-72919-5
          DOI:10.1007/978-3-031-72920-1
          • Editors:
          • Aleš Leonardis,
          • Elisa Ricci,
          • Stefan Roth,
          • Olga Russakovsky,
          • Torsten Sattler,
          • Gül Varol

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 01 October 2024

          Author Tags

          1. Foundation model
          2. Open world
          3. Multi-granularity understanding

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 06 Jan 2025

          Other Metrics

          Citations

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media