Article

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Authors: Dahyun Kang, Minsu ChoAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI

Pages 143 - 164

https://doi.org/10.1007/978-3-031-72940-9_9

Published: 17 November 2024 Publication History

Abstract

We present Lazy Visual Grounding for open-vocabulary semantic segmentation, which decouples unsupervised object mask discovery from object grounding. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a visual understanding task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely.

References

[1]

Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep VIT features as dense visual descriptors. arXiv preprint arXiv:2112.058142(3), 4 (2021)

[2]

Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: From contours to regions: an empirical evaluation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2294–2301. IEEE (2009)

[3]

Blake, A., Kohli, P., Rother, C.: Markov Random Fields for Vision and Image Processing. MIT Press (2011)

[4]

Boykov Y, Veksler O, and Zabih R Fast approximate energy minimization via graph cuts IEEE Trans. Pattern Anal. Mach. Intell. 2001 23 11 1222-1239

Digital Library

[5]

Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. (NeurIPS) (2020)

[6]

Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 32 (2019)

[7]

Cai, K., et al.: Mixreorg: cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1196–1205 (2023)

[8]

Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

[9]

Caron M, Bojanowski P, Joulin A, and Douze M Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Deep clustering for unsupervised learning of visual features Computer Vision – ECCV 2018 2018 Cham Springer 139-156

Digital Library

[10]

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. (2020)

[11]

Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)

[12]

Cha, J., Mun, J., Roh, B.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[13]

Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)

[14]

Chen CW, Luo J, and Parker KJ Image segmentation via adaptive k-mean clustering and knowledge-based morphological operations with biomedical applications IEEE Trans. Image Process. 1998 7 12 1673-1683

Digital Library

[15]

Chen, J., et al.: Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)

[16]

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2017)

[17]

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

[18]

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

[19]

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation (2021)

[20]

Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery and localization in the wild: part-based matching with bottom-up region proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1201–1210 (2015)

[21]

Cho, S., et al.: Cat-seg: cost aggregation for open-vocabulary semantic segmentation (2023)

[22]

Choi, S., Kang, D., Cho, M.: Contrastive mean-shift learning for generalized category discovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

[23]

Contributors, M.: MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)

[24]

Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

[25]

Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with maskclip. In: Proceedings of the International Conference on Machine Learning (ICML) (2023)

[26]

Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)

[27]

Everingham M, Van Gool L, Williams CKI, Winn J, and Zisserman A The Pascal visual object classes (voc) challenge Int. J. Comput. Vision 2010 88 2 303-338

Digital Library

[28]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI, pp. 540–557. Springer, Cham (2022).

Digital Library

[29]

Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI, pp. 540–557. Springer, Cham (2022).

Digital Library

[30]

Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of partially matching image features. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 1, pp. 19–25. IEEE (2006)

[31]

Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. (2020)

[32]

Gu, Z., Zhou, S., Niu, L., Zhao, Z., Zhang, L.: Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1921–1929 (2020)

[33]

Gui, J., Chen, T., Cao, Q., Sun, Z., Luo, H., Tao, D.: A survey of self-supervised learning from multiple perspectives: algorithms, theory, applications and future trends. arXiv preprint arXiv:2301.05712 (2023)

[34]

Gupta, A., Dollar, P., Girshick, R.: Lvis: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)

[35]

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

[36]

Hsu K-J, Tsai C-C, Lin Y-Y, Qian X, and Chuang Y-Y Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Unsupervised CNN-based co-saliency detection with graphical optimization Computer Vision – ECCV 2018 2018 Cham Springer 502-518

Digital Library

[37]

Huang, S., et al.: VOP: text-video co-operative prompt tuning for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574 (2023)

[38]

Iscen, A., Caron, M., Fathi, A., Schmid, C.: Retrieval-enhanced contrastive vision-text models. Adv. Neural Inf. Process. Syst. (2023)

[39]

Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)

[40]

Kang, D., Cho, M.: Integrative few-shot learning for classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9979–9990 (2022)

[41]

Kang, D., Koniusz, P., Cho, M., Murray, N.: Distilling self-supervised vision transformers for weakly-supervised few-shot classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[42]

Karazija, L., Laina, I., Vedaldi, A., Rupprecht, C.: Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316 (2023)

[43]

Kass M, Witkin A, and Terzopoulos D Snakes: active contour models Int. J. Comput. Vision 1988 1 4 321-331

[44]

Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)

[45]

Kim, S., Kang, M., Park, J.: Risclip: referring image segmentation framework using clip. arXiv preprint arXiv:2306.08498 (2023)

[46]

Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9404–9413 (2019)

[47]

Kirillov, A., Levinkov, E., Andres, B., Savchynskyy, B., Rother, C.: Instancecut: from edges to instances with multicut. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5008–5017 (2017)

[48]

Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

[49]

Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. Adv. Neural Inf. Process. Syst. 24 (2011)

[50]

Lee, Y.J., Grauman, K.: Shape discovery from unlabeled image collections. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2254–2261. IEEE (2009)

[51]

Li, B., Shi, Y., Qi, Z., Chen, Z.: A survey on semantic segmentation. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 1233–1240. IEEE (2018)

[52]

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=RriDjddCLN

[53]

Li, Z., Chen, J.: Superpixel segmentation using linear spectral clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1356–1363 (2015)

[54]

Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)

[55]

Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[56]

Lin, T.Y., et al.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV) (2014)

[57]

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

[58]

Liu, H., et al.: Learning customized visual models with retrieval-augmented knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15148–15158 (2023)

[59]

Liu, Q., Wen, Y., Han, J., Xu, C., Xu, H., Liang, X.: Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

[60]

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

[61]

Luo, H., Bao, J., Wu, Y., He, X., Li, T.: Segclip: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Proceedings of the International Conference on Machine Learning (ICML) (2023)

[62]

Ma, H., et al.: Ei-clip: entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18051–18061 (2022)

[63]

Ma, T., et al.: A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745 (2021)

[64]

Ma, T., et al.: Unleashing the potential of vision-language models for long-tailed visual recognition (2022)

[65]

Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

[66]

Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)

[67]

Paszke, A., et al.: Automatic differentiation in pytorch. In: Advances in Neural Information Processing Systems (NeurIPS) Workshop Autodiff (2017)

[68]

Pizer, S.M., et al.: Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process. 39(3), 355–368 (1987)

[69]

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML) (2021)

[70]

Rambhatla, S.S., Misra, I., Chellappa, R., Shrivastava, A.: Most: multiple object localization with self-supervised transformers for object discovery. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)

[71]

Ranasinghe, K., McKinzie, B., Ravi, S., Yang, Y., Toshev, A., Shlens, J.: Perceptual grouping in contrastive vision-language models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)

[72]

Ren, P., et al.: Viewco: discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307 (2023)

[73]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

[74]

Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2015)

[75]

Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

[76]

Russell, B.C., Freeman, W.T., Efros, A.A., Sivic, J., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1605–1614. IEEE (2006)

[77]

Seitzer, M., et al.: Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860 (2022)

[78]

Shi J and Malik J Normalized cuts and image segmentation IEEE Trans. Pattern Anal. Mach. Intell. 2000 22 8 888-905

Digital Library

[79]

Shin, G., Albanie, S., Xie, W.: Unsupervised salient object detection with spectral cluster voting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3971–3980 (2022)

[80]

Shin, G., Xie, W., Albanie, S.: Reco: retrieve and co-segment for zero-shot transfer. Adv. Neural Inf. Process. Syst. (2022)

[81]

Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? Visual prompt engineering for VLMs. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)

[82]

Siméoni, O., et al.: Localizing objects with self-supervised transformers and no labels. In: Proceedings of the British Machine Vision Conference (BMVC) (2021)

[83]

Siméoni, O., Sekkat, C., Puy, G., Vobeckỳ, A., Zablocki, É., Pérez, P.: Unsupervised object localization: observing the background to discover objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3176–3186 (2023)

[84]

Siméoni, O., Zablocki, É., Gidaris, S., Puy, G., Pérez, P.: Unsupervised object localization in the era of self-supervised VITs: a survey. arXiv preprint arXiv:2310.12904 (2023)

[85]

Singh, S., Deshmukh, S., Sarkar, M., Krishnamurthy, B.: Locate: self-supervised object discovery via flow-guided graph-cut and bootstrapped self-training. In: Proceedings of the British Machine Vision Conference (BMVC) (2023)

[86]

Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: Tenth IEEE International Conference on Computer Vision (ICCV 2005), vol. 1, pp. 370–377. IEEE (2005)

[87]

Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. (2017)

[88]

Thoma, M.: A survey of semantic segmentation. arXiv preprint arXiv:1602.06541 (2016)

[89]

Thomee, B., et al.: Yfcc100m: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)

[90]

Tian, C., Wang, W., Zhu, X., Dai, J., Qiao, Y.: Vl-ltr: learning class-wise visual-linguistic representation for long-tailed visual recognition. In: European Conference on Computer Vision. Springer (2022)

[91]

Tuytelaars, T., Lampert, C.H., Blaschko, M.B., Buntine, W.: Unsupervised object discovery: a comparison. Int. J. Comput. Vision (2010)

[92]

Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)

[93]

Vo, H.V., et al.: Unsupervised image matching and object discovery as optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8287–8296 (2019)

[94]

Vo, H.V., Pérez, P., Ponce, J.: Toward unsupervised, multi-object discovery in large-scale image collections. In: ECCV 2020, Part XXIII 16, pp. 779–795. Springer (2020)

[95]

Wang, F., Mei, J., Yuille, A.: Sclip: rethinking self-attention for dense vision-language inference. arXiv preprint arXiv:2312.01597 (2023)

[96]

Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773 (2023)

[97]

Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

[98]

Wang, X., et al.: Freesolo: learning to segment objects without annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14176–14186 (2022)

[99]

Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3124–3134 (2023)

[100]

Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14543–14553 (2022)

[101]

Wang, Z., et al.: Cris: clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11686–11695 (2022)

[102]

Weber, M., Welling, M., Perona, P.: Towards automatic discovery of object categories. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 (Cat. No. PR00662), vol. 2, pp. 101–108. IEEE (2000)

[103]

Wei, Y., et al.: iclip: bridging image classification and contrastive language-image pre-training for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[104]

Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1101–1113 (1993)

[105]

Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[106]

Wysoczańska, M., Siméoni, O., Ramamonjisoa, M., Bursuc, A., Trzciński, T., Pérez, P.: Clip-dinoiser: teaching clip a few dino tricks. arXiv preprint arXiv:2312.12359 (2023)

[107]

Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8256–8265 (2019)

[108]

Xu, J., et al.: Groupvit: semantic segmentation emerges from text supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

[109]

Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[110]

Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

[111]

Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer (2022).

Digital Library

[112]

Yin, Z., et al.: Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

[113]

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

[114]

Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

[115]

Zadaianchuk, A., Kleindessner, M., Zhu, Y., Locatello, F., Brox, T.: Unsupervised semantic segmentation with self-supervised object-centric representations. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)

[116]

Zhai, X., et al.: Lit: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

[117]

Zhang, Q., Wu, Y.N., Zhu, S.C.: Mining and-or graphs for graph matching and object discovery. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 55–63 (2015)

[118]

Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision (2019)

[119]

Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

[120]

Zhu JY, Wu J, Xu Y, Chang E, and Tu Z Unsupervised object class discovery via saliency-guided multiple class learning IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 4 862-875

Digital Library

[121]

Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)

[122]

Ziegler, A., Asano, Y.M.: Self-supervised learning of object parts for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Index Terms

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Interest point and salient region detections
        Object recognition
        Video segmentation
  2. Machine learning
    1. Learning paradigms

Index terms have been assigned to the content through auto-classification.

Recommendations

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
Computer Vision – ECCV 2024
Abstract
CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment ...
Multi-Grained Contrastive Learning for Text-supervised Open-vocabulary Semantic Segmentation
Learning open-vocabulary semantic segmentation (OVSS) from text supervision has recently received increasing attention for its promising potential in real-world applications. However, only with image-level supervision, it struggles to achieve dense and ...
Unsupervised object discovery via self-organisation

Object discovery in visual object categorisation (VOC) is the problem of automatically assigning class labels to objects appearing in given images. To achieve state-of-the-art results in this task, a large set of positive and negative training images ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI

Sep 2024

585 pages

ISBN:978-3-031-72939-3

DOI:10.1007/978-3-031-72940-9

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
Princeton University, Princeton, NJ, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 17 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents