Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-72940-9_9guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Published: 17 November 2024 Publication History

Abstract

We present Lazy Visual Grounding for open-vocabulary semantic segmentation, which decouples unsupervised object mask discovery from object grounding. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a visual understanding task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely.

References

[1]
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep VIT features as dense visual descriptors. arXiv preprint arXiv:2112.058142(3), 4 (2021)
[2]
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: From contours to regions: an empirical evaluation. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2294–2301. IEEE (2009)
[3]
Blake, A., Kohli, P., Rother, C.: Markov Random Fields for Vision and Image Processing. MIT Press (2011)
[4]
Boykov Y, Veksler O, and Zabih R Fast approximate energy minimization via graph cuts IEEE Trans. Pattern Anal. Mach. Intell. 2001 23 11 1222-1239
[5]
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. (NeurIPS) (2020)
[6]
Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 32 (2019)
[7]
Cai, K., et al.: Mixreorg: cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1196–1205 (2023)
[8]
Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
[9]
Caron M, Bojanowski P, Joulin A, and Douze M Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Deep clustering for unsupervised learning of visual features Computer Vision – ECCV 2018 2018 Cham Springer 139-156
[10]
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. (2020)
[11]
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)
[12]
Cha, J., Mun, J., Roh, B.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[13]
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)
[14]
Chen CW, Luo J, and Parker KJ Image segmentation via adaptive k-mean clustering and knowledge-based morphological operations with biomedical applications IEEE Trans. Image Process. 1998 7 12 1673-1683
[15]
Chen, J., et al.: Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
[16]
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2017)
[17]
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
[18]
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)
[19]
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation (2021)
[20]
Cho, M., Kwak, S., Schmid, C., Ponce, J.: Unsupervised object discovery and localization in the wild: part-based matching with bottom-up region proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1201–1210 (2015)
[21]
Cho, S., et al.: Cat-seg: cost aggregation for open-vocabulary semantic segmentation (2023)
[22]
Choi, S., Kang, D., Cho, M.: Contrastive mean-shift learning for generalized category discovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
[23]
Contributors, M.: MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)
[24]
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
[25]
Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with maskclip. In: Proceedings of the International Conference on Machine Learning (ICML) (2023)
[26]
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)
[27]
Everingham M, Van Gool L, Williams CKI, Winn J, and Zisserman A The Pascal visual object classes (voc) challenge Int. J. Comput. Vision 2010 88 2 303-338
[28]
Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI, pp. 540–557. Springer, Cham (2022).
[29]
Ghiasi, G., Gu, X., Cui, Y., Lin, T.-Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXVI, pp. 540–557. Springer, Cham (2022).
[30]
Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of partially matching image features. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 1, pp. 19–25. IEEE (2006)
[31]
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. (2020)
[32]
Gu, Z., Zhou, S., Niu, L., Zhao, Z., Zhang, L.: Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1921–1929 (2020)
[33]
Gui, J., Chen, T., Cao, Q., Sun, Z., Luo, H., Tao, D.: A survey of self-supervised learning from multiple perspectives: algorithms, theory, applications and future trends. arXiv preprint arXiv:2301.05712 (2023)
[34]
Gupta, A., Dollar, P., Girshick, R.: Lvis: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)
[35]
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
[36]
Hsu K-J, Tsai C-C, Lin Y-Y, Qian X, and Chuang Y-Y Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Unsupervised CNN-based co-saliency detection with graphical optimization Computer Vision – ECCV 2018 2018 Cham Springer 502-518
[37]
Huang, S., et al.: VOP: text-video co-operative prompt tuning for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574 (2023)
[38]
Iscen, A., Caron, M., Fathi, A., Schmid, C.: Retrieval-enhanced contrastive vision-text models. Adv. Neural Inf. Process. Syst. (2023)
[39]
Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)
[40]
Kang, D., Cho, M.: Integrative few-shot learning for classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9979–9990 (2022)
[41]
Kang, D., Koniusz, P., Cho, M., Murray, N.: Distilling self-supervised vision transformers for weakly-supervised few-shot classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[42]
Karazija, L., Laina, I., Vedaldi, A., Rupprecht, C.: Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316 (2023)
[43]
Kass M, Witkin A, and Terzopoulos D Snakes: active contour models Int. J. Comput. Vision 1988 1 4 321-331
[44]
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)
[45]
Kim, S., Kang, M., Park, J.: Risclip: referring image segmentation framework using clip. arXiv preprint arXiv:2306.08498 (2023)
[46]
Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9404–9413 (2019)
[47]
Kirillov, A., Levinkov, E., Andres, B., Savchynskyy, B., Rother, C.: Instancecut: from edges to instances with multicut. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5008–5017 (2017)
[48]
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[49]
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. Adv. Neural Inf. Process. Syst. 24 (2011)
[50]
Lee, Y.J., Grauman, K.: Shape discovery from unlabeled image collections. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2254–2261. IEEE (2009)
[51]
Li, B., Shi, Y., Qi, Z., Chen, Z.: A survey on semantic segmentation. In: 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 1233–1240. IEEE (2018)
[52]
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=RriDjddCLN
[53]
Li, Z., Chen, J.: Superpixel segmentation using linear spectral clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1356–1363 (2015)
[54]
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Open-vocabulary object segmentation with diffusion models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
[55]
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[56]
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV) (2014)
[57]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
[58]
Liu, H., et al.: Learning customized visual models with retrieval-augmented knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15148–15158 (2023)
[59]
Liu, Q., Wen, Y., Han, J., Xu, C., Xu, H., Liang, X.: Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)
[60]
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
[61]
Luo, H., Bao, J., Wu, Y., He, X., Li, T.: Segclip: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Proceedings of the International Conference on Machine Learning (ICML) (2023)
[62]
Ma, H., et al.: Ei-clip: entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18051–18061 (2022)
[63]
Ma, T., et al.: A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745 (2021)
[64]
Ma, T., et al.: Unleashing the potential of vision-language models for long-tailed visual recognition (2022)
[65]
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
[66]
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
[67]
Paszke, A., et al.: Automatic differentiation in pytorch. In: Advances in Neural Information Processing Systems (NeurIPS) Workshop Autodiff (2017)
[68]
Pizer, S.M., et al.: Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process. 39(3), 355–368 (1987)
[69]
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML) (2021)
[70]
Rambhatla, S.S., Misra, I., Chellappa, R., Shrivastava, A.: Most: multiple object localization with self-supervised transformers for object discovery. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
[71]
Ranasinghe, K., McKinzie, B., Ravi, S., Yang, Y., Toshev, A., Shlens, J.: Perceptual grouping in contrastive vision-language models. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
[72]
Ren, P., et al.: Viewco: discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307 (2023)
[73]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
[74]
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2015)
[75]
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
[76]
Russell, B.C., Freeman, W.T., Efros, A.A., Sivic, J., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1605–1614. IEEE (2006)
[77]
Seitzer, M., et al.: Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860 (2022)
[78]
Shi J and Malik J Normalized cuts and image segmentation IEEE Trans. Pattern Anal. Mach. Intell. 2000 22 8 888-905
[79]
Shin, G., Albanie, S., Xie, W.: Unsupervised salient object detection with spectral cluster voting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3971–3980 (2022)
[80]
Shin, G., Xie, W., Albanie, S.: Reco: retrieve and co-segment for zero-shot transfer. Adv. Neural Inf. Process. Syst. (2022)
[81]
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? Visual prompt engineering for VLMs. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
[82]
Siméoni, O., et al.: Localizing objects with self-supervised transformers and no labels. In: Proceedings of the British Machine Vision Conference (BMVC) (2021)
[83]
Siméoni, O., Sekkat, C., Puy, G., Vobeckỳ, A., Zablocki, É., Pérez, P.: Unsupervised object localization: observing the background to discover objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3176–3186 (2023)
[84]
Siméoni, O., Zablocki, É., Gidaris, S., Puy, G., Pérez, P.: Unsupervised object localization in the era of self-supervised VITs: a survey. arXiv preprint arXiv:2310.12904 (2023)
[85]
Singh, S., Deshmukh, S., Sarkar, M., Krishnamurthy, B.: Locate: self-supervised object discovery via flow-guided graph-cut and bootstrapped self-training. In: Proceedings of the British Machine Vision Conference (BMVC) (2023)
[86]
Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: Tenth IEEE International Conference on Computer Vision (ICCV 2005), vol. 1, pp. 370–377. IEEE (2005)
[87]
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. (2017)
[88]
Thoma, M.: A survey of semantic segmentation. arXiv preprint arXiv:1602.06541 (2016)
[89]
Thomee, B., et al.: Yfcc100m: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)
[90]
Tian, C., Wang, W., Zhu, X., Dai, J., Qiao, Y.: Vl-ltr: learning class-wise visual-linguistic representation for long-tailed visual recognition. In: European Conference on Computer Vision. Springer (2022)
[91]
Tuytelaars, T., Lampert, C.H., Blaschko, M.B., Buntine, W.: Unsupervised object discovery: a comparison. Int. J. Comput. Vision (2010)
[92]
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)
[93]
Vo, H.V., et al.: Unsupervised image matching and object discovery as optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8287–8296 (2019)
[94]
Vo, H.V., Pérez, P., Ponce, J.: Toward unsupervised, multi-object discovery in large-scale image collections. In: ECCV 2020, Part XXIII 16, pp. 779–795. Springer (2020)
[95]
Wang, F., Mei, J., Yuille, A.: Sclip: rethinking self-attention for dense vision-language inference. arXiv preprint arXiv:2312.01597 (2023)
[96]
Wang, J., et al.: Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773 (2023)
[97]
Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J.: Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
[98]
Wang, X., et al.: Freesolo: learning to segment objects without annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14176–14186 (2022)
[99]
Wang, X., Girdhar, R., Yu, S.X., Misra, I.: Cut and learn for unsupervised object detection and instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3124–3134 (2023)
[100]
Wang, Y., Shen, X., Hu, S.X., Yuan, Y., Crowley, J.L., Vaufreydaz, D.: Self-supervised transformers for unsupervised object discovery using normalized cut. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14543–14553 (2022)
[101]
Wang, Z., et al.: Cris: clip-driven referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11686–11695 (2022)
[102]
Weber, M., Welling, M., Perona, P.: Towards automatic discovery of object categories. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 (Cat. No. PR00662), vol. 2, pp. 101–108. IEEE (2000)
[103]
Wei, Y., et al.: iclip: bridging image classification and contrastive language-image pre-training for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[104]
Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1101–1113 (1993)
[105]
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[106]
Wysoczańska, M., Siméoni, O., Ramamonjisoa, M., Bursuc, A., Trzciński, T., Pérez, P.: Clip-dinoiser: teaching clip a few dino tricks. arXiv preprint arXiv:2312.12359 (2023)
[107]
Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8256–8265 (2019)
[108]
Xu, J., et al.: Groupvit: semantic segmentation emerges from text supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
[109]
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[110]
Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[111]
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer (2022).
[112]
Yin, Z., et al.: Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)
[113]
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
[114]
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
[115]
Zadaianchuk, A., Kleindessner, M., Zhu, Y., Locatello, F., Brox, T.: Unsupervised semantic segmentation with self-supervised object-centric representations. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)
[116]
Zhai, X., et al.: Lit: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
[117]
Zhang, Q., Wu, Y.N., Zhu, S.C.: Mining and-or graphs for graph matching and object discovery. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 55–63 (2015)
[118]
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision (2019)
[119]
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)
[120]
Zhu JY, Wu J, Xu Y, Chang E, and Tu Z Unsupervised object class discovery via saliency-guided multiple class learning IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 4 862-875
[121]
Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
[122]
Ziegler, A., Asano, Y.M.: Self-supervised learning of object parts for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Index Terms

  1. In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Guide Proceedings
            Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI
            Sep 2024
            585 pages
            ISBN:978-3-031-72939-3
            DOI:10.1007/978-3-031-72940-9
            • Editors:
            • Aleš Leonardis,
            • Elisa Ricci,
            • Stefan Roth,
            • Olga Russakovsky,
            • Torsten Sattler,
            • Gül Varol

            Publisher

            Springer-Verlag

            Berlin, Heidelberg

            Publication History

            Published: 17 November 2024

            Author Tags

            1. Unsupervised object discovery
            2. Training-free
            3. Open-vocabulary semantic segmentation
            4. CLIP

            Qualifiers

            • Article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 13 Jan 2025

            Other Metrics

            Citations

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media