Abstract
The open vocabulary segmentation (OVS) task has gained significant attention due to the challenges posed by both segmentation and open vocabulary classification, which involves recognizing arbitrary categories. Recent studies have leveraged pretrained Vision-Language models (VLMs) as a new paradigm for addressing this problem, leading to notable achievements. However, our analysis reveals that these methods are not yet fully satisfactory. In this paper, we empirically analyze the key challenges in four main categories: segmentation, dataset, reasoning and recognition. Surprisingly, we observe that the current research focus in OVS primarily revolves around recognition issues, while others remain relatively unexplored. Motivated by these findings, we propose preliminary approaches to address the top three identified issues by integrating advanced models and making adjustments to existing segmentation models. Experimental results demonstrate the promising performance gains achieved by our proposed methods on the OVS benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bucher, M., et al.: Zero-shot semantic segmentation. In: NeurIPS (2019)
Cen, J., et al.: Segment anything in 3D with NeRFs (2023)
Cha, J., et al.: Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: CVPR (2023)
Cho, S., et al.: CAT-Seg: cost aggregation for open-vocabulary semantic segmentation. CoRR (2023)
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Ding, J., et al.: Decoupling zero-shot semantic segmentation. In: CVPR (2022)
Ding, Z., et al.: Open-vocabulary panoptic segmentation with MaskCLIP. arXiv preprint arXiv:2208.08984 (2022)
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13696, pp. 540–557. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_31
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Karazija, L., et al.: Diffusion models for zero-shot open-vocabulary segmentation. CoRR (2023)
Ke, L., et al.: Segment anything in high quality (2023)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Li, B., et al.: Language-driven semantic segmentation. In: ICLR (2022)
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: CVPR (2023)
Liu, H., et al.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Q., Wen, Y., Han, J., Xu, C., Xu, H., Liang, X.: Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13680, pp. 275–292. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_16
Lüddecke, T., et al.: Image segmentation using text and image prompts. In: CVPR (2022)
Luo, H., et al.: SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: ICML (2023)
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)
Mukhoti, J., et al.: Open vocabulary semantic segmentation with patch aligned contrastive learning. In: CVPR (2023)
OpenAI: GPT-4 technical report (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Rombach, R., et al.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., Huang, T.: SegGPT: segmenting everything in context. arXiv preprint arXiv:2304.03284 (2023)
Xian, Y., et al.: Semantic projection network for zero- and few-label semantic segmentation. In: CVPR (2019)
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: CVPR (2022)
Xu, J., et al.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: CVPR (2023)
Xu, J., et al.: Learning open-vocabulary semantic segmentation models from natural language supervision. In: CVPR (2023)
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13689, pp. 736–753 . Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Xu, M., et al.: A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13689, pp. 736–753. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_42
Xu, M., et al.: Side adapter network for open-vocabulary semantic segmentation. In: CVPR (2023)
Zhang, H., et al.: A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131 (2023)
Zhao, H., et al.: Open vocabulary scene parsing. In: ICCV (2017)
Zhao, W.X., et al.: A survey of large language models (2023)
Zhao, X., et al.: Fast segment anything (2023)
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: CVPR (2022)
Zhou, B., et al.: Scene parsing through ADE20K dataset. In: CVPR (2017)
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
Zou, X., et al.: Segment everything everywhere all at once (2023)
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (21XNLG28), National Natural Science Foundation of China (No. 62276268) and Kuaishou. We acknowledge the anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, X., Ji, L., Yan, K., Sun, Y., Song, R. (2024). Expanding the Horizons: Exploring Further Steps in Open-Vocabulary Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14434. Springer, Singapore. https://doi.org/10.1007/978-981-99-8549-4_34
Download citation
DOI: https://doi.org/10.1007/978-981-99-8549-4_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8548-7
Online ISBN: 978-981-99-8549-4
eBook Packages: Computer ScienceComputer Science (R0)