Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Image Segmentation in Foundation Model Era: A Survey

Tianfei Zhou, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, Daniel Cremers T. Zhou, B. Chang and Y. Yuan are with Department of Computer Science, Beijing Institute of Technology, China. F. Zhang is with Department of Electronic Engineering, Shanghai Jiao Tong University, China. W. Wang is with CCAI, Zhejiang University, China. E. Konukoglu is with Computer Vision Lab, ETH Zurich, Switzerland. D. Cremers is with the Department of Computer Science, Technical University of Munich, Germany.

E-mail: tfzhou@bit.edu.cn (Tianfei Zhou)
Abstract

Image segmentation is a long-standing challenge in computer vision, studied continuously over several decades, as evidenced by seminal algorithms such as N-Cut, FCN, and MaskFormer. With the advent of foundation models (FMs), contemporary segmentation methodologies have embarked on a new epoch by either adapting FMs (e.g., CLIP, Stable Diffusion, DINO) for image segmentation or developing dedicated segmentation foundation models (e.g., SAM). These approaches not only deliver superior segmentation performance, but also herald newfound segmentation capabilities previously unseen in deep learning context. However, current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. We investigate two basic lines of research – generic image segmentation (i.e., semantic segmentation, instance segmentation, panoptic segmentation), and promptable image segmentation (i.e., interactive segmentation, referring segmentation, few-shot segmentation) – by delineating their respective task settings, background concepts, and key challenges. Furthermore, we provide insights into the emergence of segmentation knowledge from FMs like CLIP, Stable Diffusion, and DINO. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts. Subsequently, we engage in a discussion of open issues and potential avenues for future research. We envisage that this fresh, comprehensive, and systematic survey catalyzes the evolution of advanced image segmentation systems.

Index Terms:
Image Segmentation, Foundation Model, Computer Vision

1 Introduction

Image segmentation has been, and still is, an important and challenging research field in computer vision, with its aim to partition pixels into distinct groups. It constitutes an initial step in achieving higher-order goals including physical scene understanding, reasoning over visual commonsense, perceiving social affordances, and has widespread applications in domains like autonomous driving, medical image analysis, automated surveillance, and image editing.

The task has garnered extensive attention over decades, resulting in a plethora of algorithms in the literature, ranging from traditional, non-deep learning methods such as thresholding [1, 2], histogram mode seeking [3, 4], region growing and merging [5, 6], spatial clustering [7], energy diffusion [8], superpixels [9], conditional and Markov random fields [10], to more advanced, deep learning methods, e.g., FCN-based [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] and particularly the DeepLab family [17, 18, 19, 20], RNN-based [21], Transformer-based [22, 23, 24, 25, 26, 27, 28], and the R-CNN family [29, 30, 31]. These approaches have shown remarkable performance and robustness across all critical segmentation fields, e.g., semantic, instance, and panoptic segmentation. Yet, the exploration of image segmentation continues beyond these advancements.

Foundation Models (FMs) [32] have emerged as transformative technologies in recent years, reshaping our understanding of core domains in artificial intelligence (AI) including natural language processing [33], computer vision [34], and many other interdisciplinary areas [35, 36, 37]. Notable examples include large language models (LLMs) like GPT-3 [38] and GPT-4 [39], multimodal large language models (MLLMs) like Flamingo [40] and Gemini [41], and diffusion models (DMs) like Sora [42] and Stable Diffusion (SD) [43]. These models, distinguished by their immense scale and complexity, have exhibited emergent capabilities [44, 45] to tackle a wide array of intricate tasks with notable efficacy and efficiency. Meanwhile, they have unlocked new possibilities, such as generating chains of reasoning [46], offering human-like responses in dialogue scenarios [38], creating realistic-looking videos [42], and synthesizing novel programs [47]. The advent of GPT-4 and Sora has sparked considerable excitement within the AI community to fulfill artificial general intelligence (AGI) [48].

In the era dominated by FMs, image segmentation has undergone significant evolution, marked by distinct features uncommon in the preceding research era. To underscore the motivation behind our survey, we highlight several characteristics exemplifying this transformation:

\spadesuit FM technology has led to the emergence of segmentation generalists. Unlike traditional frameworks (e.g., FCN, Mask R-CNN), contemporary segmentation models have become promptable, i.e., generate a mask (akin to an answer in LLMs) based on a handcrafted prompt specifying what to segment in an image. The LLM-like promptable interface leads to a significant enhancement of task generality of segmentors, enabling them to rapidly adapt to various existing and new segmentation tasks, in a zero-shot (e.g., SAM [49], SEEM [50]) or few-shot (e.g., SegGPT [51]) manner. Note that these promptable models markedly differ from earlier universal models [23, 24, 22, 25], which remain limited to a fixed set of predetermined tasks, e.g., joint semantic, instance, and panoptic segmentation, with a closed vocabulary.

 Training-free segmentation has recently emerged as a burgeoning research area [52, 53, 54, 55, 56, 57]. It aims to extract segmentation knowledge from pre-trained FMs, marking a departure from established learning paradigms, such as supervised, semi-supervised, weakly supervised, and self-supervised learning. Recent studies highlight that segmentation masks can be derived effortlessly from attention maps or internal representations within models like CLIP, Stable Diffusion or DINO/DINOv2, even though they were not originally designed for segmentation purposes.

\clubsuit There is a notable trend towards integrating LLMs into segmentation systems to harness their reasoning capabilities and world knowledge [58, 59, 60, 61]. The LLM-powered segmentors possess the capacity to read, listen, and even reason to ground real-world, abstract linguistic queries into specific pixel regions. While previous efforts have explored similar capabilities in tasks such as referring segmentation [62], these methods are limited in handling basic queries like “the front-runner”. In contrast, LLM-powered segmentors can adeptly manage more complicated queries like “who will win the race?”. This capability represents a notable advancement towards developing more intelligent vision systems.

 Generative models, particularly text-to-image diffusion models, garner increasing attention in recent image segmentation research. It has been observed that DMs implicitly learn meaningful object groupings and semantics during the text-to-image generation process [63], functioning as strong unsupervised representation learners. This motivates a stream of works to directly decode the latent code of pre-trained DMs into segmentation masks, in either a label-efficient or completely unsupervised manner [63, 64]. Moreover, some efforts extend the inherent denoising diffusion process in DMs to segmentation, by approaching image segmentation from an image-conditioned mask generation perspective [65, 66, 67].

In light of these features, we found that most existing surveys in the field [68, 69, 70] are now outdated – one of the latest surveys [70] was published in 2021 and focuses only on semantic and instance segmentation. This leaves a notable gap in capturing recent FM-based approaches.

Our Contributions. To fill the gap, we offer an exhaustive and timely overview to examine how foundation models are transforming the field of image segmentation. This survey marks the first comprehensive exploration of recent image segmentation approaches that are built upon famous FMs, such as CLIP [71], Stable Diffusion [43], DINO [56]/DINOv2 [57], SAM [49] and LLMs/MLLMs [72]. It spans the breadth of the field and delves into the nuances of individual methods, thereby providing the reader with a thorough and up-to-date understanding of this topic. Beyond this, we elucidate open questions and potential directions to illuminate the path forward in this key field.

Related Surveys and Differences. In the past decade, many surveys have studied image segmentation from various perspectives. For example, [73] reviews region- and boundary-based segmentation methods in 2015. With the transition to the deep learning era, a series of works [74, 70, 75, 76, 77, 78] has summarized progress in generic segmentation tasks like semantic, instance and panoptic segmentation. A recent study [79] focuses on the specific task of open-vocabulary segmentation. Other studies delve into crucial aspects of image segmentation, such as evaluation protocols [80] or loss functions [81]. In addition, there exist surveys that focus on segmentation techniques in specialized domains, e.g., video [82], medical imaging [83, 84].

Given the accelerated evolution of FMs, there has been a surge of surveys that elucidate the fundamental principles and pioneering efforts in LLMs [33], MLLMs [72], DMs [85]. However, conspicuously absent from these works is a discussion on the role of FMs in advancing image segmentation. The survey most relevant to ours is [86], which offers an extensive review of recent developments related to SAM [49]. SAM represents a groundbreaking contribution to the segmentation field, making [86] a valuable resource. However, within the broader context of FMs, SAM is just one among many; thus, the scope of [86] is still limited in encompassing the entirety of progress in segmentation field.

Unlike prior surveys, our work stands apart in its exclusive focus on the contributions of FMs to image segmentation, and fills an existing gap in the current research landscape. We document the latest techniques, and spotlight major trends, and envision prospective research trajectories which will aid researchers in staying abreast of advances in image segmentation and accelerate progress in the field.

Survey Organization. The reminder of this paper is structured as follows. §2 presents essential background on image segmentation and FMs. §3 highlights the emergency of segmentation knowledge from existing FMs. §4 and §5 review the most important FM-based image segmentation methods, mainly from the past three years. §6 raises open issues and future directions. We conclude the paper in §7.

Refer to caption

§4§5§4.1§4.2§4.3§5.1§5.2§5.3[87][88][89][90][91][92][93][94][95][96][97][98][99][100][101][102][103][104][105][106][107][108][109][50][110][111][112][113][114][115][59][60][116][117][118][119][120][121][122][123][124]

Figure 1: The taxonomy based on image segmentation tasks and foundation models. Typical solutions are shown for each category.

2 Background

In this section, we first present a unified formulation of image segmentation tasks and categorize research directions in §2.1. Then, we provide a concise background overview of prominent FMs in §2.2.

2.1 Image Segmentation

2.1.1 A Unified Formulation

The central goal of the paper is to investigate the contributions of FMs to modern image segmentation technology. To this end, we first introduce a unified mathematical formulation applicable to various segmentation tasks. Concretely, denote 𝓧𝓧\bm{\mathcal{X}}bold_caligraphic_X and 𝓨𝓨\bm{\mathcal{Y}}bold_caligraphic_Y as the input space and output segmentation space, respectively. An image segmentation solution seeks to learn an ideal mapping function f𝑓fitalic_f:

f:𝓧𝓨,where𝓧=×𝒫,𝓨=×𝒞.:𝑓formulae-sequencemaps-to𝓧𝓨formulae-sequencewhere𝓧𝒫𝓨𝒞\displaystyle f:~{}\bm{\mathcal{X}}\mapsto\bm{\mathcal{Y}},~{}~{}\text{where}~% {}~{}\bm{\mathcal{X}}=\mathcal{I}\times\mathcal{P},~{}~{}\bm{\mathcal{Y}}=% \mathcal{M}\times\mathcal{C}.italic_f : bold_caligraphic_X ↦ bold_caligraphic_Y , where bold_caligraphic_X = caligraphic_I × caligraphic_P , bold_caligraphic_Y = caligraphic_M × caligraphic_C . (1)

Here f𝑓fitalic_f is typically instantiated as a neural network. The input space 𝓧𝓧\bm{\mathcal{X}}bold_caligraphic_X is decomposed as ×𝒫𝒫\mathcal{I}\times\mathcal{P}caligraphic_I × caligraphic_P, where \mathcal{I}caligraphic_I represents an image domain (comprising solely a single image I𝐼Iitalic_I), and 𝒫𝒫\mathcal{P}caligraphic_P refers to a collection of prompts, which is exclusively employed in certain segmentation tasks. The output space is 𝓨=×𝒞𝓨𝒞\bm{\mathcal{Y}}\!=\!\mathcal{M}\times\mathcal{C}bold_caligraphic_Y = caligraphic_M × caligraphic_C, which encompasses a set of segmentation mask \mathcal{M}caligraphic_M and a vocabulary 𝒞𝒞\mathcal{C}caligraphic_C of semantic categories associated with these masks. Eq. 1 furnishes a structured framework for understanding image segmentation, wherein a neural network is trained to map an input image, along with potentially user-specified prompts, to segmentation masks as well as corresponding semantic categories. Based on Eq. 1, we subsequently build a taxonomy for image segmentation.

2.1.2 Image Segmentation Category

According to whether P𝑃Pitalic_P is provided, we categorize image segmentation methods into two classes (Fig. 1): generic image segmentation (GIS) and promptable image segmentation (PIS).

\bullet Generic Image Segmentation. GIS aims to segment an image into distinct regions, each associated with a semantic category or an object. In GIS, the input space comprises solely the image, i.e., 𝓧𝓧\bm{\mathcal{X}}\equiv\mathcal{I}bold_caligraphic_X ≡ caligraphic_I, indicating 𝒫=𝒫\mathcal{P}\!=\!\emptysetcaligraphic_P = ∅. Based on the definition of output space 𝓨𝓨\bm{\mathcal{Y}}bold_caligraphic_Y, GIS methods can be further categorized into three major types: (i) semantic segmentation aims to identify and label each pixel with a semantic class in 𝒞𝒞\mathcal{C}caligraphic_C. (ii) instance segmentation involves grouping pixels that belong to the same semantic class into separate object instances. (iii) panoptic segmentation integrates semantic and instance segmentation to predict per-pixel class and instance labels, and is able to provide a comprehensive scene parsing.

Furthermore, based on whether the testing vocabulary 𝒞testsubscript𝒞𝑡𝑒𝑠𝑡\mathcal{C}_{test}caligraphic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT includes novel classes absent from the training vocabulary 𝒞trainsubscript𝒞𝑡𝑟𝑎𝑖𝑛\mathcal{C}_{train}caligraphic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, the three tasks are studied under two settings: closed-vocabulary (i.e., 𝒞train𝒞testsubscript𝒞𝑡𝑟𝑎𝑖𝑛subscript𝒞𝑡𝑒𝑠𝑡\mathcal{C}_{train}\!\equiv\!\mathcal{C}_{test}caligraphic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ≡ caligraphic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT) and open-vocabulary (i.e., 𝒞train𝒞testsubscript𝒞𝑡𝑟𝑎𝑖𝑛subscript𝒞𝑡𝑒𝑠𝑡\mathcal{C}_{train}\!\subset\!\mathcal{C}_{test}caligraphic_C start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ⊂ caligraphic_C start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT) segmentation. Notably, the closed-vocabulary setup has been extensively studied over the past decade. However, its open-vocabulary counterpart is still in its infancy and has garnered attention only in recent years, particularly with the advent of FMs.

\bullet Promptable Image Segmentation. PIS extends GIS by additionally incorporating a set of prompts 𝒫𝒫\mathcal{P}caligraphic_P, specifying the target to segment. In general, PIS methods only generate segmentation masks closely related to the prompts and do not directly predict classes. While the term “prompt” is relatively new, PIS has been studied for many years. Depending upon the prompt type, PIS methods can be grouped into the following categories: (i) interactive segmentation aims to segment out specific objects or parts according to user input, often provided through clicks, scribbles, boxes, or polygons, thus 𝒫={click,scribble,box,polygon}𝒫𝑐𝑙𝑖𝑐𝑘𝑠𝑐𝑟𝑖𝑏𝑏𝑙𝑒𝑏𝑜𝑥𝑝𝑜𝑙𝑦𝑔𝑜𝑛\mathcal{P}\!=\!\{click,scribble,box,polygon\}caligraphic_P = { italic_c italic_l italic_i italic_c italic_k , italic_s italic_c italic_r italic_i italic_b italic_b italic_l italic_e , italic_b italic_o italic_x , italic_p italic_o italic_l italic_y italic_g italic_o italic_n } are visual prompts; (ii) referring segmentation entails extracting the corresponding region referred by a linguistic phrase, thus 𝒫={linguisticphrase}𝒫𝑙𝑖𝑛𝑔𝑢𝑖𝑠𝑡𝑖𝑐𝑝𝑟𝑎𝑠𝑒\mathcal{P}\!=\!\{linguistic~{}phrase\}caligraphic_P = { italic_l italic_i italic_n italic_g italic_u italic_i italic_s italic_t italic_i italic_c italic_p italic_h italic_r italic_a italic_s italic_e } refers to textual prompts; (iii) few-shot segmentation targets at segmenting novel objects in given query image with a few annotated support images, i.e., 𝒫={(image,mask)}𝒫𝑖𝑚𝑎𝑔𝑒𝑚𝑎𝑠𝑘\mathcal{P}\!=\!\{(image,mask)\}caligraphic_P = { ( italic_i italic_m italic_a italic_g italic_e , italic_m italic_a italic_s italic_k ) } refers to a collection of image-mask pairs. While great progress has been made in these segmentation challenges, previous studies address various prompt types independently. In sharp contrast, FM-based methods aim to integrate them into a unified framework. Moreover, in-context segmentation has emerged as a novel few-shot segmentation task.

2.1.3 Learning Paradigms for Image Segmentation

Several prevalent learning strategies are employed to approximate the function f𝑓fitalic_f in Eq. 1. (i) Supervised learning: modern image segmentation methods are generally learned in a fully supervised manner, necessitating a collection of training images and their desired outputs, i.e. per-pixel annotations. (ii) Unsupervised learning: in the absence of explicit annotated supervision, the task of approximating f𝑓fitalic_f falls under unsupervised learning. Most existing unsupervised learning-based image segmentation models utilize self-supervised techniques, training networks with automatically-generated pseudo labels derived from image data. (iii) Weakly-supervised learning: in this case, supervision information may be inexact, incomplete or inaccurate. For inexact supervision, labels are typically acquired from a more easily annotated domain (e.g., image tag, bounding box, scribble). In the case of incomplete supervision, labels are provided for only a subset of training images. Inaccurate supervision entails per-pixel annotations for all training images, albeit with the presence of noise. (iv) Training free: in addition to the aforementioned strategies, a novel paradigm – training-free segmentation – has gained attention in the FM era, aiming to extract segmentation directly from pre-trained FMs, without involving any model training.

2.2 Foundation Model

FMs are initially elucidated in [32] as “any model that is trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks”. The term ‘foundation’ is used to underscore critically central and incomplete character of FMs: homogenization of the methodologies across research communities and emergence of new capabilities. While the basic ingredients of the FMs, such as deep neural networks and self-supervised learning, have been around for many years, the paradigm shift towards FMs is significant because the emergence and homogenization allow replacing narrow task-specific models with more generic task-agnostic models that are not strongly tied to a particular task or domain. In the subsequent subsections, we provide a brief review of language (§2.2.1) and vision foundation models (§2.2.2). Notably, we only focus on topics relevant to this survey, and direct interested readers to [125, 33] for more comprehensive discussions.

2.2.1 Language Foundation Model

\bullet Large Language Models (LLMs). Language modeling is one of the primary approaches to advancing language intelligence of machines. In general, it aims to model the generative likelihood of word sequences, so as to predict the probabilities of future tokens. In the past two decades, language modeling has evolved from the earliest statistical language models (SLMs) to neural language models (NLMs), then to small-sized pre-trained language models (PLMs), and finally to nowadays LLMs [33]. As enlarged PLMs (in terms of model size, data size and training compute), LLMs not only achieve a significant zero-shot performance improvement (even in some cases matching finetuned models), but also show strong reasoning capabilities across various domains, e.g., code writing [47], math problem solving [126]. A remarkable application of LLMs is ChatGPT, which has attracted widespread attention and transformed the way we interact with AI technology.

\bullet Multimodal Large Language Models (MLLMs). MLLMs [72] are multimodal extensions of LLMs by bringing together the reasoning power of LLMs with the capability to process non-textual modalities (e.g., vision, audio). This kind of unification expands the impact of LLMs with novel interfaces and capabilities, enabling them to solve new and more complicated tasks beyond the NLP domain. MLLMs represent the next level of LLMs. On one hand, multimodal perception is a natural way for knowledge acquisition and interaction with the real world, and thus serves as a fundamental component for achieving AGI; on the other hand, the multimodal extension expands the potential of pure language modeling to more complex tasks in, e.g., robotics and autonomous driving.

2.2.2 Visual Foundation Model

\bullet Contrastive Language-Image Pre-training (CLIP). CLIP [71] embodies a language-supervised vision model trained on 400M image-text pairs sourced from the Internet. The model has an encoder-only architecture, consisting of separate encoders for image and text encoding. It is trained by an image-text contrastive learning objective:

i2t=log[exp(sim(𝒙k,𝒕k)/τ)j=1Jexp(sim(𝒙k,𝒕j)/τ)].subscript𝑖2𝑡simsubscript𝒙𝑘subscript𝒕𝑘𝜏superscriptsubscript𝑗1𝐽simsubscript𝒙𝑘subscript𝒕𝑗𝜏\small\mathcal{L}_{i2t}=-\log\left[\frac{\exp(\text{sim}(\bm{x}_{k},\bm{t}_{k}% )/\tau)}{\sum_{j=1}^{J}\exp(\text{sim}(\bm{x}_{k},\bm{t}_{j})/\tau)}\right].caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT = - roman_log [ divide start_ARG roman_exp ( sim ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT roman_exp ( sim ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ] . (2)

where (𝒙k,𝒕k)subscript𝒙𝑘subscript𝒕𝑘(\bm{x}_{k},\bm{t}_{k})( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) denotes the image and text embeddings of the k𝑘kitalic_k-th image-text example (xk,tk)subscript𝑥𝑘subscript𝑡𝑘(x_{k},t_{k})( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). J𝐽Jitalic_J and τ𝜏\tauitalic_τ indicate the number of examples and softmax temperature, respectively. The loss i2tsubscript𝑖2𝑡\mathcal{L}_{i2t\!}caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT maximizes agreement between the embeddings of matching image and text pairs while minimizing it for non-matching pairs. In practice, text-image contrastive loss is calculated similarly, and the model is trained by a joint loss: contrast=i2t+t2isubscriptcontrastsubscript𝑖2𝑡subscript𝑡2𝑖\mathcal{L}_{\text{contrast}}\!=\!\mathcal{L}_{i2t}\!+\!\mathcal{L}_{t2i}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_i 2 italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_t 2 italic_i end_POSTSUBSCRIPT. ALIGN [127] is a follow-up work that harnesses contrastsubscriptcontrast\mathcal{L}_{\text{contrast}}caligraphic_L start_POSTSUBSCRIPT contrast end_POSTSUBSCRIPT for visual representation learning. It simplifies the costly data curation process in CLIP, and succeeds to further scale up representation learning with a noisy dataset of over one billion image-text pairs. Both CLIP and ALIGN acquire semantically-rich visual concepts and demonstrate impressive transferability in recognizing novel categories, leading to increased adoption for tackling zero-shot and open-vocabulary recognition tasks.

\bullet Self-Distillation with No Labels (DINO&DINOv2). DINO [56] interprets self-supervised learning of ViTs as a special case of self-distillation, wherein learning relies on model’s own predictions rather than external labels. Despite being a relatively small-sized model, DINO demonstrates a profound understanding of the visual world, characterized by its highly structured feature space. Notably, DINO shows two emerging properties: its features are excellent k-NN classifiers, and contain explicit information pertaining to image segmentation. DINOv2 [57] pushes the limits of visual features by scaling DINO in model and data sizes, along with an improved training recipe. The resultant model yields general-purpose features that close the performance gap with supervised alternatives across various benchmarks, while also showing notable properties, such as understanding of object parts and scene geometry. Strictly, speaking, DINO is not a ‘large’ model in terms of the parameter scale, but it is included due to the emerged nice properties for segmentation, and its role as the successor of DINOv2 .

\bullet Diffusion Models (DMs). DMs are a family of generative models that are Markov chains trained with variational inference. They have demonstrated remarkable potential in creating visually realistic samples, and set the current state-of-the-art in generation tasks. The milestone work, denoising diffusion probabilistic model (DDPMs) [128], was published in 2020 and have sparked an exponentially increasing interest in the generative AI community afterwards. DDPMs are defined as a parameterized Markov chain, which generate data from Gaussian noise within finite transitions during inference. Its training encompasses two interconnected processes. (i) Forward pass maps a data distribution 𝒛0q(𝒛0)similar-tosubscript𝒛0𝑞subscript𝒛0\bm{z}_{0}\!\sim\!q(\bm{z}_{0})bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to a simpler prior distribution 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via a diffusion process:

𝒛t𝒩(αt𝒛t1,(1αt)𝑰),similar-tosubscript𝒛𝑡𝒩subscript𝛼𝑡subscript𝒛𝑡11subscript𝛼𝑡𝑰\small\bm{z}_{t}\sim\mathcal{N}\left(\sqrt{\alpha_{t}}\bm{z}_{t-1},(1-\alpha_{% t})\bm{I}\right),bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I ) , (3)

where {αt}subscript𝛼𝑡\{\alpha_{t}\}{ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } are fixed coefficients that determine the noise schedule. (ii) Reverse pass leverages a trained neural network ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (typicall a UNet) to gradually reverse the effects of the forward process by training it to estimate the noise ϵitalic-ϵ\epsilonitalic_ϵ which has been added to 𝒛0subscript𝒛0\bm{z}_{0}bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Hence, the training objective can be derived as:

DM=𝔼𝒛0,ϵ𝒩(0,1),t[ϵϵθ(𝒛t(𝒛0,ϵ),t;𝒞)22],subscriptDMsubscript𝔼formulae-sequencesimilar-tosubscript𝒛0italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝒛𝑡subscript𝒛0italic-ϵ𝑡𝒞22\small\mathcal{L}_{\text{DM}}\!=\!\mathbb{E}_{\bm{z}_{0},\epsilon\sim\mathcal{% N}(0,1),t}\left[||\epsilon-\epsilon_{\theta}(\bm{z}_{t}(\bm{z}_{0},\epsilon),t% ;\mathcal{C})||_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT DM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ ) , italic_t ; caligraphic_C ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where 𝒞𝒞\mathcal{C}caligraphic_C denotes an additional conditioning input to ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Further, latent diffusion models (LDMs) extend DMs by training them in the low-dimensional latent space of an autoencoding model \mathcal{E}caligraphic_E (e.g., VQGAN [129]):

LDM=𝔼(𝒛0),ϵ𝒩(0,1),t[ϵϵθ(𝒛t((𝒛0),ϵ),t;𝒞)22].subscriptLDMsubscript𝔼formulae-sequencesimilar-tosubscript𝒛0italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝒛𝑡subscript𝒛0italic-ϵ𝑡𝒞22\small\mathcal{L}_{\text{LDM}}\!=\!\mathbb{E}_{\mathcal{E}(\bm{z}_{0}),% \epsilon\sim\mathcal{N}(0,1),t}\left[||\epsilon-\epsilon_{\theta}(\bm{z}_{t}(% \mathcal{E}(\bm{z}_{0}),\epsilon),t;\mathcal{C})||_{2}^{2}\right].caligraphic_L start_POSTSUBSCRIPT LDM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( caligraphic_E ( bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ) , italic_t ; caligraphic_C ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (5)

This leads to many popular text-to-image DMs (T2I-DMs), i.e., Stable Diffusion (SD) [43]. Current T2I-DMs are able to generate high-fidelity images with rich texture, diverse content and intricate structures while having compositional and editable semantics. This phenomenon potentially suggests that T2I-DMs can implicitly learn both high-level and low-level visual concepts from massive image-text pairs. Moreover, recent research has highlighted the clear correlations between attention maps and text prompts in T2I-DMs [130, 131]. These properties extend the capability of T2I-DMs from generation to perception tasks [132, 133].

\bullet Segment Anything (SAM). SAM [49] has sparked a revolution in the field of image segmentation, and profoundly influences the development of large, general-purposed models in computer vision. Unlike the aforementioned vision FMs, SAM is built specifically for segmentation. Particularly, it is a promptable model trained on a corpus of 1 billion masks from 11 million images using a promptable segmentation task. It achieves powerful zero-shot task generality to tackle various image segmentation tasks, and allows the use of “prompts” in forms of points, masks, boxes, even language to enhance segmentation interactivity. Beyond this, SAM has shown promising capabilities in a multitude of tasks, including medical imaging [111], image editing [134], video segmentation [135]. We refer readers to [86] for a comprehensive understanding of SAM’s usage scenarios.

3 Segmentation Knowledge Emerges from FMs

Given the emergency capabilities of LLMs, a natural question arises: Do segmentation properties emerge from FMs? The answer is positive, even for FMs not explicitly designed for segmentation, such as CLIP, DINO and Diffusion Models. In this section, we elaborate on the techniques to extract segmentation knowledge from these FMs, which are effectively unlocking a new frontier in image segmentation, i.e., acquiring segmentation without any training.

3.1 Segmentation Emerges from CLIP

Many studies [52, 53, 136] acknowledge that the standard CLIP is able to discern the appearance of objects, but is limited in understanding their locations. The main reason is that CLIP learns holistic visual representations that are invariant to spatial positions, whereas segmentation requires spatial-covariant features – local representations should vary w.r.t. their spatial positions in an image. To better explain this, we revisit self-attention in Transformers:

SelfAttention(𝒒,𝒌,𝒗)=softmax(𝒒𝒌/d)self-attention matrix𝒗,SelfAttention𝒒𝒌𝒗subscriptsoftmax𝒒superscript𝒌top𝑑self-attention matrix𝒗\small\text{SelfAttention}(\bm{q},\bm{k},\bm{v})=\underbrace{\text{softmax}% \left(\bm{q}\bm{k}^{\top}/\sqrt{d}\right)}_{\text{self-attention matrix}}\bm{v},SelfAttention ( bold_italic_q , bold_italic_k , bold_italic_v ) = under⏟ start_ARG softmax ( bold_italic_q bold_italic_k start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ) end_ARG start_POSTSUBSCRIPT self-attention matrix end_POSTSUBSCRIPT bold_italic_v , (6)

where 𝒒=𝒙𝑾qN×d𝒒𝒙subscript𝑾𝑞superscript𝑁𝑑\bm{q}\!=\!{\bm{x}}\bm{W}_{q}\!\in\!\mathbb{R}^{N\times d}bold_italic_q = bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, 𝒌=𝒙𝑾kN×d𝒌𝒙subscript𝑾𝑘superscript𝑁𝑑\bm{k}\!=\!\bm{x}\bm{W}_{k}\!\in\!\mathbb{R}^{N\times d}bold_italic_k = bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, 𝒗=𝒙𝑾vN×d𝒗𝒙subscript𝑾𝑣superscript𝑁𝑑\bm{v}\!=\!\bm{x}\bm{W}_{v}\!\in\!\mathbb{R}^{N\times d}bold_italic_v = bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT are query, key, and value embeddings. 𝒙N×d𝒙superscript𝑁𝑑\bm{x}\!\in\!\mathbb{R}^{N\times d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT is the input sequence with N𝑁Nitalic_N tokens, each being a d𝑑ditalic_d-dimensional vector. 𝑾d×d𝑾superscript𝑑𝑑\bm{W}\!\in\!\mathbb{R}^{d\times d}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT denotes a projection matrix whose parameters are learned in pre-training. CLIP applies attention pooling to the last self-attention layer:

AttentionPooling(𝒒,𝒌,𝒗)=SelfAttention(𝒒¯,𝒌,𝒗)AttentionPooling𝒒𝒌𝒗SelfAttention¯𝒒𝒌𝒗\displaystyle\!\!\!\!\text{AttentionPooling}(\bm{q},\bm{k},\bm{v})=\text{% SelfAttention}(\bar{\bm{q}},\bm{k},\bm{v})AttentionPooling ( bold_italic_q , bold_italic_k , bold_italic_v ) = SelfAttention ( over¯ start_ARG bold_italic_q end_ARG , bold_italic_k , bold_italic_v ) (7)

where 𝒒¯=𝒙¯𝑾qd¯𝒒¯𝒙subscript𝑾𝑞superscript𝑑\bar{\bm{q}}\!=\!\bar{\bm{x}}\bm{W}_{q}\!\in\!\mathbb{R}^{d}over¯ start_ARG bold_italic_q end_ARG = over¯ start_ARG bold_italic_x end_ARG bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and 𝒙¯dbold-¯𝒙superscript𝑑\bm{\bar{x}}\!\in\!\mathbb{R}^{d}overbold_¯ start_ARG bold_italic_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is globally average-pooled feature of 𝒙𝒙\bm{x}bold_italic_x. Eq. 7 encourages similar representations for different locations, leading to spatial-invariant features.

Despite this, MaskCLIP [52] finds that it is feasible to extract segmentation knowledge from CLIP with minimal modifications to the attention pooling module. Specifically, it simply sets the attention matrix to an identity matrix. In this way, each local visual token receives information only from its corresponding position so that visual features (i.e., 𝒗𝒗\bm{v}bold_italic_v) are well localized. Such a straightforward modification results in an 11% increase of CLIP’s mIoU on COCO-Stuff. Furthermore, SCLIP [53] proposes to compute pairwise token correlations to allow each local token to attend to positions sharing similar information, i.e., the attention matrix is computed as: (softmax(𝒒𝒒)+softmax(𝒌𝒌))softmax𝒒superscript𝒒topsoftmax𝒌superscript𝒌top\left(\text{softmax}(\bm{q}\bm{q}^{\top})+\text{softmax}(\bm{k}\bm{k}^{\top})\right)( softmax ( bold_italic_q bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + softmax ( bold_italic_k bold_italic_k start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ). CLIPSurgery [137] computes value-value attention matrix: (softmax(𝒗𝒗))softmax𝒗superscript𝒗top\left(\text{softmax}(\bm{v}\bm{v}^{\top})\right)( softmax ( bold_italic_v bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ) and incorporates the attention into each Transformer block rather than the last one. NACLIP [138] computes key-key attention matrix: (softmax(𝒌𝒌))softmax𝒌superscript𝒌top\left(\text{softmax}(\bm{k}\bm{k}^{\top})\right)( softmax ( bold_italic_k bold_italic_k start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ) and further weights the attention map with a Gaussian kernel to encourage more consistent attention across adjacent patches. GEM [139] presents a generalized way to calculate the attention matrix as: (softmax(𝒒𝒒)+softmax(𝒌𝒌)+softmax(𝒗𝒗))softmax𝒒superscript𝒒topsoftmax𝒌superscript𝒌topsoftmax𝒗superscript𝒗top\left(\text{softmax}(\bm{q}\bm{q}^{\top})+\text{softmax}(\bm{k}\bm{k}^{\top})+% \text{softmax}(\bm{v}\bm{v}^{\top})\right)( softmax ( bold_italic_q bold_italic_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + softmax ( bold_italic_k bold_italic_k start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + softmax ( bold_italic_v bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ).

3.2 Segmentation Emerges from DMs

A family of methods [55, 140, 141, 142, 89, 143, 144, 145, 146, 147] shows that pre-trained generative models, especially DMs, manifest remarkable segmentation capabilities. A major insight is that segmentation emerges from cross-attention maps in DMs. Formally, the cross-attention at one layer is computed as:

𝒎=CrossAttention(𝒒,𝒌)=softmax(𝒒𝒌/d),𝒎CrossAttention𝒒𝒌softmax𝒒superscript𝒌top𝑑\displaystyle\bm{m}=\text{CrossAttention}(\bm{q},\bm{k})=\text{softmax}(\bm{q}% \bm{k}^{\top}/\sqrt{d}),bold_italic_m = CrossAttention ( bold_italic_q , bold_italic_k ) = softmax ( bold_italic_q bold_italic_k start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ) , (8)
where𝒒=Φ(𝒛t)hw×d,𝒌=Ψ(𝒆)N×d.formulae-sequencewhere𝒒Φsubscript𝒛𝑡superscript𝑤𝑑𝒌Ψ𝒆superscript𝑁𝑑\displaystyle\text{where}~{}~{}~{}\bm{q}=\Phi(\bm{z}_{t})\in\mathbb{R}^{hw% \times d},~{}\bm{k}=\Psi(\bm{e})\in\mathbb{R}^{N\times d}.where bold_italic_q = roman_Φ ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_h italic_w × italic_d end_POSTSUPERSCRIPT , bold_italic_k = roman_Ψ ( bold_italic_e ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT .

Here ΦΦ\Phiroman_Φ and ΨΨ\Psiroman_Ψ indicate linear layers of the U-Net that denoise in the latent space. N𝑁Nitalic_N and d𝑑ditalic_d represent the length of text tokens and feature dimensionality in the layer, respectively. hw𝑤hwitalic_h italic_w is the spatial size of the feature. 𝒎hw×N𝒎superscript𝑤𝑁\bm{m}\!\in\!\mathbb{R}^{hw\times N}bold_italic_m ∈ blackboard_R start_POSTSUPERSCRIPT italic_h italic_w × italic_N end_POSTSUPERSCRIPT denotes the cross-attention map of a single head. As seen, 𝒎𝒎\bm{m}bold_italic_m captures dense correlations between pixels and words, from which we are able to extract the mask 𝒎CLShwsubscript𝒎CLSsuperscript𝑤\bm{m}_{\texttt{CLS}}\!\in\!\mathbb{R}^{hw}bold_italic_m start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h italic_w end_POSTSUPERSCRIPT associated with the class token [CLS]delimited-[]CLS[\texttt{CLS}][ CLS ]. In practice, most methods consolidate cross-attention matrices across blocks, timestamps, and attention heads into a single attention map [55, 140, 141, 142, 144] to obtain higher-quality attention maps. Nevertheless, cross-attention maps often lack clear object boundaries and may exhibit internal holes. Thus, they are typically completed by incorporating self-attention maps [55, 141] to yield final segmentation mask as 𝒎^CLS=𝒂SA𝒎CLSsubscript^𝒎CLSsubscript𝒂SAsubscript𝒎CLS\hat{\bm{m}}_{\texttt{CLS}}=\bm{a}_{\texttt{SA}}\bm{m}_{\texttt{CLS}}over^ start_ARG bold_italic_m end_ARG start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT = bold_italic_a start_POSTSUBSCRIPT SA end_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT where 𝒂SAhw×hwsubscript𝒂SAsuperscript𝑤𝑤\bm{a}_{\texttt{SA}}\!\in\!\mathbb{R}^{hw\!\times\!hw}bold_italic_a start_POSTSUBSCRIPT SA end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h italic_w × italic_h italic_w end_POSTSUPERSCRIPT is a self-attention matrix.

3.3 Segmentation Emerges from DINO

DINO [56] and DINOv2 [57] demonstrate a surprising phenomenon that segmentation knowledge emerges in self-supervised visual transformers, but not appear explicitly in either supervised ViTs or CNNs. Caron et al. show in DINO [56] that sensible object segmentation can be obtained from the self-attention of class token [CLS] in the last attention layer. More formally, given an input sequence of M𝑀Mitalic_M (=hwabsent𝑤=hw= italic_h italic_w) patches, the affinity vector 𝜶CLSsubscript𝜶CLS\bm{\alpha}_{\texttt{CLS}}bold_italic_α start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT can be computed as the pairwise similarities between the class token [CLS] and patch tokens [I] in an attention head of the last layer:

𝜶CLS=𝒒CLS𝒌I1×M,subscript𝜶CLSsubscript𝒒CLSsuperscriptsubscript𝒌Itopsuperscript1𝑀\small\bm{\alpha}_{\texttt{CLS}}=\bm{q}_{\texttt{CLS}}\cdot\bm{k}_{\texttt{I}}% ^{\top}~{}~{}~{}\in\mathbb{R}^{1\times M},bold_italic_α start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT = bold_italic_q start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT ⋅ bold_italic_k start_POSTSUBSCRIPT I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_M end_POSTSUPERSCRIPT , (9)

where 𝒒𝒒\bm{q}bold_italic_q and 𝒌𝒌\bm{k}bold_italic_k denote query and key features of corresponding tokens, respectively. The final attention map are averaged of 𝜶CLSsubscript𝜶CLS\bm{\alpha}_{\texttt{CLS}}bold_italic_α start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT over all attention heads, and can directly binarized to yield segmentation masks.

Beyond this, some other works [148, 91, 149, 150] localize objects based on similarities between patch tokens:

𝑨I=𝒌I𝒌IM×M.subscript𝑨Isubscript𝒌Isuperscriptsubscript𝒌Itopsuperscript𝑀𝑀\small\bm{A}_{\texttt{I}}=\bm{k}_{\texttt{I}}\cdot\bm{k}_{\texttt{I}}^{\top}~{% }~{}~{}\in\mathbb{R}^{M\times M}.bold_italic_A start_POSTSUBSCRIPT I end_POSTSUBSCRIPT = bold_italic_k start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ⋅ bold_italic_k start_POSTSUBSCRIPT I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT . (10)

Here each element in 𝑨Isubscript𝑨I\bm{A}_{\texttt{I}}bold_italic_A start_POSTSUBSCRIPT I end_POSTSUBSCRIPT measures the similarity between a pair of tokens. The key features are typically chosen in the computation since they show better localization properties than others (i.e., query or value features) [151]. Based on the derived affinity matrix 𝑨Isubscript𝑨I\bm{A}_{\texttt{I}}bold_italic_A start_POSTSUBSCRIPT I end_POSTSUBSCRIPT, LOST [148] directly mines potential objects based on an inverse selection strategy; DeepSpectral [91] and COMUS [150] group pixels by partitioning the affinity matrix based on spectral theory; MaskDistill [149] selects discriminative tokens based on 𝜶CLSsubscript𝜶CLS\bm{\alpha}_{\texttt{CLS}}bold_italic_α start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT, and diffuses information of discriminative tokens based on 𝑨Isubscript𝑨I\bm{A}_{\texttt{I}}bold_italic_A start_POSTSUBSCRIPT I end_POSTSUBSCRIPT to estimate initial segmentation results.

4 Foundation Model based GIS

This section presents a comprehensive review of recent advances in FM-based GIS, including semantic (§4.1), instance (§4.2) and panoptic segmentation (§4.3). Our discussions are approached from a technical perspective to elucidate the fundamental concepts and highlight the roles of FMs in GIS.

4.1 Semantic Segmentation

4.1.1 CLIP-based Solution

How to transfer pre-trained knowledge in CLIP to segmentation? This question has driven a wide spectrum of studies to solve image segmentation based on CLIP. However, the task is challenging due to the inherent granularity gap between the image-level training task in CLIP and pixel-level prediction task in image segmentation. Popular solutions are:

\bullet Training free Semantic Segmentation. As discussed in §3.1, it is feasible to derive segmentation masks from CLIP, with a slight modification of the self-attention module. On this basis, many approaches [52, 53, 137, 138, 139] achieve semantic segmentation by leveraging CLIP text encoder as the classifier to determine the category of each mask. The whole process involves no extra training or fine-tuning.

\bullet CLIP Finetuning. Following the popular pre-training-then-fine-tuning paradigm, a large number of methods fine-tunes CLIP using segmentation data. They can be categorized as either full fine-tuning or parameter-efficient tuning approaches. Full fine-tuning methods entail tuning the entire visual or textual encoders of CLIP. DenseCLIP [88], for instance, pioneers this approach by refining CLIP’s visual encoder through solving a pixel-text matching task. PPL [152] augments DenseCLIP with a probabilistic framework to learn more accurate textual descriptions based on visual cues. Though showing promising results, these methods tend to break the visual-language association within CLIP and lead to severe losses of the open-vocabulary capacity. To alleviate this, CATSeg [153] introduces a cost aggregation-based framework to maintain the zero-shot capability of CLIP even after full fine-tuning. OTSeg [154] tackles it by leveraging the ensemble of multiple text prompts, and introduce a multi-prompts sinkhorn attention to improve multimodal alignment. However, these methods typically necessitate a substantial volume of densely annotated training images. In contrast, ZegCLIP [155], LDVC [156], and ZegOT [157] employ parameter-efficient prompt tuning techniques to transfer CLIP. To prevent overfitting to seen categories, they all learn image-specific textual embeddings to achieve more accurate pixel-text alignment. SemiVL [158] adopts partial tuning strategies to only tune parameters of self-attention layers. SAN [159] adapts CLIP image encoder to segmentation via a lightweight adapter, and decouples the mask proposal and classification stage by predicting attention biases applied to deeper layers of CLIP for recognition.

\bullet CLIP as Zero-Shot Classifier. Apart from model fine-tuning, many efforts directly utilize the pre-trained CLIP as classifiers, and are able to preserve CLIP’s zero-shot transferability. The methods can be categorized into two major types: mask classification and pixel classification.

Mask classification methods [160, 161, 162, 163, 164, 165, 166, 167, 168] in general follow a two-stage paradigm, wherein class-agnostic mask proposals are firstly extracted and then the pre-trained CLIP is used for classifying the proposals. Pioneering studies [160, 161] require a standalone, CLIP-unaware model for proposal generation, while recent approaches [162, 163, 165, 164, 166] tend to integrate mask generation and classification within a unified framework. All these methods maintain CLIP frozen during training, but the vanilla CLIP is insensitive to different mask proposals, constraining classification performance. OVSeg [167] and MAFT [168] tackle this issue by tuning CLIP during training to make it more mask-aware.

Pixel classification methods [87, 169, 170, 136, 171, 172] employ CLIP to recognize pixels. LSeg [87] achieves this by learning an independent image encoder to align with the original textual encoder in CLIP. Fusioner [169] introduces a cross-modality fusion module to capture the interactions between visual and textual features from the frozen CLIP, and decodes the fused features into segmentation masks. PACL [170] defines a new compatibility function for contrastive loss to align patch tokens of the vision encoder and the [CLS] token of the text encoder. Patch-level alignment can benefit zero-shot transfer to semantic segmentation. CLIPpy [136] endows CLIP with perceptual grouping with a series of modifications on the aggregation method and pre-training strategies. Due to the absence of fine-grained supervisions, such CLIP-based segmentors cannot delineate the fine shape of targets. SAZS [173] alleviates this by developing a boundary-aware constraint.

\bullet Semantic Segmentation Emerges from Text Supervision. Inspired by CLIP, a stream of research attempts to learn transferable semantic segmentation models purely from text supervision. GroupViT [174] and SegCLIP [175] augment vanilla ViTs with grouping modules to progressively group image pixels into segments. To address their granularity inconsistency issue, SGP [176] further mines non-learnable prototypical knowledge [146] as explicit supervision for group tokens to improve clustering results. Unlike these works require customized image encoders, [177] avoids modifying CLIP’s architecture, but improves the optimization by sparsely contrasting on the image-text features with the maximum responses. TagAlign [178] also focuses on the optimization part, and introduces fine-grained attributes as supervision signals to enable dense image-text alignment.

\bullet Knowledge Distillation (KD). KD [179] is a simple but efficient approach to transfer the capability of a foundation model, which has achieved many successes in NLP and CV. In the field of semantic segmentation, ZeroSeg [180] and CLIP-ZSS [181] distill the semantic knowledge from CLIP’s visual encoder to a segmentation model. Moreover, many methods are based on self-distillation to teach themselves by aligning localized dense feature to visual feature of corresponding image patch [182], or learning global semantics based on local information [183]. Moreover, CLIP-DINOiser [184] treats DINO as a teacher to guide CLIP learn DINO-like features that are friendly to segmentation.

4.1.2 DM-based Solution

Beyond the discriminative model CLIP, there has been a growing interest in extending the horizon of generative models like DMs from generation tasks to semantic segmentation. From the technical perspective, current research can be grouped into the following categories.

\bullet Training free Semantic Segmentation. Based on the techniques in §3.2, [55, 141, 142] generate a mask 𝒎CLSsubscript𝒎CLS\bm{m}_{\texttt{CLS}}bold_italic_m start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT for each candidate class, and assign a category to each pixel by identifying the class with the highest confidence value. FreeSeg-Diff [89] follows a two-stage paradigm, that is, cluster attention maps into class-agnostic masks and then classify each mask by CLIP. These methods are limited by text prompt tokens, requiring an association between each semantic class and a prompt word, which is not always valid. To address this, OVAM [143] introduces an extra attribution prompt to enable the generation of semantic segmentation masks described by an open vocabulary, irrespective of the words in the text prompts used for image generation. Furthermore, OVDiff [145] takes a prototype learning perspective [146, 147] to build a set of categorical prototypes using T2I-DMs, which serve as nearest neighbor classifiers for segmentation. DiffSeg [185] introduces an iterative merging process to merge self-attention maps in SD into valid segmentation masks. Unlike aforementioned methods, FreeDA [54] employs SD to build a large pool of visual prototypes, and the most similar prototype is retrieved for each pixel to yield segmentation prediction.

\bullet Diffusion Features for Semantic Segmentation. Beyond attention maps, the harness of DMs’ latent representations for semantic segmentation is gaining popularity. Works like [63, 186] extract internal embeddings from text-free DMs for segmentation, but they are limited to close-vocabulary settings. In contrast, a majority of methods [187, 188, 115] employs T2I-DMs (mostly SD) to mine semantic representations. LD-ZNet [115] shows that 1) the latent space of LDMs is a better input representation compared to other forms like RGB images for semantic segmentation, and 2) the middle layers (i.e., {6,7,8,9,10}) of the denoising UNet contain more semantic information compared to either the early or later blocks of the encoder (consistent with the observation in [189]). Beyond this, for T2I-DMs, text prompt plays a crucial role in feature extraction as it serves as guidance for semantic synthesis. VPD [187] adopts a straightforward method to use class names in the dataset to form the text context of SD, in which class embedding is extracted from the text encoder of CLIP (with prompt “a photo of [CLS]”). TADP [188] and Vermouth [190] find that automatically generated captions serve as image-aligned text prompt that helps extract more semantically meaningful visual features. In contrast, MetaPrompt [191] integrates SD with a set of learnable emebddings (called meta prompts) to activate task-relevant features within a recurrent feature refinement process. Furthermore, latent features show exceptional generalization performance to unseen domains with proper prompts.

\bullet Semantic Segmentation as Denoising Diffusion. Away from these mainstream battlefields, some works [192, 193, 194, 65] reformulate semantic segmentation as a denoising diffusion process. They learn an iterative denoising process to predict the ground truth map 𝒛0subscript𝒛0\bm{z}_{0}bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from random noise 𝒛t𝒩(0,1)similar-tosubscript𝒛𝑡𝒩01\bm{z}_{t}\sim\mathcal{N}(0,1)bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) conditioned on corresponding visual features derived from an image encoder. Based on this insight, SegRefiner [195] considers a discrete diffusion formulation to refine coarse masks derived from existing segmentation models. Moreover, Peekaboo [90] is an interesting approach that treats segmentation as a foreground alpha mask optimization problem which is optimized via SD at inference time. It alpha-blends an input image with random background to generate a composite image, and then takes an inference time optimization method to iteratively update the alpha mask to converge to optimal segmentation with respect to image and text prompts.

\bullet T2I-DMs as Semantic Segmentation Data Synthesizer. Collecting and annotating images with pixel-wise labels is time-consuming and laborious, and thus always a challenge to semantic segmentation. With recent advancements in AIGC, many studies [141, 196, 99, 98] explore the potential of T2I-DMs to build large-scale segmentation dataset (including synthetic images and associated mask annotations), which can serve as a more cost-effective data source to train any existing semantic segmentation models. The idea has also been adopted in specialist domains like medical image segmentation [197]. Rather than directly generating synthetic masks, some works [198, 199, 200] employ T2I-DMs for data augmentation based on a few labeled images.

4.1.3 DINO-based Solution

\bullet Unsupervised Segmentation via Direct Grouping. Given the emergence of segmentation properties in DINO, many methods directly group DINO features into distinct regions via, e.g., k-means [151] or graph partition [148, 201, 202] based on spatially local affinities in Eq. 10. While being training-free, they are limited in discovering salient objects, and fail to generate masks for multiple semantic regions – which is critical for semantic segmentation.

\bullet Unsupervised Semantic Segmentation via Self-training. Follow-up works investigate self-training approaches to address aforementioned limitation. They tend to train segmentation models on automatically discovered pseudo labels from DINO features. Pseudo labels are in general obtained in a bottom-up manner, but the strategies differ across methods. DeepSpectral [91] performs spectral clustering over dense DINO features to over-cluster each image into segments, and afterwards cluster DINO representations of such segments across images to determine pseudo segmentation labels. Those segments represent object parts that could be combined with over-clustering and community detection to enhance the quality of pseudo masks [203]. COMUS [150] combines unsupervised salient masks with DINO feature clustering to yield initial pseudo masks, which are exploited to train a semantic segmentation network to self-bootstrap the system on images with multiple objects. Notably, STEGO [92] finds that DINO’s features have correlation patterns that are largely consistent with true semantic labels, and accordingly presents a novel contrastive loss to distill unsupervised DINO features into compact semantic clusters. Furthermore, DepthG [204] incorporates spatial information in the form of depth maps into the STEGO training process; HP [205] proposes more effective hidden positive sample to enhance contrastive learning; EAGLE [206] extracts object-level semantic and structural cues from DINO features to guide the model learning object-aware representations.

4.1.4 SAM-based Solution

\bullet SAM for Weakly Supervised Semantic Segmentation. While SAM is semantic unawareness, it attains generalized and remarkable segmentation capability, which are widely leveraged to improve segmentation quality in the weakly supervised situations. [207] uses SAM for post-processing of segmentation masks, while [208] leverages SAM for zero-shot inference. S2C [93] incorporates SAM at both feature and logit levels. It performs prototype contrastive learning based on SAM’s segments, and extracts salient points from CAMs for prompting SAM.

4.1.5 Composition of FMs for Semantic Segmentation

FMs are endowed with distinct capabilities stemming from their pre-training objectives. For example, CLIP excels in semantic understanding, while SAM and DINO specialize in spatial understanding. As such, many approaches amalgamate an assembly of these FMs into a cohesive system that absorbs their expertise. Some of them are built under zero guidance [209, 89, 210]. They leverage DINO or SD to identify class-agnostic segments, map them to CLIP’s latent space, and translate the embedding of each segment into a word (i.e., class name) via image captioning models like BLIP. Another example is SAM-CLIP [94] that combines SAM and CLIP into a single model via multi-task distillation. Recently, RIM [95] builds a training-free framework under the collaboration of three VFMs. Concretely, it first constructs category-specific reference features based on SD and SAM, and then matches them with region features derived from SAM and DINO via relation-aware ranking.

4.2 Instance Segmentation

4.2.1 CLIP-based Solution

\bullet CLIP as Zero-shot Instance Classifier. CLIP plays an important role in achieving open-vocabulary instance segmentation. [104, 96, 211] leverage the frozen CLIP text encoder as a classifier of instance mask proposals. OPSNet [97] utilizes CLIP visual and textual embeddings to enrich instance features, which are subsequently classified by the CLIP text encoder. [212] introduces a generative model to synthesize unseen features from CLIP text embeddings, thereby bridging semantic-visual spaces and address the challenge of lack of unseen training data. [213] presents a dynamic classifier to project CLIP text embedding to image-specific visual prototypes, effectively mitigating bias towards seen categories as well as multi-modal domain gap.

4.2.2 DM-based Solution

\bullet T2I-DMs as Instance Segmentation Data Synthesizer. DMs play a crucial role in instance segmentation by facilitating the generation of large-scale training datasets with accurate labels. MosaicFusion [98] introduces a training-free pipeline that simultaneously generates synthetic images via T2I-DMs and corresponding masks through aggregation over cross-attention maps. [214] adopts a cut-and-paste approach for data augmentation, where both foreground objects and background images are generated using DMs. DatasetDM [99] presents a semi-supervised approach that first learns a perception decoder to annotate images based on a small set of labeled data, and then generates images and annotations for various dense prediction tasks.

4.2.3 DINO-based Solution

\bullet Unsupervised Instance Segmentation. Some methods [100, 215, 149, 101] attempt to amplify the innate localization abilitiy of DINO to train instance-level segmentation models without any human labels. They typically work in a two-stage discover-and-learn process: discover multiple object masks from DINO features by, e.g., recursively applying normalized cuts [100], and then leverage them as pseudo labels to train instance segmentation models.

4.2.4 Composition of FMs for Instance Segmentation

X-Paste [102] revisits the traditional data boosting strategy, i.e., Copy-Paste, at scale to acquire large-scale object instances with high-quality masks for unlimited categories. It makes full use of FMs to prepare images, i.e., using SD to generate images and using CLIP to filter Web-retrieved images. Instances in the images are extracted with off-the-shelf segmentors, which are composed with background images to create training samples. DiverGen [216] improves X-Paste by focusing more on enhancing category diversity. It leverages SAM to more accurately extract instance masks. Orthogonal to these studies, Zip [217] combines CLIP and SAM to achieve training-free instance segmentation. It observes that clustering on features of CLIP’s middle layer is keenly attuned to object boundaries. Accordingly, it first clusters CLIP features to extract segments, then filters them according to boundary and semantic cues, and finally prompts SAM to produce instance masks.

Moreover, it is easy to directly turn SAM into an instance segmentation model by feeding bounding boxes of instances as prompts [218, 103], which can be obtained from object detectors, e.g., Faster R-CNN [30], Grounding DINO [219].

4.3 Panoptic Segmentation

4.3.1 CLIP-based Solution

\bullet CLIP as Zero-Shot Mask Classifier. Most recent panoptic segmentation approaches [104, 96, 220, 97, 212, 105, 221, 211] follow the query-based mask classification framework introduced in MaskFormer [22] / Mask2Former [23]. They generate class-agnostic mask proposals first and then utilize CLIP to classify the proposals, thereby empowering MaskFormer and Mask2Former open-vocabulary segmentation capabilities. MaskCLIP [104] introduces a set of mask class tokens to extract mask representations more efficiently. MasQCLIP [96] augments MaskCLIP by applying additional projections to mask class tokens to obtain optimal attention weights. OPSNet [97] learns more generalizable mask representations based on CLIP visual encoder that are subsequently used to enhance query embeddings. Unpair-Seg [105] presents a weakly supervised framework that allows the model to benefit from cheaper image-text pairs. It learns a feature adapter to align mask representations with text embeddings, which are extracted from CLIP’s visual and language encoders respectively. Despite the advances, these methods still require training a separate model for each task to achieve the best performance. Freeseg [221] and DaTaSeg [222] design all-in-one models with the same architecture and inference parameters to establish remarkable performance in open-vocabulary semantic, instance, and panoptic segmentation problems. OMG-Seg [223] introduces a unified query representation to unify different task outputs, and is able to handle ten segmentation tasks across different datasets.

4.3.2 DM-based Solution

\bullet Diffusion Features for Panoptic Segmentation. ODISE [106] explores internal representations within T2I DMs to accomplish open-vocabulary panoptic segmentation. It follows the architectural design of Mask2Former but leverages visual features derived from pre-trained diffusion UNet to predict binary mask proposals and associated mask representations. These proposals are finally recognized using CLIP as the zero-shot classifier.

\bullet Panoptic Segmentation as Denoising Diffusion. Pix2Seq-𝒟𝒟\mathcal{D}caligraphic_D [107] formulates panoptic segmentation as a discrete data generation problem conditioned on pixels, using a Bit Diffusion generative model [224]. DFormer [67] introduces a diffusion-based mask classification scheme that learns to generate mask features and attention masks from noisy mask inputs. Further, LDMSeg [225] solves generative segmentation based on SD by first compressing segmentation labels to compact latent codes and then denoising the latents following the diffusion schedule.

4.3.3 DINO-based Solution

\bullet Unsupervised Panoptic Segmentation. Based on the successes of STEGO [92] in semantic segmentation and CutLER [100] in instance segmentation, U2Seg [108] automatically identify “things” and “stuff” within images to create pseudo labels that are subsequently used to train a panoptic segmentation model, such as Panoptic Cascade Mask R-CNN [226]. Moreover, [227] follows the bottom-up architecture of [228] to separately predict semantic and boundary maps, which are later fused to yield a panoptic segmentation mask.

4.3.4 SAM-based Solution

\bullet Towards Semantic-Awareness SAM. While SAM demonstrates strong zero-shot performance, it produces segmentation without semantic meaning. This drives many research efforts, e.g., Semantic-SAM [109], SEEM [50], to enhance the semantic-awareness of SAM. In addition to visual prompts in SAM for interactive segmentation, these models learn generic object queries to achieve generic segmentation in both semantic and instance levels. In addition, the models are generally trained on a combination of multiple datasets with semantic annotations, such as COCO [229], ADE20K [230], PASCAL VOC [231].

5 Foundation Model based PIS

5.1 Interactive Segmentation

5.1.1 SAM-based Solution

As SAM is born as a universe interactive segmenting system, it naturally becomes the top selection for researchers to build advanced interactive segmentation frameworks.

\bullet Multi-Granularity Interactive Segmentation. Most existing interactive segmentation methods determines a single segmentation mask based on users’ input, which ignores spatial ambiguity. In contrast, SAM introduces a multi-granularity interactive segmentation pipeline, i.e., for each user interaction, desired segmentation region may be the concept of objects with different parts nearby. To improve the segmentation quality, HQ-SAM [218] proposes a lightweight high-quality output token replace the original SAM’s output token. After training on 44K highly-accurate masks, HQ-SAM significantly boosts the mask prediction quality of SAM. Since SAM is class-agnostic, a line of works [232, 233] tunes SAM by aligning the query-segmented regions with corresponding textual representations from CLIP, while [109] designs a SAM-like framework that supports multi-granularity segmentation using the captioned SAM data. Although these multi-granularity interactive segmentation approaches alleviate spatial ambiguity, they result in excessive output redundancy and limited scalability. To solve this, GraCo [110] explores granularity-controllable interactive segmentation, which allows precise control of prediction granularity to resolve ambiguity.

\bullet SAM for Medical Image Interactive Segmentation. Interaction segmentation is crucial in the medical field [234], such as for achieving highly precise segmentation of lesion regions, or reducing manual efforts in annotating medical data. Unlike the segmentation of natural images, medical image segmentation poses greater challenges due to many intrinsic issues such as structural complexity, low contrast, or inter-order variability. Recently, several studies [235, 236, 237] explore the zero-shot interactive segmentation capabilities in medical imaging. They cover a diverse range of anatomical and pathological targets across different medical imaging modalities, including CT [238], MRI [239], pathological images [240], endoscopic images [94]. While these studies indicate that SAM performs comparably to state-of-the-art methods in identifying well-defined objects in certain modalities, it struggles or fails completely in more challenging situations, such as when targets have weak boundaries, low contrast, small size, and irregular shapes. This suggests that directly applying SAM without fine-tuning or re-training to previously unseen and challenging medical image segmentation may result in suboptimal performance.

To enhance SAM’s performance on medical images, some approaches propose to fine-tune SAM on medical images. MedSAM [111] curates a large scale dataset containing over one million medical image-mask pairs of 11 modalities, which are used for directly fine-tuning SAM. In contrast, other methods explore parameter-efficient fine-tuning strategies. SAMed [241] applies LoRA modules to the pre-trained SAM image encoder. SAMFE [242] finds that applying LoRA to the mask decoder yields superior performance in cases with few exemplars. SAM-Med2D [236] enhances the image encoder by integrating learnable adapter layers. MedSA [243] adapts SAM to volumetric medical images by introducing Space-Depth Transpose where a bifurcated attention mechanism is utilized by capturing spatial correlations in one branch and depth correlations in another. 3DSAM-Adapter [244] introduces a holistic 2D to 3D adaptation method via carefully designed modification of the entire SAM architecture.

5.2 Referring Segmentation

5.2.1 CLIP-based Solution

Referring segmentation aims to segment a referent via a natural linguistic expression. The multi-modal knowledge in CLIP is broadly explored to tackle this multi-modal task.

\bullet Training-free Referring Segmentation. ZS-RS [112] represents a training-free referring image segmentation method that leverages cross-modal knowledge in CLIP. It begins by generating instance-level masks using an off-the-shelf mask generator, then extracts local-global features of masks and texts from CLIP, and finally identifies the desired mask based on cross-modal feature similarity. TAS [245] employs a similar pipeline as ZS-RS, but computes more fine-grained region-text matching scores to pick the correct mask.

\bullet Multi-modal Knowledge Transfer. Many efforts have been devoted to transfer multi-modal knowledge within CLIP from image-level to pixel-level. A common idea [246, 247, 113, 248, 249, 250, 251, 252, 253] is to introduce a task decoder to fuse CLIP’s image and textual features, and train it with text-to-pixel contrastive learning [246]. In addition to task decoder, ETRIS [247] and RISCLIP [113] integrate a Bridger module to encourage visual-language interactions at each encoder stage. EAVL [249] learns a set of convolution kernels based on input image and language, and do convolutions over the output of task decoder to predict segmentation masks. UniRES [250] explores multi-granularity referring segmentation to unify object-level and part-level grounding tasks. TP-SIS [252] transfers multi-modal knowledge within CLIP for referring surgical instrument segmentation.

\bullet Weakly Supervised Referring Segmentation. Moving towards real-world conditions, some work studies weakly supervised referring segmentation to alleviate the cost on pixel labeling. TSEG [254] computes patch-text similarities with CLIP and guides the classification objective during training with a multi-label patch assignment mechanism. TRIS [255] proposes a two-stage pipeline that extracts coarse pixel-level maps from image-text attention maps, which are subsequently used to train a mask decoder.

5.2.2 DM-based Solution

\bullet Training-free Referring Segmentation. Some works [114, 90] find that SD is an implicit referring segmentor with the help of generative process. Peekaboo [90] formulates segmentation as a foreground alpha mask optimization problem with SD, where a fine-grained segmentation map should yield a high-fidelity image generation process. In this way, minimizing the discrepancy between the mask-involved noise and the target noise shall give better textual-aligned pixel representations. Ref-diff [114] first generates a set of object proposals from generative models, and determines the desired mask based on proposal-text similarities.

\bullet Diffusion Features for Referring Segmentation. With the conditioned textual guidance, the modal-intertwined attention maps (c.f. §3.2) could intuitively serve as an initial visual dense representation, which could be used to yield the final segmentation mask. VPD [187] introduces a task-specific decoder to process encoded features fused from cross-attention maps and multi-level feature maps in U-Net. Meanwhile, LD-ZNet [115] injects attention features into a mask decoder for generating better textual-aligned pixel-level masks. Apart from the attention-based utilization, [256, 257] directly feed side outputs from each intermediate layer of the diffusion U-Net as well as the textual embedding, to a mask decoder to yield final predictions.

Refer to caption
Figure 2: MLLMs driven solutions lead to more powerful pixel reasoning and understanding capabilities, e.g., multi-target reasoning segmentation, instance segmentation with text descriptions, referring segmentation and conversation. (Figure adapted courtesy of [60])

5.2.3 LLMs/MLLMs-based Solution

The success of LLMs/MLLMs has showcased incredible reasoning ability and can answer complex questions, thereby bringing new possibilities to achieve new pixel reasoning and understanding capabilities (Fig. 2). In particularly, LISA [59] studies a new segmentation task, called reasoning segmentation. Different from traditional referring segmentation, the segmentors in this setting are developed to segment the object based on implicit query text involving complex reasoning. Notably, the query text is not limited to a straightforward reference (e.g., “the front-runner”), but a a more complicated description involving complex reasoning or world knowledge (e.g., “who will win the race?”). LISA employs LLaVA [258] to output a text response based on the input image, text query, and a [seg] token. The embedding for the customized [seg] token is decoded into the segmentation mask via SAM decoder. Afterwards, LISA++ [259] promotes LISA to differentiate individuals within the same category and enables more natural conversation in multi-turn dialogue. Based on these works, many efforts have been devoted to promote the reasoning capability and segmentation accuracy. LLM-Seg [260] proposes using SAM to generate a group of mask proposals that selects the best-suited answer as the final segmentation prediction. Next-Chat [261] adds a [trigger] token that depicts the coordinate of the object box as a supplementary input for MLLM to help generate better masks. Similarly, GSVA [262] introduces a rejection token [rej] to relieve the empty-target case where the object referred to in the instructions does not exist in the image, leading to the false-positive prediction. Except for the functional token incorporation, [263, 264] propose using diverse textual descriptions, such as object attribute and part, to enhance the object-text connection for accurate reasoning results. Regarding reasoning costing, PixelLLM [60] introduces a lightweight decoder to reduce the computational cost in the reasoning process. Osprey [265] extends MLLMs by incorporating fine-grained mask regions into language instruction, and delivers remarkable pixel-wise visual understanding capabilities.

5.2.4 Composition of FMs for Referring Segmentation

To enhance the textual representation for pixel-level understanding, some methods use LLMs as the text encoder for obtaining improved textual embedding for modal fusion. Particularly, BERT [266], due to its simplicity and practicality, is nearly the top choice among works [267, 268, 269, 254, 270, 271, 272, 273, 274, 275, 276, 277]. Most of them design a fusion module to bridge the features between the visual encoder and BERT. In addition, some works [261, 278, 279] treat LLM as a multi-modal unified handler, and use Vicuna [280] to map both image and text into a unified feature space, thereafter generating the segmentation output. With the powerful dialogue capabilities of the GPT-series models [39], some works [281, 282, 283] employ ChatGPT to rewrite descriptions with richer semantics, and encourages finer-grained image-text interaction in referring segmentation model training.

Apart from textual enhancement using LLMs, SAM [49] is widely chosen to provide rich segmentation prior for referring segmentation. [284] presents a prompt-driven framework to bridge CLIP and SAM in an end-to-end manner through prompting mechanisms. [285] focuses on building referring segmentors based on a simple yet effective bi-encoder design, i.e., adopting SAM and a LLM to encode image and text patterns, respectively, and then fuse the multi-modal features for segmentation predictions. Such a combination of SAM and LLM, without bells and whistles, could be easily extended to the MLLM case. Therefore, [116, 117] propose to incorporate CLIP with SAM to improve the multi-modal fusion. Specifically, F-LMM [116] proposes to use CLIP to encode the visual features, which are decoded by SAM to the predicted segmentation map. PPT [117] first employs attention maps of CLIP to compute the peak region as the explicit point prompts, which are directly used to segment the query target.

5.3 Few-Shot Segmentation

5.3.1 CLIP-based Solution

\bullet CLIP Features for Few-Shot Segmentation. Adopting CLIP to extract effective visual correlation from the support images to help segmentation inference of the query image has formulated a prevailing pipeline to address FSS, which shall be categorized into two streams based on the usage of CLIP-oriented visual feature. The first class [118, 286, 287, 288, 289, 290] relies on modelling the feature relationship of support-query images to explicitly segment the query image. WinCLIP [118] aggregates the multi-scale CLIP-based visual features of the reference and query images to obtain an enhanced support-query correlation score map for pixel-level prediction. [286, 287, 288, 289] further refine the score maps with the query- and support-based self-attention maps. [290] introduces the foreground-background correlation from the support images by crafting proper textual prompts. Another line of works [119, 251, 291] focuses on segmenting the query image regulated by the support-image-generated prototypes, where some metric functions, e.g., cosine similarity, shall be involved for the query-prototype distance calculation. RD-FSS [119] proposes to leverage the class description from CLIP text encoder as the textual prototypes, which are then correlated with visual features to dense prediction in a cross-attention manner. Additionally, PartSeg [291] aggregates both the visual and textual prototypes to help generate the improved query image pixel-level representation. Here the visual prototypes are obtained through correspondingly pooling the CLIP-based visual feature by the reference segmentation masks. To further enhance the prototypical representation, [251] use CLIP to generate the visual prototypes from the masked support images, where only interested object is remained.

5.3.2 DM-based Solution

\bullet Diffusion Features for Few-Shot Segmentation. The internal representations of DMs are useful for few-shot segmentation. Specifically, [292] directly leverages the latent diffusion features at specific time step as the representations of the support image, which are decoded along with the original image via a mask decoder. On the contrary, DifFSS [120] proposes to synthesize more support-style image-mask pairs using DMs. Building on the invariant mask, the generated support images shall include same mask-covered object yet with diverse background, enriching the support patterns for better query segmentation.

\bullet Few-Shot Segmentation as Denoising Diffusion. Some studies [293, 121] tackle few-shot segmentation by solving a denoising diffusion process. They fine-tune SD to explicitly generate segmentation mask for query images, with the main difference being the condition applied during the fine-tuning. MaskDiff [293] uses query image and support masked images as the condition, while SegICL [121] merely employs the support/query mask as the condition.

5.3.3 DINO-based Solution

\bullet DINO Features for Few-Shot Segmentation. There are some efforts [294, 122, 295, 296] exploiting latent representations in DINO/DINOv2 to enhance query and support features. [294] directly uses DINOv2 to encode query and support images, and shows that DINOv2 outperforms other FMs, like SAM and CLIP. Based on this, SPINO [122] employs DINOv2 for few-shot panoptic segmentation. [295, 296] further mine out query-support correlations through the cross- and self-attention of token embeddings in DINO, leading to more support-aware segmentation.

5.3.4 SAM-based Solution

\bullet Prompt Generation for SAM. Given the provided support image sets, a line of works [297, 298, 299, 123, 300] focuses on generating proper prompts for SAM to segment the desired target in the query image. Notably, a majority of them [297, 298, 299] propose to generate a group of candidate points as prompts based on the support-query image-level correspondence/similarity, where the support mask, highlighting the query object’s semantic, is then used to select the object-oriented prompts. VRP-SAM [123] learns a set of visual reference prompts based on query-support correspondence, which are fed into a frozen SAM for segmentation. APSeg [300] extends VRP-SAM by exploring multiple support embeddings to generate more meaningful prompts for SAM.

5.3.5 LLM/MLLM-based Solution

There are several trials [124, 301] in adopting LLM/MLLM to address FSS through instruction design. LLaFS [124] maps the fused support-query pattern into the language space, and let a LLM to tell the coordinate description of the desired segmentation mask. [301] uses GPT-4 as the task planner to divide FSS into a sequence of sub-tasks based on the support set, subsequently calls vision tools such as SAM and GPT4Vision to predict segmentation masks.

Refer to caption
Figure 3: Illustration of VPImpainting [302], which solves image segmentation as a visual “fill-in-the-blank” problem. (Figure courtesy of [302])

5.3.6 In-Context Segmentation

The rapid progress of LLMs leads to an emergent capacity to learn in-context from given a handful of examples [38, 45]. Inspired by such an amazing pattern, some researchers aim to excavate a new and similar setting in computer vision, i.e., in-context segmentation (ICS). With the aim to segment a query image based on supports, ICS can be regarded as a sub-task of FSS. Yet, ICS requires no parameter updating and can be directly performed on pre-trained models without task-specific finetuning. The ICL-emerged LLMs mostly are generative models trained with masked language modelling or next token prediction strategy. Therefore, most works, through intuitive imitation, address ICS by exploring similar self-supervised patterns on visual models. VPImpainting [302] is a pioneering work that solves visual in-context learning as image inpainting. The architecture is illustrated in Fig. 3. It defines visual prompt as a grid-like single image containing an input-output example(s) and a query, then trains an inpainting model (via MAE [303]) to predict the rest of the image such that it is consistent with given example(s). With this basis, [304, 305, 306] propose to retrieve the most suitable examples from a large dataset as the support. Furthermore, Painter [307] and SegGPT [51] are vision generalists built on in-context learning. They unify various vision tasks into the in-context learning framework by carefully redefining outputs of core vision tasks into the same format of images. Some other works [308, 309] focus on establishing large vision model by formatting images, like language tokenizer, to a group of sequence as visual sentences, and then perform LLM-like training via next token prediction. It is worth noting that such a cultivation of visual autoregressive models require hundreds of billions of vision samples from varied vision tasks, e.g., image segmentation, depth estimation. PromptDiffusion [310] explores in-context learning for diffusion models by fine-tuning SD to generate the query mask conditioned on the support image-mask pair and the query image. Matcher [311] utilizes DINOv2 to locate the target in query images by bidirectional matching, and leverages the coarse location information as the prompts of SAM for segmentation. Tyche [312] extends ICS into a probabilistic segmentation framework by explicitly modeling training and testing uncertainty, and shows promising performance in medical image segmentation.

6 Open Issue and Future Direction

Based on the reviewed research, the field of image segmentation has made tremendous progress in the FM era. Nonetheless, given the high diversity and complexity of segmentation tasks, coupled with the rapid evolution of FMs, several critical directions warrant ongoing exploration.

\bullet Explaining the Emergence of Segmentation Knowledge in FMs. Despite that different FMs differ significantly in architectures, data and training objectives, we observe a consistent emergence of segmentation knowledge from them, which drives the development of impressive training-free segmentation models. However, current methods do not fully explain how these FMs learn to understand pixels, especially how pixels interact with other modalities, like texts in CLIP and Text-to-Image Diffusion Models. This calls for novel explainable techniques to enhance our understanding of pixels in FMs. This is crucial to minimize the negative societal impacts in existing FMs, and will broaden more applications of FMs in diverse visual domains and tasks.

\bullet In-Context Segmentation. Motivated by the success of in-context learning in the language domain, there has been a growing interest in exploring its potential for vision tasks, such as image segmentation. However, the variability in output representations across vision tasks – such as the differing formats required for semantic, instance, and panoptic segmentation – renders ICS a particularly challenging problem. While some progress have been made, current results don’t show as high performance as bespoke, especially in difficult tasks like panoptic segmentation. Additionally, the ability to perform segmentation at arbitrary levels of granularity through in-context learning remains an unexplored area. Last, the scale of models employed in ICS is considerably smaller compared to the NLP counterparts like GPT-3, which may be a key factor limiting the performance of ICS. To achieve a breakthrough akin to GPT-3 in the vision domain, it is essential to develop large vision models [308]. This task poses significant difficulties and will require extensive collaboration within the vision community to address issues related to data, architecture, and training techniques.

\bullet Efficient Image Segmentation Model. While FM-driven segmentation models exhibit remarkable performance, the majority of methods introduce significant computational overheads, such as heavy image encoders for feature computation and costly fine-tuning processes. These challenges impede the broader applicability and affordability of the models in practical scenarios. Key techniques to be explored include knowledge distillation, model compression, and parameter-efficient tuning. Most existing studies focus on improving the deployment efficiency solely for SAM; yet, attention to other FMs is equally vital.

\bullet Powerful and Scalable Data Engine. Segmentation data are catalysts for progress in image segmentation. Much of the current success in deep learning based image segmentation owes to datasets such as PASCAL VOC [231], COCO [229], Cityscapes [313], and ADE20K [230]. Nonetheless, scaling up image data is a long-standing challenge and is becoming increasingly critical in the FM era, which calls for a powerful and scalable segmentation data engine. Recently, SAM [49] tackles this issue with a data engine that labels images via “model-in-the-loop”, yielding SA-1B with 11M images and 1B masks. Nevertheless, the engine is limited in realistic image labeling and lacks semantic awareness. A promising direction is to incorporate generative models into the system, which would create a more powerful data engine that can scale to arbitrary levels and is more favorable to data-scarcity scenarios like medical imaging [314] and satellite imagery [315].

\bullet Diffusion Models as the New Data Source. Text-to-image diffusion models have been proved feasible to build segmentation datasets by generating pairs of synthetic images and corresponding segmentation masks. However, there exists many challenges. First, existing DMs like SD have difficulties in generating complex scenes, e.g., a crowded street with hundreds of objects, closely intertwined objects. To alleviate this, layout or box conditions, instead of solely text, should be provided to guide the generation. Second, the bias in LAION-5B on which SD was trained, might be transferred to the dataset. This issue can be alleviated by absorbing the advancements in addressing the bias problem in generative models. Third, the domain gap between synthetic and real datasets should be continuously studied. Fourth, current approaches are limited in generating data for the task of semantic segmentation and a limited number of semantic categories, how to generalize them to generate instance-level segmentation masks and scale up the semantic vocabulary are unsolved.

\bullet Mitigating Object Hallucination in MLLMs-based Models. Although MLLMs have demonstrated significant success in pixel understanding (c.f. §5.2.3), they are prone to the issue of object hallucination [316] as LLMs. Here object hallucination refers that a model generates unintended descriptions or captions that contain objects which are inconsistent with or even absent from the target image. This issue greatly undermines the reliability of these models in real-world applications. Hence, we advocate for future research in MLLMs-based segmentation to rigorously assess object hallucinations for their models, and to incorporate this issue consideration in the development of segmentation models.

7 Conclusion

In this survey, we provide the first comprehensive review to recent progress of image segmentation in the foundation model era. We introduce key concepts and examine the inherent segmentation knowledge in existing FMs such as CLIP, Diffusion Models, and DINO/DINOv2. Moreover, we summarize more than 300 image segmentation models for tackling generic and promptable image segmentation tasks. Finally, we highlight existing research gaps that need to be filled and illuminate promising avenues for future research. We hope that this survey will act as a catalyst, sparking future curiosity and fostering a sustained passion for exploring the potential of FMs in image segmentation.

References

  • [1] N. Otsu et al., “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285-296, pp. 23–27, 1975.
  • [2] S. D. Yanowitz and A. M. Bruckstein, “A new method for image segmentation,” Computer Vision, Graphics, and Image Processing, vol. 46, no. 1, pp. 82–95, 1989.
  • [3] Y.-I. Ohta, T. Kanade, and T. Sakai, “Color information for region segmentation,” Computer graphics and image processing, vol. 13, no. 3, pp. 222–241, 1980.
  • [4] Y. Deng, B. S. Manjunath, and H. Shin, “Color image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1999, pp. 446–451.
  • [5] R. Adams and L. Bischof, “Seeded region growing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 6, pp. 641–647, 1994.
  • [6] R. Nock and F. Nielsen, “Statistical region merging,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1452–1458, 2004.
  • [7] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
  • [8] W.-Y. Ma and B. Manjunath, “Edge flow: a framework of boundary detection and image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 1997, pp. 744–749.
  • [9] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, 2012.
  • [10] X. He, R. S. Zemel, and M. A. Carreira-Perpinán, “Multiscale conditional random fields for image labeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2004.
  • [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3431–3440.
  • [12] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3349–3364, 2020.
  • [13] L. Li, T. Zhou, W. Wang, J. Li, and Y. Yang, “Deep hierarchical semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 1246–1257.
  • [14] D. Liu, J. Liang, T. Geng, A. Loui, and T. Zhou, “Tripartite feature enhanced pyramid network for dense prediction,” IEEE Trans. Image Process., vol. 32, pp. 2678–2692, 2023.
  • [15] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer Assisted Intervention, 2015, pp. 234–241.
  • [16] T. Zhou and W. Wang, “Cross-image pixel contrasting for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 5398–5412, 2024.
  • [17] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
  • [18] ——, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2017.
  • [19] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [20] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 801–818.
  • [21] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, “Scene labeling with lstm recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3547–3555.
  • [22] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” in Neural Inf. Process. Syst., 2021, pp. 17 864–17 875.
  • [23] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 1290–1299.
  • [24] J. Liang, T. Zhou, D. Liu, and W. Wang, “Clustseg: clustering for universal segmentation,” in Proc. ACM Int. Conf. Mach. Learn., 2023, pp. 20 787–20 809.
  • [25] J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, and H. Shi, “Oneformer: One transformer to rule universal image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2989–2998.
  • [26] T. Zhou, W. Wang, E. Konukoglu, and L. Van Gool, “Rethinking semantic segmentation: A prototype view,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2582–2593.
  • [27] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6881–6890.
  • [28] C. Liang, W. Wang, T. Zhou, J. Miao, Y. Luo, and Y. Yang, “Local-global context aware transformer for language-guided video segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 10 055–10 069, 2023.
  • [29] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2016.
  • [31] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
  • [32] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  • [33] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  • [34] M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundational models defining a new era in vision: A survey and outlook,” arXiv preprint arXiv:2307.13721, 2023.
  • [35] S. Latif, M. Shoukat, F. Shamshad, M. Usama, H. Cuayáhuitl, and B. W. Schuller, “Sparks of large audio models: A survey and outlook,” arXiv preprint arXiv:2308.12792, 2023.
  • [36] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023.
  • [37] M. Jin, Q. Wen, Y. Liang, C. Zhang, S. Xue, X. Wang, J. Zhang, Y. Wang, H. Chen, X. Li et al., “Large models for time series and spatio-temporal data: A survey and outlook,” arXiv preprint arXiv:2310.10196, 2023.
  • [38] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Neural Inf. Process. Syst., 2020, pp. 1877–1901.
  • [39] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • [40] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” in Neural Inf. Process. Syst., 2022, pp. 23 716–23 736.
  • [41] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
  • [42] OpenAI, “Sora: Creating video from text,” https://openai.com/sora, 2024.
  • [43] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10 684–10 695.
  • [44] R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abilities of large language models a mirage?” in Neural Inf. Process. Syst., 2023, pp. 55 565–55 581.
  • [45] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022.
  • [46] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” in Neural Inf. Process. Syst., 2022, pp. 24 824–24 837.
  • [47] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  • [48] S. Altman, “Planning for agi and beyond,” OpenAI Blog, February 2023.
  • [49] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 4015–4026.
  • [50] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” in Neural Inf. Process. Syst., 2023, pp. 19 769–19 782.
  • [51] X. Wang, X. Zhang, Y. Cao, W. Wang, C. Shen, and T. Huang, “Seggpt: Towards segmenting everything in context,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 1130–1140.
  • [52] C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 696–712.
  • [53] F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” arXiv preprint arXiv:2312.01597, 2023.
  • [54] L. Barsellotti, R. Amoroso, M. Cornia, L. Baraldi, and R. Cucchiara, “Training-free open-vocabulary segmentation with offline diffusion-augmented prototype generation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024.
  • [55] J. Wang, X. Li, J. Zhang, Q. Xu, Q. Zhou, Q. Yu, L. Sheng, and D. Xu, “Diffusion model is secretly a training-free open vocabulary semantic segmenter,” arXiv preprint arXiv:2309.02773, 2023.
  • [56] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 9650–9660.
  • [57] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” Transactions on Machine Learning Research Journal, pp. 1–31, 2024.
  • [58] H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel grounding large multimodal model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 13 009–13 018.
  • [59] X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia, “LISA: Reasoning segmentation via large language model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 9579–9589.
  • [60] Z. Ren, Z. Huang, Y. Wei, Y. Zhao, D. Fu, J. Feng, and X. Jin, “Pixellm: Pixel reasoning with large multimodal model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 26 374–26 383.
  • [61] Z. Zhang, Y. Ma, E. Zhang, and X. Bai, “Psalm: Pixelwise segmentation with large multi-modal model,” arXiv preprint arXiv:2403.14598, 2024.
  • [62] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proc. Conference on Empirical Methods in Natural Language Processing, 2014, pp. 787–798.
  • [63] D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” in Int. Conf. Learn. Representations, 2022.
  • [64] J. Tian, L. Aggarwal, A. Colaco, Z. Kira, and M. Gonzalez-Franco, “Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,” arXiv preprint arXiv:2308.12469, 2023.
  • [65] Y. Ji, Z. Chen, E. Xie, L. Hong, X. Liu, Z. Liu, T. Lu, Z. Li, and P. Luo, “Ddp: Diffusion model for dense visual prediction,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 21 741–21 752.
  • [66] J. Chen, J. Lu, X. Zhu, and L. Zhang, “Generative semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7111–7120.
  • [67] H. Wang, J. Cao, R. M. Anwer, J. Xie, F. S. Khan, and Y. Pang, “Dformer: Diffusion-guided transformer for universal image segmentation,” arXiv preprint arXiv:2306.03437, 2023.
  • [68] H.-D. Cheng, X. H. Jiang, Y. Sun, and J. Wang, “Color image segmentation: advances and prospects,” Pattern recognition, vol. 34, no. 12, pp. 2259–2281, 2001.
  • [69] Y.-J. Zhang, “A survey on evaluation methods for image segmentation,” Pattern recognition, vol. 29, no. 8, pp. 1335–1346, 1996.
  • [70] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523–3542, 2021.
  • [71] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ACM Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
  • [72] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
  • [73] N. M. Zaitoun and M. J. Aqel, “Survey on image segmentation techniques,” Procedia Computer Science, vol. 65, pp. 797–806, 2015.
  • [74] S. Ghosh, N. Das, I. Das, and U. Maulik, “Understanding deep learning techniques for image segmentation,” ACM Computing Surveys, vol. 52, no. 4, pp. 1–35, 2019.
  • [75] X. Liu, Z. Deng, and Y. Yang, “Recent progress in semantic image segmentation,” Artificial Intelligence Review, vol. 52, pp. 1089–1106, 2019.
  • [76] F. Lateef and Y. Ruichek, “Survey on semantic segmentation using deep learning techniques,” Neurocomputing, vol. 338, pp. 321–348, 2019.
  • [77] W. Shen, Z. Peng, X. Wang, H. Wang, J. Cen, D. Jiang, L. Xie, X. Yang, and Q. Tian, “A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9284–9305, 2023.
  • [78] G. Csurka, R. Volpi, B. Chidlovskii et al., “Semantic image segmentation: Two decades of research,” Foundations and Trends® in Computer Graphics and Vision, vol. 14, no. 1-2, pp. 1–162, 2022.
  • [79] C. Zhu and L. Chen, “A survey on open-vocabulary detection and segmentation: Past, present, and future,” IEEE Trans. Pattern Anal. Mach. Intell., 2024.
  • [80] Z. Wang, E. Wang, and Y. Zhu, “Image segmentation evaluation: a survey of methods,” Artificial Intelligence Review, vol. 53, no. 8, pp. 5637–5674, 2020.
  • [81] S. Jadon, “A survey of loss functions for semantic segmentation,” in IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, 2020, pp. 1–7.
  • [82] T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, and W. Wang, “A survey on deep learning technique for video segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7099–7122, 2022.
  • [83] R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, and A. K. Nandi, “Medical image segmentation using deep learning: A survey,” IET Image Processing, vol. 16, no. 5, pp. 1243–1267, 2022.
  • [84] J. A. Noble and D. Boukerroui, “Ultrasound image segmentation: a survey,” IEEE Trans. Medical Image., vol. 25, no. 8, pp. 987–1010, 2006.
  • [85] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 9, pp. 10 850–10 869, 2023.
  • [86] C. Zhang, L. Liu, Y. Cui, G. Huang, W. Lin, Y. Yang, and Y. Hu, “A comprehensive survey on segment anything model for vision and beyond,” arXiv preprint arXiv:2305.08196, 2023.
  • [87] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in Int. Conf. Learn. Representations, 2022.
  • [88] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18 082–18 091.
  • [89] B. T. Corradini, M. Shukor, P. Couairon, G. Couairon, F. Scarselli, and M. Cord, “Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models,” arXiv preprint arXiv:2403.20105, 2024.
  • [90] R. Burgert, K. Ranasinghe, X. Li, and M. S. Ryoo, “Peekaboo: Text to image diffusion models are zero-shot segmentors,” arXiv preprint arXiv:2211.13224, 2022.
  • [91] L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8364–8375.
  • [92] M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, “Unsupervised semantic segmentation by distilling feature correspondences,” in Int. Conf. Learn. Representations, 2022.
  • [93] H. Kweon and K.-J. Yoon, “From sam to cams: Exploring segment anything model for weakly supervised semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 19 499–19 509.
  • [94] A. Wang, M. Islam, M. Xu, Y. Zhang, and H. Ren, “Sam meets robotic surgery: an empirical study on generalization, robustness and adaptation,” in Medical Image Computing and Computer Assisted Intervention, 2023, pp. 234–244.
  • [95] Y. Wang, R. Sun, N. Luo, Y. Pan, and T. Zhang, “Image-to-image matching via foundation models: A new perspective for open-vocabulary semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3952–3963.
  • [96] X. Xu, T. Xiong, Z. Ding, and Z. Tu, “Masqclip for open-vocabulary universal image segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 887–898.
  • [97] X. Chen, S. Li, S.-N. Lim, A. Torralba, and H. Zhao, “Open-vocabulary panoptic segmentation with embedding modulation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 1141–1150.
  • [98] J. Xie, W. Li, X. Li, Z. Liu, Y. S. Ong, and C. C. Loy, “Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation,” arXiv preprint arXiv:2309.13042, 2023.
  • [99] W. Wu, Y. Zhao, H. Chen, Y. Gu, R. Zhao, Y. He, H. Zhou, M. Z. Shou, and C. Shen, “Datasetdm: Synthesizing data with perception annotations using diffusion models,” in Neural Inf. Process. Syst., 2023, pp. 54 683–54 695.
  • [100] X. Wang, R. Girdhar, S. X. Yu, and I. Misra, “Cut and learn for unsupervised object detection and instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 3124–3134.
  • [101] S. Arica, O. Rubin, S. Gershov, and S. Laufer, “Cuvler: Enhanced unsupervised object discoveries through exhaustive self-supervised transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 105–23 114.
  • [102] H. Zhao, D. Sheng, J. Bao, D. Chen, D. Chen, F. Wen, L. Yuan, C. Liu, W. Zhou, Q. Chu et al., “X-paste: revisiting scalable copy-paste for instance segmentation using clip and stablediffusion,” in Proc. ACM Int. Conf. Mach. Learn., 2023, pp. 42 098–42 109.
  • [103] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan et al., “Grounded sam: Assembling open-world models for diverse visual tasks,” arXiv preprint arXiv:2401.14159, 2024.
  • [104] Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary universal image segmentation with maskclip,” in Proc. ACM Int. Conf. Mach. Learn., 2023, pp. 8090–8102.
  • [105] Z. Wang, X. Xia, Z. Chen, X. He, Y. Guo, M. Gong, and T. Liu, “Open-vocabulary segmentation with unpaired mask-text supervision,” arXiv preprint arXiv:2402.08960, 2024.
  • [106] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open-vocabulary panoptic segmentation with text-to-image diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2955–2966.
  • [107] T. Chen, L. Li, S. Saxena, G. Hinton, and D. J. Fleet, “A generalist framework for panoptic segmentation of images and videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 909–919.
  • [108] D. Niu, X. Wang, X. Han, L. Lian, R. Herzig, and T. Darrell, “Unsupervised universal image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 22 744–22 754.
  • [109] F. Li, H. Zhang, P. Sun, X. Zou, S. Liu, J. Yang, C. Li, L. Zhang, and J. Gao, “Semantic-sam: Segment and recognize anything at any granularity,” in Proc. Eur. Conf. Comput. Vis., 2024.
  • [110] Y. Zhao, K. Li, Z. Cheng, P. Qiao, X. Zheng, R. Ji, C. Liu, L. Yuan, and J. Chen, “Graco: Granularity-controllable interactive segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3501–3510.
  • [111] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,” Nature Communications, vol. 15, no. 1, p. 654, 2024.
  • [112] S. Yu, P. H. Seo, and J. Son, “Zero-shot referring image segmentation with global-local context features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 456–19 465.
  • [113] S. Kim, M. Kang, D. Kim, J. Park, and S. Kwak, “Extending clip’s image-text alignment to referring image segmentation,” in Association for Computational Linguistics, 2024, pp. 4611–4628.
  • [114] M. Ni, Y. Zhang, K. Feng, X. Li, Y. Guo, and W. Zuo, “Ref-diff: Zero-shot referring image segmentation with generative models,” arXiv preprint arXiv:2308.16777, 2023.
  • [115] K. Pnvr, B. Singh, P. Ghosh, B. Siddiquie, and D. Jacobs, “Ld-znet: A latent diffusion approach for text-based image segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 4157–4168.
  • [116] S. Wu, S. Jin, W. Zhang, L. Xu, W. Liu, W. Li, and C. C. Loy, “F-lmm: Grounding frozen large multimodal models,” arXiv preprint arXiv:2406.05821, 2024.
  • [117] Q. Dai and S. Yang, “Curriculum point prompting for weakly-supervised referring image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 13 711–13 722.
  • [118] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 606–19 616.
  • [119] Z. Zhou, H.-M. Xu, Y. Shu, and L. Liu, “Unlocking the potential of pre-trained vision transformers for few-shot semantic segmentation through relationship descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3817–3827.
  • [120] W. Tan, S. Chen, and B. Yan, “Diffss: Diffusion model for few-shot semantic segmentation,” arXiv preprint arXiv:2307.00773, 2023.
  • [121] L. Shen, F. Shang, Y. Yang, X. Huang, and S. Xiang, “Segicl: A universal in-context learning framework for enhanced segmentation in medical imaging,” arXiv preprint arXiv:2403.16578, 2024.
  • [122] M. Käppeler, K. Petek, N. Vödisch, W. Burgard, and A. Valada, “Few-shot panoptic segmentation with foundation models,” arXiv preprint arXiv:2309.10726, 2023.
  • [123] Y. Sun, J. Chen, S. Zhang, X. Zhang, Q. Chen, G. Zhang, E. Ding, J. Wang, and Z. Li, “Vrp-sam: Sam with visual reference prompt,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 565–23 574.
  • [124] L. Zhu, T. Chen, D. Ji, J. Ye, and J. Liu, “Llafs: When large language models meet few-shot segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3065–3075.
  • [125] X. Liu, T. Zhou, Y. Wang, Y. Wang, Q. Cao, W. Du, Y. Yang, J. He, Y. Qiao, and Y. Shen, “Towards the unification of generative and discriminative visual foundation model: A survey,” arXiv preprint arXiv:2312.10163, 2023.
  • [126] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo et al., “Solving quantitative reasoning problems with language models,” in Neural Inf. Process. Syst., 2022, pp. 3843–3857.
  • [127] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Proc. ACM Int. Conf. Mach. Learn., 2021, pp. 4904–4916.
  • [128] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Neural Inf. Process. Syst., 2020, pp. 6840–6851.
  • [129] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12 873–12 883.
  • [130] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-or, “Prompt-to-prompt image editing with cross-attention control,” in Int. Conf. Learn. Representations, 2023.
  • [131] G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu, “Zero-shot image-to-image translation,” in ACM SIGGRAPH, 2023, pp. 1–11.
  • [132] A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, and D. Pathak, “Your diffusion model is secretly a zero-shot classifier,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 2206–2217.
  • [133] K. Clark and P. Jaini, “Text-to-image diffusion models are zero shot classifiers,” in Neural Inf. Process. Syst., 2023, pp. 58 921–58 937.
  • [134] D. Xie, R. Wang, J. Ma, C. Chen, H. Lu, D. Yang, F. Shi, and X. Lin, “Edit everything: A text-guided generative system for images editing,” arXiv preprint arXiv:2304.14006, 2023.
  • [135] Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, “Segment and track anything,” arXiv preprint arXiv:2305.06558, 2023.
  • [136] K. Ranasinghe, B. McKinzie, S. Ravi, Y. Yang, A. Toshev, and J. Shlens, “Perceptual grouping in contrastive vision-language models,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 5571–5584.
  • [137] Y. Li, H. Wang, Y. Duan, and X. Li, “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv preprint arXiv:2304.05653, 2023.
  • [138] S. Hajimiri, I. B. Ayed, and J. Dolz, “Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation,” arXiv preprint arXiv:2404.08181, 2024.
  • [139] W. Bousselham, F. Petersen, V. Ferrari, and H. Kuehne, “Grounding everything: Emerging localization properties in vision-language transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3828–3837.
  • [140] R. Yoshihashi, Y. Otsuka, T. Tanaka et al., “Attention as annotation: Generating images and pseudo-masks for weakly supervised semantic segmentation with diffusion,” arXiv preprint arXiv:2309.01369, 2023.
  • [141] Q. Nguyen, T. Vu, A. Tran, and K. Nguyen, “Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,” in Neural Inf. Process. Syst., 2023, pp. 76 872–76 892.
  • [142] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the daam: Interpreting stable diffusion using cross attention,” in Association for Computational Linguistics, 2023, pp. 5644–5659.
  • [143] P. Marcos-Manchón, R. Alcover-Couso, J. C. SanMiguel, and J. M. Martínez, “Open-vocabulary attention maps with token optimization for semantic segmentation in diffusion models,” arXiv preprint arXiv:2403.14291, 2024.
  • [144] Y. Kawano and Y. Aoki, “Maskdiffusion: Exploiting pre-trained diffusion models for semantic segmentation,” arXiv preprint arXiv:2403.11194, 2024.
  • [145] L. Karazija, I. Laina, A. Vedaldi, and C. Rupprecht, “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv preprint arXiv:2306.09316, 2023.
  • [146] T. Zhou and W. Wang, “Prototype-based semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., 2024.
  • [147] W. Wang, C. Han, T. Zhou, and D. Liu, “Visual recognition with deep nearest centroids,” in Int. Conf. Learn. Representations, 2023.
  • [148] O. Siméoni, G. Puy, H. V. Vo, S. Roburin, S. Gidaris, A. Bursuc, P. Pérez, R. Marlet, and J. Ponce, “Localizing objects with self-supervised transformers and no labels,” in British Machine Vision Conference, 2021.
  • [149] W. Van Gansbeke, S. Vandenhende, and L. Van Gool, “Discovering object masks with transformers for unsupervised semantic segmentation,” arXiv preprint arXiv:2206.06363, 2022.
  • [150] A. Zadaianchuk, M. Kleindessner, Y. Zhu, F. Locatello, and T. Brox, “Unsupervised semantic segmentation with self-supervised object-centric representations,” in Int. Conf. Learn. Representations, 2023.
  • [151] S. Amir, Y. Gandelsman, S. Bagon, and T. Dekel, “Deep vit features as dense visual descriptors,” arXiv preprint arXiv:2112.05814, 2021.
  • [152] H. Kwon, T. Song, S. Jeong, J. Kim, J. Jang, and K. Sohn, “Probabilistic prompt learning for dense prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 6768–6777.
  • [153] S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “Cat-seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 4113–4123.
  • [154] K. Kim, Y. Oh, and J. C. Ye, “Otseg: Multi-prompt sinkhorn attention for zero-shot semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 2024.
  • [155] Z. Zhou, Y. Lei, B. Zhang, L. Liu, and Y. Liu, “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11 175–11 185.
  • [156] Z. Zhang, T. Zhang, Y. Zhu, J. Liu, X. Liang, Q. Ye, and W. Ke, “Language-driven visual consensus for zero-shot semantic segmentation,” arXiv preprint arXiv:2403.08426, 2024.
  • [157] K. Kim, Y. Oh, and J. C. Ye, “Zegot: Zero-shot segmentation through optimal transport of text prompts,” arXiv preprint arXiv:2301.12171, 2023.
  • [158] L. Hoyer, D. J. Tan, M. F. Naeem, L. Van Gool, and F. Tombari, “Semivl: Semi-supervised semantic segmentation with vision-language guidance,” arXiv preprint arXiv:2311.16241, 2023.
  • [159] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “Side adapter network for open-vocabulary semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2945–2954.
  • [160] M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, and X. Bai, “A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 736–753.
  • [161] S. Sun, R. Li, P. Torr, X. Gu, and S. Li, “Clip as rnn: Segment countless visual concepts without training endeavor,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 13 171–13 182.
  • [162] J. Ding, N. Xue, G.-S. Xia, and D. Dai, “Decoupling zero-shot semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 583–11 592.
  • [163] K. Han, Y. Liu, J. H. Liew, H. Ding, J. Liu, Y. Wang, Y. Tang, Y. Yang, J. Feng, Y. Zhao et al., “Global knowledge calibration for fast open-vocabulary segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 797–807.
  • [164] Y. Zhang, Z. Wang, J. H. Liew, J. Huang, M. Zhu, J. Feng, and W. Zuo, “Associating spatially-consistent grouping with text-supervised semantic segmentation,” arXiv preprint arXiv:2304.01114, 2023.
  • [165] C. Han, Y. Zhong, D. Li, K. Han, and L. Ma, “Open-vocabulary semantic segmentation with decoupled one-pass network,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 1086–1096.
  • [166] Z. Gui, S. Sun, R. Li, J. Yuan, Z. An, K. Roth, A. Prabhu, and P. Torr, “knn-clip: Retrieval enables training-free segmentation on continually expanding large vocabularies,” arXiv preprint arXiv:2404.09447, 2024.
  • [167] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7061–7070.
  • [168] S. Jiao, Y. Wei, Y. Wang, Y. Zhao, and H. Shi, “Learning mask-aware clip representations for zero-shot segmentation,” in Neural Inf. Process. Syst., 2023, pp. 35 631–35 653.
  • [169] C. Ma, Y. Yang, Y. Wang, Y. Zhang, and W. Xie, “Open-vocabulary semantic segmentation with frozen vision-language models,” in British Machine Vision Conference, 2022.
  • [170] J. Mukhoti, T.-Y. Lin, O. Poursaeed, R. Wang, A. Shah, P. H. Torr, and S.-N. Lim, “Open vocabulary semantic segmentation with patch aligned contrastive learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 413–19 423.
  • [171] Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 15 305–15 314.
  • [172] W. He, S. Jamonnak, L. Gou, and L. Ren, “Clip-s4: Language-guided self-supervised semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11 207–11 216.
  • [173] X. Liu, B. Tian, Z. Wang, R. Wang, K. Sheng, B. Zhang, H. Zhao, and G. Zhou, “Delving into shape-aware zero-shot semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2999–3009.
  • [174] J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18 134–18 144.
  • [175] H. Luo, J. Bao, Y. Wu, X. He, and T. Li, “Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation,” in Proc. ACM Int. Conf. Mach. Learn., 2023, pp. 23 033–23 044.
  • [176] F. Zhang, T. Zhou, B. Li, H. He, C. Ma, T. Zhang, J. Yao, Y. Zhang, and Y. Wang, “Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation,” in Neural Inf. Process. Syst., 2023, pp. 73 652–73 665.
  • [177] M. Yi, Q. Cui, H. Wu, C. Yang, O. Yoshie, and H. Lu, “A simple framework for text-supervised semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7071–7080.
  • [178] Q. Liu, K. Zheng, W. Wei, Z. Tong, Y. Liu, W. Chen, Z. Wang, and Y. Shen, “Tagalign: Improving vision-language alignment with multi-tag classification,” arXiv preprint arXiv:2312.14149, 2023.
  • [179] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [180] J. Chen, D. Zhu, G. Qian, B. Ghanem, Z. Yan, C. Zhu, F. Xiao, S. C. Culatana, and M. Elhoseiny, “Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 699–710.
  • [181] J. Chen, D. Deguchi, C. Zhang, X. Zheng, and H. Murase, “Clip is also a good teacher: A new learning framework for inductive zero-shot semantic segmentation,” arXiv preprint arXiv:2310.02296, 2023.
  • [182] S. Wu, W. Zhang, L. Xu, S. Jin, X. Li, W. Liu, and C. C. Loy, “Clipself: Vision transformer distills itself for open-vocabulary dense prediction,” in Int. Conf. Learn. Representations, 2024.
  • [183] M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari, “Silc: Improving vision language pretraining with self-distillation,” arXiv preprint arXiv:2310.13355, 2023.
  • [184] M. Wysoczańska, O. Siméoni, M. Ramamonjisoa, A. Bursuc, T. Trzciński, and P. Pérez, “Clip-dinoiser: Teaching clip a few dino tricks,” arXiv preprint arXiv:2312.12359, 2023.
  • [185] J. Tian, L. Aggarwal, A. Colaco, Z. Kira, and M. Gonzalez-Franco, “Diffuse attend and segment: Unsupervised zero-shot segmentation using stable diffusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3554–3563.
  • [186] S. Mukhopadhyay, M. Gwilliam, Y. Yamaguchi, V. Agarwal, N. Padmanabhan, A. Swaminathan, T. Zhou, and A. Shrivastava, “Do text-free diffusion models learn discriminative visual representations?” arXiv preprint arXiv:2311.17921, 2023.
  • [187] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 5729–5739.
  • [188] N. Kondapaneni, M. Marks, M. Knott, R. Guimaraes, and P. Perona, “Text-image alignment for diffusion-based perception,” arXiv preprint arXiv:2310.00031, 2023.
  • [189] E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar, A. Tagliasacchi, and K. M. Yi, “Unsupervised semantic correspondence using stable diffusion,” in Neural Inf. Process. Syst., 2023, pp. 8266–8279.
  • [190] S. Dong, M. Zhu, K. Cheng, N. Wang, and X. Gao, “Bridging generative and discriminative models for unified visual perception with diffusion priors,” arXiv preprint arXiv:2401.16459, 2024.
  • [191] Q. Wan, Z. Huang, B. Kang, J. Feng, and L. Zhang, “Harnessing diffusion models for visual perception with meta prompts,” arXiv preprint arXiv:2312.14733, 2023.
  • [192] J. Wolleb, R. Sandkühler, F. Bieder, P. Valmaggia, and P. C. Cattin, “Diffusion models for implicit image segmentation ensembles,” in International Conference on Medical Imaging with Deep Learning, 2022, pp. 1336–1348.
  • [193] L. Zbinden, L. Doorenbos, T. Pissas, A. T. Huber, R. Sznitman, and P. Márquez-Neila, “Stochastic segmentation with conditional categorical diffusion models,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 1119–1129.
  • [194] A. Rahman, J. M. J. Valanarasu, I. Hacihaliloglu, and V. M. Patel, “Ambiguous medical image segmentation using diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11 536–11 546.
  • [195] M. Wang, H. Ding, J. H. Liew, J. Liu, Y. Zhao, and Y. Wei, “SegRefiner: Towards model-agnostic segmentation refinement with discrete diffusion process,” in Neural Inf. Process. Syst., 2023, pp. 79 761–79 780.
  • [196] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 1206–1217.
  • [197] V. Fernandez, W. H. L. Pinaya, P. Borges, P.-D. Tudosiu, M. S. Graham, T. Vercauteren, and M. J. Cardoso, “Can segmentation models be trained with fully synthetically generated data?” in Medical Image Computing and Computer Assisted Intervention, 2022, pp. 79–90.
  • [198] J. Schnell, J. Wang, L. Qi, V. T. Hu, and M. Tang, “Generative data augmentation improves scribble-supervised semantic segmentation,” arXiv preprint arXiv:2311.17121, 2023.
  • [199] X. Yu, G. Li, W. Lou, S. Liu, X. Wan, Y. Chen, and H. Li, “Diffusion-based data augmentation for nuclei image segmentation,” in Medical Image Computing and Computer Assisted Intervention, 2023, pp. 592–602.
  • [200] Z. Zhang, L. Yao, B. Wang, D. Jha, E. Keles, A. Medetalibeyoglu, and U. Bagci, “Emit-diff: Enhancing medical image segmentation via text-guided diffusion model,” arXiv preprint arXiv:2310.12868, 2023.
  • [201] Y. Wang, X. Shen, S. X. Hu, Y. Yuan, J. L. Crowley, and D. Vaufreydaz, “Self-supervised transformers for unsupervised object discovery using normalized cut,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14 543–14 553.
  • [202] Y. Wang, X. Shen, Y. Yuan, Y. Du, M. Li, S. X. Hu, J. L. Crowley, and D. Vaufreydaz, “Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
  • [203] A. Ziegler and Y. M. Asano, “Self-supervised learning of object parts for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14 502–14 511.
  • [204] L. Sick, D. Engel, P. Hermosilla, and T. Ropinski, “Unsupervised semantic segmentation through depth-guided feature correlation and sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3637–3646.
  • [205] H. S. Seong, W. Moon, S. Lee, and J.-P. Heo, “Leveraging hidden positives for unsupervised semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 540–19 549.
  • [206] C. Kim, W. Han, D. Ju, and S. J. Hwang, “Eagle: Eigen aggregation learning for object-centric unsupervised semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3523–3533.
  • [207] T. Chen, Z. Mai, R. Li, and W.-l. Chao, “Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation,” arXiv preprint arXiv:2305.05803, 2023.
  • [208] Z. Chen and Q. Sun, “Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models,” arXiv preprint arXiv:2310.13026, 2023.
  • [209] P. Rewatbowornwong, N. Chatthee, E. Chuangsuwanich, and S. Suwajanakorn, “Zero-guidance segmentation using zero segment labels,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 1162–1172.
  • [210] Y. Kawano and Y. Aoki, “Tag: Guidance-free open-vocabulary semantic segmentation,” arXiv preprint arXiv:2403.11197, 2024.
  • [211] Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” in Neural Inf. Process. Syst., 2023, pp. 32 215–32 234.
  • [212] S. He, H. Ding, and W. Jiang, “Primitive generation and semantic-related alignment for universal zero-shot segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11 238–11 247.
  • [213] ——, “Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 498–19 507.
  • [214] Y. Ge, J. Xu, B. N. Zhao, N. Joshi, L. Itti, and V. Vineet, “Dall-e for detection: Language-driven compositional image synthesis for object detection,” arXiv preprint arXiv:2206.09592, 2022.
  • [215] S. Cao, D. Joshi, L. Gui, and Y.-X. Wang, “Hassod: Hierarchical adaptive self-supervised object detection,” in Neural Inf. Process. Syst., 2023, pp. 59 337–59 359.
  • [216] C. Fan, M. Zhu, H. Chen, Y. Liu, W. Wu, H. Zhang, and C. Shen, “Divergen: Improving instance segmentation by learning wider data distribution with more diverse generative data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3986–3995.
  • [217] C. Shi and S. Yang, “The devil is in the object boundary: Towards annotation-free instance segmentation using foundation models,” in Int. Conf. Learn. Representations, 2024.
  • [218] L. Ke, M. Ye, M. Danelljan, Y.-W. Tai, C.-K. Tang, F. Yu et al., “Segment anything in high quality,” in Neural Inf. Process. Syst., 2023, pp. 29 914–29 934.
  • [219] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  • [220] X. Zou, Z.-Y. Dou, J. Yang, Z. Gan, L. Li, C. Li, X. Dai, H. Behl, J. Wang, L. Yuan et al., “Generalized decoding for pixel, image, and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 15 116–15 127.
  • [221] J. Qin, J. Wu, P. Yan, M. Li, R. Yuxi, X. Xiao, Y. Wang, R. Wang, S. Wen, X. Pan et al., “Freeseg: Unified, universal and open-vocabulary image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 446–19 455.
  • [222] X. Gu, Y. Cui, J. Huang, A. Rashwan, X. Yang, X. Zhou, G. Ghiasi, W. Kuo, H. Chen, L.-C. Chen et al., “Dataseg: Taming a universal multi-dataset multi-task segmentation model,” in Neural Inf. Process. Syst., 2023, pp. 67 329–67 354.
  • [223] X. Li, H. Yuan, W. Li, H. Ding, S. Wu, W. Zhang, Y. Li, K. Chen, and C. C. Loy, “Omg-seg: Is one model good enough for all segmentation?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 27 948–27 959.
  • [224] T. Chen, R. Zhang, and G. Hinton, “Analog bits: Generating discrete data using diffusion models with self-conditioning,” in Int. Conf. Learn. Representations, 2023.
  • [225] W. Van Gansbeke and B. De Brabandere, “A simple latent diffusion approach for panoptic segmentation and mask inpainting,” in Proc. Eur. Conf. Comput. Vis., 2024.
  • [226] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6399–6408.
  • [227] N. Vödisch, K. Petek, M. Käppeler, A. Valada, and W. Burgard, “A good foundation is worth many labels: Label-efficient panoptic segmentation,” arXiv preprint arXiv:2405.19035, 2024.
  • [228] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12 475–12 485.
  • [229] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
  • [230] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 633–641.
  • [231] M. Everingham, S.-S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis., vol. 111, pp. 98–136, 2015.
  • [232] T. Pan, L. Tang, X. Wang, and S. Shan, “Tokenize anything via prompting,” in Proc. Eur. Conf. Comput. Vis., 2024.
  • [233] H. Yuan, X. Li, C. Zhou, Y. Li, K. Chen, and C. C. Loy, “Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively,” in Proc. Eur. Conf. Comput. Vis., 2024.
  • [234] T. Zhou, L. Li, G. Bredell, J. Li, J. Unkelbach, and E. Konukoglu, “Volumetric memory network for interactive medical image segmentation,” Medical Image Analysis, vol. 83, p. 102599, 2023.
  • [235] M. A. Mazurowski, H. Dong, H. Gu, J. Yang, N. Konz, and Y. Zhang, “Segment anything model for medical image analysis: an experimental study,” Medical Image Analysis, vol. 89, p. 102918, 2023.
  • [236] D. Cheng, Z. Qin, Z. Jiang, S. Zhang, Q. Lao, and K. Li, “Sam on medical images: A comprehensive study on three prompt modes,” arXiv preprint arXiv:2305.00035, 2023.
  • [237] Y. Zhang, Z. Shen, and R. Jiao, “Segment anything model for medical image segmentation: Current applications and future directions,” Computers in Biology and Medicine, p. 108238, 2024.
  • [238] T. Wald, S. Roy, G. Koehler, N. Disch, M. R. Rokuss, J. Holzschuh, D. Zimmerer, and K. Maier-Hein, “Sam. md: Zero-shot medical image segmentation capabilities of the segment anything model,” in Medical Imaging with Deep Learning, 2023.
  • [239] F. Putz, J. Grigo, T. Weissmann, P. Schubert, D. Hoefler, A. Gomaa, H. B. Tkhayat, A. Hagag, S. Lettmaier, B. Frey et al., “The segment anything foundation model achieves favorable brain tumor autosegmentation accuracy on mri to support radiotherapy treatment planning,” arXiv preprint arXiv:2304.07875, 2023.
  • [240] R. Deng, C. Cui, Q. Liu, T. Yao, L. W. Remedios, S. Bao, B. A. Landman, L. E. Wheless, L. A. Coburn, K. T. Wilson et al., “Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging,” arXiv preprint arXiv:2304.04155, 2023.
  • [241] K. Zhang and D. Liu, “Customized segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.13785, 2023.
  • [242] W. Feng, L. Zhu, and L. Yu, “Cheap lunch for medical image segmentation by fine-tuning sam on few exemplars,” arXiv preprint arXiv:2308.14133, 2023.
  • [243] J. Wu, R. Fu, H. Fang, Y. Liu, Z. Wang, Y. Xu, Y. Jin, and T. Arbel, “Medical sam adapter: Adapting segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.12620, 2023.
  • [244] S. Gong, Y. Zhong, W. Ma, J. Li, Z. Wang, J. Zhang, P.-A. Heng, and Q. Dou, “3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable medical image segmentation,” arXiv preprint arXiv:2306.13465, 2023.
  • [245] Y. Suo, L. Zhu, and Y. Yang, “Text augmented spatial aware zero-shot referring image segmentation,” in Proc. Conference on Empirical Methods in Natural Language Processing, 2023, pp. 1032–1043.
  • [246] Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 686–11 695.
  • [247] Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, “Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 17 503–17 512.
  • [248] H. Nguyen-Truong, E.-R. Nguyen, T.-A. Vu, M.-T. Tran, B.-S. Hua, and S.-K. Yeung, “Improving referring image segmentation using vision-aware text features,” arXiv preprint arXiv:2404.08590, 2024.
  • [249] Y. Yan, X. He, W. Wang, S. Chen, and J. Liu, “Eavl: Explicitly align vision and language for referring image segmentation,” in Proc. AAAI Conf. Artif. Intell., 2024.
  • [250] W. Wang, T. Yue, Y. Zhang, L. Guo, X. He, X. Wang, and J. Liu, “Unveiling parts beyond objects: Towards finer-granularity referring expression segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 12 998–13 008.
  • [251] T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7086–7096.
  • [252] Z. Zhou, O. Alabi, M. Wei, T. Vercauteren, and M. Shi, “Text promptable surgical instrument segmentation with vision-language models,” in Neural Inf. Process. Syst., 2023, pp. 28 611–28 623.
  • [253] Y. Wang, J. Li, X. ZHANG, B. Shi, C. Li, W. Dai, H. Xiong, and Q. Tian, “Barleria: An efficient tuning framework for referring image segmentation,” in Int. Conf. Learn. Representations, 2023.
  • [254] R. Strudel, I. Laptev, and C. Schmid, “Weakly-supervised segmentation of referring expressions,” arXiv preprint arXiv:2205.04725, 2022.
  • [255] F. Liu, Y. Liu, Y. Kong, K. Xu, L. Zhang, B. Yin, G. Hancke, and R. Lau, “Referring image segmentation using text supervision,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 22 124–22 134.
  • [256] Y. Iioka, Y. Yoshida, Y. Wada, S. Hatanaka, and K. Sugiura, “Multimodal diffusion segmentation model for object segmentation from manipulation instructions,” in International Conference on Intelligent Robots and Systems, 2023, pp. 7590–7597.
  • [257] L. Qi, L. Yang, W. Guo, Y. Xu, B. Du, V. Jampani, and M.-H. Yang, “Unigs: Unified representation for image generation and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 6305–6315.
  • [258] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Neural Inf. Process. Syst., 2023, pp. 49 250–49 267.
  • [259] S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “An improved baseline for reasoning segmentation with large language model,” arXiv preprint arXiv:2312.17240, 2023.
  • [260] J. Wang and L. Ke, “Llm-seg: Bridging image segmentation and large language model reasoning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 1765–1774.
  • [261] A. Zhang, L. Zhao, C.-W. Xie, Y. Zheng, W. Ji, and T.-S. Chua, “Next-chat: An lmm for chat, detection and segmentation,” arXiv preprint arXiv:2311.04498, 2023.
  • [262] Z. Xia, D. Han, Y. Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3858–3869.
  • [263] C. Wei, H. Tan, Y. Zhong, Y. Yang, and L. Ma, “Lasagna: Language-based segmentation assistant for complex queries,” arXiv preprint arXiv:2404.08506, 2024.
  • [264] Y. Yang, P.-T. Jiang, J. Wang, H. Zhang, K. Zhao, J. Chen, and B. Li, “Empowering segmentation ability to multi-modal large language models,” arXiv preprint arXiv:2403.14141, 2024.
  • [265] Y. Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 28 202–28 211.
  • [266] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Association for Computational Linguistics, 2019, pp. 4171–4186.
  • [267] Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18 155–18 165.
  • [268] Y. Hu, Q. Wang, W. Shao, E. Xie, Z. Li, J. Han, and P. Luo, “Beyond one-to-one: Rethinking the referring image segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 4067–4077.
  • [269] S.-A. Liu, Y. Zhang, Z. Qiu, H. Xie, Y. Zhang, and T. Yao, “Caris: Context-aware referring image segmentation,” in ACM Multimedia, 2023, pp. 779–788.
  • [270] D. Kim, N. Kim, C. Lan, and S. Kwak, “Shatter and gather: Learning referring image segmentation with text supervision,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 15 547–15 557.
  • [271] Y. X. Chng, H. Zheng, Y. Han, X. Qiu, and G. Huang, “Mask grounding for referring image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 26 573–26 583.
  • [272] C. Liu, H. Ding, and X. Jiang, “Gres: Generalized referring expression segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 23 592–23 601.
  • [273] J. Liu, H. Ding, Z. Cai, Y. Zhang, R. K. Satzoda, V. Mahadevan, and R. Manmatha, “Polyformer: Referring image segmentation as sequential polygon generation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 18 653–18 663.
  • [274] Z. Zhang, Y. Zhu, J. Liu, X. Liang, and W. Ke, “Coupalign: Coupling word-pixel with sentence-mask alignments for referring image segmentation,” in Neural Inf. Process. Syst., 2022, pp. 14 729–14 742.
  • [275] Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr, “Semantics-aware dynamic localization and refinement for referring image segmentation,” in Proc. AAAI Conf. Artif. Intell., 2023, pp. 3222–3230.
  • [276] N. A. Shah, V. VS, and V. M. Patel, “Lqmformer: Language-aware query mask transformer for referring image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 12 903–12 913.
  • [277] J. Wu, Y. Jiang, B. Yan, H. Lu, Z. Yuan, and P. Luo, “Segment every reference object in spatial and temporal spaces,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 2538–2550.
  • [278] R. Pi, L. Yao, J. Gao, J. Zhang, and T. Zhang, “Perceptiongpt: Effectively fusing visual perception into llm,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 27 124–27 133.
  • [279] B. Zhu, P. Jin, M. Ning, B. Lin, J. Huang, Q. Song, M. Pan, and L. Yuan, “Llmbind: A unified modality-task integration framework,” arXiv preprint arXiv:2402.14891, 2024.
  • [280] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” in Neural Inf. Process. Syst., 2023, pp. 46 595–46 623.
  • [281] W. Sun, Y. Du, G. Liu, R. Kompella, and C. G. Snoek, “Training-free semantic segmentation via llm-supervision,” arXiv preprint arXiv:2404.00701, 2024.
  • [282] W. Ji, L. Li, H. Fei, X. Liu, X. Yang, J. Li, and R. Zimmermann, “Towards complex-query referring image segmentation: A novel benchmark,” arXiv preprint arXiv:2309.17205, 2023.
  • [283] Q. Yu, J. Li, W. Ye, S. Tang, and Y. Zhuang, “Interactive data synthesis for systematic vision adaptation via llms-aigcs collaboration,” arXiv preprint arXiv:2305.12799, 2023.
  • [284] C. Shang, Z. Song, H. Qiu, L. Wang, F. Meng, and H. Li, “Prompt-driven referring image segmentation with instance contrasting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 4124–4134.
  • [285] X. Huang, G. Luo, C. Zhu, B. Tong, Y. Zhou, X. Sun, and R. Ji, “Deep instruction tuning for segment anything model,” arXiv preprint arXiv:2404.00650, 2024.
  • [286] J. Wang, B. Zhang, J. Pang, H. Chen, and W. Liu, “Rethinking prior information generation with clip for few-shot segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 3941–3951.
  • [287] C. Shuai, M. Fanman, Z. Runtong, Q. Heqian, L. Hongliang, W. Qingbo, and X. Linfeng, “Visual and textual prior guided mask assemble for few-shot segmentation and beyond,” arXiv preprint arXiv:2308.07539, 2023.
  • [288] J. Wang, Y. Liu, Q. Zhou, and F. Wang, “Language-guided few-shot semantic segmentation,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2024, pp. 5035–5039.
  • [289] Y. Jia, W. Huang, J. Gao, Q. Wang, and Q. Li, “Embedding generalized semantic knowledge into few-shot remote sensing segmentation,” arXiv preprint arXiv:2405.13686, 2024.
  • [290] S. You, L. Weng, and F. Gao, “Weakly supervised few-shot segmentation through textual prompt,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2024, pp. 7950–7954.
  • [291] M. Han, H. Zheng, C. Wang, Y. Luo, H. Hu, J. Zhang, and Y. Wen, “Partseg: Few-shot part segmentation via part-aware prompt learning,” arXiv preprint arXiv:2308.12757, 2023.
  • [292] H. Huang, X. Yuan, S. Yu, W. Zhao, O. Alfarraj, A. Tolba, and F. Xia, “Few-shot semantic segmentation for consumer electronics: An inter-class relation mining approach,” IEEE Transactions on Consumer Electronics, 2024.
  • [293] M.-Q. Le, T. V. Nguyen, T.-N. Le, T.-T. Do, M. N. Do, and M.-T. Tran, “Maskdiff: Modeling mask distribution with diffusion probabilistic model for few-shot instance segmentation,” in Proc. AAAI Conf. Artif. Intell., 2024, pp. 2874–2881.
  • [294] R. Bensaid, V. Gripon, F. Leduc-Primeau, L. Mauch, G. B. Hacene, and F. Cardinaux, “A novel benchmark for few-shot semantic segmentation in the era of foundation models,” arXiv preprint arXiv:2401.11311, 2024.
  • [295] D. Kang, P. Koniusz, M. Cho, and N. Murray, “Distilling self-supervised vision transformers for weakly-supervised few-shot classification & segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 627–19 638.
  • [296] D. Anand, V. Singhal, D. D. Shanbhag, S. KS, U. Patil, C. Bhushan, K. Manickam, D. Gui, R. Mullick, A. Gopal et al., “One-shot localization and segmentation of medical images with foundation models,” arXiv preprint arXiv:2310.18642, 2023.
  • [297] C.-B. Feng, Q. Lai, K. Liu, H. Su, and C.-M. Vong, “Boosting few-shot semantic segmentation via segment anything model,” arXiv preprint arXiv:2401.09826, 2024.
  • [298] R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, X. Ma, H. Dong, P. Gao, and H. Li, “Personalize segment anything model with one shot,” in Int. Conf. Learn. Representations, 2024.
  • [299] C. Zhao and L. Shen, “Part-aware personalized segment anything model for patient-specific segmentation,” arXiv preprint arXiv:2403.05433, 2024.
  • [300] W. He, Y. Zhang, W. Zhuo, L. Shen, J. Yang, S. Deng, and L. Sun, “Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 762–23 772.
  • [301] T. Meng, Y. Tao, and W. Yin, “Few-shot classification & segmentation using large language models agent,” arXiv preprint arXiv:2311.12065, 2023.
  • [302] A. Bar, Y. Gandelsman, T. Darrell, A. Globerson, and A. Efros, “Visual prompting via image inpainting,” in Neural Inf. Process. Syst., 2022, pp. 25 005–25 017.
  • [303] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 000–16 009.
  • [304] Y. Sun, Q. Chen, J. Wang, J. Wang, and Z. Li, “Exploring effective factors for improving visual in-context learning,” arXiv preprint arXiv:2304.04748, 2023.
  • [305] Y. Zhang, K. Zhou, and Z. Liu, “What makes good examples for visual in-context learning?” in Neural Inf. Process. Syst., 2023, pp. 17 773–17 794.
  • [306] J. Zhang, B. Wang, L. Li, Y. Nakashima, and H. Nagahara, “Instruct me more! random prompting for visual in-context learning,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2024, pp. 2597–2606.
  • [307] X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang, “Images speak in images: A generalist painter for in-context visual learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 6830–6839.
  • [308] Y. Bai, X. Geng, K. Mangalam, A. Bar, A. L. Yuille, T. Darrell, J. Malik, and A. A. Efros, “Sequential modeling enables scalable learning for large vision models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 22 861–22 872.
  • [309] D. Sheng, D. Chen, Z. Tan, Q. Liu, Q. Chu, J. Bao, T. Gong, B. Liu, S. Xu, and N. Yu, “Towards more unified in-context visual understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 13 362–13 372.
  • [310] Z. Wang, Y. Jiang, Y. Lu, P. He, W. Chen, Z. Wang, and M. Zhou, “In-context learning unlocked for diffusion models,” in Neural Inf. Process. Syst., 2023, pp. 8542–8562.
  • [311] Y. Liu, M. Zhu, H. Li, H. Chen, X. Wang, and C. Shen, “Matcher: Segment anything with one shot using all-purpose feature matching,” in Int. Conf. Learn. Representations, 2024.
  • [312] M. Rakic, H. E. Wong, J. J. G. Ortiz, B. A. Cimini, J. V. Guttag, and A. V. Dalca, “Tyche: Stochastic in-context learning for medical image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 11 159–11 173.
  • [313] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3213–3223.
  • [314] S. Pan, T. Wang, R. L. Qiu, M. Axente, C.-W. Chang, J. Peng, A. B. Patel, J. Shelton, S. A. Patel, J. Roper et al., “2d medical image synthesis using transformer-based denoising diffusion probabilistic model,” Physics in Medicine & Biology, vol. 68, no. 10, p. 105004, 2023.
  • [315] A. Toker, M. Eisenberger, D. Cremers, and L. Leal-Taixé, “Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 27 695–27 705.
  • [316] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” arXiv preprint arXiv:2305.10355, 2023.