VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Abstract

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.

Index Terms— Visual grounding, diffusion models, zero-shot learning, vision-language models

1 Introduction

Recently, large-scale vision-language pre-trained models [1, 2] have demonstrated strong performance across a wide range of downstream tasks. Among them, most vision-language (VL) tasks can be broadly categorized into two types - generative tasks and discriminative tasks. For VL generative tasks, text-to-image generative diffusion models, such as Stable Diffusion [3], have shown their powerful ability to generate high-fidelity and diverse images from text descriptions. To be specific, these generative diffusion models are firstly pre-trained on large-scale text-image pairs datasets like LAION-5B [4] to learn the correlations between textual descriptions and visual concepts, then applying them in VL generative tasks, such as image editing [5] and inpainting [6]. For VL discriminative tasks, the predominant approaches are to obtain vision-language alignment abilities from large-scale pre-training datasets, which can then be utilized to facilitate downstream visual-language reasoning tasks [7, 8, 9]. Among various VL discriminative tasks, visual grounding [10] is one of the most challenging one, in which the aim is to localize a target object in an image given a textual description. Most supervised visual grounding works [11, 12, 13] study how to effectively fuse cross-modality features extracted independently by each encoder. Though these methods have achieved good performance on general visual grounding benchmarks, such as RefCOCO and RefCOCO+ [10], the training and fine-tuning cost is prohibitively expensive and time-consuming. Moreover, collecting task-specific data is even more challenging since it requires accurate descriptions of the target area as well as high-quality bounding box annotations [14].

Refer to caption — Fig. 1: Illustration of two types of vision-language tasks. Motivated by the strong abilities of text-to-image diffusion models, we propose VGDiffZero for zero-shot visual grounding.

Recent studies start to leverage pre-trained diffusion models for a set of discriminative tasks, including classification [15], segmentation [16, 17] and image-text matching [18], under different settings. These progresses fully demonstrate two key advantages of text-to-image diffusion models: (1) Strong abilities of vision-language alignment. (2) Sufficient knowledge of spatial relations and fine-grained disentangled concepts. These above two advantages inspire an intriguing question: is it possible to directly adapt these advantages of pre-trained text-to-image diffusion models to visual grounding, without costly fine-tuning?

Motivated by the above observations, in this paper, we seek to directly leverage the power of pre-trained generative diffusion models, particularly Stable Diffusion [3], for a VL discriminative task of visual grounding, as shown in Figure 1. Specifically, we propose VGDiffZero, a simple yet novel zero-shot visual grounding framework using text-to-image diffusion models. Overall, VGDiffZero regards the visual grounding task as a isolated proposal selection process mainly through our proposed comprehensive region-scoring method including two stages: Noise Injection and Noise Prediction. In the first stage, a series of detected object proposals undergo masking and cropping to obtain isolated proposals with global and local visual information. These isolated proposals are then encoded into latent vectors, and injected with Gaussian noise. In the second stage, each noised latent vector along with the text embeddings encoded by a pre-trained CLIP text encoder, is fed into the denoisng UNet to predict the injected noise. By comparing all predicted and sampled noise pairs, the best matching proposal with the minimum errors is output as the final prediction.

In general, our main contributions can be summarized as threefold: (1) We propose a novel diffusion-based framework, termed VGDiffZero, for zero-shot visual grounding without any additional fine-tuning. To the best of our knowledge, this is the first attempt to tackle visual grounding using a generative diffusion model under the zero-shot setting. (2) We propose a comprehensive region-scoring method that incorporates global and local contexts of input images to enable accurate proposal selection. (3) Extensive experiments on visual grounding benchmarks of RefCOCO [10], RefCOCO+ [10] and RefCOCOg [19] demonstrate the effectiveness of our proposed VGDiffZero.

2 Methodology

In this section, we introduce VGDiffZero, our proposed diffusion-based framework for zero-shot visual grounding. We first revisit Stable Diffusion, the foundation of our model, and then delineate the two key stages in VGDiffZero.

2.1 Preliminaries: Stable Diffusion

Diffusion models [20, 21, 3] represent a novel class of generative models that train neural networks to reverse a deterministic diffusion process. Stable Diffusion [3] is a kind of latent diffusion models which implements diffusion process in the latent space, rather than data space. Specifically, Stable Diffusion consists of three components: a Variational Autoencoder (VAE) including encoder and decoder, a diffusion model including the denoising UNet [22] and DDPM Sampler [21], and a text encoder of the pre-trained CLIP [1].

During training, Stable Diffusion learns to invert the latent diffusion process over image-text pairs $(x,y)$ . To be specific, the VAE encoder first maps an image $x$ to latent vectors $z$ , and Gaussian noise $\epsilon\sim\mathcal{N}(0,I)$ is then iteratively added to $z$ :

q(z_{t}|z_{t-1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}}z_{t-1},\beta_{t}I),t=1,.% ..,T,

(1)

where $q(z_{t}|z_{t-1})$ is the conditional density of $z_{t}$ given $z_{t-1}$ , $\left(\beta_{t}\right)^{T}_{t=1}$ are hyperparameters that determine the noise schedule, and $T$ is the total timesteps. The denoising UNet takes $z_{t}$ , current timestep $t$ , and text embeddings $\tau_{\theta}(y)$ as inputs to predict the noise $\epsilon_{t}$ as:

\epsilon_{t}=\text{UNet}(z_{t},t,\tau_{\theta}(y)),

(2)

where the text embeddings $\tau_{\theta}(y)$ is obtained from text $y$ via CLIP text encoder. The training target is to minimize the predicted error $e$ as:

e=||\epsilon-\epsilon_{t}(z_{t},t,\tau_{\theta}(y))||^{2},

(3)

where $\epsilon$ and $\epsilon_{t}$ respectively represent sampled Gaussian noise and the predicted noise by the denoising UNet.

During generation, input text $y$ is first encoded into text embeddings $\tau_{\theta}(y)$ via the text encoder of CLIP. The latent vector $z_{T}$ is sampled from the standard normal distribution $\mathcal{N}(0,I)$ , and denoising UNet takes $z_{T}$ and $\tau_{\theta}(y)$ to remove the noise to recover the denoised latent vector $z_{0}$ recursively. Finally, the denoised latent vector $z_{0}$ is decoded into an image via the VAE decoder. In this way, Stable Diffusion is able to generate images conditioned on the input text.

2.2 Zero-shot Visual Grounding via Diffusion Models

Generally, the visual grounding task can be viewed as the process of selecting the proposal that best fits the textual query [11, 23]. This involves two critical aspects: (1) Generating high-quality isolated proposals, by taking into account comprehensive visual information of each proposal, including the location within the global image as well as the internal visual information within the local proposal. (2) Establishing fine-grained alignments between region proposals and the textual query, by leveraging the vision-language matching abilities from large-scale pre-training datasets. To this end, we adopt VGDiffZero equipped with a designed comprehensive region-scoring method for zero-shot visual grounding. As depicted in Figure 2, VGDiffZero consists of two key stages: Noise Injection and Noise Prediction.

Noise Injection. Given that visual grounding involves identifying the region proposal best aligned with the textual query, the first step is to generate multiple object proposals from the input image. In this case, following the previous works [11, 23], we adopt a pre-trained object detector, i.e., Faster R-CNN [24] to extract potential region representations in the image. In order to isolate proposals preserving their locations within the whole image and internal visual information, we devise a comprehensive approach that involves masking out all image except for the proposal region, alongside cropping to only retain the proposal region. After processing all proposals via the comprehensive isolation approach, we acquire two sets of isolated proposals, including the global set and local set (corresponding to Global and Local in Figure 2, respectively). These are then individually encoded by the VAE encoder to obtain the latent representations $Z_{0}$ of each proposal. Gaussian noise $\epsilon\sim\mathcal{N}(0,I)$ is injected into each latent vector to produce the noised latent representations $Z_{noised}$ . In summary, the Noise Injection process accomplishes three objectives: isolating region proposals, encoding proposals into latent representations, and injecting noise to diffuse the latents forward.

Noise Prediction. Given an input text $y$ , it is first encoded by the pre-trained CLIP text encoder to derive the text embeddings $\tau_{\theta}(y)$ . The denoising UNet takes two sets of noised latent vectors $Z_{\text{noised}}$ and text embeddings $\tau_{\theta}(y)$ to predict the sampled noise $\epsilon$ for each isolated proposal. Subsequently, two sets of prediction errors, $e_{\text{mask}}$ and $e_{\text{crop}}$ , are computed for each proposal by calculating the deviation between the predicted noise and sampled noise (corresponding to Pred Noise and Noise in Figure 2, respectively). Since Stable Diffusion is pre-trained on semantic-consistent image-text pairs, a smaller error indicates the model more accurately predicts the noise conditioned on $z_{t}$ and $\tau_{\theta}(y)$ , meaning the current region and text are more semantically aligned. Finally, we compute the total error $e_{\text{total}}=e_{\text{mask}}+e_{\text{crop}}$ for each proposal and select the proposal with the minimum total error $e_{\text{total}}$ as the prediction output. In this way, our proposed VGDiffZero can consider both the global and local contexts of each isolated proposal for comprehensive proposal selection.

3 Experiments

Methods	RefCOCO			RefCOCO+			RefCOCOg
Methods	val	test A	test B	val	test A	test B	val	test
Random	15.61	13.47	18.23	16.30	13.29	19.98	18.79	18.35
CPT-Blk	26.90	27.50	27.40	25.40	25.00	27.00	32.10	32.30
Cropping	26.04	26.34	28.95	26.34	26.28	29.41	31.64	32.37
Masking	27.17	29.47	26.21	27.64	29.62	27.29	32.66	32.56
VGDiffZero w/ Single IPM	26.78	29.56	27.28	27.41	29.55	27.21	32.82	32.39
VGDiffZero	27.95	30.34	29.11	28.39	30.79	29.79	33.53	33.24

Table 1: Comparison of accuracy (%) on RefCOCO [10], RefCOCO+ [10] and RefCOCOg [19] under the zero-shot setting.

3.1 Experimental Settings

VGDiffZero Setup. We use the detected object proposals from a pre-trained Faster R-CNN [24], and each isolated proposal (both masking and cropping) is resized to $512\times 512$ . VGDiffZero is built on the pre-trained Stable Diffusion 2.1-base [3] with DDPM Sampler [21] and 1,000 timesteps. We use the text encoder initialized from CLIP-ViT-H/14 [1].

Datasets and Evaluation Metrics. We evaluate VGDiffZero on three widely-used VG benchmarks: RefCOCO [10], RefCOCO+ [10] and RefCOCOg [19]. RefCOCO, RefCOCO+, and RefCOCOg datasets contain 19,994, 19,992, and 26,771 images with 142,210, 141,564, and 104,560 referring expressions, respectively. RefCOCO and RefCOCO+ have shorter expressions (avg 1.6 nouns, 3.6 words), while RefCOCOg has longer, more complex expressions (avg 2.8 nouns, 8.4 words). We use Accuracy@0.5 as the evaluation metrics.

Baselines. We compare VGDiffZero with related methods: (1) Random: Randomly selecting a detected proposal as the prediction. (2) CPT-Blk [23]: A strong zero-shot visual grounding baseline that shades detected proposals with different colors and uses a masked language prompt in which the referring expression is followed by “in [MASK] color”. The color with the highest probability by the pre-trained masked language model VinVL [7] is chosen as prediction. (3) Cropping: Isolating detected proposals by cropping the image around each proposal, then passing the cropped images through VGDiffZero. (4) Masking: Similar to Cropping, but using masking to isolate detected proposals before passing through VGDiffZero. (5) VGDiffZero w/ Single IPM: Given two sets of prediction errors from isolating proposal methods (IPM), cropping and masking, selecting the proposal with the lowest prediction error as the final prediction.

3.2 Experimental Results and Analysis

Quantitative Comparison. The main experimental results are reported in Table 1, from which we can observe that: (1) Our proposed VGDiffZero outperforms other zero-shot visual grounding baseline methods. Notably, on the RefCOCO+ test A, VGDiffZero achieves 5.79% higher accuracy compared to CPT-Blk [23]. This indicates that VGDiffZero can effectively and directly leverage the vision-language alignment abilities learned by pre-trained VL generative models to perform well on visual grounding task. (2) Masking to isolate object proposals achieves superior performance compared to cropping on most datasets, which suggests that preserving the global location of proposals plays a role in the visual grounding task. (3) Using both masking and cropping to isolate proposals achieves higher accuracy than using either method alone. This indicates that considering both the global and local contexts of isolated proposals enables more robust performance on visual grounding tasks.

Methods	RefCOCO	RefCOCO+	RefCOCOg
w/ core-exp	26.86	27.13	34.32
w/ full-exp	27.95	28.39	33.53

Table 2: Accuracy on the validation sets of RefCOCO, RefCOCO+ and RefCOCOg given the core expression and full expression, respectively.

Effect of Different Expression Processing Methods. To further investigate the impact of different expression processing methods on diffusion-based visual grounding, we compare two expression processing approaches: using the full expression as input to the text encoder of CLIP versus using core expression extracted by spaCy [25]. As summarized in Table 2, using the full expression as text input performs better on the validation sets of RefCOCO and RefCOCO+, suggesting that preserving the complete expression is more advantageous when given expressions are short and contain few objects, such as “A young woman in lightblue skiwear”. In contrast, extracting the core noun phrase is more suitable for handling complex sentences with multiple objects, such as “A little rabbit crouching in the bushes under the shade of a tree”, since spaCy can extract the core expression “A little rabbit” to better separate the object from its surroundings.

SD Ver.	RefCOCO	RefCOCO+	RefCOCOg
1-2	27.11	26.73	32.34
1-4	27.64	27.61	32.73
1-5	27.86	27.97	32.81
2-1	27.95	28.39	33.53

Table 3: Accuracy on the validation sets of RefCOCO, RefCOCO+ and RefCOCOg using different versions of Stable Diffusion (short for SD Ver.).

Effect of Different Pre-trained Models. Since our VGDiffZero is built on pre-tained Stable Diffusion [3], it is necessary to investigate how the pre-trained models impacts the visual grounding performance. We compare four versions of pre-trained Stable Diffusion and summarize the results on three benchmark validation sets in Table 3. It is clear that from SD-1-2 to SD-2-1, the accuracy on three benchmarks are improved, which suggests that larger-scale pre-training can improve the vision-language alignment abilities of text-to-image diffusion models, which are precisely needed for visual grounding tasks, thus enabling superior performance.

4 Conclusions

In this paper, we propose VGDiffZero, a novel zero-shot visual grounding framework that leverages pre-trained text-to-image diffusion models’ vision-language alignment abilities. Through the designed comprehensive region-scoring method, our VGDiffZero can consider both the global and local contexts of each isolated object proposal. Extensive experimental results demonstrate that VGDiffZero achieves satisfactory performance on three general visual grounding benchmarks.

Acknowledgements. This research was supported by STI 2030—Major Projects (2022ZD0208800), NSFC General Program (Grant No. 62176215), and NSFC Young Scientists Fund (Grant No. 62001316).

References

[1] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[2] OpenAI, “GPT-4 technical report,” 2023.
[3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
[4] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al., “LAION-5B: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
[5] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani, “Imagic: Text-based real image editing with diffusion models,” in CVPR, 2023.
[6] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang, “SmartBrush: Text and shape guided object inpainting with diffusion model,” in CVPR, 2023.
[7] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao, “VinVL: Revisiting visual representations in vision-language models,” in CVPR, 2021.
[8] Ting Liu, Yue Hu, Wansen Wu, Youkai Wang, Kai Xu, and Quanjun Yin, “Dap: Domain-aware prompt learning for vision-and-language navigation,” arXiv preprint arXiv:2311.17812, 2023.
[9] Ting Liu, Wansen Wu, Yue Hu, Youkai Wang, Kai Xu, and Quanjun Yin, “Prompt-based context-and domain-aware pretraining for vision and language navigation,” arXiv preprint arXiv:2309.03661, 2023.
[10] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg, “Modeling context in referring expressions,” in ECCV, 2016.
[11] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg, “MAttNet: Modular attention network for referring expression comprehension,” in CVPR, 2018.
[12] Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, and Xi Li, “Language adaptive weight generation for multi-task visual grounding,” in CVPR, 2023.
[13] Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan Liang, Hang Su, Jun Zhu, and Lei Zhang, “DQ-DETR: Dual query detection transformer for phrase extraction and grounding,” in AAAI, 2023.
[14] Seonghoon Yu, Paul Hongsuck Seo, and Jeany Son, “Zero-shot referring image segmentation with global-local context features,” in CVPR, 2023.
[15] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak, “Your diffusion model is secretly a zero-shot classifier,” in ICCV, 2023.
[16] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu, “Unleashing text-to-image diffusion models for visual perception,” in ICCV, 2023.
[17] Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht, “Diffusion models for zero-shot open-vocabulary segmentation,” arXiv preprint arXiv:2306.09316, 2023.
[18] Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, and Xin Eric Wang, “Discriminative diffusion models as few-shot vision and language learners,” arXiv preprint arXiv:2305.10722, 2023.
[19] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy, “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016.
[20] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
[21] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
[22] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
[23] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun, “CPT: Colorful prompt tuning for pre-trained vision-language models,” arXiv preprint arXiv:2109.11797, 2021.
[24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
[25] Matthew Honnibal and Mark Johnson, “An improved non-monotonic transition system for dependency parsing,” in EMNLP, 2015.