Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: S-Lab, Nanyang Technological University 22institutetext: CCDS, Nanyang Technological University     SenseTime Research
22email: lanm0002@e.ntu.edu.sg  22email: {chaofeng.chen, ypke}@ntu.edu.sg
22email: {wangxinjiang, fenglitong, wayne.zhang}@sensetime.com

Supplementary – ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan 11    Chaofeng Chen 11    Yiping Ke 22    Xinjiang Wang 33   
Litong Feng
Corresponding author.33
   Wayne Zhang 3333

Appendix

Appendix 0.A Ablation study with different backbones and datasets

We showcase the results of the ablation study for each dataset across different CLIP models in Fig. 1. It’s clear that our method, which involves removing the residual connection and FFN, markedly enhances the open-vocabulary semantic segmentation capability of CLIP throughout all datasets. This enhancement is especially pronounced within the ViT-L/14 architecture, characterized by a larger norm of residual connection. These findings conclusively affirm the efficacy of our proposed methodology.

Refer to caption
(a) CLIP-B/16
Refer to caption
(b) OpenCLIP-B/16
Refer to caption
(c) CLIP-L/14
Refer to caption
(d) OpenCLIP-L/14
Figure 1: Ablation study on each dataset under different architectures and attention mechanisms. \largecircle\largecircle\largecircle: original CLIP; \largetriangleup\largetriangleup\largetriangleup: CLIP w/o residual connection; \largewhitestar\largewhitestar\largewhitestar: CLIP w/o residual connection and FFN.

Appendix 0.B Impact of channel-wise residual features

In this part, we investigate the effect of residual features with low intensity. Specifically, we conduct experiments by selectively reintroducing channels from residual features that have lower average values. We report the results of eliminating the top β𝛽\betaitalic_β high-value channels and the effect of normalizing Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT in Tab. 1. The best performance is achieved when β70%𝛽percent70\beta\geq 70\%italic_β ≥ 70 %. Additionally, normalizing Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT significantly reduces its scale, resulting in performance comparable to β70%𝛽percent70\beta\geq 70\%italic_β ≥ 70 %. These findings support our hypothesis that high-level supervision in CLIP emphasizes global feature direction in the residual latent space, which introduces noise into the residual features. For simplicity, we eliminate all channels in Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT.

Table 1: Average performance (mIoU) over all 8 datasets.
β𝛽\betaitalic_β (%) 0 5 10 30 50 70 100 Norm
Avg. 22.1 30.2 33.5 37.4 38.0 38.1 38.1 38.1
Table 2: Average performance (mIoU) over 5 datasets without background class based on ViT-Base and Large architectures.
VOC20 Context59 Stuff Cityscape ADE20K Avg.
CLIP [radford2021learning] 41.8 9.2 4.4 5.5 2.1 12.6
+ClearCLIP 80.9 35.9 23.9 30.0 16.7 37.5 +24.9
BLIP [li2022blip] 37.3 7.8 5.4 4.3 2.0 11.4
+ClearCLIP 73.5 31.4 21.3 23.8 13.5 32.7 +21.3
OpenCLIP [cherti2023reproducible] 47.2 9.0 5.0 5.1 2.9 13.8
+ClearCLIP 81.4 34.1 23.1 31.8 18.9 37.9 +24.1
MetaCLIP [xu2023demystifying] 35.4 8.1 4.3 5.0 2.2 11.0
+ClearCLIP 78.3 34.8 23.5 27.9 17.4 36.4 +25.4
MaskCLIP [zhou2022extract] 74.9 26.4 16.4 12.6 9.8 28.0
+ClearCLIP 61.4 28.3 18.4 24.7 13.6 29.5 +1.8
SCLIP [wang2023sclip] 78.2 33.0 21.1 29.1 14.6 35.2
+ClearCLIP 77.9 35.6 23.6 31.0 17.0 37.9 +1.6
GEM [bousselham2023grounding] 79.9 35.9 23.7 30.8 15.7 37.2
+ClearCLIP 80.2 36.5 24.4 30.5 17.4 37.8 +0.6
CLIP [radford2021learning] 15.8 4.5 2.4 2.9 1.2 5.4
+ClearCLIP 80.0 29.6 19.9 27.9 15.0 34.5 +29.1
BLIP [li2022blip] 22.5 5.8 2.4 3.8 1.5 7.2
+ClearCLIP 67.5 16.8 11.5 9.3 7.1 22.4 +15.2
OpenCLIP [cherti2023reproducible] 39.7 7.0 4.1 3.9 2.3 11.4
+ClearCLIP 65.3 27.9 19.5 26.4 16.0 31.0 +19.6
MetaCLIP [xu2023demystifying] 22.7 6.2 3.6 5.1 2.2 8.0
+ClearCLIP 78.2 30.3 20.5 25.6 16.4 34.2 +26.2
MaskCLIP [zhou2022extract] 30.1 12.6 8.9 10.1 6.9 13.7
+ClearCLIP 65.1 26.5 17.6 21.2 15.1 29.1 +11.1
SCLIP [wang2023sclip] 60.3 20.5 13.1 17.0 7.1 23.6
+ClearCLIP 79.2 30.6 20.5 27.8 15.6 34.7 +15.4
GEM [bousselham2023grounding] 80.3 26.4 17.6 22.6 11.6 31.7
+ClearCLIP 79.7 29.9 19.4 25.9 14.2 33.8 +2.1

Appendix 0.C Integration across models

Our solution serves as a free lunch applicable to various architectures and segmentation models with just 2-3 lines of code modification. Specifically, for MaskCLIP and SCLIP, we achieve this by eliminating the residual connection and Feed-Forward Network (FFN) of the last self-attention layer. For GEM, we utilize the attention output from the final layer as the final representation. Importantly, we preserve the original attention mechanisms of these methods. For baseline model, i.e., CLIP, BLIP, OpenCLIP, and MetaCLIP, we enhance them by incorporating our complete solution. The performance of different models on five datasets is summarized in Tab. 2. The results demonstrate that our solution consistently enhances the performance of existing models in open-vocabulary semantic segmentation tasks, showcasing its exceptional generalizability.

Refer to caption
Refer to caption
Figure 2: Visualization of feature maps with CLIP for two randomly selected examples from the COCOStuff dataset. The first row shows the first 64 feature maps of each type, while the second row displays all 768 feature maps of each type.

Appendix 0.D Visualization of feature maps

To intuitively demonstrate how the residual connections affect the performance, we visualize the feature maps of Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT, Xattnsubscript𝑋attnX_{\textup{attn}}italic_X start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT, and Xsumsubscript𝑋sumX_{\textup{sum}}italic_X start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT for two randomly selected samples in Fig. 2. It is obvious that the Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT feature maps associated with the residual connections are characterized by peak values in one channel (highlighted in a red box), significantly surpassing the other channels. And Xsumsubscript𝑋sumX_{\textup{sum}}italic_X start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT is similar to Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT, indicating the big influence of Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT to the final feature. Conversely, the feature maps in Xattnsubscript𝑋attnX_{\textup{attn}}italic_X start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT demonstrate a more uniform distribution across channels. Given that the segmentation map is derived from the cosine similarity of feature vectors at each spatial location, such a disparity implies that the features in Xsumsubscript𝑋sumX_{\textup{sum}}italic_X start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT and Xressubscript𝑋resX_{\textup{res}}italic_X start_POSTSUBSCRIPT res end_POSTSUBSCRIPT are less discernible compared to those in Xattnsubscript𝑋attnX_{\textup{attn}}italic_X start_POSTSUBSCRIPT attn end_POSTSUBSCRIPT, thereby introducing noise into the segmentation results. This observation supports our proposal that the high-level supervision in CLIP emphasizes the global feature direction in the residual latent space, making local feature vectors less distinguishable and leading to noise in residual features.

Appendix 0.E Additional qualitative examples

In this part, we present more qualitative results comparison between ClearCLIP and state-of-the-art methods. Figs. 3 and 4 show the results from COCOStuff, ADE20K and Pascal Context59 datasets respectively. Similar to the findings in the main text, the results of ClearCLIP exhibit much less noise than other methods, further underscoring the superiority of our method.

Refer to caption
(a) COCOStuff
Refer to caption
(b) ADE20K
Figure 3: Qualitative comparison between different open-vocabulary segmentation methods on (a) COCOStuff and (b) ADE20K datasets.
Refer to caption
Figure 4: Qualitative comparison between different open-vocabulary segmentation methods on the Pascal Context59 dataset.