22email: lanm0002@e.ntu.edu.sg 22email: {chaofeng.chen, ypke}@ntu.edu.sg
22email: {wangxinjiang, fenglitong, wayne.zhang}@sensetime.com
Supplementary – ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Appendix
Appendix 0.A Ablation study with different backbones and datasets
We showcase the results of the ablation study for each dataset across different CLIP models in Fig. 1. It’s clear that our method, which involves removing the residual connection and FFN, markedly enhances the open-vocabulary semantic segmentation capability of CLIP throughout all datasets. This enhancement is especially pronounced within the ViT-L/14 architecture, characterized by a larger norm of residual connection. These findings conclusively affirm the efficacy of our proposed methodology.
Appendix 0.B Impact of channel-wise residual features
In this part, we investigate the effect of residual features with low intensity. Specifically, we conduct experiments by selectively reintroducing channels from residual features that have lower average values. We report the results of eliminating the top high-value channels and the effect of normalizing in Tab. 1. The best performance is achieved when . Additionally, normalizing significantly reduces its scale, resulting in performance comparable to . These findings support our hypothesis that high-level supervision in CLIP emphasizes global feature direction in the residual latent space, which introduces noise into the residual features. For simplicity, we eliminate all channels in .
(%) | 0 | 5 | 10 | 30 | 50 | 70 | 100 | Norm |
Avg. | 22.1 | 30.2 | 33.5 | 37.4 | 38.0 | 38.1 | 38.1 | 38.1 |
VOC20 | Context59 | Stuff | Cityscape | ADE20K | Avg. | |
CLIP [radford2021learning] | 41.8 | 9.2 | 4.4 | 5.5 | 2.1 | 12.6 |
+ClearCLIP | 80.9 | 35.9 | 23.9 | 30.0 | 16.7 | 37.5 +24.9 |
BLIP [li2022blip] | 37.3 | 7.8 | 5.4 | 4.3 | 2.0 | 11.4 |
+ClearCLIP | 73.5 | 31.4 | 21.3 | 23.8 | 13.5 | 32.7 +21.3 |
OpenCLIP [cherti2023reproducible] | 47.2 | 9.0 | 5.0 | 5.1 | 2.9 | 13.8 |
+ClearCLIP | 81.4 | 34.1 | 23.1 | 31.8 | 18.9 | 37.9 +24.1 |
MetaCLIP [xu2023demystifying] | 35.4 | 8.1 | 4.3 | 5.0 | 2.2 | 11.0 |
+ClearCLIP | 78.3 | 34.8 | 23.5 | 27.9 | 17.4 | 36.4 +25.4 |
MaskCLIP [zhou2022extract] | 74.9 | 26.4 | 16.4 | 12.6 | 9.8 | 28.0 |
+ClearCLIP | 61.4 | 28.3 | 18.4 | 24.7 | 13.6 | 29.5 +1.8 |
SCLIP [wang2023sclip] | 78.2 | 33.0 | 21.1 | 29.1 | 14.6 | 35.2 |
+ClearCLIP | 77.9 | 35.6 | 23.6 | 31.0 | 17.0 | 37.9 +1.6 |
GEM [bousselham2023grounding] | 79.9 | 35.9 | 23.7 | 30.8 | 15.7 | 37.2 |
+ClearCLIP | 80.2 | 36.5 | 24.4 | 30.5 | 17.4 | 37.8 +0.6 |
CLIP [radford2021learning] | 15.8 | 4.5 | 2.4 | 2.9 | 1.2 | 5.4 |
+ClearCLIP | 80.0 | 29.6 | 19.9 | 27.9 | 15.0 | 34.5 +29.1 |
BLIP [li2022blip] | 22.5 | 5.8 | 2.4 | 3.8 | 1.5 | 7.2 |
+ClearCLIP | 67.5 | 16.8 | 11.5 | 9.3 | 7.1 | 22.4 +15.2 |
OpenCLIP [cherti2023reproducible] | 39.7 | 7.0 | 4.1 | 3.9 | 2.3 | 11.4 |
+ClearCLIP | 65.3 | 27.9 | 19.5 | 26.4 | 16.0 | 31.0 +19.6 |
MetaCLIP [xu2023demystifying] | 22.7 | 6.2 | 3.6 | 5.1 | 2.2 | 8.0 |
+ClearCLIP | 78.2 | 30.3 | 20.5 | 25.6 | 16.4 | 34.2 +26.2 |
MaskCLIP [zhou2022extract] | 30.1 | 12.6 | 8.9 | 10.1 | 6.9 | 13.7 |
+ClearCLIP | 65.1 | 26.5 | 17.6 | 21.2 | 15.1 | 29.1 +11.1 |
SCLIP [wang2023sclip] | 60.3 | 20.5 | 13.1 | 17.0 | 7.1 | 23.6 |
+ClearCLIP | 79.2 | 30.6 | 20.5 | 27.8 | 15.6 | 34.7 +15.4 |
GEM [bousselham2023grounding] | 80.3 | 26.4 | 17.6 | 22.6 | 11.6 | 31.7 |
+ClearCLIP | 79.7 | 29.9 | 19.4 | 25.9 | 14.2 | 33.8 +2.1 |
Appendix 0.C Integration across models
Our solution serves as a free lunch applicable to various architectures and segmentation models with just 2-3 lines of code modification. Specifically, for MaskCLIP and SCLIP, we achieve this by eliminating the residual connection and Feed-Forward Network (FFN) of the last self-attention layer. For GEM, we utilize the attention output from the final layer as the final representation. Importantly, we preserve the original attention mechanisms of these methods. For baseline model, i.e., CLIP, BLIP, OpenCLIP, and MetaCLIP, we enhance them by incorporating our complete solution. The performance of different models on five datasets is summarized in Tab. 2. The results demonstrate that our solution consistently enhances the performance of existing models in open-vocabulary semantic segmentation tasks, showcasing its exceptional generalizability.
Appendix 0.D Visualization of feature maps
To intuitively demonstrate how the residual connections affect the performance, we visualize the feature maps of , , and for two randomly selected samples in Fig. 2. It is obvious that the feature maps associated with the residual connections are characterized by peak values in one channel (highlighted in a red box), significantly surpassing the other channels. And is similar to , indicating the big influence of to the final feature. Conversely, the feature maps in demonstrate a more uniform distribution across channels. Given that the segmentation map is derived from the cosine similarity of feature vectors at each spatial location, such a disparity implies that the features in and are less discernible compared to those in , thereby introducing noise into the segmentation results. This observation supports our proposal that the high-level supervision in CLIP emphasizes the global feature direction in the residual latent space, making local feature vectors less distinguishable and leading to noise in residual features.
Appendix 0.E Additional qualitative examples
In this part, we present more qualitative results comparison between ClearCLIP and state-of-the-art methods. Figs. 3 and 4 show the results from COCOStuff, ADE20K and Pascal Context59 datasets respectively. Similar to the findings in the main text, the results of ClearCLIP exhibit much less noise than other methods, further underscoring the superiority of our method.