Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

Abbasi, Reza; Rohban, Mohammad Hossein; Baghshah, Mahdieh Soleymani

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.05897 (cs)

[Submitted on 8 Jul 2024 (v1), last revised 16 Jul 2024 (this version, v2)]

Title:Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

Authors:Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

View PDF HTML (experimental)

Abstract:CLIP models have recently shown to exhibit Out of Distribution (OoD) generalization capabilities. However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model's ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. We noted that previous studies regarding compositional understanding of CLIPs frequently fail to ensure that test samples are genuinely novel relative to the CLIP training data. To this end, we carefully synthesized a large and diverse dataset in the single object setting, comprising attributes for objects that are highly unlikely to be encountered in the combined training datasets of various CLIP models. This dataset enables an authentic evaluation of C-OoD generalization. Our observations reveal varying levels of C-OoD generalization across different CLIP models. We propose that the disentanglement of CLIP representations serves as a critical indicator in this context. By utilizing our synthesized datasets and other existing datasets, we assess various disentanglement metrics of text and image representations. Our study reveals that the disentanglement of image and text representations, particularly with respect to their compositional elements, plays a crucial role in improving the generalization of CLIP models in out-of-distribution settings. This finding suggests promising opportunities for advancing out-of-distribution generalization in CLIPs.

Comments:	Accepted at ECCV 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.05897 [cs.CV]
	(or arXiv:2407.05897v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.05897

Submission history

From: Reza Abbasi [view email]
[v1] Mon, 8 Jul 2024 13:04:40 UTC (21,895 KB)
[v2] Tue, 16 Jul 2024 05:42:05 UTC (22,102 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators