Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Abbasi, Reza; Samiei, Mohammad; Rohban, Mohammad Hossein; Baghshah, Mahdieh Soleymani

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.18525 (cs)

[Submitted on 27 Mar 2024]

Title:Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Authors:Reza Abbasi, Mohammad Samiei, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

View PDF HTML (experimental)

Abstract:Vision-language models, such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various types of distribution shifts. Recent studies attempted to investigate the leading cause of this capability. In this work, we follow the same path, but focus on a specific type of OoD data - images with novel compositions of attribute-object pairs - and study whether such models can successfully classify those images into composition classes. We carefully designed an authentic image test dataset called ImageNet-AO, consisting of attributes for objects that are unlikely encountered in the CLIP training sets. We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M. Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.

Comments:	Oral accepted at OODCV 2023(this http URL)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2403.18525 [cs.CV]
	(or arXiv:2403.18525v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.18525

Submission history

From: Reza Abbasi [view email]
[v1] Wed, 27 Mar 2024 12:59:44 UTC (5,208 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators