The Solution for Language-Enhanced Image New Category Discovery

Xu, Haonan; Chao, Dian; Wu, Xiangyu; Wan, Zhonghua; Yang, Yang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.04994 (cs)

[Submitted on 6 Jul 2024]

Title:The Solution for Language-Enhanced Image New Category Discovery

Authors:Haonan Xu, Dian Chao, Xiangyu Wu, Zhonghua Wan, Yang Yang

View PDF HTML (experimental)

Abstract:Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a dual-adapter module that simultaneously leverages knowledge from the original CLIP and new learning knowledge derived from downstream datasets. Benefiting from the pseudo visual prompts, our method surpasses the state-of-the-art not only on clean annotated text data but also on pseudo text data generated by large language models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2407.04994 [cs.CV]
	(or arXiv:2407.04994v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.04994

Submission history

From: Zhonghua Wan [view email]
[v1] Sat, 6 Jul 2024 08:09:29 UTC (1,828 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:The Solution for Language-Enhanced Image New Category Discovery

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:The Solution for Language-Enhanced Image New Category Discovery

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators