MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Zhang, Yu; Zhang, Qi; Gong, Zixuan; Shi, Yiwei; Liu, Yepeng; Miao, Duoqian; Liu, Yang; Liu, Ke; Yi, Kun; Fan, Wei; Hu, Liang; Wang, Changwei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.01460 (cs)

[Submitted on 3 Jun 2024 (v1), last revised 4 Jun 2024 (this version, v2)]

Title:MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Authors:Yu Zhang, Qi Zhang, Zixuan Gong, Yiwei Shi, Yepeng Liu, Duoqian Miao, Yang Liu, Ke Liu, Kun Yi, Wei Fan, Liang Hu, Changwei Wang

View PDF HTML (experimental)

Abstract:Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, leading to rapid advancements in multimodal studies. However, CLIP faces a notable challenge in terms of inefficient data utilization. It relies on a single contrastive supervision for each image-text pair during representation learning, disregarding a substantial amount of valuable information that could offer richer supervision. Additionally, the retention of non-informative tokens leads to increased computational demands and time costs, particularly in CLIP's ViT image encoder. To address these issues, we propose Multi-Perspective Language-Image Pretraining (MLIP). In MLIP, we leverage the frequency transform's sensitivity to both high and low-frequency variations, which complements the spatial domain's sensitivity limited to low-frequency variations only. By incorporating frequency transforms and token-level alignment, we expand CILP's single supervision into multi-domain and multi-level supervision, enabling a more thorough exploration of informative image features. Additionally, we introduce a token merging method guided by comprehensive semantics from the frequency and spatial domains. This allows us to merge tokens to multi-granularity tokens with a controllable compression rate to accelerate CLIP. Extensive experiments validate the effectiveness of our design.

Comments:	ICML 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.01460 [cs.CV]
	(or arXiv:2406.01460v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.01460

Submission history

From: Yu Zhang [view email]
[v1] Mon, 3 Jun 2024 15:49:11 UTC (5,992 KB)
[v2] Tue, 4 Jun 2024 07:36:57 UTC (5,992 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators