Unifying Vision-Language Representation Space with Single-tower Transformer

Jang, Jiho; Kong, Chaerin; Jeon, Donghyeon; Kim, Seonhoon; Kwak, Nojun

Computer Science > Machine Learning

arXiv:2211.11153 (cs)

[Submitted on 21 Nov 2022]

Title:Unifying Vision-Language Representation Space with Single-tower Transformer

Authors:Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, Nojun Kwak

View PDF

Abstract:Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.

Comments:	AAAI 2023, 11 pages
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.11153 [cs.LG]
	(or arXiv:2211.11153v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.11153

Submission history

From: Chaerin Kong [view email]
[v1] Mon, 21 Nov 2022 02:34:21 UTC (6,757 KB)

Computer Science > Machine Learning

Title:Unifying Vision-Language Representation Space with Single-tower Transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Unifying Vision-Language Representation Space with Single-tower Transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators