An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Hu, Zizhao; Jia, Shaochong; Rostami, Mohammad

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.16530 (cs)

[Submitted on 25 Mar 2024]

Title:An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Authors:Zizhao Hu, Shaochong Jia, Mohammad Rostami

View PDF HTML (experimental)

Abstract:Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.16530 [cs.CV]
	(or arXiv:2403.16530v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.16530

Submission history

From: Zizhao Hu [view email]
[v1] Mon, 25 Mar 2024 08:16:06 UTC (16,635 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators