Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Girdhar, Rohit; Singh, Mannat; Brown, Andrew; Duval, Quentin; Azadi, Samaneh; Rambhatla, Sai Saketh; Shah, Akbar; Yin, Xi; Parikh, Devi; Misra, Ishan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.10709 (cs)

[Submitted on 17 Nov 2023 (v1), last revised 2 Aug 2024 (this version, v2)]

Title:Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Authors:Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

View PDF HTML (experimental)

Abstract:We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.

Comments:	ECCV 2024. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2311.10709 [cs.CV]
	(or arXiv:2311.10709v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.10709

Submission history

From: Rohit Girdhar [view email]
[v1] Fri, 17 Nov 2023 18:59:04 UTC (16,551 KB)
[v2] Fri, 2 Aug 2024 18:55:25 UTC (16,559 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators