Efficient and consistent zero-shot video generation with diffusion models

Ethan Frakes; Umar Khalid; Chen Chen

doi:10.1117/12.3013575

7 June 2024 Efficient and consistent zero-shot video generation with diffusion models

Ethan Frakes, Umar Khalid, Chen Chen

Author Affiliations +

Proceedings Volume 13034, Real-Time Image Processing and Deep Learning 2024; 1303407 (2024) https://doi.org/10.1117/12.3013575
Event: SPIE Defense + Commercial Sensing, 2024, National Harbor, Maryland, United States

Abstract

Recent diffusion-based generative models employ methods such as one-shot fine-tuning an image diffusion model for video generation. However, this leads to long video generation times and suboptimal efficiency. To resolve this long generation time, zero-shot text-to-video models eliminate the fine-tuning method entirely and can generate novel videos from a text prompt alone. While the zero-shot generation method greatly reduces generation time, many models rely on inefficient cross-frame attention processors, hindering the diffusion model’s utilization for real-time video generation. We address this issue by introducing more efficient attention processors to a video diffusion model. Specifically, we use attention processors (i.e. xFormers, FlashAttention, and HyperAttention) that are highly optimized for efficiency and hardware parallelization. We then apply these processors to a video generator and test with both older diffusion models such as Stable Diffusion 1.5 and newer, high-quality models such as Stable Diffusion XL. Our results show that using efficient attention processors alone can reduce generation time by around 25%, while not resulting in any change in video quality. Combined with the use of higher quality models, this use of efficient attention processors in zero-shot generation presents a substantial efficiency and quality increase, greatly expanding the video diffusion model’s application to real-time video generation.

Conference Presentation

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Ethan Frakes, Umar Khalid, and Chen Chen "Efficient and consistent zero-shot video generation with diffusion models", Proc. SPIE 13034, Real-Time Image Processing and Deep Learning 2024, 1303407 (7 June 2024); https://doi.org/10.1117/12.3013575

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

;

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE