OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Chen, Liuhan; Li, Zongjian; Lin, Bin; Zhu, Bin; Wang, Qian; Yuan, Shenghai; Zhou, Xing; Cheng, Xinhua; Yuan, Li

Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.01199 (cs)

[Submitted on 2 Sep 2024 (v1), last revised 9 Sep 2024 (this version, v2)]

Title:OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Authors:Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan

View PDF HTML (experimental)

Abstract:Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.

Comments:	this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Cite as:	arXiv:2409.01199 [cs.CV]
	(or arXiv:2409.01199v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.01199

Submission history

From: Shenghai Yuan [view email]
[v1] Mon, 2 Sep 2024 12:20:42 UTC (3,150 KB)
[v2] Mon, 9 Sep 2024 13:49:53 UTC (3,149 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators