research-article

Open access

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Authors:

Guosheng LinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 8167 - 8175

https://doi.org/10.1145/3581783.3612033

Published: 27 October 2023 Publication History

Abstract

We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pre-trained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

References

[1]

Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wetzstein. 2022. DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models. arXiv preprint arXiv:2211.12131 (2022).

[2]

Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, and Noah Snavely. 2023. Persistent Nature: A Generative Model of Unbounded 3D Worlds. arXiv preprint arXiv:2303.13515 (2023).

[3]

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2019. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision. 5933--5942.

[4]

Zhaoxi Chen, GuangcongWang, and Ziwei Liu. 2023. Scenedreamer: Unbounded 3d scene generation from 2d image collections. arXiv preprint arXiv:2302.01330 (2023).

[5]

Yung-Yu Chuang, Dan B Goldman, Ke Colin Zheng, Brian Curless, David H Salesin, and Richard Szeliski. 2005. Animating pictures with stochastic motion textures. In ACM SIGGRAPH 2005 Papers. 853--860.

Digital Library

[6]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.

Digital Library

[7]

Yuki Endo, Yoshihiro Kanamori, and Shigeru Kuriyama. 2019. Animating landscape: self-supervised learning of decoupled motion and appearance for singleimage video synthesis. arXiv preprint arXiv:1910.07192 (2019).

[8]

Siming Fan, Jingtan Piao, Chen Qian, Kwan-Yee Lin, and Hongsheng Li. 2022. Simulating Fluids in Real-World Still Images. arXiv preprint arXiv:2204.11335 (2022).

[9]

Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. 2023. Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023).

[10]

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).

[11]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models. arXiv preprint arXiv:2204.03458 (2022).

[12]

Aleksander Holynski, Brian L Curless, Steven M Seitz, and Richard Szeliski. 2021. Animating pictures with eulerian motion fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5810--5819.

[13]

Wei-Cih Jhou and Wen-Huang Cheng. 2015. Animating still landscape photographs through cloud motion creation. IEEE Transactions on Multimedia 18, 1 (2015), 4--13.

Digital Library

[14]

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023).

[15]

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. 2021. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14738--14748.

[16]

Xingyi Li, Zhiguo Cao, Huiqiang Sun, Jianming Zhang, Ke Xian, and Guosheng Lin. 2023. 3D Cinemagraphy from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4595--4605.

[17]

Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV). 600--615.

Digital Library

[18]

Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo Kanazawa. 2022. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I. Springer, 515--534.

[19]

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. 2021. Infinite nature: Perpetual view generation of natural scenes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14458--14467.

[20]

Wen Liu, Zhixin Piao, Jie Min,Wenhan Luo, Lin Ma, and Shenghua Gao. 2019. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5904--5913.

[21]

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. 2023. VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10209--10218.

[22]

Aniruddha Mahapatra and Kuldeep Kulkarni. 2022. Controllable Animation of Fluid Elements in Still Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3667--3676.

[23]

Oded Maimon and Lior Rokach. 2005. Data mining and knowledge discovery handbook. (2005).

[24]

Juewen Peng, Jianming Zhang, Xianrui Luo, Hao Lu, Ke Xian, and Zhiguo Cao. 2022. Mpib: An mpi-based bokeh rendering framework for realistic partial occlusion effects. In European Conference on Computer Vision. Springer, 590--607.

Digital Library

[25]

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).

[26]

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12179--12188.

[27]

Xuanchi Ren and Xiaolong Wang. 2022. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3563--3573.

[28]

Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. 2020. Deep image spatial transformation for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7690--7699.

[29]

Chris Rockwell, David F Fouhey, and Justin Johnson. 2021. Pixelsynth: Generating a 3d-consistent experience from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14104--14113.

[30]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684--10695.

[31]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479--36494.

[32]

Jonathan Shade, Steven Gortler, Li-wei He, and Richard Szeliski. 1998. Layered depth images. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques. 231--242.

Digital Library

[33]

Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 2020. 3d photography using context-aware layered depth inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8028--8038.

[34]

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. Advances in Neural Information Processing Systems 32 (2019).

[35]

Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. 2021. Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13653--13662.

[36]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).

[37]

Ivan Skorokhodov, Grigorii Sotnikov, and Mohamed Elhoseiny. 2021. Aligning latent and image spaces to connect the unconnectable. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14144--14153.

[38]

Qianqian Wang, Zhengqi Li, David Salesin, Noah Snavely, Brian Curless, and Janne Kontkanen. 2022. 3D moments from near-duplicate photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3906--3915.

[39]

Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. 2020. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7467--7477.

[40]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586--595.

Cited By

Choo HJin IJeong SKim JKong K(2024)Deep Learning-Based 2.5D Asset Generation Techniques for Virtual ProductionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.6.101029:6(1010-1025)Online publication date: 30-Nov-2024
https://doi.org/10.5909/JBE.2024.29.6.1010
Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Deng BTucker RLi ZGuibas LSnavely NWetzstein G(2024)Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657513(1-11)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657513
Show More Cited By

Index Terms

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation
      1. Image-based rendering

Recommendations

General Dynamic Scene Reconstruction from Multiple View Video
ICCV '15: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)

This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the scene structure, appearance, or illumination. Existing techniques or dynamic scene ...
Semantically Coherent 4D Scene Flow of Dynamic Scenes
Abstract
Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, and ...
Globally consistent alignment for planar mosaicking via topology analysis

In this paper, we propose a generic framework for globally consistent alignment of images captured from approximately planar scenes via topology analysis, capable of resisting the perspective distortion meanwhile preserving the local alignment accuracy. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The National Natural Science Foundation of China
Industry Collaboration Projects (IAF-ICP) Funding Initiative

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
576
Total Downloads

Downloads (Last 12 months)439
Downloads (Last 6 weeks)26

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Choo HJin IJeong SKim JKong K(2024)Deep Learning-Based 2.5D Asset Generation Techniques for Virtual ProductionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.6.101029:6(1010-1025)Online publication date: 30-Nov-2024
https://doi.org/10.5909/JBE.2024.29.6.1010
Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Deng BTucker RLi ZGuibas LSnavely NWetzstein G(2024)Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657513(1-11)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657513
Po RYifan WGolyanik VAberman KBarron JBermano AChan EDekel THolynski AKanazawa ALiu CLiu LMildenhall BNießner MOmmer BTheobalt CWonka PWetzstein G(2024)State of the Art on Diffusion Models for Visual ComputingComputer Graphics Forum10.1111/cgf.1506343:2Online publication date: 30-Apr-2024
https://doi.org/10.1111/cgf.15063
Li XCao ZWu YWang KXian KWang ZLin G(2024)S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01900(20102-20112)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01900
Zhang SZhang YZheng QMa RHua WBao HXu WZou C(2024)3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00969(10170-10180)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00969
Skorokhodov IMenapace WSiarohin ATulyakov S(2024)Hierarchical Patch Diffusion Models for High-Resolution Video Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00723(7569-7579)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00723
Shen LLiu TSun HYe XLi BZhang JCao Z(2024)DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large MotionComputer Vision – ECCV 202410.1007/978-3-031-72633-0_19(336-353)Online publication date: 22-Nov-2024
https://doi.org/10.1007/978-3-031-72633-0_19

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten