Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612033acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Published: 27 October 2023 Publication History

Abstract

We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pre-trained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

References

[1]
Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wetzstein. 2022. DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models. arXiv preprint arXiv:2211.12131 (2022).
[2]
Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, and Noah Snavely. 2023. Persistent Nature: A Generative Model of Unbounded 3D Worlds. arXiv preprint arXiv:2303.13515 (2023).
[3]
Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2019. Everybody dance now. In Proceedings of the IEEE/CVF international conference on computer vision. 5933--5942.
[4]
Zhaoxi Chen, GuangcongWang, and Ziwei Liu. 2023. Scenedreamer: Unbounded 3d scene generation from 2d image collections. arXiv preprint arXiv:2302.01330 (2023).
[5]
Yung-Yu Chuang, Dan B Goldman, Ke Colin Zheng, Brian Curless, David H Salesin, and Richard Szeliski. 2005. Animating pictures with stochastic motion textures. In ACM SIGGRAPH 2005 Papers. 853--860.
[6]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.
[7]
Yuki Endo, Yoshihiro Kanamori, and Shigeru Kuriyama. 2019. Animating landscape: self-supervised learning of decoupled motion and appearance for singleimage video synthesis. arXiv preprint arXiv:1910.07192 (2019).
[8]
Siming Fan, Jingtan Piao, Chen Qian, Kwan-Yee Lin, and Hongsheng Li. 2022. Simulating Fluids in Real-World Still Images. arXiv preprint arXiv:2204.11335 (2022).
[9]
Rafail Fridman, Amit Abecasis, Yoni Kasten, and Tali Dekel. 2023. Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133 (2023).
[10]
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).
[11]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models. arXiv preprint arXiv:2204.03458 (2022).
[12]
Aleksander Holynski, Brian L Curless, Steven M Seitz, and Richard Szeliski. 2021. Animating pictures with eulerian motion fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5810--5819.
[13]
Wei-Cih Jhou and Wen-Huang Cheng. 2015. Animating still landscape photographs through cloud motion creation. IEEE Transactions on Multimedia 18, 1 (2015), 4--13.
[14]
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023).
[15]
Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. 2021. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14738--14748.
[16]
Xingyi Li, Zhiguo Cao, Huiqiang Sun, Jianming Zhang, Ke Xian, and Guosheng Lin. 2023. 3D Cinemagraphy from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4595--4605.
[17]
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV). 600--615.
[18]
Zhengqi Li, Qianqian Wang, Noah Snavely, and Angjoo Kanazawa. 2022. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I. Springer, 515--534.
[19]
Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. 2021. Infinite nature: Perpetual view generation of natural scenes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14458--14467.
[20]
Wen Liu, Zhixin Piao, Jie Min,Wenhan Luo, Lin Ma, and Shenghua Gao. 2019. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5904--5913.
[21]
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. 2023. VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10209--10218.
[22]
Aniruddha Mahapatra and Kuldeep Kulkarni. 2022. Controllable Animation of Fluid Elements in Still Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3667--3676.
[23]
Oded Maimon and Lior Rokach. 2005. Data mining and knowledge discovery handbook. (2005).
[24]
Juewen Peng, Jianming Zhang, Xianrui Luo, Hao Lu, Ke Xian, and Zhiguo Cao. 2022. Mpib: An mpi-based bokeh rendering framework for realistic partial occlusion effects. In European Conference on Computer Vision. Springer, 590--607.
[25]
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022).
[26]
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12179--12188.
[27]
Xuanchi Ren and Xiaolong Wang. 2022. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3563--3573.
[28]
Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H Li, and Ge Li. 2020. Deep image spatial transformation for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7690--7699.
[29]
Chris Rockwell, David F Fouhey, and Justin Johnson. 2021. Pixelsynth: Generating a 3d-consistent experience from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14104--14113.
[30]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684--10695.
[31]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35 (2022), 36479--36494.
[32]
Jonathan Shade, Steven Gortler, Li-wei He, and Richard Szeliski. 1998. Layered depth images. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques. 231--242.
[33]
Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 2020. 3d photography using context-aware layered depth inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8028--8038.
[34]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. Advances in Neural Information Processing Systems 32 (2019).
[35]
Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. 2021. Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13653--13662.
[36]
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).
[37]
Ivan Skorokhodov, Grigorii Sotnikov, and Mohamed Elhoseiny. 2021. Aligning latent and image spaces to connect the unconnectable. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14144--14153.
[38]
Qianqian Wang, Zhengqi Li, David Salesin, Noah Snavely, Brian Curless, and Janne Kontkanen. 2022. 3D moments from near-duplicate photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3906--3915.
[39]
Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. 2020. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7467--7477.
[40]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586--595.

Cited By

View all
  • (2024)Deep Learning-Based 2.5D Asset Generation Techniques for Virtual ProductionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.6.101029:6(1010-1025)Online publication date: 30-Nov-2024
  • (2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
  • (2024)Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657513(1-11)Online publication date: 13-Jul-2024
  • Show More Cited By

Index Terms

  1. Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Check for updates

    Author Tags

    1. 4d representation
    2. global consistency
    3. inpainting and outpainting
    4. long-term dynamic video synthesis

    Qualifiers

    • Research-article

    Funding Sources

    • The National Natural Science Foundation of China
    • Industry Collaboration Projects (IAF-ICP) Funding Initiative

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)439
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 28 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Deep Learning-Based 2.5D Asset Generation Techniques for Virtual ProductionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.6.101029:6(1010-1025)Online publication date: 30-Nov-2024
    • (2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
    • (2024)Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657513(1-11)Online publication date: 13-Jul-2024
    • (2024)State of the Art on Diffusion Models for Visual ComputingComputer Graphics Forum10.1111/cgf.1506343:2Online publication date: 30-Apr-2024
    • (2024)S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01900(20102-20112)Online publication date: 16-Jun-2024
    • (2024)3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00969(10170-10180)Online publication date: 16-Jun-2024
    • (2024)Hierarchical Patch Diffusion Models for High-Resolution Video Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00723(7569-7579)Online publication date: 16-Jun-2024
    • (2024)DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large MotionComputer Vision – ECCV 202410.1007/978-3-031-72633-0_19(336-353)Online publication date: 22-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media