research-article

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Authors:

Chen Change LoyAuthors Info & Claims

SA '23: SIGGRAPH Asia 2023 Conference Papers

Article No.: 95, Pages 1 - 11

https://doi.org/10.1145/3610548.3618160

Published: 11 December 2023 Publication History

Abstract

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos. Code is available at our project page: https://www.mmlab-ntu.com/project/rerender/

Supplemental Material

MP4 File

A Presentation video. Full video inputs and results in the paper.

Download
70.85 MB

ZIP File

A Presentation video. Full video inputs and results in the paper.

Download
160.98 MB

References

[1]

Omri Avrahami, Ohad Fried, and Dani Lischinski. 2022. Blended latent diffusion. arXiv preprint arXiv:2206.02779 (2022).

[2]

Duygu Ceylan, Chun-Hao Paul Huang, and Niloy J Mitra. 2023. Pix2Video: Video Editing using Image Diffusion. arXiv preprint arXiv:2303.12688 (2023).

[3]

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

[4]

Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Goldman, and Pradeep Sen. 2012. Image melding: Combining inconsistent images using patch-based synthesis.ACM Transactions on Graphics 31, 4 (2012), 82–1.

Digital Library

[5]

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, 2021. CogView: Mastering text-to-image generation via transformers. In Advances in Neural Information Processing Systems, Vol. 34. 19822–19835.

[6]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 12873–12883.

[7]

Jakub Fišer, Ondřej Jamriška, David Simons, Eli Shechtman, Jingwan Lu, Paul Asente, Michal Lukáč, and Daniel Sỳkora. 2017. Example-based synthesis of stylized facial animations. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–11.

Digital Library

[8]

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-based text-to-image generation with human priors. In Proc. European Conf. Computer Vision. Springer, 89–106.

Digital Library

[9]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).

[10]

Eric Heitz and Fabrice Neyret. 2018. High-performance by-example noise using a histogram-preserving blending operator. Proceedings of the ACM on Computer Graphics and Interactive Techniques 1, 2 (2018), 1–25.

Digital Library

[11]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).

[12]

Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin. 2001. Image analogies. In Proc. Conf. Computer Graphics and Interactive Techniques. 327–340.

Digital Library

[13]

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, 2022a. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).

[14]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, Vol. 33. 6840–6851.

[15]

Jonathan Ho, Tim Salimans, Alexey A Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022b. Video Diffusion Models. In Advances in Neural Information Processing Systems.

[16]

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021. LoRA: Low-Rank Adaptation of Large Language Models. In Proc. Int’l Conf. Learning Representations.

[17]

Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In Proc. Int’l Conf. Computer Vision. 1510–1519.

[18]

Ondřej Jamriška, Šárka Sochorová, Ondřej Texler, Michal Lukáč, Jakub Fišer, Jingwan Lu, Eli Shechtman, and Daniel Sỳkora. 2019. Stylizing video by example. ACM Transactions on Graphics 38, 4 (2019), 1–11.

Digital Library

[19]

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. arXiv preprint arXiv:2303.13439 (2023).

[20]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proc. European Conf. Computer Vision. Springer, 740–755.

[21]

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. 2023. Video-P2P: Video Editing with Cross-attention Control. arXiv preprint arXiv:2303.04761 (2023).

[22]

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. SDEdit: Guided image synthesis and editing with stochastic differential equations. In Proc. Int’l Conf. Learning Representations.

[23]

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Null-text Inversion for Editing Real Images using Guided Diffusion Models. arXiv preprint arXiv:2211.09794 (2022).

[24]

Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. 2023. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023).

[25]

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proc. IEEE Int’l Conf. Machine Learning. 16784–16804.

[26]

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. 2023. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. arXiv preprint arXiv:2303.09535 (2023).

[27]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In Proc. IEEE Int’l Conf. Machine Learning. PMLR, 8748–8763.

[28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.

Digital Library

[29]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).

[30]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proc. IEEE Int’l Conf. Machine Learning. 8821–8831.

[31]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 10684–10695.

[32]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2022. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2022).

[33]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, 2022. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, Vol. 35. 36479–36494.

[34]

Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, and Sungroh Yoon. 2023. Edit-A-Video: Single Video Editing with Object-Aware Consistency. arXiv preprint arXiv:2303.07945 (2023).

[35]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, 2023. Make-A-Video: Text-to-video generation without text-video data. In Proc. Int’l Conf. Learning Representations.

[36]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In Proc. Int’l Conf. Learning Representations.

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.

[38]

Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, and Chunhua Shen. 2023. Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models. arXiv preprint arXiv:2303.17599 (2023).

[39]

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2022. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. arXiv preprint arXiv:2212.11565 (2022).

[40]

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. 2022. GMFlow: Learning Optical Flow via Global Matching. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 8121–8130.

[41]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 1316–1324.

[42]

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021. Cross-modal contrastive learning for text-to-image generation. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 833–842.

[43]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision. 5907–5915.

[44]

Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023).

[45]

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. 5802–5810.

Cited By

Tous R(2024)Lester: Rotoscope Animation through Video Object Segmentation and TrackingAlgorithms10.3390/a1708033017:8(330)Online publication date: 30-Jul-2024
https://doi.org/10.3390/a17080330
Xing ZFeng QChen HDai QHu HXu HWu ZJiang Y(2024)A Survey on Video Diffusion ModelsACM Computing Surveys10.1145/3696415Online publication date: 18-Sep-2024
https://doi.org/10.1145/3696415
Po RYifan WGolyanik VAberman KBarron JBermano AChan EDekel THolynski AKanazawa ALiu CLiu LMildenhall BNießner MOmmer BTheobalt CWonka PWetzstein G(2024)State of the Art on Diffusion Models for Visual ComputingComputer Graphics Forum10.1111/cgf.1506343:2Online publication date: 30-Apr-2024
https://doi.org/10.1111/cgf.15063
Show More Cited By

Index Terms

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Video-to-Video Translation with Global Temporal Consistency
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Although image-to-image translation has been widely studied, the video-to-video translation is rarely mentioned. In this paper, we propose an unified video-to-video translation framework to accom- plish different tasks, like video super-resolution, ...
Fast, Realistic Lighting for Video Games

Global lighting effects produced by diffuse interreflections are typically simulated using global illumination methods such as radiosity or ray tracing. Although diffuse interreflections are crucial to produce realistic images, radiosity like methods ...
Temporal Consistency Based Method for Blind Video Deblurring
ICPR '14: Proceedings of the 2014 22nd International Conference on Pattern Recognition

In this paper, we propose a video deblurring method for recovering effectively the blurred video caused by camera shake. First, we estimate the blur kernel for each blurred video frame by introducing a preprocessing strategy which employs the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '23: SIGGRAPH Asia 2023 Conference Papers

December 2023

1113 pages

ISBN:9798400703157

DOI:10.1145/3610548

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Singapore MOE AcRF Tier 2
RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative

Conference

SA '23

Sponsor:

SIGGRAPH

SA '23: SIGGRAPH Asia 2023

December 12 - 15, 2023

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
343
Total Downloads

Downloads (Last 12 months)343
Downloads (Last 6 weeks)35

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tous R(2024)Lester: Rotoscope Animation through Video Object Segmentation and TrackingAlgorithms10.3390/a1708033017:8(330)Online publication date: 30-Jul-2024
https://doi.org/10.3390/a17080330
Xing ZFeng QChen HDai QHu HXu HWu ZJiang Y(2024)A Survey on Video Diffusion ModelsACM Computing Surveys10.1145/3696415Online publication date: 18-Sep-2024
https://doi.org/10.1145/3696415
Po RYifan WGolyanik VAberman KBarron JBermano AChan EDekel THolynski AKanazawa ALiu CLiu LMildenhall BNießner MOmmer BTheobalt CWonka PWetzstein G(2024)State of the Art on Diffusion Models for Visual ComputingComputer Graphics Forum10.1111/cgf.1506343:2Online publication date: 30-Apr-2024
https://doi.org/10.1111/cgf.15063
Eldesokey AWonka P(2024)LatentMan : Generating Consistent Animated Characters using Image Diffusion Models2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00746(7510-7519)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00746
Huang ZHe YYu JZhang FSi CJiang YZhang YWu TJin QChanpaisit NWang YChen XWang LLin DQiao YLiu Z(2024)VBench: Comprehensive Benchmark Suite for Video Generative Models2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02060(21807-21818)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02060
Zhang GZhang TNiu GTan ZBai YYang Q(2024)CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-Driven Video Editing2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00867(9079-9088)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00867
Chung JHyun SHeo J(2024)Style Injection in Diffusion: A Training-Free Approach for Adapting Large-Scale Diffusion Models for Style Transfer2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00840(8795-8805)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00840
Yang SZhou YLiu ZLoy C(2024)Fresco: Spatial-Temporal Correspondence for Zero-Shot Video Translation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00831(8703-8712)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00831
Liu SZhang YLi WLin ZJia J(2024)Video-P2P: Video Editing with Cross-Attention Control2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00821(8599-8608)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00821
Wu BChuang CWang XJia YKrishnakumar KXiao TLiang FYu LVajda P(2024)Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00789(8261-8270)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00789
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents