research-article

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Authors: Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, Xingang PanAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 95, Pages 1 - 11

https://doi.org/10.1145/3680528.3687656

Published: 03 December 2024 Publication History

Abstract

The remarkable generative capabilities of diffusion models have motivated extensive research in both image and video editing. Compared to video editing which faces additional challenges in the time dimension, image editing has witnessed the development of more diverse, high-quality approaches and more capable software like Photoshop. In light of this gap, we introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits, effectively handling global edits, local edits, and moderate shape changes, which existing methods cannot fully achieve. At the core of our method are two main processes: Coarse Motion Extraction to align basic motion patterns with the original video, and Appearance Refinement for precise adjustments using fine-grained attention matching. We also incorporate a skip-interval strategy to mitigate quality degradation from auto-regressive generation across multiple video clips. Experimental results demonstrate our framework’s superior performance in fine-grained video editing, proving its capability to produce high-quality, temporally consistent outputs.

Supplemental Material

ZIP File

Supplementary and presentation video

Download
579.90 MB

References

[1]

2023. Gen-2 by runway. https://research.runwayml.com/gen2.

[2]

2023. Pika labs. https://pika.art/.

[3]

Adobe Inc.[n. d.]. Adobe Photoshop. https://www.adobe.com/products/photoshop.html

[4]

Xiaobo An and Fabio Pellacini. 2008. AppProp: all-pairs appearance-space edit propagation. ACM Trans. Graph. 27, 3 (aug 2008), 1–9.

Digital Library

[5]

Theodore W Anderson and Donald A Darling. 1954. A test of goodness of fit. Journal of the American statistical association 49, 268 (1954), 765–769.

[6]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:https://arXiv.org/abs/2311.15127 (2023).

[7]

Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2022. InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv preprint arXiv:https://arXiv.org/abs/2211.09800 (2022).

[8]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. 2024. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators

[9]

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 22560–22570.

[10]

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. 2023b. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:https://arXiv.org/abs/2307.09481 (2023).

[11]

Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, and Qingpei Guo. 2023a. Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints. arXiv preprint arXiv:https://arXiv.org/abs/2308.10648 (2023).

[12]

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin. 2024. Improving Diffusion Models for Authentic Virtual Try-on in the Wild. arXiv preprint arXiv:https://arXiv.org/abs/2403.05139 (2024).

[13]

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. 2023. FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing. arXiv preprint arXiv:https://arXiv.org/abs/2310.05922 (2023).

[14]

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. 2022. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:https://arXiv.org/abs/2210.11427 (2022).

[15]

Ziyi Dong, Pengxu Wei, and Liang Lin. 2022. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:https://arXiv.org/abs/2211.11337 (2022).

[16]

Shanghua Gao, Zhijie Lin, Xingyu Xie, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. 2023. EditAnything: Empowering Unparalleled Flexibility in Image Editing and Generation. In Proceedings of the 31st ACM International Conference on Multimedia, Demo track.

Digital Library

[17]

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2023. TokenFlow: Consistent Diffusion Features for Consistent Video Editing. arXiv preprint arxiv:https://arXiv.org/abs/2307.10373 (2023).

[18]

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Chen Yunpeng, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Shan Ying, and Mike Zheng Shou. 2023a. Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2305.18292 (2023).

[19]

Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. 2023b. VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence. arXiv preprint arXiv:https://arXiv.org/abs/2312.02087 (2023).

[20]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2311.16933 (2023).

[21]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2024. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. International Conference on Learning Representations (2024).

[22]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:https://arXiv.org/abs/2208.01626 (2022).

[23]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In EMNLP.

[24]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:https://arXiv.org/abs/2106.09685 (2021).

[25]

Ondřej Jamriška, Šárka Sochorová, Ondřej Texler, Michal Lukáč, Jakub Fišer, Jingwan Lu, Eli Shechtman, and Daniel Sỳkora. 2019. Stylizing video by example. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–11.

[26]

Hyeonho Jeong, Geon Yeong Park, and Jong Chul Ye. 2023. VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2312.00845 (2023).

[27]

Hyeonho Jeong and Jong Chul Ye. 2023. Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2310.01107 (2023).

[28]

Nick Kanopoulos, Nagesh Vasanthavada, and Robert L Baker. 1988. Design of an image edge detection filter using the Sobel operator. IEEE Journal of solid-state circuits 23, 2 (1988), 358–367.

[29]

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the Design Space of Diffusion-Based Generative Models. In Proc. NeurIPS.

[30]

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. arXiv preprint arXiv:https://arXiv.org/abs/2303.13439 (2023).

[31]

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. 2024. AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks. arXiv preprint arXiv:https://arXiv.org/abs/2403.14468 (2024).

[32]

Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo. 2022. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions. arXiv preprint arXiv:https://arXiv.org/abs/2206.14180 (2022).

[33]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.

[34]

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, and Rakesh Ranjan. 2023. MoVideo: Motion-Aware Video Generation with Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2311.00000 (2023).

[35]

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. 2023. Video-P2P: Video Editing with Cross-attention Control.

[36]

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Null-text Inversion for Editing Real Images using Guided Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2211.09794 (2022).

[37]

Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, and Yujun Shen. 2023. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:https://arXiv.org/abs/2308.07926 (2023).

[38]

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:https://arXiv.org/abs/1704.00675 (2017).

[39]

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. 2023. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. arXiv:https://arXiv.org/abs/2303.09535 (2023).

[40]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[41]

Alex Rav-Acha, Pushmeet Kohli, Carsten Rother, and Andrew Fitzgibbon. 2008. Unwrap mosaics: a new representation for video editing. ACM Trans. Graph. 27, 3 (aug 2008), 1–11.

Digital Library

[42]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:https://arXiv.org/abs/2112.10752 [cs.CV]

[43]

Ciara Rowles. 2024. svd-temporal-controlnet. https://github.com/CiaraStrawberry/svd-temporal-controlnet.

[44]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. 2024. Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling. arXiv preprint arXiv:https://arXiv.org/abs/2401.15977 (2024).

[46]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:https://arXiv.org/abs/2010.02502 (2020).

[47]

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2023. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.

[48]

Haofan Wang, Qixun Wang, Xu Bai, Zekui Qin, and Anthony Chen. 2024b. InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation. arXiv preprint arXiv:https://arXiv.org/abs/2404.02733 (2024).

[49]

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023b. Modelscope text-to-video technical report. arXiv preprint arXiv:https://arXiv.org/abs/2308.06571 (2023).

[50]

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. 2024a. InstantID: Zero-shot Identity-Preserving Generation in Seconds. arXiv preprint arXiv:https://arXiv.org/abs/2401.07519 (2024).

[51]

Wen Wang, kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, and Chunhua Shen. 2023a. Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2303.17599 (2023).

[52]

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2023a. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7623–7633.

[53]

Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, et al. 2023b. CVPR 2023 Text Guided Video Editing Competition. arXiv preprint arXiv:https://arXiv.org/abs/2310.16003 (2023).

[54]

Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, and Song Han. 2023. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. arXiv (2023).

[55]

Kun Xu, Yong Li, Tao Ju, Shi-Min Hu, and Tian-Qiang Liu. 2009. Efficient affinity-based edit propagation using K-D tree. ACM Trans. Graph. 28, 5 (dec 2009), 1–6.

Digital Library

[56]

Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, and Humphrey Shi. 2023. Prompt-Free Diffusion: Taking" Text" out of Text-to-Image Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2305.16223 (2023).

[57]

Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, and Jiashi Feng. 2023b. Magicprop: Diffusion-based video editing via motion-aware appearance propagation. arXiv preprint arXiv:https://arXiv.org/abs/2309.00908 (2023).

[58]

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, and Samaneh Azadi. 2023a. Motion-Conditioned Image Animation for Video Editing. arXiv preprint arXiv:https://arXiv.org/abs/2311.18827 (2023).

[59]

Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. 2023. Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. In ACM SIGGRAPH Asia Conference Proceedings.

Digital Library

[60]

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. 2023. Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer. arXiv preprint arxiv:https://arXiv.org/abs/2311.17009 (2023).

[61]

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. (2023).

[62]

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. 2023. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:https://arXiv.org/abs/2308.08089 (2023).

[63]

Kaan Yücer, Alec Jacobson, Alexander Hornung, and Olga Sorkine. 2012. Transfusive image manipulation. ACM Trans. Graph. 31, 6, Article 176 (nov 2012), 9 pages.

Digital Library

[64]

Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. 2019. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:https://arXiv.org/abs/1910.09139 (2019).

[65]

Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, and Jingren Zhou. 2023a. I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models. (2023).

[66]

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. 2023b. ControlVideo: Training-free Controllable Text-to-Video Generation. arXiv preprint arXiv:https://arXiv.org/abs/2305.13077 (2023).

[67]

Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, and Jun Zhu. 2023b. ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing. arXiv preprint arXiv:https://arXiv.org/abs/2305.17098 (2023).

[68]

Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, and Mike Zheng Shou. 2023a. MotionDirector: Motion Customization of Text-to-Video Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2310.08465 (2023).

Index Terms

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Multi-clip video editing from a single viewpoint
CVMP '14: Proceedings of the 11th European Conference on Visual Media Production

We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer ...
Interactive 3D video editing

We present a generic and versatile framework for interactive editing of 3D video footage. Our framework combines the advantages of conventional 2D video editing with the power of more advanced, depth-enhanced 3D video streams. Our editor takes 3D video ...
Schematic storyboarding for video visualization and editing
SIGGRAPH '06: ACM SIGGRAPH 2006 Papers

We present a method for visualizing short video clips in a single static image, using the visual language of storyboards. These schematic storyboards are composed from multiple input frames and annotated using outlines, arrows, and text describing the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '24: SIGGRAPH Asia 2024 Conference Papers

December 2024

1620 pages

ISBN:9798400711312

DOI:10.1145/3680528

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SA '24

Sponsor:

SIGGRAPH

SA '24: SIGGRAPH Asia 2024 Conference Papers

December 3 - 6, 2024

Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
340
Total Downloads

Downloads (Last 12 months)340
Downloads (Last 6 weeks)340

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Table of Contents