survey

A Survey on Video Diffusion Models

Authors:

Yu-Gang JiangAuthors Info & Claims

ACM Computing Surveys, Volume 57, Issue 2

Article No.: 41, Pages 1 - 42

https://doi.org/10.1145/3696415

Published: 07 November 2024 Publication History

Abstract

The recent wave of AI-generated content (AIGC) has witnessed substantial success in computer vision, with the diffusion model playing a crucial role in this achievement. Due to their impressive generative capabilities, diffusion models are gradually superseding methods based on GANs and auto-regressive Transformers, demonstrating exceptional performance not only in image generation and editing, but also in the realm of video-related research. However, existing surveys mainly focus on diffusion models in the context of image generation, with few up-to-date reviews on their application in the video domain. To address this gap, this article presents a comprehensive review of video diffusion models in the AIGC era. Specifically, we begin with a concise introduction to the fundamentals and evolution of diffusion models. Subsequently, we present an overview of research on diffusion models in the video domain, categorizing the work into three key areas: video generation, video editing, and other video understanding tasks. We conduct a thorough review of the literature in these three key areas, including further categorization and practical contributions in the field. Finally, we discuss the challenges faced by research in this domain and outline potential future developmental trends. A comprehensive list of video diffusion models studied in this survey is available at https://github.com/ChenHsing/Awesome-Video-Diffusion-Models.

References

[1]

Hamed Alqahtani, Manolya Kavakli-Thorne, Gulshan Kumar, and Ferozepur SBSSTC. 2019. An analysis of evaluation metrics of GANs. In Proceedings of the International Conference on Information Technology and Applications.

[2]

Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, and Xi Yin. 2023. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv:2304.08477. Retrieved from https://arxiv.org/abs/2304.08477

[3]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision.

[4]

Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE International Conference on Computer Vision.

[6]

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. 2022. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv:2211.01324. Retrieved from https://arxiv.org/abs/2211.01324

[7]

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022. Text2live: Text-driven layered image and video editing. In Proceedings of the European Conference on Computer Vision.

Digital Library

[8]

Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, and Christian Etmann. 2021. Conditional image generation with score-based diffusion models. arXiv:2111.13606. Retrieved from https://arxiv.org/abs/2111.13606

[9]

Dan Bigioi, Shubhajit Basak, Hugh Jordan, Rachel McDonnell, and Peter Corcoran. 2023. Speech driven video editing via an audio-conditioned diffusion model. arXiv:2301.04474. Retrieved from https://arxiv.org/abs/2301.04474

[10]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127. Retrieved from https://arxiv.org/abs/2311.15127

[11]

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]

Sam Bond-Taylor, Peter Hessey, Hiroshi Sasaki, Toby P. Breckon, and Chris G. Willcocks. 2022. Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In Proceedings of the European Conference on Computer Vision.

Digital Library

[13]

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. 2022. Generating long videos of dynamic scenes. In Proceedings of the Advances in Neural Information Processing Systems.

[14]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision.

Digital Library

[17]

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. arXiv:1808.01340. Retrieved from https://arxiv.org/abs/1808.01340

[18]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]

Duygu Ceylan, Chun-Hao Paul Huang, and Niloy J. Mitra. 2023. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE International Conference on Computer Vision.

[20]

Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. 2023. StableVideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE International Conference on Computer Vision.

[21]

Matthew Chang, Aditya Prakash, and Saurabh Gupta. 2023. Look ma, no hands! Agent-environment factorization of egocentric videos. arXiv:2305.16301. Retrieved from https://arxiv.org/abs/2305.16301

[22]

Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. 2022. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE International Conference on Computer Vision.

[23]

Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. 2024. GenTron: Diffusion transformers for image and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]

Ting Chen and Lala Li. 2023. FIT: Far-reaching interleaved transformers. arXiv:2305.12689. Retrieved from https://arxiv.org/abs/2305.12689

[25]

Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, and David J. Fleet. 2022. A generalist framework for panoptic segmentation of images and videos. In Proceedings of the IEEE International Conference on Computer Vision.

[26]

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. 2023. Analog bits: Generating discrete data using diffusion models with self-conditioning. In ICLR.

[27]

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. 2023. Motion-conditioned diffusion model for controllable video synthesis. arXiv:2304.14404. Retrieved from https://arxiv.org/abs/2304.14404

[28]

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. 2024. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]

Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. 2023. Control-A-Video: Controllable text-to-video generation with diffusion models. arXiv:2305.13840. Retrieved from https://arxiv.org/abs/2305.13840

[30]

Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, and Qingpei Guo. 2023. EVE: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints. arXiv:2308.10648. Retrieved from https://arxiv.org/abs/2308.10648

[31]

Zijiao Chen, Jiaxin Qing, and Juan Helen Zhou. 2023. Cinematic mindscapes: High-quality video reconstruction from brain activity. arXiv:2305.11675. Retrieved from https://arxiv.org/abs/2305.11675

[32]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, et al. 2023. Vicuna: An Open-source Chatbot Impressing gpt-4 with 90%* Chatgpt Quality.

[33]

Iya Chivileva, Philip Lynch, Tomas E. Ward, and Alan F. Smeaton. 2023. Measuring the quality of text-to-video model outputs: Metrics and dataset. arXiv:2309.08009. Retrieved from https://arxiv.org/abs/2309.08009

[34]

Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. 2021. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE International Conference on Computer Vision.

[35]

Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, and Jun-Cheng Chen. 2023. MeDM: Mediating image diffusion models for video-to-video translation with temporal correspondence guidance. arXiv:2308.10079. Retrieved from https://arxiv.org/abs/2308.10079

[36]

Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. 2023. FLATTEN: Optical FLow-guided ATTENtion for consistent text-to-video editing. arXiv:2310.05922. Retrieved from https://arxiv.org/abs/2310.05922

[37]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]

Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, and Nicolas Thome. 2023. VidEdit: Zero-shot and spatially aware text-driven video editing. arXiv:2306.08707. Retrieved from https://arxiv.org/abs/2306.08707

[39]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision.

[40]

Duolikun Danier, Fan Zhang, and David Bull. 2023. LDMVFI: Video frame interpolation with latent diffusion models. arXiv:2303.09508. Retrieved from https://arxiv.org/abs/2303.09508

[41]

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers. In Proceedings of the Advances in Neural Information Processing Systems.

[42]

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning.

[43]

Zhongjie Duan, Chengyu Wang, Cen Chen, Weining Qian, and Jun Huang. 2024. Diffutoon: High-resolution editable toon shading via diffusion models. arXiv:2401.16224. Retrieved from https://arxiv.org/abs/2401.16224

[44]

Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining Qian, Jun Huang, Fei Chao, and Rongrong Ji. 2023. DiffSynth: Latent in-iteration deflickering for realistic video synthesis. arXiv:2308.03463. Retrieved from https://arxiv.org/abs/2308.03463

[45]

Frederik Ebert, Chelsea Finn, Alex X. Lee, and Sergey Levine. 2017. Self-supervised visual planning with temporal skip connections. Conference on Robot Learning (2017).

[46]

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. 2022. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. Robotics: Science and Systems (2022).

[47]

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. 2023. Clap learning audio concepts from natural language supervision. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. 2023. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE International Conference on Computer Vision.

[49]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]

Fanda Fan, Chunjie Luo, Jianfeng Zhan, and Wanling Gao. 2024. AIGCBench: Comprehensive evaluation of image-to-video content generated by AI. arXiv:2401.01651. Retrieved from https://arxiv.org/abs/2401.01651

[51]

Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Digital Library

[52]

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. 2023. Empowering dynamics-aware text-to-video diffusion with large language models. arXiv:2308.13812. Retrieved from https://arxiv.org/abs/2308.13812

[53]

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. 2024. VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing.

[54]

Qijun Feng, Zhen Xing, Zuxuan Wu, and Yu-Gang Jiang. 2024. Fdgaussian: Fast gaussian splatting from single image via geometric-aware diffusion model. arXiv:2403.10242. Retrieved from https://arxiv.org/abs/2403.10242

[55]

Runyang Feng, Yixing Gao, Tze Ho Elden Tse, Xueqing Ma, and Hyung Jin Chang. 2023. DiffPose: SpatioTemporal diffusion model for video-based human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision.

[56]

Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, and Baining Guo. 2023. CCEdit: Creative and controllable video editing via diffusion models. arXiv:2309.16496. Retrieved from https://arxiv.org/abs/2309.16496

[57]

Alessandro Flaborea, Luca Collorone, Guido D’Amely, Stefano D’Arrigo, Bardh Prenkaj, and Fabio Galasso. 2023. Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection. In Proceedings of the IEEE International Conference on Computer Vision.

[58]

Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, and Sean Bell. 2023. Tell me what happened: Unifying text-guided video completion via multimodal masked video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. 2022. Vision-language pre-training: Basics, recent advances, and future trends. Found. Trends Comput. Graph. Vis. (2022).

Digital Library

[60]

Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, and Lin Ma. 2024. Matten: Video generation with mamba-attention. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1701.00133

[61]

Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. 2022. Long video generation with time-agnostic vqgan and time-sensitive transformer. In Proceedings of the European Conference on Computer Vision.

Digital Library

[62]

Songwei Ge, Aniruddha Mahapatra, Gaurav Parmar, Jun-Yan Zhu, and Jia-Bin Huang. 2024. On the content bias in Fréchet video distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7277–7288.

[63]

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. 2023. Preserve your own correlation: A noise prior for video diffusion models. arXiv:2305.10474. Retrieved from https://arxiv.org/abs/2305.10474

[64]

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2023. TokenFlow: Consistent diffusion features for consistent video editing. arXiv:2307.10373. Retrieved from https://arxiv.org/abs/2307.10373

[65]

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. 2023. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv:2311.10709. Retrieved from https://arxiv.org/abs/2311.10709

[66]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Communicatons of the ACM (2020).

Digital Library

[67]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems.

[68]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision.

[69]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. J Mach Learn Res (2012).

[71]

Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, and Hang Xu. 2023. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv:2309.03549. Retrieved from https://arxiv.org/abs/2309.03549

[72]

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]

Xianfan Gu, Chuan Wen, Jiaming Song, and Yang Gao. 2023. Seer: Language instructed video prediction with latent diffusion models. arXiv:2303.14897. Retrieved from https://arxiv.org/abs/2303.14897

[74]

Zhangxuan Gu, Haoxing Chen, Zhuoer Xu, Jun Lan, Changhua Meng, and Weiqiang Wang. 2022. Diffusioninst: Diffusion model for instance segmentation. arXiv:2212.02773. Retrieved from https://arxiv.org/abs/2212.02773

[75]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. 2023. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725. Retrieved from https://arxiv.org/abs/2307.04725

[76]

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. 2023. Photorealistic video generation with diffusion models. arXiv:2312.06662. Retrieved from https://arxiv.org/abs/2312.06662

[77]

William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. 2022. Flexible diffusion modeling of long videos. In Proceedings of the Advances in Neural Information Processing Systems.

[78]

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. 2023. Animate-A-Story: Storytelling with retrieval-augmented video generation. arXiv:2307.06940. Retrieved from https://arxiv.org/abs/2307.06940

[79]

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. 2022. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv:2211.13221. Retrieved from https://arxiv.org/abs/2211.13221

[80]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Prompt-to-prompt image editing with cross attention control. In ICLR.

[81]

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303. Retrieved from https://arxiv.org/abs/2210.02303

[82]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems.

[83]

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. 2022. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. (2022).

[84]

Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In Proceedings of the Advances in Neural Information Processing Systems.

[85]

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022. Video diffusion models. In Proceedings of the Advances in Neural Information Processing Systems.

[86]

Susung Hong, Junyoung Seo, Sunghwan Hong, Heeseong Shin, and Seungryong Kim. 2023. Large language models are frame-level directors for zero-shot text-to-video generation. arXiv:2305.14330. Retrieved from https://arxiv.org/abs/2305.14330

[87]

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2023. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR.

[88]

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. 2022. Diffusion models for video prediction and infilling. Trans. Mach. Learn. Res. (2022).

[89]

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In ICLR.

[90]

Li Hu. 2024. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8153–8163.

[91]

Yaosi Hu, Zhenzhong Chen, and Chong Luo. 2023. LaMD: Latent motion diffusion for video generation. arXiv:2304.11603. Retrieved from https://arxiv.org/abs/2304.11603

[92]

Zhihao Hu and Dong Xu. 2023. VideoControlNet: A motion-guided video-to-video translation framework by using diffusion model with ControlNet. arXiv:2307.14073. Retrieved from https://arxiv.org/abs/2307.14073

[93]

Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, and Chenliang Xu. 2023. DAVIS: High-quality audio-visual separation with generative diffusion models. arXiv:2308.00122. Retrieved from https://arxiv.org/abs/2308.00122

[94]

Hanzhuo Huang, Yufan Feng, Cheng Shi, Lan Xu, Jingyi Yu, and Sibei Yang. 2023. Free-Bloom: Zero-shot text-to-video generator with LLM director and LDM animator. In Proceedings of the Advances in Neural Information Processing Systems.

[95]

Nisha Huang, Yuxin Zhang, and Weiming Dong. 2023. Style-A-Video: Agile diffusion for arbitrary text-based video style transfer. arXiv:2305.05464. Retrieved from https://arxiv.org/abs/2305.05464

[96]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. 2024. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[97]

Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding (2017).

[98]

Yasamin Jafarian and Hyun Soo Park. 2021. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[99]

Ondrej Jamriska. 2018. Ebsynth: Fast Example-based Image Synthesis and Style Transfer.

[100]

Hyeonho Jeong and Jong Chul Ye. 2023. Ground-A-Video: Zero-shot grounded video editing using text-to-image diffusion models. arXiv:2310.01107. Retrieved from https://arxiv.org/abs/2310.01107

[101]

Yujin Jeong, Wonjeong Ryoo, Seunghyun Lee, Dabin Seo, Wonmin Byeon, Sangpil Kim, and Jinkyu Kim. 2023. The power of sound (TPoS): Audio reactive video generation with stable diffusion. In Proceedings of the IEEE International Conference on Computer Vision.

[102]

Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. 2023. Ddp: Diffusion model for dense visual prediction. arXiv:2303.17559. Retrieved from https://arxiv.org/abs/2303.17559

[103]

Yifan Jiang, Han Chen, and Hanseok Ko. 2023. Spatial-temporal transformer-guided diffusion based data augmentation for efficient skeleton-based action recognition. arXiv:2302.13434. Retrieved from https://arxiv.org/abs/2302.13434

[104]

Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, and Ziwei Liu. 2023. Text2Performer: Text-driven human video generation. In Proceedings of the IEEE International Conference on Computer Vision.

[105]

Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, and Jie Chen. 2023. Diffusionret: Generative text-video retrieval with diffusion model. In Proceedings of the IEEE International Conference on Computer Vision.

[106]

Nazmul Karim, Umar Khalid, Mohsen Joneidi, Chen Chen, and Nazanin Rahnavard. 2023. SAVE: Spectral-shift-aware adaptation of image diffusion models for text-guided video editing. arXiv:2305.18670. Retrieved from https://arxiv.org/abs/2305.18670

[107]

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. 2023. Dreampose: Fashion image-to-video synthesis via stable diffusion. arXiv:2304.06025. Retrieved from https://arxiv.org/abs/2304.06025

[108]

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. 2022. Elucidating the design space of diffusion-based generative models. In Proceedings of the Advances in Neural Information Processing Systems.

[109]

Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-free generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems.

[110]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[111]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[112]

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. 2022. Denoising diffusion restoration models. In Proceedings of the Advances in Neural Information Processing Systems.

[113]

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE International Conference on Computer Vision.

[114]

Anant Khandelwal. 2023. InFusion: Inject and attention fusion for multi concept zero shot text based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[115]

Gyeongman Kim, Hajin Shim, Hyunsu Kim, Yunjey Choi, Junho Kim, and Eunho Yang. 2023. Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[116]

Pum Jun Kim, Seojun Kim, and Jaejun Yoo. 2024. STREAM: Spatio-TempoRal evaluation and analysis metric for video generative models. arXiv:2403.09669. Retrieved from https://arxiv.org/abs/2403.09669

[117]

Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, and Jinwoo Shin. 2023. Collaborative score distillation for consistent visual synthesis. In Proceedings of the Advances in Neural Information Processing Systems.

[118]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE International Conference on Computer Vision.

[119]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision.

[120]

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Digital Library

[121]

DeepFloyd Lab. 2023. DeepFloyd IF. Retrieved from https://github.com/deep-floyd/IF

[122]

Ariel Lapid, Idan Achituve, Lior Bracha, and Ethan Fetaya. 2023. GD-VDM: Generated depth for better diffusion-based video generation. arXiv:2306.11173. Retrieved from https://arxiv.org/abs/2306.11173

[123]

Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. 2021. CCVS: Context-aware controllable video synthesis. In Proceedings of the Advances in Neural Information Processing Systems.

[124]

Seungwoo Lee, Chaerin Kong, Donghyeon Jeon, and Nojun Kwak. 2023. AADiff: Audio-aligned video synthesis with text-to-image diffusion. In CVPRW.

[125]

Seung Hyun Lee, Sieun Kim, Innfarn Yoo, Feng Yang, Donghyeon Cho, Youngseo Kim, Huiwen Chang, Jinkyu Kim, and Sangpil Kim. 2023. Soundini: Sound-guided diffusion for natural video editing. arXiv:2304.06818. Retrieved from https://arxiv.org/abs/2304.06818

[126]

Taegyeong Lee, Soyeong Kwon, and Taehwan Kim. 2024. Grid diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[127]

Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, and Jia-Bin Huang. 2023. Shape-aware text-driven layered video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[128]

Chenghao Li, Chaoning Zhang, Atish Waghwase, Lik-Hang Lee, Francois Rameau, Yang Yang, Sung-Ho Bae, and Choong Seon Hong. 2023. Generative AI meets 3D: A survey on text-to-3D in AIGC era. arXiv:2305.06131. Retrieved from https://arxiv.org/abs/2305.06131

[129]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning.

[130]

Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang. 2023. MomentDiff: Generative video moment retrieval from random to real. In Proceedings of the Advances in Neural Information Processing Systems.

[131]

Shaoxu Li. 2023. Instruct-Video2Avatar: Video-to-avatar generation with instructions. arXiv:2306.02903. Retrieved from https://arxiv.org/abs/2306.02903

[132]

Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong Wang. 2023. VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv:2309.00398. Retrieved from https://arxiv.org/abs/2309.00398

[133]

Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. 2023. Generative image dynamics. arXiv:2309.07906. Retrieved from https://arxiv.org/abs/2309.07906

[134]

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. 2023. LLM-grounded video diffusion models. arXiv:2309.17444. Retrieved from https://arxiv.org/abs/2309.17444

[135]

Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, et al. 2024. Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[136]

Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, and Jiashi Feng. 2023. Magicedit: High-fidelity and temporally coherent video editing. arXiv:2308.14749. Retrieved from https://arxiv.org/abs/2308.14749

[137]

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2023. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[138]

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. 2023. VideoDirectorGPT: Consistent multi-scene video generation via LLM-guided planning. arXiv:2309.15091. Retrieved from https://arxiv.org/abs/2309.15091

[139]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision.

[140]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision.

[141]

Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Zhen Cui, and Jian Yang. 2023. Dual-stream diffusion net for text-to-video generation. arXiv:2308.08316. Retrieved from https://arxiv.org/abs/2308.08316

[142]

Daochang Liu, Qiyue Li, AnhDung Dinh, Tingting Jiang, Mubarak Shah, and Chang Xu. 2023. Diffusion action segmentation. In Proceedings of the IEEE International Conference on Computer Vision.

[143]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in Neural Information Processing Systems (2024).

[144]

Hanyuan Liu, Minshan Xie, Jinbo Xing, Chengze Li, and Tien-Tsin Wong. 2023. Video colorization with pre-trained text-to-image diffusion models. arXiv:2306.01732. Retrieved from https://arxiv.org/abs/2306.01732

[145]

Jiawei Liu, Weining Wang, Wei Liu, Qian He, and Jing Liu. 2023. ED-T2V: An efficient training framework for diffusion-based text-to-video generation. In Proceedings of the International Joint Conference on Neural Networks.

[146]

Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. 2023. Video-p2p: Video editing with cross-attention control. arXiv:2303.04761. Retrieved from https://arxiv.org/abs/2303.04761

[147]

Vivian Liu, Tao Long, Nathan Raw, and Lydia Chilton. 2023. Generative disco: Text-to-video generation for music visualization. arXiv:2304.08551. Retrieved from https://arxiv.org/abs/2304.08551

[148]

Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[149]

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. 2024. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[150]

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. 2024. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems (2024).

[151]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision.

[152]

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. 2023. VDT: An empirical study on video diffusion with transformers. arXiv:2305.13311. Retrieved from https://arxiv.org/abs/2305.13311

[153]

Shitong Luo and Wei Hu. 2021. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[154]

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. 2023. VideoFusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[155]

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. 2024. Latte: Latent diffusion transformer for video generation. arXiv:2401.03048. Retrieved from https://arxiv.org/abs/2401.03048

[156]

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. 2023. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv:2304.01186. Retrieved from https://arxiv.org/abs/2304.01186

[157]

Kangfu Mei and Vishal Patel. 2023. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence.

Digital Library

[158]

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. 2024. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7038–7048.

[159]

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. 2023. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[160]

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2022. Sdedit: Image synthesis and editing with stochastic differential equations. In ICLR.

[161]

Midjourney. 2022. Midjourney. Retrieved from https://www.midjourney.com/

[162]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision.

[163]

Shentong Mo and Yapeng Tian. 2024. Scaling diffusion mamba with bidirectional SSMs for efficient image and video generation. arXiv:2405.15881. Retrieved from https://arxiv.org/abs/2405.15881

[164]

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[165]

Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. 2023. Dreamix: Video diffusion models are general video editors. arXiv:2302.01329. Retrieved from https://arxiv.org/abs/2302.01329

[166]

Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, and Tao Xiang. 2023. DiffTAD: Temporal action detection with proposal denoising diffusion. In Proceedings of the IEEE International Conference on Computer Vision.

[167]

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. 2022. Learning audio-video modalities from image captions. In Proceedings of the European Conference on Computer Vision.

Digital Library

[168]

Jisu Nam, Gyuseong Lee, Sunwoo Kim, Hyeonsu Kim, Hyoungwon Cho, Seyeon Kim, and Seungryong Kim. 2023. DiffMatch: Diffusion model for dense matching. arXiv:2305.19094. Retrieved from https://arxiv.org/abs/2305.19094

[169]

Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, and Tim K. Marks. 2024. TI2V-Zero: Zero-Shot image conditioning for text-to-video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[170]

Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, and Martin Renqiang Min. 2023. Conditional image-to-video generation with latent flow diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[171]

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the International Conference on Machine Learning.

[172]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning.

[173]

Yaniv Nikankin, Niv Haim, and Michal Irani. 2022. Sinfusion: Training diffusion models on a single image or video. In Proceedings of the International Conference on Machine Learning.

[174]

OpenAI. 2022. ChatGPT: A Large-Scale Generative Model for Conversational AI. Retrieved from https://openai.com/chatgpt

[175]

OpenAI. 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https://arxiv.org/abs/2303.08774

[176]

OpenAI. 2024. Sora. Retrieved from https://openai.com/sora

[177]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems.

[178]

Yichen Ouyang, Hao Zhao, Gaoang Wang, et al. 2024. FlexiFilm: Long video generation with flexible conditions. arXiv:2404.18620. Retrieved from https://arxiv.org/abs/2404.18620

[179]

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. 2022. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11410–11420.

[180]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE International Conference on Computer Vision.

[181]

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Retrieved from https://arxiv.org/abs/1704.00675

[182]

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. Dreamfusion: Text-to-3d using 2d diffusion. arXiv:2209.14988. Retrieved from https://arxiv.org/abs/2209.14988

[183]

Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. 2023. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE International Conference on Computer Vision.

[184]

Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, and Yueting Zhuang. 2023. InstructVid2Vid: Controllable video editing with natural language instructions. arXiv:2305.12328. Retrieved from https://arxiv.org/abs/2305.12328

[185]

Bosheng Qin, Wentao Ye, Qifan Yu, Siliang Tang, and Yueting Zhuang. 2023. Dancing avatar: Pose and text-guided human motion videos synthesis with image diffusion model. arXiv:2308.07749. Retrieved from https://arxiv.org/abs/2308.07749

[186]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning.

[187]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2021. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (2021).

[188]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125. Retrieved from https://arxiv.org/abs/2204.06125

[189]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning.

[190]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI (2020).

[191]

Hadrien Reynaud, Mengyun Qiao, Mischa Dombrowski, Thomas Day, Reza Razavi, Alberto Gomez, Paul Leeson, and Bernhard Kainz. 2023. Feature-conditioned cascaded video diffusion models for precise echocardiogram synthesis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention.

Digital Library

[192]

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. International Journal of Computer Vision (2017).

Digital Library

[193]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[194]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[195]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention.

[196]

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. 2023. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[197]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[198]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Advances in Neural Information Processing Systems.

[199]

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. 2022. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

Digital Library

[200]

Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. 2020. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision (2020).

Digital Library

[201]

Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. In ICLR.

[202]

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. 2018. How2: A large-scale dataset for multimodal language understanding. arXiv:1811.00347. Retrieved from https://arxiv.org/abs/1811.00347

[203]

Axel Sauer, Katja Schwarz, and Andreas Geiger. 2022. Stylegan-xl: Scaling stylegan to large diverse datasets. In Proceedings of the ACM SIGGRAPH.

Digital Library

[204]

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv:2111.02114. Retrieved from https://arxiv.org/abs/2111.02114

[205]

Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, and Sungroh Yoon. 2023. Edit-A-Video: Single video editing with object-aware consistency. arXiv:2303.07945. Retrieved from https://arxiv.org/abs/2303.07945

[206]

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. In Proceedings of the Advances in Neural Information Processing Systems.

[207]

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2023. Make-a-video: Text-to-video generation without text-video data. In ICLR.

[208]

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. 2022. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[209]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning.

[210]

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. 2021. Maximum likelihood training of score-based diffusion models. In Proceedings of the Advances in Neural Information Processing Systems.

[211]

Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Advances in Neural Information Processing Systems.

[212]

Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. In Proceedings of the Advances in Neural Information Processing Systems.

[213]

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-based generative modeling through stochastic differential equations. In ICLR.

[214]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402

[215]

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning.

Digital Library

[216]

Sebastian Stein and Stephen J. McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing.

Digital Library

[217]

Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, and David A. Ross. 2020. Learning video representations from textual web supervision. arXiv:2007.14937. Retrieved from https://arxiv.org/abs/2007.14937

[218]

Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[219]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[220]

Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. 2023. Any-to-any generation via composable diffusion. In Proceedings of the Advances in Neural Information Processing Systems.

[221]

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. 2024. Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions. arXiv:2402.17485. Retrieved from https://arxiv.org/abs/2402.17485

[222]

Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. 2021. A good image generator is what you need for high-resolution video synthesis. In ICLR.

[223]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision.

Digital Library

[224]

Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, and Yu-Gang Jiang. 2024. Motioneditor: Editing video motion via content-aware diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[225]

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[226]

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2023. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[227]

Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. 2023. Exploring diffusion models for unsupervised video anomaly detection. IEEE VCIP (2023).

[228]

Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. 2023. Unsupervised video anomaly detection with diffusion models conditioned on compact motion representations. International Conference on Image Analysis and Processing.

[229]

Anwaar Ulhaq, Naveed Akhtar, and Ganna Pogrebna. 2022. Efficient diffusion models for vision: A survey. arXiv:2210.09292. Retrieved from https://arxiv.org/abs/2210.09292

[230]

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. 2018. Towards accurate generative models of video: A new metric and challenges. arXiv:1812.01717. Retrieved from https://arxiv.org/abs/1812.01717

[231]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2019. FVD: A new metric for video generation. In ICLR.

[232]

Arash Vahdat, Karsten Kreis, and Jan Kautz. 2021. Score-based generative modeling in latent space. In Proceedings of the Advances in Neural Information Processing Systems.

[233]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. Advances in Neural Information Processing Systems (2017).

[234]

Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. 2022. MCVD-masked conditional video diffusion for prediction, generation, and interpolation. In Proceedings of the Advances in Neural Information Processing Systems.

[235]

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. 2023. Gen-L-Video: Multi-text to long video generation via temporal co-denoising. arXiv:2305.18264. Retrieved from https://arxiv.org/abs/2305.18264

[236]

Hefeng Wang, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. 2023. DFormer: Diffusion-guided transformer for universal image segmentation. arXiv:2306.03437. Retrieved from https://arxiv.org/abs/2306.03437

[237]

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. 2023. Pdpp: Projected diffusion for procedure planning in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[238]

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. 2024. OmniTokenizer: A joint image-video tokenizer for visual generation. arXiv:2406.09399. Retrieved from https://arxiv.org/abs/2406.09399

[239]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep high-resolution representation learning for visual recognition. TPAMI (2020).

[240]

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023. ModelScope text-to-video technical report. arXiv:2308.06571. Retrieved from https://arxiv.org/abs/2308.06571

[241]

Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. 2023. DisCo: Disentangled control for referring human dance generation in real world. arXiv:2307.00040. Retrieved from https://arxiv.org/abs/2307.00040

[242]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. In Proceedings of the Advances in Neural Information Processing Systems.

[243]

Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, and Chunhua Shen. 2023. Zero-shot video editing using off-the-shelf image diffusion models. arXiv:2303.17599. Retrieved from https://arxiv.org/abs/2303.17599

[244]

Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. 2023. VideoFactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv:2305.10874. Retrieved from https://arxiv.org/abs/2305.10874

[245]

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE International Conference on Computer Vision.

[246]

Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. 2023. VideoComposer: Compositional video synthesis with motion controllability. arXiv:2306.02018. Retrieved from https://arxiv.org/abs/2306.02018

[247]

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. 2023. Videolcm: Video latent consistency model. arXiv:2312.09109. Retrieved from https://arxiv.org/abs/2312.09109

[248]

Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai, Zhiyuan Zhao, Chunyu Wang, et al. 2024. Microcinema: A divide-and-conquer approach for text-to-video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[249]

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2023. LAVIE: High-quality video generation with cascaded latent diffusion models. arXiv:2309.15103. Retrieved from https://arxiv.org/abs/2309.15103

[250]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, et al. 2023. InternVid: A large-scale video-text dataset for multimodal understanding and generation. arXiv:2307.06942. Retrieved from https://arxiv.org/abs/2307.06942

[251]

Yuhan Wang, Liming Jiang, and Chen Change Loy. 2023. StyleInV: A temporal style modulated inversion network for unconditional video generation. In Proceedings of the IEEE International Conference on Computer Vision.

[252]

Yuanzhi Wang, Yong Li, Xin Liu, Anbo Dai, Antoni Chan, and Zhen Cui. 2023. Edit temporal-consistent videos with image diffusion model. arXiv:2308.09091. Retrieved from https://arxiv.org/abs/2308.09091

[253]

Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, and Yu Qiao. 2023. LEO: Generative latent image animator for human video synthesis. arXiv:2305.03989. Retrieved from https://arxiv.org/abs/2305.03989

[254]

Zhou Wang and Alan C. Bovik. 2009. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Process Mag (2009).

[255]

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. TIP (2004).

[256]

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. 2023. Motionctrl: A unified and flexible motion controller for video generation. arXiv:2312.03641. Retrieved from https://arxiv.org/abs/2312.03641

[257]

Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan. 2024. Dreamvideo: Composing your dream videos with customized subject and motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[258]

Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, and Yu-Gang Jiang. 2024. GenRec: Unifying video generation and recognition with diffusion models. arXiv:2408.15241. Retrieved from https://arxiv.org/abs/2408.15241

[259]

Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, and Peter Vajda. 2024. Fairy: Fast parallelized instruction-guided video-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[260]

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin. 2023. Ai-generated content (aigc): A survey. arXiv:2304.06632. Retrieved from https://arxiv.org/abs/2304.06632

[261]

Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. 2024. Towards a better metric for text-to-video generation. arXiv:2401.07781. Retrieved from https://arxiv.org/abs/2401.07781

[262]

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2023. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE International Conference on Computer Vision.

[263]

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023. NExT-GPT: Any-to-any multimodal LLM. arXiv:2309.05519. Retrieved from https://arxiv.org/abs/2309.05519

[264]

Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, et al. 2023. Make-your-video: Customized video generation using textual and structural guidance. arXiv:2306.00943. Retrieved from https://arxiv.org/abs/2306.00943

[265]

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. 2023. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv:2310.12190. Retrieved from https://arxiv.org/abs/2310.12190

[266]

Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, and Yu-Gang Jiang. 2024. SimDA: Simple diffusion adapter for efficient video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[267]

Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. 2024. AID: Adapting Image2Video diffusion models for instruction-guided video prediction. arXiv:2406.06465. Retrieved from https://arxiv.org/abs/2406.06465

[268]

Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han Hu, Zuxuan Wu, and Yu-Gang Jiang. 2023. Vidiff: Translating videos via multi-modal instructions with diffusion models. arXiv:2311.18837. Retrieved from https://arxiv.org/abs/2311.18837

[269]

Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. 2018. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[270]

Chao Xu, Shaoting Zhu, Junwei Zhu, Tianxin Huang, Jiangning Zhang, Ying Tai, and Yong Liu. 2023. Multimodal-driven talking face generation via a unified diffusion-based generator. arXiv:2305.02594. Retrieved from https://arxiv.org/abs/2305.02594

[271]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[272]

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. 2024. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[273]

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. 2022. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[274]

Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, and Jiashi Feng. 2023. MagicProp: Diffusion-based video editing via motion-aware appearance propagation. arXiv:2309.00908. Retrieved from https://arxiv.org/abs/2309.00908

[275]

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. 2021. Videogpt: Video generation using vq-vae and transformers. arXiv:2104.10157. Retrieved from https://arxiv.org/abs/2104.10157

[276]

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2022. Diffusion models: A comprehensive survey of methods and applications. ACM Comput Surv (2022).

[277]

Mengjiao Yang, Yilun Du, Bo Dai, Dale Schuurmans, Joshua B. Tenenbaum, and Pieter Abbeel. 2023. Probabilistic adaptation of text-to-video models. arXiv:2306.01872. Retrieved from https://arxiv.org/abs/2306.01872

[278]

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. 2022. Diffusion probabilistic modeling for video generation. arXiv:2203.09481. Retrieved from https://arxiv.org/abs/2203.09481

[279]

Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, and You He. 2023. Video diffusion models with local-global context guidance. In IJCAI.

Digital Library

[280]

Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. 2023. Rerender a video: Zero-shot text-guided video-to-video translation. In Proceedings of the SIGGRAPH Asia.

Digital Library

[281]

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. 2023. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv:2308.08089. Retrieved from https://arxiv.org/abs/2308.08089

[282]

Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, et al. 2023. NUWA-XL: Diffusion over diffusion for eXtremely long video generation. arXiv:2303.12346. Retrieved from https://arxiv.org/abs/2303.12346

[283]

YouTube. [n.d.]. YouTube. Retrieved from https://www.youtube.com/

[284]

Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, and Yu Qiao. 2023. Long-term rhythmic video soundtracker. In Proceedings of the International Conference on Machine Learning.

[285]

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res. (2022).

[286]

Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. 2023. CelebV-Text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[287]

Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. 2024. Efficient video diffusion models via content-frame motion-latent decomposition. arXiv:2403.14148. Retrieved from https://arxiv.org/abs/2403.14148

[288]

Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. 2023. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[289]

Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. 2022. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR.

[290]

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. Merlot: Multimodal neural script knowledge models. In Proceedings of the Advances in Neural Information Processing Systems.

[291]

Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, and Hang Li. 2023. Make pixels dance: High-dynamic video generation. arXiv:2311.10982. Retrieved from https://arxiv.org/abs/2311.10982

[292]

Fangneng Zhan, Yingchen Yu, Rongliang Wu, Jiahui Zhang, Shijian Lu, Lingjie Liu, Adam Kortylewski, Christian Theobalt, and Eric Xing. 2023. Multimodal image synthesis and editing: A survey and taxonomy. TPAMI (2023).

Digital Library

[293]

Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. 2023. Text-to-image diffusion model in generative ai: A survey. arXiv:2303.07909. Retrieved from https://arxiv.org/abs/2303.07909

[294]

Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al. 2023. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv:2303.11717. Retrieved from https://arxiv.org/abs/2303.11717

[295]

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. 2023. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv:2309.15818. Retrieved from https://arxiv.org/abs/2309.15818

[296]

Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, and Yu-Gang Jiang. 2023. Adadiff: Adaptive step selection for fast diffusion. arXiv:2311.14768. Retrieved from https://arxiv.org/abs/2311.14768

[297]

Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE International Conference on Computer Vision.

[298]

Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang Jiang, and Larry S. Davis. 2021. Videolt: Large-scale long-tailed video recognition. In Proceedings of the IEEE International Conference on Computer Vision.

[299]

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. 2023. ControlVideo: Training-free controllable text-to-video generation. arXiv:2305.13077. Retrieved from https://arxiv.org/abs/2305.13077

[300]

Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, and Luoqi Liu. 2023. Towards consistent video editing with text-to-image diffusion models. arXiv:2305.17431. Retrieved from https://arxiv.org/abs/2305.17431

[301]

Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, and Licheng Yu. 2024. AVID: Any-length video inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[302]

Henghao Zhao, Kevin Qinghong Lin, Rui Yan, and Zechao Li. 2023. DiffusionVMR: Diffusion model for video moment retrieval. arXiv:2305.17431. Retrieved from https://arxiv.org/abs/2308.15109

[303]

Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, and Jun Zhu. 2023. ControlVideo: Adding conditional control for one shot text-to-video editing. arXiv:2305.17098. Retrieved from https://arxiv.org/abs/2305.17098

[304]

Yuyang Zhao, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. 2023. Make-A-Protagonist: Generic video editing with an ensemble of experts. arXiv:2305.08850. Retrieved from https://arxiv.org/abs/2305.08850

[305]

Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, and Mang Ye. 2023. Refined semantic enhancement towards frequency diffusion for video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence.

Digital Library

[306]

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. 2022. Magicvideo: Efficient video generation with latent diffusion models. arXiv:2211.11018. Retrieved from https://arxiv.org/abs/2211.11018

[307]

Luowei Zhou, Chenliang Xu, and Jason Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence.

[308]

Qihua Zhou, Ruibin Li, Song Guo, Yi Liu, Jingcai Guo, and Zhenda Xu. 2022. CaDM: Codec-aware diffusion modeling for neural-enhanced video streaming. arXiv:2211.08428. Retrieved from https://arxiv.org/abs/2211.08428

[309]

Yutong Zhou and Nobutaka Shimada. 2023. Vision+ language applications: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[310]

Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Yu-Gang Jiang, and Guo-Jun Qi. 2024. PoseAnimate: Zero-shot high fidelity pose controllable character animation. arXiv:2404.13680. Retrieved from https://arxiv.org/abs/2404.13680

[311]

Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, and Jianlong Fu. 2023. MovieFactory: Automatic movie creation from text using large generative models for language and images. arXiv:2306.07257. Retrieved from https://arxiv.org/abs/2306.07257

[312]

Wojciech Zielonka, Timo Bolkart, and Justus Thies. 2022. Towards metrical reconstruction of human faces. In Proceedings of the European Conference on Computer Vision.

Digital Library

Cited By

Jiang YMa XLi X(2025)Towards virtual sample generation with various data conditions: A comprehensive reviewInformation Fusion10.1016/j.inffus.2024.102874117(102874)Online publication date: May-2025
https://doi.org/10.1016/j.inffus.2024.102874
Zhang QWang SCui D(2024)Feature Consistency-Based Style Transfer for Landscape Images Using Dual-Channel AttentionIEEE Access10.1109/ACCESS.2024.348506312(164018-164027)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3485063
Liu FWang HChen WSun HDuan Y(2024)Make-Your-3D: Fast and Consistent Subject-Driven 3D Content GenerationComputer Vision – ECCV 202410.1007/978-3-031-72907-2_23(389-406)Online publication date: 31-Oct-2024
https://doi.org/10.1007/978-3-031-72907-2_23

Index Terms

A Survey on Video Diffusion Models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

VQ-VDM: Video Diffusion Models with 3D VQGAN
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

In recent years, deep generative models have achieved impressive performance such as realizing image generation that is indistinguishable from real images. Particularly, Latent Diffusion Models, one of the image generation models, have had a significant ...
Speech driven video editing via an audio-conditioned diffusion model
Abstract
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate ...
Highlights
- Denoising diffusion models for speech driven video editing.
- Present a speech-conditioned diffusion model for this task.
- We demonstrate promising results on the GRID and CREMA-D datasets.
- An unstructured diffusion-based approach ...
Diffusion Models: A Comprehensive Survey of Methods and Applications
Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in many applications, including image synthesis, video generation, and molecule design. In this survey, we provide an overview of the rapidly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 57, Issue 2

February 2025

974 pages

EISSN:1557-7341

DOI:10.1145/3696822

Editors:
David Atienza
Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
,
Michela Milano
University of Bologna, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2024

Online AM: 18 September 2024

Accepted: 22 August 2024

Revised: 14 July 2024

Received: 18 November 2023

Published in CSUR Volume 57, Issue 2

Check for updates

Author Tags

Qualifiers

Survey

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
2,476
Total Downloads

Downloads (Last 12 months)2,476
Downloads (Last 6 weeks)1,409

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang YMa XLi X(2025)Towards virtual sample generation with various data conditions: A comprehensive reviewInformation Fusion10.1016/j.inffus.2024.102874117(102874)Online publication date: May-2025
https://doi.org/10.1016/j.inffus.2024.102874
Zhang QWang SCui D(2024)Feature Consistency-Based Style Transfer for Landscape Images Using Dual-Channel AttentionIEEE Access10.1109/ACCESS.2024.348506312(164018-164027)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3485063
Liu FWang HChen WSun HDuan Y(2024)Make-Your-3D: Fast and Consistent Subject-Driven 3D Content GenerationComputer Vision – ECCV 202410.1007/978-3-031-72907-2_23(389-406)Online publication date: 31-Oct-2024
https://doi.org/10.1007/978-3-031-72907-2_23

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents