research-article

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Authors:

Rui HuangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 6441 - 6450

https://doi.org/10.1145/3581783.3611878

Published: 27 October 2023 Publication History

Abstract

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene-or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

References

[1]

Hadi AlZayer, Hubert Lin, and Kavita Bala. 2021. AutoPhoto: Aesthetic Photo Capture using Reinforcement Learning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 944--951.

Digital Library

[2]

Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and Ariel Shamir. 2014. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG) 33, 4 (2014), 1--11.

Digital Library

[3]

Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. 2022. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII. Springer, 201--218.

[4]

Yasuo Ariki, Shintaro Kubota, and Masahito Kumano. 2006. Automatic production system of soccer sports video by digital camera work based on situation recognition. In Eighth IEEE International Symposium on Multimedia (ISM'06). IEEE, 851--860.

Digital Library

[5]

Sophia Bano and Andrea Cavallaro. 2016. ViComp: composition of user-generated videos. Multimedia tools and applications 75, 12 (2016), 7187--7210.

[6]

Rogerio Bonatti, Wenshan Wang, Cherie Ho, Aayush Ahuja, Mirko Gschwindt, Efe Camci, Erdal Kayacan, Sanjiban Choudhury, and Sebastian Scherer. 2020. Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. Journal of Field Robotics 37, 4 (2020), 606--641.

[7]

Christine Chen, Oliver Wang, Simon Heinzle, Peter Carr, Aljoscha Smolic, and Markus Gross. 2013. Computational sports broadcasting: Automated director assistance for live sports. In 2013 IEEE International Conference on Multimedia and Expo (ICME). 1--6. https://doi.org/10.1109/ICME.2013.6607445

[8]

Jianhui Chen, Hoang M Le, Peter Carr, Yisong Yue, and James J Little. 2016. Learning online smooth predictors for realtime camera planning using recurrent decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4688--4696.

[9]

Jianhui Chen, Lili Meng, and James J Little. 2018. Camera selection for broadcasting soccer games. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 427--435.

[10]

Minghao Chen, Renbo Tu, Chenxi Huang, Yuqi Lin, BoxiWu, and Deng Cai. 2022. Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations. arXiv preprint arXiv:2212.03125 (2022).

[11]

Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), Vol. 1. Ieee, 886--893.

[12]

Fahad Daniyal and Andrea Cavallaro. 2011. Multi-camera scheduling for video production. In 2011 Conference for Visual Media Production. IEEE, 11--20.

Digital Library

[13]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[14]

Filip Germeys and Géry d'Ydewalle. 2007. The psychology of film: Perceiving beyond the cut. Psychological research 71, 4 (2007), 458--466.

[15]

Sidney Gottlieb. 2013. Hitchcock on Truffaut. Film Quarterly 66, 4 (2013), 10--22.

[16]

Mirko Gschwindt, Efe Camci, Rogerio Bonatti, Wenshan Wang, Erdal Kayacan, and Sebastian Scherer. 2019. Can a robot become a movie director? learning artistic principles for aerial cinematography. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1107--1114.

Digital Library

[17]

Rachel Heck, Michael Wallick, and Michael Gleicher. 2007. Virtual videography. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 3, 1 (2007), 4-es.

Digital Library

[18]

Panwen Hu, Jiazhen Liu, Tianyu Cao, and Rui Huang. 2021. Reinforcement Learning Based Automatic Personal Mashup Generation. In 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[19]

Chong Huang, Yuanjie Dang, Peng Chen, Xin Yang, and Kwang-Ting Cheng. 2021. One-Shot Imitation Drone Filming of Human Motion Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5335--5348.

[20]

Chong Huang, Chuan-En Lin, Zhenyu Yang, Yan Kong, Peng Chen, Xin Yang, and Kwang-Ting Cheng. 2019. Learning to film from professional human motion videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4244--4253.

[21]

Hao Jiang, Sidney Fels, and James J Little. 2008. Optimizing multiple object tracking and best view video synthesis. IEEE Transactions on Multimedia 10, 6 (2008), 997--1012.

Digital Library

[22]

Hongda Jiang, Bin Wang, Xi Wang, Marc Christie, and Baoquan Chen. 2020. Example-driven virtual cinematography by learning camera behaviors. ACM Transactions on Graphics (TOG) 39, 4 (2020), 45--1.

Digital Library

[23]

Rene Kaiser, Wolfgang Weiss, Malte Borsum, Axel Kochale, Marco Masetti, and Valentina Zampichelli. 2012. virtual director for live event broadcast. In Proceedings of the 20th ACM international conference on Multimedia. 1281--1282.

Digital Library

[24]

Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36, 4 (2017), 130--1.

Digital Library

[25]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).

[26]

Qiong Liu, Yong Rui, Anoop Gupta, and Jonathan J Cadiz. 2001. Automating camera management for lecture room environments. In Proceedings of the SIGCHI conference on Human factors in computing systems. 442--449.

Digital Library

[27]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2004), 91--110.

Digital Library

[28]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638--647.

Digital Library

[29]

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. 2022. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230 (2022).

[30]

Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A large-scale database for aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2408--2415.

Digital Library

[31]

Yingwei Pan, Yue Chen, Qian Bao, Ning Zhang, Ting Yao, Jingen Liu, and Tao Mei. 2021. Smart director: an event-driven directing system for live broadcasting. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1--18.

Digital Library

[32]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[33]

Rémi Ronfard. 2021. Film directing for computer games and animation. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 713--730.

[34]

Yong Rui, Anoop Gupta, Jonathan Grudin, and Liwei He. 2004. Automating lecture capture and broadcast: technology and videography. Multimedia Systems 10 (2004), 3--15.

Digital Library

[35]

Mukesh Kumar Saini, Raghudeep Gadde, Shuicheng Yan, and Wei Tsang Ooi. 2012. Movimash: online mobile video mashup. In Proceedings of the 20th ACM international conference on Multimedia. 139--148.

Digital Library

[36]

Mukesh Kumar Saini and Wei Tsang Ooi. 2018. Automated Video Mashups: Research and Challenges. MediaSync: Handbook on Multimedia Synchronization (2018), 167--190.

[37]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

[38]

Prarthana Shrestha, Peter HN de With, HansWeda, Mauro Barbieri, and Emile HL Aarts. 2010. Automatic mashup generation from multiple-camera concert recordings. In Proceedings of the 18th ACM international conference on Multimedia. 541--550.

Digital Library

[39]

Than Htut Soe. 2021. AI video editing tools. What editors want and how far is AI from delivering? arXiv preprint arXiv:2109.07809 (2021).

[40]

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems 12 (1999).

[41]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.

[42]

Feng Wang, Chong-Wah Ngo, and Ting-Chuen Pong. 2007. Lecture video enhancement and editing by integrating posture, gesture, and text. IEEE transactions on multimedia 9, 2 (2007), 397--409.

Digital Library

[43]

Jinjun Wang, Changsheng Xu, Engsiong Chng, Hanqing Lu, and Qi Tian. 2008. Automatic composition of broadcast sports video. Multimedia Systems 14 (2008), 179--193.

Digital Library

[44]

Xueting Wang, Takatsugu Hirayama, and Kenji Mase. 2015. Viewpoint sequence recommendation based on contextual information for multiview video. IEEE MultiMedia 22, 4 (2015), 40--50.

Digital Library

[45]

Xueting Wang, Yuki Muramatu, Takatsugu Hirayama, and Kenji Mase. 2014. Context-dependent viewpoint sequence recommendation system for multi-view video. In 2014 IEEE International Symposium on Multimedia. IEEE, 195--202.

Digital Library

[46]

Hui-Yin Wu and Arnav Jhala. 2018. A Joint Attention Model for Automated Editing. In INT/WICED@ AIIDE.

[47]

Yue Wu, Tao Mei, Ying-Qing Xu, Nenghai Yu, and Shipeng Li. 2015. MoVieUp: Automatic mobile video mashup. IEEE Transactions on Circuits and Systems for Video Technology 25, 12 (2015), 1941--1954.

Digital Library

[48]

Yu Xiong, Fabian Caba Heilbron, and Dahua Lin. 2022. Transcript to video: Efficient clip sequencing from texts. In Proceedings of the 30th ACM International Conference on Multimedia. 5407--5416.

Digital Library

[49]

Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. 2022. A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIX. Springer, 736--753.

[50]

Danqing Yang, Longfei Zhang, Yufeng Wu, Shugang Li, Dong Liang, and Gangyi Ding. 2019. Computable Framework For Live Sport Broadcast Directing. In 2019 IEEE International Symposium on Multimedia (ISM). IEEE, 239--2391.

[51]

Zixiao Yu, Chenyu Yu, Haohong Wang, and Jian Ren. 2022. Enabling Automatic Cinematography with Reinforcement Learning. In 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 103--108.

[52]

Xinrong Zhang, Yanghao Li, Yuxing Han, and Jiangtao Wen. 2022. AI Video Editing: a Survey. (2022).

Index Terms

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
      2. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Multi-clip video editing from a single viewpoint
CVMP '14: Proceedings of the 11th European Conference on Visual Media Production

We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer ...
Interactive 3D video editing

We present a generic and versatile framework for interactive editing of 3D video footage. Our framework combines the advantages of conventional 2D video editing with the power of more advanced, depth-enhanced 3D video streams. Our editor takes 3D video ...
Empirical observations on video editing in the mobile context
Mobility '07: Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology

Today's mobile devices enable the users to capture video clips with integrated digital cameras. However, capturing a video clip exactly as intended is often challenging -- in many cases, the possibility to edit the clip after capture would be useful. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Shenzhen Science and Technology Program

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
213
Total Downloads

Downloads (Last 12 months)213
Downloads (Last 6 weeks)17

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents