Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611878acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Published: 27 October 2023 Publication History

Abstract

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene-or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

References

[1]
Hadi AlZayer, Hubert Lin, and Kavita Bala. 2021. AutoPhoto: Aesthetic Photo Capture using Reinforcement Learning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 944--951.
[2]
Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and Ariel Shamir. 2014. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG) 33, 4 (2014), 1--11.
[3]
Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. 2022. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII. Springer, 201--218.
[4]
Yasuo Ariki, Shintaro Kubota, and Masahito Kumano. 2006. Automatic production system of soccer sports video by digital camera work based on situation recognition. In Eighth IEEE International Symposium on Multimedia (ISM'06). IEEE, 851--860.
[5]
Sophia Bano and Andrea Cavallaro. 2016. ViComp: composition of user-generated videos. Multimedia tools and applications 75, 12 (2016), 7187--7210.
[6]
Rogerio Bonatti, Wenshan Wang, Cherie Ho, Aayush Ahuja, Mirko Gschwindt, Efe Camci, Erdal Kayacan, Sanjiban Choudhury, and Sebastian Scherer. 2020. Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. Journal of Field Robotics 37, 4 (2020), 606--641.
[7]
Christine Chen, Oliver Wang, Simon Heinzle, Peter Carr, Aljoscha Smolic, and Markus Gross. 2013. Computational sports broadcasting: Automated director assistance for live sports. In 2013 IEEE International Conference on Multimedia and Expo (ICME). 1--6. https://doi.org/10.1109/ICME.2013.6607445
[8]
Jianhui Chen, Hoang M Le, Peter Carr, Yisong Yue, and James J Little. 2016. Learning online smooth predictors for realtime camera planning using recurrent decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4688--4696.
[9]
Jianhui Chen, Lili Meng, and James J Little. 2018. Camera selection for broadcasting soccer games. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 427--435.
[10]
Minghao Chen, Renbo Tu, Chenxi Huang, Yuqi Lin, BoxiWu, and Deng Cai. 2022. Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations. arXiv preprint arXiv:2212.03125 (2022).
[11]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), Vol. 1. Ieee, 886--893.
[12]
Fahad Daniyal and Andrea Cavallaro. 2011. Multi-camera scheduling for video production. In 2011 Conference for Visual Media Production. IEEE, 11--20.
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[14]
Filip Germeys and Géry d'Ydewalle. 2007. The psychology of film: Perceiving beyond the cut. Psychological research 71, 4 (2007), 458--466.
[15]
Sidney Gottlieb. 2013. Hitchcock on Truffaut. Film Quarterly 66, 4 (2013), 10--22.
[16]
Mirko Gschwindt, Efe Camci, Rogerio Bonatti, Wenshan Wang, Erdal Kayacan, and Sebastian Scherer. 2019. Can a robot become a movie director? learning artistic principles for aerial cinematography. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1107--1114.
[17]
Rachel Heck, Michael Wallick, and Michael Gleicher. 2007. Virtual videography. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 3, 1 (2007), 4-es.
[18]
Panwen Hu, Jiazhen Liu, Tianyu Cao, and Rui Huang. 2021. Reinforcement Learning Based Automatic Personal Mashup Generation. In 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[19]
Chong Huang, Yuanjie Dang, Peng Chen, Xin Yang, and Kwang-Ting Cheng. 2021. One-Shot Imitation Drone Filming of Human Motion Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5335--5348.
[20]
Chong Huang, Chuan-En Lin, Zhenyu Yang, Yan Kong, Peng Chen, Xin Yang, and Kwang-Ting Cheng. 2019. Learning to film from professional human motion videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4244--4253.
[21]
Hao Jiang, Sidney Fels, and James J Little. 2008. Optimizing multiple object tracking and best view video synthesis. IEEE Transactions on Multimedia 10, 6 (2008), 997--1012.
[22]
Hongda Jiang, Bin Wang, Xi Wang, Marc Christie, and Baoquan Chen. 2020. Example-driven virtual cinematography by learning camera behaviors. ACM Transactions on Graphics (TOG) 39, 4 (2020), 45--1.
[23]
Rene Kaiser, Wolfgang Weiss, Malte Borsum, Axel Kochale, Marco Masetti, and Valentina Zampichelli. 2012. virtual director for live event broadcast. In Proceedings of the 20th ACM international conference on Multimedia. 1281--1282.
[24]
Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36, 4 (2017), 130--1.
[25]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[26]
Qiong Liu, Yong Rui, Anoop Gupta, and Jonathan J Cadiz. 2001. Automating camera management for lecture room environments. In Proceedings of the SIGCHI conference on Human factors in computing systems. 442--449.
[27]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2004), 91--110.
[28]
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638--647.
[29]
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. 2022. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230 (2022).
[30]
Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A large-scale database for aesthetic visual analysis. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2408--2415.
[31]
Yingwei Pan, Yue Chen, Qian Bao, Ning Zhang, Ting Yao, Jingen Liu, and Tao Mei. 2021. Smart director: an event-driven directing system for live broadcasting. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1--18.
[32]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[33]
Rémi Ronfard. 2021. Film directing for computer games and animation. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 713--730.
[34]
Yong Rui, Anoop Gupta, Jonathan Grudin, and Liwei He. 2004. Automating lecture capture and broadcast: technology and videography. Multimedia Systems 10 (2004), 3--15.
[35]
Mukesh Kumar Saini, Raghudeep Gadde, Shuicheng Yan, and Wei Tsang Ooi. 2012. Movimash: online mobile video mashup. In Proceedings of the 20th ACM international conference on Multimedia. 139--148.
[36]
Mukesh Kumar Saini and Wei Tsang Ooi. 2018. Automated Video Mashups: Research and Challenges. MediaSync: Handbook on Multimedia Synchronization (2018), 167--190.
[37]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[38]
Prarthana Shrestha, Peter HN de With, HansWeda, Mauro Barbieri, and Emile HL Aarts. 2010. Automatic mashup generation from multiple-camera concert recordings. In Proceedings of the 18th ACM international conference on Multimedia. 541--550.
[39]
Than Htut Soe. 2021. AI video editing tools. What editors want and how far is AI from delivering? arXiv preprint arXiv:2109.07809 (2021).
[40]
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems 12 (1999).
[41]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.
[42]
Feng Wang, Chong-Wah Ngo, and Ting-Chuen Pong. 2007. Lecture video enhancement and editing by integrating posture, gesture, and text. IEEE transactions on multimedia 9, 2 (2007), 397--409.
[43]
Jinjun Wang, Changsheng Xu, Engsiong Chng, Hanqing Lu, and Qi Tian. 2008. Automatic composition of broadcast sports video. Multimedia Systems 14 (2008), 179--193.
[44]
Xueting Wang, Takatsugu Hirayama, and Kenji Mase. 2015. Viewpoint sequence recommendation based on contextual information for multiview video. IEEE MultiMedia 22, 4 (2015), 40--50.
[45]
Xueting Wang, Yuki Muramatu, Takatsugu Hirayama, and Kenji Mase. 2014. Context-dependent viewpoint sequence recommendation system for multi-view video. In 2014 IEEE International Symposium on Multimedia. IEEE, 195--202.
[46]
Hui-Yin Wu and Arnav Jhala. 2018. A Joint Attention Model for Automated Editing. In INT/WICED@ AIIDE.
[47]
Yue Wu, Tao Mei, Ying-Qing Xu, Nenghai Yu, and Shipeng Li. 2015. MoVieUp: Automatic mobile video mashup. IEEE Transactions on Circuits and Systems for Video Technology 25, 12 (2015), 1941--1954.
[48]
Yu Xiong, Fabian Caba Heilbron, and Dahua Lin. 2022. Transcript to video: Efficient clip sequencing from texts. In Proceedings of the 30th ACM International Conference on Multimedia. 5407--5416.
[49]
Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. 2022. A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIX. Springer, 736--753.
[50]
Danqing Yang, Longfei Zhang, Yufeng Wu, Shugang Li, Dong Liang, and Gangyi Ding. 2019. Computable Framework For Live Sport Broadcast Directing. In 2019 IEEE International Symposium on Multimedia (ISM). IEEE, 239--2391.
[51]
Zixiao Yu, Chenyu Yu, Haohong Wang, and Jian Ren. 2022. Enabling Automatic Cinematography with Reinforcement Learning. In 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 103--108.
[52]
Xinrong Zhang, Yanghao Li, Yuxing Han, and Jiangtao Wen. 2022. AI Video Editing: a Survey. (2022).

Index Terms

  1. A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. reinforcement learning
      2. video editing
      3. video representation

      Qualifiers

      • Research-article

      Funding Sources

      • Shenzhen Science and Technology Program

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 213
        Total Downloads
      • Downloads (Last 12 months)213
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 03 Oct 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media