research-article

Effective Video Summarization by Extracting Parameter-Free Motion Attention

Authors:

Sicheng ZhaoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 7

Article No.: 219, Pages 1 - 20

https://doi.org/10.1145/3654670

Published: 16 May 2024 Publication History

Abstract

Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information that cannot be captured by frame-level video representations. In this article, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the PMAM with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.

References

[1]

Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. 2021. Combining global and local attention with positional encoding for video summarization. In Proceedings of the IEEE International Symposium on Multimedia. 226–234.

[2]

Sijia Cai, Wangmeng Zuo, Larry S. Davis, and Lei Zhang. 2018. Weakly-supervised video summarization using variational encoder-decoder and web prior. In Proceedings of the European Conference on Computer Vision. 184–200.

Digital Library

[3]

Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. 2019. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia. 1–6.

Digital Library

[4]

Sandra Eliza Fontes De Avila, Ana Paula Brandao Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68.

Digital Library

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

[6]

Mohamed Elfeki and Ali Borji. 2019. Video summarization via actionness ranking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 754–763.

[7]

Ehsan Elhamifar, Guillermo Sapiro, and S. Shankar Sastry. 2015. Dissimilarity-based sparse subset selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 11 (2015), 2182–2197.

Digital Library

[8]

Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. 2018. Summarizing videos with attention. In Proceedings of the Asian Conference on Computer Vision. 39–54.

[9]

Hao Fu and Hongxing Wang. 2021. Self-attention binary neural tree for video summarization. Pattern Recognition Letters 143 (2021), 19–26.

Digital Library

[10]

Hao Fu, Hongxing Wang, and Jianyu Yang. 2021. Video summarization with a dual attention capsule network. In Proceedings of the International Conference on Pattern Recognition. IEEE, 446–451.

[11]

Tsu-Jui Fu, Shao-Heng Tai, and Hwann-Tzong Chen. 2019. Attentive and adversarial learning for video summarization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 1579–1587.

[12]

Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2023. Vectorized evidential learning for weakly-supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

Digital Library

[13]

Junyu Gao, Xiaoshan Yang, Yingying Zhang, and Changsheng Xu. 2020. Unsupervised video summarization via relation-aware assignment learning. IEEE Transactions on Multimedia 23 (2020), 3203–3214.

[14]

Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2020. Learning to model relationships for zero-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10 (2020), 3476–3491.

[15]

Genliang Guan, Zhiyong Wang, Shaohui Mei, Max Ott, Mingyi He, and David Dagan Feng. 2014. A top-down approach for video summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 11, 1 (2014), 1–21.

Digital Library

[16]

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. 505–520.

[17]

Youssef Hadi, Fedwa Essannouni, and Rachid Oulad Haj Thami. 2006. Video summarization by k-medoid clustering. In Proceedings of the ACM Symposium on Applied Computing. 1400–1401.

Digital Library

[18]

Tingting Han, Kai Wang, Jun Yu, and Jianping Fan. 2022. Weakly supervised moment localization with natural language based on semantic reconstruction. Image and Vision Computing 126 (2022), 104532.

Digital Library

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[20]

Cheng Huang and Hongmei Wang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 577–589.

[21]

Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hung Chen, and Marcel Worring. 2023. Conditional modeling based automatic video summarization. arXiv preprint arXiv:2311.12159 (2023).

[22]

Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2019), 1709–1717.

[23]

Zhong Ji, Yuxiao Zhao, Yanwei Pang, Xi Li, and Jungong Han. 2020. Deep attentive video summarization with distribution consistency learning. IEEE Transactions on Neural Networks and Learning Systems 32, 4 (2020), 1765–1775.

[24]

Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544.

Digital Library

[25]

Hussain Kanafani, Junaid Ahmed Ghauri, Sherzod Hakimov, and Ralph Ewerth. 2021. Unsupervised video summarization via multi-source features. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 466–470.

Digital Library

[26]

Luis Lebron Casas and Eugenia Koblents. 2019. Video summarization with LSTM and deep attention models. In Proceedings of the International Conference on MultiMedia Modeling. 67–79.

[27]

Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, and Ling Shao. 2021. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition 111 (2021), 107677.

[28]

Wenxu Li, Gang Pan, Chen Wang, Zhen Xing, and Zhenjun Han. 2022. From coarse to fine: Hierarchical structure-aware video summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1s (2022), 1–16.

Digital Library

[29]

Zutong Li and Lei Yang. 2021. Weakly supervised deep reinforcement learning for video summarization with semantically meaningful reward. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 3239–3247.

[30]

Guoqiang Liang, Yanbing Lv, Shucheng Li, Xiahong Wang, and Yanning Zhang. 2022. Video summarization with a dual-path attentive network. Neurocomputing 467 (2022), 1–9.

Digital Library

[31]

Yen-Ting Liu, Yu-Jhe Li, and Yu-Chiang Frank Wang. 2020. Transforming multi-concept attention into video summarization. In Proceedings of the Asian Conference on Computer Vision.

[32]

Yen-Ting Liu, Yu-Jhe Li, Fu-En Yang, Shang-Fu Chen, and Yu-Chiang Frank Wang. 2019. Learning hierarchical self-attention for video summarization. In Proceedings of the IEEE International Conference on Image Processing. 3377–3381.

[33]

Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 202–211.

[34]

Maria Nektaria Minaidi, Charilaos Papaioannou, and Alexandros Potamianos. 2023. Self-attention based generative adversarial networks for unsupervised video summarization. In 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 571–575.

[35]

Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. CLIP-It! Language-guided video summarization. Advances in Neural Information Processing Systems 34 (2021), 13988–14000.

[36]

Rameswar Panda, Abir Das, Ziyan Wu, Jan Ernst, and Amit K. Roy-Chowdhury. 2017. Weakly supervised summarization of web videos. In Proceedings of the IEEE International Conference on Computer Vision. 3657–3666.

[37]

Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn. 2020. SumGraph: Video summarization via recursive graph modeling. In Proceedings of the European Conference on Computer Vision. Springer, 647–663.

Digital Library

[38]

Mrigank Rochan, Linwei Ye, and Yang Wang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision. 347–363.

Digital Library

[39]

Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.

[40]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[41]

Hao Tang, Lei Ding, Songsong Wu, Bin Ren, Nicu Sebe, and Paolo Rota. 2023. Deep unsupervised key frame extraction for efficient video classification. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) 19, 3 (2023), 1–17.

Digital Library

[42]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.

[43]

Junyan Wang, Yang Bai, Yang Long, Bingzhang Hu, Zhenhua Chai, Yu Guan, and Xiaolin Wei. 2020. Query twice: Dual mixture attention meta learning for video summarization. In Proceedings of the ACM International Conference on Multimedia. 4023–4031.

Digital Library

[44]

Guande Wu, Jianzhe Lin, and Cláudio T. Silva. 2021. ERA: Entity relationship aware video summarization with Wasserstein GAN. arXiv preprint arXiv:2109.02625 (2021).

[45]

Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 982–990.

[46]

Zhou Yu and Nanjia Han. 2021. Accelerated masked transformer for dense video captioning. Neurocomputing 445 (2021), 72–80.

[47]

Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, and Jun Yu. 2023. ANetQA: A large-scale benchmark for fine-grained compositional reasoning over untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 23191–23200.

[48]

Li Yuan, Francis Eng Hock Tay, Ping Li, and Jiashi Feng. 2019. Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Transactions on Multimedia 22, 10 (2019), 2711–2722.

[49]

Yuan Yuan, Haopeng Li, and Qi Wang. 2019. Spatiotemporal modeling for video summarization using convolutional recurrent neural network. IEEE Access 7 (2019), 64676–64685.

[50]

Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and Limin Wang. 2023. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5682–5692.

[51]

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision. 766–782.

[52]

Ke Zhang, Kristen Grauman, and Fei Sha. 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision. 383–399.

Digital Library

[53]

Bin Zhao, Maoguo Gong, and Xuelong Li. 2022. Hierarchical multimodal transformer to summarize videos. Neurocomputing 468 (2022), 360–369.

Digital Library

[54]

Bin Zhao, Haopeng Li, Xiaoqiang Lu, and Xuelong Li. 2021. Reconstructive sequence-graph network for video summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2021), 2793–2801.

[55]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the ACM International Conference on Multimedia. 863–871.

Digital Library

[56]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2020. TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization. IEEE Transactions on Industrial Electronics 68, 4 (2020), 3629–3637.

[57]

Bin Zhao and Eric P. Xing. 2014. Quasi real-time summarization for consumer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2513–2520.

Digital Library

[58]

Sheng-Hua Zhong, Jingxu Lin, Jianglin Lu, Ahmed Fares, and Tongwei Ren. 2022. Deep semantic and attentive network for unsupervised video summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2 (2022), 1–21.

Digital Library

[59]

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452–1464.

[60]

Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[61]

Wencheng Zhu, Yucheng Han, Jiwen Lu, and Jie Zhou. 2022. Relational reasoning over spatial-temporal graphs for video summarization. IEEE Transactions on Image Processing 31 (2022), 3017–3031.

[62]

Wencheng Zhu, Jiwen Lu, Yucheng Han, and Jie Zhou. 2022. Learning multiscale hierarchical attention for video summarization. Pattern Recognition 122 (2022), 108312.

Digital Library

[63]

Wencheng Zhu, Jiwen Lu, Jiahao Li, and Jie Zhou. 2020. DSNet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing 30 (2020), 948–962.

Digital Library

[64]

Xiaoguang Zhu, Ye Zhu, Haoyu Wang, Honglin Wen, Yan Yan, and Peilin Liu. 2022. Skeleton sequence and RGB frame based multi-modality feature fusion network for action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1–24.

Digital Library

Index Terms

Effective Video Summarization by Extracting Parameter-Free Motion Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

A user attention model for video summarization
MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia

Automatic generation of video summarization is one of the key techniques in video management and browsing. In this paper, we present a generic framework of video summarization based on the modeling of viewer's attention. Without fully semantic ...
Multimodal Local Feature Enhancement Network for Video Summarization
Pattern Recognition and Computer Vision
Abstract
Multimodal information processing has garnered considerable attention in recent years. Due to the inherent multimodal information in videos, multimodal learning has been introduced in the domain of video summarization, leading to a significant ...
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 7

July 2024

973 pages

EISSN:1551-6865

DOI:10.1145/3613662

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2024

Online AM: 30 March 2024

Accepted: 20 March 2024

Revised: 19 March 2024

Received: 18 September 2023

Published in TOMM Volume 20, Issue 7

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Zhejiang Provincial Natural Science Foundation of China
National Natural Science Foundation of China (NSFC)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
135
Total Downloads

Downloads (Last 12 months)135
Downloads (Last 6 weeks)12

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents