Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

Effective Video Summarization by Extracting Parameter-Free Motion Attention

Published: 16 May 2024 Publication History


Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information that cannot be captured by frame-level video representations. In this article, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the PMAM with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.


Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. 2021. Combining global and local attention with positional encoding for video summarization. In Proceedings of the IEEE International Symposium on Multimedia. 226–234.
Sijia Cai, Wangmeng Zuo, Larry S. Davis, and Lei Zhang. 2018. Weakly-supervised video summarization using variational encoder-decoder and web prior. In Proceedings of the European Conference on Computer Vision. 184–200.
Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. 2019. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia. 1–6.
Sandra Eliza Fontes De Avila, Ana Paula Brandao Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
Mohamed Elfeki and Ali Borji. 2019. Video summarization via actionness ranking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 754–763.
Ehsan Elhamifar, Guillermo Sapiro, and S. Shankar Sastry. 2015. Dissimilarity-based sparse subset selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 11 (2015), 2182–2197.
Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. 2018. Summarizing videos with attention. In Proceedings of the Asian Conference on Computer Vision. 39–54.
Hao Fu and Hongxing Wang. 2021. Self-attention binary neural tree for video summarization. Pattern Recognition Letters 143 (2021), 19–26.
Hao Fu, Hongxing Wang, and Jianyu Yang. 2021. Video summarization with a dual attention capsule network. In Proceedings of the International Conference on Pattern Recognition. IEEE, 446–451.
Tsu-Jui Fu, Shao-Heng Tai, and Hwann-Tzong Chen. 2019. Attentive and adversarial learning for video summarization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 1579–1587.
Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2023. Vectorized evidential learning for weakly-supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Junyu Gao, Xiaoshan Yang, Yingying Zhang, and Changsheng Xu. 2020. Unsupervised video summarization via relation-aware assignment learning. IEEE Transactions on Multimedia 23 (2020), 3203–3214.
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2020. Learning to model relationships for zero-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10 (2020), 3476–3491.
Genliang Guan, Zhiyong Wang, Shaohui Mei, Max Ott, Mingyi He, and David Dagan Feng. 2014. A top-down approach for video summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 11, 1 (2014), 1–21.
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. 505–520.
Youssef Hadi, Fedwa Essannouni, and Rachid Oulad Haj Thami. 2006. Video summarization by k-medoid clustering. In Proceedings of the ACM Symposium on Applied Computing. 1400–1401.
Tingting Han, Kai Wang, Jun Yu, and Jianping Fan. 2022. Weakly supervised moment localization with natural language based on semantic reconstruction. Image and Vision Computing 126 (2022), 104532.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Cheng Huang and Hongmei Wang. 2019. A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 577–589.
Jia-Hong Huang, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hung Chen, and Marcel Worring. 2023. Conditional modeling based automatic video summarization. arXiv preprint arXiv:2311.12159 (2023).
Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2019), 1709–1717.
Zhong Ji, Yuxiao Zhao, Yanwei Pang, Xi Li, and Jungong Han. 2020. Deep attentive video summarization with distribution consistency learning. IEEE Transactions on Neural Networks and Learning Systems 32, 4 (2020), 1765–1775.
Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544.
Hussain Kanafani, Junaid Ahmed Ghauri, Sherzod Hakimov, and Ralph Ewerth. 2021. Unsupervised video summarization via multi-source features. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 466–470.
Luis Lebron Casas and Eugenia Koblents. 2019. Video summarization with LSTM and deep attention models. In Proceedings of the International Conference on MultiMedia Modeling. 67–79.
Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, and Ling Shao. 2021. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition 111 (2021), 107677.
Wenxu Li, Gang Pan, Chen Wang, Zhen Xing, and Zhenjun Han. 2022. From coarse to fine: Hierarchical structure-aware video summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1s (2022), 1–16.
Zutong Li and Lei Yang. 2021. Weakly supervised deep reinforcement learning for video summarization with semantically meaningful reward. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 3239–3247.
Guoqiang Liang, Yanbing Lv, Shucheng Li, Xiahong Wang, and Yanning Zhang. 2022. Video summarization with a dual-path attentive network. Neurocomputing 467 (2022), 1–9.
Yen-Ting Liu, Yu-Jhe Li, and Yu-Chiang Frank Wang. 2020. Transforming multi-concept attention into video summarization. In Proceedings of the Asian Conference on Computer Vision.
Yen-Ting Liu, Yu-Jhe Li, Fu-En Yang, Shang-Fu Chen, and Yu-Chiang Frank Wang. 2019. Learning hierarchical self-attention for video summarization. In Proceedings of the IEEE International Conference on Image Processing. 3377–3381.
Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 202–211.
Maria Nektaria Minaidi, Charilaos Papaioannou, and Alexandros Potamianos. 2023. Self-attention based generative adversarial networks for unsupervised video summarization. In 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 571–575.
Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. 2021. CLIP-It! Language-guided video summarization. Advances in Neural Information Processing Systems 34 (2021), 13988–14000.
Rameswar Panda, Abir Das, Ziyan Wu, Jan Ernst, and Amit K. Roy-Chowdhury. 2017. Weakly supervised summarization of web videos. In Proceedings of the IEEE International Conference on Computer Vision. 3657–3666.
Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn. 2020. SumGraph: Video summarization via recursive graph modeling. In Proceedings of the European Conference on Computer Vision. Springer, 647–663.
Mrigank Rochan, Linwei Ye, and Yang Wang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision. 347–363.
Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Hao Tang, Lei Ding, Songsong Wu, Bin Ren, Nicu Sebe, and Paolo Rota. 2023. Deep unsupervised key frame extraction for efficient video classification. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) 19, 3 (2023), 1–17.
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.
Junyan Wang, Yang Bai, Yang Long, Bingzhang Hu, Zhenhua Chai, Yu Guan, and Xiaolin Wei. 2020. Query twice: Dual mixture attention meta learning for video summarization. In Proceedings of the ACM International Conference on Multimedia. 4023–4031.
Guande Wu, Jianzhe Lin, and Cláudio T. Silva. 2021. ERA: Entity relationship aware video summarization with Wasserstein GAN. arXiv preprint arXiv:2109.02625 (2021).
Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 982–990.
Zhou Yu and Nanjia Han. 2021. Accelerated masked transformer for dense video captioning. Neurocomputing 445 (2021), 72–80.
Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, and Jun Yu. 2023. ANetQA: A large-scale benchmark for fine-grained compositional reasoning over untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 23191–23200.
Li Yuan, Francis Eng Hock Tay, Ping Li, and Jiashi Feng. 2019. Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Transactions on Multimedia 22, 10 (2019), 2711–2722.
Yuan Yuan, Haopeng Li, and Qi Wang. 2019. Spatiotemporal modeling for video summarization using convolutional recurrent neural network. IEEE Access 7 (2019), 64676–64685.
Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and Limin Wang. 2023. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5682–5692.
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision. 766–782.
Ke Zhang, Kristen Grauman, and Fei Sha. 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision. 383–399.
Bin Zhao, Maoguo Gong, and Xuelong Li. 2022. Hierarchical multimodal transformer to summarize videos. Neurocomputing 468 (2022), 360–369.
Bin Zhao, Haopeng Li, Xiaoqiang Lu, and Xuelong Li. 2021. Reconstructive sequence-graph network for video summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2021), 2793–2801.
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the ACM International Conference on Multimedia. 863–871.
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2020. TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization. IEEE Transactions on Industrial Electronics 68, 4 (2020), 3629–3637.
Bin Zhao and Eric P. Xing. 2014. Quasi real-time summarization for consumer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2513–2520.
Sheng-Hua Zhong, Jingxu Lin, Jianglin Lu, Ahmed Fares, and Tongwei Ren. 2022. Deep semantic and attentive network for unsupervised video summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2 (2022), 1–21.
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452–1464.
Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Wencheng Zhu, Yucheng Han, Jiwen Lu, and Jie Zhou. 2022. Relational reasoning over spatial-temporal graphs for video summarization. IEEE Transactions on Image Processing 31 (2022), 3017–3031.
Wencheng Zhu, Jiwen Lu, Yucheng Han, and Jie Zhou. 2022. Learning multiscale hierarchical attention for video summarization. Pattern Recognition 122 (2022), 108312.
Wencheng Zhu, Jiwen Lu, Jiahao Li, and Jie Zhou. 2020. DSNet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing 30 (2020), 948–962.
Xiaoguang Zhu, Ye Zhu, Haoyu Wang, Honglin Wen, Yan Yan, and Peilin Liu. 2022. Skeleton sequence and RGB frame based multi-modality feature fusion network for action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1–24.

Index Terms

  1. Effective Video Summarization by Extracting Parameter-Free Motion Attention



    Information & Contributors


    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 7
    July 2024
    973 pages
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents


    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2024
    Online AM: 30 March 2024
    Accepted: 20 March 2024
    Revised: 19 March 2024
    Received: 18 September 2023
    Published in TOMM Volume 20, Issue 7

    Check for updates

    Author Tags

    1. Video summarization
    2. parameter-free
    3. motion attention
    4. feature fusion
    5. multi-head attention


    • Research-article

    Funding Sources

    • Zhejiang Provincial Natural Science Foundation of China
    • National Natural Science Foundation of China (NSFC)


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • 0
      Total Citations
    • 153
      Total Downloads
    • Downloads (Last 12 months)153
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 04 Oct 2024

    Other Metrics


    View Options

    Get Access

    Login options

    Full Access

    View options


    View or Download as a PDF file.



    View online with eReader.


    Full Text

    View this article in Full Text.

    Full Text







    Share this Publication link

    Share on social media