research-article

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Authors:

Sheng-Hua Zhong,

Tongwei RenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 2

Article No.: 55, Pages 1 - 21

https://doi.org/10.1145/3477538

Published: 16 February 2022 Publication History

Abstract

With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.

References

[1]

Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. 2020. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling. Springer, 492–504.

Digital Library

[2]

Yang Cong, Junsong Yuan, and Jiebo Luo. 2011. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2011), 66–75.

Digital Library

[3]

Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68.

Digital Library

[4]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

[5]

Naveed Ejaz, Irfan Mehmood, and Sung Wook Baik. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34–44.

Digital Library

[6]

Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129.

Digital Library

[7]

Yue Gao, Meng Wang, Zheng-Jun Zha, Jialie Shen, Xuelong Li, and Xindong Wu. 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363–376.

Digital Library

[8]

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In European Conference on Computer Vision. Springer, 505–520.

[9]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.

[10]

Xufeng He, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. 2019. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2296–2304.

Digital Library

[11]

Jia-Hong Huang and Marcel Worring. 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242–250.

Digital Library

[12]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.

[13]

Zhong Ji, Fang Jiao, Yanwei Pang, and Ling Shao. 2020. Deep attentive and semantic preserving video summarization. Neurocomputing 405 (2020), 200–207.

[14]

Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2020. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 1709–1717.

[15]

Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.

[16]

Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544.

Digital Library

[17]

Maurice G. Kendall. 1945. The treatment of ties in ranking problems. Biometrika 33, 3 (1945), 239–251.

[18]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[19]

Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302.

Digital Library

[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.

Digital Library

[21]

Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658–8665.

Digital Library

[22]

Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.

[23]

Shaohui Mei, Genliang Guan, Zhiyong Wang, Shuai Wan, Mingyi He, and David Dagan Feng. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533.

Digital Library

[24]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119.

Digital Library

[25]

Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkila. 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7596–7604.

[26]

Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540–555.

[27]

Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2017. Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7206–7215.

[28]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260.

Digital Library

[29]

Mrigank Rochan and Yang Wang. 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7902–7911.

[30]

Mrigank Rochan, Linwei Ye, and Yang Wang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 347–363.

[31]

Aidean Sharghi, Boqing Gong, and Mubarak Shah. 2016. Query-focused extractive video summarization. In European Conference on Computer Vision. Springer, 3–19.

[32]

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for RNN/CNN-free language understanding. In 32nd AAAI Conference on Artificial Intelligence.

Digital Library

[33]

Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.

[34]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.

Digital Library

[36]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.

Digital Library

[37]

Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, Xiaokang Yang, and Chen Yao. 2018. Video summarization via semantic attended networks. In 32nd AAAI Conference on Artificial Intelligence. 216–223.

Digital Library

[38]

Serena Yeung, Alireza Fathi, and Fei-Fei Li. 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824.

[39]

Li Yuan, Francis E. H. Tay, Ping Li, Li Zhou, and Jiashi Feng. 2019. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9143–9150.

Digital Library

[40]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.

[41]

Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766–782.

[42]

Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–701.

Digital Library

[43]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia. 863–871.

Digital Library

[44]

Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.

[45]

Sheng-hua Zhong, Jiaxin Wu, and Jianmin Jiang. 2019. Video summarization via spatio-temporal deep architecture. Neurocomputing 332 (2019), 224–235.

Digital Library

[46]

Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In 32nd AAAI Conference on Artificial Intelligence. 7582–7589.

Digital Library

[47]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27.

Digital Library

[48]

Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.

Cited By

Han TZhou QYu JYu ZZhang JZhao S(2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3654670
Lin SChen CChang Y(2024)AMP-BiLSTM: An Enhanced Highlight Extraction Method Using Multi-Channel Bi-LSTM and Self-Attention in Streaming Videos2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00009(9-16)Online publication date: 5-Feb-2024
https://doi.org/10.1109/ICSC59802.2024.00009
Gunuganti JYeh ZWang JNorouzi M(2024)Unsupervised video summarization with adversarial graph-based attention networkJournal of Visual Communication and Image Representation10.1016/j.jvcir.2024.104200102(104200)Online publication date: Jun-2024
https://doi.org/10.1016/j.jvcir.2024.104200
Show More Cited By

Index Terms

Deep Semantic and Attentive Network for Unsupervised Video Summarization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

With the rapid growth of video data, video summarization technique plays a key role in reducing people's efforts to explore the content of videos by generating concise but informative summaries. Though supervised video summarization approaches have been ...
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...
Self-attention binary neural tree for video summarization
Highlights
- A self-attention binary neural tree (SABTNet) is proposed for video summarization.
Abstract
In this paper, we address the problem of shot-level video summarization, which aims at selecting a subset of video shots as a summary to represent the original video contents compactly and completely. Most existing methods rely on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 2

May 2022

494 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505207

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2022

Accepted: 01 July 2021

Revised: 01 May 2021

Received: 01 November 2020

Published in TOMM Volume 18, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Natural Science Foundation of Guangdong Province
Science and Technology Innovation Commission of Shenzhen
Shenzhen high-level talents program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,069
Total Downloads

Downloads (Last 12 months)193
Downloads (Last 6 weeks)7

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han TZhou QYu JYu ZZhang JZhao S(2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3654670
Lin SChen CChang Y(2024)AMP-BiLSTM: An Enhanced Highlight Extraction Method Using Multi-Channel Bi-LSTM and Self-Attention in Streaming Videos2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00009(9-16)Online publication date: 5-Feb-2024
https://doi.org/10.1109/ICSC59802.2024.00009
Gunuganti JYeh ZWang JNorouzi M(2024)Unsupervised video summarization with adversarial graph-based attention networkJournal of Visual Communication and Image Representation10.1016/j.jvcir.2024.104200102(104200)Online publication date: Jun-2024
https://doi.org/10.1016/j.jvcir.2024.104200
Ruiye Z(2024)Volleyball training video classification description using the BiLSTM fusion attention mechanismHeliyon10.1016/j.heliyon.2024.e3473510:15(e34735)Online publication date: Aug-2024
https://doi.org/10.1016/j.heliyon.2024.e34735
Tan JWang HYuan J(2023)Characters Link Shots: Character Attention Network for Movie Scene SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363025720:4(1-23)Online publication date: 11-Dec-2023
https://dl.acm.org/doi/10.1145/3630257
Yuan YZhang J(2023)Shot Boundary Detection Using Color Clustering and Attention MechanismACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359592319:6(1-23)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3595923
Zhong RWang RYao WHu MDong SMunteanu A(2023)Semantic Representation and Attention Alignment for Graph Information Bottleneck in Video SummarizationIEEE Transactions on Image Processing10.1109/TIP.2023.329376232(4170-4184)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3293762
Beigh TPrasannaVenkatesan VArumugam J(2023)Object-Based Key Frame Extraction in Videos2023 Second International Conference on Advances in Computational Intelligence and Communication (ICACIC)10.1109/ICACIC59454.2023.10435166(1-4)Online publication date: 7-Dec-2023
https://doi.org/10.1109/ICACIC59454.2023.10435166
Ali MAzhar MMasood SLee BIqbal TAmjad A(2023)Efficient Video Summarization with Hydra Attentive Vision Transformer2023 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT60620.2023.00044(196-201)Online publication date: 11-Dec-2023
https://doi.org/10.1109/FIT60620.2023.00044
Zhu YZhao WHua RWu X(2023)Topic-aware video summarization using multimodal transformerPattern Recognition10.1016/j.patcog.2023.109578140(109578)Online publication date: Aug-2023
https://doi.org/10.1016/j.patcog.2023.109578
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents