Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Published: 16 February 2022 Publication History

Abstract

With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.

References

[1]
Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. 2020. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling. Springer, 492–504.
[2]
Yang Cong, Junsong Yuan, and Jiebo Luo. 2011. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2011), 66–75.
[3]
Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr., and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68.
[4]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[5]
Naveed Ejaz, Irfan Mehmood, and Sung Wook Baik. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34–44.
[6]
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129.
[7]
Yue Gao, Meng Wang, Zheng-Jun Zha, Jialie Shen, Xuelong Li, and Xindong Wu. 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363–376.
[8]
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In European Conference on Computer Vision. Springer, 505–520.
[9]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.
[10]
Xufeng He, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. 2019. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2296–2304.
[11]
Jia-Hong Huang and Marcel Worring. 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242–250.
[12]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.
[13]
Zhong Ji, Fang Jiao, Yanwei Pang, and Ling Shao. 2020. Deep attentive and semantic preserving video summarization. Neurocomputing 405 (2020), 200–207.
[14]
Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2020. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 1709–1717.
[15]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.
[16]
Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544.
[17]
Maurice G. Kendall. 1945. The treatment of ties in ranking problems. Biometrika 33, 3 (1945), 239–251.
[18]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[19]
Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302.
[20]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.
[21]
Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658–8665.
[22]
Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.
[23]
Shaohui Mei, Genliang Guan, Zhiyong Wang, Shuai Wan, Mingyi He, and David Dagan Feng. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533.
[24]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119.
[25]
Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkila. 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7596–7604.
[26]
Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540–555.
[27]
Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2017. Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7206–7215.
[28]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260.
[29]
Mrigank Rochan and Yang Wang. 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7902–7911.
[30]
Mrigank Rochan, Linwei Ye, and Yang Wang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 347–363.
[31]
Aidean Sharghi, Boqing Gong, and Mubarak Shah. 2016. Query-focused extractive video summarization. In European Conference on Computer Vision. Springer, 3–19.
[32]
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. Disan: Directional self-attention network for RNN/CNN-free language understanding. In 32nd AAAI Conference on Artificial Intelligence.
[33]
Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.
[34]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[36]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.
[37]
Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, Xiaokang Yang, and Chen Yao. 2018. Video summarization via semantic attended networks. In 32nd AAAI Conference on Artificial Intelligence. 216–223.
[38]
Serena Yeung, Alireza Fathi, and Fei-Fei Li. 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824.
[39]
Li Yuan, Francis E. H. Tay, Ping Li, Li Zhou, and Jiashi Feng. 2019. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9143–9150.
[40]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.
[41]
Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766–782.
[42]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–701.
[43]
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia. 863–871.
[44]
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.
[45]
Sheng-hua Zhong, Jiaxin Wu, and Jianmin Jiang. 2019. Video summarization via spatio-temporal deep architecture. Neurocomputing 332 (2019), 224–235.
[46]
Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In 32nd AAAI Conference on Artificial Intelligence. 7582–7589.
[47]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27.
[48]
Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.

Cited By

View all
  • (2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
  • (2024)AMP-BiLSTM: An Enhanced Highlight Extraction Method Using Multi-Channel Bi-LSTM and Self-Attention in Streaming Videos2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00009(9-16)Online publication date: 5-Feb-2024
  • (2024)Unsupervised video summarization with adversarial graph-based attention networkJournal of Visual Communication and Image Representation10.1016/j.jvcir.2024.104200102(104200)Online publication date: Jun-2024
  • Show More Cited By

Index Terms

  1. Deep Semantic and Attentive Network for Unsupervised Video Summarization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
    May 2022
    494 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3505207
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 February 2022
    Accepted: 01 July 2021
    Revised: 01 May 2021
    Received: 01 November 2020
    Published in TOMM Volume 18, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Video summarization
    2. visual-semantic embedding
    3. self-attention

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Natural Science Foundation of Guangdong Province
    • Science and Technology Innovation Commission of Shenzhen
    • Shenzhen high-level talents program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)193
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
    • (2024)AMP-BiLSTM: An Enhanced Highlight Extraction Method Using Multi-Channel Bi-LSTM and Self-Attention in Streaming Videos2024 IEEE 18th International Conference on Semantic Computing (ICSC)10.1109/ICSC59802.2024.00009(9-16)Online publication date: 5-Feb-2024
    • (2024)Unsupervised video summarization with adversarial graph-based attention networkJournal of Visual Communication and Image Representation10.1016/j.jvcir.2024.104200102(104200)Online publication date: Jun-2024
    • (2024)Volleyball training video classification description using the BiLSTM fusion attention mechanismHeliyon10.1016/j.heliyon.2024.e3473510:15(e34735)Online publication date: Aug-2024
    • (2023)Characters Link Shots: Character Attention Network for Movie Scene SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363025720:4(1-23)Online publication date: 11-Dec-2023
    • (2023)Shot Boundary Detection Using Color Clustering and Attention MechanismACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359592319:6(1-23)Online publication date: 12-Jul-2023
    • (2023)Semantic Representation and Attention Alignment for Graph Information Bottleneck in Video SummarizationIEEE Transactions on Image Processing10.1109/TIP.2023.329376232(4170-4184)Online publication date: 1-Jan-2023
    • (2023)Object-Based Key Frame Extraction in Videos2023 Second International Conference on Advances in Computational Intelligence and Communication (ICACIC)10.1109/ICACIC59454.2023.10435166(1-4)Online publication date: 7-Dec-2023
    • (2023)Efficient Video Summarization with Hydra Attentive Vision Transformer2023 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT60620.2023.00044(196-201)Online publication date: 11-Dec-2023
    • (2023)Topic-aware video summarization using multimodal transformerPattern Recognition10.1016/j.patcog.2023.109578140(109578)Online publication date: Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media