Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548105acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-Level Spatiotemporal Network for Video Summarization

Published: 10 October 2022 Publication History
  • Get Citation Alerts
  • Abstract

    With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.

    Supplementary Material

    MP4 File (MM22-fp1505.mp4)
    Presentation video for MLSN

    References

    [1]
    Evlampios Apostolidis, Eleni Adamantidou, Alexandros I Metsai, Vasileios Mezaris, and Ioannis Patras. 2020. Performance over random: A robust evaluation protocol for video summarization methods. In Proceedings of the 28th ACM International Conference on Multimedia. 1056--1064.
    [2]
    Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, and Ioannis Patras. 2021. Combining global and local attention with positional encoding for video summarization. In 2021 IEEE International Symposium on Multimedia (ISM). IEEE, 226--234.
    [3]
    Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. 129--136.
    [4]
    Wei-Ta Chu and Yu-Hsin Liu. 2019. Spatiotemporal modeling and label distribution learning for video summarization. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.
    [5]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
    [6]
    Mohamed Elfeki and Ali Borji. 2019. Video summarization via actionness ranking. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 754--763.
    [7]
    Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. 2018. Summarizing videos with attention. In Asian Conference on Computer Vision. Springer, 39--54.
    [8]
    Litong Feng, Ziyin Li, Zhanghui Kuang, and Wei Zhang. 2018. Extractive video summarizer with memory augmented neural networks. In Proceedings of the 26th ACM international conference on Multimedia. 976--983.
    [9]
    Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei Sha. 2014. Diverse sequential subset selection for supervised video summarization. Advances in neural information processing systems, Vol. 27 (2014).
    [10]
    Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In European conference on computer vision. Springer, 505--520.
    [11]
    Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3090--3098.
    [12]
    Zhong Ji, Fang Jiao, Yanwei Pang, and Ling Shao. 2020. Deep attentive and semantic preserving video summarization. Neurocomputing, Vol. 405 (2020), 200--207.
    [13]
    Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder--decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 6 (2019), 1709--1717.
    [14]
    Yifan Jiao, Xiaoshan Yang, Tianzhu Zhang, Shucheng Huang, and Changsheng Xu. 2017. Video highlight detection via deep ranking modeling. In Pacific-Rim Symposium on Image and Video Technology. Springer, 28--39.
    [15]
    Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, and In So Kweon. 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on artificial intelligence, Vol. 33. 8537--8544.
    [16]
    Gunhee Kim and Eric P Xing. 2014. Reconstructing storyline graphs for image recommendation from web community photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3882--3889.
    [17]
    Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 1346--1353.
    [18]
    Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, and Ling Shao. 2021. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition, Vol. 111 (2021), 107677.
    [19]
    Yen-Ting Liu, Yu-Jhe Li, Fu-En Yang, Shang-Fu Chen, and Yu-Chiang Frank Wang. 2019. Learning hierarchical self-attention for video summarization. In 2019 IEEE international conference on image processing (ICIP). IEEE, 3377--3381.
    [20]
    Zheng Lu and Kristen Grauman. 2013. Story-driven summarization for egocentric video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2714--2721.
    [21]
    Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. 2014. Category-specific video summarization. In European conference on computer vision. Springer, 540--555.
    [22]
    Mrigank Rochan, Linwei Ye, and Yang Wang. 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European conference on computer vision (ECCV). 347--363.
    [23]
    Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the eighth ACM international conference on Multimedia. 105--115.
    [24]
    Yassir Saquil, Da Chen, Yuan He, Chuan Li, and Yong-Liang Yang. 2021. Multiple Pairwise Ranking Networks for Personalized Video Summarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1718--1727.
    [25]
    Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. 2015. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5179--5187.
    [26]
    Min Sun, Ali Farhadi, and Steve Seitz. 2014. Ranking domain-specific highlights by analyzing edited videos. In European conference on computer vision. Springer, 787--802.
    [27]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.
    [28]
    Hao Tang, Vivek Kwatra, Mehmet Emre Sargin, and Ullas Gargi. 2011. Detecting highlights in sports videos: Cricket as a test case. In 2011 IEEE International Conference on Multimedia and Expo. IEEE, 1--6.
    [29]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
    [30]
    Junbo Wang, Wei Wang, Zhiyong Wang, Liang Wang, Dagan Feng, and Tieniu Tan. 2019. Stacked memory network for video summarization. In Proceedings of the 27th ACM International Conference on Multimedia. 836--844.
    [31]
    Jinjun Wang, Changsheng Xu, Engsiong Chng, and Qi Tian. 2004. Sports highlight detection from keyword sequences using HMM. In 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), Vol. 1. IEEE, 599--602.
    [32]
    Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, Xiaokang Yang, and Chen Yao. 2018. Video summarization via semantic attended networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
    [33]
    Bo Xiong, Gunhee Kim, and Leonid Sigal. 2015. Storyline representation of egocentric videos with an applications to story-based search. In Proceedings of the IEEE International Conference on Computer Vision. 4525--4533.
    [34]
    Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran, and Thomas S Huang. 2005. Highlights extraction from sports video based on an audio-visual marker detection framework. In 2005 IEEE International Conference on Multimedia and Expo. IEEE, 4--pp.
    [35]
    Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 982--990.
    [36]
    Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In European conference on computer vision. Springer, 766--782.
    [37]
    Ke Zhang, Kristen Grauman, and Fei Sha. 2018. Retrospective encoders for video summarization. In Proceedings of the European conference on computer vision (ECCV). 383--399.
    [38]
    Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 863--871.
    [39]
    Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7405--7414.
    [40]
    Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [41]
    Wencheng Zhu, Jiwen Lu, Jiahao Li, and Jie Zhou. 2020. Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, Vol. 30 (2020), 948--962.

    Cited By

    View all
    • (2024)Explainable Video Summarization for Advancing Media Content ProductionEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch065(1-24)Online publication date: 1-Jul-2024
    • (2024)BNoteHelper: A Note-based Outline Generation Tool for Structured Learning on Video-sharing PlatformsACM Transactions on the Web10.1145/363877518:2(1-30)Online publication date: 12-Mar-2024
    • (2023)Self-supervised Video Summarization Guided by Semantic Inverse Optimal TransportProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612087(6611-6622)Online publication date: 26-Oct-2023
    • Show More Cited By

    Index Terms

    1. Multi-Level Spatiotemporal Network for Video Summarization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '22: Proceedings of the 30th ACM International Conference on Multimedia
      October 2022
      7537 pages
      ISBN:9781450392037
      DOI:10.1145/3503161
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 October 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. local relative loss
      2. multi-level spatiotemporal network
      3. video summarization

      Qualifiers

      • Research-article

      Conference

      MM '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)155
      • Downloads (Last 6 weeks)8

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Explainable Video Summarization for Advancing Media Content ProductionEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch065(1-24)Online publication date: 1-Jul-2024
      • (2024)BNoteHelper: A Note-based Outline Generation Tool for Structured Learning on Video-sharing PlatformsACM Transactions on the Web10.1145/363877518:2(1-30)Online publication date: 12-Mar-2024
      • (2023)Self-supervised Video Summarization Guided by Semantic Inverse Optimal TransportProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612087(6611-6622)Online publication date: 26-Oct-2023
      • (2023)Self-Supervised Adversarial Video Summarizer With Context Latent Sequence LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324046433:8(4122-4136)Online publication date: 1-Aug-2023

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media