Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Modality correlation-based video summarization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video summarization is an important technique to help us browse, store, and retrieve a rapidly increasing amount of video data, which extracts frames or shots from the original video. Text information covers important content of a video, and thus a summarization can be generated by exploring the correlation between the frame and text. In this study, we propose a video summarization method based on the modality correlation. With this method, we first learn the correlation between the text and frame in the respective space, and then fuse two correlations to obtain the importance score of each shot. Finally, video shots that have a high importance score are chosen as the video summarization. Compared to previous methods that seldom apply text to generate the video summarization, or only use the latent common information between text and frame, the proposed method fully utilizes not only the latent common but also modality-specific information for a video summarization. Experiments were conducted on the TVSum50 dataset, and the results verify the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Find the latest articles, discoveries, and news in related topics.

References

  1. Aner A, Kender JR (2002) Video summaries through mosaic-based shot and scene clustering. In: European conference on computer vision. Springer, Berlin, pp 388–402

  2. Chakraborty S, Tickoo O, Iyer R (2015) Adaptive keyframe selection for video summarization. In: 2015 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 702–709

  3. Cong Y, Yuan J, Luo J (2012) Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Trans Multimed 14(1):66–75

    Article  Google Scholar 

  4. De Avila SEF, Lopes APB, da Luz A Jr, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

    Article  Google Scholar 

  5. Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in neural information processing systems, pp 2069–2077

  6. Guan G, Wang Z, Lu S, Da Deng J, Feng DD (2013) Keypoint-based keyframe selection. IEEE Trans Circ Sys Video Technol 23(4):729–734

    Article  Google Scholar 

  7. Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: European conference on computer vision. Springer, Berlin, pp 505–520

  8. Hadi Y, Essannouni F, Thami ROH (2006) Video summarization by k-medoid clustering. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 1400–1401

  9. Han J, Ji X, Hu X, Zhu D, Li K, Jiang X, Cui G, Guo L, Liu T (2013) Representing and retrieving video shots in human-centric brain imaging space. IEEE Trans Image Process 22(7):2723–2736

    Article  MathSciNet  Google Scholar 

  10. Han Y, Zhu L, Cheng Z, Li J, Liu X (2018) Discrete optimal graph clustering[J]. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2018.2881539

  11. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  12. Hong R, Tang J, Tan HK, Yan S, Ngo C, Chua TS (2009) Event driven summarization for web videos. In: Proceedings of the first SIGMM workshop on social media. ACM, pp 43–48

  13. Hu T, Li Z, Su W, Mu X, Tang J (2017) Unsupervised video summaries using multiple features and image quality. In: 2017 IEEE third international conference on multimedia big data (BigMM). IEEE, pp 117–120

  14. Kang HW, Matsushita Y, Tang X, Chen XQ (2006) Space-time video montage. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. IEEE , pp 1331–1338

  15. Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2698–2705

  16. Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302

  17. Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212–1227

    Article  Google Scholar 

  18. Lee YJ, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1346–1353

  19. Li J, Lu K, Huang Z, Zhu L, Shen HT (2018) Heterogeneous domain adaptation through progressive alignment. IEEE Trans Neural Netw Learning Sys 30 (5):1381–1391

    Article  MathSciNet  Google Scholar 

  20. Li J, Lu K, Huang Z, Zhu L, Shen HT (2018) Transfer independently together: a generalized framework for domain adaptation. IEEE Trans Cybern 49 (6):2144–2155

    Article  Google Scholar 

  21. Li J, Wu Y, Zhao J, Lu K (2016) Low-rank discriminant embedding for multiview learning. IEEE Trans Cybern 47(11):3516–3529

    Article  Google Scholar 

  22. Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Trans Image Process 26(8):3652–3664

    Article  MathSciNet  Google Scholar 

  23. Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2657–2664

  24. Liu D, Hua G, Chen T (2010) A hierarchical visual model for video object summarization. IEEE Trans Pattern Anal Mach Intell 32(12):2178–2190

    Article  Google Scholar 

  25. Lu Z, Grauman K (2013) Story-driven summarization for egocentric video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2714–2721

  26. Ma YF, Lu L, Zhang HJ, Li M (2002) A user attention model for video summarization. In: Proceedings of the tenth ACM international conference on multimedia. ACM, pp 533–542

  27. Nam J, Tewfik AH (2002) Event-driven video abstraction and visualization. Multimed Tools Appl 16(1-2):55–77

    Article  Google Scholar 

  28. Newey WK (1988) Adaptive estimation of regression models via moment restrictions. J Econ 38(3):301–339

    Article  MathSciNet  Google Scholar 

  29. Ngo CW, Ma YF, Zhang HJ (2003) Automatic video summarization by graph modeling. In: Ninth IEEE international conference on computer vision, 2003. Proceedings. IEEE, pp 104–109

  30. Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27 (11):5585–5599

    Article  MathSciNet  Google Scholar 

  31. Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: European conference on computer vision. Springer, Berlin, pp 540–555

  32. Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. arXiv:1509.00685

  33. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  34. Song J, Yang Y, Huang Z, Shen HT, Luo J (2013) Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Trans Multimed 15 (8):1997–2008

    Article  Google Scholar 

  35. Song Y, Vallmitjana J, Stent A, Jaimes A (2015) Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187

  36. Sun K, Zhu J, Lei Z, Hou X, Zhang Q, Duan J, Qiu G (2017) Learning deep semantic attributes for user video summarization. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 643–648

  37. Tang J, Wang K, Shao L (2016) Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans Image Process 25(7):3157–3166

    Article  MathSciNet  Google Scholar 

  38. Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl (TOMM) 3(1):3

    Article  Google Scholar 

  39. Zhang HJ, Wu J, Zhong D, Smoliar SW (1997) An integrated system for content-based video retrieval and browsing. Pattern Recogn 30(4):643–658

    Article  Google Scholar 

  40. Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision. Springer, pp 766–782

  41. Zhang S, Zhu Y, Roy-Chowdhury AK (2016) Context-aware surveillance video summarization. IEEE Trans Image Process 25(11):5469–5478

    Article  MathSciNet  Google Scholar 

  42. Zhang Z, Liu L, Shen F, Shen HT, Shao L (2018) Binary multi-view clustering. IEEE Trans Pattern Anal Mach Intell 41(7):1774–1782

    Article  Google Scholar 

  43. Zhang Z, Shao L, Xu Y, Liu L, Yang J (2017) Marginal representation learning with graph structure self-adaptation. IEEE Trans Neural Netw Learn Sys 29 (10):4645–4659

    Article  MathSciNet  Google Scholar 

  44. Zhang Z, Xu Y, Shao L, Yang J (2017) Discriminative block-diagonal representation learning for image recognition. IEEE Trans Neural Netw Learn Sys 29 (7):3111–3125

    Article  MathSciNet  Google Scholar 

  45. Zhuang Y, Rui Y, Huang TS, Mehrotra S (1998) Adaptive key frame extraction using unsupervised clustering. In: 1998 international conference on image processing, 1998. ICIP 98. Proceedings, vol 1. IEEE, pp 866–870

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61671274, 61876098 and 61573219, in part by the Postdoctoral Science Foundation under Grant 2016M592190, in part by the Fostering Project of Dominant Discipline and Talent Team of Shandong Province Higher Education Institutions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiushan Nie.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Nie, X., Liu, X. et al. Modality correlation-based video summarization. Multimed Tools Appl 79, 33875–33890 (2020). https://doi.org/10.1007/s11042-020-08690-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08690-3

Keywords