Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Video reasoning for conflict events through feature extraction

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The rapid growth of multimedia data and the improvement of deep learning technology has allowed high-accuracy models to be trained for various fields. Video tools such as video classification, temporal action detection, and video summary are now available for the understanding of videos. In daily life, many social events start with a small conflict event. If conflicts and the subsequent dangers can be learned about from a video, we can prevent social incidents from occurring early on. This research presents a video and audio reasoning network that infers possible conflict events through video and audio features. To make the respective model more generalizable to other tasks, we have also added a predictive network to predict the risk of conflict events. We use multitasking to render the characteristics of movies and voices more generalizable to other similar tasks. We also propose several methods to integrate video features and audio features, improving the reasoning performance of the model. There’s a model we proposed is called the video and audio reasoning Network (VARN) which is more accurate than other models. Compared with RandomNet, it achieves a 2.9 times greater accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Statista (2020) YouTube: hours of video uploaded every minute 2019 | Statistic. [online] Available at: https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/ [Accessed 28 May. 2020]

  2. Omnicoreagency.com Instagram by the Numbers (2020): Stats, Demographics and Fun Facts. [online] Available at: https://www.omnicoreagency.com/instagram-statistics/ [Accessed 28 May. 2020]

  3. Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus

  4. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: advances in neural information processing systems. Montreal

  5. Richard A, Gall J, (2016) Temporal action detection using a statistical language model, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas

  6. Venugopalan S et al (2015) Sequence to sequence-video to text, Proceedings of the IEEE International conference on computer vision, Santiago

  7. Tapaswi M et al (2016) Movieqa: Understanding stories in movies through question-answering, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas

  8. Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation, proceedings of the European conference on computer vision (ECCV).

  9. Wei J, Wang H, Yi Y, Li Q, Huang D (2019) P3D-CTN: Pseudo-3D convolutional tube network for spatio-temporal action detection in videos, Proceedings of the IEEE International Conference on Image Processing (ICIP)

  10. He K et al (2016) Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas

  11. Qiu Z, Ting Y, Tao M (2017) Learning spatio-temporal representation with pseudo-3d residual networks, Proceedings of the IEEE International Conference on Computer Vision, Venice

  12. Rodriguez MD, Ahmed J, Shah M, (2008) Action MACH: A spatio-temporal maximum average correlation height filter for action recognition, in Proc. CVPR’08

  13. Jhuang H, Gall v, Zuffi S, Schmid C, Black M J (2013) Towards understanding action recognition, in Proc. ICCV’13

  14. Barchiesi D, et al (2014) Acoustic scene classification,:1411.3715 

  15. Rakotomamonjy A, Gasso G (2015) Histogram of gradients of time-frequency representations for audio scene classification, IEEE/ACM transactions on audio. speech and language processing (TASLP) 23(1):142–153

    Google Scholar 

  16. Salamon J, Bello JP (2015) Unsupervised feature learning for urban sound classification, acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on, IEEE, Brisbane

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Piczak KJ (2015) Environmental sound classification with convolutional neural networks, 2015 IEEE 25th International workshop on machine learning for signal processing (MLSP), Boston

  19. Hertel L, Phan H, Mertins A (2016) Comparing time and frequency domain for audio event recognition using deep learning, 2016 IEEE international joint conference on neural networks (IJCNN), Budapest

  20. McLoughlin I et al (2015) Robust sound event classification using deep neural networks. IEEE/ACM Trans audio speech language process 23(3):540–552

    Article  Google Scholar 

  21. den Oord V A, Sander D, Schrauwen B (2013) Deep content-based music recommendation advnces in neural information processing systems, stateline

    Google Scholar 

  22. Hannun AY, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY (2014) Deep speech: scaling up end-to-end speech recognition. arXiv:1412.5567

  23. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. advances in neural information processing systems, Barcelona

    Google Scholar 

  24. CARUNA R Multitask learning: A knowledge-based source of inductive bias, Proceedings of the 10th International conference in machine learning, Amherst, 1993.

  25. Baxter J (1997) A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach learn 28(1):7–39

    Article  Google Scholar 

  26. Duong L et al (2015) Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th International joint conference on natural language processing (Volume 2: Short Papers), Vol. 2

  27. Yang Y, Hospedales TM (2017) Trace Norm Regularised Deep Multi-Task Learning. ICLR (Workshop)

  28. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  29. Kuehne H et al (2013) Hmdb51: a large video database for human motion recognition, High performance computing in science and engineering 12. Springer, Berlin, Heidelberg

    Google Scholar 

  30. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev Pa, Suleyman M, Zisserman A (2017) The kinetics human action video dataset

  31. Caba Heilbron F et al (2015) Activitynet: a large-scale video benchmark for human activity understanding, Proceedings of the IEEE conference on computer vision and pattern recognition, Boston

  32. Krishna R et al (2017) Dense-Captioning events in videos, Proceedings of the IEEE International conference on computer vision, Venice

Download references

Acknowledgement

This work was supported in part by the 109 promote the park smart robot innovation self-made base subsidy program under Grant AIS10904, and in part by the "Allied Advanced Intelligent Biomedical Research Center, STUST" from Higher Education Sprout Project, Ministry of Education, Taiwan, and in part by the Ministry of Science and Technology (MOST) of Taiwan under Grant MOST 109-2221-E-218-026.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gwo-Jiun Horng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, ST., Hsu, CW., Horng, GJ. et al. Video reasoning for conflict events through feature extraction. J Supercomput 77, 6435–6455 (2021). https://doi.org/10.1007/s11227-020-03514-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03514-5

Keywords