Video reasoning for conflict events through feature extraction

Cheng, Sheng-Tzong; Hsu, Chih-Wei; Horng, Gwo-Jiun; Jiang, Ci-Ruei

doi:10.1007/s11227-020-03514-5

Video reasoning for conflict events through feature extraction

Published: 02 January 2021

Volume 77, pages 6435–6455, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Sheng-Tzong Cheng¹,
Chih-Wei Hsu¹,
Gwo-Jiun Horng ORCID: orcid.org/0000-0002-0193-2104² &
…
Ci-Ruei Jiang¹

172 Accesses
3 Citations
Explore all metrics

Abstract

The rapid growth of multimedia data and the improvement of deep learning technology has allowed high-accuracy models to be trained for various fields. Video tools such as video classification, temporal action detection, and video summary are now available for the understanding of videos. In daily life, many social events start with a small conflict event. If conflicts and the subsequent dangers can be learned about from a video, we can prevent social incidents from occurring early on. This research presents a video and audio reasoning network that infers possible conflict events through video and audio features. To make the respective model more generalizable to other tasks, we have also added a predictive network to predict the risk of conflict events. We use multitasking to render the characteristics of movies and voices more generalizable to other similar tasks. We also propose several methods to integrate video features and audio features, improving the reasoning performance of the model. There’s a model we proposed is called the video and audio reasoning Network (VARN) which is more accurate than other models. Compared with RandomNet, it achieves a 2.9 times greater accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Fig. 9

Evaluation of multiple features for violent scenes detection

Article 25 February 2016

Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features

Multimodal deep representation learning for video classification

Article 03 May 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Statista (2020) YouTube: hours of video uploaded every minute 2019 | Statistic. [online] Available at: https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/ [Accessed 28 May. 2020]
Omnicoreagency.com Instagram by the Numbers (2020): Stats, Demographics and Fun Facts. [online] Available at: https://www.omnicoreagency.com/instagram-statistics/ [Accessed 28 May. 2020]
Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: advances in neural information processing systems. Montreal
Richard A, Gall J, (2016) Temporal action detection using a statistical language model, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Venugopalan S et al (2015) Sequence to sequence-video to text, Proceedings of the IEEE International conference on computer vision, Santiago
Tapaswi M et al (2016) Movieqa: Understanding stories in movies through question-answering, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation, proceedings of the European conference on computer vision (ECCV).
Wei J, Wang H, Yi Y, Li Q, Huang D (2019) P3D-CTN: Pseudo-3D convolutional tube network for spatio-temporal action detection in videos, Proceedings of the IEEE International Conference on Image Processing (ICIP)
He K et al (2016) Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Qiu Z, Ting Y, Tao M (2017) Learning spatio-temporal representation with pseudo-3d residual networks, Proceedings of the IEEE International Conference on Computer Vision, Venice
Rodriguez MD, Ahmed J, Shah M, (2008) Action MACH: A spatio-temporal maximum average correlation height filter for action recognition, in Proc. CVPR’08
Jhuang H, Gall v, Zuffi S, Schmid C, Black M J (2013) Towards understanding action recognition, in Proc. ICCV’13
Barchiesi D, et al (2014) Acoustic scene classification,:1411.3715
Rakotomamonjy A, Gasso G (2015) Histogram of gradients of time-frequency representations for audio scene classification, IEEE/ACM transactions on audio. speech and language processing (TASLP) 23(1):142–153
Google Scholar
Salamon J, Bello JP (2015) Unsupervised feature learning for urban sound classification, acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on, IEEE, Brisbane
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Piczak KJ (2015) Environmental sound classification with convolutional neural networks, 2015 IEEE 25th International workshop on machine learning for signal processing (MLSP), Boston
Hertel L, Phan H, Mertins A (2016) Comparing time and frequency domain for audio event recognition using deep learning, 2016 IEEE international joint conference on neural networks (IJCNN), Budapest
McLoughlin I et al (2015) Robust sound event classification using deep neural networks. IEEE/ACM Trans audio speech language process 23(3):540–552
Article Google Scholar
den Oord V A, Sander D, Schrauwen B (2013) Deep content-based music recommendation advnces in neural information processing systems, stateline
Google Scholar
Hannun AY, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY (2014) Deep speech: scaling up end-to-end speech recognition. arXiv:1412.5567
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. advances in neural information processing systems, Barcelona
Google Scholar
CARUNA R Multitask learning: A knowledge-based source of inductive bias, Proceedings of the 10th International conference in machine learning, Amherst, 1993.
Baxter J (1997) A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach learn 28(1):7–39
Article Google Scholar
Duong L et al (2015) Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th International joint conference on natural language processing (Volume 2: Short Papers), Vol. 2
Yang Y, Hospedales TM (2017) Trace Norm Regularised Deep Multi-Task Learning. ICLR (Workshop)
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H et al (2013) Hmdb51: a large video database for human motion recognition, High performance computing in science and engineering 12. Springer, Berlin, Heidelberg
Google Scholar
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev Pa, Suleyman M, Zisserman A (2017) The kinetics human action video dataset
Caba Heilbron F et al (2015) Activitynet: a large-scale video benchmark for human activity understanding, Proceedings of the IEEE conference on computer vision and pattern recognition, Boston
Krishna R et al (2017) Dense-Captioning events in videos, Proceedings of the IEEE International conference on computer vision, Venice

Download references

Acknowledgement

This work was supported in part by the 109 promote the park smart robot innovation self-made base subsidy program under Grant AIS10904, and in part by the "Allied Advanced Intelligent Biomedical Research Center, STUST" from Higher Education Sprout Project, Ministry of Education, Taiwan, and in part by the Ministry of Science and Technology (MOST) of Taiwan under Grant MOST 109-2221-E-218-026.

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, 701, Taiwan
Sheng-Tzong Cheng, Chih-Wei Hsu & Ci-Ruei Jiang
Department of Computer Science and Information Engineering, Southern Taiwan University of Science and Technology, Tainan, Taiwan
Gwo-Jiun Horng

Authors

Sheng-Tzong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Wei Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Gwo-Jiun Horng
View author publications
You can also search for this author in PubMed Google Scholar
Ci-Ruei Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gwo-Jiun Horng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, ST., Hsu, CW., Horng, GJ. et al. Video reasoning for conflict events through feature extraction. J Supercomput 77, 6435–6455 (2021). https://doi.org/10.1007/s11227-020-03514-5

Download citation

Accepted: 16 November 2020
Published: 02 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11227-020-03514-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video reasoning for conflict events through feature extraction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluation of multiple features for violent scenes detection

Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features

Multimodal deep representation learning for video classification

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Video reasoning for conflict events through feature extraction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluation of multiple features for violent scenes detection

Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features

Multimodal deep representation learning for video classification

Explore related subjects

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation