Abstract
The rapid growth of multimedia data and the improvement of deep learning technology has allowed high-accuracy models to be trained for various fields. Video tools such as video classification, temporal action detection, and video summary are now available for the understanding of videos. In daily life, many social events start with a small conflict event. If conflicts and the subsequent dangers can be learned about from a video, we can prevent social incidents from occurring early on. This research presents a video and audio reasoning network that infers possible conflict events through video and audio features. To make the respective model more generalizable to other tasks, we have also added a predictive network to predict the risk of conflict events. We use multitasking to render the characteristics of movies and voices more generalizable to other similar tasks. We also propose several methods to integrate video features and audio features, improving the reasoning performance of the model. There’s a model we proposed is called the video and audio reasoning Network (VARN) which is more accurate than other models. Compared with RandomNet, it achieves a 2.9 times greater accuracy.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Statista (2020) YouTube: hours of video uploaded every minute 2019 | Statistic. [online] Available at: https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/ [Accessed 28 May. 2020]
Omnicoreagency.com Instagram by the Numbers (2020): Stats, Demographics and Fun Facts. [online] Available at: https://www.omnicoreagency.com/instagram-statistics/ [Accessed 28 May. 2020]
Karpathy A et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: advances in neural information processing systems. Montreal
Richard A, Gall J, (2016) Temporal action detection using a statistical language model, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Venugopalan S et al (2015) Sequence to sequence-video to text, Proceedings of the IEEE International conference on computer vision, Santiago
Tapaswi M et al (2016) Movieqa: Understanding stories in movies through question-answering, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation, proceedings of the European conference on computer vision (ECCV).
Wei J, Wang H, Yi Y, Li Q, Huang D (2019) P3D-CTN: Pseudo-3D convolutional tube network for spatio-temporal action detection in videos, Proceedings of the IEEE International Conference on Image Processing (ICIP)
He K et al (2016) Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas
Qiu Z, Ting Y, Tao M (2017) Learning spatio-temporal representation with pseudo-3d residual networks, Proceedings of the IEEE International Conference on Computer Vision, Venice
Rodriguez MD, Ahmed J, Shah M, (2008) Action MACH: A spatio-temporal maximum average correlation height filter for action recognition, in Proc. CVPR’08
Jhuang H, Gall v, Zuffi S, Schmid C, Black M J (2013) Towards understanding action recognition, in Proc. ICCV’13
Barchiesi D, et al (2014) Acoustic scene classification,:1411.3715
Rakotomamonjy A, Gasso G (2015) Histogram of gradients of time-frequency representations for audio scene classification, IEEE/ACM transactions on audio. speech and language processing (TASLP) 23(1):142–153
Salamon J, Bello JP (2015) Unsupervised feature learning for urban sound classification, acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on, IEEE, Brisbane
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Piczak KJ (2015) Environmental sound classification with convolutional neural networks, 2015 IEEE 25th International workshop on machine learning for signal processing (MLSP), Boston
Hertel L, Phan H, Mertins A (2016) Comparing time and frequency domain for audio event recognition using deep learning, 2016 IEEE international joint conference on neural networks (IJCNN), Budapest
McLoughlin I et al (2015) Robust sound event classification using deep neural networks. IEEE/ACM Trans audio speech language process 23(3):540–552
den Oord V A, Sander D, Schrauwen B (2013) Deep content-based music recommendation advnces in neural information processing systems, stateline
Hannun AY, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY (2014) Deep speech: scaling up end-to-end speech recognition. arXiv:1412.5567
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. advances in neural information processing systems, Barcelona
CARUNA R Multitask learning: A knowledge-based source of inductive bias, Proceedings of the 10th International conference in machine learning, Amherst, 1993.
Baxter J (1997) A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach learn 28(1):7–39
Duong L et al (2015) Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th International joint conference on natural language processing (Volume 2: Short Papers), Vol. 2
Yang Y, Hospedales TM (2017) Trace Norm Regularised Deep Multi-Task Learning. ICLR (Workshop)
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H et al (2013) Hmdb51: a large video database for human motion recognition, High performance computing in science and engineering 12. Springer, Berlin, Heidelberg
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev Pa, Suleyman M, Zisserman A (2017) The kinetics human action video dataset
Caba Heilbron F et al (2015) Activitynet: a large-scale video benchmark for human activity understanding, Proceedings of the IEEE conference on computer vision and pattern recognition, Boston
Krishna R et al (2017) Dense-Captioning events in videos, Proceedings of the IEEE International conference on computer vision, Venice
Acknowledgement
This work was supported in part by the 109 promote the park smart robot innovation self-made base subsidy program under Grant AIS10904, and in part by the "Allied Advanced Intelligent Biomedical Research Center, STUST" from Higher Education Sprout Project, Ministry of Education, Taiwan, and in part by the Ministry of Science and Technology (MOST) of Taiwan under Grant MOST 109-2221-E-218-026.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cheng, ST., Hsu, CW., Horng, GJ. et al. Video reasoning for conflict events through feature extraction. J Supercomput 77, 6435–6455 (2021). https://doi.org/10.1007/s11227-020-03514-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03514-5