Violence Detection Paper 1
Violence Detection Paper 1
Violence Detection Paper 1
Abstract—Violent action recognition has significant human activities. Conversely, this dataset is first of its nature
importance in developing automated video surveillance systems. focused on violent scenes detection, to build precise
Over last few years, violence detection such as fight activity surveillance systems, for monitoring indoor and outdoor
recognition is mostly achieved through hand-crafted features environments.
detectors. Some researchers also inquired learning based
representation models. These approaches achieved high Historically, human activity recognition is achieved
accuracies on Hockey and Movies benchmark datasets through traditional hand-crafted feature representation
specifically designed for detection of violent sequences. approaches such as Histogram of Oriented Gradient (HOG),
However, these techniques have limitations in learning Scale-Invariant Feature Transform (SIFT), Hessian3D and
discriminating features for videos with abrupt camera motion Local Binary Pattern (LBP) etc. More on, there is growing
of Hockey dataset. Deep representation based approaches have tendency to solve this problem by adopting learning based
been successfully used in image recognition and human action deep representation techniques, such as Convolutional Neural
detection tasks. This paper proposed deep representation based Networks (CNN), 3D-CNN for spatio-temporal analysis,
model using concept of transfer learning for violent scenes CNN followed by Recurrent Neural Network (RNN) and
detection to identify aggressive human behaviors. The result Spiking Neural Networks (SNN) etc. see survey [7] [8].
reports that proposed approach is outperforming state-of-the-
art accuracies by learning most discriminating features For violence detection, most of the existing approaches
achieving 99.28% and 99.97% accuracies on Hockey and relay on hand-defined features descriptors, to distinguish fight
Movies datasets respectively, by learning finest features for the sequences from normal ones, a scheme often used in human
task of violent action recognition in videos. action recognition domain. Thereby, since the introduction of
the violent/fight specific two datasets, most of the techniques
Keywords—Violence Detection, Fight Recognition, are dependent on formulating hand-crafted feature
Surveillance Videos, Deep CNN, GoogleNet, Transfer Learning representations for violence identification, such as Space-
Time Interest Points (STIP), Motion SIFT (MoSIFT), Motion
I. INTRODUCTION features, Motion blobs performed on audio-visual analysis
In video surveillance, to critically assure public safety along blood and flame detection [6], [9]–[12]. However, A
hundreds and thousands of surveillance cameras are deployed few researches are conducted using deep learning techniques
within cities, but it is almost impossible now a day to manually such as 2D-CNN, 3D-CNN, C3D [13]–[15]. Besides that,
monitor all cameras to keep an eye on violent activities. there is a scarcity in using deep representations models based
Rather, there is a significant requirement for developing transfer learning approach, to solve violent/fight detection
automated video surveillance systems to automatically track problem, in violent action recognition domain.
and monitor such activities. Thereby, in case of emergencies
alarming the controlling authorities to take appropriate Deep learning based approaches generally called end-to-
measures against detected violence. Violence recognition is a end learning. It has history of deep representation based
key step towards developing automated security surveillance Convolutional Neural Network (CNN) model, starting from
systems, to distinguish normal human activities from hand-written digit classifications. Then, over the years
abnormal/violent actions. Normal human activities are often evolved as state-of-the-art deep CNN architectures such as
categorized as routine life interactive behaviors, such as AlexNet, GoogleNet, ResNet. These architectures are winner
walking, jogging, running, hand waving [1] [2]. However, model of ImageNet Large Scale Visual Recognition
violence is subjected to unusual furious actions, such as fight Challenge (ILSVRC), due to their remarkable accuracy for
activity happening between two or more people [3]. image classification task. These networks are trained on 15
million annotated images for 1000 categories [16]–[19].
In last few years, the task of human action recognition has However, in order to successfully train a deep learning
received much attention of the researcher community, to network, a very large dataset is required to learn generalized
detect normal day human activities through video analysis, see features. To combat the challenge of huge data requirement,
surveys [4] [5]. However, little attention has paid to the the concept of transfer learning is adopted by many
problem of human violent action detection, until the researchers as a promising strategy. In transfer learning, a
availability of violent sequences for fight activities. The CNN model pre-trained on specific dataset, which has already
authors has created two datasets specifically for fight activities learnt specific features for some specific task, can be
detections, to distinguish violent/fight incidents from normal transferred to be fine-tuned for a new task, even to an entirely
events [6]. Before the availability of this dataset, most of the different domain [20]. Due to this powerful concept,
available datasets were particularly concerned about general researchers started using transfer learning for numerous tasks
559
Fig. 1. Transfer learning concept with source task network architecture transformed into target task network architecture. A source task network pre-
trained on source dataset is fine-tuned on target dataset to be a target task network
560
Fig. 2. Hockey and Movies datasets samples. First row, left two images are fight and right two images are non-fight frames from Hockey dataset.
Second row, left two images are fight and right two images are non-fight frames from Movies dataset.
Fig. 3. Hockey and Movies datasets training progress samples for 5 epochs of 1st fold. Left image shows 1st fold training progress of Hockey dataset,
Right image reports 1st fold training progress of Movies dataset.
Fig. 4. Hockey and Movies datasets accuracies for 10-Folds. Left image shows accuracies for each fold of Hockey dataset, Right image reports
accuracies for each fold of Movies dataset.
C. Training progress quickly in 4th and 5th epoch. So, training is stopped at this stage
The distinct GoogleNet models are trained for each fold to avoid network overfitting problem.
separately. Models are trained in sequential order with each Fig. 4. show accuracy progress achieved at the end of each
fold trained for 5 epochs, on both datasets. fold. 10 distinct accuracy values are reported for all 10-folds.
Fig. 3. shows training progress for 5 epochs of 1st fold on Hockey dataset with abrupt camera motions in video frames
Hockey and Movies datasets. Accuracy indicators shows is producing arbitrary highest accuracy values for different
drastic increase in accuracy by reducing loss in very 1st and folds. However, Movies dataset is achieving more consistent
2nd training epoch. Model approaches to highest accuracy highest accuracy rate for each fold.
561
TABLE I. COMPARISON OF CLASSIFICATION ACCURACIES RESULTS ON HOCKEY AND MOVIES DATASETS
Year Author/Method Features/Classifiers Testing Scheme Datasets Accuracy (%)
Hockey Movies
2013 Bermejo et al. [6] STIP (HOG) + HIK 5-Fold CV 91.7± -% 49.0± -%
STIP (HOF) + HIK 88.6± -% 59.0± -%
MoSIFT + HIK 90.9± -% 89.5± -%
2014 Deniz et al. [9] SVM 10-Fold CV 90.1±0% 85.4±9.3%
Adaboost 90.1±0% 98.9±0.2%
2014 Ding et al. [14] 3D-CNN Train/Test split 91± - % -
2015 ViF [10] SVM 10-Fold CV 82.3±0.2% 96.7±0.3%
Adaboost 82.2±0.4% 92.8±0.4%
Random Forests 82.4±0.6% 88.9±1.2%
2015 LMP [10] SVM 10-Fold CV 75.9±0.3% 84.4±0.8%
Adaboost 76.5±0.9% 81.5±2.1%
Random Forests 77.7±0.6% 92±0.1%
2015 Serrano [10] SVM 10-Fold CV 72.5±0.5% 87.2±0.7%
Adaboost 71.7±0.3% 81.7±0.2%
Random Forests 82.4±0.6% 97.8±0.4%
2018 Serrano et al. [13] C3D [33] 10-Fold CV 87.4±1.2% 93.6±0.8%
2D-CNN 87.8±0.3% 93.1±0.3%
2D-CNN + HOG Forest 94.6±0.6% 99±0.5%
- Proposed Method Deep CNN using Transfer 10-Fold CV 99.28% 99.97%
learning
562
Results show that, proposed model is outperforming top [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
ranked approaches by learning finest features on challenging Classification with Deep Convolutional Neural Networks,” Adv.
Neural Inf. Process. Syst., pp. 1–9, 2012.
datasets. Model achieved 99.28% and 99.97% accuracies on
[18] C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–9, 2014.
Hockey and Movies datasets respectively, in just 5 training
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
epochs. Proposed method is able to learn most discriminating Image Recognition,” in 2016 IEEE Conference on Computer Vision
features for abrupt camera motions and dynamic scenes and Pattern Recognition (CVPR), 2016, pp. 770–778.
sequences, for the task of violent action detection in videos. [20] M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional networks,” Lect. Notes Comput. Sci. (including Subser.
ACKNOWLEDGMENT Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8689 LNCS,
This research is supported by the PDE-GIR project which no. PART 1, pp. 818–833, 2014.
has received funding from the European Union’s Horizon [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L.
Fei-Fei, “Large-scale video classification with convolutional neural
2020 research and innovation programme under the Marie networks,” Comput. Vis. Pattern Recognit. (CVPR), 2014 IEEE Conf.,
Skłodowska-Curie grant agreement No 778035. pp. 1725–1732, 2014.
[22] A. B. Sargano, X. Wang, P. Angelov, and Z. Habib, “Human action
REFERENCES recognition using transfer learning with deep representations,” 2017
[1] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A Int. Jt. Conf. Neural Networks, pp. 463–469, 2017.
local SVM approach,” in Proceedings - International Conference on [23] J. Donahue et al., “DeCAF: A Deep Convolutional Activation Feature
Pattern Recognition, 2004. for Generic Visual Recognition,” 2013.
[2] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions [24] C. Clarin, J. Dionisio, M. Echavez, and P. Naval, “DOVE: Detection
as space-time shapes,” IEEE Trans. Pattern Anal. Mach. Intell., 2007. of movie violence using motion intensity analysis on skin and blood,”
[3] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” in 2009 PCSC, vol. 6, pp. 150–156, 2005.
IEEE Computer Society Conference on Computer Vision and Pattern [25] W. Zajdel, J. D. Krijnders, T. Andringa, and D. M. Gavrila,
Recognition Workshops, CVPR Workshops 2009, 2009. “CASSANDRA: Audio-video sensor fusion for aggression detection,”
[4] R. Poppe, “A survey on vision-based human action recognition,” Image in 2007 IEEE Conference on Advanced Video and Signal Based
Vis. Comput., vol. 28, no. 6, pp. 976–990, 2010. Surveillance, AVSS 2007 Proceedings, 2007, pp. 200–205.
[5] S.-R. Ke, H. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-H. Choi, [26] Y. Gong, W. Wang, S. Jiang, Q. Huang, and W. Gao, “Detecting
“A Review on Video-Based Human Activity Recognition,” violent scenes in movies by auditory and visual cues,” in Pacific-Rim
Computers, 2013. Conference on Multimedia, 2008, pp. 317–326.
[6] E. Bermejo, O. Deniz, G. Bueno, and R. Sukthankar, “Violence [27] T. Giannakopoulos, D. Kosmopoulos, A. Aristidou, and S.
Detection in Video Using Computer Vision Techniques,” CAIP’11 Theodoridis, “Violence content classification using audio features,” in
Proc. 14th Int. Conf. Comput. Anal. images patterns - Vol. Part II, Hellenic Conference on Artificial Intelligence, 2006, pp. 502–507.
2011. [28] D. Chen, H. Wactlar, M. Chen, C. Gao, A. Bharucha, and A.
[7] A. Sargano, P. Angelov, and Z. Habib, “A Comprehensive Review on Hauptmann, “Recognition of aggressive human behavior using binary
Handcrafted and Learning-Based Action Representation Approaches local motion descriptors,” in Engineering in Medicine and Biology
for Human Activity Recognition,” Appl. Sci., vol. 7, no. 1, p. 110, 2017. Society, 2008. EMBS 2008. 30th Annual International Conference of
the IEEE, 2008, pp. 5238–5241.
[8] D. Wu, N. Sharma, and M. Blumenstein, “Recent Advances in Video-
Based Human Action Recognition using Deep Learning: A Review,” [29] J. Lin and W. Wang, “Weakly-supervised violence detection in movies
pp. 2865–2872, 2017. with audio and video based co-training,” in Pacific-Rim Conference on
Multimedia, 2009, pp. 930–935.
[9] O. Deniz, I. Serrano, G. Bueno, and T.-K. T. Kim, “Fast violence
detection in video,” in Computer Vision Theory and Applications [30] T. Giannakopoulos, A. Makris, D. Kosmopoulos, S. Perantonis, and S.
(VISAPP), 2014 International Conference on, 2014, vol. 2, pp. 478– Theodoridis, “Audio-visual fusion for detecting violent scenes in
485. videos,” in Hellenic Conference on Artificial Intelligence, 2010, vol.
6040, no. December, pp. 91–100.
[10] I. S. Gracia, O. D. Suarez, G. B. Garcia, and T. K. Kim, “Fast fight
detection,” PLoS One, 2015. [31] L.-H. Chen, C.-W. Su, and H.-W. Hsu, “Violent scene detection in
movies,” Int. J. Pattern Recognit. Artif. Intell., vol. 25, no. 08, pp.
[11] J. Nam, M. Alghoniemy, and A. H. Tewfik, “Audio-visual content-
1161–1172, 2011.
based violent scene characterization,” in Image Processing, 1998. ICIP
98. Proceedings. 1998 International Conference on, 1998, vol. 1, pp. [32] L. Xu, C. Gong, J. Yang, Q. Wu, and L. Yao, “Violent video detection
353–357. based on MoSIFT feature and sparse coding,” in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing
[12] W.-H. Cheng, W.-T. Chu, and J.-L. Wu, “Semantic context detection
- Proceedings, 2014, pp. 3538–3542.
based on hierarchical audio models,” in Proceedings of the 5th ACM
SIGMM international workshop on Multimedia information retrieval, [33] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
2003, pp. 109–115. spatiotemporal features with 3D convolutional networks,” in
Proceedings of the IEEE International Conference on Computer
[13] I. Serrano, O. Deniz, J. L. Espinosa-Aranda, and G. Bueno, “Fight
Vision, 2015.
Recognition in Video Using Hough Forests and 2D Convolutional
Neural Network,” IEEE Trans. Image Process., 2018. [34] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” pp. 1–14, 2014.
[14] C. Ding, S. Fan, M. Zhu, W. Feng, and B. Jia, “Violence Detection in
Video by Using 3D Convolutional Neural Networks,” in International [35] Y. Aytar, “Transfer learning for object category detection,” University
Symposium on Visual Computing, Springer, Cham, 2014, pp. 551–558. of Oxford, 2014.
[15] P. Zhou, Q. Ding, H. Luo, X. Hou, B. Jin, and P. Maass, “Violent [36] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu,
Interaction Detection in Video Based on Deep Learning,” in Journal of “Transfer Learning for Video Recognition with Scarce Training Data
Physics: Conference Series, 2017, vol. 844, no. 1, p. 12044. for Deep Convolutional Neural Network,” arXiv Prepr.
arXiv1409.4127, 2014.
[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “lecun- at el -
Gradient-based learning applied to document,” Proc. IEEE, vol. 86, no.
11, pp. 2278–2323, 1998.
563