Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
70 views6 pages

Violence Detection Paper 1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

2018 2nd European Conference on Electrical Engineering and Computer Science (EECS)

Violence Detection in Surveillance Videos with


Deep Network using Transfer Learning
Aqib Mumtaz Allah Bux Sargano Zulfiqar Habib
Department of Computer Sciences Department of Computer Sciences Department of Computer Sciences
COMSATS University Islamabad, COMSATS University Islamabad, COMSATS University Islamabad,
Lahore, Pakistan Lahore, Pakistan Lahore, Pakistan
fa16-rcs-022@cuilahore.edu.pk allahbux@cuilahore.edu.pk drzhabib@cuilahore.edu.pk
aqib.mumtaz@gmail.com

Abstract—Violent action recognition has significant human activities. Conversely, this dataset is first of its nature
importance in developing automated video surveillance systems. focused on violent scenes detection, to build precise
Over last few years, violence detection such as fight activity surveillance systems, for monitoring indoor and outdoor
recognition is mostly achieved through hand-crafted features environments.
detectors. Some researchers also inquired learning based
representation models. These approaches achieved high Historically, human activity recognition is achieved
accuracies on Hockey and Movies benchmark datasets through traditional hand-crafted feature representation
specifically designed for detection of violent sequences. approaches such as Histogram of Oriented Gradient (HOG),
However, these techniques have limitations in learning Scale-Invariant Feature Transform (SIFT), Hessian3D and
discriminating features for videos with abrupt camera motion Local Binary Pattern (LBP) etc. More on, there is growing
of Hockey dataset. Deep representation based approaches have tendency to solve this problem by adopting learning based
been successfully used in image recognition and human action deep representation techniques, such as Convolutional Neural
detection tasks. This paper proposed deep representation based Networks (CNN), 3D-CNN for spatio-temporal analysis,
model using concept of transfer learning for violent scenes CNN followed by Recurrent Neural Network (RNN) and
detection to identify aggressive human behaviors. The result Spiking Neural Networks (SNN) etc. see survey [7] [8].
reports that proposed approach is outperforming state-of-the-
art accuracies by learning most discriminating features For violence detection, most of the existing approaches
achieving 99.28% and 99.97% accuracies on Hockey and relay on hand-defined features descriptors, to distinguish fight
Movies datasets respectively, by learning finest features for the sequences from normal ones, a scheme often used in human
task of violent action recognition in videos. action recognition domain. Thereby, since the introduction of
the violent/fight specific two datasets, most of the techniques
Keywords—Violence Detection, Fight Recognition, are dependent on formulating hand-crafted feature
Surveillance Videos, Deep CNN, GoogleNet, Transfer Learning representations for violence identification, such as Space-
Time Interest Points (STIP), Motion SIFT (MoSIFT), Motion
I. INTRODUCTION features, Motion blobs performed on audio-visual analysis
In video surveillance, to critically assure public safety along blood and flame detection [6], [9]–[12]. However, A
hundreds and thousands of surveillance cameras are deployed few researches are conducted using deep learning techniques
within cities, but it is almost impossible now a day to manually such as 2D-CNN, 3D-CNN, C3D [13]–[15]. Besides that,
monitor all cameras to keep an eye on violent activities. there is a scarcity in using deep representations models based
Rather, there is a significant requirement for developing transfer learning approach, to solve violent/fight detection
automated video surveillance systems to automatically track problem, in violent action recognition domain.
and monitor such activities. Thereby, in case of emergencies
alarming the controlling authorities to take appropriate Deep learning based approaches generally called end-to-
measures against detected violence. Violence recognition is a end learning. It has history of deep representation based
key step towards developing automated security surveillance Convolutional Neural Network (CNN) model, starting from
systems, to distinguish normal human activities from hand-written digit classifications. Then, over the years
abnormal/violent actions. Normal human activities are often evolved as state-of-the-art deep CNN architectures such as
categorized as routine life interactive behaviors, such as AlexNet, GoogleNet, ResNet. These architectures are winner
walking, jogging, running, hand waving [1] [2]. However, model of ImageNet Large Scale Visual Recognition
violence is subjected to unusual furious actions, such as fight Challenge (ILSVRC), due to their remarkable accuracy for
activity happening between two or more people [3]. image classification task. These networks are trained on 15
million annotated images for 1000 categories [16]–[19].
In last few years, the task of human action recognition has However, in order to successfully train a deep learning
received much attention of the researcher community, to network, a very large dataset is required to learn generalized
detect normal day human activities through video analysis, see features. To combat the challenge of huge data requirement,
surveys [4] [5]. However, little attention has paid to the the concept of transfer learning is adopted by many
problem of human violent action detection, until the researchers as a promising strategy. In transfer learning, a
availability of violent sequences for fight activities. The CNN model pre-trained on specific dataset, which has already
authors has created two datasets specifically for fight activities learnt specific features for some specific task, can be
detections, to distinguish violent/fight incidents from normal transferred to be fine-tuned for a new task, even to an entirely
events [6]. Before the availability of this dataset, most of the different domain [20]. Due to this powerful concept,
available datasets were particularly concerned about general researchers started using transfer learning for numerous tasks

978-1-7281-1929-8/18/$31.00 ©2018 IEEE 558


DOI 10.1109/EECS.2018.00109
of images classifications, as well as action recognition [21] proposed. This system is revealing finest accuracy results
[22]. Transfer learning has optimal strategies for fine-tuning 94.6%, 99% on Hockey and Movies datasets respectively, as
network, in which a pre-trained network has to be fine-tuned compare to all previous techniques of hand-defined features
on target dataset to successfully perform the new task in new detectors and deep representation models [13].
domain [23].
In short, a significant number of previous algorithms
Despite to the fact that deep learning techniques are perform audio-visual cue analysis by recognizing audio cues
successfully used for human action recognition, however for violent activities or by examining blood and flame visual
these techniques in coordination with transfer learning color patterns using hand-defined features. A few deep
concept have not been considered by researchers for violence learning based approaches also incorporated CNN and 3D-
detection. This research work proposed transfer learning CNN architectures. Moreover, a 2D-CNN model, by taking
based deep CNN model to detect violent/fight activates in advantage of network deep representations, in combination
videos sequences. GoogleNet [18] is selected due to deep with hand-defined Hough Forest features, is constructing
network architecture with 12 times fewer parameters than finest classifier to discriminate violent human behaviors.
AlexNet as a pre-trained model. It is fine-tuned on Hockey However, deep learning models have certain limitations. They
and Movies datasets using transfer learning to create a deep require huge computational power with enormous amount of
representation classifier for violent scenes identification. domain specific data. Developing huge amount of labeled
Results show that proposed approach is out performing on dataset is laborious and time-consuming task. This
both datasets as compare to all competitive state-of-the-art shortcoming is leading to a major bottleneck in training deep
published approaches from hand-crafted and deep learning learning model from scratch for target domain. To swiftly
domains. Following paper is organized in Related Work, combat this challenge, an approach of transfer learning
Methodology, Datasets, Experiments, Results, and becomes useful. In which, a source network pre-trained on a
Conclusion in section II, III, IV, V, VI, and VII respectively. huge dataset is re-trained on the target domain specific dataset
[20]. This scheme eliminates the need of producing huge
II. RELATED WORK dataset as well as training model from scratch. In this regards,
Initial proposals adopted the methodology of violence winner models of ImageNet Large Scale Visual Recognition
recognition using blood and flame detection, capturing the Challenge (ILSVRC), such as AlexNet [17] VGGNet [34],
degrees of motion, recognizing sounds features by exploiting GoogleNet [18] and ResNet [19] trained on 15 million
audio-visual correlation, skin and blood patterns exposure and annotated images for 1000 categories, are fortunately
discovering scream like cues in audio exploiting audio-video publically available as open source pre-trained models. These
correlation for violent scenes detection [11], [24]–[26]. Then, models can be used as pre-trained networks employing
audio features are used to detect gunshots, explosions and car- transfer learning to develop domain specific target networks,
breaking activities, using Hidden Markov models (HMM) and such as for the task of violent human behaviors detection.
Gaussian mixture models [12]. Audio characteristics from
time and frequency domain are classified using Support III. METHODOLOGY
Vector Machine (SVM) [27]. In machine learning domain, learning based representation
techniques achieve feature learning through iterative
Furthermore, Chen et al. used spatio-temporal video cubes
optimization procedure. Feature learning is very appealing
and local binary motion detectors [28]. Lin and Wang
due to learning complex underlying data representation,
exploited weekly-supervised audio classifier co-trained with especially for complex task of image recognition, as compare
video features of motion, blood and explosion [29].
to hand-crafted feature descriptors. The learnt features
Giannakopoulos et al. performed audio-visual features acquired through learning a specific problem, can be re-
analysis using statistics, average motion followed by K-
utilized for solving another problem in a new task, a concept
Nearest Neighbors (KNN) classifier [30]. And, Chen et al. known as transfer learning. This approach has been
suggested detection of faces and blood presence [31] for
successfully used in object classification and categorization
determining potential violent contents in videos.
domain [35].
Bermejo et al. exhibited encouraging results with 90%
The CNN deep model is originally data driven, it requires
accuracy using MoSIFT feature descriptor revealing two large labeled dataset for training. Annotated dataset
potential datasets “Hockey dataset” and “Movies dataset”,
preparation is complex and demanding task. On the other
specifically designed for violence detection job [6]. Following hand, providing insufficient amount of data would not
that, Kernel Density Estimation (KDE) was exploited to
leverage CNN model to learn optimal deep features instead
obtain feature selection on MoSIFT descriptor with sparse network suffer from significant overfitting issue. To solve the
coding reporting accuracy 94.3% on Hockey dataset
problem of overfitting for small dataset, utilizing modern deep
determining aggressive human behaviors [32]. Another learning network architectures, the approach of transfer
approach describes fuzzy region emerges in image frames due
learning comes into play. In which, existing network
to abrupt violent motion patterns, reporting 98.9% accuracy architecture with pre-trained learned features as source task
on Movies dataset [9]. Motion blob, another form of motion
network is employed to build new target task network
features is used to discriminate fight and non-fight video architecture for limited dataset [36]. Fig. 1. shows general
frames, by extracting basic features of blobs (perimeter, area
representation of source task network, with convolutional
etc.) yielding 97.8% accuracy [10]. blocks followed by dense fully connected subsequent layers,
Recently, 3D ConvNets based model with the prior pre-trained on ImageNet with 1000 output classes. The source
knowledge is investigated on Hockey dataset [14]. 3D task network is utilized for transfer learning to create a target
Convolutional Neural Network architecture C3D [33] is task network, to be trained on Hockey and Movies dataset with
experimented on Hockey and Movie sequences [13]. More 2 output classes for violent/fight and non-fight activities.
recently, a 2D-CNN model using Hough Forest features is

559
Fig. 1. Transfer learning concept with source task network architecture transformed into target task network architecture. A source task network pre-
trained on source dataset is fine-tuned on target dataset to be a target task network

The Hockey dataset is challenging due to abrupt camera


In this paper, GoogleNet [18] is selected as a pre-trained motion in recording non-fight scenes of real-time hockey
source network architecture with learned features from games. Movies dataset has views complexities due to diverse
ImageNet dataset on 15 million annotated images for 1000 collection of scenes, exhibiting variations in background with
categories. GoogleNet codenamed Inception is a 22 layers different illumination conditions and occlusions. The
deep network, composed of repeated inception modules. challenging characteristics of both Hockey and Movies
Although network is 22 layers deep but it has 12 times fewer datasets is making them best suitable source for the task of
parameters than AlexNet to make it an efficient deep neural violence recognition. See Fig. 2. for dataset details.
network architecture for computer vision. Based on these
characteristics GoogleNet is selected for transfer learning V. EXPERIMENTS
experiments, by removing last dense fully connected
This section describes the experimental details of
classification layer classifying 1000 ImageNet classes and
proposed methodology of learned features representation
replacing it with 2 classes of Hockey and Movies dataset to
model using transfer learning as discussed in section III.
discriminate violent/fight actions from non-fight.
Fig. 1. shows overall scheme of transfer learning with A. Experimental setup
source task network architecture transformed into target The GoogleNet model learn spatial features from images
network architecture, classifying 2 output classes of target as 2D deep CNN network. This model can be trained on
datasets. videos dataset by converting annotated video clips into labeled
images sequences. As pre-training step both videos datasets
IV. DATASETS are converted into corresponding frames to efficiently train
The experiments are conducted on two benchmark deep model.
violence activity detection Hockey and Movies datasets [6]. Hockey dataset with 1000 video clips generated 41056
The Hockey dataset is first of its kind specifically designed annotated images for both activities of fight and non-fight.
for fight activity recognitions. It has 1000 video clips of Where each image represents adjacent video frames.
360x288 resolution. Dataset is further sub-divided into two Similarly, Movies dataset with 200 video clips is converted
categories, fight and non-fight, with each category containing into 9841 annotated images of adjacent frames for fight and
500 clips. The dataset is obtained from National Hockey non-fight actions.
League (NHL) of hockey games with real-life violent events. B. Parameters
The Movies dataset is also made particularly for fight Network is fine-tuned on both datasets using batch size 64,
activity detection. It has 200 video clips for both fight and non- constant learning rate 0.0001, with momentum set to 0.9.
fight activities. Fight scenes are extracted from different Based on experiments 5 epochs training scheme is adopted to
actions movie clips. Whereas, non-fight scenes are extracted find optimal results on both datasets. Due to significant size of
from publically available action recognition datasets. Unlike target datasets, network is fine-tuned by back propagating
Hockey dataset, this dataset has collection of wide range of errors throughout the network to previous layers. During
diverse scenes recorded at different resolutions on different training GoogleNet is provided with resized images pipeline
occasions, with an average resolution of 360x250 pixels. of 224x224 pixels for each fold. Experiments are conducted
on NVIDIA 1080 ti GPU.

560
Fig. 2. Hockey and Movies datasets samples. First row, left two images are fight and right two images are non-fight frames from Hockey dataset.
Second row, left two images are fight and right two images are non-fight frames from Movies dataset.

Fig. 3. Hockey and Movies datasets training progress samples for 5 epochs of 1st fold. Left image shows 1st fold training progress of Hockey dataset,
Right image reports 1st fold training progress of Movies dataset.

Fig. 4. Hockey and Movies datasets accuracies for 10-Folds. Left image shows accuracies for each fold of Hockey dataset, Right image reports
accuracies for each fold of Movies dataset.

C. Training progress quickly in 4th and 5th epoch. So, training is stopped at this stage
The distinct GoogleNet models are trained for each fold to avoid network overfitting problem.
separately. Models are trained in sequential order with each Fig. 4. show accuracy progress achieved at the end of each
fold trained for 5 epochs, on both datasets. fold. 10 distinct accuracy values are reported for all 10-folds.
Fig. 3. shows training progress for 5 epochs of 1st fold on Hockey dataset with abrupt camera motions in video frames
Hockey and Movies datasets. Accuracy indicators shows is producing arbitrary highest accuracy values for different
drastic increase in accuracy by reducing loss in very 1st and folds. However, Movies dataset is achieving more consistent
2nd training epoch. Model approaches to highest accuracy highest accuracy rate for each fold.

561
TABLE I.  COMPARISON OF CLASSIFICATION ACCURACIES RESULTS ON HOCKEY AND MOVIES DATASETS
Year Author/Method Features/Classifiers Testing Scheme Datasets Accuracy (%)
Hockey Movies
2013 Bermejo et al. [6] STIP (HOG) + HIK 5-Fold CV 91.7± -% 49.0± -%
STIP (HOF) + HIK 88.6± -% 59.0± -%
MoSIFT + HIK 90.9± -% 89.5± -%
2014 Deniz et al. [9] SVM 10-Fold CV 90.1±0% 85.4±9.3%
Adaboost 90.1±0% 98.9±0.2%
2014 Ding et al. [14] 3D-CNN Train/Test split 91± - % -
2015 ViF [10] SVM 10-Fold CV 82.3±0.2% 96.7±0.3%
Adaboost 82.2±0.4% 92.8±0.4%
Random Forests 82.4±0.6% 88.9±1.2%
2015 LMP [10] SVM 10-Fold CV 75.9±0.3% 84.4±0.8%
Adaboost 76.5±0.9% 81.5±2.1%
Random Forests 77.7±0.6% 92±0.1%
2015 Serrano [10] SVM 10-Fold CV 72.5±0.5% 87.2±0.7%
Adaboost 71.7±0.3% 81.7±0.2%
Random Forests 82.4±0.6% 97.8±0.4%
2018 Serrano et al. [13] C3D [33] 10-Fold CV 87.4±1.2% 93.6±0.8%
2D-CNN 87.8±0.3% 93.1±0.3%
2D-CNN + HOG Forest 94.6±0.6% 99±0.5%
- Proposed Method Deep CNN using Transfer 10-Fold CV 99.28% 99.97%
learning

VI. RESULTS Foremostly, the proposed model achieved superior


accuracy results in just 5 training epochs on both datasets,
The proposed method is evaluated against two benchmark eventually reducing the effort required to train a model on
violent activity recognition datasets. i.e., the Hockey and target dataset.
Movies datasets [6]. To perform a comprehensive comparison
a wide range of existing algorithms are taken from literature VII.CONCLUSION
with benchmark accuracies results, from both domains of
Violent action detection such as fight scene recognition
hand-defined features detectors and deep representations
has attracted computer vision researchers during last few
based models, as discussed in section II.
years, because detection of aggressive human behaviors is
In hand-defined domain; Bermejo et al. proposed method preliminary requirement to develop automated video
achieved 90% as benchmark accuracy with the introduction of surveillance systems. Historically, violent action recognition
Hockey and Movies datasets [6]. Following that, Deniz et al. tasks are usually achieved through hand-crafted feature
suggested technique using SVM and Adaboost reported detectors. However, some approaches also proposed deep
98.9% accuracy on Movies dataset [9]. Later on, The Violent learning based models to detect aggressive human behaviors.
Flows (ViF), LMP methods using SVM, Adaboost and Although, deep representation based transfer learning
Random Forests classifiers are reported [10]. approaches have been used for human action recognition such
as walking, jogging, running, hand waving. However, there is
Moreover, in deep learning domain; Ding et al. scarcity in using transfer learning based deep model for
implemented 3D-CNN model with train/test split scheme [14]. violent sequences detection. To train model, Hockey and
In recent times, Serrano et al. evaluated C3D and 2D-CNN Movies datasets are first of their kind specifically designed for
models. Author further proposed finest approach violent/fight action recognition, as compare to other available
incorporating 2D-CNN with Hough Forest features. This human action detection datasets.
approach elevated accuracies to 94.6±0.6%, 99±0.5% for
Hockey and Movies datasets correspondingly, setting the In this research, the learned representation based deep
accuracy bar to the next level [13]. CNN model is proposed to identify aggressive behaviors in
videos. Since training a deep network from scratch ends up
Thereby, the proposed approach of transfer learning using facing network overfitting issues. Therefore, an alternative
GoogleNet deep model with already learnt features is transfer learning training strategy is adopted. GoogleNet; a
compared, to assess the performance of suggested very deep network architecture is adopted as a source task
methodology against established techniques. See Table I, network, pre-trained on ImageNet dataset with 15 million
results are formulated as mean of accuracy with 10-fold cross annotated images for 1000 categories. By incorporating the
validation scheme as described in training progress part of concept of transfer learning, source network is utilized to
section V. develop target task network. Which is then fine-tuned on
Finally, the proposed approach outperforms state-of-the- Hockey and Movies datasets with discussed optimal
art accuracies on both datasets. Results show highest parameters. The proposed model is trained using 10-fold cross
accuracies 99.28% and 99.97% on Hockey and Movies validation scheme, by developing dataset images pipeline with
datasets respectively. The proposed strategy specifically images resizing, as input to fine-tuning network for each
improved Hockey dataset accuracy by learning generalize distinct fold. Distinct models are trained for 5 epochs, for each
deep features for abrupt camera motion sequences, as fold, producing mean accuracy results across 10-folds, on both
compared to benchmark techniques. Similarly, model is able datasets.
to distinguish violent action in a wide variety of movie clips
with dynamic scenes, on Movies dataset.

562
Results show that, proposed model is outperforming top [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
ranked approaches by learning finest features on challenging Classification with Deep Convolutional Neural Networks,” Adv.
Neural Inf. Process. Syst., pp. 1–9, 2012.
datasets. Model achieved 99.28% and 99.97% accuracies on
[18] C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–9, 2014.
Hockey and Movies datasets respectively, in just 5 training
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
epochs. Proposed method is able to learn most discriminating Image Recognition,” in 2016 IEEE Conference on Computer Vision
features for abrupt camera motions and dynamic scenes and Pattern Recognition (CVPR), 2016, pp. 770–778.
sequences, for the task of violent action detection in videos. [20] M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional networks,” Lect. Notes Comput. Sci. (including Subser.
ACKNOWLEDGMENT Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8689 LNCS,
This research is supported by the PDE-GIR project which no. PART 1, pp. 818–833, 2014.
has received funding from the European Union’s Horizon [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L.
Fei-Fei, “Large-scale video classification with convolutional neural
2020 research and innovation programme under the Marie networks,” Comput. Vis. Pattern Recognit. (CVPR), 2014 IEEE Conf.,
Skłodowska-Curie grant agreement No 778035. pp. 1725–1732, 2014.
[22] A. B. Sargano, X. Wang, P. Angelov, and Z. Habib, “Human action
REFERENCES recognition using transfer learning with deep representations,” 2017
[1] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A Int. Jt. Conf. Neural Networks, pp. 463–469, 2017.
local SVM approach,” in Proceedings - International Conference on [23] J. Donahue et al., “DeCAF: A Deep Convolutional Activation Feature
Pattern Recognition, 2004. for Generic Visual Recognition,” 2013.
[2] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions [24] C. Clarin, J. Dionisio, M. Echavez, and P. Naval, “DOVE: Detection
as space-time shapes,” IEEE Trans. Pattern Anal. Mach. Intell., 2007. of movie violence using motion intensity analysis on skin and blood,”
[3] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” in 2009 PCSC, vol. 6, pp. 150–156, 2005.
IEEE Computer Society Conference on Computer Vision and Pattern [25] W. Zajdel, J. D. Krijnders, T. Andringa, and D. M. Gavrila,
Recognition Workshops, CVPR Workshops 2009, 2009. “CASSANDRA: Audio-video sensor fusion for aggression detection,”
[4] R. Poppe, “A survey on vision-based human action recognition,” Image in 2007 IEEE Conference on Advanced Video and Signal Based
Vis. Comput., vol. 28, no. 6, pp. 976–990, 2010. Surveillance, AVSS 2007 Proceedings, 2007, pp. 200–205.
[5] S.-R. Ke, H. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-H. Choi, [26] Y. Gong, W. Wang, S. Jiang, Q. Huang, and W. Gao, “Detecting
“A Review on Video-Based Human Activity Recognition,” violent scenes in movies by auditory and visual cues,” in Pacific-Rim
Computers, 2013. Conference on Multimedia, 2008, pp. 317–326.
[6] E. Bermejo, O. Deniz, G. Bueno, and R. Sukthankar, “Violence [27] T. Giannakopoulos, D. Kosmopoulos, A. Aristidou, and S.
Detection in Video Using Computer Vision Techniques,” CAIP’11 Theodoridis, “Violence content classification using audio features,” in
Proc. 14th Int. Conf. Comput. Anal. images patterns - Vol. Part II, Hellenic Conference on Artificial Intelligence, 2006, pp. 502–507.
2011. [28] D. Chen, H. Wactlar, M. Chen, C. Gao, A. Bharucha, and A.
[7] A. Sargano, P. Angelov, and Z. Habib, “A Comprehensive Review on Hauptmann, “Recognition of aggressive human behavior using binary
Handcrafted and Learning-Based Action Representation Approaches local motion descriptors,” in Engineering in Medicine and Biology
for Human Activity Recognition,” Appl. Sci., vol. 7, no. 1, p. 110, 2017. Society, 2008. EMBS 2008. 30th Annual International Conference of
the IEEE, 2008, pp. 5238–5241.
[8] D. Wu, N. Sharma, and M. Blumenstein, “Recent Advances in Video-
Based Human Action Recognition using Deep Learning: A Review,” [29] J. Lin and W. Wang, “Weakly-supervised violence detection in movies
pp. 2865–2872, 2017. with audio and video based co-training,” in Pacific-Rim Conference on
Multimedia, 2009, pp. 930–935.
[9] O. Deniz, I. Serrano, G. Bueno, and T.-K. T. Kim, “Fast violence
detection in video,” in Computer Vision Theory and Applications [30] T. Giannakopoulos, A. Makris, D. Kosmopoulos, S. Perantonis, and S.
(VISAPP), 2014 International Conference on, 2014, vol. 2, pp. 478– Theodoridis, “Audio-visual fusion for detecting violent scenes in
485. videos,” in Hellenic Conference on Artificial Intelligence, 2010, vol.
6040, no. December, pp. 91–100.
[10] I. S. Gracia, O. D. Suarez, G. B. Garcia, and T. K. Kim, “Fast fight
detection,” PLoS One, 2015. [31] L.-H. Chen, C.-W. Su, and H.-W. Hsu, “Violent scene detection in
movies,” Int. J. Pattern Recognit. Artif. Intell., vol. 25, no. 08, pp.
[11] J. Nam, M. Alghoniemy, and A. H. Tewfik, “Audio-visual content-
1161–1172, 2011.
based violent scene characterization,” in Image Processing, 1998. ICIP
98. Proceedings. 1998 International Conference on, 1998, vol. 1, pp. [32] L. Xu, C. Gong, J. Yang, Q. Wu, and L. Yao, “Violent video detection
353–357. based on MoSIFT feature and sparse coding,” in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing
[12] W.-H. Cheng, W.-T. Chu, and J.-L. Wu, “Semantic context detection
- Proceedings, 2014, pp. 3538–3542.
based on hierarchical audio models,” in Proceedings of the 5th ACM
SIGMM international workshop on Multimedia information retrieval, [33] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
2003, pp. 109–115. spatiotemporal features with 3D convolutional networks,” in
Proceedings of the IEEE International Conference on Computer
[13] I. Serrano, O. Deniz, J. L. Espinosa-Aranda, and G. Bueno, “Fight
Vision, 2015.
Recognition in Video Using Hough Forests and 2D Convolutional
Neural Network,” IEEE Trans. Image Process., 2018. [34] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” pp. 1–14, 2014.
[14] C. Ding, S. Fan, M. Zhu, W. Feng, and B. Jia, “Violence Detection in
Video by Using 3D Convolutional Neural Networks,” in International [35] Y. Aytar, “Transfer learning for object category detection,” University
Symposium on Visual Computing, Springer, Cham, 2014, pp. 551–558. of Oxford, 2014.
[15] P. Zhou, Q. Ding, H. Luo, X. Hou, B. Jin, and P. Maass, “Violent [36] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu,
Interaction Detection in Video Based on Deep Learning,” in Journal of “Transfer Learning for Video Recognition with Scarce Training Data
Physics: Conference Series, 2017, vol. 844, no. 1, p. 12044. for Deep Convolutional Neural Network,” arXiv Prepr.
arXiv1409.4127, 2014.
[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “lecun- at el -
Gradient-based learning applied to document,” Proc. IEEE, vol. 86, no.
11, pp. 2278–2323, 1998.

563

You might also like