2017 Allah Bux PHD
2017 Allah Bux PHD
2017 Allah Bux PHD
December 2017
Declaration of Authorship
I, Allah Bux, declare that the thesis titled, “Vision-based Human Action Recogni-
tion using Machine Learning Techniques” and the work presented in it are my own. I
confirm that:
Where I have consulted the published work of others, this is always clearly at-
tributed.
Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.
Detailed breakdown of the publications is presented in the first chapter of the thesis.
Signed:
Allah Bux
December 13, 2017
i
Abstract
The focus of this thesis is on automatic recognition of human actions in videos. Human
performed by a human. This is a difficult problem due to the many challenges including,
but not limited to, variations in human shape and motion, occlusion, cluttered background,
To start with, The most popular and prominent state-of-the-art techniques are reviewed,
evaluated, compared, and presented. Based on the literature review, these techniques are
categorized into handcrafted feature-based and deep learning-based approaches. The pro-
posed action recognition framework is then based on these handcrafted and deep learning
based techniques, which are then adopted throughout the thesis by embedding novel al-
gorithms for action recognition, both in the handcrafted and deep learning domains.
First, a new method based on handcrafted approach is presented. This method ad-
novel feature descriptor for multiview human action recognition. This descriptor employs
the region-based features extracted from the human silhouette. The proposed approach is
quite simple and achieves state-of-the-art results without compromising the efficiency of
the recognition process which shows its suitability for real-time applications.
Second, two innovative methods are presented based on deep learning approach, to go
beyond the limitations of handcrafted approach. The first method is based on transfer
learning using pre-trained deep learning model as a source architecture to solve the prob-
to action recognition task with limited training dataset. The comparative analysis also
method employs Deep Belief Networks (DBNs) with restricted Boltzmann machines
for action recognition in unconstrained videos. The proposed method automatically ex-
tracts suitable feature representation without any prior knowledge using unsupervised
deep learning model. The effectiveness of the proposed method is confirmed with high
Finally, the thesis is concluded with important discussions and research directions in
iv
This thesis is dedicated to the loving memories of my Mom
with eternal love and appreciation.
v
Contents
Declaration of Authorship i
Abstract ii
Acknowledgements iv
List of Figures ix
List of Tables xi
Abbreviations xii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims and objectives of the thesis . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Validation methodology and software/hardware tools . . . . . . . . . . . 9
1.5 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Publications related to the thesis . . . . . . . . . . . . . . . . . . . . . . 12
2 Literature Review 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Handcrafted Representation-Based Approach . . . . . . . . . . . . . . . 22
2.2.1 Space-Time-Based Approaches . . . . . . . . . . . . . . . . . . 23
2.2.1.1 Space-Time Volumes (STVs) . . . . . . . . . . . . . . 25
2.2.1.2 Space-Time Trajectory . . . . . . . . . . . . . . . . . 25
2.2.1.3 Space-Time Features . . . . . . . . . . . . . . . . . . 28
2.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Appearance-Based Approach . . . . . . . . . . . . . . . . . . . . 31
2.2.3.1 Shape-Based Approach . . . . . . . . . . . . . . . . . 31
2.2.3.2 Motion-Based Approach . . . . . . . . . . . . . . . . 32
vi
Contents vii
Bibliography 127
List of Figures
ix
List of Figures x
4.1 Overview of the proposed system, first row indicates the source architec-
ture and second row shows the target architecture . . . . . . . . . . . . . 92
4.2 Feature extraction and hybrid classification model . . . . . . . . . . . . . 93
4.3 Sample frames for each action from four scenarios in KTH dataset . . . . 94
4.4 Confusion matrix of KTH dataset with 6 human actions . . . . . . . . . . 95
4.5 Sample frames for each action from UCF sports dataset . . . . . . . . . . 95
4.6 Confusion matrix of UCF sports action dataset . . . . . . . . . . . . . . . 97
5.1 Restricted Boltzmann Machine model with visible layer (V) and hidden
layer (H). The connections exist between the layers but not within the lay-
ers. The visible layer represents the input given to the model and hidden
layer is used to learn the features from the input data. . . . . . . . . . . . 104
5.2 Diagram of DBN with visible layer V and four hidden layers (H1-H4).
The blue arrows indicate the generative model, while red arrows indicate
the direction of recognition. . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 The DBN generative model with four hidden layers (H1-H4), used for
generating images with their class labels. . . . . . . . . . . . . . . . . . . 109
5.4 The discriminative DBN model for classification with a visible layer v, 3
hidden layers (H1-H3), and an output layer. . . . . . . . . . . . . . . . . 111
5.5 The proposed DBN model having visible layer v with 100x100 dimen-
sion input, 3 hidden layers (H1-H3), and an output layer with 10 nodes
representing 10 action classes. . . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Sample frames for each action from UCF sports dataset . . . . . . . . . . 115
List of Tables
xi
Abbreviations
xii
Abbreviations xiii
Introduction
IN THIS CHAPTER
The topic of PhD thesis and its motivation are introduced in Section 1.1. Aims and
objectives of the thesis are presented in Section 1.2, and brief overview is given in
Section 1.3. The major contributions of the thesis are demonstrated in Section 1.5.
Finally, the chapter is concluded with the list of publications related to the thesis in
Section 1.6.
1.1 Introduction
Vision-based Human Activity Recognition (HAR) is an important research area in the
field of computer vision and machine learning. The purpose of the HAR is to automati-
cally understand what kind of action is performed in the video. This is really a difficult
problem due to many challenges involved in HAR. These challenges include: occlusion,
variation in human shape, and motion, cluttered backgrounds, stationary or moving cam-
eras, different illumination conditions, and viewpoint variations. However, the intensity
of these challenges may vary depending on the category of an activity under considera-
tion. Generally, the activities fall into four categories, i.e., gestures, actions, interactions,
1
Chapter 1. Introduction 2
and group activities. This division is mainly based on the complexity and duration of the
activities [3].
Gesture: A gesture is defined as a basic movement of the human body parts that carry
some meaning. Head shaking, hand waving, and facial expressions are some good exam-
ples of gestures. Usually, a gesture takes a very short amount of time and its complexity
human and the other one may be a human or an object. Thus, it could be a human-human
and hugging are examples of a human-human interaction, while a person using an ATM,
a person using a computer, and a person stealing a bag are examples of a human-object
interaction.
Group Activity: This is the most complex type of activity. Certainly, it can be a
combination of gestures, actions, and interactions. It involves more than two humans and
a single or multiple objects. A group of people protesting, two teams playing a game, and
As the title suggests, this thesis considers the ”gestures” and ”actions” for recognition.
Therefore, the recognition of ”interactions” and ”group activities” are not covered under
In recent years, HAR has drawn much attention of researchers around the globe due to
its important applications in the real-world scenarios. This is a great motivating factor for
the author of the thesis, to select it as a PhD topic. There are many applications of HAR,
cameras and require laborious human monitoring for video content analysis. On the other
hand, intelligent video surveillance systems are aimed at automatically tracking the in-
dividuals or a crowd and recognizing their activities [4]. This includes the detection of
suspicious or criminal activities and reporting to the authorities for immediate action [5].
In this way, the workload of the security personnel can be reduced and alerts can be sig-
naled for security events, which can be helpful in preventing dangerous situations.
Ambient Assisted Living (AAL): The AAL is one of the important applications of
HAR-based systems. These systems are used in health care to understand and analyze
the patient’s activities, to facilitate health workers in treatment, diagnosis, and general
health care of the patients. In addition to this, these systems are also used for monitoring
the daily life activities of elderly people, to provide them a safe, independent and com-
fortable stay. Usually, these systems capture the continuous movement of elderly people,
automatically recognize their activities and detect any abnormality as it happens, such
as falling down, having a stroke or respiration issues. Among these abnormal activities,
”falls” are a major cause for fatal injury, especially for elderly people, and are considered
facing such as mouse and keyboard, it is desirable to have more natural interfaces between
the computer and human operator by understanding the human gesture. An example of
such a type of interfacing is controlling the presentation of slides using hand movement
[7].
based activity recognition. Giving a robot the ability to recognize human activities is
very important for HRI. This makes the robots useful for the industrial setup and well as
of the applications of HRI can be seen as humanoid robots that could recognize human
emotions from the sequence of images [8]. In addition to this, the actor (robot) wearing
the camera, may be involved in an ongoing activity. This is not only recognition of an
activity in real time but also recognizing the activity before its completion [9].
Entertainment: Human activity recognition systems are used for recognition of enter-
tainment activities such as dance [10], and sports [11]. The modelling of a player’s action
in the game has achieved much attention from the sports community in recent years due
to its important use, such as adapting to the change in the game as it occurs [12].
Intelligent Driving: Human activity recognition techniques are also employed to as-
sist drivers by providing different cues regarding the state of the driver while driving a
vehicle. It has been reported that the secondary tasks performed by the drivers such as an-
swering the phone, sending or receiving text messages, eating or drinking while operating
design methodology and data collection process. These approaches fall into vision-based,
non-visual sensor-based, and multi-modal categories. Indeed, HAR systems are complex
in nature, and comprised of different sub-systems. The schematic diagram of a typical ac-
tivity recognition system including its subsystems is presented in Figure 1.1. This thesis is
and deep learning approaches, in order to decide, which one works the best;
which is considered as one of the major challenges for recognition of human actions
• O6. comparison between the supervised and unsupervised deep learning models on
a same dataset;
• O7. production of better results than the existing ones in terms of accuracy and
sis. Chapter 2 to chapter 4 are based on the already published work in international jour-
nals and conferences, while the work presented in chapter 5 is currently under review for
journal publication. Each of these chapters start with a brief introduction explaining the
context of the chapter and its contribution. Chapter 6 concludes the thesis and presents
the proposed methods as future research directions. A brief overview of each chapter is
presented below:
Chapter 1. Introduction 7
datasets available for experimentations are also presented to provide further insight into
the field. Finally, the chapter is concluded with important discussions and research direc-
tions.
This chapter presents a novel feature descriptor for multiview human action recogni-
tion. This descriptor employs the region-based features extracted from human silhouettes.
To achieve this, a human silhouette is divided into regions in a radial fashion with the in-
terval of a certain degree, and then region-based geometrical and Hu-moments features
are obtained from each radial bin to articulate the feature descriptor. A multiclass sup-
port vector machine classifier is used for action classication. The proposed approach is
quite simple and achieves state-of-the-art results without compromising the efciency of
the recognition process. The contribution is two-fold. Firstly, this method achieves high
age testing speed of the proposed method is 34 frames per second which is much higher
than the existing methods and shows its suitability for the real-time applications. The
sition Sequences) dataset confirmed the superior performance of the proposed method as
This chapter presents an innovative method for human action recognition using a pre-
trained deep Convolutional Neural Networks (CNNs) model as a source architecture for
extracting features from the target dataset. Then, a hybrid Support Vector Machines and
tated dataset are transferable to an action recognition task with limited training dataset.
The comparative analysis confirmed that the proposed method outperforms the hand-
This chapter presents an unsupervised Deep Belief Networks (DBNs) based method
for recognizing human activities in unconstrained videos. This method automatically ex-
tracts suitable feature representation without any prior knowledge using an unsupervised
deep learning model. In this work, the contribution is twofold. Firstly, the proposed
method achieves high recognition accuracy as compared to state-of-the art deep learning
unsupervised DBNs architecture for modelling complex human activities. Secondly, the
proposed model uses and automatic structure learning method for learning important pa-
rameters. Moreover, although the model has been designed by keeping in mind the human
activities, however, this model can also be adapted for other visual recognition tasks as
well. The effectiveness of the proposed method is confirmed by high recognition results
This chapter concludes the thesis, and presents the Comparative analysis and key find-
ings regarding all proposed methods. In addition to this, possible extensions of the pro-
posed methods both in short-term and long-term perspectives are presented in this chapter.
rithms. In this research work, the experimental methodology is used to evaluate and
validate the performance of the proposed methods. In this regard, experiments have been
ing simple and complex datasets. The simple datasets such as KTH [15], in which the
videos were recorded with static camera and background, and complex dataset such as
UCF sports [16, 17], collected from variety of sports, broadcasted on television channels
such as BBC and ESPN. In a UCF sports dataset, actions were recorded in real sport envi-
For experimentations, the MATLAB with its different versions (R2015, R2016, and
R2017) was used as base software for implementation of the proposed algorithms. The
hardware used for experimentation include Intel Core i7-4770 CPU with 8GB RAM, and
Intel Xeon processor E5-2630, 64GB RAM with NVIDIA Quadro K2200 4GB GPU, for
of the thesis. In addition to this, well-known public datasets and important appli-
cations of HAR are also presented to provide further insight into the field. This
review covers all these aspects of HAR with comprehensive coverage of each part.
The Human activity recognition contribution is presented in Chapter 2, and has also
2. A method for human action recognition from multiple views based on view-
This method is based on a novel view invariant feature descriptor, and can recognize
the actions from five different views including, front, back, left, right and top view.
proposed approach is efficient and achieves high recognition accuracy which makes
This contribution is presented in Chapter 3, and has also been published [20].
Chapter 1. Introduction 11
3. A method for applying a hybrid classifier and transfer learning to action recog-
nition.
recognition, where a huge amount of labeled data are available. However, these
models do not generalize well when the size of the dataset is small. In this direction,
model as a source architecture for extracting features from the target dataset i.e.,
HAR dataset. It is experimentally confirmed that we can benefit from the pre-
trained deep learning models rather than training the deep model from scratch in
the case of a small dataset. Moreover, It is also confirmed that hybrid classifier
This contribution is presented in Chapter 4, and has also been published [21].
4. A method for human action recognition using DBNs with automatic structure
learning.
unsupervised Deep Belief Networks (DBNs) based method with automatic struc-
ture learning is proposed for HAR. The contribution of this method is two fold:
Firstly, the proposed method achieved higher accuracy than the similar state-of-the-
art methods, which confirms the potential of a less explored unsupervised DBNs ar-
chitecture for modelling complex human activities. Secondly, the proposed method
Chapter 1. Introduction 12
fine-tuning stage only. Moreover, although the model was designed by keeping in
mind the human activities, however, this model can also be adapted for other visual
recognition tasks.
publication.
Journal Papers
1. Allah Bux Sargano,, Plamen Angelov, and Zulfiqar Habib. Human Action Recog-
nition from Multiple Views Based on View-Invariant Feature Descriptor Using Sup-
2. Allah Bux Sargano, Plamen Angelov, and Zulfiqar Habib. A comprehensive re-
3. Allah Bux Sargano, Plamen Angelov, and Zulfiqar Habib. Human Action Recog-
Conference Papers
Chapter 1. Introduction 13
1. Allah Bux Sargano, Plamen Angelov, and Zulfiqar Habib. Vision based human
2. Allah Bux Sargano, Xiafeng Wang, Plamen Angelov, and Zulfiqar Habib. Human
Literature Review
IN THIS CHAPTER
A comprehensive review of the handcrafted and learning based HAR techniques
is presented with detailed introduction of these approaches presented in Sec-
tion 2.1. While detailed comparisons, evaluations, and important discussions on
handcrafted-based techniques are given in Section 2.2. Learning based approaches
such as dictionary learning and deep learning based techniques are described, com-
pared, and evaluated in Section 2.3. Important public datasets available for experi-
mentations and evaluation of HAR techniques are presented in Section 2.4. Finally,
the chapter is concluded in Section 2.5.
2.1 Introduction
Human Activity Recognition (HAR) plays an important role in many real-world appli-
cations. Its goal is to recognize the activities of a person or a group of persons from the
sensors and/or video data, including the knowledge of the context in which these activities
take place. Due to the advancement in sensor and visual technology, HAR based systems
have been widely used in many real-world applications. Specifically, the proliferation of
14
Chapter 2. Literature Review 15
small size sensors have enabled the smart devices to recognize the human activities in a
context-aware manner [22]. Based on the design methodology and data collection pro-
cess, these approaches are categorized into visual sensor-based, non-visual sensor-based,
and multi-modal categories. The major difference between the visual and other types of
sensors is the way of perceiving the data. Visual sensors provide the data in the form
of 2D or 3D images or videos, whereas other sensors provide the data in the form of a
In recent years, the wearable devices have been featured with many small non-visual
sensors, which have enabled the development of pervasive applications. The wearable
devices such as smart-phones, smart-watches, and fitness wristbands are worn all day
long by many people. These devices have computing power, communication capability,
and are available at low cost, which make them suitable for HAR [24]. Currently, vari-
ous techniques have been proposed for sensor-based human activity recognition in daily
On the other hand, visual sensor-based approach is one of the most popular HAR ap-
proach in the computer vision and machine learning research community. This approach
has been employed in a wide range of application domains. In particular, the past decade
has witnessed enormous growth in its applications. The major applications of vision-
based HAR include: Human Computer Interaction (HCI), intelligent video surveillance,
search. In HCI, the activity recognition systems observe the task carried out by the user
Chapter 2. Literature Review 16
and guide him/her to complete it by providing feedback. In video surveillance, the activ-
ity recognition system can automatically detect a suspicious activity and report it to the
authorities for immediate action. Similarly, in entertainment, these systems can recognize
Multi-modal HAR approach has also become popular during the last decade. In this
approach, visual and non-visual sensors are used at the same time to recognize the human
activities. This approach is specifically useful in the situations where one type of sensor is
not enough to meet the user requirements. For example, a visual sensor, like camera can
cover the subject and the context in which the activity take place but it may not be enough
to analyze the sensitive information such as temperature, user heart rate, and humidity in
However, non-visual sensors in general and wearable sensors in specific have several
limitations. Most of the wearable sensors need to be worn and run continuously, which
and technical issues. The major practical issues are acceptability and willingness to use
wearable sensors and technical issues include battery life, ease of use, size, and effective-
ness of the sensor [26]. In addition to this, in some application domains such as video
surveillance where continuous monitoring of the people is required for suspicious activ-
ities, non-visual sensor-based approach might not be effective. Therefore, the ultimate
applied to most of the application domains. This is the rationale for the proposed work
Chapter 2. Literature Review 17
ity and duration, vision-based activities fall into four categories, i.e., gestures, actions,
Since the 1980s, researchers have been working on HAR from images and videos. One
of the important directions that researchers have been following for action recognition
is similar to the working of the human vision system. At a low level, the human vision
system can receive the series of observations regarding the movement and shape of the
human body in a short span of time. Then, these observations are passed to the interme-
diate human perception system for further recognition of the class of these observations,
such as walking, jogging, and running. In fact, the human vision and perception system
vested much effort during the past few decades. However, unfortunately, due to many
variations, viewpoint variations, occlusions, and the non-rigid shape of the humans and
objects, we are still far from the level of the human vision system. What we have achieved
so far may be a fraction of what a mature human vision system can do.
ity recognition approaches can be divided into two major categories. (1) The traditional
URF), and Local Binary Patterns (LBPs). This is followed by a generic trainable classi-
fier for action recognition as shown in Figure 2.2. (2) The learning-based representation
approach, which is a recently emerged approach with capability of learning features au-
tomatically from the raw data. This eliminates the need of handcrafted feature detectors
and descriptors required for action representation. Unlike the traditional handcrafted ap-
proach, it uses the concept of a trainable feature extractor followed by a trainable classi-
egy for HAR. Generally, it consists of three major phases (foreground detection, hand-
crafted feature extraction and representation, and classification) as shown in Figure 2.4.
A good number of survey papers have been published on different phases of handcrafted
discuss the HAR approaches. One survey presented in the work of Aggarwal and Ryoo
[3], divides the activity recognition approaches into two major categories: single layered
activities from the sequence of a video, while hierarchical approaches recognize more
complex activities by decomposing them into simple activities (sub-events). These are
further sub-categorized such as space-time volumes, and trajectories, based on the fea-
ture representation and classification methods used for recognition. A detailed survey on
libraries and public datasets available for object segmentation. Another study, presented
in [28], discussed the three levels of HAR, including core technology, HAR systems, and
such as occlusion, anthropometry, execution rate, background clutter, and camera motion
as discussed in [29]. This survey categorized the existing methods based on their abili-
ties for handling these challenges on which potential research areas were also identified
[29]. In [30], human action recognition methods based on the feature representation and
classification were discussed. Similarly, Weinland et al. [31] surveyed the human activ-
based human action recognition methods [32], presenting the state-of-the-art methods for
activity recognition which use semantic-based features. In this paper, semantic space,
and semantic-based features such as pose, poselet, related objects, attributes, and scene
context are also defined and discussed. Different handcrafted features extraction and rep-
resentation methods have been proposed for human action recognition [33–37].
Chapter 2. Literature Review 20
ing uses computational models with multiple processing layers based on representation
learning with multiple levels of abstraction. This learning encompasses a set of meth-
ods that enable the machine to process the data in raw form and automatically transform
it into a suitable representation needed for classification. This is what we call trainable
feature extractors. This transformation process is handled at different layers, for exam-
ple, an image consists of an array of pixels, and then the first layer transforms it into
edges at a particular location and orientation. The second layer represents it as collec-
tion of motifs by recognizing the particular arrangement of edges in an image. The third
layer may combine the motifs into parts and the following layers would turn it into the
recognizable objects. These layers are learned from the raw data using a general purpose
learning procedure which does not need to be designed manually by the experts [38]. This
paper further examines various computer-based fields such as 3D games and animations
systems [39, 40], physical sciences, health-related issues [41–43], natural sciences and
camera/sensor used for capturing the activity. The use of appropriate cameras for captur-
ing the activity has a great impact on the overall functionality of the recognition system.
In fact, these cameras have been instrumental to the progression of research in the field of
Chapter 2. Literature Review 21
computer vision [46–50]. According to the nature and dimensionality of images captured
by these cameras, they are broadly divided into two categories, i.e., 2D and 3D cameras.
The objects in the real world exist in 3D form: when these are captured using 2D cameras
then one dimension is already lost, which causes the loss of some important information.
To avoid the loss of information, researchers are motivated to use 3D cameras for cap-
turing the activities. For the same reason, 3D-based approaches provide higher accuracy
than 2D-based approaches but at higher computational cost. Recently, some efficient 3D
cameras have been introduced for capturing images in 3D form. Among these, 3D Time-
of-flight (ToF) cameras, and Microsoft Kinect have become very popular for 3D imaging.
However, these sensors also have several limitations such as these sensors only capture
the frontal surfaces of the human and other objects in the scene. In addition to this, these
sensors also have a limited range of about 67 m, and data can be distorted by scattered
Recently, thermal cameras have become popular. These are the passive sensors that
capture the infrared variations emitted by the objects with temperature above absolute
zero. Originally, these cameras were developed for surveillance as a night vision tool
for military, but due to the significant reduction in prices, these cameras have become
affordable for many applications. Deploying these cameras in computer vision systems
can overcome the limitations of normal grayscale and RGB cameras such as illumination
problems. In human activity recognition, these cameras can easily detect the human mo-
tion regardless of illumination conditions and colors of human surfaces and backgrounds.
These cameras can also overcome another major challenge in human activity recognition
Chapter 2. Literature Review 22
known as cluttered backgrounds [52, 53]. However, there is no any universal rule for
selecting the appropriate camera; it mainly depends on the nature of the problem and its
requirements.
A good number of survey and review papers have been published on HAR and related
processes. However, due to the great amount of work published on this subject published
reviews are quickly out-of-date. For the same reason, writing a review paper on human
activity recognition is hard work and a challenging task. In this chapter we provide the
nition based on both handcrafted and learning-based action representations along with
well-known datasets. This chapter covers all these aspects of HAR in a single chapter
with reference to the more recent publications. However, the major focus remained on
human gesture, and action recognition techniques as this is the motive of the thesis.
based representation. This approach has been popular among the HAR community and
has achieved remarkable results on different public well-known datasets. In this approach,
the important features from the sequence of image frames are extracted and the feature
descriptor is built up using expert designed feature detectors and descriptors. Then, clas-
(SVM) [54]. This approach includes space-time, appearance-based, local binary patterns,
Space-time-based approaches have four major components: space time interest point
(STIP) detector, feature descriptor, vocabulary builder, and classifier [55]. The STIP
detectors are further categorized into dense and sparse detectors. The dense detectors
such as V-FAST, Hessian detector, dense sampling, densely cover all the video content
for detection of interest points, while sparse detectors such as cuboid detector Harris3D
[56], and Spatial Temporal Implicit Shape Model (STISM), use a sparse (local) subset of
this content. Various STIP detectors have been developed by different researchers [57,
58]. The feature descriptors are also divided into local and global descriptors. The local
and N-jet are based on the local information such as texture, colour, and posture, while
Chapter 2. Literature Review 24
global descriptors use global information such as illumination changes, phase changes,
and speed variation in a video. The vocabulary builders or aggregating methods are based
called space-time volumes (STVs). The core of STVs-based methods is a similarity mea-
sure between two volumes for action recognition. Bobick and Davis [59], proposed an
action recognition system using template matching, instead of using space-time volumes,
technique for action recognition. Hu et al. [60]extended this work where MHI and two
(HOG) were combined for action representation, followed by simulated annealing mul-
tiple instance learning support vector machine (SMILE-SVM) for action classification.
The method proposed by Roh et al. [61] also extended the work of Bobick and Davis[59]
from 2D to 3D space for view-independent human action recognition using a volume mo-
tion template. The comparative analysis of these techniques is presented in Table 2.1.
points corresponding to his/her joints position of the body. When a person performs an
action, there are certain changes in his/her joint positions according to the nature of the
action. These changes are recorded as space-time trajectories, which construct a 3D XYZ
joint position of the body in order to distinguish different types of actions. Following this
idea many approaches have been proposed for action recognition based on the trajectories
Inspired by the dense sampling in image classification, the concept of dense trajectories
for action recognition from videos was introduced [68]. The authors sampled the dense
points from each image frame and tracked them using displacement information from a
dense optical flow field. These types of trajectories cover the motion information and
are robust to irregular motion changes. This method achieved state-of-the-art-results with
challenging datasets. In the work of Wang and Schmid [73], an extension to the work
of Wang et al. [68] was proposed for the improvement of performance regarding camera
motion. For estimation of camera motion the authors used Speeded-Up Robust Features
(SURF) descriptor and dense optical flow. This significantly improved the performance of
motion-based descriptors such as histograms of optical flow (HOF), and motion boundary
histograms (MBH). However, when incorporating high density with trajectories within
the video, it increases the computational cost. Many attempts have been made to re-
duce the computational cost of the dense trajectory-based methods. For this purpose, a
saliency-map to extract the salient regions within the image frame was used in the work
of Vig et al [82]. Based on the saliency-map a significant number of dense trajectories can
Recently, a human action recognition method from depth movies captured by the Kinect
sensor was proposed [87]. This method represents dynamic skeleton shapes of the human
rate of the activity and uses transported-square root vector fields (TSRVFs) of trajectories
and standard Euclidean norm to achieve the computational efficiency. Another method
used different descriptors such as HOG, HOF and motion boundary histograms (MBH)
for recognition of actions of construction workers using dense trajectories [92]. Among
these descriptors, authors reported the highest accuracy with codebook of size 500 using
problem and few methods have been proposed at this end. For this purpose, a human ac-
tion recognition method was proposed using explicit motion modelling [93]. This method
used visual code words generated from the dense trajectories for action representation
space-time trajectories for human action recognition. Generally, these features are local
of space-time volumes and trajectories, these features can be divided into two categories:
sparse and dense. The features detectors that are based on interest point detectors such as
Harris3D [57], and Dollar [94] are considered as sparse, while feature detectors based on
optical flow are considered as dense. These interest point detectors provide the base for
most of the recently proposed algorithms. In the work of Thi et al [95], the interest points
were detected using Harris3D [57], based on these points they build the feature descriptor
and used PCA (principal component analysis)-SVM for classification. In the work of Kihl
et al [84], the authors proposed a novel local polynomial space-time descriptor based on
Chapter 2. Literature Review 29
The most popular action representation methods in this category are based on the Bag-
of-Visual-Words (BoVW) model [96, 97] or its variants [98, 99]. The BoVW model
normalization. They extracted local features from the video; learn visual dictionary by
using a training set by Gaussian Mixture Model (GMM) or K-mean clustering, encode and
pool features, and finally represent the video as normalized pooled vectors followed by
a generic classifier for action recognition. The high performance of the BoVW model is
due to an effective low level feature such as dense trajectory features [73, 100], encoding
methods such as Fisher Vector [99], and space-time co-occurrence descriptors [65]. The
improved dense trajectory (iDT) [73] provides the best performance among the space-
The coding methods have played an important role in boosting the performance of these
approaches. Recently, a new encoding method named Stacked Fisher Vector (SFV) [77]
was developed as an extension of traditional single layer Fisher Vector (FV) [99]. Unlike
traditional FV, which encodes all local descriptors at once, SFV first performs encoding
in dense sub-volumes, then compresses these sub-volumes into FV, and finally applies
another FV encoding based on the compressed sub-volumes. For the detailed comparison
2.2.2 Discussion
well-known datasets including simple and complex activities as recorded in Table 2.1.
Chapter 2. Literature Review 30
Some merits of these approaches are as follows: (1) STVs-based approaches are suit-
able for recognition of gestures and simple actions. However, these approaches have
also produced comparable results on complex datasets such as Human Motion database
(HMDB-51), Hollywood2, and University of Central Florida (UCF-101); (2) The space-
time trajectory-based approaches are especially useful for recognition of complex activ-
ities. With the introduction of dense trajectories, these approaches have become popu-
lar due to their high accuracy for challenging datasets. In recent years, trajectory-based
approaches have been getting lot of attention due to their reliability under noise and il-
art results on many challenging datasets. It has been observed that descriptors such as
HOG3D, HOG, HOF, and MBH are more suitable for handling intra-class variations and
However, these approaches have some limitations as follows: (1) STVs-based ap-
proaches are not effective in recognizing multiple persons in a scene; these methods use a
sliding window for this purpose which is not very effective and efficient; (2) Trajectory-
based approaches are good at analyzing the movement of a person in a view invariant
manner but to correctly localize the 3D XYZ joint position of a person is still a chal-
lenging task; (3) Space-time features are more suitable for simple datasets; for effective
the computational complexity. These limitations can cause a hindrance to real-time ap-
plications.
Chapter 2. Literature Review 31
In this section, 2D (XY) and 3D (XYZ) depth image-based methods which use ef-
fective shape, motion, or combination of shape and motion features are discussed. The
2D shape-based approaches [101, 102] use shape and contour-based features for action
representation and motion-based approaches Efros et al and Fathi and Mori [103, 104]
use optical flow or its variants for action representations. Some approaches use both
shape and motion feature for action representation and recognition [105]. In 3D-based
approaches, a model of a human body is constructed for action representation; this model
can be based on cylinders, ellipsoids, visual hulls generated from silhouettes or surface
mesh. Some examples of these methods are 3D optical flow [106], shape histogram [107],
The shape-based methods capture the local shape features from the human image or
silhouette [110]. These methods first obtained the foreground silhouette from an im-
age frame using foreground segmentation techniques. Then, they extracted the features
from the silhouette itself (positive space) or from the surrounding regions of the silhou-
ette (negative space) between canvas and the human body [111]. Some of the important
features that can be extracted from the silhouette are contour points, region-based fea-
tures, and geometric features. The region-based human action recognition method was
proposed in [112]. This method divides the human silhouette into a fixed number of grids
and cells for action representation and used a hybrid classifier Support Vector Machine
and Nearest Neighbour (SVM-NN) for action recognition. For practical applications,
Chapter 2. Literature Review 32
the human action recognition method should be computationally lean. In this direction,
(SAX) shapes [113]. In this method, a silhouette was transformed into time-series and
these time-series were converted into a SAX vector for action representation followed by
a random forest algorithm for action recognition. A pose-based view invariant human
action recognition method was proposed based on the contour points with a sequence of
the multiview key poses for action representation [114]. An extension of this method was
proposed by Chaaraoui and Flrez-Revuelton[115]. This method uses the contour points
of the human silhouette and radial scheme for action representation and support vector
developed by extracting features from the surrounding regions (negative space) of the hu-
man silhouette [111, 116]. Another method used pose information for action recognition
[117]. In this method, first, the scale invariant features were extracted from the silhouette,
and then these features were clustered to build the key poses. Finally, the classification
Motion-based action recognition methods use motion features for action representation
followed by a generic classifier for action recognition. A novel motion descriptor was
proposed [118] for multiview action representation. This motion descriptor is based on
motion direction and histogram of motion intensity followed by the support vector ma-
chine for classification. Another method based on 2D motion templates using motion
history images and histogram of oriented gradients was proposed by Murtaza et al [119].
Chapter 2. Literature Review 33
An action recognition method was proposed by Kliper-Gross et al [75], based on the key
elements of motion encoding and local changes in motion direction encoded with the
bag-of-words technique.
These methods combine shape-based and motion-based features for action representa-
tion. An optical flow and silhouette-based shape features were used for view invariant
action recognition in [120] followed by principal component analysis (PCA) for reducing
the dimensionality of the data. Some other methods base on shape and motion infor-
mation were proposed for action recognition [121, 122]. The coarse silhouette features,
radial grid-based features and motion features were used for multiview action recognition
in the work of Pelikan et al [122]. Meanwhile, Jiang et al [105] used shape-motion proto-
type trees for human action recognition. The authors represented action as a sequence of
prototypes in shape-motion space and used distance measure for sequence matching. This
method was tested on five public datasets and achieved state-of-the-art results. Eweiwi
[123], proposed a method based on action key poses as a variant of Motion Energy Im-
ages (MEI), and Motion History Images (MHI) for action representation followed by a
In this section, the two important approaches that do not fit under the headings of
the above mentioned categories are discussed. These approaches include Local Binary
Local binary pattern (LBP) [124] is a type of visual descriptor for texture classifica-
tion. Since its inception, several modified versions of this descripto [125–127] have been
recognition method was proposed [128] based on LBP combined with appearance invari-
ance and patch matching method. This method was tested on different public datasets
and proved to be efficient for action recognition. Another method for activity recogni-
tion was proposed using LBP-TOP descriptor [129]. In this method, the action volume
was partitioned into sub-volumes and feature histogram was generated by concatenating
the histograms of sub-volumes. Using this representation, they encoded the motion at
three different levels: pixel level (single bin in the histogram), region-level (sub-volume
methods have also been employed for multiview human action recognition. A multiview
human action recognition method was proposed [130] based on contour based pose fea-
tures and uniform rotation-invariant LBP followed by SVM for classification. Recently,
another motion descriptor named Motion Binary Pattern (MBP) was introduced for mul-
tiview action recognition [131]. This descriptor is a combination of Volume Local Binary
Pattern (VLBP) and optical flow. This method was evaluated on multiview INRIA Xmas
Motion Acquisition Sequences (IXMAS) dataset and achieved 80.55% recognition re-
sults.
Chapter 2. Literature Review 35
Traditional vision-based human action recognition methods employ the spatial or tem-
poral features followed by a generic classifier for action representation and classification.
However, it is difficult to scale up these methods for handling uncertainty and complexity
involved in real world applications. For handling these difficulties, fuzzy-based methods
are considered as being better choice.A fuzzy-based framework was proposed for human
action recognition based on the fuzzy log-polar histograms and temporal self-similarities
for action representation followed by SVM for action classification [132]. The evaluation
of the proposed method on two public datasets confirmed the high accuracy and its suit-
ability for real world applications. Another method based on fuzzy logic was proposed
[133], which utilized the silhouette slices and movement speed features as input to the
fuzzy system, and employed the fuzzy c-means clustering technique to acquire the mem-
bership function for the proposed system. The results confirmed the better accuracy of
the proposed fuzzy system as compared to non-fuzzy systems for the same public dataset.
Most of the human action recognition methods are view dependent and can recognize
the action from a fixed view. However, a real-time human action recognition method must
be able to recognize the action from any viewpoint. To achieve this, many state-of-the-art
methods use data captured by multiple cameras. However, this is not a practical solution
because calibration of multiple cameras in real world scenarios is quite difficult. The use
of a single camera should be the ultimate solution for view invariant action recognition.
Along these lines, a fuzzy logic-based method was proposed [134] for view invariant
action recognition using a single camera. This method extracted human contours from
Chapter 2. Literature Review 36
the fuzzy qualitative Poisson human model for view estimation followed by clustering
algorithms for view classification as shown in Figure 2.6 The results indicate that the
proposed method is quite efficient for viewing independent action recognition. Some
methods based on neuro-fuzzy systems (NFS) have also been proposed for human gesture
and action recognition [135, 136]. In addition to this, evolving systems [49, 137] are also
2.2.5 Discussion
In this section, a comparison is made between the appearance of LBP and fuzzy logic-
based methods.These approaches are simple and have produced state-of-the-art results
on Weizmann, KTH, and multiview IXMAS datasets as recorded in Table 2.2. There
are two major approaches for multiview human action recognition based on shape and
approaches provide higher accuracy than 2D approaches but at higher computational cost
which makes these approaches less applicable to real time applications. In addition to this,
deficiencies which might have occurred due to segmentation errors in each view point.
Moreover, a good 3D model of different views can only be constructed when the views
a 3D model.
approach where the action is represented by handcrafted feature detectors and descrip-
tors; learning-based representation approaches have the capability to learn the feature
automatically from the raw data, thus introducing the concept of end-to-end learning
which means transformation from pixel level to action classification. Some of these ap-
learning while others employ deep learning-based models for action representation. We
have divided these approaches into two categories: non-deep learning-based approach
Chapter 2. Literature Review 38
TABLE 2.2: Comparison of appearance, Local Binary Pattern, and Fuzzy logic-based
approaches
These approaches are based on genetic programming and dictionary learning as dis-
sparse representation of the input data. The sparse representation is suitable for the cat-
egorization tasks in images and videos. Dictionary learning-based approaches have been
and action recognition [146]. The concept of dictionary learning is similar to Bag-of-
visual-words(BoVW) model because both are based on the representative vectors learned
from the large number of samples. These representative vectors are called code words,
forming a code-book in BoVW model, and dictionary atoms in the context of dictionary
learning. One way to get the sparse representation of input data is to learn over-complete
basis dictionary. In a paper published by Guha and Ward [147], three over-complete
dictionary learning frameworks were investigated for human action recognition. An over-
was proposed for human action recognition [148] based on the hierarchical descriptor.
The cross-view action recognition problem was addressed by using transferable dictio-
nary pair [149]. In this method the authors learned the view specific dictionaries where
each dictionary corresponds to one camera view and extended this work by constructing
a common dictionary which shares information from different views [150]. The proposed
vised cross-domain dictionary learning-based method was proposed for visual recognition
Dictionary leaning-based methods also use unsupervised learning, for example, Zhu
and Shao [152] proposed an unsupervised approach for cross-view human action recog-
nition. This method does not require target view label information or correspondence
annotations for action recognition. The set of low-level trajectory features are coded us-
ing locality-constrained linear coding (LLC) [153] to form the coding descriptors, then
peak values are pooled to form a histogram that captures the local structure of each action.
Chapter 2. Literature Review 41
natural evolution. It can be used to solve the problems without having the prior knowledge
identify the sequence of unknown primitive operations that can maximize the perfor-
introduced for action recognition [154]. In this method, instead of using handcrafted fea-
tures, authors automatically learned the spatio-temporal motion features for action recog-
Gabor filter and wavelet. In this way, the effective set of features was learned for action
recognition. This method was evaluated on three challenging datasets and outperformed
sents the random variables and their dependencies in the form of directed acyclic graph.
Different variations of Bayesian network have been introduced such as conditional Bayesian
the work of Zhnag et al [155], an Interval Temporal Bayesian Networks (ITBN) was intro-
duced for recognition of complex human activities. This method combined the Bayesian
network with the interval algebra to model the time dependencies over time intervals.
In order to evaluate the performance of the proposed method, a cargo loading dataset
Chapter 2. Literature Review 42
was considered for experimentations and evaluations. Khan et al [156] proposed an-
other method for action detection using dynamic conditional Bayesian network, which
also achieved the state-of-the-art results. In Park et al [157], MEBN was used for pre-
dictive situation awareness (PSAW) using multiple sensors. These networks are robust
for reasoning the uncertainty in the complex domains for predicting and estimating the
Recent studies show that there are no universally best hand-crafted feature descrip-
tors for all datasets, therefore learning features directly from the raw data may be more
learning multiple levels of representation and abstraction that can make sense of data such
as speech, images, and text. Deep learning-based methods have ability to process the im-
ages/videos in their raw forms and automate the process of feature extraction, representa-
tion, and classification. These methods use trainable feature extractors and computational
models with multiple processing layers for action representation and recognition. Based
on a research study on deep learning presented [158], we have classified the deep learn-
ing models into three categories: (1) generative/unsupervised models (e.g., Deep Belief
Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Convolutional
Neural Networks (CNNs); (3) Hybrid models, these models use the characteristics of
both models, for example a goal of discrimination may be assisted with the outcome of
Chapter 2. Literature Review 43
the generative model. However, these models are not discussed here separately.
Unsupervised deep learning models do not require the target class labels during the
learning process. These models are specifically useful when labelled data are relatively
scarce or unavailable. Deep learning models have been investigated since the 1960s [159]
but researchers paid little attention these models. This was mainly due to the success of
shallow models such as SVMs [160], and unavailability of huge amount of data, required
A remarkable surge in the history of deep models was triggered by the work of Hin-
ton et al [161] where the highly efficient DBN and training algorithm was introduced
followed by the feature reduction technique [162]. The DBN was trained layer by layer
using RBMs [163], the parameters learned during this unsupervised pre-training phase
of this efficient model there has been a lot of interest in applying deep learning models to
A method using unsupervised feature learning from video data was proposed in the
work of Le et al [164] for action recognition. The authors used an independent subspace
analysis algorithm to learn spatio-temporal features combining them with deep learning
techniques such as convolutional and staking for action representation and recognition.
Deep Belief Networks (DBNs) trained with RBMs were used for human action recogni-
tion [165]. The proposed method outperformed the handcrafted learning-based approach
Chapter 2. Literature Review 44
on two public datasets. Learning continuously from the streaming video without any la-
bels is an important but challenging task. This issue was addressed in a paper by Hasan
and Roy-Chowdhury [166] using an unsupervised deep learning model. Most of the ac-
tion datasets have been recorded under a controlled environment; action recognition from
unconstrained videos is a challenging task. A method for human action recognition from
Unsupervised learning played a pivotal role in reviving the interests of the researchers
in deep learning. However, it has been overshadowed by the purely supervised learn-
ing since the major breakthrough in deep learning used CNNs for object recognition
[168].Conversely, an important study by the pioneers of latest deep learning models sug-
gest that unsupervised learning is going to be far more important than its supervised
counterpart in the long run [38] since we discover the world by observing it rather being
told the name of every object. The human and animal learning is mostly unsupervised.
According to the literature survey of human action recognition, the most frequently
used model under the supervised category is Convolutional Neural Network (CNN). The
CNN [169] is a type of deep learning model which has shown excellent performance
tion and human action recognition[168, 170]. This is a hierarchical learning model with
multiple hidden layers to transform the input volume into output categories. Its archi-
tecture consists of three main types of layers: convolutional layer, pooling layer, and
fully-connected layer as shown in Figure 2.8. Understanding the operation of the different
Chapter 2. Literature Review 45
layers of CNN require mapping back these activities into pixel space, this is done with the
help of Deconvolutional Networks (Deconvnets) [171]. The Deconvnets use the same pro-
cess as CNN but in reverse order for mapping from feature space to pixel space. Initially,
the deep CNN [169] was used for representation and recognition of objects from still
images [168]. This was extended to action recognition from videos [172] using stacked
video frames as input to the network but the results were worse than even the handcrafted
shallow representations [73, 97]. This issue was investigated by Simonyan and Zisser-
man [2] who came up with the idea of two-stream (spatial and temporal) CNN for action
ure 2.9. Both these streams were implemented as Convnet, the spatial stream recognizes
the action from still video frames and the temporal stream performs action recognition
from the motion in the form of dense optical flow. Afterwards, these two streams were
combined using late fusion for action recognition. This method achieved superior results
to one of the best shallow handcrafted-based representation methods [73]. However, the
two-stream architecture may not be suitable for real-time applications due to its compu-
tational complexity.
Most of the deep CNN models for action recognition are limited to handling inputs
CNN model. This problem was addressed by Xu et al [173] by introducing the 3D con-
volutional neural networks model for airport surveillance. This model uses features from
layer. This method achieved state-of-the-art results in airport video surveillance datasets.
The supervised learning CNN model 2D or 3D can also be accompanied by some unsu-
pervised endeavours. One of the unsupervised endeavours is slow feature analysis (SFA)
[174], which extracts slowly varying features from the input signal in an unsupervised
manner. Beside other recognition problems, it has also proved to be effective for human
action recognition [175]. Sun et al [176], combined two-layered SFA learning with 3D
CNN for automated action representation and recognition. This method achieved state-
of-the-art results on three public datasets including KTH, UCF sports, and Hollywood2.
Other types of supervised models include Recurrent Neural Networks (RNNs). A method
using RNN was proposed [177] for skeleton-based action recognition. The human skele-
ton was divided into five parts and then separately fed into five subnets. The results of
these subnets were fused into the higher layers and final representation was fed into the
single layer. For further detail regarding this model reader may refer to Du et al [177].
The deep learning-based models for human action recognition require a huge amount
of video data for training. While collecting and annotating huge amount of video data
cess has been achieved in the domains of image classification, object recognition, speech
recognition, and human action recognition using the standard 2D and 3D CNN models.
However, there still exist some issues such as high computational complexity of training
CNN kernels, and huge data requirements for training. To curtail these issues researchers
have been working to come up with variations of these models. With this in mind, factor-
for human action recognition. This network factorizes the standard 3D CNN model as a
2D spatial kernel at lower layers (spatial convolutional layers) based on sequential learn-
ing process and 1D temporal kernels in the upper layers (temporal convolutional layers).
This reduced the number of parameters to be learned by the network and thus reduced the
computational complexity of the training CNN kernels and proposed its detailed architec-
was proposed by Tran et al [179] for human action recognition. The evaluation of this
method on four public datasets confirmed three important findings: (1) 3D CNN is more
suitable for spatio-temporal features than 2D CNN; (2) The CNN architecture with small
Chapter 2. Literature Review 48
3x3x3 kernels is the best choice for spatio-temporal features; (3) The proposed method
Some studies have reported that incorporating handcrafted features into the CNN model
can improve the performance of action recognition. Along this direction, combining in-
formation from multiple sources using CNN was proposed by Park et al [180]. In this
method, handcrafted features were used to perform spatially varying soft-gating and used
fusion method for combining multiple CNN trained on different sources. Recently, Yu
et al [181] came up with the variation of CNN called stratified pooling-based CNN (SP-
CNN). Since each video has a different number of frame-level features, to combine and
get a video-level feature is a challenging task. The SP-CNN method addressed this is-
sue by proposing variation in the CNN model as follows: (a) adjustment of pre-trained
CNN on target dataset; (b) extraction of features at frame-level; (c) using principal com-
ponent analysis (PCA) for dimensionality reduction; (d) stratified pooling frame-level
features into video-level features; (e) SVM for multiclass classification. This architecture
Semantic-based features such as pose, poselet are important cues for describing the
category of an action being performed. In this direction, some methods based on fuzzy
CNN were proposed by Ijjina and Mohan and Choran et al [182, 183] using local posed-
based features. These descriptors are based on the motion and appearance information
acquired from tracking human body parts. These methods were evaluated on Human Mo-
tion Database (HMDB) produced superior results than other state-of-the-art methods. It
Chapter 2. Literature Review 49
has been observed that the context/scene where the action is carried out also provides im-
portant cues regarding the category of an action. Gkioxari et al [184] used the contextual
information for human action recognition and Girshick et al [185] adapted region-based
Convolutional Neural Network (CNN) for classification. Where actor was considered as
One of the major challenges in human action recognition is view variance. The same
action viewed from different angles looks quite different. This issue was addressed by
Rahmani and Mian [186] using CNN. This method generates the training data by fitting
synthetic 3D human model to real motion and renders human poses from different view-
points. The CNN model has shown better performance than handcrafted representation-
based methods for multi-view human action recognition. The comparison of non-deep
learning and deep learning-based methods on different public datasets is shown in Table
2.3.
Chapter 2. Literature Review 50
2.3.3 Discussion
approaches have been divided into genetic programming, dictionary learning, supervised
tion used in each category. However, this division boundary is not strict and approaches
may overlap.
the concept of unified learning of dictionary and corresponding classifier into a single
learning procedure, which leads to the concept of end-to-end learning. On the other
selection, used to solve the problem without prior domain knowledge. In human action
recognition, GP is used to design the holistic descriptors that are adaptive, and robust for
action recognition. These methods have achieved state of the art results on challenging
Deep learning has emerged as highly popular direction within the machine learning
which has outperformed the traditional approaches in many applications of computer vi-
sion. The highly advantageous property of deep learning algorithms is their ability to
learn features from the raw data, which eliminates the need of handcrafted feature detec-
tors and descriptors. There are two categories of deep learning models, i.e., unsupervised
and supervised models. The DBN is a popular unsupervised model which has been used
for human action recognition. This model has already achieved high performance on
Chapter 2. Literature Review 52
other hand, CNN is one of the most popular deep learning models in the supervised cat-
egory. Most of the existing learning-based representations either directly apply CNN to
video frames or variations of CNN for spatio-temporal features extraction and represen-
tation. These models have also achieved excellent results on challenging human activity
recognition datasets as recorded in Table 2.3. So far, supervised deep learning models
have achieved better performance but some studies suggest that unsupervised learning is
going to be far more important in the long run. Since the world is discovered by observ-
ing it rather being told the name of every object, human and animal learning is mostly
unsupervised [38].
The deep learning models have also some limitations: These models require huge
amount of data for training the algorithms. Most of the action recognition datasets such
as KTH [15], IXMAS [141], HMDB-51 [72], and UCF Sports [16, 17] are comparatively
small for training these models. However, recently a large-scale ActivityNet dataset [199]
was proposed with 200 action categories, 849 hours of video in total. This dataset is suit-
able to train deep learning-based algorithms. We can expect a major breakthrough with
2.4 Datasets
In this section, well-known public datasets for human activity recognition are dis-
cussed. The focus is on recently developed datasets which have been frequently used
for experimentations.
Chapter 2. Literature Review 53
This dataset [10] was introduced by the Weizmann institute of Science in 2005. This
dataset consists of 10 simple actions with static background, i.e., walk, run, skip, jack,
jump forward or jump, jump in place or pjump, gallop-sideways or side, bend, wave1,
for recognition of simple actions. Some methods, such as [111, 112] have reported 100%
accuracy on this dataset. The background of the dataset is simple and only one person
The KTH dataset [15] was created by the Royal Institute of Technology, Sweden in
2004. This dataset consists of six types of human actions (walking, jogging, running,
boxing, hand clapping and hand waving) performed by 25 actors with 4 different scenar-
ios. Thus, it contains 25 × 6 × 4 = 600 video sequences. These videos were recorded
with static camera and background; therefore, this dataset is also considered relatively
simple for evaluation of human activity recognition algorithms. The method proposed
by Sadanand and Corso [69] achieved 98.2% accuracy on this dataset, which is the high-
est accuracy reported so far. The one frame example of each action from four different
INRIA Xmas Motion Acquisition Sequences (IXMAS) [141] a multiview dataset was
This dataset consists of 13 daily life actions performed by 11 actors 3 times each. These
actions include crossing arms, stretching head, sitting down, checking watch, getting up,
walking, turning around, punching, kicking, waving, picking, pointing, and throwing.
These actions were recorded with five calibrated cameras including 4 side cameras and
F IGURE 2.12: One frame example of each action from four different scenarios in the
KTH dataset
Chapter 2. Literature Review 55
a top camera. The extracted silhouettes of the video sequences are also provided for ex-
perimentation. Basically, two types of methods have been proposed for multiview action
recognition, i.e., 2D and 3D-based methods. The 3D based methods have reported higher
accuracy than the 2D based methods on this dataset but at a higher computational cost.
The highest accuracy reported on this dataset is 100% by Holte et al [143] using 3D mo-
tion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)). The example frames
for each action from five different camera views are shown in Figure 2.13.
2.4.4 HMDB-51
The HMDB-51 [72] is one of the largest datasets available for activity recognition
developed by Serre lab, Brown University, USA in 2011. It consists of 51 types of daily
life actions comprised of 6849 video clips collected from different sources such as movies,
YouTube, and Google videos. The highest accuracy reported so far on this dataset is
74.7% by Yu et al [181] using SP-CNN as shown in Table 2.4. One frame example for
2.4.5 Hollywood2
Hollywood2 [200] action dataset was created by INRIA (Institut National de Recherche
(get out of car, answer phone, kiss, hug, handshake, sit down, stand up, sit up, run, eat,
fight, and drive car) with dynamic background features. This dataset is very challenging,
consists of short unconstrained movies with multiple persons, cluttered background, cam-
era motion, and large intra-class variations. This dataset is meant for evaluation of HAR
algorithms in real life scenarios. Many researchers have evaluated their algorithms on
Chapter 2. Literature Review 56
F IGURE 2.13: One frame example for each action from five different camera views in
IXMAS dataset
this dataset, the best accuracy achieved so far is 75.2% by Fernando et al [193] using rank
pooling and CNN. Some example frames from Hollywood2 dataset are shown in Figure
2.16.
Chapter 2. Literature Review 57
F IGURE 2.14: Exemplar frames for action 1 to 28 from HMDB-51 action dataset
F IGURE 2.15: Exemplar frames for action 29 to 51 from HMDB-51 action dataset
UCF-101 action recognition dataset [188] was created by the Centre for Research in
Computer Vision, University of Central Florida, USA in 2012. This is one of the largest
action dataset contains 101 action categories collected from YouTube. This dataset is an
extension of UCF-50 [201] dataset with 50 action categories. UCF-101 contains 13,320
videos in total, aimed at encouraging the researchers to develop their algorithms for hu-
man action recognition in realistic scenarios. The example frames for each action are
UCF sports action dataset was created by the Centre for Research in Computer Vision,
University of Central Florida, USA in 2008 [16, 17]. It consists of 11 sports action cat-
swing, riding, and diving) broadcasted on television channels. The dataset includes total
150 video sequences of realistic scenarios. The best accuracy achieved on this dataset so
Chapter 2. Literature Review 59
F IGURE 2.18: Exemplar frames for actions 58 to 101 from UCF-101 dataset
Chapter 2. Literature Review 61
far is 95.0% by Sadanand and Corso [69] using STVs as shown in Table 2.4. The example
YouTube action dataset [89] was developed in 2009. This is a challenging dataset
due to camera motion, viewpoint variations, illumination conditions, and cluttered back-
grounds. It contains 11 action categories: biking, diving, basketball shooting, horse rid-
ing, swinging, soccer juggling, trampoline jumping, volleyball spiking, golf swinging,
tennis swinging, and walking with a dog. The highest accuracy achieved so far on this
dataset is 93.38% by Peng et al [77] using FV and SFV. The example frames for each
ActivityNet [199] was created in 2015. This is a large-scale video dataset covering
wide range of complex human activities. It provides 203 action categories in total 849
hours of video data. This dataset is specifically helpful for training the classifiers which
require a huge amount of data for training such as deep neural networks. According to the
F IGURE 2.20: Exemplar frames of 11 sports actions from YouTube action dataset
videos and 50.2% on trimmed videos classification. They used deep features (DF), motion
features (MF), and static features (SF) as shown in Table 2.3. Some example frames from
2.5 Conclusions
This chapter provided a comprehensive literature review of state-of-the-art human ac-
tivity representation and recognition approaches including both handcrafted and learning-
there for a quite long time. These approaches have achieved good results on different
Chapter 2. Literature Review 63
sentation methods are based on the local densely-sampled descriptors, which incur a high
computational cost. In these approaches, the important features from the sequence of
image frames are extracted to build the feature vector using human engineered feature
base, local binary patterns, and fuzzy logic-based approaches. However, handcrafted
learning based techniques use trainable feature extractors followed by a trainable classi-
fier, which lead to the concept of end-to-end learning or learning from pixel level to action
categories identification.This eliminates the need for handcrafted feature detectors and
Chapter 2. Literature Review 64
descriptors used for action representation. These approaches include evolutionary, dictio-
nary learning, and deep learning-based approaches. Recently, the research community has
paid a lot of attention to these approaches. This is mainly due to their high performance
data-driven deep models referred to as black-box have some limitations: Firstly, it is dif-
ficult to incorporate problem-specific prior knowledge into these models. Secondly, some
of the best performing deep learning-based methods are still dependent on handcrafted
features. In HAR, the performance of the pure learning-based methods is still not up to
the mark. This is mainly due to the unavailability of huge datasets for action recognition
unlike in the object recognition where huge dataset such as ImageNet is available.
In order to provide further insight into the field, some well-known public datasets for
activity recognition are presented for experimentations and evaluations of HAR tech-
niques. These datasets include: KTH, Weizmann, IXMAS, UCF Sports, Hollywood2,
Multiple Views
IN THIS CHAPTER
A novel method for human action recognition from multiple views, i.e., view in-
variant action recognition is presented. The introduction of the proposed method is
presented in Section 3.1. While, state-of-the-art view invariant action recognition
techniques are ascertained in Section 3.2. The proposed methodology and exper-
imentations results are explained in Section 3.3 and 3.4 respectively. Finally, the
chapter is concluded in Section 3.5.
3.1 Introduction
Over the past decade, many techniques have been proposed for human action recogni-
tion. However, it is still a challenging task due to many issues involved in it [30]. The
major challenges and issues in HAR are as follows: (1) occlusion; (2) variation in human
65
Chapter 3. Human Action Recognition From Multiple Views 66
appearance, shape, and clothes; (3) cluttered backgrounds; (4) stationary or moving cam-
eras; (5) different illumination conditions; and (6) viewpoint variations. Among these
challenges, viewpoint variation is one of the major problems in HAR since most of the
approaches for human activity classification are view-dependent and can recognize the
activity from one fixed view captured by a single camera. These approaches supposed to
have the same camera view during training and testing. This condition cannot be main-
tained in real world application scenarios. Moreover, if this condition is not met, their
accuracy decreases drastically because the same actions look quite different when cap-
tured from different viewpoints [3, 202]. A single camera-based approach also fails to
recognize the action when an actor is occluded by an object or when some parts of the
action are hidden due to unavoidable self-occlusion. To avoid these issues and get the
complete picture of an action, more than one camera is used to capture the action, this
[203].
There are two major approaches for recognition of human actions from multiviews:
3D approach and 2D approach [5]. In the first approach, a 3D model of a human body is
constructed from multiple views and motion representation is formed from it for action
recognition. This model can be based on cylinders, ellipsoids, visual hulls generated from
silhouettes, or surface mesh. Some examples of motion representation are 3D optical flow
[106], shape histogram [107], motion history volume [108], 3D body skeleton [109], and
spatio-temporal motion patterns [204]. Usually, the 3D approach provides higher accu-
racy than a 2D approach but at higher computational cost, which makes it less applicable
Chapter 3. Human Action Recognition From Multiple Views 67
ent views. Hence, the model is exposed to deficiencies which might have occurred due to
segmentation errors in each viewpoint. Moreover, a good 3D model of different views can
only be constructed when the views overlap. Therefore, a sufficient number of viewpoints
However, recently some 3D cameras have been introduced for capturing images in 3D
form. Among these, 3D time-of-flight (ToF) cameras and Microsoft Kinect have become
very popular for 3D imaging. These devices overcome the difficulties that are faced
model. However, these sensors also have several limitations. For example, in contrast
to a fully reconstructed 3D model from multiple views, these sensors only capture the
frontal surfaces of the human and other objects in the scene. In addition to this, these
sensors also have limited range about 67 m, and data can be distorted by scattered light
from the reflective surfaces [51]. Due to the limitations of the 3D approach, researchers
prefer to employ a 2D approach for human-action recognition from multiple views [205].
The methods based on 2D models extract features from 2D images covering multiple
views. Different methods have been proposed for multiview action recognition based
on 2D models. However, three important lines of work are mentioned here. The first
appropriate feature descriptor(s) or fusion of different features [130, 206], and then action
tion is carried out either by a single universal classifier or multiple classifiers are trained,
and later on their results are fused to get the final result [207, 208]. The third one utilizes
the learning-based model, such as deep learning and its variations, to learn the effective
and discriminative features directly from the raw data for multiview action recognition
[189, 209].
The proposed method in this chapter falls under first category. This method is based
on a novel feature descriptor for multiview human action recognition. This descriptor
employs region-based features extracted from the human silhouette. To achieve this, the
human silhouette is divided into regions in a radial fashion with the interval of a cer-
tain degree, and then region-based geometrical and Hu-moments features are obtained
from each radial bin to articulate the feature descriptor. A multiclass support vector ma-
chine classier is used for action classication. The proposed approach is quite simple
and achieves state-of-the-art results without compromising the efciency of the recogni-
tion process.The contribution is two-fold. Firstly, the proposed approach achieves high
advantage of this simple representation, the proposed provide high recognition speed of
34 frames per second on a challenging multiview IXMAS (INRIA Xmas Motion Acqui-
sition Sequences) dataset. Due to its high accuracy and speed the proposed approach is
on a 2D approach. These methods extract features from 2D image frames of all available
views and combine these features for action recognition. Then, classifier is trained using
all these viewpoints. After training the classifier, some methods use all viewpoints for
classification [210], while others use a single viewpoint for classification of a query action
[211–213]. In both cases, the query view is part of the training data. However, if the query
view is different than the learned views, this is known as cross-view action recognition.
This is even more challenging than the multiview action recognition [150, 214].
Different types of features such as motion, shape, or combination of motion and shape
features have been used for multiview action recognition. In [212], silhouette-based fea-
tures were acquired from five synchronized and calibrated cameras. The action recogni-
tion from multiple views was performed by computing the R transform of the silhouette
surfaces and manifold learning. In [114], contour points of the human silhouette were
used for pose representation, and multiview action recognition was achieved by the ar-
[115] for action recognition from multiple views; this method used contour points of the
silhouette and radial scheme for pose representation. Then, model fusion of multiple
camera streams was used to build the bag of key poses, which worked as a dictionary
for known poses and helped to convert training sequences into key poses for a sequence-
extracted the uniform rotation-invariant local binary patterns (LBP) and contour-based
pose features from the silhouette. The classification was performed using a multiclass
support vector machine. In [117], scale-invariant features were extracted from the sil-
houette and clustered to build the key poses. Finally, the classification was done using a
Optical flow and silhouette-based features were used for view-invariant action recog-
nition in [120], and principal component analysis (PCA) was used for reducing the di-
mensionality of the data. In [122], coarse silhouette features, radial grid-based features
and motion features were used for multiview action recognition. Another method for
viewpoint changes and occlusion-handling was proposed in [211]. This method used
the histogram of oriented gradients (HOG) features with local partitioning, and obtained
the final results by fusing the results of the local classifiers. A novel motion descriptor
based on motion direction and histogram of motion intensity was proposed in [118] for
Another method based on 2D motion templates, motion history images, and histogram of
oriented gradients was proposed in [119]. A hybrid model which combines convolution
neural networks (CNN) with hidden Markov model (HMM) was used for action classifi-
cation [209]. In this method, the CNN was used to learn the effective and robust features
directly from the raw data, and HMM was used to learn the statistical dependencies over
but very few are actually suitable for real-time applications due to their high computa-
tional cost. Therefore, the cost of the feature extraction and action classification has to
be reduced as much as possible. The proposed system has been designed around these
parameters. It uses the human silhouette as input for feature extraction and formulating
the region-based novel feature descriptor. Human silhouette can easily be extracted using
foreground detection techniques. At the moment our focus is not on foreground segmen-
tation; rather, we are focused on later phases of the action recognition such as feature
extraction and classification. Although the human silhouette is a representation that con-
tains much less information than the original image, we show that it contains sufficient
information to recognize the human actions with high accuracy. This work proposes a
novel feature descriptor for multiview human-action recognition based on human silhou-
ette. When a silhouette is partitioned into radial bins, then region descriptors are formed
which are represented as a triangle or quadrangle. Accordingly, it makes the pose de-
scription process simple and accurate. The descriptive detail of the proposed system is
This work considers the centeroid of a silhouette as a point of origin, and divide the
human silhouette into R radial bins with the angular interval of the same degree. Here the
value of the R is selected to be 12, which constitutes 12 radial bins with the angular inter-
val of 30°each. This value has been selected with experimentation to cover the maximum
Chapter 3. Human Action Recognition From Multiple Views 72
Algorithm 3.1 Feature Extraction and Classification for HAR from multiple views
3. Feature Extractions:
viewing angles. This is unlike [215], where the radial histograms were used as a feature
descriptor [115], and the contour points of each radial bin were considered as features,
and model fusion was used to achieve the multiview action recognition by obtaining key
poses for each action through K-means clustering. The proposed method computes effi-
cient and discriminative geometrical and Hu-moment features from each radial bin of the
silhouette itself. Then, these features from all bins are concatenated into a feature vector
This approach has three major advantages. Firstly, it divides the silhouettes into radial
bins, which cover almost all viewing angles, thus provides an easy way to compute the
features for different views of an action. Secondly, it uses the selected discriminative fea-
tures and avoids extra computation such as fusion or clustering as used in [115]. Thirdly,
due to employing selected features from each bin, it does not require any dimensionality
reduction technique.
3.3.1 Pre-Processing
Usually, some lines or small size regions are formed around the silhouette due to seg-
mentation errors, loose clothing of the subject under consideration, and other noise. These
unnecessary regions do not offer any important information for action recognition, but
rather create problems for the feature extraction algorithm and increase the complexity.
By removing these small regions, complexity can be reduced and the feature-extraction
process can be made more accurate. In our case, a region was considered to be small and
unnecessary if its area is less than 1/10 of the silhouette. This threshold was set based on
the silhouette before and after noise removal is shown in first row of Figure 3.1.
The success of any recognition system mainly depends on proper feature selection and
extraction mechanism. For action recognition from different views, a set of discrimina-
based on two types of features: (1) region-based geometric features and (2) Hu-moments
features extracted from each radial bin of the silhouette. The overview of the feature
Cm = (xc, yc ) (3.1)
where
∑ni=1 xi ∑ni=1 yi
xc = and yc = (3.2)
n n
The shape of a region can be described by two types of features. One is the region-based
and the other is boundary-based features. The region-based features are less affected by
noise and occlusion than boundary-based features. Therefore, region-based features are
better choice to describe the shape of a region [111]. Generally, a shape is described by
a set of vectors known as descriptors. A good descriptor is one which has the ability to
reconstruct the shape from the feature points. The resultant shape should be an approxi-
mation of the original shape and should yield similar feature values. These features have
Chapter 3. Human Action Recognition From Multiple Views 75
1. First of all, the centroid of a human silhouette is calculated using Equation 3.2, as
2. After computing the centroid by Equation 3.2, a radius of the silhouette is com-
puted.
3. The silhouette is divided into 12 radial bins with respect to its centroid using a
mask; this division has been made with intervals of 30°. These bins are shown in
Figure 3.4.
4. The following region-based features are computed for each bin of the silhouette:
Chapter 3. Human Action Recognition From Multiple Views 76
the critical points (corner points) of each shape and their distances. There can
be different numbers of critical points for each shape; therefore, the mean and
• Area: The simple and natural property of a region is its area. In the case of
the area of each bin, which provides 12 important features for each human
silhouette.
• Eccentricity: The ratio of the major and minor axes of an object is known as
eccentricity. In our case, it is ratio of distance between the major axis and foci
of the ellipse. Its value is between 0 and 1, depending upon the shape of the
between each pair of pixels. We have computed perimeter of each bin forming
a triangle or quadrangle.
between x-axis and major axis of the ellipse. Its value can be between −90°
and +90°
Chapter 3. Human Action Recognition From Multiple Views 77
The use of invariant moments for binary shape representation was proposed in [216].
The moments which are in-variants with respect to rotation, scales, and translations, are
known as Hu-moments in-variants. The proposed method computes seven moments for
each radial bin of the silhouette. The moment (p,q) of an image f(x,y) of size M×N is
Chapter 3. Human Action Recognition From Multiple Views 78
defined as:
M−1 N−1
m p,q = ∑ ∑ f (x, y) x p yq (3.3)
x=0 y=0
Here p is order of x and q is order of y.The central moments can be calculated in the same
way as these moments except the values of x and y is displaced by the mean values as
follows:
M−1 N−1
µ p,q = ∑ ∑ f (x, y) (x − xavg ) p ( y − yavg )q (3.4)
x=0 y=0
where
m10 m01
xavg = and yavg = (3.5)
m00 m00
µ p,q p+q+2
η p,q = γ ,γ = , p + q = 2, 3, ... (3.6)
µ00 2
Based on these central moments, [216] introduced seven Hu-moments as linear combi-
2
h2 = (η20 − η02 )2 + 4η11 (3.8)
h6 = (η20 − η02 )[(η30 + η12 )2 − (η21 + η03 )2 ] + 4η11 (η30 + η12 )(η21 + η03 ) (3.12)
After computing the features from video, an appropriate classifier has to be used for
classification of the actions. In supervised learning, discriminative models are more ef-
fective than generative models [218]. Support vector machine (SVM) is one of the most
than some old-style classifiers such as backpropagation neural networks, Naı̈ve Bayes,
and k-nearest neighbors (KNNs) in many classification problems [62]. The SVM was
first proposed in [160], originally, it was developed for binary classification but later on
extended to the multiclass classification problem. There are two main approaches for mul-
ticlass SVM. The first approach considers all classes of data directly into one optimiza-
tion formulation, while the second approach constructs and combines binary classifiers to
build a multiclass classifier. The second approach is computationally less expensive and
easy to implement. Many algorithms have been derived for multiclass classification us-
ing this approach, such as one-against-all [219], one-against-one [220], Directed Acyclic
Chapter 3. Human Action Recognition From Multiple Views 80
port Vector Machine (ECOC-SVM) [222], and Support Vector Machines with Binary Tree
[220] are two commonly used methods for multiclass classification. The one-against-all
needs N SVM binary classifiers for an N class classification problem, while one-against-
one method needs (NxN-1)/2 binary classifiers for an N number of classes, each trained
The proposed method used multiclass SVM classifier implementation in [54] for multi-
view action recognition, which uses the one-against-one method with radial basis function
(RBF) kernel. Moreover, to estimate the best parameters for classifier, grid search was
conducted to know the best value for parameter γ and C. Here, γ represents the width of
the RBF kernel and C represents the weight of error penalty. The appropriate set of (C,γ)
3.4 Experimentations
For the evaluation of proposed method, comprehensive experimentations have been
out (LOSO) scheme has been used for view-invariance evaluation. In this scheme, the
classifier is trained on all sequences except one, which is used for testing. This process
is repeated for all possible combinations and results are averaged. This is a common
strategy used by different researchers such as [118, 211] for evaluation of their methods.
Chapter 3. Human Action Recognition From Multiple Views 81
The IXMAS is a challenging and well-known dataset with multiple actors and camera
views. This dataset is popular among the human-action recognition methods for testing
action recognition. It includes 13 daily life action classes with 5 different cameras, in-
cluding one top-view and four side cameras, as shown in Figure 3.5. Each action is
each sequence during action execution. The change in orientation is indicated by action
labels, and no additional information is provided other than these labels. Most of the ex-
isting methods use selected action classes and actors for experimentation [118, 211]. For
comparison, this work uses 11 action classes performed by 12 actors as shown in Figure
3.6. The name and the label index of these actions are shown in Table 3.1.
F IGURE 3.5: Five cameras views of check watch action from IXMAS dataset
Chapter 3. Human Action Recognition From Multiple Views 82
presented in a confusion matrix, shown in Figure 3.7. The results confirm that proposed
method outperforms the similar state-of-the-art 2D based methods such as [63, 114, 118,
206, 211, 213, 226–229] recorded in Table 3.2. It is important to be mentioned here that
the number of classes, actors, and views used in experimentations vary among these meth-
ods. For example, in [63], 89.4% accuracy has been reported but they excluded camera 4
from experimentation. Likewise, [229] also excluded the top camera and considered only
the remaining 4 cameras. Moreover, most of the published methods are not appropri-
ate for real-time application due to their high computational cost. The proposed method
considers all views of the IXMAS dataset including the top view for recognition. The
results indicate that the proposed method is superior to the similar 2D methods, not only
The resolution of IXMAS dataset is only 390 × 291, which is very low resolution
R2015b implementation on Intel Core i7-4770 CPU with 8 cores @ 3.4 GHz, 8 GB RAM,
and Windows 10 operating system were used. However, only 4 cores of the CPU were
the code. The average testing speed of the proposed method is 2.88 ms per frame, which
Chapter 3. Human Action Recognition From Multiple Views 85
correspond to almost 34 frames per second (FPS); this is much better than the existing
3.5 Conclusions
In this chapter, a multiview human-action recognition method based on a novel region-
based feature descriptor is presented. The multiview HAR methods can be divided into
makes it less applicable for real-time applications. The proposed method uses the hu-
man silhouette as input to the features extraction process. Although the human silhouette
contains much less information than the original image, but proposed method practically
confirms that it is sufficient for action recognition with high accuracy. Moreover, to get the
view-invariant features, the silhouette is divided into radial bins with the angular interval
features are computed for each bin. For action recognition, a multiclass support vector
machine with RBF kernel was employed. The proposed method was evaluated on the
well-known IXMAS dataset. This dataset is a challenging benchmark available for eval-
uation of multiview action recognition methods. The results indicate that the proposed
method outperforms the state-of-the-art 2D methods both in terms of speed and recogni-
tion accuracy. The testing speed of the proposed method is 34 frames per second, which
makes it very much suitable for real-time applications. As far as more complex datasets
such as HMDB-51, YouTube, and Hollywood-II are concerned, the proposed method was
Chapter 3. Human Action Recognition From Multiple Views 86
not tested on these datasets. However, authors believe that proposed method should also
produce good accuracy with these datasets as well, because once the silhouette is ex-
tracted the scene complexities should not matter. However, perfect silhouette extraction
in complex scenarios is still a challenging task which may further affect the feature ex-
traction process as well. As a future work authors would like to extend this work for
query view is different than the learned views with more complex dataset.
Chapter 4
Transfer Learning
IN THIS CHAPTER
An innovative method for human action recognition using transfer learning is pre-
sented. Transfer learning is defined as using a pre-trained model on specific dataset
for solving a new task, even in a different domain. The introduction of the proposed
method is presented in Section 4.1, and related work is presented in Section 4.2.
The proposed methodology and experimentations results are explained in Section
4.3 and 4.4 respectively. Finally, the chapter is concluded in Section 4.5.
4.1 Introduction
In recent years, learning-based representation, and in particular deep learning, has
introduced the concept of end-to-end learning by using the trainable feature extractor
followed by a trainable classifier [19, 146]. These approaches have revealed remark-
able progress for action recognition in videos. The deep learning model, introduced in
87
Chapter 4. Human Action Recognition Using Transfer Learning 88
[162] for reducing the dimensionality of the data, CNN [230] and Deep Belief Networks
(DBNs) [161] have been widely used for image classification, object recognition, and
action recognition.
However, training a new deep learning model from scratch requires huge amount of
data, high computational resources, and hours, in some cases days, of training. In real-
time consuming and expensive. Hence, collecting the sufficient amount of data may not
be a viable option in many cases [231, 232], which makes it quite challenging to apply
deep learning models for such problems. For combating this challenge, researchers revis-
ited their strategies for visual categorization to make them in-line with the working of the
human vision system. Humans have capability to learn thousands of categories in their
lives from just few samples. It is believed that humans achieve this capability by accu-
mulating knowledge over a time period and transferring it for learning new objects [233].
Researchers are convinced that the knowledge of previous objects assist in the learning of
new objects through their similarity and connection with the new objects. Based on this
idea, some studies suggest that the deep learning models trained for a classification task,
can be employed for a new classification task [234–236]. Thus, the CNN models trained
on a specific dataset or task can be fine-tuned for a new task even in a different domain
[185, 237, 238]. This concept is known as transfer learning or domain adaptation.
Transfer learning has been studied as a machine learning technique for a long time,
for solving different visual categorization problems. In recent years, due to an explosion
of information such as images, audio, and video over the internet, demands for high
Chapter 4. Human Action Recognition Using Transfer Learning 89
accuracy, and computational efficiency have increased. Due to these reasons, transfer
learning has attracted a lot of interest in the areas of machine learning and computer
vision. When traditional machine learning techniques have reached their limits, transfer
learning unlocks new flow streams for visual categorization. It has primarily changed the
approach, the way machines used to learn and treat the classification tasks [239].
Transfer learning mainly employs two approaches: 1) preserving the original pre-
trained network and updating the weights based on the new training dataset, 2) using
the pre-trained network for feature extraction and representation, followed by a generic
classifier such as SVM for classification [170]. The second approach has been success-
fully applied for many recognition and classification tasks [234, 240]. The proposed
technique for human action recognition also falls under the second category. In this work,
the recently proposed benchmark for deep models such as AlexNet [168], and GoogleNet
[241] are investigated. Based on experimentation, the AlexNet has been selected as source
model for building a target model. The source model has been used for feature extraction
nition using handcrafted representations and deep learning. Action recognition using
handcrafted feature descriptors, such as extended SURF[242], HOG-3D [243], and some
other shape and motion based features descriptors [20, 73, 78, 244, 245], have achieved
good performance for human action recognition. However, these approaches have several
Chapter 4. Human Action Recognition Using Transfer Learning 90
feature detectors, descriptors, and vocabulary building methods for feature extraction and
This approach has been used in several domains such as image classification, speech
recognition, and object recognition, just to name few [246]. These models have also been
explored for human activity recognition. Some prominent contributions like 3D Con-
[179], Deep ConvNets [172], and Two-stream ConvNets [2], have achieved good results.
The on-line deep learning method such as [18], is also getting more attention for action
recognition. In [248], a human action recognition method was proposed using unsuper-
vised on-line deep learning technique. This method achieved accuracy of 89.86%, and
have less discriminative power. Conversely, deep network architectures are inefficient in
capturing the salient motion. For addressing this issue, [189] combined deep convolu-
tional networks with trajectory based features for recognizing human actions. However,
deep learning-based methods also have some limitations. These models require huge
dataset for training, and collecting huge amount of domain-specific data is time con-
suming and expensive. Therefore, training the deep learning model from scratch is not
feasible for domain-specific problems. This problem can be solved using a pre-trained
Chapter 4. Human Action Recognition Using Transfer Learning 91
network as a source architecture for training the target model with small dataset, known
Fortunately, the winner models of ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) such as AlexNet [168], GoogleNet [241], and ResNet [249] are publicly avail-
able as pre-trained networks. These networks can be used for transfer learning. One of the
important ways to employ the existing models for a new task is to use pre-trained models
as a feature extraction machine and combine this deep representation with off-the-shelf
classifiers for action recognition [234]. Some researchers have also used cross-domain
fer was performed between the KTH, TRECVID [251] and Microsoft research action
datasets. The TRECVID and Microsoft research action datasets were used as a source
domain, while KTH was used a target domain. In addition to this, some researchers have
4.3 Methodology
In machine learning, utilizing the previously learned knowledge for solving a new task
is known as transfer learning or knowledge transfer [252]. The transfer learning using
deep CNNs is very helpful for training the model with a limited size dataset, because
CNNs are prone to over-fitting with small datasets. Over-fitting can be avoided by in-
creasing the size of the training data, however, it is very difficult and expensive to provide
the large amount of annotated data. In this situation, transfer learning comes in handy
Chapter 4. Human Action Recognition Using Transfer Learning 92
and solves this problem by using the pre-trained deep representation as a source archi-
tecture for building the new architecture [253]. In this work, the AlexNet [168] has been
employed as a source architecture for solving the human action recognition problem. The
AlexNet was trained on ImageNet dataset which took as input 224 × 224 pixels RGB
images and categorized it into the respected class. This architecture consists of five con-
volutional layers from C1-C5 and three fully connected layers Fc6-Fc8 as shown in the
parameters for small training dataset of the new task is problematic and time consuming.
Therefore, a source architecture has been used as a feature extractor followed by an off-
the-shelf hybrid SVM-KNN classifier for action recognition. The value of K in the nearest
neighbor algorithm is selected through cross validation. The proposed work is innovative
and presents an interesting combination of deep learning and hybrid classifier, which re-
sults in boosting the performance of the human recognition method. The experimentation
F IGURE 4.1: Overview of the proposed system, first row indicates the source architecture
and second row shows the target architecture
Chapter 4. Human Action Recognition Using Transfer Learning 93
results confirm the efficiency of the proposed work. Moreover, it is also demonstrated that
the hybrid classifier has an advantage over a single classifier. The hybrid classification
for the proposed technique. The proposed technique is tested on two well-known action
datasets i.e., KTH [15], and UCF Sports [17]. The description of these datasets and
The KTH [15] is well-known public dataset comprised of 6 actions, including waking,
running, jogging, hand waving, boxing, and hand clapping. There were 25 actors involved
in performing these actions in different setups including: outdoor, outdoor with variation
in scale, outdoor with different clothes, and outdoor with illumination variations. The
sample frames for each action from four different scenarios are shown in Figure 4.3. This
is a single view dataset with uniform background and recorded with fixed camera at the
During experimentation, the dataset is divided into two parts; one part is used for train-
ing, while other one is used for testing the correctness of the proposed method, same as
[248]. The proposed method achieves 98.15% accuracy on KTH dataset, which is higher
than the similar methods such as [73, 114, 164, 173, 248, 254, 255], as shown in Table
4.1. The confusion matrix indicating the accuracy of each action, and correspondence be-
tween the target classes along x-axis and output classes along y-axis, is shown in Figure
4.4.
The UCF dataset [17] encompasses 10 sports actions collected from videos broadcasted
on television channels such as ESPN and BBC. These actions include: golf swing, div-
ing, lifting, kicking, running, riding horse, swing-bench, skateboarding, swing-side, and
F IGURE 4.3: Sample frames for each action from four scenarios in KTH dataset
Chapter 4. Human Action Recognition Using Transfer Learning 95
walking. These actions were recorded in a real sport environment exhibiting the varia-
dataset. The sample frames for each action are shown in Figure 4.5.
F IGURE 4.5: Sample frames for each action from UCF sports dataset
Chapter 4. Human Action Recognition Using Transfer Learning 96
The proposed study uses a popular Leave-One-Out (LOO) cross validation scheme.
Some other methods have also used Leave-One-Sequence-Out (LOSO), and Leave-One-
Person-Out (LOPO) cross validation, which are quite similar to LOO validation [112].
In LOO cross validation, all video sequences are used for training except one, which is
used for testing the performance of the classifier. This method is repeated for all available
video sequences. Finally, the results of these sequences are summed up and the average
result is considered as the final result. This validation scheme has been employed by
many similar research methods, such as [100, 256], for assessing the performance of their
methods. Since, the proposed method uses the same validation scheme, it provides a fair
comparison with similar methods. The proposed transfer learning method achieved an
accuracy of 91.47% on UCF sports dataset, which is higher than other similar methods
as shown in Table 4.2. The detailed confusion matrix indicating the accuracy of each
action, and correspondence between the target classes along x-axis and output classes
4.5 Conclusion
This chapter presented a human action recognition method based on transfer learning
using a pre-trained deep CNN architecture and a hybrid SVM-KNN classifier. The source
architecture is used as a feature extractor machine for the new task and hybrid SVM-KNN
classifier is trained on the target datasets. It was experimentally confirmed that with the
help of transfer learning we can successfully utilize the already learned knowledge for
learning a new task with a limited training dataset. Transfer learning is very useful when
the dataset is not sufficient for training the deep learning model from scratch. Moreover,
training a deep learning model from scratch requires time and computational resources,
which can be saved using transfer learning. In addition to this, it was confirmed that a
hybrid classifier has an advantage over the single classifier in boosting the accuracy of
the recognition system. Moreover, unlike handcrafted representation based methods, the
proposed approach is simpler and directly works with RGB images, thus, eliminating the
need of pre-processing and manual feature extraction. The effectiveness of the proposed
method was confirmed on two well-known KTH, and UCF sports action datasets, and
Chapter 4. Human Action Recognition Using Transfer Learning 98
achieved 98.15%, and 91.47% accuracy, respectively. The comparative analysis confirms
that the proposed methods outperforms similar state-of-the-art methods for human action
recognition using transfer learning. In future, the authors like to extend this method for
Belief Networks
IN THIS CHAPTER
An innovative method for human action recognition using unsupervised Deep Be-
lief Networks (DBNs) is presented. The introduction of the proposed method is
presented in Section 5.1. While background of the DBNs model, and related work,
are ascertained in Section 5.2. The proposed methodology and experimentations
results are elaborated in Section 5.3 and 5.4 respectively. Finally, the chapter is
concluded in Section 5.5.
5.1 Introduction
Over the past few years, vision based human activity recognition (HAR) has caught
the attention of the research community due to its important applications in real world
scenarios. A lot of methods have been proposed for human action recognition using a
deep learning approach. Nevertheless, most of these methods are based on supervised
99
Chapter 5. Human Action Recognition Using Deep Belief Networks 100
learning and little attention has been paid towards unsupervised deep learning models.
However, as matter of fact, unsupervised learning is far more important than its supervised
counterpart [38] since we discover the world by observing it rather being told the name
Video contents on the internet are increasing rapidly. Every minute, 100 hours of videos
are uploaded onto YouTube alone and according to the reports by 2017, 74% of the all
internet traffic will consist of video content. Most of these video contents are not an-
notated, hence we cannot make sense out of them with supervised learning, until and
unless these are annotated. However, the manual annotation process is time consuming
and expensive. This rapid growth of video content demands for efficient and effective
ways of understanding the video content. On the other hand, search engines use the text
meta-data such as title, description, and tags provided by the users for searching and pro-
cessing videos on the internet. These meta-data are incomplete, sparse, and in some cases
inconsistent with the content of the videos. As a consequence, automatic video under-
standing techniques such as classification, retrieval, and analysis are inevitable. The great
realistic scenes and activities. These videos differ significantly from the videos which
contain simple activities recorded under controlled settings. Most of the existing tech-
niques for HAR target these simple activities such as in Weizmann [10] and KTH [15]
datasets. However, datasets such as UCF sports [260], UCF11 [89], and UCF50 [201] are
collected from different TV channels and YouTube videos and represent realistic and un-
constrained activities. A good number of research articles have been published on HAR
Chapter 5. Human Action Recognition Using Deep Belief Networks 101
using handcrafted and deep learning based techniques, but recognizing actions from un-
There are two major approaches to address this problem, these include the traditional
handcrafted feature-based, and the deep learning-based approach. The handcrafted ap-
proach suffers from several limitations such as, its inability to automatically learn fea-
tures from the data, and shallow representation of data in classifiers. On the other hand,
a deep learning-based approach employs the concept of end-to-end learning by using the
trainable feature extractor followed by a trainable classifier. Deep learning has emerged
ple layers of features from the raw data. Deep learning algorithms develop multi-layer
representation of different patterns in the data, where each successive layer is responsi-
ble for learning increasingly complex features. Higher level features are learned from
the input of the lower layers, thus representation increased abstraction level at each con-
secutive layer. This automatic learning ability of deep learning models eliminates the
need of handcrafted feature detectors and descriptors. In addition to this, these models
The deep learning model such as Convolutional Neural Networks (CNNs) [230] and
Deep Belief Networks (DBNs) [261], deep recurrent neural networks [262, 263], and
deep Boltzmann machines [264] have been successfully employed for many visual cate-
model capable of learning from the input data without any prior knowledge. This model
Chapter 5. Human Action Recognition Using Deep Belief Networks 102
when we have few samples of labeled data or we are dealing with unlabeled data. As
discussed earlier most of the available videos are in unlabeled form, and annotating the
In this chapter, an unsupervised deep learning based method for human action recog-
nition in unconstrained videos is proposed. This method employs Deep Belief Networks
with restricted Boltzmann machine to construct a deep neural network model. This
method takes raw images data as input and automatically learns the feature representation
importance of these models was overshadowed by the success of shallow networks such
as Support Vector Machines (SVMs) [265]. In recent years, the achievement of these
models was confirmed with the introduction of an efficient Deep Belief Network (DBN)
model in [161]. This is a multi-layer generative model where each higher layer represents
more abstract information. In the first phase, the model is pre-trained using multiple
layers of Restricted Boltzmann Machines (RBMs), then in the second phase, parameters
are fine-tuned in a supervised manner using backpropagation [266, 267]. With efficient
and successful training of DBN, deep learning models have become popular, and applied
to many applications such as object recognition [268, 269], image classification [270],
face image analysis [271], visual task categorization [272], and speech recognition [263].
Chapter 5. Human Action Recognition Using Deep Belief Networks 103
from Deep Boltzmann Machine (DBM) [273]. This model is used by DBN for learning
consist of a visible or input layer followed by many hidden layers and imposes no con-
straints for connecting any type of nodes. However, to make the training process simpler
and easier, certain restrictions are imposed on RBM such as it has no links between nodes
of the same type, making it a bi-partite graph. Hence, this model consists of a visible
layer followed by hidden layers, where symmetric connections exist between the layers
but not within the layers as shown in Figure 5.1. As a consequence, RBM model is much
quicker to train than original the DBM due to conditional independence of the hidden
layers. Therefore, RBM is preferred over the DBM model for training the deep belief
networks.
The RBM and its variants have been applied to many applications such as documents
retrieval [274], visual categorization [162, 275–277], motion data modeling [278], human
activity recognition [279, 280], speech and audio data analysis [281, 282]. In visual
categorization tasks, the visible or input layer corresponds to the input (e.g., image pixels),
and hidden layers correspond to the feature detectors as a collections of neurons in the
visual corridor. The RBM either uses the binary visible and hidden nodes for binary input
or Gaussian visible nodes for continuous valued data. The subsequent sections discuss
how the RBM model is trained with different types of visible nodes.
Chapter 5. Human Action Recognition Using Deep Belief Networks 104
F IGURE 5.1: Restricted Boltzmann Machine model with visible layer (V) and hidden
layer (H). The connections exist between the layers but not within the layers. The visible
layer represents the input given to the model and hidden layer is used to learn the features
from the input data.
In the RBM model, the probability assigned to the input layer and hidden layers con-
figuration (v, h) can be calculated using the energy function [283]. In the case of binary
E (v, h) = − aT v − bT h − vT W h (5.1)
where v ∈ {0,1}n and h ∈ {0,1}m representing the states of the input and hidden nodes,
respectively. The W ∈ Rnxm is a weight matrix that indicates the symmetric connection
between the input and hidden nodes, while a ∈ Rn and b ∈ Rm , indicate the biases for
input and hidden nodes, respectively. The probability of joint configuration {v, h} can be
calculated as:
e−E(v,h)
p (v, h) = (5.2)
∑η,µ e−E(η,µ)
Chapter 5. Human Action Recognition Using Deep Belief Networks 105
and probability assigned by the RBM model with configuration v, h, can be obtained
by marginalizing h, as:
∑h e−E(v,h)
p (v) = (5.3)
∑η,µ e−E(η,µ)
In unsupervised RBM learning, a log probability is increased for the training data,
which reduces the energy of the training data. During the training phase, this is achieved
by adjusting the weights and biases values. In this direction, the log probability of the
training data can be maximized using stochastic steepest ascent rule as:
∆wi j = ε < wi h j >data − < vi h j >model (5.4)
where ε is a learning rate and < . >Φ is an expectation under the distribution. The
probability of RBM with binary visible and hidden nodes h j and randomly chosen training
1
p h j = 1| v = (5.5)
1+exp(−b j − ∑i vi wi j )
This is a logic sigmoid function applied to the total input of the hidden nodes. The exam-
ple of an unbiased sample is given by vi h j . The probability of the visible node vi being 1,
1
p (vi = 1| h) = (5.6)
1+exp(−ai − ∑ j h j wi j )
However, calculating the unbiased sample of (vi h j )model would require a long sequence
Chapter 5. Human Action Recognition Using Deep Belief Networks 106
of alternating Gibbs sampling between the visible and hidden layers, therefore mostly the
for RBM. Although CD only approximates the log probability of the training data, in
practice CD has proven to be quite helpful in producing good model. Stacked RBMs
trained using CD are powerful tools for pre-training the DBN as a generative model. This
model has been successfully applied to visual categorization, and other tasks that involve
3. Visible states are sampled according to the Equation 5.6. This is known as the
reconstruction step.
However, in order to improve the performance of the model, the sampling may be
repeated multiple times in each step. This is as a general form of the CD algorithm CDn,
The RBM model with Gaussian visible nodes is more difficult to train than with binary
visible nodes. However, with continuous valued data, Gaussian nodes provide better
Chapter 5. Human Action Recognition Using Deep Belief Networks 107
(vi − ai )2 vi
E (v, h) = ∑ − ∑ b j h j − ∑ σ i h j wi j (5.8)
i 2σ 2 i j i, j
1
p hj = 1 | v = vi
(5.9)
1 + exp(−b j − ∑i σi wi j )
and the value of Gaussian visible nodes during reconstruction step is calculated as:
Deep Belief Networks (DBNs) are graphical models with multi-layer architecture and
can work as a generative or unsupervised model. The multiple layers of DBN enables
automatic feature extraction in a non-linear fashion even from complex data distributions.
This model can be trained effectively using [161]. In the first phase, the model is pre-
trained in a greedy layer-wise fashion using unsupervised RBMs for learning network
weights and other parameters. In this layer-wise training, once the hidden layer is trained,
the weights for this layer become fixed, and activation on this hidden layer provides in-
put to the next RBM and so on. In the second phase, the model can be fine-tuned in a
generative or discriminative manner to solve a specific task. The fine-tuning process can
An example diagram of DBN with binary visible nodes is shown in Figure 5.2.
The DBN generative model consists of multiple consecutive layers, where lower layers
have directed connections, and top two layers have undirected connections and form an
associative memory. Moreover, adding a further hidden layer to a generative DBN model
may improve the performance of the model if certain pre-conditions are met [161]. The
fine-tuning of a generative model can be conducted using the wake-sleep algorithm [286].
F IGURE 5.2: Diagram of DBN with visible layer V and four hidden layers (H1-H4).
The blue arrows indicate the generative model, while red arrows indicate the direction of
recognition.
Chapter 5. Human Action Recognition Using Deep Belief Networks 109
After the pre-training phase, a generative model can be trained in a completely unsuper-
vised manner, or labels can be used for training the model. The model proposed in [285]
learns to generate new data along with appropriate class labels. In this connection, labels
are introduced at penultimate layer of DBN, and the top RBM layer is used for learning
using these labels as shown in Figure 5.3. This trained model can be used for a classifi-
cation task, in this case, after an input the bottom-up activations are calculated. However,
bottom-up activations on the penultimate layer are kept fixed and Gibbs sampling is con-
ducted between the top two layers and the probability of each class label is calculated.
F IGURE 5.3: The DBN generative model with four hidden layers (H1-H4), used for
generating images with their class labels.
Chapter 5. Human Action Recognition Using Deep Belief Networks 110
First, the DBN model is trained in a layer-wise fashion as a generative model, then an
output node is added with class labels, and model is tuned using backpropagation to solve
where p indicates the training label of each category, i and p̂i denotes the probability of the
training example belonging to the category i as predicted by the model. An example dia-
gram of the discriminative DBNs is shown in Figure 5.4. The discriminative DBNs model
provide better performance than the generative models on classification tasks [285].
Structural learning can improve the performance and learning capability of the model
by automatically identifying the set of best parameters during the training process. These
parameters describe the architecture of the model such as the number of layers, number
of nodes in each layer, and learning rate. In this regard, a number of approaches have
been proposed in the literature for structure learning of deep belief networks and other
deep learning models. However, due to scalability issues, the use of structural learning
for complex real world problems is limited. The most promising methods of structure
learning are based on Bayesian non-parametric networks. These methods enforce prior
Chapter 5. Human Action Recognition Using Deep Belief Networks 111
F IGURE 5.4: The discriminative DBN model for classification with a visible layer v, 3
hidden layers (H1-H3), and an output layer.
probability distribution over the structural components of the model, and use a probabilis-
tic inference method to infer the structure and parameters of the model. A popular method
for structure learning using Bayesian non-parametric methods was proposed in [287]. In
the case of deep belief networks, heuristic methods [288] and evolutionary methods [289]
have been proposed in the literature for structure learning. For a specific class of neu-
ral networks, a method proposed in [290] can adapt the networks structure in an online
manner.
Chapter 5. Human Action Recognition Using Deep Belief Networks 112
based on a handcrafted and deep learning based approach. However, deep learning based
methods are gaining popularity due to their capability of automatic feature extraction and
representation. Recently, there has been increased interest in deep architectures such as
DBNs [161]. The goal of these models is to learn more abstract representations from the
input data in a layer-wise fashion using a pre-training unsupervised learning phase. After
can be introduced at penultimate layer of the model as discussed in section 5.2.2.1. The
proposed framework for human action recognition is based on DBN with multiple layers
videos.
The major focus of the proposed method is to learn the effective feature representation
suitable for action recognition. In this direction, we used multiple layers of RBMs using
pre-training method is quite helpful in initializing weights of the DBN, which increases
the learning capability of the model and avoids over-fitting. This pre-training step gener-
ates many layers of feature detectors and descriptors suitable for feature representation.
After the pre-training phase, we used backpropagation fine tuning which slightly modified
the feature detectors and descriptors according to the class labels and made them suitable
for discrimination tasks. The major advantage of the pre-training phase is that the labels
Chapter 5. Human Action Recognition Using Deep Belief Networks 113
information is not used for the features representation from scratch, such as in supervised
However, it has been observed that training and fine-tuning of the DBN, with multiple
RBMs layer, is a time consuming task. Therefore, we resized the extracted images from
the video frames into 100x100 size, which made 10000 dimensions, as the actual input to
the model at the visible or input layer. In addition to this we used 3 hidden layers and an
output layer, which corresponds to the 10 output classes of the model as shown in Figure
5.5. The parameter configuration of the model is shown in Table 5.1. We performed ex-
periments using MATLAB R2017a implementation on graphics works station with Intel
Xeon Processor E5-2630 v3 (8C, 2.4GHz, Turbo, HT, 20M, 85W), 64GB RAM, NVIDIA
results of the proposed method are discussed. The performance of the proposed method
S# Parameter Value
1 No. of total layers 5
2 No. of hidden layers 3
3 Unsupervised learning rate 0.001
4 Supervised learning rate 0.1
5 No. of unsupervised epochs 100
6 No. of supervised epochs 300
7 Initial momentum 0.5
8 Final momentum 0.9
9 Batch size 100
10 Training accuracy 100%
11 Testing accuracy 88.86%
Chapter 5. Human Action Recognition Using Deep Belief Networks 114
F IGURE 5.5: The proposed DBN model having visible layer v with 100x100 dimension
input, 3 hidden layers (H1-H3), and an output layer with 10 nodes representing 10 action
classes.
has been assessed over the UCF Sports dataset [16]. This dataset is comprised of 10 sports
actions collected from variety of sports, broadcasted on television channels such as BBC
and ESPN. Actions include: diving, golf swing, kicking, lifting, riding horse, running,
and occlusions, which make it a challenging dataset. The sample frames for each action
F IGURE 5.6: Sample frames for each action from UCF sports dataset
The proposed method uses a popular Leave-One-Out (LOO) cross validation scheme,
as recommended in pioneering work on this dataset [260]. In this scheme, one action
sequence is used for testing, while rest of the action sequences are used for training the
classifier [21]. This process is repeated for all action sequences and resultant accuracies
are averaged. This validation scheme has been adopted in [68, 100, 245, 257, 291] for the
evaluation of their proposed techniques. This shows fair comparison of our model with
these state-of-the-art methods. The proposed unsupervised deep belief networks based
method achieved an average accuracy of 88.86% on this dataset, which is higher than
5.5 Conclusion
This chapter presented an unsupervised deep belief networks based method for recog-
extracts suitable feature representation without any prior knowledge using an unsuper-
vised deep learning model. In this work, the contribution is twofold. Firstly, the results
indicate that the proposed method achieved higher accuracy than state-of-the-art unsu-
pervised deep learning based methods. This demonstrates the potential of less explored
unsupervised DBNs architecture for modelling complex human activities. Secondly, al-
though the model has been designed by keeping human activities in mind, and its effec-
model can also be adapted for other visual recognition tasks as well. As future work, this
method can be extended for unstructured human activities such as crowed behavior anal-
ysis using unsupervised DBN model, since the major portion of video content available
Directions
IN THIS CHAPTER
The conclusive summary, research objectives, and extensions related to the pro-
posed techniques are presented. In this regard, research objectives and their real-
ization is presented in Section 6.1. The possible extensions and enhancements of
the proposed techniques in the short-term and long-term perspectives are presented
in Section 6.2
and deep learning approaches, in order to decide which one works the best.
117
CHAPTER 6. Conclusion and Future Research Directions 118
These two objectives have been realized by conducting a comprehensive review of the
state-of-the-art techniques. In this regard, most popular and prominent techniques were
reviewed, compared, evaluated, and analyzed on different well known public datasets.
The key findings and analysis regrading handcrafted and deep learning based techniques
Handcrafted-based techniques
In this category, Space-Time-based techniques have been popular and have achieved
promising results on simple and complex datasets. Among these techniques, trajectory
and space-time feature based methods have shown overall superior performance which
are also reliable under noise and different illumination conditions. However, these tech-
niques also have some limitations. Firstly, correct localization of subject and movement
tracking is still a challenging task in these methods. Secondly, these methods are com-
putationally expensive, and may not be suitable for real-time applications. Therefore,
by considering the nature of the problem in hand, one may resort to other techniques
Deep learning has emerged as a highly popular direction within machine learning,
which has outperformed the traditional approaches in many applications of computer vi-
sion. The highly advantageous property of deep learning algorithms is their ability to
CHAPTER 6. Conclusion and Future Research Directions 119
learn features from the raw data, which eliminates the need of handcrafted feature detec-
tors and descriptors. However, this approach also has some limitations. Deep learning
models require a huge amount of data for training the algorithm. However, most of the
well-known public HAR datasets such as KTH [15], IXMAS [141], HMDB-51 [72], and
UCF Sports [16, 17] are small in size, and therefore not suitable for training deep learning
Based on the above findings, major gaps were identified, and novel methods for human
action recognition have been developed based on handcrafted and deep learning-based
approaches.
considered as one of the major challenges for recognition of human actions in differ-
This objective has been realized by developing a novel method for human action recog-
nition from multiple views presented in chapter 3, and briefly discussed here. This method
employs region-based features extracted from the human silhouette obtained from videos.
The human silhouette contains much less information than the original image frame.
the action with high accuracy, as this method achieved high accuracy compared to state-
of-the-art handcrafted based methods, including trajectory based methods. In this work,
two goals are achieved. Firstly, the proposed method achieves higher recognition accu-
racy than state-of-the-art methods on well-known multiview IXMAS (INRIA Xmas Mo-
tion Acquisition Sequences) dataset. Secondly, the average testing speed of the proposed
CHAPTER 6. Conclusion and Future Research Directions 120
method is 34 frames per second, which is much higher than existing methods, and shows
its suitability for real-time applications. The comparison with state-of-the-art methods in
terms of accuracy is shown in Table 6.1, and in terms of speed of execution is shown in
Table 6.2.
This objective has been realized by developing an innovative method for HAR based on
transfer learning presented in chapter 4, and briefly discussed here. As mentioned earlier,
deep learning models cannot be trained effectively with a small dataset. To overcome
this limitation, and benefit from the best properties of deep learning models, a transfer
learning approach was adopted. This method uses a pre-trained Convolutional Neural
Networks (CNNs) model as a source architecture for extracting the features from the
target dataset, followed by a hybrid Support Vector Machines and K-Nearest Neighbor
(SVM-KNN) classifier for action recognition. The proposed method was evaluated on
two well-known action datasets, i.e., UCF sports and KTH. The comparative analysis
confirmed the superior performance of the proposed method as shown in Table 6.3, and
6.4.
TABLE 6.4: Comparison with state-of-art method on UCF sports action dataset
This objective has been realized by developing an innovative method for human ac-
tion recognition based on the unsupervised deep learning method presented in chapter 5,
and briefly discussed here. Deep learning based methods have gained popularity due to
their high recognition results in different computer vision tasks. However, this success is
mainly limited to the domains where a huge amount of annotated datasets are available.
In other words, these success stories are mainly associated with supervised deep learning
models. Due to these reasons, little attention has been paid towards unsupervised deep
learning models. But, as matter of fact, unsupervised learning is far more important than
its supervised counterpart [38], since we discover the world by observing it rather being
told the name of every object. Human and animal learning is mostly unsupervised. Fur-
thermore, the huge amount video content available on the Internet are unlabeled. Hence,
we cannot recognize these contents with supervised learning techniques. The ultimate
Keeping in mind the importance of unsupervised deep learning, this thesis proposed an
unsupervised deep belief networks based method for recognition of human activities in
unconstrained videos. The proposed method automatically extracts suitable feature repre-
sentation without any prior knowledge using an unsupervised deep learning model. In this
work, our contribution is twofold. Firstly, the proposed method achieved higher accuracy
than similar state-of-the-art methods, which confirmed the potential of less explored un-
supervised DBNs architecture for modelling complex human activities as shown in Table
CHAPTER 6. Conclusion and Future Research Directions 123
6.5. Secondly, the model uses the automatic structure learning method for learning im-
portant parameters. Although, the model has been designed by keeping human activities
in mind, this model can also be adapted for other visual recognition tasks as well.
O6. Comparison between the supervised and unsupervised deep learning models
on a same dataset.
This objective has been realized by evaluating the proposed supervised (transfer learn-
ing) and unsupervised deep learning-based methods on same dataset. It has been learned
that, the proposed method using transfer learning produced better results 91.47%, as com-
pared to our unsupervised deep belief network based method, which produced 88.6%
accuracy. However, in cases where labeled data are rare or unavailable, then the unsu-
pervised deep learning based methods prevails the supervised deep learning or transfer
learning based methods. The comparison of the proposed methods with state-of-the-art
O7. Production of better results than the existing ones in terms of accuracy and
In fact, this was an ultimate objective of this research work. It has been realized by
proposing a novel methods for human action recognition, which produced superior results
than state-of-the-art methods as shown in Tables 6.1, 6.2, 6.3, 6.4, and 6.5.
This section presents the short-term plan for possible extensions of the proposed meth-
ods.
Human action recognition from multiple views is proposed in chapter 3. This method
showed high accuracy and efficiency on a challenging dataset. In future, authors aim to
extend this method for cross-view action recognition and evaluate it on recently developed
This innovative method for human action recognition is based on transfer learning.
In this method, the pre-trained deep CNN architecture was used as a feature extraction
machine and a hybrid SVM-KNN classifier was employed for classification. This is con-
sidered as a partial transfer learning. In future, authors would like to extend this method
for more challenging datasets, such as [188], using deep/full transfer learning.
The proposed techniques were trained and tested on the same datasets. In future, au-
thors would like to test proposed techniques on cross datasets, and see whether the per-
This section presents the long-term plan for improvement and adaptation of the pro-
posed methods.
This method is based on unsupervised deep belief networks for human action recogni-
tion in unconstrained videos. The proposed method automatically extracts suitable feature
representation, without any prior knowledge, using an unsupervised deep learning model.
The proposed method achieved higher accuracy than similar state-of-the-art deep learn-
ing based methods. For future work, this method can be extended for unstructured human
Extension of proposed deep learning based methods for real time action recogni-
tion
Most of the existing deep learning based methods are not suitable for real time action
changes in existing methods to meet real time requirements. In future, authors plan to
make the required changes in proposed deep learning based algorithms to make them
Extension of proposed deep learning based methods for detecting suspicious events
in videos
Automatic detection of suspicious events is very important for video surveillance. Most
of the existing techniques for suspicious events detection are based on handcrafted ap-
pearance and motion features. However, these methods are not optimal for analysis of
CHAPTER 6. Conclusion and Future Research Directions 126
complex video scenes. Authors would like to extend the proposed deep learning based
[1] A. Karpathy, F. Li, and J. Johnson, “Cs231n convolutional neural network for vi-
sual recognition,”,” Online Course, 2016.
[3] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Com-
puting Surveys (CSUR), vol. 43, no. 3, p. 16, 2011.
[4] L. Kratz and K. Nishino, “Tracking pedestrians using local spatio-temporal motion
patterns in extremely crowded scenes,” IEEE transactions on pattern analysis and
machine intelligence, vol. 34, no. 5, pp. 987–1002, 2012.
[5] T. Xiang and S. Gong, “Video behavior profiling for anomaly detection,” IEEE
transactions on pattern analysis and machine intelligence, vol. 30, no. 5, pp. 893–
908, 2008.
[6] M. Mubashir, L. Shao, and L. Seed, “A survey on fall detection: Principles and
approaches,” Neurocomputing, vol. 100, pp. 144–152, 2013.
[7] H.-K. Lee and J.-H. Kim, “An hmm-based threshold model approach for gesture
recognition,” IEEE Transactions on pattern analysis and machine intelligence,
vol. 21, no. 10, pp. 961–973, 1999.
127
Bibliography 128
[8] L. Zhang, M. Jiang, D. Farid, and M. A. Hossain, “Intelligent facial emotion recog-
nition and semantic-based topic detection for a humanoid robot,” Expert Systems
with Applications, vol. 40, no. 13, pp. 5160–5168, 2013.
[11] Y. Luo, T.-D. Wu, and J.-N. Hwang, “Object-based analysis and interpretation of
human motion in sports video sequences by dynamic bayesian networks,” Com-
puter Vision and Image Understanding, vol. 92, no. 2, pp. 196–216, 2003.
[13] S. G. Klauer, F. Guo, J. Sudweeks, and T. A. Dingus, “An analysis of driver inatten-
tion using a case-crossover approach on 100-car data: Final report,” report, 2010.
[15] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a local svm
approach,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th In-
ternational Conference on, vol. 3, pp. 32–36, IEEE.
[16] K. Soomro and A. R. Zamir, Action recognition in realistic sports videos, pp. 181–
208. Springer, 2014.
[18] A. Bux, P. Angelov, and Z. Habib, Vision based human activity recognition: a
review, pp. 341–371. Springer, 2017.
[20] A. B. Sargano, P. Angelov, and Z. Habib, “Human action recognition from multiple
views based on view-invariant feature descriptor using support vector machines,”
Applied Sciences, vol. 6, no. 10, p. 309, 2016.
[25] H. Xu, J. Liu, H. Hu, and Y. Zhang, “Wearable sensor-based human activity recog-
nition method with multi-features extracted from hilbert-huang transform,” Sen-
sors, vol. 16, no. 12, p. 2048, 2016.
[28] S.-R. Ke, H. L. U. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-H. Choi, “A re-
view on video-based human activity recognition,” Computers, vol. 2, no. 2, pp. 88–
131, 2013.
[29] M. Ramanathan, W.-Y. Yau, and E. K. Teoh, “Human action recognition with video
data: research and evaluation challenges,” IEEE Transactions on Human-Machine
Systems, vol. 44, no. 5, pp. 650–663, 2014.
[30] R. Poppe, “A survey on vision-based human action recognition,” Image and vision
computing, vol. 28, no. 6, pp. 976–990, 2010.
[34] A. Jalal, S. Kamal, and D. Kim, “A depth video sensor-based life-logging human
activity recognition system for elderly care in smart indoor environments,” Sensors,
vol. 14, no. 7, pp. 11735–11759, 2014.
[35] A. Jalal, N. Sarif, J. T. Kim, and T.-S. Kim, “Human activity recognition via rec-
ognized body parts of human depth silhouettes for residents monitoring services at
smart home,” Indoor and Built Environment, vol. 22, no. 1, pp. 271–279, 2013.
Bibliography 131
[36] J. Li and N. Allinson, “Building recognition using local oriented features,” IEEE
Transactions on Industrial Informatics, vol. 9, no. 3, pp. 1697–1704, 2013.
[37] A. Jalal, S. Kamal, and D. Kim, “Shape and motion features approach for activ-
ity tracking and recognition from kinect video camera,” in Advanced Information
Networking and Applications Workshops (WAINA), 2015 IEEE 29th International
Conference on, pp. 445–450, IEEE.
[38] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553,
pp. 436–444, 2015.
[39] R. Yuan and W. Hui, “Object identification and recognition using multiple con-
tours based moment invariants,” in 2008 International Symposium on Information
Science and Engineering, vol. 1, pp. 140–144, IEEE.
[42] A. Farooq, A. Jalal, and S. Kamal, “Dense rgb-d map-based human tracking and ac-
tivity recognition using skin joints features and self-organizing map,” KSII Trans-
actions on internet and information systems, vol. 9, no. 5, pp. 1856–1869, 2015.
[43] A. Jalal and S. Kim, “The mechanism of edge detection using the block matching
criteria for the motion estimation,” in Proceedings of the Conference on Human
Computer Interaction, Daegu, Korea, vol. 3031, p. 484489.
[45] S. Kamal and A. Jalal, “A hybrid feature extraction approach for human detection,
tracking and activity recognition using depth sensors,” Arabian Journal for Science
and Engineering, vol. 41, no. 3, pp. 1043–1051, 2016.
[47] S. Kamal, A. Jalal, and D. Kim, “Depth images-based human detection, tracking
and activity recognition using spatiotemporal features and modified hmm,”
[48] K. Lai, L. Bo, X. Ren, and D. Fox, “Sparse distance learning for object recognition
combining rgb and depth information,” in Robotics and Automation (ICRA), 2011
IEEE International Conference on, pp. 4007–4013, IEEE.
[49] A. Jalal, J. T. Kim, and T.-S. Kim, “Development of a life logging system via
depth imaging-based human activity recognition for smart homes,” in Proceedings
of the International Symposium on Sustainable Healthy Buildings, Seoul, Korea,
pp. 91–95.
[50] J.-Y. Chang, J.-J. Shyu, and C.-W. Cho, “Fuzzy rule inference based human activity
recognition,” in 2009 IEEE Control Applications,(CCA) Intelligent Control,(ISIC),
pp. 211–215, IEEE.
[52] R. Gade and T. B. Moeslund, “Thermal cameras and applications: A survey,” Ma-
chine vision and applications, vol. 25, no. 1, pp. 245–262, 2014.
[53] J. Han and B. Bhanu, “Human activity recognition in thermal infrared imagery,”
in Computer Vision and Pattern Recognition-Workshops, 2005. CVPR Workshops.
IEEE Computer Society Conference on, pp. 17–17, IEEE, 2005.
Bibliography 133
[54] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM
Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27,
2011.
[56] I. Sipiran and B. Bustos, “Harris 3d: a robust extension of the harris operator
for interest point detection on 3d meshes,” The Visual Computer, vol. 27, no. 11,
pp. 963–976, 2011.
[57] I. Laptev, “On space-time interest points,” International Journal of Computer Vi-
sion, vol. 64, no. 2-3, pp. 107–123, 2005.
[58] A. Gilbert, J. Illingworth, and R. Bowden, “Scale invariant action recognition us-
ing compound features mined from dense spatio-temporal corners,” in European
Conference on Computer Vision, pp. 222–233, Springer.
[59] A. F. Bobick and J. W. Davis, “The recognition of human movement using tem-
poral templates,” IEEE Transactions on pattern analysis and machine intelligence,
vol. 23, no. 3, pp. 257–267, 2001.
[60] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang, “Action detection in
complex scenes with spatial and temporal ambiguities,” in 2009 IEEE 12th Inter-
national Conference on Computer Vision, pp. 128–135, IEEE.
[61] M.-C. Roh, H.-K. Shin, and S.-W. Lee, “View-independent human action recogni-
tion with volume motion template on single stereo camera,” Pattern Recognition
Letters, vol. 31, no. 7, pp. 639–647, 2010.
[62] H. Qian, Y. Mao, W. Xiang, and Z. Wang, “Recognition of human activities using
svm multi-class classifier,” Pattern Recognition Letters, vol. 31, no. 2, pp. 100–
111, 2010.
Bibliography 134
[63] X. Wu, D. Xu, L. Duan, and J. Luo, “Action recognition using context and appear-
ance distribution features,” in Computer Vision and Pattern Recognition (CVPR),
2011 IEEE Conference on, pp. 489–496, IEEE.
[64] N. Ikizler and P. Duygulu, “Histogram of oriented rectangles: A new pose descrip-
tor for human action recognition,” Image and Vision Computing, vol. 27, no. 10,
pp. 1515–1526, 2009.
[65] X. Peng, Y. Qiao, Q. Peng, and X. Qi, “Exploring motion boundary based sampling
and spatial-temporal context descriptors for action recognition,” in BMVC.
[67] M. Chen, L. Gong, T. Wang, and Q. Feng, “Action recognition using lie algebrized
gaussians over dense local spatio-temporal features,” Multimedia Tools and Appli-
cations, vol. 74, no. 6, pp. 2127–2142, 2015.
[68] H. Wang, A. Klser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajec-
tories,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Confer-
ence on, pp. 3169–3176, IEEE.
[70] S. Ma, L. Sigal, and S. Sclaroff, “Space-time tree ensemble for action recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 5024–5032.
[71] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based action recog-
nition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 915–922.
Bibliography 135
[72] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video
database for human motion recognition,” in 2011 International Conference on
Computer Vision, pp. 2556–2563, IEEE.
[73] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Pro-
ceedings of the IEEE International Conference on Computer Vision, pp. 3551–
3558.
[74] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo, “Trajectory-based model-
ing of human actions with motion reference points,” in European Conference on
Computer Vision, pp. 425–438, Springer.
[76] L. Wang, Y. Qiao, and X. Tang, “Motionlets: Mid-level 3d parts for human mo-
tion recognition,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2674–2681.
[77] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action recognition with stacked fisher
vectors,” in European Conference on Computer Vision, pp. 581–595, Springer.
[78] M. Jain, H. Jegou, and P. Bouthemy, “Better exploiting motion for better action
recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pp. 2555–2562.
[80] M. Hoai and A. Zisserman, “Improving human action recognition using score
distribution and ranking,” in Asian Conference on Computer Vision, pp. 3–20,
Springer.
Bibliography 136
[82] E. Vig, M. Dorr, and D. Cox, “Space-variant descriptor sampling for action recog-
nition based on saliency and eye movements,” in European conference on computer
vision, pp. 84–97, Springer.
[83] S. Mathe and C. Sminchisescu, Dynamic eye movement datasets and learnt
saliency models for visual action recognition, pp. 842–856. Springer, 2012.
[84] O. Kihl, D. Picard, and P.-H. Gosselin, “Local polynomial space–time descriptors
for action classification,” Machine Vision and Applications, vol. 27, no. 3, pp. 351–
361, 2016.
[85] T. Lan, Y. Zhu, A. Roshan Zamir, and S. Savarese, “Action recognition by hi-
erarchical mid-level action elements,” in Proceedings of the IEEE International
Conference on Computer Vision, pp. 4552–4560.
[86] J. Yuan, Z. Liu, and Y. Wu, “Discriminative subvolume search for efficient action
detection,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
Conference on, pp. 2442–2449, IEEE.
[87] B. B. Amor, J. Su, and A. Srivastava, “Action recognition using rate-invariant anal-
ysis of skeletal shape trajectories,” IEEE transactions on pattern analysis and ma-
chine intelligence, vol. 38, no. 1, pp. 1–13, 2016.
[89] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos in the wild,”
in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pp. 1996–2003, IEEE.
Bibliography 137
[90] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,” in 2005
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, pp. 984–989, IEEE.
[91] Y. Sheikh, M. Sheikh, and M. Shah, “Exploring the space of a human action,” in
Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1,
vol. 1, pp. 144–149, IEEE.
[92] J. Yang, Z. Shi, and Z. Wu, “Vision-based action recognition of construction work-
ers using dense trajectories,” Advanced Engineering Informatics, vol. 30, no. 3,
pp. 327–336, 2016.
[93] Y.-G. Jiang, Q. Dai, W. Liu, X. Xue, and C.-W. Ngo, “Human action recognition in
unconstrained videos by explicit motion modeling,” IEEE Transactions on Image
Processing, vol. 24, no. 11, pp. 3781–3795, 2015.
[94] P. Dollr, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse
spatio-temporal features,” in 2005 IEEE International Workshop on Visual Surveil-
lance and Performance Evaluation of Tracking and Surveillance, pp. 65–72, IEEE.
[95] T. H. Thi, J. Zhang, L. Cheng, L. Wang, and S. Satoh, “Human action recognition
and localization in video using structured learning of local space-time features,” in
Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE Inter-
national Conference on, pp. 204–211, IEEE.
[96] J. Sivic and A. Zisserman, “Video google: A text retrieval approach to object
matching in videos,” in Computer Vision, 2003. Proceedings. Ninth IEEE Inter-
national Conference on, pp. 1470–1477, IEEE.
[97] X. Peng, L. Wang, X. Wang, and Y. Qiao, “Bag of visual words and fusion methods
for action recognition: Comprehensive study and good practice,” Computer Vision
and Image Understanding, 2016.
[98] L. Liu, L. Wang, and X. Liu, “In defense of soft-assignment coding,” in 2011
International Conference on Computer Vision, pp. 2486–2493, IEEE.
Bibliography 138
[99] F. Perronnin, J. Snchez, and T. Mensink, “Improving the fisher kernel for large-
scale image classification,” in European conference on computer vision, pp. 143–
156, Springer.
[100] H. Wang, A. Klser, C. Schmid, and C.-L. Liu, “Dense trajectories and motion
boundary descriptors for action recognition,” International journal of computer
vision, vol. 103, no. 1, pp. 60–79, 2013.
[102] C. Thurau and V. Hlavc, “Pose primitive based human action recognition in videos
or still images,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on, pp. 1–8, IEEE.
[104] A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,”
in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference
on, pp. 1–8, IEEE.
[105] Z. Jiang, Z. Lin, and L. Davis, “Recognizing human actions by learning and match-
ing shape-motion prototype trees,” IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, vol. 34, no. 3, pp. 533–547, 2012.
[106] M. B. Holte, T. B. Moeslund, N. Nikolaidis, and I. Pitas, “3d human action recogni-
tion for multi-view camera systems,” in 3D Imaging, Modeling, Processing, Visual-
ization and Transmission (3DIMPVT), 2011 International Conference on, pp. 342–
349, IEEE.
Bibliography 139
[107] P. Huang, A. Hilton, and J. Starck, “Shape similarity for 3d video sequences of
people,” International Journal of Computer Vision, vol. 89, no. 2-3, pp. 362–381,
2010.
[108] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recognition using
motion history volumes,” Computer Vision and Image Understanding, vol. 104,
no. 2, pp. 249–257, 2006.
[110] L. Wang and D. Suter, “Recognizing human activities from silhouettes: Motion
subspace and factorial discriminative graphical model,” in 2007 IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–8, IEEE.
[112] D. Vishwakarma and R. Kapoor, “Hybrid classifier based human activity recog-
nition using the silhouette and cells,” Expert Systems with Applications, vol. 42,
no. 20, pp. 6957–6965, 2015.
[116] S. A. Rahman, I. Song, M. K. Leung, I. Lee, and K. Lee, “Fast action recognition
using negative space features,” Expert Systems with Applications, vol. 41, no. 2,
pp. 574–587, 2014.
[118] S. Chun and C.-S. Lee, “Human action recognition using histogram of motion
intensity and direction from multiple views,” IET Computer Vision, 2016.
[120] M. Ahmad and S.-W. Lee, “Hmm-based human action recognition using multiview
image sequences,” in Pattern Recognition, 2006. ICPR 2006. 18th International
Conference on, vol. 1, pp. 263–266, IEEE.
[122] S. Pehlivan and D. A. Forsyth, “Recognizing activities in multiple views with fu-
sion of frame judgments,” Image and Vision Computing, vol. 32, no. 4, pp. 237–
249, 2014.
[123] A. Eweiwi, S. Cheema, C. Thurau, and C. Bauckhage, “Temporal key poses for hu-
man action recognition,” in Computer Vision Workshops (ICCV Workshops), 2011
IEEE International Conference on, pp. 1310–1317, IEEE.
[126] M. Pietikinen, A. Hadid, G. Zhao, and T. Ahonen, Computer vision using local
binary patterns, vol. 40. Springer Science Business Media, 2011.
[127] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary pat-
terns with an application to facial expressions,” IEEE transactions on pattern anal-
ysis and machine intelligence, vol. 29, no. 6, pp. 915–928, 2007.
[128] L. Yeffet and L. Wolf, “Local trinary patterns for human action recognition,”
in 2009 IEEE 12th International Conference on Computer Vision, pp. 492–497,
IEEE.
[134] C. H. Lim and C. S. Chan, “Fuzzy qualitative human model for viewpoint identifi-
cation,” Neural Computing and Applications, vol. 27, no. 4, pp. 845–856, 2016.
[136] B. Yousefi and C. K. Loo, “Bio-inspired human action recognition using hy-
brid max-product neuro-fuzzy classifier and quantum-behaved pso,” arXiv preprint
arXiv:1509.03789, 2015.
[140] R. Mattivi and L. Shao, “Human action recognition using lbp-top as sparse spatio-
temporal feature descriptor,” in International Conference on Computer Analysis of
Images and Patterns, pp. 740–747, Springer.
[141] D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbitrary views
using 3d exemplars,” in 2007 IEEE 11th International Conference on Computer
Vision, pp. 1–7, IEEE.
Bibliography 143
[146] F. Zhu, L. Shao, J. Xie, and Y. Fang, “From handcrafted to learned representations
for human action recognition: A survey,” Image and Vision Computing, 2016.
[147] T. Guha and R. K. Ward, “Learning sparse representations for human action recog-
nition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34,
no. 8, pp. 1576–1588, 2012.
[148] H. Wang, C. Yuan, W. Hu, and C. Sun, “Supervised class-specific dictionary learn-
ing for sparse modeling in action recognition,” Pattern Recognition, vol. 45, no. 11,
pp. 3902–3911, 2012.
[150] J. Zheng, Z. Jiang, and R. Chellappa, “Cross-view action recognition via transfer-
able dictionary learning,” 2016.
Bibliography 144
[151] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learning for vi-
sual recognition,” International Journal of Computer Vision, vol. 109, no. 1-2,
pp. 42–59, 2014.
[152] F. Zhu and L. Shao, “Correspondence-free dictionary learning for cross-view ac-
tion recognition,” in ICPR, pp. 4525–4530.
[153] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained lin-
ear coding for image classification,” in Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on, pp. 3360–3367, IEEE.
[154] L. Liu, L. Shao, X. Li, and K. Lu, “Learning spatio-temporal representations for
action recognition: a genetic programming approach,” IEEE transactions on cy-
bernetics, vol. 46, no. 1, pp. 158–170, 2016.
[155] Y. Zhang, Y. Zhang, E. Swears, N. Larios, Z. Wang, and Q. Ji, “Modeling temporal
interactions with interval temporal bayesian networks for complex activity recog-
nition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35,
no. 10, pp. 2468–2483, 2013.
[156] F. M. Khan, S. C. Lee, and R. Nevatia, “Conditional bayesian networks for action
detection,” in Advanced Video and Signal Based Surveillance (AVSS), 2013 10th
IEEE International Conference on, pp. 256–262, IEEE, 2013.
[158] L. Deng and D. Yu, “Deep learning,” Signal Processing, vol. 7, pp. 3–4, 2014.
[160] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20,
no. 3, pp. 273–297, 1995.
[161] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[165] P. Foggia, A. Saggese, N. Strisciuglio, and M. Vento, “Exploiting the deep learning
paradigm for recognizing human actions,” in Advanced Video and Signal Based
Surveillance (AVSS), 2014 11th IEEE International Conference on, pp. 93–98,
IEEE.
[167] L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, and G. Serra, “Effective code-
books for human action representation and classification in unconstrained videos,”
IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1234–1245, 2012.
[173] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human ac-
tion recognition,” IEEE transactions on pattern analysis and machine intelligence,
vol. 35, no. 1, pp. 221–231, 2013.
[175] Z. Zhang and D. Tao, “Slow feature analysis for human action recognition,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 436–
450, 2012.
[176] L. Sun, K. Jia, T.-H. Chan, Y. Fang, G. Wang, and S. Yan, “Dl-sfa: deeply-learned
slow feature analysis for action recognition,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 2625–2632.
[177] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton
based action recognition,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1110–1118.
Bibliography 147
[178] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recognition using fac-
torized spatio-temporal convolutional networks,” in Proceedings of the IEEE In-
ternational Conference on Computer Vision, pp. 4597–4605.
[181] S. Yu, Y. Cheng, S. Su, G. Cai, and S. Li, “Stratified pooling based deep con-
volutional neural networks for human action recognition,” Multimedia Tools and
Applications, pp. 1–16, 2016.
[182] E. P. Ijjina and C. K. Mohan, “Human action recognition based on motion cap-
ture information using fuzzy convolution neural networks,” in Advances in Pattern
Recognition (ICAPR), 2015 Eighth International Conference on, pp. 1–6, IEEE,
2015.
[183] G. Chron, I. Laptev, and C. Schmid, “P-cnn: Pose-based cnn features for action
recognition,” in Proceedings of the IEEE International Conference on Computer
Vision, pp. 3218–3226.
[185] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 580–587.
[186] H. Rahmani and A. Mian, “3d action recognition from novel viewpoints,” in CVPR,
June.
Bibliography 148
[187] A. Alfaro, D. Mery, and A. Soto, “Action recognition in video using sparse coding
and relative features,” arXiv preprint arXiv:1605.03222, 2016.
[188] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions
classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[189] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-
convolutional descriptors,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 4305–4314.
[190] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, “Beyond gaussian pyramid:
Multi-skip feature stacking for action recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 204–212.
[192] B. Mahasseni and S. Todorovic, “Regularizing long short term memory with 3d
human-skeleton sequences for action recognition,”
[194] W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, “A key volume mining deep framework
for action recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 1991–1999.
[195] C. Wang, Y. Wang, and A. L. Yuille, “Mining 3d key-pose-motifs for action recog-
nition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2639–2647.
[196] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for
action recognition,” in Proceedings of the IEEE International Conference on Com-
puter Vision, pp. 4041–4049.
Bibliography 149
[202] D. Rudoy and L. Zelnik-Manor, “Viewpoint selection for human actions,” Interna-
tional Journal of Computer Vision, vol. 97, no. 3, pp. 243–254, 2012.
[203] B. Saghafi, D. Rajan, and W. Li, “Efficient 2d viewpoint combination for human
action recognition,” Pattern Analysis and Applications, vol. 19, no. 2, pp. 563–577,
2016.
[204] S. Ali and M. Shah, “Human action recognition in videos using kinematic features
and multiple instance learning,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 32, no. 2, pp. 288–303, 2010.
[208] A. Iosifidis, A. Tefas, and I. Pitas, “Multi-view action recognition based on action
volumes, fuzzy distances and cluster discriminant analysis,” Signal Processing,
vol. 93, no. 6, pp. 1445–1457, 2013.
[209] J. Lei, G. Li, J. Zhang, Q. Guo, and D. Tu, “Continuous action segmentation
and recognition using hybrid convolutional neural network-hidden markov model
model,” IET Computer Vision, 2016.
[210] N. Gkalelis, N. Nikolaidis, and I. Pitas, “View indepedent human movement recog-
nition from multi-view video exploiting a circular invariant posture representation,”
in Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on,
pp. 394–397, IEEE.
[211] D. Weinland, M. zuysal, and P. Fua, Making action recognition robust to occlusions
and viewpoint changes, pp. 635–648. Springer, 2010.
[212] R. Souvenir and J. Babbs, “Learning the viewpoint manifold for action recogni-
tion,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Con-
ference on, pp. 1–7, IEEE.
[213] J. Liu and M. Shah, “Learning human actions via information maximization,” in
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference
on, pp. 1–8, IEEE.
[214] W. Nie, A. Liu, W. Li, and Y. Su, “Cross-view action recognition by cross-domain
learning,” Image and Vision Computing, 2016.
Bibliography 151
[215] C.-H. Hsieh, P. S. Huang, and M.-D. Tang, “Human action recognition using sil-
houette histogram,” in Proceedings of the Thirty-Fourth Australasian Computer
Science Conference-Volume 113, pp. 11–16, Australian Computer Society, Inc.
[216] M.-K. Hu, “Visual pattern recognition by moment invariants,” information Theory,
IRE Transactions on, vol. 8, no. 2, pp. 179–187, 1962.
[217] Z. Huang and J. Leng, “Analysis of hu’s moment invariants on image scaling and
rotation,” in Computer Engineering and Technology (ICCET), 2010 2nd Interna-
tional Conference on, vol. 7, pp. V7–476, IEEE, 2010.
[219] V. N. Vapnik and V. Vapnik, Statistical learning theory, vol. 1. Wiley New York,
1998.
[220] U. H.-G. Kreel, “Pairwise classification and support vector machines,” in Advances
in kernel methods, pp. 255–268, MIT Press.
[221] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large margin dags for multiclass
classification,” in nips, vol. 12, pp. 547–553.
[222] T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems via error-
correcting output codes,” Journal of artificial intelligence research, pp. 263–286,
1995.
[223] S. Cheong, S. H. Oh, and S.-Y. Lee, “Support vector machines with binary tree
architecture for multi-class classification,” Neural Information Processing-Letters
and Reviews, vol. 2, no. 3, pp. 47–51, 2004.
[224] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector
machines,” Neural Networks, IEEE Transactions on, vol. 13, no. 2, pp. 415–425,
2002.
Bibliography 152
[225] K. Manosha Chathuramali and R. Rodrigo, “Faster human activity recognition with
svm,” in Advances in ICT for Emerging Regions (ICTer), 2012 International Con-
ference on, pp. 197–203, IEEE.
[226] K. K. Reddy, J. Liu, and M. Shah, “Incremental action recognition using feature-
tree,” in Computer Vision, 2009 IEEE 12th International Conference on, pp. 1010–
1017, IEEE.
[227] F. Lv and R. Nevatia, “Single view human action recognition using key pose match-
ing and viterbi path searching,” in Computer Vision and Pattern Recognition, 2007.
CVPR’07. IEEE Conference on, pp. 1–8, IEEE.
[231] X. Cao, Z. Wang, P. Yan, and X. Li, “Transfer learning for pedestrian detection,”
Neurocomputing, vol. 100, pp. 51–57, 2013.
[232] D. Wu, F. Zhu, and L. Shao, “One shot learning gesture recognition from rgbd
images,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012
IEEE Computer Society Conference on, pp. 7–12, IEEE, 2012.
[236] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level
image representations using convolutional neural networks,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 1717–1724.
[237] H. Nam and B. Han, “Learning multi-domain convolutional neural networks for
visual tracking,” arXiv preprint arXiv:1510.07945, 2015.
[238] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic seg-
mentation,” in Proceedings of the IEEE International Conference on Computer
Vision, pp. 1520–1528.
[239] L. Shao, F. Zhu, and X. Li, “Transfer learning for visual categorization: A sur-
vey,” IEEE transactions on neural networks and learning systems, vol. 26, no. 5,
pp. 1019–1034, 2015.
[240] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-
the-shelf: an astounding baseline for recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813.
[242] G. Willems, T. Tuytelaars, and L. Van Gool, “An efficient dense and scale-invariant
spatio-temporal interest point detector,” Computer Vision–ECCV 2008, pp. 650–
663, 2008.
Bibliography 154
[244] A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,”
in European Conference on Computer Vision, pp. 314–327, Springer, 2012.
[248] K. Charalampous and A. Gasteratos, “On-line deep learning method for action
recognition,” Pattern Analysis and Applications, vol. 19, no. 2, pp. 337–354, 2016.
[249] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-
nition,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 770–778, 2016.
[252] Y. Aytar, Transfer learning for object category detection. Thesis, 2014.
Bibliography 155
[253] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu, “Transfer learn-
ing for video recognition with scarce training data for deep convolutional neural
network,” arXiv preprint arXiv:1409.4127, 2014.
[254] M. A. R. Ahad, M. N. Islam, and I. Jahan, “Action recognition based on binary pat-
terns of action-history and histogram of oriented gradient,” Journal on Multimodal
User Interfaces, vol. 10, no. 4, pp. 335–344, 2016.
[255] S. Ding and S. Qu, “An improved interest point detector for human action recogni-
tion,” in Control and Decision Conference (CCDC), 2016 Chinese, pp. 4355–4360,
IEEE, 2016.
[256] Y. Tian, Q. Ruan, G. An, and Y. Fu, “Action recognition using local consistent
group sparse coding with spatio-temporal structure,” in Proceedings of the 2016
ACM on Multimedia Conference, pp. 317–321, ACM.
[263] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recur-
rent neural networks,” in Acoustics, speech and signal processing (icassp), 2013
ieee international conference on, pp. 6645–6649, IEEE, 2013.
[268] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification,” in Proceedings of the IEEE
international conference on computer vision, pp. 1026–1034, 2015.
[269] Y. Xia, L. Zhang, W. Xu, Z. Shan, and Y. Liu, “Recognizing multi-view ob-
jects with occlusions using a deep architecture,” Information Sciences, vol. 320,
pp. 333–345, 2015.
Bibliography 157
[271] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting
10,000 classes,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1891–1898, 2014.
[272] D. Turcsany and A. Bargiela, “Learning local receptive fields in deep belief net-
works for visual feature detection,” in International Conference on Neural Infor-
mation Processing, pp. 462–470, Springer, 2014.
[274] P. Xie, Y. Deng, and E. Xing, “Diversifying restricted boltzmann machine for doc-
ument modeling,” in Proceedings of the 21th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, pp. 1315–1324, ACM, 2015.
[276] S. A. Eslami, N. Heess, C. K. Williams, and J. Winn, “The shape boltzmann ma-
chine: a strong model of object shape,” International Journal of Computer Vision,
vol. 107, no. 2, pp. 155–176, 2014.
[277] D. Turcsany, A. Bargiela, and T. Maul, “Local receptive field constrained deep
networks,” Information Sciences, vol. 349, pp. 229–247, 2016.
[278] S. Nie, Z. Wang, and Q. Ji, “A generative restricted boltzmann machine based
method for high-dimensional motion data modeling,” Computer Vision and Image
Understanding, vol. 136, pp. 14–22, 2015.
Bibliography 158
[279] L. Xu, Y. Li, Y. Wang, and E. Chen, “Temporally adaptive restricted boltzmann
machine for background modeling.,” in AAAI, pp. 1938–1944, 2015.
[280] F. Zhao, Y. Huang, L. Wang, T. Xiang, and T. Tan, “Learning relevance restricted
boltzmann machine for unstructured group activity and event understanding,” In-
ternational Journal of Computer Vision, vol. 119, no. 3, pp. 329–345, 2016.
[281] G. Dahl, A.-r. Mohamed, G. E. Hinton, et al., “Phone recognition with the mean-
covariance restricted boltzmann machine,” in Advances in neural information pro-
cessing systems, pp. 469–477, 2010.
[283] J. J. Hopfield, “Neural networks and physical systems with emergent collective
computational abilities,” Proceedings of the national academy of sciences, vol. 79,
no. 8, pp. 2554–2558, 1982.
[290] P. P. Angelov and D. P. Filev, “Flexible models with evolving structure,” Interna-
tional Journal of Intelligent Systems, vol. 19, no. 4, pp. 327–340, 2004.
[291] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models for
action detection,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 2642–2649, 2013.
[292] A.-A. Liu, N. Xu, W.-Z. Nie, Y.-T. Su, Y. Wong, and M. Kankanhalli, “Benchmark-
ing a multimodal and multiview and interactive dataset for human action recogni-
tion,” IEEE Transactions on cybernetics, 2017.