Thesis
Thesis
Thesis
IRVINE
THESIS
MASTER OF SCIENCE
in Computer Science
by
Goutham Patnaikuni
Thesis Committee:
Professor Deva Ramanan, Chair
Professor Alexander Ihler
Professor Charless Fowlkes
2009
c 2009 Goutham Patnaikuni
The thesis of Goutham Patnaikuni
is approved and is acceptable in quality and form for
publication on microfilm and in digital formats:
Committee Chair
ii
DEDICATION
I dedicate this thesis to Professor Paul Utgoff, who lost his battle to appendiceal cancer in
October 2008. I learnt of his passing only recently and was deeply saddened to hear about
it. It seems like it was just yesterday when I was in his office discussing homework
assignments for his Artificial Intelligence course. Professor, this is for you, with love and
resolution
iii
TABLE OF CONTENTS
Page
LIST OF FIGURES vi
ACKNOWLEDGMENTS viii
1 Introduction 1
1.1 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 4
2.1 Local features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Global features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Other HMM based approaches . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Theory 8
3.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Intuitions behind Margins . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Functional and geometrical margins . . . . . . . . . . . . . . . . . . 10
3.1.4 The optimal margin classifier . . . . . . . . . . . . . . . . . . . . . . 12
3.1.5 Optimal margin classifiers . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.6 Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Transition and Emission probabilities . . . . . . . . . . . . . . . . . 18
3.2.2 Maximum Likelihood for the HMM . . . . . . . . . . . . . . . . . . . 19
3.2.3 The forward-backward algorithm . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.5 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Approach 37
4.1 The Support Vector Machine Approach . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Optical Flow based HOG descriptors . . . . . . . . . . . . . . . . . . 37
4.1.2 A multiclass SVM framework for action recognition . . . . . . . . . 39
4.1.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
4.2 The Hidden Markov Model Approach . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Building a vocabulary of visual words . . . . . . . . . . . . . . . . . 41
4.2.2 A hidden Markov model framework for video sequences . . . . . . . 43
4.2.2.1 Dimensionality reduction of optical flow features . . . . . . 43
4.2.2.2 Visual word SVM based features . . . . . . . . . . . . . . . 44
4.2.2.3 Classifying a new pre-tracked video sequence . . . . . . . . 44
4.3 A Unified model for joint tracking and recognition . . . . . . . . . . . . . . 46
4.3.1 Cross product space of location and visual words . . . . . . . . . . . 46
4.3.2 Joint model for tracking and recognition . . . . . . . . . . . . . . . . 46
4.3.2.1 Bounded velocity motion model . . . . . . . . . . . . . . . 47
4.3.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Motivation for joint tracking and recognition . . . . . . . . . . . . . . . . . 49
4.4.1 The Tracking problem . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.2 Experiment and baseline . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Results 53
5.1 Action Classification on KTH Dataset . . . . . . . . . . . . . . . . . . . . . 53
5.2 Action Classification on UCF action dataset . . . . . . . . . . . . . . . . . . 58
Bibliography 64
Appendices 67
A Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 KTH Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 UCF Action Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
v
LIST OF FIGURES
Page
vi
LIST OF TABLES
Page
vii
ACKNOWLEDGMENTS
I would like to start off by thanking Professor Deva Ramanan. Working with him for the
past year has been an absolute pleasure. He has been a great source of knowledge and
support and I cannot thank him enough for it. He is, without a doubt, the best advisor I
have ever had. My thanks also go to Professors Ihler and Fowlkes, for their time and help.
This work would not have been possible without the support of my friends here in Irvine. I
would particularly like to thank my good friend Kristian Hermansen, whom I have known
since my days as an undergraduate at the University of Massachusetts, for his constant
support and for being a source of humor at times when I needed it.
I have to thank my family, including my parents, my sister and my brother in law for being
there for me and believing in me. I particularly appreciate all the intellectual conversations
I have had with my brother-in-law about my research. It is good to have someone with a
PhD from Stanford in your family.
viii
ABSTRACT
A Joint Model for Tracking and Recognizing
Human Actions in Video Sequences
By
Goutham Patnaikuni
In this paper, we propose a new method for human activity recognition from video sequences
approaches - one based solely on Support Vector Machines and another in which the video
sequence is modelled as a Markov process. The major difference between our methods and
previous work done in action recognition is that we do away with the assumption that
the persons in a video sequence are tracked prior to the recognition process, and instead
combine the tracking and recognition problems into one. We believe that this is not only
a much more reasonable approach to action recognition, but also that combining the two
ix
Chapter 1
Introduction
video databases and automatic tagging of videos on sites such as Youtube. Various visual
cues, based on motion and shape, have been used for action recognition. In this thesis, the
focus will be on recognizing the actions in video sequences using motion cues. Specifically,
optical flow will be used as a feature set. We discuss two different methods for solving the
action recognition problem; one of them extends the SVM based framework introduced in
Dalal et. al[1]. In this setting, a video sequence, broken down into a sequence of frames is
essentially treated as a “bag-of-words” i.e the order of the frames is inconsequential . The
second method models the sequence of frames in a video as a Markov process meaning that
Many interesting approaches have been proposed to solving the action recognition problem.
Although some of these methods have reported some impressive results, most of them
suffer from the same weakness - they assume that tracking of the human figure is done
1
prior to recognition. In most cases, an external module is used to localize the motion of the
human figure in each frame of a video sequence. A common technique used is background
subtraction. In background subtraction, the goal is to identify moving objects from the
portion of a video frame that differs significantly from a background model. Although it
works well in cases where the background is static, background subtraction is not known to
based classifier, developed by Navneet Dalal and Bill Triggs and commonly known as the
Dalal-Triggs detector. The detector has a single fixed template which determines whether
a given image pattern corresponds to a person. The Dalal-Triggs detector was originally
trained to detect pedestrians in images and is not meant to detect a wide variety of human
poses. This means that while it may accurately locate a person standing in an upright pose
While the above two methods can be used for localizing motion in simple scenarios, com-
plex ones require a more intuitive approach. In such situations, the kind of action being
performed, and not global composition of a frame, should provide the primary cue for lo-
calization and recognition of motion patterns. In the case of action recognition, the ability
The main contribution of this thesis is to address the issues described above. In this thesis,
this problem is tackled by building a family of templates, one for each motion pattern,
instead of working with a single template for localizing motion. A motion pattern could
either be related to an action class or a “word” from a visual vocabulary. We believe that
having a family of templates will enable us to detect a wide array of human poses and will
therefore lead to a better track, which in turn will increase the accuracy with which an action
can be predicted. We later describe a method in which these templates are incorporated
into a hidden Markov model based approach for solving the action recognition problem. We
introduce a novel joint model for tracking and recognizing actions in video sequences.
2
1.2 Organization of this thesis
The thesis is organized as follows. Chap 2 provides a brief overview of related work in
human activity recognition. Chap 3 goes into detail about machinery of Support Vector
Machines and Hidden Markov Models which are the basis for the methods used in this
thesis for recognizing human actions. Chap 4 discusses these methods in detail, specifically
describing how the tracking and recognition problems are combined into one. Chap 5
discusses the results of testing the methods described in Chap 4 on two datasets. Chap 6
concludes this thesis with a summary and a discussion of possibilities for extensions.
3
Chapter 2
Related Work
The literature on action recognition is quite rich. We avoid an in-depth review of all methods
and instead refer the reader to [30]. We instead focus on related work that is most applicable
to the approach we pursue. As mentioned before, different visual cues are used to detect
human actions. This section will concentrate on methods that use motion based cues. The
methods we discuss fall into two broad categories - ones that consider global level features
Several recent approaches have concentrated on capturing local level features in video and
using them to understand the underlying motion in video sequences. The motivation for
doing so was to overcome some of the limitations of the methods that only considered motion
on a global scale, such as the inability to deal with multiple moving objects and variations
in background. Several local features for video have been proposed recently; one of which is
Space-time interest points[4]. In images, points with significant variation in local intensities
are considered to be of interest and are called spatial interest points. Space-time interest
points are an extension of spatial interest points and are meant to correspond to interesting
4
events in video data. Neibles et al.[3] models a video sequence as a collection of spatial-
temporal words by extracting space-time interest points from video sequences. Probability
action categories are automatically learnt using a probabilistic Latent Semantic Analysis
(pLSA)[15] model. Using this model, a new video sequence is categorized and the motions
with space-time interest points and uses them for recognition of spatio-temporal events and
activities. Captuo et al.[7] combine space-time features and SVMs and use the resulting
Part based models are also increasingly being used in action recognition. This trend is
partly inspired by the success of these models in object detection. In [32], a discriminatively
trained, multi-scale, deformable part model is used to build models for people and objects
such as cars, bottles, and couches. [33, 34, 35] also use part based models for both human
Several global based features have been captured and used for action recognition in the past.
One commonly used feature is optical flow, which is an approximation of motion between
temporally adjacent scenes. Efros et. al[9] build motion descriptors based on optical flow
and use these in a k -nearest neighbor framework to classify actions. Wang. et al.[2] use the
same descriptor to build a visual vocabulary and represent a video as a bag of visual words.
They later use this representation to build a model based on latent Dirichlet allocation
(LDA)[5]. [18] also uses optical flow to model human actions as a flexible constellation of
methods attempt to capture the underlying motion similarity amongst videos of a given
action class. Shechtman and Irani [19] avoid explicit flow computations by employing a
5
rank-based constraint directly on the intensity information of spatio-temporal cuboids to
Rodriguez and Ahmed[11] introduce a template-based method for recognizing human ac-
tions called Action MACH based on a Maximum Average Correlation Height (MACH)
single template for an action by synthesizing a single Action MACH filter for a given action
class. These Action MACH filters combine the training sequences of an action class into a
single composite template. These templates are then correlated with testing sequences in
the frequency domain via a FFT transform. Once an Action MACH filter is synthesized,
similar actions in a testing video sequence are detected by applying the action MACH filter
to the video.
In this thesis, we use optical flow as a feature set. Although it does not explicitly capture
local interest points in video, localization is offered to some degree by using optical flow in
Using HMMs for action recognition is very common. Typically, the hidden state is an
activity to be inferred, and observations are image measurements. Yamato et al.[20] describe
recognizing tennis strokes with HMMs. Wilson and Bobick[21] describe the use of HMMs
for recognizing gestures such as pushes. Yang et al.[22] use HMMs to recognize handwriting
gestures.
In order to simplify the training process of learning the state transition matrix, there has
been a great deal of interest in models obtained by modifying a basic activity-state HMM.
Variations include a coupled HMM (CHMM) [21, 22], a layered HMM (LHMM) [23, 24,
25], a parametric HMM (PHMM) [26], an entropic HMM (EHMM) [27], and variable length
6
In this thesis, we use HMMs to infer activities using optical flow based feature sets as
observations. Later on, we build a joint model for tracking and recognizing actions in
video.
7
Chapter 3
Theory
This chapter discusses support vector machines and hidden Markov models in detail. Both
Support Vector Machines (SVMs for short) are known to be among the best “off-the-shelf”
supervised learning algorithms. SVMs are used in solving problems such as text catego-
rization, hand-written character recognition, image classification and in this case, action
dimensional hyperplane that optimally separates N dimensional data into two categories.
Input data fed into an SVM can be viewed as two sets of vectors in an N dimensional space.
An SVM will construct a separating hyperplane in that space, one which maximizes the
8
3.1.1 Intuitions behind Margins
The intuition behind margins can be best explained by considering logistic regression. In
≥ 0.5. or equivalently, if θT x ≥ 0, the label “1” is predicted. For a positive training example
(y = 1), the larger θT x is, the larger h θ (x ) is, and thus higher the degree of “confidence”
that the label is 1. The prediction can be thought of as a “confident” one that y = 1 if θT x
= 0 if θT x 0. Given a training set, a good fit can be found if a θ can be found such that
θT x(i) 0 whenever y(i) = 1, and θT x(i) 0 whenever y(i) = 0, since this would reflect a
very confident set of classifications for all the training examples. For points that are very far
away from the separating hyperplane, a prediction can be made rather confidently. On the
other hand, for a point that is very close to the hyperplane, a confident prediction may not
be possible because even a small change in the separating hyperplane could easily change
the prediction.
3.1.2 Notation
Considering a linear classifier for a binary classification problem with labels y and features
Here, g(z ) = 1 if z ≥ 0, and g(z) = -1 otherwise. The “w,b” notation allows the intercept
9
3.1.3 Functional and geometrical margins
Given a training example (x (i ),y(i )), the functional margin of (w,b) with respect to the
If y (i) = 1, then for the functional margin to be large (for the prediction to be confident and
correct), w T x + b needs to be a large positive number. Conversely, if y (i) = -1, then for
if y (i) (w T x + b) > 0, then the prediction on this example is correct. A large functional
For a linear classifier with the choice of g, if w and b were to be replaced with 2w and
2b respectively, since g(w T x + b) = g(2w T x + 2b), this would not change h w,b (x ) atall.
However, replacing (w,b) with (2w,2b) results in multiplying the functional margin by a
factor of 2. This means that by scaling w and b, the functional margin can be arbitrarily
with respect to S = {(x (i) ,y (i) );i = 1,...,m} is defined as the smallest of the functional
margins of the individual training examples. Denoted by γ̂, this can be written as:
In Fig 3.1, the decision boundary corresponding to (w,b) is shown, along with the vector
w. It should be noted that w is orthogonal to the separating hyperplane. In the figure, the
distance of point A from the decision boundary, γ̂ (i) , is given by line segment AB.
w
||w|| is a unit length vector pointing in the same direction as w. Since A represents x (i) ,
w
the point B is given by x (i) - γ̂ (i) . ||w|| . But this point lies on the decision boundary, and all
10
Figure 3.1: Linear decision boundary
T (i) (i) w
w x − γ̂ +b=0 (3.4)
||w||
T
wT x(i) + b
w b
γ̂ (i) = = x(i) + (3.5)
||w|| ||w|| ||w||
More generally, for both negative and positive examples, the geometric examples (w,b) with
T !
w b
γ (i) = y(i) x(i) + (3.6)
||w|| ||w||
It should be noted that if ||w|| = 1, then the functional margin is equal to the geometric
margin. The geometric margin is invariant to rescaling of the parameters; i.e if w and b are
replaced by 2w and 2b, the geometric margin does not change. Because of this invariance to
the scaling of the parameters, when trying to fit w and b to the training data, an arbitrary
training set S = {(x (i) ,y (i) );i = 1,....,m}, the geometric margin (w,b) with respect to S can
11
be defined as the smallest of the geometric margins on the individual training examples.:
Given a training set, it seems that it is natural to try and find a decision boundary that
maximizes the geometric margin, since this would reflect a very confident set of predictions
on the training set and a good fit to the training data. This will result in a classifier
that separates the positive training examples from the negative training examples with the
geometric margin.
Assuming that the training data is linearly separable, i.e it is possible to separate the
positive and negative examples using a hyperplane, the question is how to find one that
achieves maximum geometric margin. The following optimization problem can be posed:
maxγ,w,b γ
||w|| = 1
The objective is to maximize γ, subjective to the training example having functional margin
at least γ. The ||w|| = 1 constraint ensures that the functional margin equals the geometric
margin, so it is gauranteed that all geometric margins are atleast γ. Thus, solving this
problem will result in (w,b) with the largest possible geometric margin with respect to the
training set.
The ||w|| = 1 constraint makes the problem a hard one to solve because it cannot directly
be plugged into a standard optimization algorithm; the answer is to transform the problem:
γ̂
maxγ,w,b ||w||
12
γ̂
Here, ||w|| is maximized, subject to the functional margins all being atleast γ̂. Since the
γ̂
functional and geometric margins are related by γ = ||w|| , this is the desired answer. The
difficult constraint ||w|| = 1 does not have to dealt with. On the other hand, there is no
γ̂
off-the-shelf software that optimizes the objective function ||w||
Using the ability to add an arbitrary scaling constant on w and b without changing anything,
the scaling constant that the functional margin of w,b with respect to the training set
optimization problem
minγ,w,b 12 ||w||2
The problem can now be solved efficiently. The equation above is a optimization problem
with a convex quadratic objective and linear constraints. Its solution gives a optimal margin
classifier. This can be solved using quadratic programming and lagrange multipliers.
g i (w ) = -y (i) (w T x (i) + b) + 1 ≤ 0.
There is one such constraint for each training example. From the KKT conditions, the
only training examples with αi > 0 are the ones that have functional margins equal to one
(the ones corresponding to constraints that hold with equality, g i (w ) = 0). In Fig 3.2, the
The points with the smallest margins are exactly the ones closest to the decision boundary.
In this case, only three points (one negative and two positive examples) lie on the dashed
13
Figure 3.2: Maximum margin separating hyperplane and support vectors
lines parallel to the decision boundary. This means that only three αi ’s will be non-zero at
the optimal solution. These three points are called the support vectors. The number of
support vectors is less than the size of the training set. Constructing the Lagrangian form
m
1 X
L(w, b, α) = ||w||2 − αi [y(i) (wT x(i) + b) − 1] (3.8)
2
i=1
It should be noted that there are only αi and no βi Lagrange multipliers, since the equation
To find the dual form of the problem, L(w,b,α) will have to be minimized with respect to w
and b (for fixed α), to get θD . This can be done by setting the derivatives of L with respect
to w and b to 0‘:
m
X
∇w L(w, b, α) = w − αi y(i) x(i) = 0 (3.9)
i=1
m
X
w= αi y(i) x(i) (3.10)
i=1
14
The derivative with respect to b
m
∂ X
L(w, b, α) = αi y(i) = 0 (3.11)
∂b
i=1
Taking the definition of w in Equation (3.10) and plugging it back into the Lagrangian in
Equation (3.8):
m m m
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y(i) (3.12)
2
i=1 i,j=1 i=1
The above equation was obtained by minimizing L with respect to w and b. Putting this
together with constraints αi ≥ 0. and the constaint (3.11) leads to the following dual
problem:
s.t αi ≥ 0, i = 1,...,m
Pm (i) = 0
i=1 αi y
In the dual problem above, the parameters of the maximization problem are all αi ’s. If
there were an algorithm to solve the dual equation above, the optimal w ’s can be found as
a function of α’s using Equation (3.10). Having found w∗ , the optimal value for intercept
Equation (3.10) also gives a optimal value of w in terms of the optimal value of α. If a
predicted if this quantity is bigger than zero. But by using Equation (3.10), this quantity
m
!T
X
T (i) (i)
w +b= αi y x x+b (3.14)
i=1
15
m
X
= αi y(i) hx(i) , xi + b (3.15)
i=1
Once the αi ’s are found, a quantity that depends only on the inner product between x and
the points in the training set will have to be calculated. Many of the terms in the sum
above will be zero because the αi ’s will all be zeros except for the support vectors and only
the inner product between x and the support vectors will have to be calculated
The support vector machine is fundamentally a binary classifier. In practice, however, one
may have to tackle problems involving more than two classes. In this project, for example,
SVMs are being used in a multiclass scenario. Various methods have been proposed to
One commonly used approach is to construct K separate SVMs, in which the k th SVM
y k (x ) is trained using data from class C k as the positive examples and the data from the
This heuristic suffers approach suffers from the problem that the different classifiers are
trained on different tasks, and there is not guarantee that the real-valued quantities y k (x )
Another approach is to train K (K -1)/2 different 2-class SVMs on all possible pairs of
classes, and then to classify test points according to which class has the highest number of
’votes’, an approach that is called one-versus-one. The problem with this approach is that
it required more training time that the one-versus-the-rest approach. Similarly, to evaluate
16
test points, significantly more computation is required.
17
3.2 Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical model in which the system being modeled
the hidden parameters from the observable data. The extracted model parameters can then
be used to perform further analysis, for example for pattern recognition applications. In a
regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters. In a hidden Markov model, the state is not
directly visible, but variables influenced by the state are visible. Each state has a probability
distribution over the possible output tokens. Therefore the sequence of tokens generated by
an HMM gives some information about the sequence of states. The HMM is widely used
in speech recognition, natural language modelling, on-line handwriting recognition and the
As in a standard mixture model, the latent variables are the discrete multinomial variables
zn describing which components of the mixture is responsible for generating the correspond-
ing observation xn . If the probability distribution of zn were allowed to depend on the state
of the previous latent variable zn-1 through a conditional distribution p(zn |zn-1 ). Because
the latent variables are K -dimensional binary variables, this conditional distribution corre-
sponds to a table of numbers A, the elements of which are known as transition probabilities.
They are given by Ajk = p(z nk = 1|z n−1,j = 1), and because they are probabilities, they
P
satisfy 0 ≤ Ajk ≤ 1 with k Ajk = 1, so the matrix A has K (K -1) independent features..
K Y
K
z znk
Y
p(zn |zn−1 , A) = Ajkn−1,j (3.17)
k=1 j=1
The initial latent node z1 is special in that it does not have a parent node, and so it
18
πk ≡ p(z1k = 1), so that:
K
Y
p(z1 |π) = πkz1k (3.18)
k=1
P
where k πk =1
The specification of the probabilistic model is completed by defining the conditional distri-
butions of the observed variables p(xn |zn , φ) where φ is the set of parameters governing the
distribution. These are known as emission probabilities. These could be given by a Gaus-
sian (or any other continuous probability distribution) if the elements of x are continuous
distribution p(xn |zn , φ) consists, for a given value of φ, of a vector of K numbers corre-
sponding to the K possible states of the binary vector zn . The emission probabilities can
K
Y
p(xn |zn , φ) = p(xn |φk )znk (3.19)
k=1
The joint probability distribution over both latent and observed variables is then given by:
N N
" #
Y Y
p(X, Z|π) = p(z1 |π) p(zn |zn−1 , A) p(xm |zm , φ) (3.20)
n=2 m=1
where X = {x1 ,. . . ,xN }, Z = {z1 ,...,zN }, and θ = {π,A,φ} denotes the set of parameters
governing the model. The model is tractable for a wide range of emission distributions
For an observed data set X = {x1 ,...,xN }, one can determine the parameters of the HMM
using maximum likelihood. The likelihood function is obtained from the joint distribution
19
in Equation (3.20) by marginalizing over the latent variables:
X
p(X|θ) = p(X, Z|θ) (3.21)
z
Because the joint distribution p(X|θ) does not factorize over n, each of the summations
because there are N variables to be summed over, each of which has K states, resulting in
a total of K N terms. Thus the number of terms in the summation grows exponentially with
the length of the chain.In fact, the summation in Equation (3.20) corresponds to summing
(a) A state transition diagram (b) A lattice representing the transition dia-
gram unfolded over time
To find an efficient framework for the maximizing the likelihood function in the hidden
Markov Model, one can use the expectation maximization (EM) algorithm. The EM algo-
rithm starts with some initial selection for the model parameters, denoted by θold . In the E
step, the model parameters can be used to find the posterior distribution of the latent vari-
ables p(Z |X, θold ). This posterior distribution can then be used to evaluate the expectation
X
Q(θ, θold ) = p(Z|X, θold ) ln p(X, Z|θ) (3.22)
z
20
Introducing some notation, γ(zn ) can denote the marginal posterior distribution of a latent
variable zn , and ξ(zn−1 ,zn ) to denote the joint posterior distribution of two successive talent
variables, so that:
For each value of n, γ(zn ) can be stored using a set of K non-negative numbers that sum to
that sum to unity. γ(znk ) can denote the conditional probability of z nk = 1, with a similar
notation for ξ(zn−1,j , znk ) and for other probabilistic variables introduced earlier. Because
the expectation of a binary random variable is just the probability that it takes the value
1:
X
γ(znk ) = E[znk ] = γ(z)znk (3.25)
z
X
ξ(zn−1,j , znk ) = E[zn−1,j , znk ] = γ(z)zn−1,j znk (3.26)
z
Substituting the joint distribution p(X,Z|θ)in Equation (3.20) into (3.22), and making use
K
X N X
X K X
K N X
X K
Q(θ, θold ) = γ(z1k ) ln πk + ξ(zn−1,j , znk )lnAjk + γ(znk ) ln p(zn |φk ) (3.27)
k=1 n=2 j=1 k=1 n=1 k=1
The goal of the E step will be to evaluate the quantities γ(zn ) and ξ(zn−1 , zn ) efficiently.
In the M step, the quantity Q(θ, θold ) with respect to the parameters θ = {π, A, φ} in which
γ(zn ) and ξ(zn−1 , zn ) are treated as constant. Maximization with respect to π and A is
21
easily achieved using appropriate Lagrange multipliers with the results:
γ(z1k )
πk = PK (3.28)
j=1 γ(z1j )
PN
ξ(zn−1,j , znk )
Ajk = PK n=2
PN (3.29)
l=1 n=2 ξ(zn−1,j , znl )
The EM algorithm must be initialized by choosing the starting values for π and A, which
should respect the summation constraints associated with their probabilistic interpretation.
Any elements of π and A that are initially set to zero will remain zero in subsequent EM
updates. A typical initialization procedure would involve selecting random starting values
To maximize Q(θ,θold ) with respect to φk , it should be noted that only the final term in
Equation (3.27) depends on φk . If the parameters φk are different for the different compo-
nents, then this term decouples into a sum of terms one for each value of k, each of which
can be maximized independently. This would reduce to simply maximizing the weighted log
likelihood function for the emission density p(x|φk ) with the weights γ(z nk ). For example,
P
in case of Gaussian emission densities, p(x|φk ) = N (x|µk , k ), and maximization of the
PN
n=1 γ(znk )xn
µk = PN
(3.30)
n=1 γ(znk )
PN
n=1 γ(znk )(xn − µk )(xn − µk )T
Σk = PN (3.31)
n=1 γ(znk )
For the case of discrete multinomial observed variables, the conditional distribution of the
D Y
Y K
p(x|z) = µxiki zk (3.32)
i=1 k=1
22
and the corresponding M step equations are given by:
PN
n=1 γ(znk )xin
µik = PN
(3.33)
n=1 γ(znk )
The EM algorithm requires the initial values for the parameters of the emission distribution.
There needs to be an efficient procedure to evaluate the quantities γ(znk ) and ξ(zn-1,j , znk ),
corresponding to the E step of the EM algorithm. The graph for the Hidden Markov
Model is a tree, so this means that the posterior distribution of the latent variables can be
obtained efficiently using a two-stage message passing algorithm. In the particular context
of the hidden Markov Model, this is known as the forward backward (Rabiner, 1989) or the
Baum-Welch algorithm (Baum, 1972). There are several variants of the basic algorithm,
all of which lead to the exact marginals, according to the precise form of the message that
are propagated along the chain. The most widely used of these is called the alpha-beta
The evaluation of the posterior distributions of the latent variables is independent of the
form of the emission density p(x|z) or of whether the observed variables are discrete or
continuous. All that is required is the values of the quantities p(xn |zn ) for each value of
z n for every n.Also, the explicit dependence on the model parameters θold shall be omitted
23
The following conditional independencies hold:
p(X|zn−1 , zn ) = p(x1 , ..., xn−1 |zn−1 )p(xn |zn )p(xn+1 , ..., xN |zn ) (3.39)
where X = {x1 ,...,xn }. These relations are easily proved using d-separation. For instance,
for the first of these results, every path from any one of the nodes x1 ,...,xn-1 to the node
xn passes through node zn , which is observed. Because all such paths are head-to-tail, it
follows that the conditional independence properly must hold. These relations can also be
proved directly from the joint distribution of the hidden Markov model using the sum of
To evaluate γ(znk ), the fact that for a discrete multinomial random variable the expected
value of one of its components is just the probability of that component having a 1 proves
useful. Given this fact, the goal is to find the posterior distribution p(zn |x1 , ..., xn ) of zn
given the observed data set x1 ,...,xn . This represents a vector of length K whose entries
p(X|zn )p(zn )
γ(zn ) = p(zn |X) = (3.42)
p(X)
The denominator p(X) is implicitly conditioned on the parameters θold of the HMM and
hence represents the likelihood function. Using the conditional independency property
24
(3.34), together with the product rule of probability
where
The quantity α(zn ) represents the joint probability of observing all of the given data up to
a time n and the value of zn , whereas β(zn ) represents the conditional probability of all
the future data from time n + 1 upto N given the value of zn . Again, α(zn ) and β(zn )
each represent a set of K numbers, one for each of the possible settings of the 1-of-K coded
binary vector zn . From now on, the notation α(znk ) shall be used to denote the value of
The recursion relations that allow α(zn ) and β(zn ) to be evaluated efficiently can now be
derived. Making use of the conditional independence properties,in particular Eq (3.35) and
(3.36), together with the sum of product rules, allows to express α(zn ) in terms of α(zn-1 )
25
Figure 3.4: Forward recursion for the evaluation of the α variables
as follows
X
α(zn ) = p(xn |zn ) α(zn−1 )p(zn |zn−1 ) (3.47)
zn−1
It should be noted that there are K terms in the summation,and the right hand side has to
be evaluated for each of the K values of zn so each step of the α recursion has computational
cost that scaled like O(K 2 ). The forward recursion has to be computed using the lattice
26
diagram in Fig 3.4
In order to start this recursion,an initial condition is required. This is given by:
K
Y
α(z1 ) = p(x1 , z1 ) = p(z1 )p(x1 |z1 ) = {πk p(x1 |φk )}z1k (3.48)
k=1
which says that α(z1k ), for k =1,...,K, takes the value πk p(x1 |φk ). Starting at the first node of
the chain, one can work along the chain and evaluate α(zk ) for every latent node. Because
each step of the recursion involves multiplying by a K x K matrix, the overall cost of
The recursion relation for β(zn ) can be found similarly, by making use of the conditional
X
β(zn ) = β(zn+1 )p(xn+1 |zn+1 )p(zn+1 |zn ) (3.50)
zn+1
It should be noted that in this case, the algorithm goes backward, evaluating β(zn ) in terms
of β(zn+1 ). At each step, the effect of observation β(xn+1 ) is absorbed through the emission
probability p(xn+1 |zn+1 ), multiplied by the transition matrix p(zn+1 |zn ), and then zn+1 is
27
Figure 3.5: Backward recursion for the evaluation of the β variables
Again, a starting condition for the recursion, namely a value for β(zN ), is required.This can
be obtained by setting n = N in Eq (3.43) and replacing α(zn ) with its definition (3.44) to
give:
p(X, zN )β(zN )
p(zN |X) = (3.51)
p(X)
In the M step equations, the quantity p(X) will cancel out,as can be seen in the M step
Pn Pn
n=1 γ(znk )xn n=1 α(znk )β(znk )xn
µk = P n = P n (3.52)
n=1 γ(z nk ) n=1 α(znk )β(znk )
However, the quantity p(X) represents the likelihood function whose value is typically mon-
itored during the EM optimization, and so it us useful to be able to evaluate it. Summing
both sides of (3.43) over zn , and using the fact that the left hand side is a normalized
distribution:
X
p(X) = α(zn )β(zn ) (3.53)
zn
Thus the likelihood function can be evaluated by computing this sum,for any convenient
choice of n. For instance, is only the likelihood function has to be evaluated, then this can
28
be done by running the α recursion from the start to the end of the chain, and then use
this result for n = N, making use of the fact that β(zN ) is a vector of 1’s. In this case, no
X
p(X) = α(zN ) (3.54)
zN
To distribute the likelihood, the joint distribution p(X,Z) should be summed over all pos-
sible values of z. Each such choice represents a particular choice of a hidden state for every
time step, in order words every term in the summation is a path through the lattice diagram.
The number of such paths is exponential. By expressing the likelihood function in the form
(3.54), the computational cost has been reduced from being exponential in the length of
the chain to being linear by swapping the order of the summation and multiplications, so
that each time step n, the contributions from all paths passing through each of the states
To evaluate the quantities ξ(zn-1 , zn ), which corresponds to the values of the conditional
Here, the conditional independence property (3.39) was used together with definitions of
α(zn ) and β(zn ) given by (3.44) and (3.45). Thus ξ(zn-1 , zn ) directly by using the results
To summarize, the steps required to train a hidden Markov model using the EM algorithm,
one needs to first make an initial selection of the parameters θold where θ ≡ (π, A, φ).
29
The A and π parameters are often initialized either uniformly or randomly from a uniform
the parameters φ will depend on the form of the distribution. Then both the forward α
recursion and the backward β recursion and use the results to evaluate γ(zn ) and ξ(zn-1 , zn ).
At this stage, the likelihood function can be evaluated. This completes the E step, and these
results can be used to find a revised set of parameters θnew using the M step equations
from section (the forward backward). The E and M steps can be run alternatively until
It should be noted that in the recursion relations, the observations enter through conditional
distributions of the form p(xn |zn ). The recursions are therefore independent of the type of
long as its value can be computed for each of the K possible states of zn . Since the observed
variables {xn } are fixed, the quantities p(xn |zn ) can be pre-computed as functions of zn at
The maximum likelihood approach is most effective when the number of data points is large
in relation to the number of parameters. Here, the hidden Markov model can be trained
effectively, using maximum likelihood, provided the training sequence is sufficiently long.
Alternatively, one can also use multiple shorter sequences which requires a straightforward
modification of the hidden Markov model EM algorithm. In the case of left-to-right mod-
els, this is particularly important because, in a given observation sequence, a given state
Another quantity of interest is the predictive distribution, in which the observed data is X
= {x1 ,...,xn } and one wishes to predict xN+1 . Again, one can make use of the sum and
product rules together with the conditional independence properties (3.39) and (3.41) to
30
derive:
X
p(xN+1 |X) = p(xN+1 , zN+1 |X)
zN+1
X
= p(xN+1 |zN+1 )p(zN+1 |X)
zN+1
X X
= p(xN+1 |zN+1 ) p(zN+1 , zN |X)
zN+1 zN
X X
= p(xN+1 |zN+1 ) p(zN+1 |zN )p(zN |X)
zN+1 zN
X X p(zN , X)
= p(xN+1 |zN+1 ) p(zN+1 |zN )
zN+1 zN
p(X)
1 X X
= p(xN +1 |zN +1 ) p(zN +1 |zN )α(zN ) (3.56)
p(X) z z
N+1 N
which can be evaluated by fist running a forward α recursion and the computing the final
summations over zN can be stored and used once the value of zN+1 is observed in order to
run the α recursion forward to the next step in order to predict the subsequent value xN+2 .
In the equation above, the influence of all the data from x1 to xN is summarized in the the
K values of α(zN )
algorithm in practice. In the algorithm, at each step, the value α(zn ) is obtained from the
previous value α(zn−1 ) by multiplying by quantities p(zn |zn−1 ) and p(xn |zn ). Because the
probabilities are often significantly less than unity, going forward along the chain, the values
In case of i.i.d data, this problem was circumvented with the evaluation of log likelihood
functions. This will not work here because the sums of products of smaller numbers are
being formed. Therefore, rescaled versions of α(zn ) and β(zn ) whose values remain of order
31
unity are used. The corresponding scaling factors cancel out when these rescaled quantities
In (3.44), α(zn ) = p(x1 ,...,xn ,zn ) representing the joint distribution of all the observations
upto xn and the latent variable zn . Now, to introduce the normalized version of α:
α(zn )
α̂(zn ) = p(zn |x1 , ..., xn ) = (3.57)
p(x1 , ..., xn )
over K variables for any value of n. In order to relate the scaled and original alpha vari-
ables, scaling factors defined by conditional distributions over the observed variables are
introduced:
n
Y
p(x1 , . . . , xn ) = cm (3.59)
m=1
and so
n
!
Y
α(zn ) = p(zn |x1 , ..., xn )p(x1 , ..., xn ) = cm α̂(zn ) (3.60)
m=1
The recursion equation (3.48)for α can be turned into one for α̂ given by
X
cn α̂(zn ) = p(xn |zn ) α̂(zn−1 )p(zn |zn−1 ) (3.61)
zn−1
At each stage of the forward message passing phase, c n will have to be evaluated and stored,
which is easily done because it is the coefficient that normalizes the right hand side of the
32
Rescaled variables β̂(zn ) can be similarly defined using
N
!
Y
β(zn ) = cm β̂(zn ) (3.62)
m=n+1
which will again remain within the same machine precision because, from (3.45), the quan-
The recursion result (3.50) for β then gives the following recursion for the re-scaled variables
X
cn+1 β̂(zn ) = β̂(zn+1 )p(xn+1 |zn+1 )p(zn+1 |zn ) (3.64)
zn+1
In applying this recursion relation, the scaling factors c n that were computed in the α phase
are used.
N
Y
p(X) = cn (3.65)
n=1
Similarly,using (3.43) and (3.55), together with (3.65), the required marginals are given by
33
Figure 3.6: A graphical depiction of the Viterbi algorithm
In many applications of hidden Markov models, the latent variables have some meaningful
interpretation, and so it is often of interest to find the most probable sequence of hidden
states for a given observation sequence. Because the graph for a hidden Markov model
is a directed tree, this problem can be solved exactly using the max-sum algorithm. The
problem of solving the most probable sequence of latent states is not the same as that
of finding the set of states that are individually the most probable. The latter problem
can be solved by running the forward-backward (sum-product) algorithm to find the latent
marginals γ(zn ) and then maximizing each of these individually. However, the set of such
states will not, in general correspond to the most probable sequence of states. In fact, this
set of states might even represent a sequence having zero probability, if it so happens that
two successive states, which in isolation are individually the most probable, are such that
Finding the most probable sequence of states can be solved efficiently using the max-sum
algorithm, which is known as the Viterbi algorithm. Fig 3.6 shows a fragment of the hidden
Markov model expanded as a lattice diagram. The number of possible paths through the
lattice diagram grows exponentially with the length of the chain. The Viterbi algorithm
searches this space of paths efficiently to find the most probable path with a computational
cost that grows only linearly with the length of the chain.
34
The variable zN is treated as the root, and messages are passed to the root starting with
the leaf nodes. The messages passed in the max-sum algorithm are given by
µfn+1 →zn+1 (zn+1 ) = max{ln fn+1 (zn , zn+1 ) + µzn →fn+1 (zn )} (3.69)
zn
If µzn →fn+1 (zn ) is eliminated between these two equations, a recursion for the f → z can be
obtained
A simple algorithm that keeps track of the path to every possible latent variable is used to
find the sequence of latent variables that correspond to the most likely path
Intuitively, the Viterbi algorithm can be understood as follows. Naively, one could consider
all of the exponentially many paths through the lattice, evaluate the probability for each,
and then select the path having the highest probability. However, a dynamic saving in
computational cost can be made as follows. Suppose the probability of each path is evaluated
by summing up products of transition and emission probabilities going forward along each
path through the lattice. Considering a particular time step n and a particular state k at
that time step, there will be many possible paths converging on the corresponding node in
the lattice diagram. However, only the path that that has the highest probability so far
needs to be retained. Because there are K states at time step n, K such paths will need to
be kept track of at step n. At time step n+1, there will be K 2 possible paths to consider,
compromising K possible paths leading out of each of the K current states, but again,
35
only K of these corresponding to the best path for each state at time n+1 will have to be
retained. When the final time step N is reached, it will be known which state corresponds
to the overall most probable path. Because there is a unique path coming into that state,
the path can be traced back to step N -1 to see what state it occupied at that time, and so
36
Chapter 4
Approach
Two different approaches will be discussed in this section. The first approach is purely SVM
(Support Vector Machine) based and is derived from Dalal et al.[1]. The machinery behind
the second approach is based on Hidden Markov Models. We describe our joint model for
tracking and recognition in which we track and classify actions simultaneously. Both of
The SVM approach is an extension of [1]. Whereas in [1], the focus is on building a binary
classifier to make person/no-person detections in images, the SVM approach uses an optical
flow based feature set and learns an SVM for each class of actions.
Dalal et. al[1] discusses how locally normalized Histogram of Oriented Gradient (HOG)
descriptors provide better performance at human detection relative to other feature sets.
These descriptors are computed over a grid of uniformly spaced cells and use overlapping
37
local contrast normalizations for improved performance. The intuition is that local object
appearance and shape in static images can be characterized well by the distribution of local
intensity gradients.
Here, local intensity gradients are replaced by optical flow to characterize local motion
patterns in video sequences. This is implemented by dividing an image into regions called
cells and accumulating a 1-D histogram of the optical flows over the pixels of the cell.
To achieve better invariance to effects such as illumination and shadowing, local responses
from cells are normalized by accumulating cells over larger spatial regions called blocks and
normalizing over all cells in a block. Just as in [1], we use an 8x8 cell with each block having
4 cells in it. Each pixel calculates a weighted “vote” based on the orientation and magnitude
of the optical flow vector centered at it and the votes are accumulated into orientation bins
over cells. There are 9 orientation bins from 0 - 180 (degrees) and 4 normalizations for every
cell. A detection window is tiled with a dense grid of overlapping HOG descriptors. Let
the number of pixels in the image. For efficiency reasons, we only score windows centered
at every 8th pixel. If the dimensionality of the detection window is n×m, and pi is the
n×m dimensional patch extracted from location i in the image, we can write
xi = ψ(pi )
n m
for the HOG feature vector computed at the patch. This feature vector is 8 × 8 × 36
dimensional. Fig. 4.1 shows an image, the optical flow and the HOG descriptor computed
at the image.
Figure 4.1: (a) is the original image. (b) is the optical flow computed at the image. (c) is
38
(a) Original image (b) Optical flow (c) HOG descriptor
described in Sec 3.1.6. This is the version of the SVM that is used here. Given a training
4.1.2.1 Training
Given a training set of video sequences, we collect training pairs {xi , yi } where xi is as
described above and yi ∈ {1,. . . ,C}. where C is the number of action classes. Positive
features for an SVM for action class C are obtained from bounding boxes around the
actions corresponding to class C. Negative features were obtained from random patches in
images containing actions from the remaining classes. We train a model wC that minimizes
d
X
wC = αi yi xi (4.1)
i=1
where d is the size of the training data and αi are Lagrange multipliers, whose values can
The training is typically done iteratively - after learning an initial SVM classifier, all the
negative training examples are searched exhaustively for false positives (hard examples).
39
These false positives are then appended to the negative training set and the SVM is retrained
using the augmented negative training set, which gives rise to a new classifier.
4.1.2.2 Testing
To classify a new video sequence broken down into a sequence of optical flows, we convert
f ramewidth f rameheight
each image into a large 8 × 8 × 36 dimensional HOG descriptor. Then
for a given C, we scan the n×m dimensional detection window across all locations and
scales, scoring the classifier for C at every detection window by convolving the image HOG
= {x 1 , . . . , xm }, where m is the number of windows in the image across locations and scales,
The entire video sequence X can be classified by taking a majority vote across frames
N
X
C(X) = C(x1 , . . . , xn ) = arg max I(C(xi ) = C) (4.3)
C
i=1
40
4.2 The Hidden Markov Model Approach
The SVM approach is similar to a bag-of-words approach; even if it were fed a sequence of
frames constituting a video out of order, it would still classify the video sequence just as it
would if the frames were fed to it in order. In reality, the optical flow in a particular frame
very much depends on the optical flows of the frames that precede it. This scenario can be
modeled as a Markov process. We use the training set of videos to build Hidden Markov
Models (HMMs) for each action class. Given a new, previously unseen video sequence which
broken down into a sequence of optical flows, the HMMs learnt in the training phase are
As mentioned in section 3.2, a Hidden Markov Model (HMM) consists of a transition model
A, an emission model φ and an initial model π. The first step towards building an HMM
model is to build a vocabulary of “visual words” using the frames belonging to the videos
from the training set. These visual words will then represent our hidden variables in the
HMM.
To build a visual vocabulary, the motion descriptor in Efros et al.[9] is used on bounding
boxes around the person in a frame. This motion descriptor has been shown to perform
reliably with noisy image sequences. Given a video sequence in which the person appears
in the center of the field of view, the optical flow is computed at each frame using the
Lucas-Kanade algorithm[10]. The optical flow vector field F is then split into 2 scalar fields
Fx and Fy , corresponding to the x and y components of the optical flow. Fx and Fy are
Fx = F+ − + −
x - Fx and Fy = Fy - Fy . These four non-negative channels are then blurred with a
Gaussian kernel and normalized to obtain the final four channels Fb− + − +
x , Fbx , Fby , and Fby
The motion descriptors of 2 different frames are computed as follows: Suppose the four
41
Figure 4.2: Cluster centers from the UCF action dataset
channels for frame A are a 1 , a 2 , a 3 and a 4 , similarly, the four channels for frame B are b 1 ,
4 X
X
S(A, B) = ac (x, y)bc (x, y) (4.4)
c=1 x,y∈I
To construct the codebook, an affinity matrix A is computed on all frames in the training
set, where A(i.j ) is the similarity between frame i and frame j calculated using the equation
above. K-Medoid clustering is then run on this affinity matrix to obtain K clusters. Each
frame in the training set belongs to one of these K clusters. Each cluster represents a visual
word. Fig. 4.2 shows 15 of the 45 “cluster centers” from the UCF action dataset[11]. Each
From now on, the visual vocabulary V will be represented as the K element set {w 1 , . . . , wK }.
For the KTH dataset, which has 2391 video sequences in it, the size of the vocabulary we
42
4.2.2 A hidden Markov model framework for video sequences
For an N frame video sequence X = {x1 , . . . , xN }, where xj is the optical flow computed
at frame j, then each xj could represent an observed state in a HMM. The video sequence
X could be represented as a sequence of these observed states. The visual words computed
in the section above could represent the hidden variables z in a HMM. Given a labeled set
of videos as training data, we could then build a hidden Markov model θC = {AC ,φC ,π C }
We assume that our observations are continuous vectors and therefore assume a Gaussian
emission density. We explore two different feature vectors. The first is the optical flow
based HOG feature set described in Sec. 4.1.1. For a n × m image patch, the length of the
n m
feature vector is 8 × 8 × 36. The problem with directly modelling this feature vector as
a Gaussian is that it may be too large to fit a full covariance matrix. This means that we
may end up with a singular covariance matrix meaning Σ−1 will not exist. One solution to
this problem is to restrict the space of matrices Σ under consideration. Instead of trying to
fit a full covariance matrix, we could choose to fit a covariance matrix Σ that is diagonal.
cases, we found both an efficiency and performance gain by explicitly reducing the dimen-
sionality of the data by Principal Component Analysis (PCA). We project the data down
n m
where V T is a 8 × 8 × 36 × D dimensional matrix of principal component directions.
43
4.2.2.2 Visual word SVM based features
Here, we train an SVM wi ,i ∈ {1, . . . , K} to detect each visual word. Positive instances
come from bounding boxes from images in the cluster for which i is the center. Negative
instances are randomly sampled patches from images belonging to other clusters along with
image patches with no person in them. We found that adding the latter helps improve
performance since the SVM is trained to discriminate a particular action word from both
other action words and background patches that do not correspond to any action class. We
can now define the feature extracted from the ith patch as
w1T
..
xi = wT ψ(pi ) =
. ψ(pi ) (4.6)
T
wK
This is also a linear dimensionality reduction scheme. Our reduction scheme is discrimina-
tive in that it exploits training labels, while PCA does not. One may also employ other
based scheme in anticipation of moving our model toward a fully discriminative structured
Given the definitions of our observed feature vectors x and hidden states z, we use the
A new, previously unseen video sequence can be classified using the hidden Markov models
constructed in the training phase. In the pre-tracked scenario, the location of the person
in each frame is known i.e each frame is clipped to only include the person of interest.
The video sequence essentially reduces to a sequence of poses from which optical flow can
computed at each pose. This sequence of flows constitute the observed states. A sum-
44
product algorithm can then be used to calculate the probability of the sequence of observed
states under the HMM for each action class built in the training phase.If
represents the probability of the partial observed sequence x1 ,. . . ,xn produced by all possible
state sequences that end at the i -th visual word given θC , it can be recursively defined as:
XK
αn (i) = αn−1 (j)AC C
ji φi (Xn ) (4.7)
j=1
where K is the size of the visual vocabulary. AC is a matrix of transition probabilities for
length video sequence, the probability of the entire sequence X under the HMM for class C
K
X
p(X|Z, θC ) = αm = αm (j) (4.8)
j=1
After scoring the sequence under the HMM for every class,the final classifier for a video
As mentioned earlier, the ability to perform robust tracking prior to classification cannot
be assumed general. In the next section, this assumption is relaxed and a model for jointly
45
4.3 A Unified model for joint tracking and recognition
Here, we use HMM machinery to build a joint model for tracking and recognizing actions.
Given a video sequence, we do not assume that the person in it is tracked; instead we try
to infer the person’s location in each frame along with his/her action.
In the pre-tracked scenario, z ∈ {1,. . . ,K } where K is the number of visual words. In the
current scenario, where pre-tracking is not assumed and location of a person in a frame is
to the number of pixels in a frame. Given that the hidden variables z now lie in the cross
product space of locations and visual words, a HMM for tracking can be given by
N
Y
p(X, Z|θG ) = p(z1 |π G ) p(xi |zi , φG )p(zn |zn+1 , AG ) (4.10)
n=1
We present a model for joint tracking and recognition by building action specific HMMs for
N
Y
p(X, Z, C|θC ) = p(z1 |π C ) p(xi |zi , φC )p(zn |zn+1 , AC )p(C) (4.11)
n=1
where θC is the hidden Markov model for action class C. Our model enforces that all the
visual words zn are consistent with a single action class whereas (4.10) cannot guarantee
this property. Fig. 4.3 shows a graphical representation of our joint model. In it, the
transition and emission probabilities are conditioned on the variable C, the action class.
46
Figure 4.3: An illustration of the joint model. The variable C determines the transition
and emission probabilities.
As mentioned earlier, in the present scenario, z ∈ {1,. . . ,L}×{1,. . . ,K }. This state space
can be very large for large values of L. For the KTH dataset[7], where the spatial resolution
is 160x120, L ∼ 19200. In order to reduce this state space, we place a prior on the movement
more than δ pixels from frame to frame. The probability of a transition from state zn−1 =
can be used to speed up the Viterbi algorithm, rewriting the Viterbi recursion as a dis-
tance transform. The standard Viterbi algorithm is O(K 2 ) (where K is the number of
states) whereas using distance transform techniques takes O(K ) time; an order of magni-
47
Figure 4.4: An illustration of the bounded velocity motion model. Each node represents
a (location,word ) hidden state. The only transitions considered are shown by the lines
connecting nodes at n-1 and n. The rest of the transitions are not taken into consideration
4.3.2.2 Inference
Given our joint model for tracking and recognition, we can now run inference on a new video
sequence and find both the best path of hidden states and its action class. We describe the
For a specific C, we employ dynamic programming to find the best sequence of hidden
states. If S C (zn =i ) is defined as the score of the path that ends in state i under θC , it can
48
where
Recalling our bounded velocity motion model and that zn = (l n ,w n ), if we define l (zn ) =
In the section above, we presented a joint model for tracking and recognition. Here, we
present an informal motivation for such a model. We run several baseline algorithms on
a video sequence and compare their performance, tracking wise, to our joint tracking and
recognition model.
There does not exist an out-of-the-box technology for tracking people in video sequences.
Tracking people is difficult because of the deform due to articulations, speed of movement
from frame to frame, and the clutter surrounding the person to be tracked. Many previous
approaches have relied on Kalman filtering or Particle filtering [30]. Although it may not
49
phy, in which a track is stitched together by linking object detections from across frames[31].
Such an approach has the advantage of not needing to be hand initialized and being robust
to errors from drifting because the tracker is essentially re-initializing itself each frame. For
example, Forsyth[30] explicitly advocates the above approach for face tracking.
Given that tracking by detection has been advocated by recent approaches, for our baseline,
we feel comfortable in proposing to link detections across frames using the detector in Dalal
et. al[1]. Commonly known as the Dalal-Triggs detector, it is known to be the state-of-the-
art for human detection in images and is particularly trained to find pedestrians in images.
We also propose two other baselines. All our baselines are outlined below
In B and C, the hidden variable z ∈ {1,. . . ,L} where L is the number of discrete locations
in a frame. They both use the bounded velocity motion model for transition probabilities.
where xi is the HOG feature vector extracted from location i and w is the model. It should
be noted that in its typical implementation, the Dalal-Triggs detector scans a 64×128
detection window across locations as well as scales. Here, the version of the detector used
does not search over scales. Images are rescaled such that the person of interest appears in
50
(a) Kicking sequence tracked by baseline A
Figure 4.5: (a) tracked using baseline A is not consistent in the person it tracks. (b) tracked
using baseline B is consistent but does not track the person of interest-the person kicking
the ball. (c) and (d) seem to be doing equally well in tracking the person of interest
Baseline C is similar to the joint model except that there is no model for transitions
between visual words - these transitions are simply not taken into consideration. We run
all our baselines and our joint model on a kicking sequence from the UCF dataset[11]. We
run baseline C and our joint model given that the classification of the sequence is already
known to us. The joint model has a Gaussian emission density and uses the feature vector
in Sec 4.2.2.2. Fig. 4.5 shows the tracks output by the baselines and our joint model.
From the figure, it seems that the Dalal-Triggs baseline (baseline A) is inconsistent in its
tracking. One would expect the bounding box in each frame to be around the same person,
but this is clearly not the case here. Baseline B enforces this consistency by using the
51
Dalal-Triggs detector in a dynamic programming framework, using the bounded velocity
motion model for transition probabilities. Both A and B are unable to track the person of
interest i.e the person kicking the ball. Instead, they seem to be tracking the person whose
pose most resembles that of a pedestrian. The performace of baselines A and B on the
video sequence demonstrates why pre-tracking using an external module cannot be assumed
in general.
Baseline C outputs a track that is identical to the one output by the joint model. Although
baseline C outputs a reasonable track without considering transitions between visual words,
we believe that these transitions do convey important information that will lead to more
templates, where each template encodes a visual action word. We are not aware of a similar
Even though we do not see any major improvement of our joint model over baseline C, we
note that in ur model, jointly tracking and recognizing activities is not more expensive than
separately pre-tracking and labelling actions. Since both activities can be efficiently done
together, we argue that this will result in a more robust system in more difficult scenarios
because it can enforce the constraint that all visual word templates are consistent with a
52
Chapter 5
Results
The algorithms described in Chap are tested on two datasets: KTH human motion dataset[7]
The KTH human motion dataset is one of the largest video datasets of human actions.
It contains six types of human actions (walking, jogging, running, boxing, hand waving
and hand clapping) performed several times by 25 subjects both outdoors and indoors,
with different clothes. All sequences were taken over homogeneous backgrounds with a
static camera with 25fps frame rate. Representative frames from the dataset are shown in
Fig. 5.1. The dataset was divided roughly equally into a training and testing set of video
sequences. The training sequences were tracked and stabilized so that the figures appear
in the center of the field of view. This was done using a combination of the Dalal Triggs
The SVM approach was first tested on the dataset. Once the optical flow was computed on
every frame in the training set, six linear SVMs were built; one for each class, as described
in Sec 4.1. In building the negative training set for a particular action class, feature vectors
53
Figure 5.1: Representative frames from the KTH action dataset
corresponding to frames that had no person in them were included, along with feature
vectors from frames in the other five action classes. Given all six SVMs, a new video
sequence, was converted to a sequence of optical flows and classified by a majority vote
taken across all frames. The confusion matrix for the KTH dataset using the SVM approach
Table 5.1: - Confusion matrix for the KTH dataset using the SVM approach (overall
accuracy = 80.5%)
We also tested our joint model on the KTH dataset. A visual vocabulary was constructed
like discussed in Sec. 4.2.1. We built a HMM for each of the six action classes and ran
inference on every video sequence on the test set. As described in Sec. 4.3.2.2, the action
was classified as the action corresponding to the highest scoring HMM. The confusion matrix
using 255 codewords and optical flow features (4.2.2.1) with a Gaussian emission model is
The confusion matrix using 255 codewords and visual word SVM features (4.2.2.2) with a
54
box clap wave jog run walk
box 0.98 0.01 0.01 0.0 0.0 0.00
clap 0.01 0.96 0.03 0.0 0.0 0.0
wave 0.02 0.03 0.95 0.0 0.0 0.0
jog 0.0 0.0 0.0 0.72 0.25 0.03
run 0.0 0.0 0.0 0.23 0.74 0.03
walk 0.0 0.0 0.0 0.04 0.03 0.93
Table 5.2: - Confusion matrix for the KTH dataset using our joint model and optical flow
features (overall accuracy = 88%)
Table 5.3: title of table - Confusion matrix for the KTH dataset using the joint model
and visual word SVM based features (overall accuracy = 89.67 %)
Table 5.4 shows how the methods introduced in this thesis perform on the KTH dataset
compared to other action recognition methods It should be noted that it is not very clear
how the other methods split the training and testing sets. The methods introduced in this
thesis perform reasonably well on the KTH dataset given that pretracking is not assumed.
Fig. 5.3 shows consecutive frames from KTH running sequences. Bounding boxes represent
the estimated location of the person. As seen in the figure, both the SVM approach and
the joint model do rather well in tracking a wide variety of poses and localizing motion in
a sequence of frames
From the confusion matrices, it is apparent that the “boxing” action has the highest accu-
55
Figure 5.2: The frames in the top row are from a jogging sequence, the frames in the bottom
row from a running sequence. These actions are similar to one another
racy. This is because it is unlike any of the other five actions. For example, “waving” and
“clapping” can be quite similar to one another because in many cases, both action start out
the same way - with the hands of the person stretched out by his/her side. Similarly, the
“walking”, “running” and “jogging” actions are also similar since they all involve movement
of the legs. Specifically, the running and jogging actions are often misclassified as one an-
other. This is reasonable, since running and jogging are similar actions and can be hard to
tell apart. Fig. 5.2 shows a few consecutive frames from a running and a jogging sequence.
56
(a) boxing sequence
Figure 5.3: The persons in (a) and (b) were tracked (and their actions classified) using the
SVM approach. Persons in (c) and (d) were tracked using the joint model and optical flow
features. Persons in (e) and (f) were tracked using the joint model and a visual word SVM
based features
57
Figure 5.4: Representative frames from the UCF Sports Action Dataset
The UCF (University of Central Florida) vision lab collected a set of eight actions from
various sports featured on channels such as BBC and ESPN. Actions in this dataset include
Golf swings, diving, kicking a soccer ball, weight lifting, horse riding, running, skating and
swingbenching. Representative frames of this dataset are shown in Fig 5.4 The original
dataset consists of some frames in which more than one person appears, unlike the KTH
dataset which consists of only one person per frame. The dataset also includes cropped
versions of the frames including only the person of interest in the center. The dataset was
To test the SVM approach on this dataset, eight SVMs, one for each action, were built in
the training phase. The negative dataset for a particular action consisted of feature vectors
from frames in the other seven actions whereas the positive dataset consisted of feature
vectors from frames in the action class. Classification of a new sequence of frames was done
as described in Sec 4.1. The confusion matrix for the UCF dataset using the SVM approach
We built eight HMMs, one for each class to test out joint model for tracking and recognition
on the UCF dataset. The confusion matrix for the UCF dataset using a vocabulary of 45
visual words, optical flow features and a Gaussian emission mode is shown in Table 5.6
And finally, the confusion matrix for the UCF dataset using the joint model, a vocabulary
of 45 words and visual word SVM based feature vectors is shown in Table 5.7
58
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.8 0.0 0.01 0.0 0.06 0.02 0.08 0.03
dive 0.01 0.81 0.0 0.06 0.0 0.02 0.02 0.08
kick 0.0 0.01 0.67 0.01 0.03 0.2 0.06 0.02
lift 0.0 0.06 0.02 0.76 0.05 0.0 0.0 0.11
horse ride 0.05 0.02 0.02 0.03 0.85 0.01 0.01 0.01
run 0.01 0.01 0.22 0.0 0.01 0.68 0.05 0.02
skate 0.06 0.01 0.08 0.01 0.0 0.06 0.77 0.01
swingbench 0.02 0.06 0.02 0.01 0.0 0.0 0.0 0.8
Table 5.5: Confusion matrix for the UCF dataset using the SVM approach (overall accuracy
= 76.75%)
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.89 0.0 0.0 0.0 0.03 0.01 0.06 0.01
dive 0.01 0.88 0.0 0.04 0.0 0.02 0.01 0.04
kick 0.0 0.0 0.72 0.01 0.02 0.18 0.05 0.02
lift 0.0 0.04 0.02 0.85 0.02 0.0 0.0 0.07
horse ride 0.02 0.0 0.01 0.01 0.93 0.01 0.01 0.01
run 0.0 0.01 0.19 0.01 0.01 0.74 0.03 0.01
skate 0.06 0.0 0.07 0.01 0.0 0.05 0.81 0.01
swingbench 0.0 0.03 0.01 0.06 0.0 0.0 0.0 0.9
Table 5.6: Confusion matrix for the UCF dataset using the joint model with optical flow
based features (overall accuracy = 84%)
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.91 0.0 0.0 0.0 0.02 0.01 0.05 0.01
dive 0.01 0.9 0.0 0.03 0.0 0.02 0.0 0.04
kick 0.0 0.0 0.76 0.0 0.02 0.17 0.03 0.02
lift 0.0 0.04 0.01 0.88 0.01 0.0 0.0 0.06
horse ride 0.01 0.0 0.01 0.01 0.94 0.01 0.01 0.01
run 0.0 0.0 0.16 0.01 0.01 0.78 0.03 0.01
skate 0.04 0.0 0.04 0.0 0.0 0.05 0.86 0.01
swingbench 0.0 0.02 0.0 0.05 0.0 0.0 0.0 0.93
Table 5.7: Confusion matrix for the UCF dataset using using the joint model and visual
word SVM based features (overall accuracy = 87.1%)
59
From the confusion matrices, it seems like there is some confusion between the “kick” and
“run” actions. Given that these actions can be similar (for example,they both involve a wide
stance), this is quite reasonable. Otherwise, both the SVM and HMM approach seem to be
doing reasonably well, even on complex actions like diving. The joint model in combination
with the visual word SVM based feature set, in particular, does well on this dataset. Fig 5.5
shows several sequences from the UCF action dataset with bounding boxes around detected
motion patterns. Again, both the SVM approach and the joint model seem to be tracking
We are only aware of one other method that uses the UCF action dataset for testing. [11]
achieves an accuracy of 69.2 % on it. Although it seems like our method does an order of
magnitude better than [11], it should be noted that there were a few actions mentioned in
[11] that didnt appear in our version of the dataset. For example, pole vaulting and swing-
baseball do not appear here. It may be that the publicly available dataset is a simpler
60
(a) lifting sequence
Figure 5.5: The persons in (a) and (b) were tracked (and their actions classified) using the
SVM approach. (c) and (d) were tracked using the joint model with an optical flow based
feature set.(e), (f), (g) and (h) were tracked using the joint model with a visual word SVM
based features
61
Chapter 6
In this thesis, we tried two different methods for tackling the action recognition problem.
From the results on the KTH and UCF action datasets, it is apparent that the hidden
Markov model approach outperformed the Support Vector Machine based approach. Intu-
itively, this makes sense because the pose of a person in a frame is certainly dependent on
the person’s pose in previous frames. This suggests that the order in which frames appear
This thesis introduced a new method by which tracking and recognition can be done simul-
taneously. It tested rather well on the KTH and UCF action datasets; the next natural step
would be to test it on video sequences with more complex motion patterns, like a YouTube
video. So far, only optical flow has been used as a feature set but as motion patterns get
more obscure, it may be worth exploring other feature sets and use them in conjunction
Throughout our work, we assumed that there is only one person of interest in a video
sequence whose actions need to be tracked and recognized. The next step would be to
consider how to extend the framework introduced here to track and recognize multiple
human actions. Another simplifying assumption made in this thesis is that a person can
only perform a single action in a video sequence, whereas in reality, this is hardly ever
62
the case. It might be worth exploring an extension of the methods introduced here by
considering transitions from one action class to another along with transitions among visual
words.
Finally, we believe that a fully discriminative model such as a structural SVM[20] may
perform better since both the transition model and location emission models will be trained
63
Bibliography
[1] N. Dalal B. Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE
Conference Computer Vision and Pattern Recognition 2005, San Diego, USA, pages:886
to 893
64
[13] K. Murphy. Hidden Markov Model (HMM) Toolbox for Matlab. Software retrieved
from http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html
[18] G. Mori, Y. Wang. Learning a Discriminative Hidden Part Model for Human Action
Recognition. In 2008.
[21] M. Brand. Coupled hidden markov models for complex action recognition. In lab
vision and modelling tr-407, MIT, 1997
[22] M. Brand, N. Oliver, and A.P. Pentland. Coupled hidden markov models for complex
action recognition. In Conference on Computer Vision and Pattern Recognition 1997,
pages:994 to 999
[23] N. Oliver, A. Garg, and E. Horvitz. Layered representations for learning and inferring
office activity from multiple sensory channels. In , vol. 96, no. 2, pp. 163180, November
2004
[24] N. Oliver, E. Horvitz, and A. Garg. Layered representations for human activity
recognition. In IEEE International Conference on Multimodal Interfaces, 2002, pages 3
to 8
[26] A. D. Wilson and A. F. Bobick. Parametric hidden markov models for gesture
recognition. In Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp.
884900, September 1999.
[29] A. Galata, N. Johnson, and D. Hogg. Learning structured behavior models using
variable length markov models. In International Workshop on Modelling People, 1999.
65
[29] A. Galata, N. Johnson, and D. Hogg. Learning behavior models of human activities.
In Machine Vision Conference, 1999.
[34] B. Epshtein and S. Ullman, Semantic hierarchies for recognizing objects and parts.
In CVPR, 2007.
[35] S. Ioffe and D. Forsyth. Probabilistic methods for finding people. In , pages 45-69,
June 2001
66
Appendices
A Data Sets
In this appendix, we give a brief introduction to the data sets used in this thesis.
The KTH human motion dataset is one of the largest video datasets of human actions. It
contains six types of human actions (walking, jogging, running, boxing, hand waving and
hand clapping) performed several times by 25 subjects both outdoors and indoors, with
different clothes. All sequences were taken over homogeneous backgrounds with a static
camera with 25fps frame rate. The data set is fairly synthetic and does not represent real
world scenarios.
The UCF (University of Central Florida) vision lab collected a set of eight actions from
various sports featured on channels such as BBC and ESPN. Actions in this dataset include
Golf swings, diving, kicking a soccer ball, weight lifting, horse riding, running, skating
and swingbenching. The original dataset consists of some frames in which more than one
person appears, unlike the KTH dataset which consists of only one person per frame. This
relatively new data set contains close to 200 video sequences at a resolution of 720x480.
67
The collection represents a natural pool of actions featured in a wide range of scenes and
viewpoints
68