Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Thesis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

UNIVERSITY OF CALIFORNIA,

IRVINE

A Joint Model for Tracking and Recognizing


Human Actions in Video Sequences

THESIS

submitted in partial satisfaction of the requirements


for the degree of

MASTER OF SCIENCE

in Computer Science

by

Goutham Patnaikuni

Thesis Committee:
Professor Deva Ramanan, Chair
Professor Alexander Ihler
Professor Charless Fowlkes

2009

c 2009 Goutham Patnaikuni
The thesis of Goutham Patnaikuni
is approved and is acceptable in quality and form for
publication on microfilm and in digital formats:

Committee Chair

University of California, Irvine


2009

ii
DEDICATION

I dedicate this thesis to Professor Paul Utgoff, who lost his battle to appendiceal cancer in
October 2008. I learnt of his passing only recently and was deeply saddened to hear about
it. It seems like it was just yesterday when I was in his office discussing homework
assignments for his Artificial Intelligence course. Professor, this is for you, with love and
resolution

iii
TABLE OF CONTENTS

Page

LIST OF FIGURES vi

LIST OF TABLES vii

ACKNOWLEDGMENTS viii

ABSTRACT OF THIS THESIS ix

1 Introduction 1
1.1 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 4
2.1 Local features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Global features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Other HMM based approaches . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Theory 8
3.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Intuitions behind Margins . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Functional and geometrical margins . . . . . . . . . . . . . . . . . . 10
3.1.4 The optimal margin classifier . . . . . . . . . . . . . . . . . . . . . . 12
3.1.5 Optimal margin classifiers . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.6 Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Transition and Emission probabilities . . . . . . . . . . . . . . . . . 18
3.2.2 Maximum Likelihood for the HMM . . . . . . . . . . . . . . . . . . . 19
3.2.3 The forward-backward algorithm . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.5 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Approach 37
4.1 The Support Vector Machine Approach . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Optical Flow based HOG descriptors . . . . . . . . . . . . . . . . . . 37
4.1.2 A multiclass SVM framework for action recognition . . . . . . . . . 39
4.1.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

iv
4.2 The Hidden Markov Model Approach . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Building a vocabulary of visual words . . . . . . . . . . . . . . . . . 41
4.2.2 A hidden Markov model framework for video sequences . . . . . . . 43
4.2.2.1 Dimensionality reduction of optical flow features . . . . . . 43
4.2.2.2 Visual word SVM based features . . . . . . . . . . . . . . . 44
4.2.2.3 Classifying a new pre-tracked video sequence . . . . . . . . 44
4.3 A Unified model for joint tracking and recognition . . . . . . . . . . . . . . 46
4.3.1 Cross product space of location and visual words . . . . . . . . . . . 46
4.3.2 Joint model for tracking and recognition . . . . . . . . . . . . . . . . 46
4.3.2.1 Bounded velocity motion model . . . . . . . . . . . . . . . 47
4.3.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Motivation for joint tracking and recognition . . . . . . . . . . . . . . . . . 49
4.4.1 The Tracking problem . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.2 Experiment and baseline . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Results 53
5.1 Action Classification on KTH Dataset . . . . . . . . . . . . . . . . . . . . . 53
5.2 Action Classification on UCF action dataset . . . . . . . . . . . . . . . . . . 58

6 Conclusion and Future Work 62

Bibliography 64

Appendices 67
A Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 KTH Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 UCF Action Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

v
LIST OF FIGURES

Page

3.1 Linear decision boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


3.2 Maximum margin separating hyperplane and support vectors . . . . . . . . 14
3.3 The transition state diagram unfolded over time . . . . . . . . . . . . . . . 20
3.4 Forward recursion for the evaluation of the α variables . . . . . . . . . . . . 26
3.5 Backward recursion for the evaluation of the β variables . . . . . . . . . . . 28
3.6 A graphical depiction of the Viterbi algorithm . . . . . . . . . . . . . . . . . 34

4.1 Visualization of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


4.2 Cluster centers from the UCF action dataset . . . . . . . . . . . . . . . . . 42
4.3 An illustration of the joint model . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 An illustration of the bounded velocity motion model . . . . . . . . . . . . 48
4.5 Case for joint tracking and recognition . . . . . . . . . . . . . . . . . . . . . 51

5.1 Representative frames from the KTH action dataset . . . . . . . . . . . . . 54


5.2 Similarity between running and jogging actions . . . . . . . . . . . . . . . . 56
5.3 KTH sequences with tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Representative frames from the UCF Sports Action Dataset . . . . . . . . . 58
5.5 UCF sequences with tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

vi
LIST OF TABLES

Page

5.1 KTH confusion matrix with SVM approach . . . . . . . . . . . . . . . . . . 54


5.2 KTH confusion matrix using the joint model and optical flow features . . . 55
5.3 KTH confusion matrix using the joint model and visual word SVM based
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Comparison of different methods on KTH . . . . . . . . . . . . . . . . . . . 55
5.5 UCF confusion matrix using SVM approach . . . . . . . . . . . . . . . . . . 59
5.6 UCF confusion matrix using the joint model with optical flow based features 59
5.7 UCF confusion matrix using the joint model and visual word SVM based
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

vii
ACKNOWLEDGMENTS

I would like to start off by thanking Professor Deva Ramanan. Working with him for the
past year has been an absolute pleasure. He has been a great source of knowledge and
support and I cannot thank him enough for it. He is, without a doubt, the best advisor I
have ever had. My thanks also go to Professors Ihler and Fowlkes, for their time and help.

This work would not have been possible without the support of my friends here in Irvine. I
would particularly like to thank my good friend Kristian Hermansen, whom I have known
since my days as an undergraduate at the University of Massachusetts, for his constant
support and for being a source of humor at times when I needed it.

I have to thank my family, including my parents, my sister and my brother in law for being
there for me and believing in me. I particularly appreciate all the intellectual conversations
I have had with my brother-in-law about my research. It is good to have someone with a
PhD from Stanford in your family.

viii
ABSTRACT
A Joint Model for Tracking and Recognizing
Human Actions in Video Sequences

By

Goutham Patnaikuni

Master of Science in Computer Science

University of California, Irvine, 2009

Professor Deva Ramanan, Chair

In this paper, we propose a new method for human activity recognition from video sequences

using a combination of discriminative and generative models. We present two different

approaches - one based solely on Support Vector Machines and another in which the video

sequence is modelled as a Markov process. The major difference between our methods and

previous work done in action recognition is that we do away with the assumption that

the persons in a video sequence are tracked prior to the recognition process, and instead

combine the tracking and recognition problems into one. We believe that this is not only

a much more reasonable approach to action recognition, but also that combining the two

problems increases the accuracy of tracking

ix
Chapter 1

Introduction

Recognizing human actions in video sequences is a challenging problem in computer vision.

It has applications in areas such as human computer interaction, surveillance, searching

video databases and automatic tagging of videos on sites such as Youtube. Various visual

cues, based on motion and shape, have been used for action recognition. In this thesis, the

focus will be on recognizing the actions in video sequences using motion cues. Specifically,

optical flow will be used as a feature set. We discuss two different methods for solving the

action recognition problem; one of them extends the SVM based framework introduced in

Dalal et. al[1]. In this setting, a video sequence, broken down into a sequence of frames is

essentially treated as a “bag-of-words” i.e the order of the frames is inconsequential . The

second method models the sequence of frames in a video as a Markov process meaning that

the sequence in which frames occur in a video is taken into consideration.

1.1 Contributions of this thesis

Many interesting approaches have been proposed to solving the action recognition problem.

Although some of these methods have reported some impressive results, most of them

suffer from the same weakness - they assume that tracking of the human figure is done

1
prior to recognition. In most cases, an external module is used to localize the motion of the

human figure in each frame of a video sequence. A common technique used is background

subtraction. In background subtraction, the goal is to identify moving objects from the

portion of a video frame that differs significantly from a background model. Although it

works well in cases where the background is static, background subtraction is not known to

perform well when the scene consists of complex, non-static backgrounds.

Another popular out-of-the-box module is the Histogram of Oriented Gradient (HOG)

based classifier, developed by Navneet Dalal and Bill Triggs and commonly known as the

Dalal-Triggs detector. The detector has a single fixed template which determines whether

a given image pattern corresponds to a person. The Dalal-Triggs detector was originally

trained to detect pedestrians in images and is not meant to detect a wide variety of human

poses. This means that while it may accurately locate a person standing in an upright pose

in an image, it may not be able to detect a person in a crouching or sitting position.

While the above two methods can be used for localizing motion in simple scenarios, com-

plex ones require a more intuitive approach. In such situations, the kind of action being

performed, and not global composition of a frame, should provide the primary cue for lo-

calization and recognition of motion patterns. In the case of action recognition, the ability

to perform robust tracking prior to recognition cannot be assumed in general.

The main contribution of this thesis is to address the issues described above. In this thesis,

this problem is tackled by building a family of templates, one for each motion pattern,

instead of working with a single template for localizing motion. A motion pattern could

either be related to an action class or a “word” from a visual vocabulary. We believe that

having a family of templates will enable us to detect a wide array of human poses and will

therefore lead to a better track, which in turn will increase the accuracy with which an action

can be predicted. We later describe a method in which these templates are incorporated

into a hidden Markov model based approach for solving the action recognition problem. We

introduce a novel joint model for tracking and recognizing actions in video sequences.

2
1.2 Organization of this thesis

The thesis is organized as follows. Chap 2 provides a brief overview of related work in

human activity recognition. Chap 3 goes into detail about machinery of Support Vector

Machines and Hidden Markov Models which are the basis for the methods used in this

thesis for recognizing human actions. Chap 4 discusses these methods in detail, specifically

describing how the tracking and recognition problems are combined into one. Chap 5

discusses the results of testing the methods described in Chap 4 on two datasets. Chap 6

concludes this thesis with a summary and a discussion of possibilities for extensions.

3
Chapter 2

Related Work

The literature on action recognition is quite rich. We avoid an in-depth review of all methods

and instead refer the reader to [30]. We instead focus on related work that is most applicable

to the approach we pursue. As mentioned before, different visual cues are used to detect

human actions. This section will concentrate on methods that use motion based cues. The

methods we discuss fall into two broad categories - ones that consider global level features

and others that try to capture and interpret local features.

2.1 Local features

Several recent approaches have concentrated on capturing local level features in video and

using them to understand the underlying motion in video sequences. The motivation for

doing so was to overcome some of the limitations of the methods that only considered motion

on a global scale, such as the inability to deal with multiple moving objects and variations

in background. Several local features for video have been proposed recently; one of which is

Space-time interest points[4]. In images, points with significant variation in local intensities

are considered to be of interest and are called spatial interest points. Space-time interest

points are an extension of spatial interest points and are meant to correspond to interesting

4
events in video data. Neibles et al.[3] models a video sequence as a collection of spatial-

temporal words by extracting space-time interest points from video sequences. Probability

distributions of the spatial-temporal words and intermediate topics corresponding to human

action categories are automatically learnt using a probabilistic Latent Semantic Analysis

(pLSA)[15] model. Using this model, a new video sequence is categorized and the motions

in it localized. Lindeberg et al.[6] introduces several local space-time descriptors associated

with space-time interest points and uses them for recognition of spatio-temporal events and

activities. Captuo et al.[7] combine space-time features and SVMs and use the resulting

approach to classification of human actions

Part based models are also increasingly being used in action recognition. This trend is

partly inspired by the success of these models in object detection. In [32], a discriminatively

trained, multi-scale, deformable part model is used to build models for people and objects

such as cars, bottles, and couches. [33, 34, 35] also use part based models for both human

and object recognition. In [18], a discriminative part-based approach based on hidden

conditional random fields is used for human action recognition.

2.2 Global features

Several global based features have been captured and used for action recognition in the past.

One commonly used feature is optical flow, which is an approximation of motion between

temporally adjacent scenes. Efros et. al[9] build motion descriptors based on optical flow

and use these in a k -nearest neighbor framework to classify actions. Wang. et al.[2] use the

same descriptor to build a visual vocabulary and represent a video as a bag of visual words.

They later use this representation to build a model based on latent Dirichlet allocation

(LDA)[5]. [18] also uses optical flow to model human actions as a flexible constellation of

parts. In order to avoid explicit computation of optical flow, a number of template-based

methods attempt to capture the underlying motion similarity amongst videos of a given

action class. Shechtman and Irani [19] avoid explicit flow computations by employing a

5
rank-based constraint directly on the intensity information of spatio-temporal cuboids to

enforce consistency between a template and a target.

Rodriguez and Ahmed[11] introduce a template-based method for recognizing human ac-

tions called Action MACH based on a Maximum Average Correlation Height (MACH)

filter. They address the common limitations of template-based methods in generating a

single template for an action by synthesizing a single Action MACH filter for a given action

class. These Action MACH filters combine the training sequences of an action class into a

single composite template. These templates are then correlated with testing sequences in

the frequency domain via a FFT transform. Once an Action MACH filter is synthesized,

similar actions in a testing video sequence are detected by applying the action MACH filter

to the video.

In this thesis, we use optical flow as a feature set. Although it does not explicitly capture

local interest points in video, localization is offered to some degree by using optical flow in

conjunction with the framework introduced in Dalal et. al[1]

2.3 Other HMM based approaches

Using HMMs for action recognition is very common. Typically, the hidden state is an

activity to be inferred, and observations are image measurements. Yamato et al.[20] describe

recognizing tennis strokes with HMMs. Wilson and Bobick[21] describe the use of HMMs

for recognizing gestures such as pushes. Yang et al.[22] use HMMs to recognize handwriting

gestures.

In order to simplify the training process of learning the state transition matrix, there has

been a great deal of interest in models obtained by modifying a basic activity-state HMM.

Variations include a coupled HMM (CHMM) [21, 22], a layered HMM (LHMM) [23, 24,

25], a parametric HMM (PHMM) [26], an entropic HMM (EHMM) [27], and variable length

Markov models (VLMM) [28, 29].

6
In this thesis, we use HMMs to infer activities using optical flow based feature sets as

observations. Later on, we build a joint model for tracking and recognizing actions in

video.

7
Chapter 3

Theory

This chapter discusses support vector machines and hidden Markov models in detail. Both

of these constitute the machinery used to classify video sequences.

3.1 Support Vector Machines

Support Vector Machines (SVMs for short) are known to be among the best “off-the-shelf”

supervised learning algorithms. SVMs are used in solving problems such as text catego-

rization, hand-written character recognition, image classification and in this case, action

recognition. A Support Vector Machine is typically a binary classifier (although it can be

extended to a multiclass classifier) which performs classification by constructing an N-1

dimensional hyperplane that optimally separates N dimensional data into two categories.

Input data fed into an SVM can be viewed as two sets of vectors in an N dimensional space.

An SVM will construct a separating hyperplane in that space, one which maximizes the

margin between the two data sets.

8
3.1.1 Intuitions behind Margins

The intuition behind margins can be best explained by considering logistic regression. In

logistic regression, the probability p(y = 1 |x; θ) is modeled by h θ (x ) = g(θT x ). When h θ (x )

≥ 0.5. or equivalently, if θT x ≥ 0, the label “1” is predicted. For a positive training example

(y = 1), the larger θT x is, the larger h θ (x ) is, and thus higher the degree of “confidence”

that the label is 1. The prediction can be thought of as a “confident” one that y = 1 if θT x

 0. Similarly, logistic regression can be thought of as making a confident prediction that y

= 0 if θT x  0. Given a training set, a good fit can be found if a θ can be found such that

θT x(i)  0 whenever y(i) = 1, and θT x(i)  0 whenever y(i) = 0, since this would reflect a

very confident set of classifications for all the training examples. For points that are very far

away from the separating hyperplane, a prediction can be made rather confidently. On the

other hand, for a point that is very close to the hyperplane, a confident prediction may not

be possible because even a small change in the separating hyperplane could easily change

the prediction.

3.1.2 Notation

Considering a linear classifier for a binary classification problem with labels y and features

x, y ∈ -1,1 (instead of 0,1). Using w, b, the classifier can be written as

hw,b (x) = g(wT x + b) (3.1)

Here, g(z ) = 1 if z ≥ 0, and g(z) = -1 otherwise. The “w,b” notation allows the intercept

term b to be treated separately from other parameters.

9
3.1.3 Functional and geometrical margins

Given a training example (x (i ),y(i )), the functional margin of (w,b) with respect to the

training example can be defined as

γ̂ (i) = y (i) (wT x(i) + b) (3.2)

If y (i) = 1, then for the functional margin to be large (for the prediction to be confident and

correct), w T x + b needs to be a large positive number. Conversely, if y (i) = -1, then for

the functional margin to be large, w T x + b needs to be large negative number. Moreover,

if y (i) (w T x + b) > 0, then the prediction on this example is correct. A large functional

margin represents a confident and a correct prediction.

For a linear classifier with the choice of g, if w and b were to be replaced with 2w and

2b respectively, since g(w T x + b) = g(2w T x + 2b), this would not change h w,b (x ) atall.

However, replacing (w,b) with (2w,2b) results in multiplying the functional margin by a

factor of 2. This means that by scaling w and b, the functional margin can be arbitrarily

large without changing anything meaningful. If a normalization condition such as ||w|| =


w
1, (w,b) can be replaced by ( ||w|| , b ). Given a training set, the functional margin (w,b)
2 ||w||2

with respect to S = {(x (i) ,y (i) );i = 1,...,m} is defined as the smallest of the functional

margins of the individual training examples. Denoted by γ̂, this can be written as:

γ̂ = min γ̂ (i) (3.3)


i=1,...,m

In Fig 3.1, the decision boundary corresponding to (w,b) is shown, along with the vector

w. It should be noted that w is orthogonal to the separating hyperplane. In the figure, the

distance of point A from the decision boundary, γ̂ (i) , is given by line segment AB.

w
||w|| is a unit length vector pointing in the same direction as w. Since A represents x (i) ,
w
the point B is given by x (i) - γ̂ (i) . ||w|| . But this point lies on the decision boundary, and all

10
Figure 3.1: Linear decision boundary

points x on the decision boundary satisfy the equation w T x + b = 0.

 
T (i) (i) w
w x − γ̂ +b=0 (3.4)
||w||

Solving for γ̂ (i) yields

T
wT x(i) + b

w b
γ̂ (i) = = x(i) + (3.5)
||w|| ||w|| ||w||

More generally, for both negative and positive examples, the geometric examples (w,b) with

respect to training example (x (i) ,y (i) ) to be

 T !
w b
γ (i) = y(i) x(i) + (3.6)
||w|| ||w||

It should be noted that if ||w|| = 1, then the functional margin is equal to the geometric

margin. The geometric margin is invariant to rescaling of the parameters; i.e if w and b are

replaced by 2w and 2b, the geometric margin does not change. Because of this invariance to

the scaling of the parameters, when trying to fit w and b to the training data, an arbitrary

scaling constant on w can be imposed without changing anything important. Given a

training set S = {(x (i) ,y (i) );i = 1,....,m}, the geometric margin (w,b) with respect to S can

11
be defined as the smallest of the geometric margins on the individual training examples.:

γ = min γ (i) (3.7)


i=1,...,m

3.1.4 The optimal margin classifier

Given a training set, it seems that it is natural to try and find a decision boundary that

maximizes the geometric margin, since this would reflect a very confident set of predictions

on the training set and a good fit to the training data. This will result in a classifier

that separates the positive training examples from the negative training examples with the

geometric margin.

Assuming that the training data is linearly separable, i.e it is possible to separate the

positive and negative examples using a hyperplane, the question is how to find one that

achieves maximum geometric margin. The following optimization problem can be posed:

maxγ,w,b γ

s.t y (i) (wT x(i) + b) ≥ γ̂, i = 1, ..., m

||w|| = 1

The objective is to maximize γ, subjective to the training example having functional margin

at least γ. The ||w|| = 1 constraint ensures that the functional margin equals the geometric

margin, so it is gauranteed that all geometric margins are atleast γ. Thus, solving this

problem will result in (w,b) with the largest possible geometric margin with respect to the

training set.

The ||w|| = 1 constraint makes the problem a hard one to solve because it cannot directly

be plugged into a standard optimization algorithm; the answer is to transform the problem:

γ̂
maxγ,w,b ||w||

s.t y (i) (wT x(i) + b) ≥ γ, i = 1, ..., m

12
γ̂
Here, ||w|| is maximized, subject to the functional margins all being atleast γ̂. Since the
γ̂
functional and geometric margins are related by γ = ||w|| , this is the desired answer. The

difficult constraint ||w|| = 1 does not have to dealt with. On the other hand, there is no
γ̂
off-the-shelf software that optimizes the objective function ||w||

Using the ability to add an arbitrary scaling constant on w and b without changing anything,

the scaling constant that the functional margin of w,b with respect to the training set

must be set to 1, can be introduced. i.e γ =1.


ˆ Plugging this into the equation above and
γ̂ 1
noting that maximizing ||w|| = ||w|| is the same as minimizing ||w2 || results in the following

optimization problem

minγ,w,b 12 ||w||2

s.t y (i) (wT x(i) + b) ≥ γ, i = 1, ..., m

The problem can now be solved efficiently. The equation above is a optimization problem

with a convex quadratic objective and linear constraints. Its solution gives a optimal margin

classifier. This can be solved using quadratic programming and lagrange multipliers.

3.1.5 Optimal margin classifiers

The constraint for the optimization problem above can be written as

g i (w ) = -y (i) (w T x (i) + b) + 1 ≤ 0.

There is one such constraint for each training example. From the KKT conditions, the

only training examples with αi > 0 are the ones that have functional margins equal to one

(the ones corresponding to constraints that hold with equality, g i (w ) = 0). In Fig 3.2, the

maximum margin separating the hyperplane is shown by the solid line

The points with the smallest margins are exactly the ones closest to the decision boundary.

In this case, only three points (one negative and two positive examples) lie on the dashed

13
Figure 3.2: Maximum margin separating hyperplane and support vectors

lines parallel to the decision boundary. This means that only three αi ’s will be non-zero at

the optimal solution. These three points are called the support vectors. The number of

support vectors is less than the size of the training set. Constructing the Lagrangian form

for the optimization problem:

m
1 X
L(w, b, α) = ||w||2 − αi [y(i) (wT x(i) + b) − 1] (3.8)
2
i=1

It should be noted that there are only αi and no βi Lagrange multipliers, since the equation

only has inequality constraints.

To find the dual form of the problem, L(w,b,α) will have to be minimized with respect to w

and b (for fixed α), to get θD . This can be done by setting the derivatives of L with respect

to w and b to 0‘:

m
X
∇w L(w, b, α) = w − αi y(i) x(i) = 0 (3.9)
i=1

This implies that

m
X
w= αi y(i) x(i) (3.10)
i=1

14
The derivative with respect to b

m
∂ X
L(w, b, α) = αi y(i) = 0 (3.11)
∂b
i=1

Taking the definition of w in Equation (3.10) and plugging it back into the Lagrangian in

Equation (3.8):

m m m
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y(i) (3.12)
2
i=1 i,j=1 i=1

The above equation was obtained by minimizing L with respect to w and b. Putting this

together with constraints αi ≥ 0. and the constaint (3.11) leads to the following dual

problem:

Pm 1 Pm (i) (j) (i) (j)


maxα W (α) = i=1 αi − 2 i,j=1 y y αi αj hx , x i

s.t αi ≥ 0, i = 1,...,m
Pm (i) = 0
i=1 αi y

In the dual problem above, the parameters of the maximization problem are all αi ’s. If

there were an algorithm to solve the dual equation above, the optimal w ’s can be found as

a function of α’s using Equation (3.10). Having found w∗ , the optimal value for intercept

term b can be found as:

maxi:y(i) =−1 w∗T x(i) + mini:y(i) =1 w∗T x(i)


b∗ = (3.13)
2

Equation (3.10) also gives a optimal value of w in terms of the optimal value of α. If a

prediction at a new point x must be made, w T + b can be calculated and y = 1 can be

predicted if this quantity is bigger than zero. But by using Equation (3.10), this quantity

can also be written as:

m
!T
X
T (i) (i)
w +b= αi y x x+b (3.14)
i=1

15
m
X
= αi y(i) hx(i) , xi + b (3.15)
i=1

Once the αi ’s are found, a quantity that depends only on the inner product between x and

the points in the training set will have to be calculated. Many of the terms in the sum

above will be zero because the αi ’s will all be zeros except for the support vectors and only

the inner product between x and the support vectors will have to be calculated

3.1.6 Multiclass SVMs

The support vector machine is fundamentally a binary classifier. In practice, however, one

may have to tackle problems involving more than two classes. In this project, for example,

SVMs are being used in a multiclass scenario. Various methods have been proposed to

combine multiple binary SVMs to build a multiclass classifier

One commonly used approach is to construct K separate SVMs, in which the k th SVM

y k (x ) is trained using data from class C k as the positive examples and the data from the

remaining K - 1 classes as the negative examples. This is known as a one-versus-the-rest

approach. A new point x is classified using

y(x) = max yk (x) (3.16)


k

This heuristic suffers approach suffers from the problem that the different classifiers are

trained on different tasks, and there is not guarantee that the real-valued quantities y k (x )

for different classifiers will have different scales.

Another approach is to train K (K -1)/2 different 2-class SVMs on all possible pairs of

classes, and then to classify test points according to which class has the highest number of

’votes’, an approach that is called one-versus-one. The problem with this approach is that

it required more training time that the one-versus-the-rest approach. Similarly, to evaluate

16
test points, significantly more computation is required.

17
3.2 Hidden Markov Models

A Hidden Markov Model (HMM) is a statistical model in which the system being modeled

is assumed to be a Markov process with unknown parameters; the challenge is to determine

the hidden parameters from the observable data. The extracted model parameters can then

be used to perform further analysis, for example for pattern recognition applications. In a

regular Markov model, the state is directly visible to the observer, and therefore the state

transition probabilities are the only parameters. In a hidden Markov model, the state is not

directly visible, but variables influenced by the state are visible. Each state has a probability

distribution over the possible output tokens. Therefore the sequence of tokens generated by

an HMM gives some information about the sequence of states. The HMM is widely used

in speech recognition, natural language modelling, on-line handwriting recognition and the

analysis of biological sequences such as proteins and DNA.

3.2.1 Transition and Emission probabilities

As in a standard mixture model, the latent variables are the discrete multinomial variables

zn describing which components of the mixture is responsible for generating the correspond-

ing observation xn . If the probability distribution of zn were allowed to depend on the state

of the previous latent variable zn-1 through a conditional distribution p(zn |zn-1 ). Because

the latent variables are K -dimensional binary variables, this conditional distribution corre-

sponds to a table of numbers A, the elements of which are known as transition probabilities.

They are given by Ajk = p(z nk = 1|z n−1,j = 1), and because they are probabilities, they
P
satisfy 0 ≤ Ajk ≤ 1 with k Ajk = 1, so the matrix A has K (K -1) independent features..

The conditional distribution can be written in the form:

K Y
K
z znk
Y
p(zn |zn−1 , A) = Ajkn−1,j (3.17)
k=1 j=1

The initial latent node z1 is special in that it does not have a parent node, and so it

has a marginal distribution p(z1 ) represented by a vector of probabilities π with elements

18
πk ≡ p(z1k = 1), so that:

K
Y
p(z1 |π) = πkz1k (3.18)
k=1

P
where k πk =1

The specification of the probabilistic model is completed by defining the conditional distri-

butions of the observed variables p(xn |zn , φ) where φ is the set of parameters governing the

distribution. These are known as emission probabilities. These could be given by a Gaus-

sian (or any other continuous probability distribution) if the elements of x are continuous

variables, or by conditional probability tables if x is discrete. Because xn is observed, the

distribution p(xn |zn , φ) consists, for a given value of φ, of a vector of K numbers corre-

sponding to the K possible states of the binary vector zn . The emission probabilities can

be represented in the form:

K
Y
p(xn |zn , φ) = p(xn |φk )znk (3.19)
k=1

The joint probability distribution over both latent and observed variables is then given by:

N N
" #
Y Y
p(X, Z|π) = p(z1 |π) p(zn |zn−1 , A) p(xm |zm , φ) (3.20)
n=2 m=1

where X = {x1 ,. . . ,xN }, Z = {z1 ,...,zN }, and θ = {π,A,φ} denotes the set of parameters

governing the model. The model is tractable for a wide range of emission distributions

including discrete tables, Gaussians and mixtures of Gaussians.

3.2.2 Maximum Likelihood for the HMM

For an observed data set X = {x1 ,...,xN }, one can determine the parameters of the HMM

using maximum likelihood. The likelihood function is obtained from the joint distribution

19
in Equation (3.20) by marginalizing over the latent variables:

X
p(X|θ) = p(X, Z|θ) (3.21)
z

Because the joint distribution p(X|θ) does not factorize over n, each of the summations

over zn cannot be treated independently. The summations cannot be performed explicitly

because there are N variables to be summed over, each of which has K states, resulting in

a total of K N terms. Thus the number of terms in the summation grows exponentially with

the length of the chain.In fact, the summation in Equation (3.20) corresponds to summing

over many paths in Fig 3.3 (b)

(a) A state transition diagram (b) A lattice representing the transition dia-
gram unfolded over time

Figure 3.3: The transition state diagram unfolded over time

To find an efficient framework for the maximizing the likelihood function in the hidden

Markov Model, one can use the expectation maximization (EM) algorithm. The EM algo-

rithm starts with some initial selection for the model parameters, denoted by θold . In the E

step, the model parameters can be used to find the posterior distribution of the latent vari-

ables p(Z |X, θold ). This posterior distribution can then be used to evaluate the expectation

of the logarithm of the compete-data likelihood function as a function of the parameters θ,

to give the function Q(θ,θold ) defined by:

X
Q(θ, θold ) = p(Z|X, θold ) ln p(X, Z|θ) (3.22)
z

20
Introducing some notation, γ(zn ) can denote the marginal posterior distribution of a latent

variable zn , and ξ(zn−1 ,zn ) to denote the joint posterior distribution of two successive talent

variables, so that:

γ(zn ) = p(zn |X, θold ) (3.23)

ξ(zn−1 , nn ) = p(zn−1 , zn |X, θold ) (3.24)

For each value of n, γ(zn ) can be stored using a set of K non-negative numbers that sum to

unity, and similarly, ξ(zn−1 , zn ) can be stored in a K × K matrix of non-negative numbers

that sum to unity. γ(znk ) can denote the conditional probability of z nk = 1, with a similar

notation for ξ(zn−1,j , znk ) and for other probabilistic variables introduced earlier. Because

the expectation of a binary random variable is just the probability that it takes the value

1:

X
γ(znk ) = E[znk ] = γ(z)znk (3.25)
z
X
ξ(zn−1,j , znk ) = E[zn−1,j , znk ] = γ(z)zn−1,j znk (3.26)
z

Substituting the joint distribution p(X,Z|θ)in Equation (3.20) into (3.22), and making use

of the definitions of γ and ξ:

K
X N X
X K X
K N X
X K
Q(θ, θold ) = γ(z1k ) ln πk + ξ(zn−1,j , znk )lnAjk + γ(znk ) ln p(zn |φk ) (3.27)
k=1 n=2 j=1 k=1 n=1 k=1

The goal of the E step will be to evaluate the quantities γ(zn ) and ξ(zn−1 , zn ) efficiently.

In the M step, the quantity Q(θ, θold ) with respect to the parameters θ = {π, A, φ} in which

γ(zn ) and ξ(zn−1 , zn ) are treated as constant. Maximization with respect to π and A is

21
easily achieved using appropriate Lagrange multipliers with the results:

γ(z1k )
πk = PK (3.28)
j=1 γ(z1j )
PN
ξ(zn−1,j , znk )
Ajk = PK n=2
PN (3.29)
l=1 n=2 ξ(zn−1,j , znl )

The EM algorithm must be initialized by choosing the starting values for π and A, which

should respect the summation constraints associated with their probabilistic interpretation.

Any elements of π and A that are initially set to zero will remain zero in subsequent EM

updates. A typical initialization procedure would involve selecting random starting values

for these parameters subject the summation and non-negativity constraints.

To maximize Q(θ,θold ) with respect to φk , it should be noted that only the final term in

Equation (3.27) depends on φk . If the parameters φk are different for the different compo-

nents, then this term decouples into a sum of terms one for each value of k, each of which

can be maximized independently. This would reduce to simply maximizing the weighted log

likelihood function for the emission density p(x|φk ) with the weights γ(z nk ). For example,
P
in case of Gaussian emission densities, p(x|φk ) = N (x|µk , k ), and maximization of the

function Q(θ,θold ) then gives:

PN
n=1 γ(znk )xn
µk = PN
(3.30)
n=1 γ(znk )
PN
n=1 γ(znk )(xn − µk )(xn − µk )T
Σk = PN (3.31)
n=1 γ(znk )

For the case of discrete multinomial observed variables, the conditional distribution of the

observed variables takes the form:

D Y
Y K
p(x|z) = µxiki zk (3.32)
i=1 k=1

22
and the corresponding M step equations are given by:

PN
n=1 γ(znk )xin
µik = PN
(3.33)
n=1 γ(znk )

The EM algorithm requires the initial values for the parameters of the emission distribution.

3.2.3 The forward-backward algorithm

There needs to be an efficient procedure to evaluate the quantities γ(znk ) and ξ(zn-1,j , znk ),

corresponding to the E step of the EM algorithm. The graph for the Hidden Markov

Model is a tree, so this means that the posterior distribution of the latent variables can be

obtained efficiently using a two-stage message passing algorithm. In the particular context

of the hidden Markov Model, this is known as the forward backward (Rabiner, 1989) or the

Baum-Welch algorithm (Baum, 1972). There are several variants of the basic algorithm,

all of which lead to the exact marginals, according to the precise form of the message that

are propagated along the chain. The most widely used of these is called the alpha-beta

algorithm, discussed below

The evaluation of the posterior distributions of the latent variables is independent of the

form of the emission density p(x|z) or of whether the observed variables are discrete or

continuous. All that is required is the values of the quantities p(xn |zn ) for each value of

z n for every n.Also, the explicit dependence on the model parameters θold shall be omitted

because these are fixed throughout

23
The following conditional independencies hold:

p(X|zn ) = p(x1 , ..., xn |zn )p(xn+1 , ..., xN |zn ) (3.34)

p(X|zn ) = p(x1 , ..., xn |zn )p(xn+1 , ..., xN |zn ) (3.35)

p(x1 , ..., xn−1 |zn−1 , zn ) = p(x1 , ..., xn−1 |zn−1 ) (3.36)

p(xn+1 , ..., xN |zn , zn+1 ) = p(xn+1 , ..., xN |zn+1 ) (3.37)

p(xn+2 , ..., xN |zn+1 , xn+1 ) = p(xn+2 , ..., xN |zn+1 ) (3.38)

p(X|zn−1 , zn ) = p(x1 , ..., xn−1 |zn−1 )p(xn |zn )p(xn+1 , ..., xN |zn ) (3.39)

p(xN+1 |X, zN+1 ) = p(xN+1 |zN+1 ) (3.40)

p(zN+1 |zN , X) = p(zN+1 |zN ) (3.41)

where X = {x1 ,...,xn }. These relations are easily proved using d-separation. For instance,

for the first of these results, every path from any one of the nodes x1 ,...,xn-1 to the node

xn passes through node zn , which is observed. Because all such paths are head-to-tail, it

follows that the conditional independence properly must hold. These relations can also be

proved directly from the joint distribution of the hidden Markov model using the sum of

product rules of probability.

To evaluate γ(znk ), the fact that for a discrete multinomial random variable the expected

value of one of its components is just the probability of that component having a 1 proves

useful. Given this fact, the goal is to find the posterior distribution p(zn |x1 , ..., xn ) of zn

given the observed data set x1 ,...,xn . This represents a vector of length K whose entries

correspond to the expected values of znk . Using Bayes theorem:

p(X|zn )p(zn )
γ(zn ) = p(zn |X) = (3.42)
p(X)

The denominator p(X) is implicitly conditioned on the parameters θold of the HMM and

hence represents the likelihood function. Using the conditional independency property

24
(3.34), together with the product rule of probability

p(x1 , ..., xn , zn )p(xn+1 , ..., xN |zn ) α(zn )β(zn )


γ(zn ) = = (3.43)
p(X) p(X)

where

α(zn ) ≡ p(x1 , ..., xn , zn ) (3.44)

β(zn ) ≡ p(xn+1 , ..., xN |zn ) (3.45)

The quantity α(zn ) represents the joint probability of observing all of the given data up to

a time n and the value of zn , whereas β(zn ) represents the conditional probability of all

the future data from time n + 1 upto N given the value of zn . Again, α(zn ) and β(zn )

each represent a set of K numbers, one for each of the possible settings of the 1-of-K coded

binary vector zn . From now on, the notation α(znk ) shall be used to denote the value of

α(zn ) when z nk = 1, with the analogous interpretation of β(znk ).

The recursion relations that allow α(zn ) and β(zn ) to be evaluated efficiently can now be

derived. Making use of the conditional independence properties,in particular Eq (3.35) and

(3.36), together with the sum of product rules, allows to express α(zn ) in terms of α(zn-1 )

25
Figure 3.4: Forward recursion for the evaluation of the α variables

as follows

α(zn ) = p(x1 , ..., xn , zn )

= p(x1 , ..., xn |zn )p(zn )

= p(xn |zn )p(x1 , ..., xn−1 |zn )p(zn )

= p(xn |zn )p(x1 , ..., xn−1 , zn )


X
= p(xn |zn ) p(x1 , ..., xn−1 , zn−1 , zn )
zn−1
X
= p(xn |zn ) p(x1 , ..., xn−1 , zn |zn−1 )p(zn−1 )
zn−1
X
= p(xn |zn ) p(x1 , ..., xn−1 |zn−1 )p(zn |zn−1 )p(zn−1 )
zn−1
X
= p(xn |zn ) p(x1 , ..., xn−1 , zn−1 )p(zn |zn−1 ) (3.46)
zn−1

Making use of definition (3.44) for α(zn ),

X
α(zn ) = p(xn |zn ) α(zn−1 )p(zn |zn−1 ) (3.47)
zn−1

It should be noted that there are K terms in the summation,and the right hand side has to

be evaluated for each of the K values of zn so each step of the α recursion has computational

cost that scaled like O(K 2 ). The forward recursion has to be computed using the lattice

26
diagram in Fig 3.4

In order to start this recursion,an initial condition is required. This is given by:

K
Y
α(z1 ) = p(x1 , z1 ) = p(z1 )p(x1 |z1 ) = {πk p(x1 |φk )}z1k (3.48)
k=1

which says that α(z1k ), for k =1,...,K, takes the value πk p(x1 |φk ). Starting at the first node of

the chain, one can work along the chain and evaluate α(zk ) for every latent node. Because

each step of the recursion involves multiplying by a K x K matrix, the overall cost of

evaluating these quantities for the whole chain is O(K 2 N )

The recursion relation for β(zn ) can be found similarly, by making use of the conditional

independence properties (3.36) and (3.37)

β(zn ) = p(xn+1 , ..., xN |zn )


X
= p(xn+1 , ..., xN , zn+1 |zn )
zn+1
X
= p(xn+1 , ..., xN |zn , zn+1 )p(zn+1 |zn )
zn+1
X
= p(xn+1 , ..., xN |zn+1 )p(zn+1 |zn )
zn+1
X
= p(xn+2 , ..., xN |zn+1 )p(xn+1 |zn+1 )p(zn+1 |zn ) (3.49)
zn+1

Making use of the definition (3.45) for β(zn ):

X
β(zn ) = β(zn+1 )p(xn+1 |zn+1 )p(zn+1 |zn ) (3.50)
zn+1

It should be noted that in this case, the algorithm goes backward, evaluating β(zn ) in terms

of β(zn+1 ). At each step, the effect of observation β(xn+1 ) is absorbed through the emission

probability p(xn+1 |zn+1 ), multiplied by the transition matrix p(zn+1 |zn ), and then zn+1 is

marginalized. This is illustrated in Fig 3.5

27
Figure 3.5: Backward recursion for the evaluation of the β variables

Again, a starting condition for the recursion, namely a value for β(zN ), is required.This can

be obtained by setting n = N in Eq (3.43) and replacing α(zn ) with its definition (3.44) to

give:

p(X, zN )β(zN )
p(zN |X) = (3.51)
p(X)

which will be correct provided β(zN ) = 1 for all settings of zN

In the M step equations, the quantity p(X) will cancel out,as can be seen in the M step

equation for µk given by (3.30) which takes the form

Pn Pn
n=1 γ(znk )xn n=1 α(znk )β(znk )xn
µk = P n = P n (3.52)
n=1 γ(z nk ) n=1 α(znk )β(znk )

However, the quantity p(X) represents the likelihood function whose value is typically mon-

itored during the EM optimization, and so it us useful to be able to evaluate it. Summing

both sides of (3.43) over zn , and using the fact that the left hand side is a normalized

distribution:

X
p(X) = α(zn )β(zn ) (3.53)
zn

Thus the likelihood function can be evaluated by computing this sum,for any convenient

choice of n. For instance, is only the likelihood function has to be evaluated, then this can

28
be done by running the α recursion from the start to the end of the chain, and then use

this result for n = N, making use of the fact that β(zN ) is a vector of 1’s. In this case, no

β recursion is required, and the following is obtained:

X
p(X) = α(zN ) (3.54)
zN

To distribute the likelihood, the joint distribution p(X,Z) should be summed over all pos-

sible values of z. Each such choice represents a particular choice of a hidden state for every

time step, in order words every term in the summation is a path through the lattice diagram.

The number of such paths is exponential. By expressing the likelihood function in the form

(3.54), the computational cost has been reduced from being exponential in the length of

the chain to being linear by swapping the order of the summation and multiplications, so

that each time step n, the contributions from all paths passing through each of the states

znk can be summed to give the intermediate quantities α(zn ).

To evaluate the quantities ξ(zn-1 , zn ), which corresponds to the values of the conditional

probabilities p(zn-1 ,zn |X). Applying Baye’s theorem:

ξ(zn-1 , zn ) = p(zn-1 , zn |X)


p(X|zn-1 , zn )p(zn-1 , zn )
=
p(X)
p(x1 , ..., xn−1 |zn )p(xn |zn p(xn+1 , ..., xN |zn )p(zn |zn−1 )p(zn−1 )
=
p(X)
α(zn-1 )p(xn |zn )p(zn |zn-1 )β(zn )
= (3.55)
p(X)

Here, the conditional independence property (3.39) was used together with definitions of

α(zn ) and β(zn ) given by (3.44) and (3.45). Thus ξ(zn-1 , zn ) directly by using the results

of the α and β recursions.

To summarize, the steps required to train a hidden Markov model using the EM algorithm,

one needs to first make an initial selection of the parameters θold where θ ≡ (π, A, φ).

29
The A and π parameters are often initialized either uniformly or randomly from a uniform

distribution (respecting their non-negativity and summation constraints). Initialization of

the parameters φ will depend on the form of the distribution. Then both the forward α

recursion and the backward β recursion and use the results to evaluate γ(zn ) and ξ(zn-1 , zn ).

At this stage, the likelihood function can be evaluated. This completes the E step, and these

results can be used to find a revised set of parameters θnew using the M step equations

from section (the forward backward). The E and M steps can be run alternatively until

convergence,for instance when the log likelihood is below some threshold.

It should be noted that in the recursion relations, the observations enter through conditional

distributions of the form p(xn |zn ). The recursions are therefore independent of the type of

dimensionality of the observed variables or of the form of this conditional distribution, so

long as its value can be computed for each of the K possible states of zn . Since the observed

variables {xn } are fixed, the quantities p(xn |zn ) can be pre-computed as functions of zn at

the start of the EM algorithm, and remain fixed throughout.

The maximum likelihood approach is most effective when the number of data points is large

in relation to the number of parameters. Here, the hidden Markov model can be trained

effectively, using maximum likelihood, provided the training sequence is sufficiently long.

Alternatively, one can also use multiple shorter sequences which requires a straightforward

modification of the hidden Markov model EM algorithm. In the case of left-to-right mod-

els, this is particularly important because, in a given observation sequence, a given state

transition corresponding to a non-diagonal element of A will be seen at most once

Another quantity of interest is the predictive distribution, in which the observed data is X

= {x1 ,...,xn } and one wishes to predict xN+1 . Again, one can make use of the sum and

product rules together with the conditional independence properties (3.39) and (3.41) to

30
derive:

X
p(xN+1 |X) = p(xN+1 , zN+1 |X)
zN+1
X
= p(xN+1 |zN+1 )p(zN+1 |X)
zN+1
X X
= p(xN+1 |zN+1 ) p(zN+1 , zN |X)
zN+1 zN
X X
= p(xN+1 |zN+1 ) p(zN+1 |zN )p(zN |X)
zN+1 zN
X X p(zN , X)
= p(xN+1 |zN+1 ) p(zN+1 |zN )
zN+1 zN
p(X)
1 X X
= p(xN +1 |zN +1 ) p(zN +1 |zN )α(zN ) (3.56)
p(X) z z
N+1 N

which can be evaluated by fist running a forward α recursion and the computing the final

summations over zN can be stored and used once the value of zN+1 is observed in order to

run the α recursion forward to the next step in order to predict the subsequent value xN+2 .

In the equation above, the influence of all the data from x1 to xN is summarized in the the

K values of α(zN )

3.2.4 Scaling Factors

There is an important issue to be addressed before making use of the forward-backward

algorithm in practice. In the algorithm, at each step, the value α(zn ) is obtained from the

previous value α(zn−1 ) by multiplying by quantities p(zn |zn−1 ) and p(xn |zn ). Because the

probabilities are often significantly less than unity, going forward along the chain, the values

α(zn ) can go to zero exponentially quickly.

In case of i.i.d data, this problem was circumvented with the evaluation of log likelihood

functions. This will not work here because the sums of products of smaller numbers are

being formed. Therefore, rescaled versions of α(zn ) and β(zn ) whose values remain of order

31
unity are used. The corresponding scaling factors cancel out when these rescaled quantities

are used in the EM algorithm.

In (3.44), α(zn ) = p(x1 ,...,xn ,zn ) representing the joint distribution of all the observations

upto xn and the latent variable zn . Now, to introduce the normalized version of α:

α(zn )
α̂(zn ) = p(zn |x1 , ..., xn ) = (3.57)
p(x1 , ..., xn )

which is expected to be well behaved numerically because it is a probability distribution

over K variables for any value of n. In order to relate the scaled and original alpha vari-

ables, scaling factors defined by conditional distributions over the observed variables are

introduced:

cn = p(xn |x1 , ..., xn−1 ) (3.58)

From the product rule:

n
Y
p(x1 , . . . , xn ) = cm (3.59)
m=1

and so

n
!
Y
α(zn ) = p(zn |x1 , ..., xn )p(x1 , ..., xn ) = cm α̂(zn ) (3.60)
m=1

The recursion equation (3.48)for α can be turned into one for α̂ given by

X
cn α̂(zn ) = p(xn |zn ) α̂(zn−1 )p(zn |zn−1 ) (3.61)
zn−1

At each stage of the forward message passing phase, c n will have to be evaluated and stored,

which is easily done because it is the coefficient that normalizes the right hand side of the

equation above to give α̂(zn )

32
Rescaled variables β̂(zn ) can be similarly defined using

N
!
Y
β(zn ) = cm β̂(zn ) (3.62)
m=n+1

which will again remain within the same machine precision because, from (3.45), the quan-

tities β̂(zn ) are simply the ratio of two conditional probabilities

p(xn+1 , ..., xN |zn )


β̂(zn ) = (3.63)
p(xn+1 , ..., xN |x1 , ..., xn )

The recursion result (3.50) for β then gives the following recursion for the re-scaled variables

X
cn+1 β̂(zn ) = β̂(zn+1 )p(xn+1 |zn+1 )p(zn+1 |zn ) (3.64)
zn+1

In applying this recursion relation, the scaling factors c n that were computed in the α phase

are used.

From (3.59), the likelihood function can be found using

N
Y
p(X) = cn (3.65)
n=1

Similarly,using (3.43) and (3.55), together with (3.65), the required marginals are given by

γ(zn ) = α̂(zn )β̂(zn ) (3.66)

ξ(zn−1 , zn ) = cn α̂(zn−1 )p(xn |zn )p(zn |zn−1 )β̂(zn ) (3.67)

33
Figure 3.6: A graphical depiction of the Viterbi algorithm

3.2.5 The Viterbi Algorithm

In many applications of hidden Markov models, the latent variables have some meaningful

interpretation, and so it is often of interest to find the most probable sequence of hidden

states for a given observation sequence. Because the graph for a hidden Markov model

is a directed tree, this problem can be solved exactly using the max-sum algorithm. The

problem of solving the most probable sequence of latent states is not the same as that

of finding the set of states that are individually the most probable. The latter problem

can be solved by running the forward-backward (sum-product) algorithm to find the latent

marginals γ(zn ) and then maximizing each of these individually. However, the set of such

states will not, in general correspond to the most probable sequence of states. In fact, this

set of states might even represent a sequence having zero probability, if it so happens that

two successive states, which in isolation are individually the most probable, are such that

the transition matrix element connecting them is zero

Finding the most probable sequence of states can be solved efficiently using the max-sum

algorithm, which is known as the Viterbi algorithm. Fig 3.6 shows a fragment of the hidden

Markov model expanded as a lattice diagram. The number of possible paths through the

lattice diagram grows exponentially with the length of the chain. The Viterbi algorithm

searches this space of paths efficiently to find the most probable path with a computational

cost that grows only linearly with the length of the chain.

34
The variable zN is treated as the root, and messages are passed to the root starting with

the leaf nodes. The messages passed in the max-sum algorithm are given by

µzn →fn+1 (zn ) = µfn →zn (zn ) (3.68)

µfn+1 →zn+1 (zn+1 ) = max{ln fn+1 (zn , zn+1 ) + µzn →fn+1 (zn )} (3.69)
zn

If µzn →fn+1 (zn ) is eliminated between these two equations, a recursion for the f → z can be

obtained

ω(zn+1 ) = ln p(xn+1 |zn+1 ) + max{ln p(zn+1 |zn ) + ω(zn )} (3.70)


zn

where ω(zn ) ≡ µfn →zn (zn ).

The messages are initialized using

ω(z1 ) = ln p(z1 ) + ln p(x1 |z1 ) (3.71)

A simple algorithm that keeps track of the path to every possible latent variable is used to

find the sequence of latent variables that correspond to the most likely path

Intuitively, the Viterbi algorithm can be understood as follows. Naively, one could consider

all of the exponentially many paths through the lattice, evaluate the probability for each,

and then select the path having the highest probability. However, a dynamic saving in

computational cost can be made as follows. Suppose the probability of each path is evaluated

by summing up products of transition and emission probabilities going forward along each

path through the lattice. Considering a particular time step n and a particular state k at

that time step, there will be many possible paths converging on the corresponding node in

the lattice diagram. However, only the path that that has the highest probability so far

needs to be retained. Because there are K states at time step n, K such paths will need to

be kept track of at step n. At time step n+1, there will be K 2 possible paths to consider,

compromising K possible paths leading out of each of the K current states, but again,

35
only K of these corresponding to the best path for each state at time n+1 will have to be

retained. When the final time step N is reached, it will be known which state corresponds

to the overall most probable path. Because there is a unique path coming into that state,

the path can be traced back to step N -1 to see what state it occupied at that time, and so

on back through the lattice to the state n=1.

36
Chapter 4

Approach

Two different approaches will be discussed in this section. The first approach is purely SVM

(Support Vector Machine) based and is derived from Dalal et al.[1]. The machinery behind

the second approach is based on Hidden Markov Models. We describe our joint model for

tracking and recognition in which we track and classify actions simultaneously. Both of

these are discussed in detail in the following sections.

4.1 The Support Vector Machine Approach

The SVM approach is an extension of [1]. Whereas in [1], the focus is on building a binary

classifier to make person/no-person detections in images, the SVM approach uses an optical

flow based feature set and learns an SVM for each class of actions.

4.1.1 Optical Flow based HOG descriptors

Dalal et. al[1] discusses how locally normalized Histogram of Oriented Gradient (HOG)

descriptors provide better performance at human detection relative to other feature sets.

These descriptors are computed over a grid of uniformly spaced cells and use overlapping

37
local contrast normalizations for improved performance. The intuition is that local object

appearance and shape in static images can be characterized well by the distribution of local

intensity gradients.

Here, local intensity gradients are replaced by optical flow to characterize local motion

patterns in video sequences. This is implemented by dividing an image into regions called

cells and accumulating a 1-D histogram of the optical flows over the pixels of the cell.

To achieve better invariance to effects such as illumination and shadowing, local responses

from cells are normalized by accumulating cells over larger spatial regions called blocks and

normalizing over all cells in a block. Just as in [1], we use an 8x8 cell with each block having

4 cells in it. Each pixel calculates a weighted “vote” based on the orientation and magnitude

of the optical flow vector centered at it and the votes are accumulated into orientation bins

over cells. There are 9 orientation bins from 0 - 180 (degrees) and 4 normalizations for every

cell. A detection window is tiled with a dense grid of overlapping HOG descriptors. Let

i ∈ {1, . . . , L} where L is the number of discrete locations in an image. L is proportional to

the number of pixels in the image. For efficiency reasons, we only score windows centered

at every 8th pixel. If the dimensionality of the detection window is n×m, and pi is the

n×m dimensional patch extracted from location i in the image, we can write

xi = ψ(pi )

n m
for the HOG feature vector computed at the patch. This feature vector is 8 × 8 × 36

dimensional. Fig. 4.1 shows an image, the optical flow and the HOG descriptor computed

at the image.

Figure 4.1: (a) is the original image. (b) is the optical flow computed at the image. (c) is

the HOG descriptor computed for the image

38
(a) Original image (b) Optical flow (c) HOG descriptor

4.1.2 A multiclass SVM framework for action recognition

Typically, an SVM is a binary classifier but can be extended to a multiclass framework as

described in Sec 3.1.6. This is the version of the SVM that is used here. Given a training

dataset of video sequences, an SVM is built for every action class.

4.1.2.1 Training

Given a training set of video sequences, we collect training pairs {xi , yi } where xi is as

described above and yi ∈ {1,. . . ,C}. where C is the number of action classes. Positive

features for an SVM for action class C are obtained from bounding boxes around the

actions corresponding to class C. Negative features were obtained from random patches in

images containing actions from the remaining classes. We train a model wC that minimizes

the hinge loss as follows

d
X
wC = αi yi xi (4.1)
i=1

where d is the size of the training data and αi are Lagrange multipliers, whose values can

be found by solving a dual optimization problem.

The training is typically done iteratively - after learning an initial SVM classifier, all the

negative training examples are searched exhaustively for false positives (hard examples).
39
These false positives are then appended to the negative training set and the SVM is retrained

using the augmented negative training set, which gives rise to a new classifier.

4.1.2.2 Testing

To classify a new video sequence broken down into a sequence of optical flows, we convert
f ramewidth f rameheight
each image into a large 8 × 8 × 36 dimensional HOG descriptor. Then

for a given C, we scan the n×m dimensional detection window across all locations and

scales, scoring the classifier for C at every detection window by convolving the image HOG

descriptor with 36 templates representing different subsets of weights from wC .

If X = {x1 , . . . , xN } represents the sequence of optical flows in an N -image video, and Xj

= {x 1 , . . . , xm }, where m is the number of windows in the image across locations and scales,

xj can be classified as follows

C(xj ) = arg max max wC xi (4.2)


C i

The entire video sequence X can be classified by taking a majority vote across frames

N
X
C(X) = C(x1 , . . . , xn ) = arg max I(C(xi ) = C) (4.3)
C
i=1

where I is the identity function

40
4.2 The Hidden Markov Model Approach

The SVM approach is similar to a bag-of-words approach; even if it were fed a sequence of

frames constituting a video out of order, it would still classify the video sequence just as it

would if the frames were fed to it in order. In reality, the optical flow in a particular frame

very much depends on the optical flows of the frames that precede it. This scenario can be

modeled as a Markov process. We use the training set of videos to build Hidden Markov

Models (HMMs) for each action class. Given a new, previously unseen video sequence which

broken down into a sequence of optical flows, the HMMs learnt in the training phase are

then used to classify this observed sequence.

4.2.1 Building a vocabulary of visual words

As mentioned in section 3.2, a Hidden Markov Model (HMM) consists of a transition model

A, an emission model φ and an initial model π. The first step towards building an HMM

model is to build a vocabulary of “visual words” using the frames belonging to the videos

from the training set. These visual words will then represent our hidden variables in the

HMM.

To build a visual vocabulary, the motion descriptor in Efros et al.[9] is used on bounding

boxes around the person in a frame. This motion descriptor has been shown to perform

reliably with noisy image sequences. Given a video sequence in which the person appears

in the center of the field of view, the optical flow is computed at each frame using the

Lucas-Kanade algorithm[10]. The optical flow vector field F is then split into 2 scalar fields

Fx and Fy , corresponding to the x and y components of the optical flow. Fx and Fy are

further half wave rectified into four non-negative channels F− + − +


x , Fx , Fy , and Fy , such that

Fx = F+ − + −
x - Fx and Fy = Fy - Fy . These four non-negative channels are then blurred with a

Gaussian kernel and normalized to obtain the final four channels Fb− + − +
x , Fbx , Fby , and Fby

The motion descriptors of 2 different frames are computed as follows: Suppose the four

41
Figure 4.2: Cluster centers from the UCF action dataset

channels for frame A are a 1 , a 2 , a 3 and a 4 , similarly, the four channels for frame B are b 1 ,

b 2 , b 3 and b 4 . Then the similarity between frame A and frame B is:

4 X
X
S(A, B) = ac (x, y)bc (x, y) (4.4)
c=1 x,y∈I

where I is the spatial extent of the motion descriptors.

To construct the codebook, an affinity matrix A is computed on all frames in the training

set, where A(i.j ) is the similarity between frame i and frame j calculated using the equation

above. K-Medoid clustering is then run on this affinity matrix to obtain K clusters. Each

frame in the training set belongs to one of these K clusters. Each cluster represents a visual

word. Fig. 4.2 shows 15 of the 45 “cluster centers” from the UCF action dataset[11]. Each

one of the images in the figure represents a visual word.

From now on, the visual vocabulary V will be represented as the K element set {w 1 , . . . , wK }.

For the KTH dataset, which has 2391 video sequences in it, the size of the vocabulary we

used was 255.

42
4.2.2 A hidden Markov model framework for video sequences

For an N frame video sequence X = {x1 , . . . , xN }, where xj is the optical flow computed

at frame j, then each xj could represent an observed state in a HMM. The video sequence

X could be represented as a sequence of these observed states. The visual words computed

in the section above could represent the hidden variables z in a HMM. Given a labeled set

of videos as training data, we could then build a hidden Markov model θC = {AC ,φC ,π C }

for every action class C in the training set.

4.2.2.1 Dimensionality reduction of optical flow features

We assume that our observations are continuous vectors and therefore assume a Gaussian

emission density. We explore two different feature vectors. The first is the optical flow

based HOG feature set described in Sec. 4.1.1. For a n × m image patch, the length of the
n m
feature vector is 8 × 8 × 36. The problem with directly modelling this feature vector as

a Gaussian is that it may be too large to fit a full covariance matrix. This means that we

may end up with a singular covariance matrix meaning Σ−1 will not exist. One solution to

this problem is to restrict the space of matrices Σ under consideration. Instead of trying to

fit a full covariance matrix, we could choose to fit a covariance matrix Σ that is diagonal.

Although restricting the covariance matrix to be diagonal is an acceptable solution in some

cases, we found both an efficiency and performance gain by explicitly reducing the dimen-

sionality of the data by Principal Component Analysis (PCA). We project the data down

to a smaller dimension D such that


 
 V1T 
T
 .. 
xi = V ψ(pi ) = 
 .  ψ(pi )
 (4.5)
 
VDT

n m
where V T is a 8 × 8 × 36 × D dimensional matrix of principal component directions.

43
4.2.2.2 Visual word SVM based features

Here, we train an SVM wi ,i ∈ {1, . . . , K} to detect each visual word. Positive instances

come from bounding boxes from images in the cluster for which i is the center. Negative

instances are randomly sampled patches from images belonging to other clusters along with

image patches with no person in them. We found that adding the latter helps improve

performance since the SVM is trained to discriminate a particular action word from both

other action words and background patches that do not correspond to any action class. We

can now define the feature extracted from the ith patch as
 
 w1T 
..
xi = wT ψ(pi ) = 
 
.  ψ(pi ) (4.6)
 
 
T
wK

This is also a linear dimensionality reduction scheme. Our reduction scheme is discrimina-

tive in that it exploits training labels, while PCA does not. One may also employ other

discriminative techniques such as linear discriminant analysis (LDA). We chose an SVM

based scheme in anticipation of moving our model toward a fully discriminative structured

prediction model[20], which we briefly describe in our future work.

Given the definitions of our observed feature vectors x and hidden states z, we use the

standard EM algorithm for HMMs to learn the parameters θC ∀C ∈ {1,. . . ,C}

4.2.2.3 Classifying a new pre-tracked video sequence

A new, previously unseen video sequence can be classified using the hidden Markov models

constructed in the training phase. In the pre-tracked scenario, the location of the person

in each frame is known i.e each frame is clipped to only include the person of interest.

The video sequence essentially reduces to a sequence of poses from which optical flow can

computed at each pose. This sequence of flows constitute the observed states. A sum-

44
product algorithm can then be used to calculate the probability of the sequence of observed

states under the HMM for each action class built in the training phase.If

αn (i) = p(x1 , ....., xn , zn = i|θC )

represents the probability of the partial observed sequence x1 ,. . . ,xn produced by all possible

state sequences that end at the i -th visual word given θC , it can be recursively defined as:
 
XK
αn (i) =  αn−1 (j)AC  C
ji φi (Xn ) (4.7)
j=1

where K is the size of the visual vocabulary. AC is a matrix of transition probabilities for

action class C and AC


ji is the probability of transitioning from state j to state i in class C.

φC is the emission model and φC


i (xt ) represents the probability P (xn |zn = i). For an N

length video sequence, the probability of the entire sequence X under the HMM for class C

can be computed as follows:

K
X
p(X|Z, θC ) = αm = αm (j) (4.8)
j=1

After scoring the sequence under the HMM for every class,the final classifier for a video

sequence X can be written as

C ∗ = arg max p(X|Z, θC ) (4.9)


C

As mentioned earlier, the ability to perform robust tracking prior to classification cannot

be assumed general. In the next section, this assumption is relaxed and a model for jointly

tracking and predicting actions is presented.

45
4.3 A Unified model for joint tracking and recognition

Here, we use HMM machinery to build a joint model for tracking and recognizing actions.

Given a video sequence, we do not assume that the person in it is tracked; instead we try

to infer the person’s location in each frame along with his/her action.

4.3.1 Cross product space of location and visual words

In the pre-tracked scenario, z ∈ {1,. . . ,K } where K is the number of visual words. In the

current scenario, where pre-tracking is not assumed and location of a person in a frame is

“hidden”, z ∈ {1,. . . ,L}×{1,. . . ,K } where L is the number of locations and proportional

to the number of pixels in a frame. Given that the hidden variables z now lie in the cross

product space of locations and visual words, a HMM for tracking can be given by

N
Y
p(X, Z|θG ) = p(z1 |π G ) p(xi |zi , φG )p(zn |zn+1 , AG ) (4.10)
n=1

where θG is a global hidden Markov model trained on the entire dataset.

4.3.2 Joint model for tracking and recognition

We present a model for joint tracking and recognition by building action specific HMMs for

every action class. This is given by

N
Y
p(X, Z, C|θC ) = p(z1 |π C ) p(xi |zi , φC )p(zn |zn+1 , AC )p(C) (4.11)
n=1

where θC is the hidden Markov model for action class C. Our model enforces that all the

visual words zn are consistent with a single action class whereas (4.10) cannot guarantee

this property. Fig. 4.3 shows a graphical representation of our joint model. In it, the

transition and emission probabilities are conditioned on the variable C, the action class.

46
Figure 4.3: An illustration of the joint model. The variable C determines the transition
and emission probabilities.

4.3.2.1 Bounded velocity motion model

As mentioned earlier, in the present scenario, z ∈ {1,. . . ,L}×{1,. . . ,K }. This state space

can be very large for large values of L. For the KTH dataset[7], where the spatial resolution

is 160x120, L ∼ 19200. In order to reduce this state space, we place a prior on the movement

of a person between consecutive frames - we assume that a person cannot be displaced by

more than δ pixels from frame to frame. The probability of a transition from state zn−1 =

(l n−1 ,w n−1 ) to zn = (l n ,w n ), which we will represent as T (zn−1 , zn ) is



 0 if |ln − ln−1 | > δ

T (zn−1 , zn ) = (4.12)
 c if |ln − ln−1 | ≤ δ

Fig. 4.4 below illustrates the bounded velocity motion model.

As an alternative to the bounded velocity motion model, Felzenszwalb and Huttenlocher[14]

can be used to speed up the Viterbi algorithm, rewriting the Viterbi recursion as a dis-

tance transform. The standard Viterbi algorithm is O(K 2 ) (where K is the number of

states) whereas using distance transform techniques takes O(K ) time; an order of magni-

tude speedup for large values of K

47
Figure 4.4: An illustration of the bounded velocity motion model. Each node represents
a (location,word ) hidden state. The only transitions considered are shown by the lines
connecting nodes at n-1 and n. The rest of the transitions are not taken into consideration

4.3.2.2 Inference

Given our joint model for tracking and recognition, we can now run inference on a new video

sequence and find both the best path of hidden states and its action class. We describe the

process of inference as follows

max p(Z, C|X) = max max p(Z, C=i|X)


C,Z i Z
N
" #
Y
i i i
∝ max p(C = i) max p(z1 |π ) p(xn |zn , φ )p(zn |zn−1 , A )
i Z
n=2
(4.13)

For a specific C, we employ dynamic programming to find the best sequence of hidden

states. If S C (zn =i ) is defined as the score of the path that ends in state i under θC , it can

be defined recursively as follows

SC (zn ) = p(xn |zn = i) max S(zn−1 = j)p(zn = i|zn−1 = j) (4.14)


j

48
where

SC (z1 ) = p(z1 = i)p(x1 |zi )

Recalling our bounded velocity motion model and that zn = (l n ,w n ), if we define l (zn ) =

l n , the equation above can be rewritten as

SC (zn ) = p(xn |zn = i) max S(zn−1 = j)p(zn = i|zn−1 = j) (4.15)


{j:|l(i)−l(j)|≤δ}

Finally, if S C is the score of the highest scoring path,

SC = max SC (zN ) (4.16)

The final classifier can then be written as

C ∗ = arg max SC (4.17)


C

4.4 Motivation for joint tracking and recognition

In the section above, we presented a joint model for tracking and recognition. Here, we

present an informal motivation for such a model. We run several baseline algorithms on

a video sequence and compare their performance, tracking wise, to our joint tracking and

recognition model.

4.4.1 The Tracking problem

There does not exist an out-of-the-box technology for tracking people in video sequences.

Tracking people is difficult because of the deform due to articulations, speed of movement

from frame to frame, and the clutter surrounding the person to be tracked. Many previous

approaches have relied on Kalman filtering or Particle filtering [30]. Although it may not

necessarily seem intuitive, recent approaches have advocated a track-by-detection philoso-

49
phy, in which a track is stitched together by linking object detections from across frames[31].

Such an approach has the advantage of not needing to be hand initialized and being robust

to errors from drifting because the tracker is essentially re-initializing itself each frame. For

example, Forsyth[30] explicitly advocates the above approach for face tracking.

4.4.2 Experiment and baseline

Given that tracking by detection has been advocated by recent approaches, for our baseline,

we feel comfortable in proposing to link detections across frames using the detector in Dalal

et. al[1]. Commonly known as the Dalal-Triggs detector, it is known to be the state-of-the-

art for human detection in images and is particularly trained to find pedestrians in images.

We also propose two other baselines. All our baselines are outlined below

A: Linking detections across frames using the Dalal-Triggs detector

B: Employing the Dalal-Triggs detector in a dynamic programming framework

C: Reducing the joint model for tracking+recognition to a simple location HMM

In B and C, the hidden variable z ∈ {1,. . . ,L} where L is the number of discrete locations

in a frame. They both use the bounded velocity motion model for transition probabilities.

In B, the emission model is given by

p(xi |i) ∝ ewxi

where xi is the HOG feature vector extracted from location i and w is the model. It should

be noted that in its typical implementation, the Dalal-Triggs detector scans a 64×128

detection window across locations as well as scales. Here, the version of the detector used

does not search over scales. Images are rescaled such that the person of interest appears in

the center of the detection window.

50
(a) Kicking sequence tracked by baseline A

(b) Kicking sequence tracked by baseline B

(c) Kicking sequence tracked by baseline C

(d) Kicking sequence tracked by the joint model

Figure 4.5: (a) tracked using baseline A is not consistent in the person it tracks. (b) tracked
using baseline B is consistent but does not track the person of interest-the person kicking
the ball. (c) and (d) seem to be doing equally well in tracking the person of interest

Baseline C is similar to the joint model except that there is no model for transitions

between visual words - these transitions are simply not taken into consideration. We run

all our baselines and our joint model on a kicking sequence from the UCF dataset[11]. We

run baseline C and our joint model given that the classification of the sequence is already

known to us. The joint model has a Gaussian emission density and uses the feature vector

in Sec 4.2.2.2. Fig. 4.5 shows the tracks output by the baselines and our joint model.

From the figure, it seems that the Dalal-Triggs baseline (baseline A) is inconsistent in its

tracking. One would expect the bounding box in each frame to be around the same person,

but this is clearly not the case here. Baseline B enforces this consistency by using the

51
Dalal-Triggs detector in a dynamic programming framework, using the bounded velocity

motion model for transition probabilities. Both A and B are unable to track the person of

interest i.e the person kicking the ball. Instead, they seem to be tracking the person whose

pose most resembles that of a pedestrian. The performace of baselines A and B on the

video sequence demonstrates why pre-tracking using an external module cannot be assumed

in general.

Baseline C outputs a track that is identical to the one output by the joint model. Although

baseline C outputs a reasonable track without considering transitions between visual words,

we believe that these transitions do convey important information that will lead to more

accurate tracks in more complex scenarios.

We note that baseline C is itself a novel contribution to tracking people in videos. It

estimates a spatio-temporal track by linking together responses from a family of HOG

templates, where each template encodes a visual action word. We are not aware of a similar

work in this vein.

Even though we do not see any major improvement of our joint model over baseline C, we

note that in ur model, jointly tracking and recognizing activities is not more expensive than

separately pre-tracking and labelling actions. Since both activities can be efficiently done

together, we argue that this will result in a more robust system in more difficult scenarios

because it can enforce the constraint that all visual word templates are consistent with a

single action class.

52
Chapter 5

Results

The algorithms described in Chap are tested on two datasets: KTH human motion dataset[7]

and the UCF Sports Action dataset[11]

5.1 Action Classification on KTH Dataset

The KTH human motion dataset is one of the largest video datasets of human actions.

It contains six types of human actions (walking, jogging, running, boxing, hand waving

and hand clapping) performed several times by 25 subjects both outdoors and indoors,

with different clothes. All sequences were taken over homogeneous backgrounds with a

static camera with 25fps frame rate. Representative frames from the dataset are shown in

Fig. 5.1. The dataset was divided roughly equally into a training and testing set of video

sequences. The training sequences were tracked and stabilized so that the figures appear

in the center of the field of view. This was done using a combination of the Dalal Triggs

detector, background subtraction and Sabzmeydani et al[12].

The SVM approach was first tested on the dataset. Once the optical flow was computed on

every frame in the training set, six linear SVMs were built; one for each class, as described

in Sec 4.1. In building the negative training set for a particular action class, feature vectors

53
Figure 5.1: Representative frames from the KTH action dataset

corresponding to frames that had no person in them were included, along with feature

vectors from frames in the other five action classes. Given all six SVMs, a new video

sequence, was converted to a sequence of optical flows and classified by a majority vote

taken across all frames. The confusion matrix for the KTH dataset using the SVM approach

is shown in Table 5.1


box clap wave jog run walk
box 0.95 0.02 0.01 0.01 0.01 0.0
clap 0.03 0.88 0.08 0 0 0.01
wave 0.03 0.06 0.89 0.02 0.00 0.0
jog 0.01 0.0 0.0 0.59 0.32 0.08
run 0.0 0.0 0.0 0.26 0.66 0.08
walk 0.0 0.0 0.0 0.09 0.05 0.86

Table 5.1: - Confusion matrix for the KTH dataset using the SVM approach (overall
accuracy = 80.5%)

We also tested our joint model on the KTH dataset. A visual vocabulary was constructed

like discussed in Sec. 4.2.1. We built a HMM for each of the six action classes and ran

inference on every video sequence on the test set. As described in Sec. 4.3.2.2, the action

was classified as the action corresponding to the highest scoring HMM. The confusion matrix

using 255 codewords and optical flow features (4.2.2.1) with a Gaussian emission model is

shown in Table 5.2

The confusion matrix using 255 codewords and visual word SVM features (4.2.2.2) with a

Gaussian emission model is shown in Table 5.3

54
box clap wave jog run walk
box 0.98 0.01 0.01 0.0 0.0 0.00
clap 0.01 0.96 0.03 0.0 0.0 0.0
wave 0.02 0.03 0.95 0.0 0.0 0.0
jog 0.0 0.0 0.0 0.72 0.25 0.03
run 0.0 0.0 0.0 0.23 0.74 0.03
walk 0.0 0.0 0.0 0.04 0.03 0.93

Table 5.2: - Confusion matrix for the KTH dataset using our joint model and optical flow
features (overall accuracy = 88%)

box clap wave jog run walk


box 0.98 0.01 0.01 0.0 0.0 0.0
clap 0.01 0.97 0.02 0.0 0.0 0.0
wave 0.02 0.02 0.96 0.0 0.0 0.0
jog 0.0 0.0 0.0 0.75 0.22 0.03
run 0.0 0.0 0.0 0.21 0.77 0.02
walk 0.0 0.0 0.0 0.02 0.03 0.95

Table 5.3: title of table - Confusion matrix for the KTH dataset using the joint model
and visual word SVM based features (overall accuracy = 89.67 %)

Table 5.4 shows how the methods introduced in this thesis perform on the KTH dataset

compared to other action recognition methods It should be noted that it is not very clear

method accuracy pretracking


Mori et. al[18] 97.2% yes
Wang et. al[2] 92.43% yes
HMM and visual word SVM feature vector 89.67% no
HMM and optical flow feature vector 88% no
Neibles et. al[3] 81.5% no
SVM approach 80.5% no

Table 5.4: Comparison of different methods on KTH

how the other methods split the training and testing sets. The methods introduced in this

thesis perform reasonably well on the KTH dataset given that pretracking is not assumed.

Fig. 5.3 shows consecutive frames from KTH running sequences. Bounding boxes represent

the estimated location of the person. As seen in the figure, both the SVM approach and

the joint model do rather well in tracking a wide variety of poses and localizing motion in

a sequence of frames

From the confusion matrices, it is apparent that the “boxing” action has the highest accu-

55
Figure 5.2: The frames in the top row are from a jogging sequence, the frames in the bottom
row from a running sequence. These actions are similar to one another

racy. This is because it is unlike any of the other five actions. For example, “waving” and

“clapping” can be quite similar to one another because in many cases, both action start out

the same way - with the hands of the person stretched out by his/her side. Similarly, the

“walking”, “running” and “jogging” actions are also similar since they all involve movement

of the legs. Specifically, the running and jogging actions are often misclassified as one an-

other. This is reasonable, since running and jogging are similar actions and can be hard to

tell apart. Fig. 5.2 shows a few consecutive frames from a running and a jogging sequence.

56
(a) boxing sequence

(b) jogging sequence

(c) hand waving sequence

(d) hand clapping sequence

(e) running sequence

(f) walking sequence

Figure 5.3: The persons in (a) and (b) were tracked (and their actions classified) using the
SVM approach. Persons in (c) and (d) were tracked using the joint model and optical flow
features. Persons in (e) and (f) were tracked using the joint model and a visual word SVM
based features

57
Figure 5.4: Representative frames from the UCF Sports Action Dataset

5.2 Action Classification on UCF action dataset

The UCF (University of Central Florida) vision lab collected a set of eight actions from

various sports featured on channels such as BBC and ESPN. Actions in this dataset include

Golf swings, diving, kicking a soccer ball, weight lifting, horse riding, running, skating and

swingbenching. Representative frames of this dataset are shown in Fig 5.4 The original

dataset consists of some frames in which more than one person appears, unlike the KTH

dataset which consists of only one person per frame. The dataset also includes cropped

versions of the frames including only the person of interest in the center. The dataset was

divided roughly equally into training and testing sets

To test the SVM approach on this dataset, eight SVMs, one for each action, were built in

the training phase. The negative dataset for a particular action consisted of feature vectors

from frames in the other seven actions whereas the positive dataset consisted of feature

vectors from frames in the action class. Classification of a new sequence of frames was done

as described in Sec 4.1. The confusion matrix for the UCF dataset using the SVM approach

is shown in Table 5.5

We built eight HMMs, one for each class to test out joint model for tracking and recognition

on the UCF dataset. The confusion matrix for the UCF dataset using a vocabulary of 45

visual words, optical flow features and a Gaussian emission mode is shown in Table 5.6

And finally, the confusion matrix for the UCF dataset using the joint model, a vocabulary

of 45 words and visual word SVM based feature vectors is shown in Table 5.7

58
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.8 0.0 0.01 0.0 0.06 0.02 0.08 0.03
dive 0.01 0.81 0.0 0.06 0.0 0.02 0.02 0.08
kick 0.0 0.01 0.67 0.01 0.03 0.2 0.06 0.02
lift 0.0 0.06 0.02 0.76 0.05 0.0 0.0 0.11
horse ride 0.05 0.02 0.02 0.03 0.85 0.01 0.01 0.01
run 0.01 0.01 0.22 0.0 0.01 0.68 0.05 0.02
skate 0.06 0.01 0.08 0.01 0.0 0.06 0.77 0.01
swingbench 0.02 0.06 0.02 0.01 0.0 0.0 0.0 0.8

Table 5.5: Confusion matrix for the UCF dataset using the SVM approach (overall accuracy
= 76.75%)

golf swing dive kick lift horse ride run skate swingbench
golf swing 0.89 0.0 0.0 0.0 0.03 0.01 0.06 0.01
dive 0.01 0.88 0.0 0.04 0.0 0.02 0.01 0.04
kick 0.0 0.0 0.72 0.01 0.02 0.18 0.05 0.02
lift 0.0 0.04 0.02 0.85 0.02 0.0 0.0 0.07
horse ride 0.02 0.0 0.01 0.01 0.93 0.01 0.01 0.01
run 0.0 0.01 0.19 0.01 0.01 0.74 0.03 0.01
skate 0.06 0.0 0.07 0.01 0.0 0.05 0.81 0.01
swingbench 0.0 0.03 0.01 0.06 0.0 0.0 0.0 0.9

Table 5.6: Confusion matrix for the UCF dataset using the joint model with optical flow
based features (overall accuracy = 84%)

golf swing dive kick lift horse ride run skate swingbench
golf swing 0.91 0.0 0.0 0.0 0.02 0.01 0.05 0.01
dive 0.01 0.9 0.0 0.03 0.0 0.02 0.0 0.04
kick 0.0 0.0 0.76 0.0 0.02 0.17 0.03 0.02
lift 0.0 0.04 0.01 0.88 0.01 0.0 0.0 0.06
horse ride 0.01 0.0 0.01 0.01 0.94 0.01 0.01 0.01
run 0.0 0.0 0.16 0.01 0.01 0.78 0.03 0.01
skate 0.04 0.0 0.04 0.0 0.0 0.05 0.86 0.01
swingbench 0.0 0.02 0.0 0.05 0.0 0.0 0.0 0.93

Table 5.7: Confusion matrix for the UCF dataset using using the joint model and visual
word SVM based features (overall accuracy = 87.1%)

59
From the confusion matrices, it seems like there is some confusion between the “kick” and

“run” actions. Given that these actions can be similar (for example,they both involve a wide

stance), this is quite reasonable. Otherwise, both the SVM and HMM approach seem to be

doing reasonably well, even on complex actions like diving. The joint model in combination

with the visual word SVM based feature set, in particular, does well on this dataset. Fig 5.5

shows several sequences from the UCF action dataset with bounding boxes around detected

motion patterns. Again, both the SVM approach and the joint model seem to be tracking

the person in the image sequence accurately.

We are only aware of one other method that uses the UCF action dataset for testing. [11]

achieves an accuracy of 69.2 % on it. Although it seems like our method does an order of

magnitude better than [11], it should be noted that there were a few actions mentioned in

[11] that didnt appear in our version of the dataset. For example, pole vaulting and swing-

baseball do not appear here. It may be that the publicly available dataset is a simpler

subset of the collected dataset.

60
(a) lifting sequence

(b) golf swing sequence

(c) running sequence

(d) swingbench sequence

(e) diving sequence

(f) kicking sequence

(g) riding sequence

(h) skateboarding sequence

Figure 5.5: The persons in (a) and (b) were tracked (and their actions classified) using the
SVM approach. (c) and (d) were tracked using the joint model with an optical flow based
feature set.(e), (f), (g) and (h) were tracked using the joint model with a visual word SVM
based features

61
Chapter 6

Conclusion and Future Work

In this thesis, we tried two different methods for tackling the action recognition problem.

From the results on the KTH and UCF action datasets, it is apparent that the hidden

Markov model approach outperformed the Support Vector Machine based approach. Intu-

itively, this makes sense because the pose of a person in a frame is certainly dependent on

the person’s pose in previous frames. This suggests that the order in which frames appear

in a video sequence should be taken into consideration when classifying it.

This thesis introduced a new method by which tracking and recognition can be done simul-

taneously. It tested rather well on the KTH and UCF action datasets; the next natural step

would be to test it on video sequences with more complex motion patterns, like a YouTube

video. So far, only optical flow has been used as a feature set but as motion patterns get

more obscure, it may be worth exploring other feature sets and use them in conjunction

with optical flow.

Throughout our work, we assumed that there is only one person of interest in a video

sequence whose actions need to be tracked and recognized. The next step would be to

consider how to extend the framework introduced here to track and recognize multiple

human actions. Another simplifying assumption made in this thesis is that a person can

only perform a single action in a video sequence, whereas in reality, this is hardly ever

62
the case. It might be worth exploring an extension of the methods introduced here by

considering transitions from one action class to another along with transitions among visual

words.

Finally, we believe that a fully discriminative model such as a structural SVM[20] may

perform better since both the transition model and location emission models will be trained

jointly so as to produce correct tracks on the training data.

63
Bibliography

[1] N. Dalal B. Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE
Conference Computer Vision and Pattern Recognition 2005, San Diego, USA, pages:886
to 893

[2] Y. Wang P. Sabzmeydani G. Mori. Semi-Latent Dirichlet Allocation: A Hierarchical


Model for Human Action In International Conference on Computer Vision Recognition
2007

[3] J. C. Niebles, H. Wang L. Fei-Fei. Unsupervised Learning of Human Action Categories


Using Spatial-Temporal Words. In International Journal of Computer Vision Volume 79,
Issue 3 (September 2008), pages:299-318

[4] I. Laptev T. Lindeberg. Space-Time Interest Points. In Conference on Computer


Vision 2003, Nice, France, pages:432 to 439

[5] D. M. Blei, A. Y. Ng, M. I. Jordan J. Lafferty. Latent dirichlet allocation. In Journal


of Machine Learning Research Jan 2003, pages:993 to 1022

[6] T. Lindeberg I. Laptev. Local Descriptors for Spatio-Temporal Recognition. In In


ECCV Workshop 2004 ”Spatial Coherence for Visual Motion Analysis”, Springer LNCS
Vol.3667, pages:91 to 103

[7] B. Caputo, C. Schuldt I. Laptev. Recognizing Human Actions: A Local SVM


Approach. In . ICPR 2004, Cambridge, UK. pages:32 to 36.

[8] I. Laptev. Local Spatio-Temporal Image Features for Motion Interpretation.

[9] A. Efros, A. C Berg, G. Mori, J. Malik Recognizing action at a distance. In EEE


International Conference on Computer Vision 2003. Volume 2. pages:726 to 733.

[10] B. D. Lucas T. Kanade An iterative image registration technique with an application


to stereo vision In Proceedings of the DARPA Image Understanding Workshop (April
1981). pages:121 to 130.

[11] D. Rodriguez, J. Ahmed M. Shah. Action MACH: A Spatio-temporal Maximum


Average Correlation Height Filter for Action Recognition In Vision and Pattern Recog-
nition, 2008

[12] P. Sabzmeydani G. Mori. Detecting pedestrians by learning shapelet features. In


2003.

64
[13] K. Murphy. Hidden Markov Model (HMM) Toolbox for Matlab. Software retrieved
from http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html

[14] P. Felzenszwalb D. P. Huttenlocher. Distance Transforms of Sampled Functions. In


ornell Computing and Information Science Technical Report TR2004-1963, September
2004.

[15] T. Hofmann. Probabilistic Latent Semantic Indexing In of the Twenty-Second Annual


International SIGIR Conference on Research and Development in Information Retrieval
(SIGIR-99), 1999.

[16] C. Bishop. Pattern Recognition and Machine Learning.

[17] A. Ng. Support Vector Machines. Retrieved from


http://www.stanford.edu/class/cs229/notes/cs229-notes3.pdf.

[18] G. Mori, Y. Wang. Learning a Discriminative Hidden Part Model for Human Action
Recognition. In 2008.

[19] E. Shechtman and M. Irani. Space-time behavior based correlation. In 2005.

[20] I. Tsochantaridis, T. Joachims, T. Hofmann and Y. Altun. Large margin methods


for structured and interdependent output variables. In , 6, 14531484 2005.

[21] M. Brand. Coupled hidden markov models for complex action recognition. In lab
vision and modelling tr-407, MIT, 1997

[22] M. Brand, N. Oliver, and A.P. Pentland. Coupled hidden markov models for complex
action recognition. In Conference on Computer Vision and Pattern Recognition 1997,
pages:994 to 999

[23] N. Oliver, A. Garg, and E. Horvitz. Layered representations for learning and inferring
office activity from multiple sensory channels. In , vol. 96, no. 2, pp. 163180, November
2004

[24] N. Oliver, E. Horvitz, and A. Garg. Layered representations for human activity
recognition. In IEEE International Conference on Multimodal Interfaces, 2002, pages 3
to 8

[25] T. Mori, Y. Segawa, M. Shimosaka, and T. Sato. Hierarchical recognition of


daily human actions based on continuous hidden markov models. In Face and Gesture
Recognition, 2004, pages 779 to 784.

[26] A. D. Wilson and A. F. Bobick. Parametric hidden markov models for gesture
recognition. In Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp.
884900, September 1999.

[27] M. Brand and V. Kettnaker. Discovery and segmentation of activities in video. In


Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 844851, August
2000

[29] A. Galata, N. Johnson, and D. Hogg. Learning structured behavior models using
variable length markov models. In International Workshop on Modelling People, 1999.

65
[29] A. Galata, N. Johnson, and D. Hogg. Learning behavior models of human activities.
In Machine Vision Conference, 1999.

[30] D. A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien and D. Ramanan. Computational


studies of human motion: part 1, tracking and motion synthesis. In and Trends in
Computer Graphics and Vision. July 2005

[31] D. Ramanan and D. A. Forsyth. Automatic Annotation of Everyday Movements. In


Dec 2003.

[32] D. McAllester and D. Ramanan. A discriminately trained, multiscale, deformable


part model. In , 2008

[33] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial priors for part-based


recognition using statistical models. In , pages 1017, 2005.

[34] B. Epshtein and S. Ullman, Semantic hierarchies for recognizing objects and parts.
In CVPR, 2007.

[35] S. Ioffe and D. Forsyth. Probabilistic methods for finding people. In , pages 45-69,
June 2001

66
Appendices

A Data Sets

In this appendix, we give a brief introduction to the data sets used in this thesis.

A.1 KTH Dataset

The KTH human motion dataset is one of the largest video datasets of human actions. It

contains six types of human actions (walking, jogging, running, boxing, hand waving and

hand clapping) performed several times by 25 subjects both outdoors and indoors, with

different clothes. All sequences were taken over homogeneous backgrounds with a static

camera with 25fps frame rate. The data set is fairly synthetic and does not represent real

world scenarios.

A.2 UCF Action Dataset

The UCF (University of Central Florida) vision lab collected a set of eight actions from

various sports featured on channels such as BBC and ESPN. Actions in this dataset include

Golf swings, diving, kicking a soccer ball, weight lifting, horse riding, running, skating

and swingbenching. The original dataset consists of some frames in which more than one

person appears, unlike the KTH dataset which consists of only one person per frame. This

relatively new data set contains close to 200 video sequences at a resolution of 720x480.

67
The collection represents a natural pool of actions featured in a wide range of scenes and

viewpoints

68

You might also like