Proceedings of Seminar and Project
Computer Vision: Object and People Tracking
Winter semester 2013/14
Dr. Gabriele Bleser and Prof. Didier Stricker
Department Augmented Vision
University of Kaiserslautern and DFKI GmbH
Introduction
The seminar and project Computer Vision: Object and People Tracking (INF73-72-S-7, INF-73-82-L-7) are continuative courses based on and applying the
knowledge taught in the lectures 3D Computer Vision (INF-73-51-V-7) and Computer Vision: Object and People Tracking (INF-73-52-V-7). The goal of the
project is to research, design, implement and evaluate algorithms and methods
for tackling computer vision problems. The seminar is more theoretical. Its educational objective is to train the ability to become acquainted with a specific
research topic, review scientific articles and give a comprehensive presentation
supported by media.
In the winter semester 2013/14, projects and seminars addressed image-based
recognition and tracking tasks in the 2D image (e.g. objects and text) and in
the 3D domain (e.g. body motion tracking and gaze estimation). The results are
documented in these proceedings.
Organisers and supervisors
The courses are organised by the Department Augmented Vision (http://ags.
cs.uni-kl.de), more specifically by:
Prof. Dr. Didier Stricker
Dr. Gabriele Bleser
In the winter semester 2013/14, the projects were supervised by the following
department members:
Dr. Alain Pagani
Christian Bailer
Mohamed Selim
Nils Petersen
Stephan Krauss
Sebastian Palacio
Markus Miezal
August 2014
Dr. Gabriele Bleser
Real time text recognition in natural scene
Pramod Murthy1 and Alain Pagani2
1
murthy.pramod@gmail.com
alain.pagani@dfki.de
2
Abstract. With increasing usage of mobile camera based imaging, text
recognition in uncontrolled environment provides a key input modality
to many augmented reality based applications. The goal of the project
was to develop a real time solution for text detection and recognition.
The work was performed in two stages which typically combines different
stages of text recognition process. The first part of the study was to evaluate feasibility of standard OCR system for recognizing text in camera
based document images to text in natural scenes. While the later part
consist of implementing text detection and localization by selecting from
a set of Extremal Regions(ER). Hence, combining the two mentioned
components an end-to end solution was developed. Finally, we discuss
various issues faced and possible improvements to solve them.
Keywords: Text detection and localization, Text recognition,OCR, Natural scene, ER detection.
1
Introduction
The problem of recognizing text in natural scene has drawn significant attention
in recent years in computer vision research. One of the major factor contributing
to the field being the high usage of camera imaging in mobile devices. Text recognition for document imaging is solved using robust and powerful algorithms, but
the problem of text detection and recognition in natural scene is still unsolved.
This is due to the fact that localizing text is a very expensive task of 2N subsets
which might represent text in an image (where N is the number of pixels)[6].
Nevertheless, the solutions would certainly lead to interesting augmented reality application like assisting blind people to navigate, serve product reviews
and relevant information over a smartphone or in wearable gadgets to provide
translation services for the textual information.The scope of the project was to
understand the problem of real-time text detection and recognition in natural
scene and further provide different ways of improving text recognition accuracy.
2
Related Work
We can broadly classify the approaches for text detection and extraction into
four different classes.
2
Edge based methods Edges are good features for text detection in natural
scenes. These methods are usually apply edge detector at first and later different
morphological operators are applied to remove text from the background scene.
Edge based detectors heavily rely on characters exhibiting strong edges which
fail during a strong illumination or shadows overlayed in the scenes. Many approaches like multiscale edge based extraction [5] [7], a self adapting thresholding
morphological operations [1] and Pyramid based decomposition [4, 3] of image
with color based edge detection to support changes in text color and varying
font sizes in the scene.
Texture based methods Texture based methods uses textural properties of
the text to distinguish from background. Here, a probable region with texture
is extracted by applying texture analysis methods Gaussian. Wavelet decomposition, Fourier Transform, Discrete Cosine Transform (DCT) and local binary
pattern (LBP). Finally, a classifier (usually a trained machine learning ) is used
to text regions. There are approaches proposed for detecting text detection for
multilingual text [12] using features and histogram of oriented gradients (HOG),
mean of gradients (MG) and LBP as features, and AdaBoost classifier to decide
text regions.
Connected Components Connected components based methods group a set
of small regions successively until all regions which are part of text are identified. These methods usually segment candidate text component by edge detection
and color clustering. The non-text components are pruned with heuristic rules or
classifiers. Since, CC based methods have few candidate regions as compared to
other approaches, so they have a low computation cost associated. The located
regions can then directly used as input for optical character recognition engine.
Also iterative approaches with a modified Conditional Random Field algorithm
to get connected components with Belief Propagation inference and OCR filtering stages[11]. Another method uses a separate color images layers analyzing
with Block Adjacency Graph (BAG) [9].
Stroke based methods Another set of approaches used the stroke width of
text as a feature to detect text and distinguish it from background. The image
is segmented using stroke width feature and grouped together by clustering. A
operator called Stroke width Transform (SWT), allows to detect characters over
a different scale by merging pixels of similar stroke as connected components
[2]. These approaches are primarily designed for text with horizontal orientation
and would fail to recognize text with a different orientations.
As each class of methods fail in certain conditions because of the inherent
approach and varied character orientations, researchers have proposed new approaches by combining the strengths of each approaches. A hybrid algorithm
to detect text in any arbitrary orientation by adapting to two sets of features
based on SWT and a two level classification scheme and achieved good results
3
on ICDAR datasets [10]. Neumann and Matas [6] approach of extremal regions
provides a real-time performance with reduced memory footprint. The goal of
the project was to develop a real-time text recognition system needing a less
computationally expensive and low memory footprint approach. we focused on
hybrid approaches with extremal regions for text localization compared all approaches with highest f-measure [11].The extremal regions method trained on
ICDAR 2003 dataseta and was evaluated on ICDAR 2011 and street view dataset. The results were second highest with precision 73.1%, recall 64.7% and
f-measure of 68.7% while on more challenging Street View Text (SVT) data-set
the recognition rates were precision 67.0 %, recall 29.0% and f-measure of 41.0
% [6].
3
System Design
Designing a text recognition system is quite challenging task as the images may
exhibit several variations in imaging and distortions.
Fig. 1. Image processing pipeline
A typical image processing pipeline for text detection and recognition in
natural scene is as shown in Figure 1 [11]. A input image is preprocessed to
resize it to the optimal resolution and to calculate features for text detection.
In the text detection and localization step, a bounding box is found around the
4
region of the text present with in an image. This extracted region is further
enhanced and subtracted from background. Typically, the subtracted image is
converted in binary image before it is fed into an OCR engine to process. The
OCR engine should be configured appropriately for the language of text which
loads the corresponding language model for text correction.
OCR systems have evolved with improved accuracy rates, we performed an
evaluation study a OCR system to benchmark text recognition rates for camera
based document images.
4
OCR evaluation
Tesseract OCR engine is chosen for performing OCR over camera based document images. Tesseract combined with other open source imaging libraries, supports different image input formats and converts them to text over 60 languages
[8].
A Canon EOS 5D Mark II camera with resolution of 21 megapixels was used
for taking images with roughly a distance of 70 cm. A set of 14 images were
captured of a single page document printed in arial font text. Each image in
the set consist had different level font size used for printing the text document.
The font size ranging 8 to 30 was considered, with an increasing size of 2 units
at each level as shown in figure 2 collection of images containing a single page
documents, which are varying in font size and font types were collected. An OCR
is applied on the different images to detect and recognize the characters written
in the document. The recognized text output is compared with the ground truth
text for respective images. Finally, the various accuracy measures to measure
the output of OCR are calculated.
4.1
Experimental setup
A database of natural images containing a single page document was used. Two
set of images were used, i.e. Arial and Times New Roman, based on the type of
the font used for printing the document text in the images as shown in Figure 2.
Each set had a total of 14 images, in which the next image consists of a document
with an increment of font size of 2. The first image in the set to be processed
had a document text with font size 8, whereas the last document was with font
size 30. The following steps explain the overall setup as shown in Figure 1.
1.
2.
3.
4.
A single image in the given set of images for a single font type was selected.
The image is processing using tesseract OCR engine.
A output-OCR txt file is generated for the tesseract OCR output.
Image selected in Step 1 is reduced to the next resolution size and process
is continued from step 2 till the smallest reduction level is reached.
5. The output-OCR text file is compared with a text file containing the accurate
text (ground truth) present in the image.
6. Different Measures (Precision, Recall and F- Measures) are calculated.
5
(a)
(b)
(c)
(d)
Fig. 2. Sample images containing document with arial font and font size 8 (a), 30(b)
and times new roman font with sizes 8 (c) and 30 (d)
Measurement procedure Three measures listed below were used to measure
the OCR Output at word level:
Precision: Fraction of correct words retrieved to that of words present in the
ground truth document.
Recall: Fraction of correct words retrieved to that of total number of words
retrieved in OCR output document.
F-Measure: The weighted harmonic mean of precision and recall.
F =2∗
4.2
precision.recall
precision + recall
Results
Fig. 3. The precision (a), recall (b) and f-measures (c) for images containing varying
text size for arial font
6
Fig. 4. The precision (a), recall (b) and f-measures (c) values for images containing
varying text size for times New Roman font
The feasibility of text retrieval in natural images with text in two standard
fonts (Arial and Times New Roman) has been analyzed. We measured accuracy
of words detection of single page document image taken by camera for different
font sizes at human readable distance (approximately 1m) as shown in figure
2 and 3. We also measured the optimal character heights in pixel for correct
detection of text, in order to estimate the resolution of optics to be used by a
text recognition system. Finally, it was observed from figures 4, figure 5 that
optimal character height (x-height) information should be at least 16 pixels to
get an accuracy of text recognition of 90 % and above.
5
Text detection and localization
Text detection and localization is the process of determining the text localization in the image and generating bounding box around them[11].This part of
the images is then served as input for OCR for further evaluation. It is certain
that there is need of image correction or other preprocessing steps such finding orientation of the text, correcting camera angles or perspective distortions.
We implemented text localization and detection method using extremal regions
by Neumann and Matas[6]. An Extremal Region (ER) is a region of the image
whose outer boundary pixels have strictly higher values than region itself. The
method can divided into two stages of classification. In first stage, a set of ER are
computed using a sequential search. The ER detector has complexity of O(2pN
where p denotes number of channels used. The probability of each ER being a
character is estimated using a set of incrementally computable descriptors. The
different incrementally computable descriptors consists of following elements
– Area Area of a region.
– Bounding box Top-right and bottom left corner of the region.
7
(a) precision
(b) recall
(c) f-measure
Fig. 5. Plots of precision (a), recall (b) and f-measures (c) values vs character heights
(x-height) for arial font
– Perimeter The length of the boundary of the region.
– Euler number The difference between the number of connected components
and number of holes in an binary image.
– Horizontal crossings A vector with number of transitions between pixels
belonging to ER and not belonging to ER in the given row i of the ER region
r.
A sequential classifiers selects ER regions using incrementally computable
descriptors as features. The classification is applied in two steps to make it
more computationally efficient. During the first step of classification, features are
computed by incrementally increasing the threshold value form 0 to 255 in O(1)
for each ER. The value of class condition probability p(r | character) is tracked
at each threshold and only selected when the probability is above global limit
P min and difference between local maximum and local minimum is greater than
∆ min. A Real Adaboost classifier using decision trees computes incrementally
computable descriptors in O(1). The output is calibrated to class conditional
probability function p using logistic regression to select extremal regions. The
parameters constants P min = 0.2 and ∆min = 0 are set to get high recall rate
8
(a) precision
(b) recall
(c) f-measure
Fig. 6. Plots of precision (a), recall (b) and f-measures (c) values vs character heights
(x-height) for times new roman font
(95.6 %) [6]. In the second stage, we applied tesserect OCR engine to get the text
in different extremal regions. Tesseract provides options for adaptation assuming
it to be a single block of text or auto orientation correction and segmentation
detection mode.
6
6.1
Experiments
Experiment 1
We applied our system to images with different type of segmentation and media type. The system was applied on images contain product information. The
tesseract engine was successful to segment and layout analysis in images where
product had flat surface geometry. The recognition rate decreased dramatically
as we had cylindrical objects (Figure 7).
9
(a)
(b)
(c)
Fig. 7. Sample product images containing text in different layouts.
6.2
Experiment 2
In the next experiment, we applied our system on an offline high quality video
stream to evaluate the system for different light illumination changes. The video
was split into frames and used as an input to the whole system. The system
recognized did detect text and but tesseract ocr engine recognized only text
with certain font types and language models supported as shown in figure 8.
The figure 9 shows the overall framework developed for text detection and
recognition. The input video stream is split into frames. Each frame is applied
a text detection filter to get a list of extremal regions. The list of extremal
regions are cropped from image and processed before they are given as input
to tesseract ocr system. The tesseract ocr engine needs to be initialized with
appropriate language models for recognizing the text in the cropped images.
7
Conclusion
An real-time text detection and recognition method was developed and applied
to detect text in natural scene images at varying complexity. A feasibility study
10
(a)
(b)
Fig. 8. Text recognition in a video.
(a)
Fig. 9. Real time text detection and recognition framework.
of Tesseract OCR system on camera based text documents measuring at least 16
pixel height of character information needed to retrieve 90% of words. Further,
Tesseract engine was combined with Neumann and Matas algorithm [6] for text
detection and localization. The combined system was successfully applied to
varying text layouts, sizes in natural scene.
References
[1] Ying-ying CUI, Jie YANG, and Dong LIANG. An edge-based approach for
sign text extraction. Image Technology, 1:007, 2006.
[2] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. Detecting text in natural scenes with stroke width transform. In Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on, pages 2963–2970. IEEE,
2010.
[3] Nobuo Ezaki, Marius Bulacu, and Lambert Schomaker. Text detection from
natural scene images: towards a system for visually impaired persons. In
Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 683–686. IEEE, 2004.
[4] Anil K Jain and Bin Yu. Automatic text location in images and video
frames. Pattern recognition, 31(12):2055–2076, 1998.
[5] Xiaoqing Liu and Jagath Samarabandu. Multiscale edge-based text extraction from complex images. In Multimedia and Expo, 2006 IEEE International Conference on, pages 1721–1724. IEEE, 2006.
[6] Lukas Neumann and Jiri Matas. Real-time scene text localization and recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 3538–3545. IEEE, 2012.
[7] Wen-wu Ou, Jun-min Zhu, and Chang-ping Liu. Text location in natural
scene. Journal of Chinese Information Processing, 5:006, 2004.
[8] Ray Smith. Tesseract ocr engine. Lecture. Google Code. Google Inc, 2007.
[9] Kongqiao Wang and Jari A Kangas. Character location in scene images
from digital camera. Pattern recognition, 36(10):2287–2299, 2003.
[10] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. Detecting
texts of arbitrary orientations in natural images. In Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1083–1090.
IEEE, 2012.
[11] Honggang Zhang, Kaili Zhao, Yi-Zhe Song, and Jun Guo. Text extraction
from natural scene image: A survey. Neurocomputing, 122:310–323, 2013.
[12] Gang Zhou, Yuehu Liu, Quan Meng, and Yuanlin Zhang. Detecting multilingual text in natural scene. In Access Spaces (ISAS), 2011 1st International
Symposium on, pages 116–120. IEEE, 2011.
1
h
Struck (Structured Output Tracking with
Kernels) and related work
Achim Otting1 and Christian Bailer2
1
2
otting@rhrk.uni-kl.de
christian.bailer@dfki.de
Abstract. This paper presents the Struck (Structured Output Tracking
with Kernels) algorithm. It augments the idea of adaptive tracking-bydetection with structured Output Learning with SVMs. In contrast to
traditional tracking-by-detection methods Struck directly links learning
and tracking and does not need an intermediary binarization step. Inaccurate tracking may lead to a wrong classification and further drift. A
budgeting mechanism for the number of Support Vectors make real-time
application of Struck possible. Experiments show that Struck can outperform state-of-the-art tracking-by-detection methods. In this paper a
way of how to make Struck more robust for appearance changes is also
presented.
Keywords: Struck, Structured Output, SVM, Tracking, Kernels
1
1.1
Introduction
Motivation
Visual Object Tracking is used in many application areas like traffic surveillance
[13], medicine [18] and human computer interaction [7]. After many years of
research there still exist some challenges in visual object tracking. Especially appearance changes of the target object like rotation, scale changes, deformations
and also illumination changes as well as occlusions remain difficult tasks [12].
No general visual object tracking algorithm is available which is able to solve all
the mentioned problems in all possible situations. Wu et al. [20] designed large
scale experiments to assess state-of-the-art visual object tracking algorithms.
The evaluation shows that the Struck approach, which is presented in this paper, can exceed in many benchmarks the results of the competing algorithms.
Struck uses structured output SVMs to estimate the objects position. A budgeting mechanism for the number of support vectors make real-time applications
possible.
1.2
Related Work
In the past several approaches for Visual Object Tracking were developed. Asset2 (A Scene Segmenter Establishing Tracking, Version 2) [17] and the Traffic
3
Monitoring System [13] extract two dimensional features to find the optical flow
of an image for the tracking task. Motion Models can be considered to reduce the
search space. Another way to track objects is to classify samples generated from
the video sequence into background and target objects. Tracking by Detection
which can, in an advanced implementation (e.g. [3]), handle object appearance
and disappearance but do not use motion models. Struck augments the ideas of
adaptive tracking-by-detection.
Tracking-by-Detection can be seen as detection over time. Today there are good
detection methods available. Avidan [1] describes an tracking system for vehicles
that searches for strong edges in images and uses a support vector machine
(SVM) to decide whether the edges they found belong to a vehicle or not.
Adaptive Tracking-by-Detection methods allow to handle changes in the appearance. This is realized by online-trained classifiers. The approaches of Babenko
et al. [3], Grabner et al. [10] and Saffari et al. [16] are boosting-based.
1.3
Preliminaries
In this chapter some basics of how to design multi-class classifier are discussed
which are important to understand the fundamental principles of Struck. To
classify multiple classes with SVMs it is possible to define several binary classifiers and combine their results. One-vs-One and One-vs-Rest training are often
used. One can also design one single classifier which classifies multiple classes
[8]. This is called structured output learning
Tracking By Detection The tracking algorithm predicts for every time t the
position pt of the bounding box. The classifier learns for every t the predefined
features and assigns to every sample x a binary label +1 or -1 according to z =
sign (h (x)). The assumption is that during tracking the maximum classification
confidence lies around pt−1 and goes with pt . The translation y of the object at
t is:
p
◦y
(1)
yt = arg max h xt t−1
y∈Y
and the new estimated object position is computed as
pt = pt−1 ◦ yt
(2)
.
As mentioned by Hare et al. [11] the parts of the adaptive tracking by detection approaches can now be described in short as (1) sample generating and
labeling (2) classifier update.
They also point out the main drawbacks of this approach. One problem is
that the samples have the same weight during the training. So a negative sample
which highly overlaps the bounding box has the same weight as one which has
only a little overlap. Now the poorly labeled samples can reduce the classifier
4
accuracy and also the tracking accuracy. There exists also no universal strategy
for sample generating and labeling. This leads to label noise. Another problem
they point out is that the maximum classifier confidence does not necessarily
go together with the best estimate of the objects location. This is because the
objectives of the classifier and the tracker are very different. The first classifies
the samples as target object or background and the second estimates the objects
location. There exists no direct connection between these to parts during learning
which results in poor labels.
Structured Output The goal is to enable the classifier not only to process
arbitrary input but also to generate arbitrary (more complex than simple binary)
output with only one single classifier. To do so one has to bring the outputs in
a relationship to each other. These classifiers can better use the training data
for learning. In the case of object detection, the output space can be four points
describing the bounding box surrounding the target object. The values of the
points depend on each other. The values of the left and top points are lower than
the values of the right and bottom points. In addition, the score of the bounding
boxes correlate. Bounding boxes that highly overlap will have nearly the same
score. Considering these dependencies will improve training and testing [4].
SVMs To separate samples of two classes (x1 , y1 ), ..., (xn , yn ) with x ∈ Rn and y ∈
{−1, 1}, a SVM creates a hyperplane described by hw, xi − b = 0 from this sam2
ples. The SVM minimizes (in the not linearly separable case) φ (ξ) = 12 ||w|| +
Pl
C i=1 ξi . ξ is called slack-variable and needed to consider the loss.
So the constrained Optimization Problem is
l
P (w, b, ξ) = min
w
X
1
2
||w|| + C
ξi
2
i=1
(3)
subject to ∀i : yi (hw, xi i − b) ≥ 1 − ξi and ξi ≥ 0
.
To get rid of the constraints in (3), positive dual variables αi are introduced.
Each constraint is multiplied with a dual variable and added, called the Lagrangian of the Optimization Problem
l
l
X
X
1
2
||w|| + C
ξi −
αi (yi (hw, xi i + b) + ξi − 1).
w,b,a 2
i=1
i=1
L (w, b, α) = min
(4)
The unconstrained Optimization Problem (Dual Problem) is now a maximization
D (α) = max
a
l
X
i=1
αi −
1X
αi αj yi yj hxi , xj i
2 i,j
subject to ∀i : 0 ≤ αi ≤ C;
l
X
i=1
α i yi = 0
(5)
5
.
2
Struck
Hare et al. [11] (Struck) are convinced that the existing tracking by detection
algorithms only address the label noise by making their classifier more robust.
But the real problem is that the labeler and the learner are separated. The Struck
algorithm, which is presented in this chapter, does not depend on a labeler but
it directly links learning and tracking. There is no intermediate binarization
step. It directly learns the object transformation with a structured output SVM.
Struck is based on the work of Crammer and Singer [8].
New samples are classified according to yt = arg maxy F (xt pt−1 , y). F (x, y) =
hw, φ (x, y)i measures the compatibility between (x, y) pairs. The problem for
structured output SVMs is defined as
n
P (w, b, ξ) = min
w
X
1
2
ξi
kwk + C
2
i=1
subject to ∀i : ξi ≥ 0
∀i, ∀y =
6 yi : hw, δφi (y)i ≥ ∆ (yi , y) − ξi
(6)
where δφi (y) = φi (xi , yi ) − φi (xi , y) = F (xi , yi ) − F (xi , y)
Note that this is not a linearly separable problem and therefore joint kernel
maps φ (x, y) are used. The inequality contains a loss function ∆(yi , y) to handle
the first issue (all samples are equally weighted). The loss function should be
0 iff yi = y and increase as yi and y diverge.
2.1
Online Optimization
The (Dual) Optimization Problem of Struck which has to be solved is (see [11]
and [5] for details):
D (β) = max −
β
X
∆ (yi , y) βiy −
i,y
1X y y
βi βj hφ (xi , y) , φ (xj , y)i
2
subject to ∀i, ∀y : βiy ≤ δ (y, yi ) C
X y
∀i :
βi = 0
(7)
y
with δ(y, y) =
(
1,
0,
if y 6= y
else
, hφ (xi , y) , φ (xj , y)i as a kernel comparing two image patches from frame xi , xj
at positions y, y, and
X y
F (x, y) =
(8)
βi k hφ (xi , y) , φ (x, y)i .
i,y
6
Now let us discuss some basic definitions. If βiy 6= 0 for a (xi , y) pair, then this
pair is called support vector and the xi support pattern. For any given support
pattern there exists only one support vector with βiyi > 0 namely (xi , yi ) which
is called positive support vector. For all other support vectors hold βiy < 0. They
are called negative support vectors.
To solve the dual problem (7) a SMO algorithm [15] is used. The SMO
algorithm can be found in Listing 1.1. The coefficients of the support vectors
and the gradients are updated during this SMOStep. The algorithm takes a
y
y
sample, y+ and y− . As you can see the derived dual variables βi + and βi − are
incremented/decremented with the same value (lines 8, 9). This ensures that the
second constraint of (7) is fulfilled. A coefficient βiy equals zero leads to that its
support vector is being removed. At the end gradients are updated according to
gi (y) = −∆ (y, yi ) − F (xi , y)
(9)
Listing 1.1. SMOStep taken from [11]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
R e q u i r e : i, y+ , y
k00 = hφ (xi , y+ ) , (xi , y+ )i
k11 = hφ (xi , y− ) , (xi , y− )i
k01 = hφ (xi , y+ ) , (xi , y− )i
+ )gi (y− )
λu = gki00(y+k
11 2k01
y
λ = max 0, min λu , Cδ (y+ , yi ) − βi +
Update c o e f f i c i e n t s
y
βi + + = λ
y
βi − − = λ
Update g r a d i e n t s
f o r (xj , y) ∈ S do
k0 = hφ (xj , y) , φ (xi , y+ )i
k1 = hφ (xj , y) , φ (xi , y− )i
gj (y) − = λ (k0 k1 )
end f o r
Three strategies exist to choose a triple ((xi , yi ) , y+ , y− ) which is given as
input into SMOStep.
– PROCESSNEW takes as input a pattern xi and returns immediately if
the pattern is a support pattern. Otherwise xi is not a support pattern, it
follows that (xi , yi ) is not a support vector and byi = 0 and the sample class
yi is assigned to the positive class y+ . y− is computed as arg miny∈Y gi (y).
Listing 1.2. PROCESSNEW(xi ) taken from [5]
1 i f xi i s a s u p p o r t p a t t e r n then e x i t
2 y+ = yi
3 y− = arg miny∈Y gi (y)
4 Perform SM OStep (i, y+ , y− )
7
– PROCESSOLD chooses a support pattern randomly and recompute its class
and assigns it to the positive class. The constraint only considers existing
support vectors. Yi is computed as in PROCESSNEW. The new y+ and y−
are along the highest gradient.
Listing 1.3. PROCESSOLD taken from [5]
1 t a k e random s u p p o r t p a t t e r n xi
2 y+ = arg maxy∈Y gi (y) subject to βiy ≤ Cδ (y, yi )
3 y− = arg miny∈Y gi (y)
4 Perform SM OStep (i, y+ , y− )
– OPTIMIZE chooses a support pattern randomly and takes all classes of the
support vectors. Then it reassigns y+ and y− out of the classes.
Listing 1.4. OPTIMIZE taken from [5]
1
2
3
4
5
t a k e random s u p p o r t p a t t e r n xi
Let Yi = {y ∈ Y such that (xi , y) ∈ S}
y+ = arg maxy∈Yi gi (y) subject to βiy ≤ Cδ (y, yi )
y− = arg miny∈Yi gi (y)
Perform SM OStep (i, y+ , y− )
Listing 1.5. Struck taken from [11]
1 Require : ft , pt−1 , St−1
2 Estima te change i n ob j e c t l o c a t i o n
p
3 yt = arg maxy∈Y F xt t−1 , y
4 pt = pt−1 ◦ yt
5 Update d i s c r i m i n a n t f u n c t i o n
6 (i, y+ , y− ) ← P ROCESSN EW (xpt t , y0 )
7 SM OST EP (i, y+ , y− )
8 BU DGET M AIN T EN AN CE ()
9 f o r j = 1 t o nR do
10
(i, y+ , y− ) ← P ROCESSOLD ()
11
SM OST EP (i, y+ , y− )
12
BU DGET M AIN T EN AN CE ()
13
f o r k = 1 t o nO do
14
(i, y+ , y− ) ← OP T IM IZE ()
15
SM OST EP (i, y+ , y− )
16
end f o r
17 end f o r
18 r e t u r n pt , St
As you can see PROCESSNEW and PROCESSOLD compute the classes
and may create new support vectors. They can be expensive as they search over
the whole transformation space to minimize (9). OPTIMIZE only reassigns the
8
classes and updates the coefficients. This operation is much faster because the
search space only contains the classes of the already computed support vectors.
After every PROCESSNEW step (to process a new sample (xi , yi )) follow 10 iterations of a REPROCESS step. This REPROCESS step contains 1
PROCESSOLD and 10 OPTIMIZE steps (Listing 1.5) [11].
2.2
Budget
For an permanent object tracking task the number of support vectors increase
unbounded because every new incoming sample may lead to the creation of a new
support vector. F and is part of (9) and therefore also part of PROCESSNEW
and PROCESSOLD which are getting more and more complex while adding
new samples leading to new support vectors [11].
To ensure that the object tracking can be done in real-time it is necessary to
remove support vectors. Wang et al. [19] show that the gradient error (that is the
performance degradation) is proportional to the change of the weight vector. So
minimizing the weight degradation k∆wk minimizes the gradient
P error E. The
pre and post conditions of BUDGETMAINTENANCE is that y βiy = 0 has to
hold. There is only one support vector (xi , yi ) with βiy ≥ 0 for a support pattern
xi so for budget maintenance only the support vectors with βiy ≤ 0 are remove
candidates [11].
3
Further Improvements with Robust Tracking with
Weighted Online Structured Learning
Yao et al. [21] identified two main problems of adaptive tracking-by-detection
methods from which also the presented Struck suffers from. One problem is that
the methods need online classifiers to adapt the object’s appearance model and
therefore discard old training data. If the appearance model changes the old
training data is not available for a necessary redetection. The other problem is
that the traditional methods weight all samples equally and either fully include or
discard the samples. But newer samples should affect the classification in the time
dependent sequence of frames more than older ones and therefore they should be
higher weighted. Yao et al. developed in [21] a robust tracker with online weighted
learning. Their tracker can handle kernels and structured output. A constant
sized subset of a huge sized set is a reservoir. Their algorithm uses a weighted
reservoir, the elements are chosen with a probability based on their weights. To
update the set the actual sample replaces an element with certain probability.
In a possible infinite sequence it is not feasible to use all samples for learning.
Because online learning does not consider all samples it is less performant than
batch learning and this causes an additional loss of online learning compared
to batch learning, the regret. But the error of the robust tracker with online
weighted learning is not much higher than the error of batch learning [6] and
[21].
9
4
4.1
Experiments
Tracking by Detection
This chapter presents the executed experiments by Hare et al. [11] which compare benchmarks results of Struck [11] with results of boosting-based and random
forests based approaches [3], [10], [14] and [16]. For better comparability, Struck
uses similar features like the other ones although the features were optimized
for boosting-based approaches. Struck has much better results than the other
approaches even with the features that were not optimized for Struck. Further
experiments and results of the combination of multiple kernels for Struck are discussed afterwards. Hare et al. [11] compared the performance of the approaches
with 8 different Video Sequences. Sylvester and David include light, scale and
pose changes. The two Face sequences include occlusion and Face2 additionally
appearance changes. The Girl sequence contains appearance and pose changes.
The challenges in the Tiger and Coke sequences are occlusions, pose changes,
fast motion and in the Coke sequence a specular object. The fast motions additionally cause motion blur [3]. The Sequences can be found at [2].
Sequence Struck∞ Struck100 Struck50 Struck20 MIForest OMCLP MIL Frag OAB
Coke
0.57
0.57
0.56
0.52
0.35
0.24 0.33 0.08 0.17
David
0.80
0.80
0.81
0.35
0.72
0.61 0.57 0.43 0.26
Face1
0.86
0.86
0.86
0.81
0.77
0.80 0.60 0.88 0.48
Face2
0.86
0.86
0.86
0.83
0.77
0.78 0.68 0.44 0.68
Girl
0.80
0.80
0.80
0.79
0.71
0.64 0.53 0.60 0.40
Sylvester
0.68
0.68
0.67
0.58
0.59
0.67 0.60 0.62 0.52
Tiger1
0.70
0.70
0.69
0.68
0.55
0.53 0.52 0.19 0.23
Tiger2
0.56
0.57
0.55
0.39
0.53
0.44 0.53 0.15 0.28
Table 1. Benchmark tracking using VOC by detection taken from [11]
Table 1 compares Struck with the other algorithms. The feature vector used
Haar-like features arranged on a grid at two different scales. The feature vector
is smoothed with a Gaussian. It is sufficient to search the points on a polar
grid with a bounded distance. The experiment was repeated five times and the
median result was taken (see details in [11]).
T
B )
area(B
The PASCAL Visual Object Classes VOC overlap measure ao = area(Bpp S Bgt
gt )
[9] measures how good the algorithm predicts the position of the bounding box.
It considers the predicted bounding box Bt and the ground truth bounding box
Bgt . A sample is voted positive if the overlap of both bounding boxes is greater
than 0.5.
In all but one experiments, Struck outperforms the other approaches. It is
remarkable that a low budget for Struck is sufficient to get good results. In four
of eight experiments Struck20 is better than all others approaches. The best
10
results for Struck are often achievable with a low budget of 50. Only Frag is a
little bit better than Struck when tracking Face1, because Frag was optimized
for partial occlusions. The Performance of Frag decrease dramatically in Face2
because in does not contain only occlusion but also appearance changes, which
Frag can not handle, because Frag has no adaptive object model [3]. Struck20
performs poorly in the David Sequence. The background is very dynamic and
changes its appearance. So 20 Support Vectors may be to few to generalize
the background. They achieved with the unoptimized Struck algorithm with a
budget size of 100 an average FPS of 13.2. This shows that Struck is suitable for
real-time applications [11].
The Figure 1 shows the support vectors after tracking for Struck64 . The
green boxes represent positive support vectors (the object) and the red boxes
represent negative support vectors (the background). The positive support vectors can track the appearance change of the objects so it is suitable for adaptive
tracking by detection. Much more negative support vectors are needed because
the background has a higher variance than the foreground objects [11].
Fig. 1. tracked girl taken from [11]
4.2
Combination of multiple Kernels
As an improvement Hare et al. [11] tested is the combination of multiple feaPNk (i) (i) (i)
tures through weighted kernels k (x, x) = N1k i=1
k
x , x . The advanced
Multiple Kernel Learning would also learn the weights. But experiments showed,
that also learning weights would not improve the tracking performance significantly. The table 2 shows the combination of multiple kernels. Unfortunately the
results are not significantly better or even got worse. The performance depends
strongly on the choice of the features. In future research Hare et al. want to
investigate in better choice of the features.
11
Sequence
Coke
David
Face1
Face2
Girl
Sylvester
Tiger1
Tiger2
A
0.57
0.80
0.86
0.86
0.80
0.68
0.70
0.57
B
0.57
0.83
0.82
0.79
0.77
0.75
0.69
0.60
C
0.69
0.67
0.86
0.79
0.68
0.72
0.77
0.61
AB
0.62
0.84
0.82
0.83
0.79
0.73
0.69
0.53
AC
0.65
0.68
0.87
0.86
0.80
0.72
0.74
0.63
BC
0.68
0.87
0.82
0.78
0.79
0.77
0.74
0.57
ABC
0.63
0.87
0.83
0.84
0.79
0.73
0.72
0.56
Table 2. combined Kernels taken from [11], In column A Haar-like features with
Gaussian kernel σ = 0.2, in column B raw features with Gaussian kernel σ = 0.1, In
column C histogram features with intersection kernels
5
Conclusion
In this paper we discussed Struck as an extension of adaptive tracking-bydetection using structured output. In contrast to traditional tracking-by-detection
approaches it does not depend on a labeler and has no intermediate binarization
step. Inaccuracies in the tracking step of traditional approaches may cause there
false classification of samples and may finally lead to further drift. The augmentation of the adaptive tracking-by-detection idea enables Struck to handle
appearance changes. Struck show constantly good results in different challenging
tracking tasks. The budgeting mechanism allows to use it for real-time applications. The Robust Tracking with weighted Online Structured Learning makes the
Structured Learning approach more robust against appearance changes, because
a bounded set of old training samples is kept for training.
The Experiments Hare et al. [11] did, are mainly limited to static background and dynamic foreground (except the David Sequence). The performance
of Struck when tracking dynamic objects and having dynamic background should
also be investigated.
References
1. S. Avidan. Support vector tracking. In IEEE Transactions on Pattern Analysis
and Machine Intelligence, volume 26, pages 1064–1072, August 2004.
2. B. Babenko. Tracking with online multiple instance learning (miltrack). http:
//vision.ucsd.edu/~bbabenko/project_miltrack.shtml. visited 2014-03-04.
3. B. Babenko, Yang M.-H., and S. Belongie. Visual tracking with online multiple
instance learning. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2009.
4. M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured
output regression. In European Conference on Computer Vision 2008, 2008.
5. A Bordes, L Bottou, P Gallinari, and J Weston. Solving multiclass support vector
machines with larank. In Proceedings of the 24th International Conference on
Machine Learning, ICML ’07, pages 89–96, New York, NY, USA, 2007. ACM.
12
6. A. Bordes, N. Usunier, and L. Bottou. Sequence labelling svms trained in one
pass. In Walter Daelemans, Bart Goethals, and Katharina Morik, editors, Machine
Learning and Knowledge Discovery in Databases, volume 5211 of Lecture Notes in
Computer Science, pages 146–161. Springer Berlin Heidelberg, 2008.
7. I. Bretzner, L. andLaptev and T. Lindeberg. shortened version hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering.
In Proceedings Face and Gesture 2002, pages 423–428, 2002.
8. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass
kernel-based vector machines. Journal of Machine Learning Research, 2:265–292,
March 2001.
9. M. Everingham and J. Winn. The pascal visual object classes challenge 2012
(voc2012) development kit. http://pascallin.ecs.soton.ac.uk/challenges/
VOC/voc2012/htmldoc/devkit_doc.html, 2012. visited 2014-01-15.
10. H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via on-line boosting.
In Proceedings of the British Machine Vision Conference, pages 6.1–6.10. BMVA
Press, 2006.
11. S. Hare, A. Saffari, and P.H.S. Torr. Struck: Structured output tracking with
kernels. In 2011 IEEE International Conference on Computer Vision (ICCV),
pages 263–270, November 2011.
12. A. S. Jalal and V. Singh. The state-of-the-art in visual object tracking. Informatica
(Slovenia), 36(3):227–248, 2012.
13. K Kiratiratanapruk and S Siddhichai. Vehicle detection and tracking for traffic
monitoring system. In TENCON 2006. 2006 IEEE Region 10 Conference, pages
1–4, November 2006.
14. C. Leistner, A. Saffari, and H. Bischof. Miforests: Multiple-instance learning with
randomized trees. In Proceedings of ECCV 2010 - 11th European Conference on
Computer Vision, volume 6, pages 29–42, September 2010.
15. John C. Platt. Advances in kernel methods. chapter Fast Training of Support
Vector Machines Using Sequential Minimal Optimization, pages 185–208. MIT
Press, Cambridge, MA, USA, 1999.
16. A. Saffari, M. Godec, T. Pock, C. Leistner, and H. Bischof. Online multi-class
lpboost. In 2010 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3570–3577, June 2010.
17. S.M. Smith. Asset-2: real-time motion segmentation and shape tracking. In Proceeding on Fifth International Conference Computer Vision, 1995, pages 237–244,
Jun 1995.
18. X Tang, G. C Sharp, and S. B Jiang. Fluoroscopic tracking of multiple implanted
fiducial markers using multiple object tracking. Physics in Medicine and Biology,
52(14):4081–4098, 2007.
19. Z. Wang, Crammer K., and Vucetic S. Multi-class pegasos on a budget. In Proceedings of the 27 th International Conference on Machine Learning, pages 1143–1150,
2010.
20. Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2013.
21. R. Yao, Q. Shi, C. Shen, Zhang Y., and A. van den Hengel. Robust tracking with
weighted online structured learning. In Andrew W. Fitzgibbon, Svetlana Lazebnik,
Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, ECCV (3), volume 7574
of Lecture Notes in Computer Science, pages 158–172. Springer, 2012.
Combining Head Pose and Eye Location
Information for Gaze Estimation
Daniel Gröger1 and Mohamed Selim2
1
2
groeger@rhrk.uni-kl.de
mohamed.selim@dfki.de
Abstract. The combination of eye location and head pose estimation
promises an improvement of gaze estimation using only a simple webcam
in front of the user. Creating such an estimation system with high accuracy and real-time performance will open up a broad range of applications assisting disabled people, monitoring driver behavior, or gaze-based
interaction on large screen displays. In this seminar paper the approach
by Valenti et al. [16] is presented and its limitations and benefits are discussed. A small background summary is given, the approach is described,
and experimental results are analyzed. It is shown that that an accurate
real-time system is feasible even though some limitations apply.
Keywords: gaze estimation, head pose estimation, eye-tracking
1
Introduction
Estimating a person’s gaze can be useful for many applications. The gaze is
directly connected to a person’s attention and can therefore be used to learn
about the user’s interest in observed objects. This knowledge may then be used
to learn more about the user’s behavior or to build systems capable of interacting
with the user based on the expressed interest. The possible applications for
this knowledge cover diverse topics such as marketing and usability studies,
interaction devices for disabled people, attention monitoring while driving, and
interactive reading.
Gaze estimation and head pose estimation have been studied separately to a
large extent in the past and numerous solutions have been proposed that vary
strongly in respect to the usage scenario. For gaze estimation static eye-trackers
using infrared illumination, head-mounted eye-trackers in the form of glasses,
and electro-oculography based gaze estimation are used in practice [2, 14, 5].
To estimate the head pose hardware like head mounted inertial measurement
units (IMUs), time of flight based systems (Microsoft Kinect), and multi camera
setups are used. While they work well in their respective use case, the goal of this
work is to investigate a solution which allows for maximal movement freedom
of the user, while using minimal affordable hardware such as a webcam and no
wearable equipment whatsoever.
Image-based solutions to estimate the gaze as well as the head pose separately
from an image captured by a webcam have been proposed in several papers
2
as well, but most of these approaches are limited in terms of trackable head
poses. The aim of this paper is to study a novel approach by Valenti et al. [16]
which combines eye location and head pose estimation in a non-sequential way
to overcome these limitations. To accomplish this, the used technologies for eye
location and head pose estimation are introduced and the necessary details of
the novel system provided.
In Section 2 a short summary of background on gaze and head pose estimation will be given to put the studied approach into context. The details of the
approach are then explained in Section 3 and experimental results illustrated
in Section 4. The implementation and corresponding issues will be described in
Section 5. Subsequently limitations regarding the implementation, the general
approach, and the evaluation will be discussed in Section 6 and finally the conclusion regarding the studied approach will be summarized in the last Section.
2
Background
To put the studied approach for gaze estimation into context, a short overview
of recent approaches will be given in this section.
2.1
Eye location based approaches
For general eye location based gaze estimation several approaches have been
established. Image-based devices extract the eye position from images that are
captured from a camera observing the eye’s movements. Electro-Oculography
measures the electromagnetic variation that occurs upon movement of the eye
muscles using electrodes attached to the head, from which the current eye position is calculated and thus the point of gaze can be obtained. Search coil
describes a thin metal rod that is being placed on the eye to measure electromagnetic induction. For this purpose an electromagnetic field is applied and
the gaze position calculated from the induction. For further details and other
methods, see Holmqvist et al. [9].
In this paper the image-based approach is used, as the desired input device
is a webcam that is observing the user. The image-based techniques using a
stationary camera can be divided further into active systems that use a light
source, e.g. infra red light, for tracking and passive systems that rely solely on
the camera image. For either one different challenges arise.
IR bases approaches are widely used in user studies such as psychological
attention, reading, etc. [7, 2] They typically operate in close range (approximately
the distance of a subject sitting in front of a monitor), require a restriction
of the user’s head movement (usually achieved by a head rest), and require a
calibration procedure. An additional limitation is the interference of sun light
with the system’s IR illumination of the eye, which makes outdoor usage of
such systems very difficult. These characteristics show that IR based stationary
systems are usually designed to be very accurate and used in a lab environment
rather than for everyday use.
3
Appearance-based approaches require more complex processing as no additional information such as IR reflections is available. This usually involves
detecting the eyes, compensating for head motion, and processing the image to
estimate the gaze. Still, more complex but less restrictive solutions are more
suitable for many applications, since the human gaze does not only consist of
eye movements but head movements are also important.
State of the art appearance-based methods use a variety of techniques to
locate the eyes and their centers and show that non-IR based eye gaze estimation
is becoming feasible. Asteriadis et al. [1] use feature matching on the edge map
of the eye area and a training set to localize the eyes. Other recent approaches
rely on machine learning and some kind of feature extraction. Hamouz et al. [8]
use gabor filters to extract 10 features and two support vector machine (SVM)
classifiers, Türkan et al. [15] use SVMs and edge projection, and Campadelli et
al. [6] use an eye detector and a process to refine the eye positions using Haar
wavelets and a SVM. In the studied gaze estimation system an approach by
Valenti et al. [17] based on isophote curvatures is used, which outperforms the
aforementioned approaches regarding eye center location while being robust to
small pose as well as illumination changes.
2.2
Head pose based approaches
To estimate the user’s gaze, the pose can be used as well. Once it is known where
the user’s head is facing, the gaze can be estimated using the field of view. As with
eye location based estimation, systems that use more complex hardware, such
as IMUs, multi-camera setups, or Microsoft’s Kinect, already provide reasonable
solutions to estimate a user’s head pose. Yet, to achieve more flexibility and
reduce hardware cost, only image based approaches will to be considered here.
A variety of image based head pose estimation algorithms exist of which
many make use of image features and use either a 3-D or 2-D model to represent the head. For 3-D-model based approaches more or less complex generic or
user specific models are used, but complex models are usually less tolerant to
initialization errors, are computationally expensive, and suffer from large drift
over time. Therefore, simpler models, like the cylindrical head model (CHM),
are used in publications and demonstrate good performance regarding real-time
use in studies [4, 11, 19]. In the studied gaze estimation system the approach by
Xiao el al. [19] is used, which is capable of tracking the head even when it is
turned by more than 30◦ from the frontal position in respect to the camera.
3
The studied approach
The studied novel visual gaze estimation system is a combination of an isophote
based eye location detector by Valenti et al. [17] and a head pose estimator by
Xiao et al.[19] that will be described in more detail in the following subsections.
4
3.1
Accurate Eye Center Location and Tracking Using Isophote
Curvature
The main idea behind the eye center location approach is that each eye’s appearance in the captured image can be assumed to be a bright ellipse with a
dark ellipse in its center. These shapes can be described by an combination of
isophotes, which are curves that connect points of equal intensity in the image.
Isophotes are used as they posess several useful properties for object descriptors.
They do not intersect each other and can therefore be used to fully describe an
image and their shape is independent to linear lightning changes and rotation.
Fig. 1. ‘(a) Source image, (b) the obtained centermap, and (c) the 3-D representation
of the latter.‘[16]
Once the whole image is described as isophotes, we still need to determine
those that correspond to the eye center. For this step, Valenti et al. make use of
the isophote curvature k, which is expressed as
k=−
δI 2 δ 2 I
δy δx2
2
δI δ I δI
− 2 δx
δxδy δy +
3
δI 2
δI 2 2
+
δx
δy
δI 2 δ 2 I
δx δy 2
(1)
δI
is the first-order derivative of the intensity function I on the y-dimension.
where δy
The intensity on the outer side of the curve determines the sign of this curvature, which enables us to discriminate between dark and bright isophote centers
and therefore find the dark area of cornea and iris within the brighter sclera
(negative sign of curvature).
To find global isophote centers, the distance D(x, y) to the closest isophote
center is estimated for every pixel in the image using k as
o 2
n
δI
δI 2
δI δI
,
+
δx δy
δx
δy
D(x, y) = − δI 2 δ2 I
.
(2)
δI δ 2 I δI
δI 2 δ 2 I
−
2
+
δy δx2
δx δxδy δy
δx δy 2
5
These values, ignoring points with a positive curvature, are then mapped into an
accumulator which performs weighting based on a curvedness measure [10] for
each point. Finally, to yield a single estimate for each cluster, the accumulator
is convolved with a Gaussian kernel and the maximum estimate is chosen as the
estimated location of the eye center (see 1).
3.2
Robust Full-Motion Recovery of Head by Dynamic Templates
and Re-registration Techniques
The approach to estimate the head pose by Xiao et al. [19] uses a cylindrical head
model (CHM) which is initialized assuming a frontal face position of the user.
It is based on the idea of estimating the motion between two images assuming
constant intensity.
I(F (u, µ), t + 1) = I(u, t)
(3)
where F (u, µ) is a parametric motion model with motion parameter µ, u the
old location, F (u, µ) the new location, and t the time. Since the CHM is three
dimensional, the motion parameter vector µ has 6 degrees of freedom, µ =
[ω x , ω y , ω z , tx , ty , tz ] where ω x , ω y , and ω z are rotation parameters (pitch, yaw,
and roll angles respectively) and tx , ty , and tz translation parameters.
These parameters are initialized on the initial frame containing the frontal
face. The positions of the eyes are detected used to improve the tx and ty parameters estimated by the center of the detected face. The distance between the
eyes is used to estimate tz . Since a frontal position is assumed, the pitch (ω x )
and yaw (ω y ) angles are set to zero and the roll (ω z ) is determined using the
eyes’ positions.
On the next frame, the motion is estimated using (3) where F (u, µ) is derived
using the rigid movement of a point X = [x, y, z, 1] between t and t + 1
1 −ω x ω y tx
ω x 1 −ω x ty
(4)
X(t + 1) = M · X(t) =
−ω y ω x 1 tz · X(t)
0
0
0 1
and the camera projection matrix, assuming the matrix depends only on the focal
T
length. This yields the following equation for an image point u of X = [x, y, z, 1]
at t + 1:
fL
x − yω z + zω y + tx
·
u(t + 1) =
(t)
(5)
xω z + y − zω x + ty −xω y + yω x + z + tz
Equation (3) is then solved for µ using the Lucas-Kanade method [12] with
applied weights as follows
µ=−
X
Ω
T
w (Iu Fµ ) (Iu Fµ )
!−1
X
T
w It (Iu Fµ )
.
Ω
(6)
6
where Fµ is the partial differential of F (•), Ω is the face region (called template),
and w are weights.
The weights are determined in several steps to account for the following concerns. For robustness against noise, non-rigid motion, and occlusion, iteratively
re-weigthed least squares (IRLS) [3] are used and compensation is applied for
side effects caused by IRLS. To account for non-uniform pixel density, i.e. that
highly dense pixel originating from the border of the cylinder should contribute
less to the motion, the density is calculated and the weights adapted accordingly.
For more details please refer to [19].
Before applying equation (6) iteratively, the initial µ is calculated using the
partial differential Fµ at µ = 0. Subsequently the incremental transformation
is computed using µ after each iteration and all incremental transformations
composed for the final transformation matrix.
To deal with large head movements, the template is adapted each frame and
the error measured to re-register the template once the error is to large.
3.3
Combining eye location and CHM tracking
Each system on its own is capable of achieving better results than other proposed solutions but yet each one has its limitations. The gaze estimation system
requires a frontal head pose, or less than 30◦ rotation at the cost of accuracy, to
achieve good results and the CHM tracker may not be able to recover when converging to the wrong location. Hence, Valenti et al. [16] propose a simultaneous
integration of both systems to overcome these limitations.
This is done by correcting the eye location using the estimated head pose
and improving the head pose estimation by using the eye detectors results as a
cue for quality control. In the first frame the eye locations are detected using
the standard eye locator described in Section 3.1 and used as reference points.
These reference points are projected onto the estimated cylindrical model and an
area around the target point is extracted. The obtained patched are then transformed using the transformation matrix of the CHM tracker. The eye locator is
then applied to the transformed patches that are expected to be more similar
to a frontal eye image and therefore deliver better results (see 2). In addition
the center of each patch is used as a reference when choosing the peak of the
accumulator, since peaks closer to the center are closer to the initial location
estimate.
To check the results of the head pose estimation, the pose vector is calculated
from the 3-D eye locations, given that the location of the eyes in 3-D is known,
and compared to the resulting vector of the head tracker. If the difference exceeds
a certain threshold, the transformation matrix of the tracker is recomputed based
on the average of the vectors. Furthermore, the standard eye locator is used to
check whether the eye locations calculated by the head tracker are close to the
estimated locations of the standard locator, which should be accurate when a
frontal view is encountered.
7
Fig. 2. ‘Examples of extreme head poses and the respective pose-normalized eye locations. The results of the eye locator in the pose normalized eye region is represented
by a white dot.‘[16]
3.4
Gaze estimation
Given a reliable eye location and head pose in 3-D one can basically calculate the
users gaze and field of view. Of course, the exact parameters for a person’s eyes
had to be known and hence, an approximation is used. In their paper, Valenti
et al. [16] suggest a binocular field of view that spans 120◦ of visual angle surrounded by a monocular field of view to each side horizontally of approximately
30◦ each. This binocular field of view is centered on the gaze point M and can
be approximately described by a pyramid OABCD (see 3). The pyramid is then
intersected with the plane of the observed target scene P to yield the area that
is visible to the user.
This estimation is a good approximation of the user’s field of view but does
not take the eye movement in the ocular cavities into account. Valenti et al.
assume that “the point of interest (defined by the eyes) does not fall outside the
head-pose-defined field of view.”[16, p. 808] based on the study in [13]. Furthermore, they state that a simple “2-D mapping of the location of the pupil (with
respect to an arbitrary anchor point) and known locations on the screen”[16,
p. 808] is sufficient to interpolate the focused point in the scene, since the eyes
can be assumed to only shift in vertical and horizontal direction in the ocular
cavities [18].
A calibration plane is constructed in front of the head and rays from the
center of the head through known points on the target plane (points displayed
for calibration purposes) intersected with the calibration plane. Now the known
points can be retargeted and a recalibration simulated every time the user’s
8
Fig. 3. ‘Representation of the visual field of view at distance d.‘[16]
head moves. The calibrations plane and the calibration points move according
to the head pose model in 3-D and hence the new intersections of rays between
head and calibration points with the target plane can be calculated as new know
points and a new mapping learned.
4
Experimental results
Valenti et al. [16] report three independent evaluations, one for each of the
studied system’s components and an overall evaluation of the system. First, the
eye locator supported by the head pose estimation was tested on the Boston
University head pose database [11]. The system using head cues is compared to
the standard eye locator on video sequences of subjects performing nine different
head motions. The performance was measured in terms of the error, i.e. based on
the Euclidean distance between estimated eye location and manually annotated
ground truth. The results show a significant overall improvement in performance
compared to the baseline results and a specific improvement in accuracy from
16% to 23% for an allowed error larger than 0.1 [16].
Second, the head pose estimation is evaluated using the same database as
before. For pitch(ω x ), yaw(ω y ), and roll(ω z ) the root-mean-square-error (RMSE)
and the standard deviation (STD) are computed between ground truth measured
by magnetic sensors on the subject’s head and the estimates by the head pose
estimator. The estimator is used in two different configurations regarding the
template update described in 3.2. For one the first frame is kept as the template
throughout the video sequence and for the other the template is updated each
frame. Both methods are tested with and without eye location information and
the results compared.
The comparison shows that the updated approach achieves the best results
when no eye information is used. With eye information both configurations per-
9
form equally good or better than the baseline while the fixed template approach
outperforms the update approach. This is most likely due to small errors introduced by the eye locator that cannot be corrected by the head pose estimation.
In addition Valenti et al. compare their results to two similar studies and report
“comparable or better results with respect to the compared methods” [16, p.
810].
Finally, two experiments are performed to evaluate the overall system performance. For both, the data was recorded of 11 male and female subjects sitting
in front of a computer monitor equipped with a webcam under different lighting
conditions. The first task involved looking at a dot on the screen while moving
the head towards to dot and then randomly while still gazing at the dot when
the desired position facing the dot is reached. The second task is to follow the
dot around the screen naturally.
For both tasks the face of the user and the position of the dot on the screen
are recorded as ground truth. Based on this data three algorithms are evaluated.
1. Eyes-only gaze estimator: The approach representing traditional mapping of anchor-pupil vectors to the screen (as in [18]) without taking head
motion into account.
2. Pose-normalized gaze estimator: An approach using anchor-pupil vectors that have been normalized using the head pose. (see [16])
3. Pose-retargeted gaze estimator: The method proposed in Section 3.4
using pose-normalized displacement vectors, which is different from the posenormalized gaze estimator because of the retargeting of known points as
described in Section 3.4.
Based on the mean error and STD reported on task one and two the results
can be summarized as follows. The pose-normalized gaze estimator and poseretargeted gaze estimator achieved a significantly lower error than the eyes-only
gaze estimator on task one and also outperformed it on task two where the eyesonly gaze estimator failed as expected due to head motion. The pose-retargeted
improved the results of the pose-normalized gaze estimator in both tasks and
achieved a significantly lower error in task two as opposed to task one mainly
because the eye displacement in respect to the head were natural in task two.
The pose-retargeted gaze estimator “has a mean error of (87.18, 103.86) pixels,
corresponding to an angle of (1.9◦ , 2.2◦ ) in the x- and y-direction, respectively”
[16, p. 813], which is quite impressive considering that the human fovea covers
roughly 2◦ of visual angle.
5
Implementation
The implementation of the discussed approach is a goal of this work to fully
understand the approach’s details and possible issues that may arise. To this
end, the implementation of the eye center locator, the head pose estimator, and
the overall system will be discussed in the following subsections.
10
Fig. 4. ‘Schematic diagram of the components of the system.‘[16]
5.1
System Overview
Starting with the eye center locator, the parts of the system were implemented
using the C++ OpenCV3 library. As illustrated in Figure 4 the system is initialized by detecting a frontal face and the eyes. This is accomplished by using
3
http://opencv.org/
11
the OpenCV Haar Cascade Classifier4 . After initialization of the cylindrical head
model the head is tracked in the subsequent frame and the eye region”s extracted
according to the head model. The eye center locator is then applied on the normalized eye patches and the results transformed back into the model’s 3D space.
The overall system was tested under Ubuntu5 running on an Intel Xeon 8-core
3.4 GHz processor with 16GB RAM. As cameras the Playstation Eye® and a
Logitech HD Webcam C270 were used both recording images of 640x320 pixels
resolution at a distance of approximately 80cm from the head to the camera (see
Figure 5).
Fig. 5. Camera and display setup for calibration
5.2
Eye Center Locator
According to Valenti et al.[17] the isophote curvature, isophote center displacement vectors, and curvedness are calculated using the image gradient of the
smoothed input image (see Section 3.1 for details). The sign of the curvature
is used to discriminate between votes and the curvedness serves as a weight for
each displacement vector vote into an accumulator. Finally, the accumulator is
convolved with a Gaussian kernel to merge clusters of votes and yield several
high peaks.
The implementation of this part posed one major difficulty. The calculation
of the derivatives in combination with the initial smoothing of the image has a
great impact on curvature and displacement vectors. To reduce artifacts caused
4
5
http://docs.opencv.org/modules/objdetect/doc/cascade_classification.
html#haar-feature-based-cascade-classifier-for-object-detection
http://www.ubuntu.com/
12
by the discrete nature of the image when calculating the derivatives, it is desirable to smooth the image. Unfortunately, using a high standard deviation for
the Gaussian kernel used for smoothing will cause the curvature to degenerate
and the displacement vectors to become inaccurate. Furthermore, different techniques to calculate the derivative, for example Scharr or Sobel kernels, have an
influence on the center voting, as they take more or less of each pixel’s neighborhood into account.
Fig. 6. Accuracy of Eye Center Locator on the BioID database under different configurations for initial smoothing and accumulator smoothing.
To evaluate the implemented eye center locator, the BioID database was used
and results directly compared to Valenti et al.[17] Since optimal parameter values
for the initial smoothing as well as the Gaussian kernel to apply on the accumulator (center map) were unknown, the two dimensional parameter space (initial
σ and accumulator σ) was explored to find the configuration with the highest
accuracy(see Figure 6). Figure 7 shows the detailed results for a fixed initial σ of
0.1 of which the accumulator σ of 3 was chosen as the optimal configuration out
of the tested space. For these parameter values the worst eye accuracy measure
yields a score of 60.61% compared to Valenti’s results of 77.15%. For future work
on this implementation it is desirable to investigate this discrepancy further and
determine the correlation between parameter values and eye region resolution.
13
Fig. 7. Accuracy of Eye Center Locator on the BioID database for initial smoothing
of 0.1 and varying accumulator smoothing.
5.3
Head Pose Estimator
The head pose estimation was implemented according to Xiao et al.[19] to fully
understand the concepts and later apply the parts used as in Valenti’s algorithm description[16]. For this purpose a modular approach was chosen to keep
the parts, for example the cylindrical head model, as reusable as possible. The
implementation consists of the cylindrical head model, the main algorithm to
calculate the six dimensional motion model in an iterative way using the LucasKanade method [12], calculation of weights for the iterative approach, and the
template re-registration algorithm.
The main issue for this part of the implementation was the complexity of
the overall system. Small errors and changes have a large impact on the overall estimation of the motion model and easily cause the iterative approach not
to converge. This made finding errors in the calculation very difficult and led
to the overall head pose estimation being not robust and accurate. An alternative approach was implemented later to compare the results and found to be
more robust. The approach is based on using the OpenCV implementation of
the Lucas-Kanade optical flow algorithm6 and solving the “Perspective-n-Point
(PnP)” problem7 to estimate the new head pose.
To compare the overall results of the system the latter implementation was
used. Hence, it is desirable for further work to use the approach by Xiao et al.
once the implementation is more robust. This is also desirable as it is not clear
6
7
http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_
tracking.html#calcopticalflowpyrlk
http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_
reconstruction.html#solvepnp
14
whether Valenti et al. use any of the applied improvements, such as weighting and
template re-registration, in their adaption of the head pose estimation algorithm.
5.4
Combination
The combination of head pose estimation and eye center location was implemented according to Valenti et al.[16] Based on the estimated head pose the
regions around the eyes were extracted and unwarped to a pose-normalized
representation (see Figure 2). On these normalized eye regions the eye locator
was applied and the estimated position transformed back into the 3D space of
the head model. An evaluation of the implemented system’s performance could
only be done by examining sample sequences but not using ground truth as in
Valenti’s report due to issues with the implementation and the limited time for
this work.
Fig. 8. Left: Camera image with rendered head model, estimated eye centers, and point
between eye centers. Right: Calibration points on target plane (blue) and estimated
focused point (red).
5.5
Gaze Estimation
Finally, the gaze estimation based on the estimated head pose and eye displacement vectors was implemented according to the studied approach. To this end,
the implementation of a calibration plane was added to the cylindrical head
model and a calibration GUI8 was created to add calibration points to the plane
and collect corresponding screen coordinates and eye displacement vectors (see
Figure 8).
Again a final empirical evaluation was not possible as the overall system did
not perform in a robust and accurate way. The initial fit of the head model to the
user was not accurate enough and the eye center locator exposed some jittering
on estimating the eye center location. This led to possibly wrong displacement
vectors during calibration and therefore an unreliable mapping between displacement vectors and screen coordinates.
8
Graphical User Interface
15
To overcome this problem and compare the system’s performance to the one
reported by Valenti et al.[16] the single components need to be revised and open
implementation problems solved.
6
Limitations
In this section I will discuss the limitations of the studied approach. First, I
will discuss limitations regarding the evaluation scenario and subsequently what
needs to be investigated further towards a more general application of webcam
based gaze estimation.
The evaluation scenario involved users sitting in front of a computer screen
on which a camera is mounted to capture the user’s face. In this context Valenti
et al. [16] report good results in terms of mean error in pixels onscreen but
also hint to a couple of problems. One of these is the placement of the camera
on top of the monitor which leads to partial eye occlusion by the eye lids when
gazing at points on the bottom of the screen. Furthermore problems with the low
resolution (320 x 240 pixels) on the Boston University head pose database [11]
are reported for the evaluation of the eye location detector with head pose cues.
This leads to the question in what way the user’s distance to the camera and
the camera’s resolution impact the performance of the gaze estimation system.
One a similar note, one might think of whether the screen size or the position
of the face within the image has any effect on the performance, since a larger
screen would lead to an even more displaced position of the camera. The reported
on-screen mean error of (87.18, 103.86) pixels on the dot following task can be
considered a good result, but depending on the usage scenario the error might
be to large (considering the density of information on computer screens).
A second issue is the calibration procedure. A calibration with target points
needs to be conducted to achieve a mapping between the eye and the observed
scene. Applications thus need to require the user to go through the calibration
procedure for every session using the system. A further investigation on what is
possible with different methods for automatic calibration would be useful.
Towards a more general scenario one may consider using this technology in
a car to assist the driver, on top of a TV set in the living room, or behind
a shopping window to interact with people passing by. In this context several
open questions arise that make interesting topics for further research. Regarding
all of the example scenarios the general limitations of the system regarding the
user’s distance to the camera, the resolution of the camera, and the position
of the face within the image need to be explored further. Regarding the in-car
application the limitations on the camera angle would be interesting, as the
camera would have to be placed on the dashboard or in the rear view mirror.
For the shopping window multi-user tracking and non-frontal initialization, as
well as further occlusion and recovery are interesting.
Limitations concerning the implementation of the discussed approach are
mainly related to unknown parameters or unclear implementation details. The
large influence of smoothing parameters is a clear limitation as optimal values
16
have to be determined for each application of the approach. Also the relation
between face size and these values needs to be investigated further, to make the
system usable at varying distances to the camera. Regarding the head tracking
similar limitation regarding the distance need to be determined, as it relies on
the optical flow of the face region and therefore less information is available on
reduced resolution of the face in the image.
Summarizing, the limitations of the studied approach need to be defined more
precise which requires further research on this topic. The overall goal is still an
auto-calibrating gaze estimation system which works at short and long range,
preferably for multiple simultaneous user, at up to large head pose rotational
angles from the camera.
7
Conclusion
The main contribution of the studied approach is a gaze estimation system that
works well for users in front of a computer monitor using only a webcam. Valenti
et al. show that the systems achieve a better performance together than on their
own. It is possible to increase the operation range of the eye locator by more
than 15◦ and the overall system is capable of estimating the gaze with a small
mean error between 2◦ and 5◦ of visual angle in real-time. With this technology
a broad range of applications are possible that can be made easily accessible to
many users, since the required hardware is very cheap compared to other eyetracking equipment. Nevertheless, there remain several open questions regarding
the application of this technology in other scenarios and the general limitations
imposed by the camera hardware and usage setup. The implementation of the
system showed that there are several issues that can be resolved by investigating the relationship between parameters further. It also showed that there are
improvements that may be considered in future work.
References
1. S. Asteriadis, N. Nikolaidis, A. Hajdu, and I. Pitas. An eye detection algorithm
using pixel to edge information. In Proc. of 2nd IEEE-EURASIP Int. Symposium
on Control, Communications, and Signal Processing, ISCCSP 2006., 2006.
2. Ralf Biedert, Jörn Hees, Andreas Dengel, and Georg Buscher. A robust realtime
reading-skimming classifier. In Proceedings of the Symposium on Eye Tracking
Research and Applications, ETRA ’12, pages 123–130, New York, NY, USA, 2012.
ACM.
3. Michael Black. Robust incremental optical flow. PhD thesis, Yale University, 1992.
4. L.M. Brown. 3d head tracking using motion adaptive texture-mapping. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001
IEEE Computer Society Conference on, volume 1, pages I–998–I–1003 vol.1, 2001.
5. Andreas Bulling, Jamie A. Ward, Hans Gellersen, and Gerhard Tröster. Robust
recognition of reading activity in transit using wearable electrooculography. In
Proceedings of the 6th International Conference on Pervasive Computing, Pervasive
’08, pages 19–37, Berlin, Heidelberg, 2008. Springer-Verlag.
17
6. Paola Campadelli and Raffaella Lanzarotti. Precise eye localization through a
general-to-specific model definition. In Proc. of BMVC, 2006.
7. Laura A. Granka, Thorsten Joachims, and Geri Gay. Eye-tracking analysis of user
behavior in www search. In Proceedings of the 27th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR
’04, pages 478–479, New York, NY, USA, 2004. ACM.
8. M. Hamouz, J. Kittler, J. K Kamarainen, P. Paalanen, H. Kalviainen, and J. Matas.
Feature-based affine-invariant localization of faces. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 27(9):1490–1495, 2005.
9. Kenneth Holmqvist, Marcus Nystrm, Richard Andersson, Richard Dewhurst, Jarodzka Halszka, and Joost van de Weijer. Eye tracking: A comprehensive guide to
methods and measures. Oxford University Press, 2011.
10. Jan J. Koenderink and Andrea J. van Doorn. Surface shape and curvature scales.
Image Vision Comput., 10(8):557–565, October 1992.
11. M. La Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking under varying illumination: an approach based on registration of texture-mapped 3d models.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(4):322–336,
2000.
12. Bruce D. Lucas and Takeo Kanade. An iterative image registration technique
with an application to stereo vision. In Proceedings of the 7th International Joint
Conference on Artificial Intelligence - Volume 2, IJCAI’81, pages 674–679, San
Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc.
13. Rainer Stiefelhagen and Jie Zhu. Head orientation and gaze direction in meetings.
In CHI ’02 Extended Abstracts on Human Factors in Computing Systems, CHI EA
’02, pages 858–859, New York, NY, USA, 2002. ACM.
14. T. Toyama, A. Dengel, W. Suzuki, and K. Kise. Wearable reading assist system:
Augmented reality document combining document retrieval and eye tracking. In
Document Analysis and Recognition (ICDAR), 2013 12th International Conference
on, pages 30–34, 2013.
15. M. Türkan, M. Pardás, and A. Çetin. Human eye localization using edge projection.
In Proc. Comput. Vis. Theory Appl., 2007.
16. R. Valenti, N. Sebe, and T. Gevers. Combining head pose and eye location information for gaze estimation. Image Processing, IEEE Transactions on, 21(2):802–815,
2012.
17. Roberto Valenti and Theo Gevers. Accurate eye center location and tracking using
isophote curvature. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition, 2008. CVPR 2008., pages 1–8, 2008.
18. Roberto Valenti, Jacopo Staiano, Nicu Sebe, and Theo Gevers. Webcam-based
visual gaze estimation. In Proceedings of the 15th International Conference on Image Analysis and Processing, ICIAP ’09, pages 662–671, Berlin, Heidelberg, 2009.
Springer-Verlag.
19. Jing Xiao, Tsuyoshi Moriyama, Takeo Kanade, and Jeffrey Cohn. Robust fullmotion recovery of head by dynamic templates and re-registration techniques. International Journal of Imaging Systems and Technology, 13:85 – 94, September
2003.
SLAM for dynamic AR environments
Philipp Hasper1 and Nils Petersen2
1
2
hasper@cs.uni-kl.de
Nils.Petersen@dfki.de
Abstract. The goal of this project was to implement a monocular SLAM
system for tracking in an unknown environment and to examine how
changes in the environment during runtime affect the tracking result.
The approach is based on PTAM and splits the pipeline into two modules, namely ”Tracker” and ”Mapper”: The Mapper constructs a map of
the environment while the Tracker tries to localize itself in this map.
Keywords: SLAM, Augmented Reality, Tracking
1
Introduction
The term SLAM is short for ”Simultaneous Localization and Mapping” and is a
way for tracking in an unknown environment: A moving agent is building a map
of its environment and simultaneously localizes itself in this newly created map
(cf. Figure 1). The origin of SLAM lies in the filed of robot navigation [1, Section
2.4.3] where usually several different sensors (odometer, GPS, inertial, . . . ) are
fused, those measurements then are used to derive position and/or pathway and
finally a map of the environment is built.
1.1
Visual SLAM
In the beginning, environment reconstruction using only 2D cameras as sensors
was only known as Structure from Motion (SfM) - a process which is intrinsically offline, i.e. the calculations are performed only after all measurements have
already been acquired.
One of the early applications of online SfM (i.e. SLAM) in the Computer
Vision community was proposed by [2] which made the real-time localization of
robots using only a single camera possible. This technique of monocular visual
SLAM is also suitable for tracking e.g. in the context of Augmented Reality
(AR). While AR applications are usually based on pre-defined scenes to recognize and annotations to display, the visual appearance at run time may differ
significantly. This may be due to changes in perspective when the user moves
around once a pre-defined scene has been recognized correctly. In this case, scene
recognition is to be followed by a tracking procedure to maintain a valid state
of AR annotations and to guide future scene recognition.
The PTAM system [4,5] (Parallel Tracking and Mapping) was specifically
designed for this purpose: SLAM for small AR workspaces. Their approach is
2
Fig. 1. Visualization of the FastSLAM system showing a robot’s path and the observed
landmarks. Taken from [8]
the present paper’s basis and our adaptation of it will be extensively discussed
in the following sections.
One major limitation of PTAM is that accuracy is decreased in dynamic
environments, i.e. when objects are moving or illumination differences change.
In the outlook we will discuss how to incorporate Robust Dynamic SLAM [13]
to overcome those limitations.
2
Our approach
Our system is similar to PTAM [4] and shares the characteristic design choice of
separating tracking and mapping in two separate modules, namely Tracker and
Mapper (cf. Figure 2). The pipeline displayed in said figure is described in the
following. A second design choice is also inherited, that is the use of keyframes to
derive 3D information for the map construction. Keyframes are camera images
which are characteristic for the scene - i.e. while an image of a fast camera
movement would yield few useful information, an image showing substantially
new information compared to the previous images is desired to be selected as
keyframe.
The map consists of a point cloud and localization is done by finding projections of those 3D points in the current camera image. We developed with C++
and OpenCV (version 2.4.8) which reduces the porting effort significantly since
this library is available for all our targeted platforms (e.g. Android).
2.1
Camera calibration
To account for camera distortions resulting from the pinhole camera abstraction
the computation of the so-called camera intrinsics and the incorporation into the
3
Processing Pipeline
Tracker:
Initialization
Reproject existing
map points
Find Matches
Triangulate new
points
Add new
keyframe
Perspective-NPoint Pose
Estimation
Mapper:
Find closest
existing keyframe
Map refinement
Fig. 2. Our system’s processing pipeline consisting of two separate modules. The
Tracker calls the Mapper whenever tracking is considered to be stable. The Mapper
then decides if a new keyframe should be added and if so, the current camera image is
used for a new triangulation. Besides that, the Mapper refines the existing map when
he is in an idle state.
calculations is advisable. OpenCV’s built-in functionality is used here to obtain
the intrinsics as 3x3 matrix denoted by K and five distortion coefficients using
multiple images of a chessboard pattern with given dimensions (cf. Figure 3).
Fig. 3. One of the 12 images of a chessboard pattern used to calculate radial and
tangential distortion of an image taken with the Samsung Galaxy S2. The left side
shows the image with the detected corners marked. The right side shows the corrected
image.
2.2
Feature detection and matching
The 3D map points are obtained by triangulation of 2D points found in the camera images. For this purpose we need a point feature detector and a point feature
4
descriptor to match found features. PTAM uses FAST[11] as point features and
matches them by comparing zero-mean SSD of 8x8-pixel patches. An extensive
evaluation of different point feature detectors and descriptors for visual SLAM
was performed by [9] and her conclusion was to use BRISK or ORB[12] so we
are using the latter one. Feature matching is done by brute force matching of
ORB descriptors.
2.3
Triangulation and Initialization
One key technique used in the proposed approach is triangulation which denominates the process of deriving 3D points from a set of 2D point matches obtained
from two camera images with sufficient baseline. The first step is to find the
Fundamental Matrix F which maps one point in the first image to a line in the
second image (cf. Figur 4).
X
xR
xL
OL
Left view
eL
eR
OR
Right view
Fig. 4. The epipolar constraint: A point in one view (XL ) can be projected into
the other view as a ray (eR XR ). Taken from https://en.wikipedia.org/wiki/File:
Epipolar_geometry.svg
We use a RANSAC for this - the used distance threshold for the epipolar
inliers is 3 pixels and the confidence threshold 99%. Second, the Essential Matrix
is computed: E = K T F K and the rotation and translation matrices are derived
by singular value decomposition. There are two solutions each due to projective
ambiguity - R1 , R2 , t1 , t2 - so we assume the first camera’s projection matrix to
′
be P = I and the second one’s to be Pi,j
= K[Ri |tj ].
The triangulation is done by constructing a system of linear equations from
′
the fact that x = P X and x′ = Pi,j
X and solving it as discussed in[3].
Finally, we test which of the four possible projection matrices Pi,j leads to
a triangulation with more than 60% of all points in front of both cameras. This
step is necessary because there is only one camera pose with all triangulated
points being in front of both cameras. In reality, even the correct pose will have
some points violating this constraints due to numerical or measurement errors
5
so we use a 60% threshold instead of 100%. Points which are not in front of both
cameras are removed and the remaining ones are returned as the triangulation’s
result.
This process of triangulation is first done in the initialization: The user is
advised to hold the camera at the scene to be tracked and then smoothly offsetting it in a translational manner, creating a small parallax movement. Then,
one image prior to this and one after this offset are used for triangulation by
matching point features (cf. Figure 5) and performing the procedure explained
above.
Fig. 5. Parallax movement is needed for map initialization. The matches between two
images of the same scene with a small translational movement are used for triangulation.
Fig. 6. The point cloud derived from
the initialization pictured in Figure 5.
For illustration purposes the number of
detected feature points was vastly increased so the scene is recognizable. Usually, a much smaller number of matches is
needed.
Each triangulated 3D point is added to the map and is assigned a feature
descriptor calculated from the the first of the two images. Finally, said first image
is also added to the map as its first keyframe. Each keyframe is also assigned its
camera pose (in the case of initialization this is the identity matrix).
2.4
Camera pose estimation
Once a map of 3D points is given, the current camera pose can be obtained
by projecting the points into the image based on a pose assumption imposed
6
by previous camera poses and a motion model (the use of a stationary motion
model has shown acceptable results).
This is done by Perspective N-Point. We use EPnP[6] which reduces the
problem to O(n) with n being the number of map points.
Fig. 7. Matches of map points and the current camera image’s feature points. To
visualize the map points, they are projected into the keyframe they were constructed
from. On the left you see two examplary keyframes with the map points drawn in blue
and the lines indicating matches between map points and image features of the current
camera image on the right. Those matches of 3D to 2D points are then used for pose
reconstruction with a Perspective-N-Point algorithm.
2.5
Adding a new keyframe
Whenever tracking is assumed to be good, the Tracker calls the Mapper with
the current camera frame and the corresponding pose estimation. The Mapper
then decides if the camera image should be added as keyframe based on the
last added keyframe (they have to be at least 20 frames apart). To add the new
keyframe, the spatially closest already existing keyframe in the map is determined (the Tracker’s pose assumption and each stored pose are compared) and
a new triangulation is done with those two images (cf. Section 2.3).
7
2.6
Map refinement
The map has to be refined to remove spurious triangulations and to fuse triangulated points which are actually different measurements of one identical landmark.
PTAM uses Levenberg-Marquardt[7] bundle adjustment for this.
Currently, the integration of Levenberg-Marquardt in our system is a work
in progress and the given results are without bundle adjustment but with a
simplistic refinement approach using non-minimum suppression: Whenever two
map points are closer than a given threshold the one with the higher reprojection
error assigned in the triangulation is removed.
3
Handling of dynamic scenes
RDSLAM [13] improves monocular SLAM by incorporating two major enhancements: Occlusion handling during the projection of map points into the current
camera image and a prior-adaptive RANSAC algorithm for pose estimation.
Occlusion handling works as follows: The current visual appearance of each
map point whose projection lies in the current camera image is compared with
its stored descriptor. If they differ significantly this means either a) the map
point is occluded or b) the map point is invalid due to changes in the (dynamic)
environment. In the first case, the map point should be excluded from the pose
estimation but it should stay in the map. In the latter, it should be removed
permanently. The distinction whether a) holds or b) is done by evaluating neighbouring features:
1. Denote the map point by X and its projection in the image by x. Collect all
currently tracked map points whose projection into the image is less than 20
pixel away from x and call this set neighbourhood.
2. If the neighbourhood is empty, this is an indicator for X being occluded by
a moving object since a moving object’s feature points are considered as
outliers during tracking. In this case we keep the map point.
3. If the neighbourhood contains points x′ with their respective 3D points X ′
and X is closer to the camera than all X ′ , the map point is not occluded.
Therefore it is save to assume that the point became invalid due to changes
in the environment and we remove the map point.
4. If there are some neighbouring points whose 3D point X ′ is closer to the
camera than X, there are two possible reasons for this: a) X ′ belongs to an
object which occludes X or b) X ′ appears due to changes in the environment
and X is actually invalid. To distinguish both cases, X and X ′ are projected
in the keyframe X was constructed from.
– If those projected points are distant, they don’t belong to the same object,
hence a) is the case and we keep the map point.
– If they are close to each other, b) is the case and we remove the map
point.
The second technique Tan et al. propose is PARSAC, a prior-based adaptive
RANSAC which enforces a evenly distributed sampling and uses a weighting
scheme based on the results from the previous frame.
8
4
Results and Conclusion
The proposed system continuously tracks and adds new keyframes (cf. Figure
8). As already mentioned, the tracking accuracy suffers from dynamic changes
so the next step would be to incorporate the techniques discussed in Section 3
(cf. Figure 9). Additionally, the selection of distinctive keyframes could be improved further, e.g. by cumulating feature measurements from several subsequent
frames. Thirdly, panoramic movement is likely to occur in AR environments (especially when the user wears a HMD he often pans his head) but can break the
map since there is no baseline for triangulation. Pirchheim et. al [10] propose a
solution for this problem.
Fig. 8. Screenshots of our SLAM system. The blue dots are the 3D map points projected into the image. Compare left and right image to see that map points got added
due to the addition of a new keyframe.
Fig. 9. Projection of map points in case of occlusions (left) or changes in the environment (right). In the right scene, the red box indicates points which have become
invalidated since the structure they originated from (in this case they where constructed
from the letters at the top of the book) is removed. The green box contains scene points
which are not observable from the current position but are most likely to be still valid
since the original structure (the title of the lying book) is untouched. The goal is to
remove the points in the red box from the map and to maintain those in the green one.
9
References
1. Gabriele Bleser. Towards Visual-Inertial SLAM for Mobile Augmented Reality.
PhD thesis, Technical University Kaiserslautern, 2009.
2. Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse.
MonoSLAM: real-time single camera SLAM. IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–67, June 2007.
3. RI Hartley and Peter Sturm. Triangulation. Computer vision and image understanding, 68(2):146–157, 1997.
4. Georg Klein and David Murray. Parallel Tracking and Mapping for Small AR
Workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and
Augmented Reality, pages 1–10. IEEE, November 2007.
5. Georg Klein and David Murray. Improving the Agility of Keyframe-Based SLAM.
In Proceedings of the European Conference on Computer Vision (ECCV) 2008,
2008.
6. Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An Accurate
O(n) Solution to the PnP Problem. International Journal of Computer Vision,
81(2):155–166, July 2008.
7. Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial & Applied Mathematics, 11(2):431–
441, 1963.
8. Michael Montemerlo. Fastslam: A factored solution to the simultaneous localization
and mapping problem with unknown data association. Phd thesis, 2003.
9. Zhen Peng. Efficient matching of robust features for embedded SLAM. Diploma
thesis, University of Stuttgart, 2012.
10. Christian Pirchheim, Dieter Schmalstieg, and Gerhard Reitmayr. Handling Pure
Camera Rotation in Keyframe-Based SLAM. In IEEE International Symposium
on Mixed and Augmented Reality, pages 229–238, 2013.
11. Edward Rosten and Tom Drummond. Machine Learning for High-Speed Corner
Detection. In Aleš Leonardis, Horst Bischof, and Axel Pinz, editors, Computer
VisionECCV 2006, pages 430–443, 2006.
12. Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. In 2011 International Conference on Computer
Vision, pages 2564–2571. IEEE, November 2011.
13. Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust
Monocular SLAM in Dynamic Environments. In IEEE International Symposium
on Mixed and Augmented Reality, pages 209–218, 2013.
Visual Tracking Decomposition
Mohammed Anwar1 and Alain Pagani2
1
2
anwarower@gmail.com
alain.pagani@dfki.de
Abstract. In this work we present an explanation to the Visual Tracking Decomposition methodology given in [8], digging further into the
mathematical concepts behind its proposed approach. We explain the
theory of basic Bayesian Tracking and some useful mathematical concepts in the paper. We give an idea about Principal Component Analysis,
Diffusion Distance and Markov Chain Monte Carlo. We explain in detail
how each of these concepts were applied to the paper. In conclusion, this
work is meant to be an aiding supplementary material for understanding
the paper.
Keywords: Visual Tracking Decomposition, SPCA, Markov Chain Monte
Carlo
1
Introduction
Object Tracking is a problem that remains not perfectly solved till now in Computer Vision [10]. Researchers have recently tackled the area of real-world scenario rather than just confining to the lab environment. There are numerous
difficulties that are encountered in Visual Tracking. The intractability of tracking an object emerges from the severe changes in its appearance or motion. The
domain of these changes includes pose, illumination, occlusion in addition to
abrupt motion due to a low video frame rate. This problem is tackled by the
given paper. Here, a novel method is proposed lying on the concept of decomposing down the process of tracking into several basic models, allowing them to
account for various motion assumptions. The results of the basic models then
are then combinded together offering a highly complex visual tracker.
2
2.1
Mathematical Background
Bayesian Tracking
The Bayes Filter is the backbone of various probability based trackers. It is used
to model and predict stochastically the states of a model in a dynamic system
based on the control input to the model and the sensor readings from the surroundings. Since we are speaking about visual tracking, the control input plays
no role in our considerations. The Bayes Filter uses the concept of a complete
state based on the Markov assumption, which means that the previous state is
2
always considered a well-composed summary of the history up to its point in
time. This drops the need for keeping track of the past input/ measurements. It
is inevitable to discuss the Bayesian Tracking without giving a notion of a state.
A state for a given object subject to tracking is a collection of all aspects relating
it to the environment such as position, scale, velocity, etc [15]. We denote the
state of an object for a given time t by Xt . There is usually a spectrum of interaction between the object under tracking and its surrounding environment. An
important form of such is the observational data. These are observations done
by the object to evaluate the correctness of its current belief. We donote the
measurements observed at time t by Yt . The Bayesian Tracking procedure can
be formalized for the visual tracking in the given equation [15]:
Z
p(Xt |Y1:t ) α p(Yt |Xt ) p(Xt |Xt−1 )p(Xt−1 |Y1:t−1 )dXt−1
(1)
It is quite obvious that this equation has a recursive form. The term p(Xt−1 |Yt−1 )
is the belief a unit step back in time. An explanation for the rest of the terms is
provided in section 2.1.1.
2.1.1 Bayesian Tracking Components The Bayesian Tracking process is
composed of several models that collaboratively work to estimate the state of
the tracked object. We will explicitly mention them below to demonstrate how
they basicly work. The explanation for the approach deployed in the paper will
be analogous to the basic approach.
Motion Model The motion model is a prediction tool. It assumes a certain
motion pattern and use it to give an estimation of the future state given the
current one. Examples for this are : constant velocity motion model and constant
acceleration motion model. In equation (1), the motion model is p(Xt |Xt−1 ). Xt
is the predicted state given a current state of Xt−1 .
Observation Model The observation model has basically two components
which are : The object under tracking and the observation measurements. The
object defines the aspects of the tracked target including shape, appearance,
etc. The observation measurements take over correcting the current belief. An
example for this can be taking sensor measurements, taking camera shots or
comparing image patches as in visual tracking. In (1), The observation model
is the term p(Yt |Xt ). It means given an estimated state of Xt what is the probability of getting the measurement record Yt . The consistency between Yt and
the estimated state evaluates the tracking accuracy.
2.1.2 Bayesian Tracking Example To project the previously mentioned
concepts into a concrete example, let us take a look now on figure 1. In (a) we
have a given frame from a football game. We need to pick some object to be our
3
subject of tracking. This object is selected in (b) to be the ball. In this sense we
can say that the object model is a circle with the football texture, etc. The state
in this case is just a 2 dimensional position. Normally, since the frames are projected from a 3-D animated world, the linear motions would lose their linearity.
However, for the sake of simplicity, we will assume that we have here a linear
motion model that is estimated through previous frames to be in the direction
from the player’s foot to the goal. Our motion model in this sense assumes the
ball would continue its path with the same velocity. The uncertainty is embedded typically in the system by exploiting a gaussian distribution centered at the
predicted new state. Now what happens from (b) to (a) is that we use the motion model to predict the new state (dashed boundaries), then we correctify our
assumption using the observation model. Here, the observation model measures
the similarity/dissimilarity between the patch described by the predicted state
and the template we have in the object model (see (c)).
2.2
Diffusion Distance
The Diffusion Distance [9] is an approach to measure distances between histogrambased local descriptors. Histogram based local descriptors are generally susceptible for deformations, illumination and noise. Normally, the distance between
decriptors is measured through a bin-to-bin correspondance, which means that
the corresponding differences between classes is calculated with their quantization value ranges. The distances used here can be the Euclidean Distance,
the Kullback-Leiber-Divergence or various others. The methods are, nevertheless, sensitive to for effects that globally take place. For example, if an object is
moving in an image sequence and 2 frames are being compared, a big distance
could be estimated although the same object is being depicted. Here, the data is
translated into a local region and during the comparison some regions in the environment will not be considered. To tackle this problem, a cross-bin approach can
be used to compare between bins that could be found in different places. The Duffusion Distance offers a cross-bin comparison for multi-dimensional histogrambased local descriptors. The difference between the histograms is observed here
as a temperature field and it models a diffusion through this field. Afterwards,
the norm is integrated on the diffusion field over time, offering a dissimilarity
measure between the histograms. A Gaussian pyramid is used for this diffusion
difference. This Gaussian Filter smoothens the data and reduces the dimension.
For the diffusion distance, two n dimensional histograms h1 (x) and h2 (x) are
observed, where x ∈ Rn . Thus the Diffusion Distance is defined through :
K(h1 , h2 ) =
L
X
k(|dl (x)|)
(2)
l=0
where
d0 (x) = h1 (x) − h2 (x)
(3)
dl (x) = [dl−1 (x) ∗ φ(x, σ)] ↓2 , l = 1, ..., L
(4)
4
(a) Initial frame
(b) Registering the required object for tracking. The template
defined by the red boundaries is in this case the object model.
(c) Using the motion model to predict the next state (dashed
boundaries), then measuring the correctness of the predicted
state using the observation model.
Fig. 1. A simplified example capturing the aspects of Bayesian Tracking in application.
5
L is the levels of the Gaussian pyramid, [] ↓2 denotes the downsampling by half,
φ is a Gaussian filter with a standard deviation σ. k() is a metric by which the
norm of the histograms is calculated. It is defined in [9] to be the L1 -Norm,
which is defined by :
n
X
kxk =
|xi |
(5)
i=1
The calculation of the Diffusion Distance has a complexity that is linear in
terms of the number of histogram bins to be compared. As the dimension of the
Gaussian pyramid exponentially decreases and only a small Gaussian filter is
used, the convolution is carried out in linear time for the L levels.
2.3
Principle Component Analysis
Principle Component Analysis (PCA) is a powerful mathematical tool that is
used in this paper in the form of its extension (SPCA). PCA is used while dealing
with multivariate data offering means of dimensional reduction with minimal loss
of information. In figure 5, we are confronted with a view projection problem
from 3D to 2D. The different views illustrate different axes of projection. That
reveals the secret why the top figure is most illustratively helpful. The reason for
this is that the axes of projection are aligned with the maximum 2 directions of
data variation(In this case, the maximum variation of data is the most distance
two points on the teapot’s body). That results in coveying the biggest amount of
information possible about the data. Looking for principle component analysis
for a set of data could be expressed by finding a new basis for defining them. To
be able to smoothly use linear algebra, we assume linearity and orthogonality
of the principle components. The most principle component in the data is the
direction of the highest degree of variation. That poses a limiting factor for the
next most important principal component to be on a perpendicular axes to it,
we can say that the second most principle component is also the one showing
the highest variance of data, and so on. PCA thus eliminates the redundancy in
the given set, allowing us to express it in a simpler way. The redundancy in a
given set is mirrored in a statistical factor, namely the covariance between its
attributes. If we have an attribute that is intimately covarying with another one,
then we do need only the behavior of one of them to deduce the other. This is
done by expressing the covarying attributes as linear combination of each other.
Eventually, we could say that we eliminated a portion of redundancy in our data
set.(See figure 2).
2.3.1 Mathematical Background of PCA We have mentioned formerly
that using PCA is a matter of choosing a new basis for the given dataset. Now
We will give a glimpse about the mathematics behind.
Let us denote a dataset with X, where X is a matrix obtained by stacking
the data instances next to each other, so that each column corresponds to one
6
Fig. 2. A range of possible redundancies in data between the 2 variables r1 and r2 .
The dashed line is a best-fit obtained by r2 = kr1 [13].
data instance. The problem of finding a new basis is then finding a P matrix,
that projects the data to Y :
Y = PX
(6)
Equally important, we need this new image of the data Y to host the maximum
covariance possible. An insight into the covariance is done by building the covariance matrix. Assuming that the data X have a mean of zero, the covariance
matrix is derived by :
C =YYT
(7)
In the covariance matrix C, an element cij is the dot product of all the values for
a given data dimension i and the respective values for another given dimension
j. That is the covariance between the i-th dimension and the j-th dimension.
Two vectors that are at most covariant have a covariance of 0. A diagonal entry
however cii represents the variance along the dimension i. Thus, an idealized
maximum covariance matrix would be the one where all the off-diagonal entries
are 0. Now we can formally state our problem, which is picking P so that C is
diagonalizied.
We proceed on substituting (6) in (7), so we get :
C = (P X)(P X)T
(8)
C = (P X)(X T P T )
(9)
C = P (XX T )P T = P QP T
(10)
From linear algebra, we know that the matrix Q is symmetric and decomposable to :
Q = AΣAT
(11)
7
where Σ is a diagonal matrix and A is an orthognal matrix of Q’s eigenvectors.
That yields :
C = P AΣAT P T
(12)
Since we only want to keep the diagonalized structure of Σ, a smart solution is
to neutralize P with A, namely by choosing P to be AT (up to a scale).
C = (AT A)Σ(AT A)
(13)
Which finally gives us the diagonalized matrix :
C = IΣI
(14)
A key knowledge here is that the inverse of an orthogonal matrix is equal to
its transpose [14]. In conclusion, we can see that the principal components for
a dataset are derived by calculating the eigenvectors for the covariance matrix.
The method discussed here is just a single method for calculating principal components. There is some other technique that uses Singular Value Decomposition
[17]. An informative step by step tutorial about PCA can be found in [13]
2.3.2 Sparse Component Analysis Although Applying PCA could significantly reduce the dimensionality of a given dataset, it still suffers from a big
disadvantage. In most of the times, the principal components are linear combinations of all the variables. In many cases, it is benefitial to sacrifice a degree of
data accuracy to describe each principal component in terms of few variables.
That translates up to axes in which a few entries are non-zero. The extenstion
(SPCA) used in this paper takes over solving this problem. It adds another constraint to the formal problem of PCA, which is that their cardinality (number of
non-zero entries) should be minimized as possible to enable a better conveyance
of the physical properties of the system. Consequently, this offers the chance to
use less space for storing the data leading to the desired degree of compactness.
The formal problem would then be :
maximize xT Ax − ρCard2 (x)
(15)
subject to x > 0
(16)
Here, A is the covariance matrix of the data. The cardinality function Card
expresses the number of non-zero entries. ρ is a factor that reflects our preference to whether the accuracy or the sparsity of the components. This problem
is in fact an NP one [11]. It is beyond the scope of this work to explicitly derive the solution to the problem. However, one of the approaches is offered in
[4]. The solution relies on a class of mathematical approximation called convex
optimization [2]. The same optimization tools is used for the subject paper. An
alternative solution for the problem is the Iterative Elimination. There, variables
are recursively eliminated in confinance with a given criterion that in turn targets to minimize the explained variance loss and reconsiders the sparse principal
component analysis problem over and over until the desired sparsity is achieved
8
[17]. The application of SPCA is sometimes very powerful and offers very compact solutions. For example, it could reduce the number of involved variables to
be 14 rather than 533 in [4].
2.4
Markov Chain Monte Carlo
2.4.1 Markov Chains A Markov Chain is a sequence of random variables
X ( 0),..x( M ) fulfilling the given condition for m ∈ {1, .., M − 1} :
p(x(m+1) |x(1) , ...x(m) ) = p(x(m+1) |x(m) )
(17)
This means in words that the next state does only depend on the previous state,
regardless how the history before the previous state looked like. This is called
the Markov Assumption [1]. In figure 3, we are having a grasshopper that is
Fig. 3. A grasshopper’s right and left hops can be modelled as a Markov Chain. [6]
taking steps either to the right and the left. It is initially starting at state 0. The
decision of whether to hop right or left only depends on the current state. In
this sense we can see that this behaviour can be modelled by a Markov Chain.
2.4.2 Markov Chain Monte-Carlo Methods One of the strength points
of Markov Chains is that they can be used to sample from distributions that are
actually intractable to directly sample from [6]. The algorithms using this tool are
called the Markov Chain Monte Carlo methods. To sample a target distribution,
an MCMC algorithm uses a Markov Chain that has the same distribution. It is
worthwhile to observe in figure(3) how the probability distribution spreads out in
each step, where initially we have a probability of 1 to be at state 0. Afterwards
the distribution is {0.5, 0.25, 0.25 } as indicated in the graph. Further on, the
distribution keeps getting divided for new states as the grasshopper moves. That
continuous change in the Markov Chain’s distribution is however not desirable
if we need it to simulate the target fixed distribution, which would imply that
it at least keeps the same distribution along time. This poses a requirement on
the Markov Chain used, which is that the probability that a given state xi is
the current one would be the same over time. Luckily, Markov Chains have the
ability to do that when they exhibit a stationary behaviour. In this case, as time
9
goes by, the transition probabilities between states converge to a certain limit.
Let us make that more formal. Given a Markov Chain with a state X and a
transition matrix T , the following equation holds:
X
′
P (X (t) = x)T (x → x )
(18)
p(X (t+1) = x′ ) =
x
This equations rules the dynamics of the chain. It means that the probability
at a given time t + 1 to be at state x′ is equal to the same of probabilities
of each path leading to x′ . Each path leading to x′ is composed of the joint
probability between the probability of being at a given state and the probability
of transition from this state to x′ . We denote then the stationary distribution by
π. The probability of having state x′ is given accordingly by π(x′ ). If the system
is in stationary state, that implies that :
X
′
P (X (t) = x)T (x → x )
(19)
p(X (t+1) = x′ ) = p(X (t) = x′ ) =
x
∴ π(x′ ) =
X
π(x)T (x → x′ )
(20)
x
Arriving this stationary state, nevertheless is not guaranteed for all Markov
Chains, therefore we have to ensure this by performing a regularity test. If for
any given state in a Markov Chain, all the other states are reachable including
itself (through self-loop), we can deduce that it is regular[6]. The regularity of a
Markov Chain ensures in turn, that it has a stationary distribution that is also
unique, regardless from the initial state.
2.4.3 Metropolis Hastings Algorithm [6] The objective of MCMC method
is to sample a target distribution using a Markov Chain that is having the desired
stationary distribution [6]. The actual implementation, however, depends on
the field of the application. Metropolis-Hasting Algorithm is one of the MCMC
algorithms. The pivot of this algorithm is the reversibility of a given chain. To
understand what a reversible chain means let us observe figure 4. Here, we can see
a Markov Chain depicted that is regular. Let x and x′ be any 2 arbitrary states.
The probability of going from state x to state x′ is equal to the joint probability
between being at state x (given by π(x)) and the probability of traversing then
to state x′ (given by T (x → x′ )). That is modelled in the graph by traversing the
red edge. Similarly, going from state x′ to x is modelled by traversing the green
edge. If the probability between traversing the red edge and the green edge are
equal, we can say that the transfer from state x to x′ is reversible. That is :
π(x)T (x → x′ ) = π(x′ )T (x′ → x)
(21)
Since x and x′ are arbitrary, we can generalize this for any 2 states in the chain,
which makes it a reversible chain. The finding is very useful. If there is a regular
chain that is reversible for a given distribution π, then this distribution is in fact
10
its unique stationary distribution. That helps us to know for a given Markov
Chain if it will really converge to our target probability distribution or not. The
proof for this property is as follows : starting from (21), we sum up both sides
with respect to x getting :
X
X
π(x)T (x → x′ ) =
π(x′ )T (x′ → x)
(22)
x
x
π(x′ ) is constant with respect to x, so we can get it out of the summation
getting :
X
X
T (x′ → x)
(23)
π(x)T (x → x′ ) = π(x′ )
x
x
Since summing the transition probabilites from a given state should should yield
one, we are left with :
X
π(x)T (x → x′ ) = π(x′ )
(24)
x
Which is in fact, the stationary state equation. (20) . Our objective is to
sample from intractable probability distribution. Since that is hard, we will be
using a Markov Chain that has the same stationary distribution. We ensure this
condition by choosing to have the chain reversible at our target distribution. A
question would then be : How can we proceed on taking samples by the Markov
Chain ? The answer is the main axis of the Metropolis-Hastings algorithm. The
Metropolis-Hastings chain is made to explore the state space freely according
to a propositional distribution Q(x → x′ ). Next, there comes an evaluating
agent into play, which judges if a proposed next state is good enough or not
and accordingly accept it or not. This is an acceptance ratio A(x → x′ ) that
corrects the proposal. Mathematically, that means to decompse the translation
probability T (x → x′ ) to be:
T (x → x′ ) = Q(x → x′ )A(x → x′ )
(25)
Plugging (25) into (21) we get :
π(x)Q(x → x′ )A(x → x′ ) = π(x′ )Q(x′ → x)A(x′ → x)
′
′
(26)
′
A(x → x )
π(x )Q(x → x)
=
A(x′ → x)
π(x)Q(x → x′ )
(27)
Since the ratio between the 2 acceptance probabilities is what matters, we can
conventionally always select A(x′ → x) to be equal to 1. Moreover, we pose a
condition to ensure the acceptance ratio not to exceed 1. This gets us :
A(x → x′ ) = min{1,
π(x′ )Q(x′ → x)
}
π(x)Q(x → x′ )
(28)
11
Fig. 4. A Markov Chain where reversibility holds.
2.5
Interactive Markov Chain Monte Carlo Based Tracking
Interactive Markov Chain Monte Carlo is a probabilistic graphical sampling
approach. It is usually used when multi-trackers are exploited as it offers to
integrate the knowledge between them. In [5], an object is assigned a local tracker
using its local appearance and a global tracker using global features of the image.
In the Visual Tracking Decomposition paper, the same concept of integration
is applied but between various local trackers. It is worth mentioning though
that the Visual Tracking Decomposition applied this mathematical tool first. A
detailed explanation for IMCMC can be found in [3], while another application
for it can be found in [12]
3
Visual Tracking Decomposition
To alleviate the problem of appearance abrupt changes, the paper adopts a
divide-and-conquer technique. Instead of using a single tracker, a numerous simpler trackers are used that communicate together. Therefore, we can say that
innovation of the tracker is conducted in a diamond-like workflow. The upperhalf of the diamond corresponds to distributing the work into smaller numerous
components. On the other hand, the lower-half of the diamond depicts fusing
the knowledge and integrating the beliefs of the basic components into the final
decision. In the following we will discuss how each basic tracker is built and how
it works, then we will discuss how the results of basic trackers are integrated
together.
12
3.1
Basic Tracker
As we formerly mentioned, a tracker is defined by an observation model and a
motion model. In the next subsections we will see how the observation model
and the motion model of a basic tracker is formed.
3.1.1 Observation Model To accomodate with the problem of severe change
of appearance, several types of images features where used for each object’s image
patch along time. The types of features used were hue, saturation, intensity and
edge. For each type of features, measures were taken for the object at each unit of
time. A sample fij denotes the sample at time i for the feature type j. The set of
all samples fij is denoted by S. It is poorly explained how exactly the template is
explained in terms of a vector f . However, it is most likely that the rows of each
patch are stacked together in a single row. In this sense, we could say that S is the
object model we are dealing with in the paper. However, since the tracker is to be
decomposed to several basic onces, it consequently follows that the object model
should be distributed between them. A question now arises : How should such
distribution be performed?. To answer this, let us first observe a key example that
would deliver an intuitive idea about the solution. In figure 5, we are confronted
Fig. 5. Several views for the same teapot. The view in the top is illustratively the most
helpful. [19]
with four views for the same teapot. A question would be : Which is the view
offering the most details? . With our human intuition, most of us will agree that
the view on the top is the most illustratively informative one. In figure 6, we
can see 2 different views for the same dimensional dataset. However, in the right
13
view, we can perceive an inherent property about the data that has not been
obvious before, which is that the data points are almost co-planar. These two
examples make us wonder if there is some objective measure that could be able
to give us the same insights. This would then be helpful to efficiently describe
the template set used for the object model in the paper. Luckily, the answer
is yes. In fact, such a tool is nothing but the principal components. Exploiting
this mathematical tool in the decomposition of the the object model is a main
contribution for the paper.
(a) A 3-dimensional dataset
(b) This view suggests a reduction in the
dimensionality of the data since they are almost on the same plane
Fig. 6. An example delivering an intuition about principal component analysis [18]
Our requirements are however more complicated that just reducing the dimensionality of the dataset. We need to apply a decomposition that fulfills three
conditions. First : We need to capture as wide variations as possible for the
tracked object appearance. Second We do not want that two different basic
trackers track the same group of features, otherwise there would be no point of
decomposing the object model. Formally, that means that we require the decomposed sub-models to be complementary. Third : To cope with the limitations in
our resources, we would like the models to be as compact as possible. Since the
PCA maximizes the covariance between data, it can achieve our first requirement. The principal components are also orthogonal, which offers the needed
complementariness between the sub-models. We still need, however, to find a solution for our thirdly posed requirement. That is why the paper relies on SPCA
for building the sub-models. As discussed in subsection(2.3.2), SPCA offers components that have an interpolation between being principal and having the least
number of non-zero entries possible. This solution will however affect how perfectly the first two conditions are fulfilled, but as a good compromise, it will
enable us to achieve the three requirements, each with an accepted degree.
14
The dataset we are having in the paper is a vector a constructed from the
template set S :
a = (f11 ...ft1 ...ftu ...ftu ).
(29)
From here, we construct a so-called Gramian matrix form by A = aT a. There
is an elegant solution worth highlighting here. Since we are investigating the
convariance between various patches of the same dimension, it is intuitive to
think of the data dimensions we have in our dataset as the number of pixels
in the patch. However, that would have implied taking A to be rather aaT , so
that each entry would be a dot product of 2 vectors, each representing a list of
the values for a given pixel entry in all the patches. On the other hand, in the
covariance matrix calculated by the paper, each element correspond to the dot
product of 2 vectors, each corresponding to a list of all pixel values for a single
patch. So, what would we gain by abandoning the intuitive way ? In fact we gain
a lot. This way, we force the gramian matrix each step to be of a fixed size that
is equal to the square of the number of patches in S whatever the dimension of
the patch is. Hadn’t we done that, we would have had to deal with a much bigger
matrix, requiring possibly much more memory, time and calculations. The idea
was carried out before in [16] verifying the principal components to be nearly
the same in both cases.
After applying SPCA and obtaining the sparse principal components of the
dataset, the components are used to perform a decomposition of the data into
the object sub-models . Each object sub-model will be then used by a basic
tracker. The data set expressed by the vector a in (29) is projected once onto
each principle component forming a given object sub-model each time, which we
denote by Mti , which means the sub-model M built from the i-th component
constructed at time t. This is equivalent to looking up for the non-zero entries in
the component and taking the corresponding feature fit in the resulting object
sub-model. This process is illustrated in figure (7).
Distance between 2 Sets as an Observation Model For each object submodel a corresponding observation model is assigned. The probability for a given
basic observation model is then infered from a score function between its object
model Mti and the measurement Yt . Yt is defined by the patch taken at state Xt .
(State defines position and scale). For a given object model Mt , The function is
namely :
P (Yt |Xt ) = e−λDD(Yt ,Mt )
(30)
The probability decays exponentially with the distance score given by DD(Yt , Mt ),
where Mt is the object model at time t. λ is a tuning parameter. The distance
measure used here is the diffusion distance elaborated in subsection(2.2).
3.1.2 Motion Model The paper exploits two simple motion models, which
are smooth motion and abrupt motion. Both of the models are Gaussian distributed. The difference between the two models is only the variance, which is
relatively small in one and relatively big in the other. The small one assumes
15
Fig. 7. Building object models for the basic trackers from the calculated sparse principal components[7]
smooth motion while the big one assumes abrupt motion. The two assumptions
are depicted in figure 8. Formally, we can express a given motion model as the
following :
P (Xt |Xt−1 ) = G(Xt−1 , σ 2 )
(31)
That means if we know that the previous state was Xt−1 , the current state Xt
will be Gaussian distributed centered at Xt−1 and a standard deviation equal
to σ.
Fig. 8. Two Gaussians. The left has a wider spread and accounts for an abrupt motion
assumption and the right has a tighter region of confidence, suiting a smooth motion.
3.1.3 State Estimation for a Basic Tracker The observation model offers us a probability distribution that exponentially decays with the dissimilarity
between the object model and the correction measurements. The actual exploitation of such a probability distribution, however, is not that easy. The difficulty
16
resides in the continuity of the distribution and the intractability of allocating
the corresponding probability for each sample in the sample space. Nevertheless,
it turns out that there is some technique that can choose random samples from
the distribution that together inherently reflect it. We The key concept beyond
this approach is that the domain elements while are being hit while selection in
frequency that is propotional to their respective probability in the target distribution. We refer to this process technically as sampling. This concept will be
used to deduce discrete instantations from the given probability and then use
them to do the final estimation for the state. One of the tools for achieving this
sampling is the Markov Chains.
Using Metropolis Hastings Algorithm The Metropolis-Hastings Algorithm
works as follows : Initially, A random number sample is picked from from the
proposal distribution. Then, the proposed distribution suggests another sample.
We evaluate the probability of this proposed value with respect to the current
sample we are having and calculate an acceptance ratio. Afterwards, a random
number from a uniform distribution is picked and the sample is accepted if this
random number is less than the acceptance ratio. Otherwise, the new sample
would just be equal to the current sample, and we try another suggested sample.
Now let us project this theoretical silver lining on the problem in the paper. The
target probability distribution we are trying to model is the observation model
p(Yt |Xt ). Now what could be the distribution proposing the samples? Without
much thinking, the question could answer itself through a rephrasal : Given a
certain current state, which state is most likely to happen in the next one? This
is actually nothing but the function of the motion model we discussed in section
2.1.1. So the proposing distribution is the same as p(Xt |Xt−1 ). The sampling we
are speaking here is in the state space, which means that each sample corresponds
to a state. We will use the notation Xt for the current sample in the time t and
the notation Xt∗ for the sample proposed upon it. So now we talk of p(Yt |Xt∗ )
as a target distribution and p(Xt |Xt∗ ) as a proposing distribution. To measure
the acceptance ratio we mentioned previously for a given sample, we plug in the
terms in the Metropolis-Hasting Algorithm formula in (28):
γparallel = min(1,
p(Yt |Xt∗ )p(Xt |Xt∗ )
)
p(Yt |Xt )p(Xt∗ |Xt )
(32)
In (32), a term p(Xt∗ |Xt ) is the corresponding probability for a randomly
taken sample Xt∗ from the motion model distribution based on the the previous
sample Xt . On the other hand, a term p(Yt |Xt∗ ) reflects how the proposed sample
is consistent with the observation model. This is done as shown in (30). Getting
the samples according to a predefined number of iterations, the best estimate is
then calculated using a maximum a postiori estimation :
(l)
Xt = arg max p(Xt |Y1:t ) f or l = 1, .., N
(l)
Xt
(33)
17
(l)
Here Xt
time t.
3.2
is the l-th sample and N is the total number of samples evaluated at
Integration of Basic Trackers
Up till this point, we have seen how a single MCMC based tracker works. We will
turn now to another main contribution of the paper, which is using a number of
these basic trackers and integrating their knowledge together.
3.2.1 Forming different Basic Trackers It is worthwhile to mention how
the different basic trackers are built. Up to this point, it should be clear that
each object model is associated with only one observation model. For a given
object model, different basic trackers are formed by permuting its corresponding
observation model with the assumed motion models. In a result to this, if we
have in total r object models and s motion model, we get at the end s x r basic
trackers. In the paper, the number of object models was fixed by taking only the
most 4 principle components in (15). On the other hand, the number of motion
models was 2. That makes up only 8 basic trackers in total (See table 1).
Motion \Observation P1 (Yt |Xt ) P2 (Yt |Xt ) P3 (Yt |Xt ) P4 (Yt |Xt )
P1 (Xt |Xt−1 )
T11
T21
T31
T41
P2 (Xt |Xt−1 )
T12
T22
T32
T42
Table 1. Creating different basic trackers by taking possible pairs of each observation
model and motion model. Since we have 2 possible motion models and 4 object models,
we have in total 8 trackers.
3.2.2 Integrating Basic Models The usage of various basic trackers simultaneously can be regarded as if P (Yt |Xt ) in (1) is decomposed into :
P (Yt |Xt ) =
i=s
X
wi Pi (Yt |Xt ),
(34)
i=1
Where Pi (Yt |Xt ) denotes the i-th basic observation model and s is the total
number of basic observation models. Similarly, P (Xt |Xt−1 ) in (1) is decomposed
into :
j=r
X
wj Pj (Xt |Xt−1 )
(35)
P (Xt |Xt−1 ) =
j=1
Where Pj (Xt |Xt−1 ) denotes the j-th basic motion model and r is the total
number of basic motion models. Nevertheless, (34) and (35) are only means of
delivering an intuition for the decomposition process. The reason for this is that
18
they are not explicitly evaluated or used further. The two equations are just
an image of another mathematical process happening in the same time. More
details are in the next subsection.
3.2.3 Using Interactive Markov Chain Monte Carlo After knowing how
each basic tracker estimates the next state, let us see how the trackers then integrate their knowledge. The Metropolis-Hastings algorithm we mentioned before is in fact a part of a bigger picture, namely the Interactive Markov Chain
Monte Carlo [3]. This algorithm has 2 modes : parallel mode (which operates as
Metropolis-Hastings, and interactive mode, which is responsible for integrating
the knowledge of the basic trackers. What the interactive mode does is that it
allows for each tracker to get influenced by its peers. With a proportion to how
its belief measures in comparison with the sum of other beliefs, each tracker
has a probability of getting its state accepted by the other trackers. This means
for a tracker that it accepts the state of anoter tracker Tij built up from i-th
observation model and j-th motion model, with an acceptance rate of :
pi (Yt |Xtj )
γinteractive = P
s
r P
pi (Yt |Xtj )
(36)
i=1 j=1
The acceptance of the state of a given tracker Tij by another tracking is equivalent
to doubling the weight of the i-th observation model and j-th motion models in
(34) and (35) respectively. In this way, the weights in (34) and (35) are calculated.
Getting the samples for each basic tracker as formerly explained, the most likely
state is then determined according to (33).
4
Results
The proposed VTD Tracker was evaluated quantitatively and qualitatively in
comparsion with four different tracking approaches : Standard MCMC, Means
Shift (MS), Online Appearance Learning (OAL) and Multiple Instance Learning
(MIL). The VTD scored the best results overcoming illumination changes, occlusion, background clutter and abrupt motion. None of the other trackers were
able to stand up against the whole difficulties. Moreover, the behavior of VTD
itself was compared to itself without SPCA onetime and without Interactive
MCMC another time. In both cases, it was shown that the design is optimal.
5
Discussion
It has been shown that the Visual Tracking Decomposition methodology is a
valuable contribution to the tracking research constitution. Choosing the SPCA
to decompose the object model is a novel way for achieving compactness while
keeping the most information possible. However, the proposed approach suffers
19
from a significant drawback. In various parts of paper, tuning parameters are
injected into the building equations (e.g : Object Model, Motion Model,etc). The
values used for the parameters are mentioned, but it was never explained why
they were chosen in specific. That keeps a door open for a collapse or at least a
high drop in the performance if those parameters are for any reason mistaken.
Moreover, the parameters are not proven to hold for all scenarios, which means
that the one used in the paper can simply fail in other trials. One other problem
is the high requirements posed on the memory, computation and time resources.
The requirements scale up quite fast with the number of trackers involved. The
algorithm is not meant to be a realtime one, but still a too long delay would be
unwanted.
6
Conclusion
The proposed approach discussed here offered promising results in the field by
overcoming the appearance change problem. We demonstrated how the novel
approach exploited SPCA for decomposing the object model to compact, complementary and expressive models, each for a basic tracker. We eventually saw
how the integration between trackers was performed through IMCMC, leading to
an improved performance. The drawbacks for the approach are the tunable parameters and low scalability due to required resources. As future work, it would
be benefitial to consider the parallelization of the algorithm, which could lead
to a significant speed-up.
References
1. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
2. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
3. Jukka Corander, Magnus Ekdahl, and Timo Koski. Parallell interacting mcmc for
learning of topologies of graphical models. Data Mining and Knowledge Discovery,
17(3):431–456, 2008.
4. Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse pca using semidefinite programming. In SIAM Review, pages
41–48. MIT Press, 2004.
5. Zia Khan, Tucker Balch, and Frank Dellaert. An mcmc-based particle filter for
tracking multiple interacting targets. In in Proc. ECCV, pages 279–290, 2003.
6. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
7. Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition : Presentation
slides, 2009.
8. Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition. In CVPR,
pages 1269–1276, 2010.
9. Haibin Ling. Diffusion distance for histogram comparison. In In CVPR06, pages
246–253, 2006.
20
10. Dr Emilio Maggio and Dr Andrea Cavallaro. Video Tracking: Theory and Practice.
Wiley Publishing, 1st edition, 2011.
11. Baback Moghaddam, Yair Weiss, and Shai Avidan. Generalized spectral bounds for
sparse lda. In In Machine Learning, Proceedings of the Twenty-Third International
Conference (ICML, 2006.
12. S. Santhoshkumar, S. Karthikeyan, and B.S. Manjunath. Robust multiple object
tracking by detection with interacting markov chain monte carlo. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 2953–2957,
Sept 2013.
13. John Shlens. A tutorial on principal component analysis.
14. Gilbert Strang. Linear Algebra and Its Applications. Wellesley-Cambridge Press,
2009.
15. Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005.
16. Matthew Turk and Alex Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–86, January 1991.
17. Yang Wang and Qiang Wu. Sparse pca by iterative elimination algorithm. Advances
in Computational Mathematics, 36(1):137–151, 2012.
18. YouTube http://www.youtube.com/watch?v=4pnQd6jnCWk. Principal component analysis (pca), 2009.
19. YouTube: http://www.youtube.com/watch?v=BfTMmoDFXyE. A layman’s introduction to principal component analysis, 2009.
Gradient Response Maps vs. HOG Features for
Detection of Texture-Less Objects
Sahar Javadi1 and Stephan Krauss2
1
2
Technical University of Kaiserslautern javadi@rhrk.uni-kl.de
German Research Center for Artificial Intelligence stephan.krauss@dfki.de
Abstract. In this paper a comparison between two approaches, Gradient Response Maps and Histograms of Oriented Gradients (HOG), in
the context of detection of texture less objects, is presented by first explaining the parameters of each method separately in detail and then
presenting the experimental results of a case study comparing two methods.
Key words: Gradient Response Maps, Histograms of Oriented Gradients (HOG)
1
Introduction
Real time detection and learning of texture-less or low textured objects is a
critical and challenging task in the area of Computer Vision. Many applications
such as robotics in which a robot needs to deal with a continuously changing
environment and learn new objects in real time strongly need efficient approaches
with low computational cost.
Real-time 3D detection of instances of objects using Gradient Response Maps
[1] is a new approach for the detection of untextured objects which does not
need a time consuming training phase, and consequently can be used in time
critical applications such as robotic. The robustness of this approach is due
to the spread of image gradient orientations which makes it possible to test
only a small subset of all possible pixel locations when parsing an image and
representing a 3D object with a limited set of templates. This approach can be
improved in presence of a dense depth sensor which takes into account also the
3D surface normal orientations.
Histograms of Oriented Gradients (HOG) [2] is another related and popular
method for object detection which is based on the statistical description of the
distribution of intensity gradients in localized portions of the image. The basic
idea behind this approach is that local appearance of an object or its shape can
be characterized by the distribution of local intensity gradients or edge directions quite well, even if a precise knowledge of the location of the corresponding
gradient or edge is not available. This approach gives reliable results but it is
computationally complex and consequently not suitable for on-line applications.
This paper is organized as follows. In the next two sections a detailed description of the two approaches Gradient Response Maps and the Histograms of
Oriented Gradients (HOG), is presented. In Section 4 a comprehensive comparison between the two methods is presented by describing the evaluation results of
experiments comparing two methods from different aspects such as robustness,
speed and occlusion. The last section concludes with summary and discussion.
2
Gradient Response Maps
In this section a comprehensive description of the Gradient Response Maps approach with detailed explanation of its parameters is presented, and it is shown
how a new representation of the input image can be built in order to parse the
image quickly for finding the objects.
2.1
Similarity Measure
The first very important component to explain is the similarity measure.What
this similarity measure basically does is that for each gradient orientation on the
object, it searches in a neighbourhood of the associated gradient location for the
most similar orientation in the input image. This can be shown as:
ε(I, τ, c) = Σrǫp ( max | cos(ori(O, r) − ori(I, t)) |)
tǫR(c+r)
(1)
in this formula R(c + r) = [c + r − T2 , c + r + T2 ] × [c + r − T2 , c + r + T2 ] is the
neighbourhood of size T centred at location c + r in the input image. ori(O, r)
is the gradient orientation in radians at location r in a reference image O of an
object to detect. The locations r are considered in O as specified in a list denoted
by P .In the above formula, considering only the orientation of gradients and
not their magnitude or direction, makes the measure robust to contrast changes,
and taking the absolute value of the cosine will allow the measure to handle the
object occluding boundaries. The reason for considering image gradients in the
very first step is that they are proven to be robust to illumination changes and
noise and are normally more discriminant than other representation forms.
In following sections it is shown how this similarity measure can be computed
easily by spreading the computed gradient orientations.
2.2
Computing the Gradient Orientations
The orientation of gradients is computed on each color channel of the input
image in order to increase the robustness, and then the gradient orientation of
the channel whose gradient magnitude is largest is considered as the gradient
orientation in that location of the image, according to the formula below. In this
formula for an RGB color image I the gradient orientation map Ig (x) at location
x is computed as
Ig (x) = ori(Ĉ(x))
(2)
where
Ĉ(x) = arg max |
CǫR,G,B
∂C
|
∂x
(3)
In order to quantize the gradient orientation map, the gradient directions are
omitted and only the gradient orientations are taken into account. The orientation space in then divided into n0 equal spacings as illustrated in Fig. 1.
Fig. 1. Quantizing the gradient orientations
2.3
Spreading the Orientations
In this subsection it is shown how a new binary representation of the gradients
around each image location in order to avoid computing the max operator in
Eq. 1 every time a new template needs to be evaluated against an image location. This new binary representation along with lookup tables is then used for
precomputing the maximal values efficiently. The computation procedure of is
shown in Fig. 2
Fig. 2. Spreading the gradient orientations
As it can be seen in Fig. 2, first step in computation of the new binary
representation of the input image is to quantize the orientations into a small
number of values. By doing so, the new representation of the input image can be
done simply by spreading the gradient orientations of the input image around
their locations. For encoding all possible combinations of orientations spread in a
specific location, a binary string is used in which each individual bit corresponds
to one quantized orientation, and is set to 1 if this orientations is present in
the neighbourhood of the this location and 0 otherwise.These binary strings are
then used as indices for accessing lookup tables in order to compute the similarity
measure in a fast way.
2.4
Precomputing the Response Maps
The precomputation of response maps is shown in Fig. 3.
Fig. 3. Precomputing the Response Maps
As shown in this figure the new binarized image and lookup tables are used
together to precompute the max operations in the similarity measure for each
location and each possible orientation in the template. The results are stored
in 2D maps. For computing similarity measures it is just enough to sum values
read from these maps. Due to the fact that these maps are shared between
the templates, once the maps are computed the matching of several templates
against the input image can be done quickly.
2.5
Extension to Dense Depth Sensors
In case a depth sensor is available this approach can be extended using quantized
surface normals which leads to more robustness in the detection of objects.
The similarity measure is then defined as the dot product of the normalized
surface gradients, instead of the cosine difference for the image gradients. The
combined similarity measure in this case is simply the sum of the measure for
image gradients and that of surface normals.
3
Histograms of Oriented Gradients
This approach is based on the idea that local appearance and shape of objects
can often be characterized rather well by the distribution of local intensity gra-
dients or edge directions, even if the precise knowledge of the position of the
corresponding gradient or edge position is not available.
In practice, the image window is divided into small spatial regions (cells), and
gradient directions or edge orientations over the pixels of the cell are accumulated
as a local 1-D histogram. A contrast normalization is also done in overlapping
descriptor blocks before using them. Each block is a collection of neighbouring
cells. Normalized descriptor blocks are referred to as Histogram of Oriented
Gradient (HOG) descriptors. An overview of HOG approach is depicted in Fig.
4 in case of human detection. The used classifier in this case is a conventional
SVM.
Fig. 4. An overview of HOG approach
Experimental results for examining the influence of different descriptor parameters show that fine-scale gradients, fine orientation binning, relatively coarse
spatial binning, and high quality local contrast normalization in overlapping descriptor blocks are all important for good performance.
The similarity measure used in classifier in HOG approach is normally Euclidean metric or cosine similarity which are computed as follows respectively.
d=
p
(p1 − q1 )2 ...(pn − qn )2
cos(θ) =
4
A.B
|A|.|B|
(4)
(5)
Experimental Validation
In this section the experimental validation results of comparing the method using
gradient response maps which is called LINE to HOG and three other methods
i. e., DOT [3], TLD [4] and Steger [5] are presented. While DOT is recognized
as a fast template matching method HOG and Steger are more known as slow
but very robust template matching methods. For experimental validation three
different variations of LINE have been used. LINE-2D which uses just image
gradients, LINE-3D which uses just the surface normals, and LINE-MOD which
is multimodal and uses both. These methods are compared to each other from
robustness, speed and occlusion aspects.
4.1
Robustness
The six methods mentioned above are evaluated using six sequences made of
more than 2000 real images each. Illumination and large viewpoint changes on
heavily cluttered backgrounds are contained in each sequence of images. As the
experimental results illustrated in Fig. 5 and Fig. 6 show LINE-2D outperforms
other methods,except for Steger method which gives the same results. One possible reason is that both LINE and Steger methods use similar similarity measures.
As it can be seen in these figures when a depth sensor is available i. e., in LINEMOD there are just few false positives and this method outperforms all other
methods without decreasing the runtime performance. The middle column in
these figures represents the results in case of setting a threshold for each approach to allow 97% true positive rate and only evaluate the hypothesis with
the largest response.
Fig. 5. Comparison the robustness of six methods on real 3D objects
4.2
Speed
Since the procedure of template learning is considered to be instantaneous, for
evaluating different approaches from the aspect of speed, only runtime performance is taken into account. As the experimental results presented in Fig. 6 show
Fig. 6. Comparison the robustness of six methods on real 3D objects
the LINE approach is generally a real-time approach. In this case LINE-MOD is
rather slower than LINE-2D and LINE-3D due to the slower preprocessing stage.
The DOT method is initially fast but as the number of templates increases it
becomes slower.
4.3
Occlusion
As it can be seen in Fig. 7 the robustness of methods LINE-2D and LINE-MOD is
also evaluated by adding synthetic noise and illumination changes to the image.
The results suggest that both methods are linear with respect to occlusion.
5
Comparison of Gradient Response Maps and HOGs
from Similarity Measure Aspect
The LINE method and HOG both are gradient-based methods which take
into account the orientation of gradients. Although exploiting gradient orientations(and not their direction) makes both methods robust to background clutter
and also small shifts and deformations, each method achieves this capability in
a different way. The HOG method does this by first quantizing the orientations
and using local histograms. However this can be unstable when strong gradients
Fig. 7. Comparison the speed of six methods
Fig. 8. Left: LINE approach is linear with respect to occlusion, Right: Average recognition score for the six real 3D objects with respect to occlusion
appear in the background. On the other hand, the similarity measure of LINE
method, for each gradient orientation on the object, searches in a neighbourhood of the associated gradient location for the most similar orientation in the
input image. Taking the absolute value of cosine between gradient orientations
in similarity measure of the LINE method allows it to correctly handle object
occluding boundaries.Considering only the orientation of gradients and not their
norms makes the measure robust to contrast changes. This is something which
is missing in HOG method, and consequently makes the method vulnerable to
strong contrast changes.
6
Summary and Discussion
In this paper a comprehensive comparison between the methods LINE with three
different variations i. e., LINE-2D, LINE-3D, and LINE-MOD , and HOG was
drawn. Moreover, the results of the experimental validations for LINE, HOG and
three other methods were presented and discussed. These methods are validated
for recognition of six different objects on heavily cluttered backgrounds from
different aspects such as robustness, speed, and occlusion.
The obtained results suggested that the LINE method is a real-time method
which in most of the cases outperforms all other considered methods. However
there are certain issues in the way this comparison was done. It is noteworthy
that this paper reviews the results of another paper which basically attempts
to prove the efficiency of the LINE method in comparison to four other methods i. e., HOG, DOT, Steger, and TLD. The main point of the current paper
is however the comparison of LINE and HOG methods by reviewing the past
obtained results. These two methods are focused on because they both use gradient features for the detection of objects. Although the LINE method uses just
the orientation of the gradients while the HOG method takes orientation, and
also the magnitude of the gradients into account.
The first issue that arises is from the robustness aspect and the way theses
methods have been evaluated. Considering the fact that these results were reported by experimenting on only 6 objects the question is whether 6 objects —
all located in a room and not in an outside environment — are good representatives for certain judgement about the robustness of a method i. e., LINE in
comparison to other methods such as HOG. On the other hand there are no clear
explanations about the cases where the HOG method outperforms the LINE-2D
method.
Another debatable issue is in the context of speed. The question is that basically the comparison between a method like LINE which is specifically designed
for real-time applications and a method like HOG which does not mean to be
used in real-time situations is the right thing to do? From the perspective of this
work’s author, this is not a fair thing.
References
1. Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P.F., Navab, N., Fua, P., Lepetit,
V.: Gradient response maps for real-time detection of textureless objects. IEEE
Trans. Pattern Anal. Mach. Intell. 34(5) (2012) 876–888
2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In
Schmid, C., Soatto, S., Tomasi, C., eds.: International Conference on Computer
Vision and Pattern Recognition. Volume 2., INRIA Rhone-Alpes, ZIRST-655, av.
de l’Europe, Montbonnot-38334 (2005) 886–893
3. Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., Navab, N.: Dominant orientation
templates for real-time detection of texture-less objects. In: CVPR. (2010) 2257–
2264
4. Kalal, Z., Matas, J., Mikolajczyk, K.: P-n learning: Bootstrapping binary classifiers
by structural constraints. In: In IEEE Conference on Computer Vision and Pattern
Recognition. (2010) 49–56
5. Steger, C.: Occlusion, clutter, and illumination invariant object recognition (2002)
have been created by applying homographies. One has to question the statistical
significance of the performed evaluation, especially when it was outperformed
by SURF in one of the six evaluated datasets. Furthermore, the dataset with
the highest performance gain with respect to the compared feature detection
methods is a dataset they created themselves, consisting on an image with only
gaussian noise added at different intensities.
Another concern is the runtime evaluation of the proposed method. Whereas
the total time needed seems to be approximately the same as SIFT while still
outperforming it in every evaluated dataset, it doesn’t hold up compared to
more recent methods in terms of computation time. Compared to SURF, the
computation time increases by a factor of 2.5-4, and even worse, compared to
STAR, it needs approximately 8 times as much time.
The improvements that are proposed in the accelerated KAZE features publication seem to improve its most important shortcoming: computation time. It
achieves this by changing the computation method of the non-linear scale space.
As a result, the computation time is faster than SURF, making it possible to
use it for real-time applications.
The most important question to ask is if the proposed optimizations come
at the expense of a lower quality (with respect to stability of the features under
noise or scale changes). However, the comparison of the evaluation results don’t
indicate that this is the case.
While the work on key point descriptors using a non-linear scale space seems
promising, more care should be taken in presenting the results in a consistent
way. For example, they included precision/recall graphs in the original paper,
whereas for their accelerated features, they only show a table listing the matching
score and recall matches. Most importantly, the authors could have described
their datasets better, or even mention that they are composed of only one original
picture with different homographies applied.
5
Conclusion
This seminar report gives a short overview of KAZE features, a feature detection
that makes use of a non-linear scale space. These features seem to outperform
most state of the art feature detectors, like SIFT or SURF, although the computation time is comparable to that of SIFT, and a lot longer than SURF. This
disadvantage is rectified by the accelerated-KAZE features, who achieved even
better results during their evaluation. Even though the authors only included
the evaluation of 6 different datasets, the gain in precision and recall seems to
indicate that they might outperform other state-of-the-art feature detectors in
a more rigorous evaluation. Even though the evaluation might not be the most
rigorous, the proposed feature detector seems very promising.
References
1. P. F. Alcantarilla, A. Bartoli, and A. J. Davison. KAZE features. In Eur. Conf. on
Computer Vision (ECCV), 2012.
2. P. F. Alcantarilla, J. Nuevo, and A. Bartoli. Fast explicit diffusion for accelerated
features in nonlinear scale spaces. In British Machine Vision Conf. (BMVC), 2013.
3. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features.
In In ECCV, pages 404–417, 2006.
4. David G. Lowe. Distinctive image features from scale-invariant keypoints, 2003.
5. Joachim Weickert, Bart M. Ter Haar Romeny, and Max A. Viergever. Efficient
and reliable schemes for nonlinear diffusion filtering. IEEE Transactions on Image
Processing, 7:398–410, 1998.
Fusion of inertial body tracking with Kinect
body tracking
Artem Avtandilov1 and Markus Miezal2
1
TU Kaiserslautern, artem@rhrk.uni-kl.de
2
markus.miezal@dfki.de
Abstract. Inertial body tracking and body tracking with the Microsoft
Kinect sensor provide certain benefits for real-time human body motion
capturing, but both lack precision when it comes to disturbances: be
it occlusion for visual sensor or magnetic field disturbances for inertial
measurement units (IMUs). This paper proposes an approach to fuse
inertial body tracking and Kinect body tracking in order to achieve better
estimate of skeletal motions.
Keywords: Inertial motion capture, sensor fusion, body sensor network,
Kinect
1
Introduction
Different technologies for body tracking have their strong and weak sides. While
Kinect suffers from occlusion and is slower, inertial tracking systems could only
detect posture and are vulnerable to magnetic disturbances. On the other hand,
Kinect is capable of detecting a 3D pose and extracting segment lengths, but
inertial measurement units are faster and tracking is more robust when not
disturbed. Motivated by above, [4] and [5], this paper introduces a practical
approach for fusion of inertial body tracking with Kinect body tracking. There
were many attempts to fuse data from inertial measurement units and Kinect,
such as [2], where emphasis is made on the mapping rather than body tracking.
Combination of inertial measurement units (IMU) and Kinect is further used in
[6] for gesture recognition using 5 XSens IMUs. [3] advances those approaches
to joint angle estimations with applications in clinical rehabilitation.
2
2.1
Proposed approach
Kinect SDK. Data structure
The Kinect for Windows software development kit (SDK) enables the use of C++
to create applications that support body tracking features by using the Kinect
sensor and a Windows enabled machine. The Developer Toolkit provides sample
software that has been modified in order to record samples of body posture and
process them further. The environment enables the use of all hardware features
2
of Kinect sensor: RGB camera, depth sensor and multi-array microphone. The
SDK provides already fully-functioning version of body tracking that has been
used in different generations of the XBox game console. However, the precision of
the joint positions produced by Kinect is not very high, but it is already sufficient
for fusion with inertial data and mapping applications[7]. Sample information
includes 3D coordinates of 20 joint positions of tracked body, timestamp of
the sample and information about specific points whether they were inferred or
directly tracked. A rendering of the most recent detection is being output to the
screen which makes it easy to monitor the current state. Kinect is capable of
tracking two full bodies simultaneously and can detect four more3 .
Fig. 1. Tracking capabilities of Kinect sensor: up to two distinguishable subjects could
be tracked with all joints position (blue and purple as shown) and can detect presence
of four more subjects in the scene (dots as shown)
2.2
Prerequisites for fusion
The 3D points produced by Kinect and the joint points extracted from the
kinematic chains of the inertial measurements system lie in different coordinate
systems. Moreover, they are not ideally synchronized in time since the data is
captured asynchronously on two different machines which leads to a time offset
and drift. At this stage, the desired sensor-fusion is impossible. The data needs
to be aligned with respect to time and in the spatial domain whereas the former
has to be carried out first.
3
Image from Microsoft http://msdn.microsoft.com/en-us/library/hh973074.
aspx
3
Fig. 2. Coordinate systems represented visually: global coordinates based on IMUs
tracking information (left) and Kinect coordinate system that is due to be aligned
with IMUs
Timing synchronization. In order to synchronize two samples of tracking information produced by two different measuring platforms (Kinect and IMUs), an
easily recognizable event has been asserted into the sequence of actions recorded
from the human. At the beginning of each recorded sequence, the subject has
to clap. The clapping event can easily be extracted from IMUs’ acceleration and
from the Kinect’s hand positions.
(a) Normalized acceleration of the hand
recorded from IMU
(b) Distance between hand points produced by Kinect
Fig. 3. Data for synchronization.
Fig. 3(a) shows the normalized acceleration of the hand IMU. The peak
indicates the clapping event. From the Kinect data it is easy to calculate a
timestamp of the event when hands were closest to each other (see Fig. 3(b)).
4
Comparing these two timestamps, the time offset between the two systems can
been determined and an adjustment can be performed. As noted before, Kinect
software is running on Windows machine and IMUs data are recorded under
Linux. Naturally, they have a diverging clock, but synchronizing them with the
calculated offset is sufficient for short term recording. Furthermore, Kinect and
IMUs provide tracking information with different frequencies - 30 and 100 Hz
accordingly.
Skeletons alignment. The spatial alignment becomes possible because the
skeletons are compatible and correspondences between joints exist. The alignment consists of rotation and translation. At this point, only rotational alignment
is needed. Since the IMU tracking system is fixed with respect to the pelvis while
Kinect’s skeleton can move freely, the Kinect skeletons translation has to be reset to the IMU skeletons origin. Rotational alignment is performed only once
when processing begins as follows[1].
To find the optimal rotation both datasets are re-centralized so that both
centroids are at the origin, like shown4 on Fig. 4.
Fig. 4. Relative rotation between two centralized point clouds.
This removes the translation component, leaving only the rotation. The next
step involves accumulating a matrix, called H, and using the Singular Value
Decomposition to find the rotation as follows:
H=
N
X
(PAi − centroidA )(PBi − centroidB )T
(1)
i=1
[U, S, V ] = SV D(H)
4
Image from Nghia Ho http://nghiaho.com/?attachment_id=807
(2)
5
Finally, the rotation matrix from A to B can be calculated using U and V :
R = V UT
(3)
The translation alignment is performed on every following frame where Kinect
data is present. The inertial tracking system’s pelvis position is fixed, so the
vector from the Kinect skeletons pelvis to the inertial tracking systems pelvis is
found and subtracted from every point of the Kinect skeleton. This final step
allows the sensor-fusion between the two data streams.
2.3
Measurement models
The inertial tracking system is based on extended Kalman filter (EKF) which
operates on a kinematic chain, formed by Denavit-Hartenberg-Transformations
(DH). The filter’s state comprises all time-dependent parameters of the kinematic chain (angles, angular velocities and angular accelerations) and is propagated in time with a standard constant angular acceleration model with white
noise in acceleration[5].
To fuse the Kinect data into this system, the joint position data has to be
related to the Kinect measurements using the underlying kinematic chain. Since
the Kinect produces only 3D points, a fusion on angle level is impossible because
not all degrees of freedom are covered by the 3D point model (e.g. rotation along
the bone). The fusion has to be performed on point level. Two measurement
models have been implemented. The first assumes that all 3D points in the
IMUs skeleton are equal to their corresponding point in the Kinect’s skeleton.
The second one weakens this approach, assuming that the direction the next
corresponding joint is equal.
Measurement model 1 (MM1) measures positions of 11 corresponding joints:
0=Y −P +e
(4)
Here Y represents joint points extracted from the state by multiplying through
all transformations of the underlying kinematic chain to a specific joint, P stands
for joint points produced by Kinect and e denotes zero-mean Gaussian measurement noise. On the downside, the segment lengths produced by these models
might differ in reality and the alignment of skeletons cannot be perfect - these
are the reasons why measurement model 2 (MM2) is introduced where segment
directions are measured. Cosine of the angle between to vectors could be represented with dot product as follows:
cos(ϑ) =
A·B
Anorm Bnorm
(5)
Two perfectly aligned vectors will form an angle of zero degrees, of which
the cosine equals to 1. This leads to the following simplification: Let vector
Y represent segments acquired from the state and vector P the corresponding
vectors from the Kinect, then MM2 is defined as:
6
0=(
Y
P
)·(
)−1+e
Ynorm
Pnorm
(6)
The results of two introduced measurement models are further discussed in
the next section.
3
Experiments
The described system has been tested against magnetic disturbances brought to
the scene in order to make data produced by IMUs unreliable.
3.1
Magnet affecting the scene
(a) IMU skeleton reproduced to the visual image when subject is about to pick
up a strong magnet demonstrating normal behavior
(b) IMU skeleton reproduced to the visual image when subject is affected by
magnet demonstrating how magnetic
disturbances make inertial tracking unreliable
Fig. 5. IMU skeleton with magnet in scene
Left hand’s magnetometer normalized data demonstrates significant disturbances. Further analysis and graphic representation of recorded data reveal inadequate behavior of chains when there is no Kinect information present (see Fig.
6(a)). Even if the magnet was placed on the left wrist, the huge error quickly
propagates through the inertial measurement model to all parts of the chain.
Even the torso is significantly influenced by the disturbed measurements. Closer
look at Z coordinate of torso shows that while in fact it is supposed to be stable
7
(a) Left hand magnetometer normalized
data when magnet has been added to
the scene
(b) Distance between hand points produced by Kinect
Fig. 6. Torso coordinates when magnet has been added to the scene
like at the beginning of the measurement, it changes value unexpectedly and
then gets stuck at zero being unable to produce tracking information. A reliable
tracking can not be achieved in this setup (see Fig. 6(b)).
3.2
Kinect data used to restore corrupted scene
Using the two Kinect measurement models, experimental data shows that visual
data produced by the system and coordinates of specific joints is much more
robust when Kinect tracking information is taken into account while IMUs cannot handle external magnetic disturbances. As a consequence, propagation of
magnetic disturbances affects coordinates of shown joints.
(a) Measurement model 1 propagation
onto torso coordinates
(b) Distance between hand points produced by Kinect
Fig. 7. Measurement model 2 propagation onto torso coordinates
8
Further analysis of produced results demonstrates the behavior of the system. Fig. 7(b) for MM1 demonstrates jitter on Z coordinate comparing to the
original tracking with IMUs only. This shows that the torso is being dragged
to the point to match Kinect information through MM1 and EKF. Later when
magnet is added to the scene, Z coordinate looks much more stable comparing
to the original IMU data, but here inertial tracking information is trying to drag
torso out of the way being hold by Kinect. And when effect of magnetic disturbances propagated to torso becomes the most severe and IMUs are not capable
of tracking at all rushing Z and other coordinates to zero, MM1 still holds it on
reliable level. Jitter is sufficiently small as can be clearly seen on the recording.
Some segments, for example shoulders, are different in Kinect and IMU models, and when MM1 starts running during the recording replay, shoulders are
dragged down as they are tracked by Kinect in its skeleton reproduction and
similarly spine is dragged backwards. MM2 demonstrates similar performance,
but it is clearly seen, that it measures different features - angles between same
segments from two different sources. This puts the priority of the system to
make joint segments to be parallel to each other. Depending on different level
noises this could result in better or worse performance comparing to MM1 as
discussed further. In particular, it could be clearly seen that when system can
no longer handle disturbances resulting in torso being dragged out of the way,
other segments will still be parallel.
4
Conclusion
This paper proposes basic concept for fusion of inertial body tracking and Kinect
body tracking. Results mentioned above allow to conclude, that tracking information received from Kinect sensor is capable of improving an estimate of
human body movements in 3D, particularly when magnetic disturbances occur.
Two measurement models demonstrated different performance depending on the
noise levels, magnet locations and other factors. Measurement model 1 performed
better with lower noises and demonstrated more robust results, however measurement model 2 was able to produce better estimate with higher noise levels
and more intense magnetic fields. Data received from Kinect makes the system
more robust, however it can not be completely reliable due to significant levels
of jitter occurred from optical tracking originating from the nature of optical
tracking itself.
References
1. Paul J. Besl. A method for registration of 3-d shapes. IEEE transactions on pattern
analysis and machine intelligence, 1992.
2. Bas des Bouvrie. Improving rgbd indoor mapping with imu data. MSc thesis for
Faculty EEMCS, Delft University of Technology, 2011.
3. Antonio Padilha Lanari Bo et al. Joint angle estimation in rehabilitation with inertial sensors and its integration with kinect. 33rd Annual International Conference
of the IEEE Engineering in Medicine and Biology Society, 2011.
9
4. Jamie Shotton et al. Real-time human pose recognition in parts from single depth
images. Communications of the ACM, 2013.
5. Markus Miezal et al. A generic approach to inertial tracking of arbitrary kinematic
chains. BodyNet, 2013.
6. Oresti Banos et al. Kinect=imu? learning mimo signal mappings to automatically
translate activity recognition systems across sensor modalities. 16th International
Symposium on Wearable Computers, 2012.
7. Kourosh Khoshelham and Sander Oude Elberink. Accuracy and resolution of kinect
depth data for indoor mapping applications. Sensors, 2012.