Computer Vision: Object and People Tracking

Ruben Comesanha

Computer Vision: Object and People Tracking

Proceedings of Seminar and Project Computer Vision: Object and People Tracking Winter semester 2013/14 Dr. Gabriele Bleser and Prof. Didier Stricker Department Augmented Vision University of Kaiserslautern and DFKI GmbH Introduction The seminar and project Computer Vision: Object and People Tracking (INF73-72-S-7, INF-73-82-L-7) are continuative courses based on and applying the knowledge taught in the lectures 3D Computer Vision (INF-73-51-V-7) and Computer Vision: Object and People Tracking (INF-73-52-V-7). The goal of the project is to research, design, implement and evaluate algorithms and methods for tackling computer vision problems. The seminar is more theoretical. Its educational objective is to train the ability to become acquainted with a specific research topic, review scientific articles and give a comprehensive presentation supported by media. In the winter semester 2013/14, projects and seminars addressed image-based recognition and tracking tasks in the 2D image (e.g. objects and text) and in the 3D domain (e.g. body motion tracking and gaze estimation). The results are documented in these proceedings. Organisers and supervisors The courses are organised by the Department Augmented Vision (http://ags. cs.uni-kl.de), more specifically by: Prof. Dr. Didier Stricker Dr. Gabriele Bleser In the winter semester 2013/14, the projects were supervised by the following department members: Dr. Alain Pagani Christian Bailer Mohamed Selim Nils Petersen Stephan Krauss Sebastian Palacio Markus Miezal August 2014 Dr. Gabriele Bleser Real time text recognition in natural scene Pramod Murthy1 and Alain Pagani2 1 murthy.pramod@gmail.com alain.pagani@dfki.de 2 Abstract. With increasing usage of mobile camera based imaging, text recognition in uncontrolled environment provides a key input modality to many augmented reality based applications. The goal of the project was to develop a real time solution for text detection and recognition. The work was performed in two stages which typically combines different stages of text recognition process. The first part of the study was to evaluate feasibility of standard OCR system for recognizing text in camera based document images to text in natural scenes. While the later part consist of implementing text detection and localization by selecting from a set of Extremal Regions(ER). Hence, combining the two mentioned components an end-to end solution was developed. Finally, we discuss various issues faced and possible improvements to solve them. Keywords: Text detection and localization, Text recognition,OCR, Natural scene, ER detection. 1 Introduction The problem of recognizing text in natural scene has drawn significant attention in recent years in computer vision research. One of the major factor contributing to the field being the high usage of camera imaging in mobile devices. Text recognition for document imaging is solved using robust and powerful algorithms, but the problem of text detection and recognition in natural scene is still unsolved. This is due to the fact that localizing text is a very expensive task of 2N subsets which might represent text in an image (where N is the number of pixels)[6]. Nevertheless, the solutions would certainly lead to interesting augmented reality application like assisting blind people to navigate, serve product reviews and relevant information over a smartphone or in wearable gadgets to provide translation services for the textual information.The scope of the project was to understand the problem of real-time text detection and recognition in natural scene and further provide different ways of improving text recognition accuracy. 2 Related Work We can broadly classify the approaches for text detection and extraction into four different classes. 2 Edge based methods Edges are good features for text detection in natural scenes. These methods are usually apply edge detector at first and later different morphological operators are applied to remove text from the background scene. Edge based detectors heavily rely on characters exhibiting strong edges which fail during a strong illumination or shadows overlayed in the scenes. Many approaches like multiscale edge based extraction [5] [7], a self adapting thresholding morphological operations [1] and Pyramid based decomposition [4, 3] of image with color based edge detection to support changes in text color and varying font sizes in the scene. Texture based methods Texture based methods uses textural properties of the text to distinguish from background. Here, a probable region with texture is extracted by applying texture analysis methods Gaussian. Wavelet decomposition, Fourier Transform, Discrete Cosine Transform (DCT) and local binary pattern (LBP). Finally, a classifier (usually a trained machine learning ) is used to text regions. There are approaches proposed for detecting text detection for multilingual text [12] using features and histogram of oriented gradients (HOG), mean of gradients (MG) and LBP as features, and AdaBoost classifier to decide text regions. Connected Components Connected components based methods group a set of small regions successively until all regions which are part of text are identified. These methods usually segment candidate text component by edge detection and color clustering. The non-text components are pruned with heuristic rules or classifiers. Since, CC based methods have few candidate regions as compared to other approaches, so they have a low computation cost associated. The located regions can then directly used as input for optical character recognition engine. Also iterative approaches with a modified Conditional Random Field algorithm to get connected components with Belief Propagation inference and OCR filtering stages[11]. Another method uses a separate color images layers analyzing with Block Adjacency Graph (BAG) [9]. Stroke based methods Another set of approaches used the stroke width of text as a feature to detect text and distinguish it from background. The image is segmented using stroke width feature and grouped together by clustering. A operator called Stroke width Transform (SWT), allows to detect characters over a different scale by merging pixels of similar stroke as connected components [2]. These approaches are primarily designed for text with horizontal orientation and would fail to recognize text with a different orientations. As each class of methods fail in certain conditions because of the inherent approach and varied character orientations, researchers have proposed new approaches by combining the strengths of each approaches. A hybrid algorithm to detect text in any arbitrary orientation by adapting to two sets of features based on SWT and a two level classification scheme and achieved good results 3 on ICDAR datasets [10]. Neumann and Matas [6] approach of extremal regions provides a real-time performance with reduced memory footprint. The goal of the project was to develop a real-time text recognition system needing a less computationally expensive and low memory footprint approach. we focused on hybrid approaches with extremal regions for text localization compared all approaches with highest f-measure [11].The extremal regions method trained on ICDAR 2003 dataseta and was evaluated on ICDAR 2011 and street view dataset. The results were second highest with precision 73.1%, recall 64.7% and f-measure of 68.7% while on more challenging Street View Text (SVT) data-set the recognition rates were precision 67.0 %, recall 29.0% and f-measure of 41.0 % [6]. 3 System Design Designing a text recognition system is quite challenging task as the images may exhibit several variations in imaging and distortions. Fig. 1. Image processing pipeline A typical image processing pipeline for text detection and recognition in natural scene is as shown in Figure 1 [11]. A input image is preprocessed to resize it to the optimal resolution and to calculate features for text detection. In the text detection and localization step, a bounding box is found around the 4 region of the text present with in an image. This extracted region is further enhanced and subtracted from background. Typically, the subtracted image is converted in binary image before it is fed into an OCR engine to process. The OCR engine should be configured appropriately for the language of text which loads the corresponding language model for text correction. OCR systems have evolved with improved accuracy rates, we performed an evaluation study a OCR system to benchmark text recognition rates for camera based document images. 4 OCR evaluation Tesseract OCR engine is chosen for performing OCR over camera based document images. Tesseract combined with other open source imaging libraries, supports different image input formats and converts them to text over 60 languages [8]. A Canon EOS 5D Mark II camera with resolution of 21 megapixels was used for taking images with roughly a distance of 70 cm. A set of 14 images were captured of a single page document printed in arial font text. Each image in the set consist had different level font size used for printing the text document. The font size ranging 8 to 30 was considered, with an increasing size of 2 units at each level as shown in figure 2 collection of images containing a single page documents, which are varying in font size and font types were collected. An OCR is applied on the different images to detect and recognize the characters written in the document. The recognized text output is compared with the ground truth text for respective images. Finally, the various accuracy measures to measure the output of OCR are calculated. 4.1 Experimental setup A database of natural images containing a single page document was used. Two set of images were used, i.e. Arial and Times New Roman, based on the type of the font used for printing the document text in the images as shown in Figure 2. Each set had a total of 14 images, in which the next image consists of a document with an increment of font size of 2. The first image in the set to be processed had a document text with font size 8, whereas the last document was with font size 30. The following steps explain the overall setup as shown in Figure 1. 1. 2. 3. 4. A single image in the given set of images for a single font type was selected. The image is processing using tesseract OCR engine. A output-OCR txt file is generated for the tesseract OCR output. Image selected in Step 1 is reduced to the next resolution size and process is continued from step 2 till the smallest reduction level is reached. 5. The output-OCR text file is compared with a text file containing the accurate text (ground truth) present in the image. 6. Different Measures (Precision, Recall and F- Measures) are calculated. 5 (a) (b) (c) (d) Fig. 2. Sample images containing document with arial font and font size 8 (a), 30(b) and times new roman font with sizes 8 (c) and 30 (d) Measurement procedure Three measures listed below were used to measure the OCR Output at word level: Precision: Fraction of correct words retrieved to that of words present in the ground truth document. Recall: Fraction of correct words retrieved to that of total number of words retrieved in OCR output document. F-Measure: The weighted harmonic mean of precision and recall. F =2∗ 4.2 precision.recall precision + recall Results Fig. 3. The precision (a), recall (b) and f-measures (c) for images containing varying text size for arial font 6 Fig. 4. The precision (a), recall (b) and f-measures (c) values for images containing varying text size for times New Roman font The feasibility of text retrieval in natural images with text in two standard fonts (Arial and Times New Roman) has been analyzed. We measured accuracy of words detection of single page document image taken by camera for different font sizes at human readable distance (approximately 1m) as shown in figure 2 and 3. We also measured the optimal character heights in pixel for correct detection of text, in order to estimate the resolution of optics to be used by a text recognition system. Finally, it was observed from figures 4, figure 5 that optimal character height (x-height) information should be at least 16 pixels to get an accuracy of text recognition of 90 % and above. 5 Text detection and localization Text detection and localization is the process of determining the text localization in the image and generating bounding box around them[11].This part of the images is then served as input for OCR for further evaluation. It is certain that there is need of image correction or other preprocessing steps such finding orientation of the text, correcting camera angles or perspective distortions. We implemented text localization and detection method using extremal regions by Neumann and Matas[6]. An Extremal Region (ER) is a region of the image whose outer boundary pixels have strictly higher values than region itself. The method can divided into two stages of classification. In first stage, a set of ER are computed using a sequential search. The ER detector has complexity of O(2pN where p denotes number of channels used. The probability of each ER being a character is estimated using a set of incrementally computable descriptors. The different incrementally computable descriptors consists of following elements – Area Area of a region. – Bounding box Top-right and bottom left corner of the region. 7 (a) precision (b) recall (c) f-measure Fig. 5. Plots of precision (a), recall (b) and f-measures (c) values vs character heights (x-height) for arial font – Perimeter The length of the boundary of the region. – Euler number The difference between the number of connected components and number of holes in an binary image. – Horizontal crossings A vector with number of transitions between pixels belonging to ER and not belonging to ER in the given row i of the ER region r. A sequential classifiers selects ER regions using incrementally computable descriptors as features. The classification is applied in two steps to make it more computationally efficient. During the first step of classification, features are computed by incrementally increasing the threshold value form 0 to 255 in O(1) for each ER. The value of class condition probability p(r | character) is tracked at each threshold and only selected when the probability is above global limit P min and difference between local maximum and local minimum is greater than ∆ min. A Real Adaboost classifier using decision trees computes incrementally computable descriptors in O(1). The output is calibrated to class conditional probability function p using logistic regression to select extremal regions. The parameters constants P min = 0.2 and ∆min = 0 are set to get high recall rate 8 (a) precision (b) recall (c) f-measure Fig. 6. Plots of precision (a), recall (b) and f-measures (c) values vs character heights (x-height) for times new roman font (95.6 %) [6]. In the second stage, we applied tesserect OCR engine to get the text in different extremal regions. Tesseract provides options for adaptation assuming it to be a single block of text or auto orientation correction and segmentation detection mode. 6 6.1 Experiments Experiment 1 We applied our system to images with different type of segmentation and media type. The system was applied on images contain product information. The tesseract engine was successful to segment and layout analysis in images where product had flat surface geometry. The recognition rate decreased dramatically as we had cylindrical objects (Figure 7). 9 (a) (b) (c) Fig. 7. Sample product images containing text in different layouts. 6.2 Experiment 2 In the next experiment, we applied our system on an offline high quality video stream to evaluate the system for different light illumination changes. The video was split into frames and used as an input to the whole system. The system recognized did detect text and but tesseract ocr engine recognized only text with certain font types and language models supported as shown in figure 8. The figure 9 shows the overall framework developed for text detection and recognition. The input video stream is split into frames. Each frame is applied a text detection filter to get a list of extremal regions. The list of extremal regions are cropped from image and processed before they are given as input to tesseract ocr system. The tesseract ocr engine needs to be initialized with appropriate language models for recognizing the text in the cropped images. 7 Conclusion An real-time text detection and recognition method was developed and applied to detect text in natural scene images at varying complexity. A feasibility study 10 (a) (b) Fig. 8. Text recognition in a video. (a) Fig. 9. Real time text detection and recognition framework. of Tesseract OCR system on camera based text documents measuring at least 16 pixel height of character information needed to retrieve 90% of words. Further, Tesseract engine was combined with Neumann and Matas algorithm [6] for text detection and localization. The combined system was successfully applied to varying text layouts, sizes in natural scene. References [1] Ying-ying CUI, Jie YANG, and Dong LIANG. An edge-based approach for sign text extraction. Image Technology, 1:007, 2006. [2] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. Detecting text in natural scenes with stroke width transform. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2963–2970. IEEE, 2010. [3] Nobuo Ezaki, Marius Bulacu, and Lambert Schomaker. Text detection from natural scene images: towards a system for visually impaired persons. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 683–686. IEEE, 2004. [4] Anil K Jain and Bin Yu. Automatic text location in images and video frames. Pattern recognition, 31(12):2055–2076, 1998. [5] Xiaoqing Liu and Jagath Samarabandu. Multiscale edge-based text extraction from complex images. In Multimedia and Expo, 2006 IEEE International Conference on, pages 1721–1724. IEEE, 2006. [6] Lukas Neumann and Jiri Matas. Real-time scene text localization and recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3538–3545. IEEE, 2012. [7] Wen-wu Ou, Jun-min Zhu, and Chang-ping Liu. Text location in natural scene. Journal of Chinese Information Processing, 5:006, 2004. [8] Ray Smith. Tesseract ocr engine. Lecture. Google Code. Google Inc, 2007. [9] Kongqiao Wang and Jari A Kangas. Character location in scene images from digital camera. Pattern recognition, 36(10):2287–2299, 2003. [10] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. Detecting texts of arbitrary orientations in natural images. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1083–1090. IEEE, 2012. [11] Honggang Zhang, Kaili Zhao, Yi-Zhe Song, and Jun Guo. Text extraction from natural scene image: A survey. Neurocomputing, 122:310–323, 2013. [12] Gang Zhou, Yuehu Liu, Quan Meng, and Yuanlin Zhang. Detecting multilingual text in natural scene. In Access Spaces (ISAS), 2011 1st International Symposium on, pages 116–120. IEEE, 2011. 1 h Struck (Structured Output Tracking with Kernels) and related work Achim Otting1 and Christian Bailer2 1 2 otting@rhrk.uni-kl.de christian.bailer@dfki.de Abstract. This paper presents the Struck (Structured Output Tracking with Kernels) algorithm. It augments the idea of adaptive tracking-bydetection with structured Output Learning with SVMs. In contrast to traditional tracking-by-detection methods Struck directly links learning and tracking and does not need an intermediary binarization step. Inaccurate tracking may lead to a wrong classification and further drift. A budgeting mechanism for the number of Support Vectors make real-time application of Struck possible. Experiments show that Struck can outperform state-of-the-art tracking-by-detection methods. In this paper a way of how to make Struck more robust for appearance changes is also presented. Keywords: Struck, Structured Output, SVM, Tracking, Kernels 1 1.1 Introduction Motivation Visual Object Tracking is used in many application areas like traffic surveillance [13], medicine [18] and human computer interaction [7]. After many years of research there still exist some challenges in visual object tracking. Especially appearance changes of the target object like rotation, scale changes, deformations and also illumination changes as well as occlusions remain difficult tasks [12]. No general visual object tracking algorithm is available which is able to solve all the mentioned problems in all possible situations. Wu et al. [20] designed large scale experiments to assess state-of-the-art visual object tracking algorithms. The evaluation shows that the Struck approach, which is presented in this paper, can exceed in many benchmarks the results of the competing algorithms. Struck uses structured output SVMs to estimate the objects position. A budgeting mechanism for the number of support vectors make real-time applications possible. 1.2 Related Work In the past several approaches for Visual Object Tracking were developed. Asset2 (A Scene Segmenter Establishing Tracking, Version 2) [17] and the Traffic 3 Monitoring System [13] extract two dimensional features to find the optical flow of an image for the tracking task. Motion Models can be considered to reduce the search space. Another way to track objects is to classify samples generated from the video sequence into background and target objects. Tracking by Detection which can, in an advanced implementation (e.g. [3]), handle object appearance and disappearance but do not use motion models. Struck augments the ideas of adaptive tracking-by-detection. Tracking-by-Detection can be seen as detection over time. Today there are good detection methods available. Avidan [1] describes an tracking system for vehicles that searches for strong edges in images and uses a support vector machine (SVM) to decide whether the edges they found belong to a vehicle or not. Adaptive Tracking-by-Detection methods allow to handle changes in the appearance. This is realized by online-trained classifiers. The approaches of Babenko et al. [3], Grabner et al. [10] and Saffari et al. [16] are boosting-based. 1.3 Preliminaries In this chapter some basics of how to design multi-class classifier are discussed which are important to understand the fundamental principles of Struck. To classify multiple classes with SVMs it is possible to define several binary classifiers and combine their results. One-vs-One and One-vs-Rest training are often used. One can also design one single classifier which classifies multiple classes [8]. This is called structured output learning Tracking By Detection The tracking algorithm predicts for every time t the position pt of the bounding box. The classifier learns for every t the predefined features and assigns to every sample x a binary label +1 or -1 according to z = sign (h (x)). The assumption is that during tracking the maximum classification confidence lies around pt−1 and goes with pt . The translation y of the object at t is: p ◦y (1) yt = arg max h xt t−1 y∈Y and the new estimated object position is computed as pt = pt−1 ◦ yt (2) . As mentioned by Hare et al. [11] the parts of the adaptive tracking by detection approaches can now be described in short as (1) sample generating and labeling (2) classifier update. They also point out the main drawbacks of this approach. One problem is that the samples have the same weight during the training. So a negative sample which highly overlaps the bounding box has the same weight as one which has only a little overlap. Now the poorly labeled samples can reduce the classifier 4 accuracy and also the tracking accuracy. There exists also no universal strategy for sample generating and labeling. This leads to label noise. Another problem they point out is that the maximum classifier confidence does not necessarily go together with the best estimate of the objects location. This is because the objectives of the classifier and the tracker are very different. The first classifies the samples as target object or background and the second estimates the objects location. There exists no direct connection between these to parts during learning which results in poor labels. Structured Output The goal is to enable the classifier not only to process arbitrary input but also to generate arbitrary (more complex than simple binary) output with only one single classifier. To do so one has to bring the outputs in a relationship to each other. These classifiers can better use the training data for learning. In the case of object detection, the output space can be four points describing the bounding box surrounding the target object. The values of the points depend on each other. The values of the left and top points are lower than the values of the right and bottom points. In addition, the score of the bounding boxes correlate. Bounding boxes that highly overlap will have nearly the same score. Considering these dependencies will improve training and testing [4]. SVMs To separate samples of two classes (x1 , y1 ), ..., (xn , yn ) with x ∈ Rn and y ∈ {−1, 1}, a SVM creates a hyperplane described by hw, xi − b = 0 from this sam2 ples. The SVM minimizes (in the not linearly separable case) φ (ξ) = 12 ||w|| + Pl C i=1 ξi . ξ is called slack-variable and needed to consider the loss. So the constrained Optimization Problem is l P (w, b, ξ) = min w X 1 2 ||w|| + C ξi 2 i=1 (3) subject to ∀i : yi (hw, xi i − b) ≥ 1 − ξi and ξi ≥ 0 . To get rid of the constraints in (3), positive dual variables αi are introduced. Each constraint is multiplied with a dual variable and added, called the Lagrangian of the Optimization Problem l l X X 1 2 ||w|| + C ξi − αi (yi (hw, xi i + b) + ξi − 1). w,b,a 2 i=1 i=1 L (w, b, α) = min (4) The unconstrained Optimization Problem (Dual Problem) is now a maximization D (α) = max a l X i=1 αi − 1X αi αj yi yj hxi , xj i 2 i,j subject to ∀i : 0 ≤ αi ≤ C; l X i=1 α i yi = 0 (5) 5 . 2 Struck Hare et al. [11] (Struck) are convinced that the existing tracking by detection algorithms only address the label noise by making their classifier more robust. But the real problem is that the labeler and the learner are separated. The Struck algorithm, which is presented in this chapter, does not depend on a labeler but it directly links learning and tracking. There is no intermediate binarization step. It directly learns the object transformation with a structured output SVM. Struck is based on the work of Crammer and Singer [8]. New samples are classified according to yt = arg maxy F (xt pt−1 , y). F (x, y) = hw, φ (x, y)i measures the compatibility between (x, y) pairs. The problem for structured output SVMs is defined as n P (w, b, ξ) = min w X 1 2 ξi kwk + C 2 i=1 subject to ∀i : ξi ≥ 0 ∀i, ∀y = 6 yi : hw, δφi (y)i ≥ ∆ (yi , y) − ξi (6) where δφi (y) = φi (xi , yi ) − φi (xi , y) = F (xi , yi ) − F (xi , y) Note that this is not a linearly separable problem and therefore joint kernel maps φ (x, y) are used. The inequality contains a loss function ∆(yi , y) to handle the first issue (all samples are equally weighted). The loss function should be 0 iff yi = y and increase as yi and y diverge. 2.1 Online Optimization The (Dual) Optimization Problem of Struck which has to be solved is (see [11] and [5] for details): D (β) = max − β X ∆ (yi , y) βiy − i,y 1X y y βi βj hφ (xi , y) , φ (xj , y)i 2 subject to ∀i, ∀y : βiy ≤ δ (y, yi ) C X y ∀i : βi = 0 (7) y with δ(y, y) = ( 1, 0, if y 6= y else , hφ (xi , y) , φ (xj , y)i as a kernel comparing two image patches from frame xi , xj at positions y, y, and X y F (x, y) = (8) βi k hφ (xi , y) , φ (x, y)i . i,y 6 Now let us discuss some basic definitions. If βiy 6= 0 for a (xi , y) pair, then this pair is called support vector and the xi support pattern. For any given support pattern there exists only one support vector with βiyi > 0 namely (xi , yi ) which is called positive support vector. For all other support vectors hold βiy < 0. They are called negative support vectors. To solve the dual problem (7) a SMO algorithm [15] is used. The SMO algorithm can be found in Listing 1.1. The coefficients of the support vectors and the gradients are updated during this SMOStep. The algorithm takes a y y sample, y+ and y− . As you can see the derived dual variables βi + and βi − are incremented/decremented with the same value (lines 8, 9). This ensures that the second constraint of (7) is fulfilled. A coefficient βiy equals zero leads to that its support vector is being removed. At the end gradients are updated according to gi (y) = −∆ (y, yi ) − F (xi , y) (9) Listing 1.1. SMOStep taken from [11] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 R e q u i r e : i, y+ , y k00 = hφ (xi , y+ ) , (xi , y+ )i k11 = hφ (xi , y− ) , (xi , y− )i k01 = hφ (xi , y+ ) , (xi , y− )i + )gi (y− ) λu = gki00(y+k 11 2k01 y λ = max 0, min λu , Cδ (y+ , yi ) − βi + Update c o e f f i c i e n t s y βi + + = λ y βi − − = λ Update g r a d i e n t s f o r (xj , y) ∈ S do k0 = hφ (xj , y) , φ (xi , y+ )i k1 = hφ (xj , y) , φ (xi , y− )i gj (y) − = λ (k0 k1 ) end f o r Three strategies exist to choose a triple ((xi , yi ) , y+ , y− ) which is given as input into SMOStep. – PROCESSNEW takes as input a pattern xi and returns immediately if the pattern is a support pattern. Otherwise xi is not a support pattern, it follows that (xi , yi ) is not a support vector and byi = 0 and the sample class yi is assigned to the positive class y+ . y− is computed as arg miny∈Y gi (y). Listing 1.2. PROCESSNEW(xi ) taken from [5] 1 i f xi i s a s u p p o r t p a t t e r n then e x i t 2 y+ = yi 3 y− = arg miny∈Y gi (y) 4 Perform SM OStep (i, y+ , y− ) 7 – PROCESSOLD chooses a support pattern randomly and recompute its class and assigns it to the positive class. The constraint only considers existing support vectors. Yi is computed as in PROCESSNEW. The new y+ and y− are along the highest gradient. Listing 1.3. PROCESSOLD taken from [5] 1 t a k e random s u p p o r t p a t t e r n xi 2 y+ = arg maxy∈Y gi (y) subject to βiy ≤ Cδ (y, yi ) 3 y− = arg miny∈Y gi (y) 4 Perform SM OStep (i, y+ , y− ) – OPTIMIZE chooses a support pattern randomly and takes all classes of the support vectors. Then it reassigns y+ and y− out of the classes. Listing 1.4. OPTIMIZE taken from [5] 1 2 3 4 5 t a k e random s u p p o r t p a t t e r n xi Let Yi = {y ∈ Y such that (xi , y) ∈ S} y+ = arg maxy∈Yi gi (y) subject to βiy ≤ Cδ (y, yi ) y− = arg miny∈Yi gi (y) Perform SM OStep (i, y+ , y− ) Listing 1.5. Struck taken from [11] 1 Require : ft , pt−1 , St−1 2 Estima te change i n ob j e c t l o c a t i o n p 3 yt = arg maxy∈Y F xt t−1 , y 4 pt = pt−1 ◦ yt 5 Update d i s c r i m i n a n t f u n c t i o n 6 (i, y+ , y− ) ← P ROCESSN EW (xpt t , y0 ) 7 SM OST EP (i, y+ , y− ) 8 BU DGET M AIN T EN AN CE () 9 f o r j = 1 t o nR do 10 (i, y+ , y− ) ← P ROCESSOLD () 11 SM OST EP (i, y+ , y− ) 12 BU DGET M AIN T EN AN CE () 13 f o r k = 1 t o nO do 14 (i, y+ , y− ) ← OP T IM IZE () 15 SM OST EP (i, y+ , y− ) 16 end f o r 17 end f o r 18 r e t u r n pt , St As you can see PROCESSNEW and PROCESSOLD compute the classes and may create new support vectors. They can be expensive as they search over the whole transformation space to minimize (9). OPTIMIZE only reassigns the 8 classes and updates the coefficients. This operation is much faster because the search space only contains the classes of the already computed support vectors. After every PROCESSNEW step (to process a new sample (xi , yi )) follow 10 iterations of a REPROCESS step. This REPROCESS step contains 1 PROCESSOLD and 10 OPTIMIZE steps (Listing 1.5) [11]. 2.2 Budget For an permanent object tracking task the number of support vectors increase unbounded because every new incoming sample may lead to the creation of a new support vector. F and is part of (9) and therefore also part of PROCESSNEW and PROCESSOLD which are getting more and more complex while adding new samples leading to new support vectors [11]. To ensure that the object tracking can be done in real-time it is necessary to remove support vectors. Wang et al. [19] show that the gradient error (that is the performance degradation) is proportional to the change of the weight vector. So minimizing the weight degradation k∆wk minimizes the gradient P error E. The pre and post conditions of BUDGETMAINTENANCE is that y βiy = 0 has to hold. There is only one support vector (xi , yi ) with βiy ≥ 0 for a support pattern xi so for budget maintenance only the support vectors with βiy ≤ 0 are remove candidates [11]. 3 Further Improvements with Robust Tracking with Weighted Online Structured Learning Yao et al. [21] identified two main problems of adaptive tracking-by-detection methods from which also the presented Struck suffers from. One problem is that the methods need online classifiers to adapt the object’s appearance model and therefore discard old training data. If the appearance model changes the old training data is not available for a necessary redetection. The other problem is that the traditional methods weight all samples equally and either fully include or discard the samples. But newer samples should affect the classification in the time dependent sequence of frames more than older ones and therefore they should be higher weighted. Yao et al. developed in [21] a robust tracker with online weighted learning. Their tracker can handle kernels and structured output. A constant sized subset of a huge sized set is a reservoir. Their algorithm uses a weighted reservoir, the elements are chosen with a probability based on their weights. To update the set the actual sample replaces an element with certain probability. In a possible infinite sequence it is not feasible to use all samples for learning. Because online learning does not consider all samples it is less performant than batch learning and this causes an additional loss of online learning compared to batch learning, the regret. But the error of the robust tracker with online weighted learning is not much higher than the error of batch learning [6] and [21]. 9 4 4.1 Experiments Tracking by Detection This chapter presents the executed experiments by Hare et al. [11] which compare benchmarks results of Struck [11] with results of boosting-based and random forests based approaches [3], [10], [14] and [16]. For better comparability, Struck uses similar features like the other ones although the features were optimized for boosting-based approaches. Struck has much better results than the other approaches even with the features that were not optimized for Struck. Further experiments and results of the combination of multiple kernels for Struck are discussed afterwards. Hare et al. [11] compared the performance of the approaches with 8 different Video Sequences. Sylvester and David include light, scale and pose changes. The two Face sequences include occlusion and Face2 additionally appearance changes. The Girl sequence contains appearance and pose changes. The challenges in the Tiger and Coke sequences are occlusions, pose changes, fast motion and in the Coke sequence a specular object. The fast motions additionally cause motion blur [3]. The Sequences can be found at [2]. Sequence Struck∞ Struck100 Struck50 Struck20 MIForest OMCLP MIL Frag OAB Coke 0.57 0.57 0.56 0.52 0.35 0.24 0.33 0.08 0.17 David 0.80 0.80 0.81 0.35 0.72 0.61 0.57 0.43 0.26 Face1 0.86 0.86 0.86 0.81 0.77 0.80 0.60 0.88 0.48 Face2 0.86 0.86 0.86 0.83 0.77 0.78 0.68 0.44 0.68 Girl 0.80 0.80 0.80 0.79 0.71 0.64 0.53 0.60 0.40 Sylvester 0.68 0.68 0.67 0.58 0.59 0.67 0.60 0.62 0.52 Tiger1 0.70 0.70 0.69 0.68 0.55 0.53 0.52 0.19 0.23 Tiger2 0.56 0.57 0.55 0.39 0.53 0.44 0.53 0.15 0.28 Table 1. Benchmark tracking using VOC by detection taken from [11] Table 1 compares Struck with the other algorithms. The feature vector used Haar-like features arranged on a grid at two different scales. The feature vector is smoothed with a Gaussian. It is sufficient to search the points on a polar grid with a bounded distance. The experiment was repeated five times and the median result was taken (see details in [11]). T B ) area(B The PASCAL Visual Object Classes VOC overlap measure ao = area(Bpp S Bgt gt ) [9] measures how good the algorithm predicts the position of the bounding box. It considers the predicted bounding box Bt and the ground truth bounding box Bgt . A sample is voted positive if the overlap of both bounding boxes is greater than 0.5. In all but one experiments, Struck outperforms the other approaches. It is remarkable that a low budget for Struck is sufficient to get good results. In four of eight experiments Struck20 is better than all others approaches. The best 10 results for Struck are often achievable with a low budget of 50. Only Frag is a little bit better than Struck when tracking Face1, because Frag was optimized for partial occlusions. The Performance of Frag decrease dramatically in Face2 because in does not contain only occlusion but also appearance changes, which Frag can not handle, because Frag has no adaptive object model [3]. Struck20 performs poorly in the David Sequence. The background is very dynamic and changes its appearance. So 20 Support Vectors may be to few to generalize the background. They achieved with the unoptimized Struck algorithm with a budget size of 100 an average FPS of 13.2. This shows that Struck is suitable for real-time applications [11]. The Figure 1 shows the support vectors after tracking for Struck64 . The green boxes represent positive support vectors (the object) and the red boxes represent negative support vectors (the background). The positive support vectors can track the appearance change of the objects so it is suitable for adaptive tracking by detection. Much more negative support vectors are needed because the background has a higher variance than the foreground objects [11]. Fig. 1. tracked girl taken from [11] 4.2 Combination of multiple Kernels As an improvement Hare et al. [11] tested is the combination of multiple feaPNk (i) (i) (i) tures through weighted kernels k (x, x) = N1k i=1 k x , x . The advanced Multiple Kernel Learning would also learn the weights. But experiments showed, that also learning weights would not improve the tracking performance significantly. The table 2 shows the combination of multiple kernels. Unfortunately the results are not significantly better or even got worse. The performance depends strongly on the choice of the features. In future research Hare et al. want to investigate in better choice of the features. 11 Sequence Coke David Face1 Face2 Girl Sylvester Tiger1 Tiger2 A 0.57 0.80 0.86 0.86 0.80 0.68 0.70 0.57 B 0.57 0.83 0.82 0.79 0.77 0.75 0.69 0.60 C 0.69 0.67 0.86 0.79 0.68 0.72 0.77 0.61 AB 0.62 0.84 0.82 0.83 0.79 0.73 0.69 0.53 AC 0.65 0.68 0.87 0.86 0.80 0.72 0.74 0.63 BC 0.68 0.87 0.82 0.78 0.79 0.77 0.74 0.57 ABC 0.63 0.87 0.83 0.84 0.79 0.73 0.72 0.56 Table 2. combined Kernels taken from [11], In column A Haar-like features with Gaussian kernel σ = 0.2, in column B raw features with Gaussian kernel σ = 0.1, In column C histogram features with intersection kernels 5 Conclusion In this paper we discussed Struck as an extension of adaptive tracking-bydetection using structured output. In contrast to traditional tracking-by-detection approaches it does not depend on a labeler and has no intermediate binarization step. Inaccuracies in the tracking step of traditional approaches may cause there false classification of samples and may finally lead to further drift. The augmentation of the adaptive tracking-by-detection idea enables Struck to handle appearance changes. Struck show constantly good results in different challenging tracking tasks. The budgeting mechanism allows to use it for real-time applications. The Robust Tracking with weighted Online Structured Learning makes the Structured Learning approach more robust against appearance changes, because a bounded set of old training samples is kept for training. The Experiments Hare et al. [11] did, are mainly limited to static background and dynamic foreground (except the David Sequence). The performance of Struck when tracking dynamic objects and having dynamic background should also be investigated. References 1. S. Avidan. Support vector tracking. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 26, pages 1064–1072, August 2004. 2. B. Babenko. Tracking with online multiple instance learning (miltrack). http: //vision.ucsd.edu/~bbabenko/project_miltrack.shtml. visited 2014-03-04. 3. B. Babenko, Yang M.-H., and S. Belongie. Visual tracking with online multiple instance learning. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 4. M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In European Conference on Computer Vision 2008, 2008. 5. A Bordes, L Bottou, P Gallinari, and J Weston. Solving multiclass support vector machines with larank. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 89–96, New York, NY, USA, 2007. ACM. 12 6. A. Bordes, N. Usunier, and L. Bottou. Sequence labelling svms trained in one pass. In Walter Daelemans, Bart Goethals, and Katharina Morik, editors, Machine Learning and Knowledge Discovery in Databases, volume 5211 of Lecture Notes in Computer Science, pages 146–161. Springer Berlin Heidelberg, 2008. 7. I. Bretzner, L. andLaptev and T. Lindeberg. shortened version hand gesture recognition using multi-scale colour features, hierarchical models and particle filtering. In Proceedings Face and Gesture 2002, pages 423–428, 2002. 8. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, March 2001. 9. M. Everingham and J. Winn. The pascal visual object classes challenge 2012 (voc2012) development kit. http://pascallin.ecs.soton.ac.uk/challenges/ VOC/voc2012/htmldoc/devkit_doc.html, 2012. visited 2014-01-15. 10. H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via on-line boosting. In Proceedings of the British Machine Vision Conference, pages 6.1–6.10. BMVA Press, 2006. 11. S. Hare, A. Saffari, and P.H.S. Torr. Struck: Structured output tracking with kernels. In 2011 IEEE International Conference on Computer Vision (ICCV), pages 263–270, November 2011. 12. A. S. Jalal and V. Singh. The state-of-the-art in visual object tracking. Informatica (Slovenia), 36(3):227–248, 2012. 13. K Kiratiratanapruk and S Siddhichai. Vehicle detection and tracking for traffic monitoring system. In TENCON 2006. 2006 IEEE Region 10 Conference, pages 1–4, November 2006. 14. C. Leistner, A. Saffari, and H. Bischof. Miforests: Multiple-instance learning with randomized trees. In Proceedings of ECCV 2010 - 11th European Conference on Computer Vision, volume 6, pages 29–42, September 2010. 15. John C. Platt. Advances in kernel methods. chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pages 185–208. MIT Press, Cambridge, MA, USA, 1999. 16. A. Saffari, M. Godec, T. Pock, C. Leistner, and H. Bischof. Online multi-class lpboost. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3570–3577, June 2010. 17. S.M. Smith. Asset-2: real-time motion segmentation and shape tracking. In Proceeding on Fifth International Conference Computer Vision, 1995, pages 237–244, Jun 1995. 18. X Tang, G. C Sharp, and S. B Jiang. Fluoroscopic tracking of multiple implanted fiducial markers using multiple object tracking. Physics in Medicine and Biology, 52(14):4081–4098, 2007. 19. Z. Wang, Crammer K., and Vucetic S. Multi-class pegasos on a budget. In Proceedings of the 27 th International Conference on Machine Learning, pages 1143–1150, 2010. 20. Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2013. 21. R. Yao, Q. Shi, C. Shen, Zhang Y., and A. van den Hengel. Robust tracking with weighted online structured learning. In Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, ECCV (3), volume 7574 of Lecture Notes in Computer Science, pages 158–172. Springer, 2012. Combining Head Pose and Eye Location Information for Gaze Estimation Daniel Gröger1 and Mohamed Selim2 1 2 groeger@rhrk.uni-kl.de mohamed.selim@dfki.de Abstract. The combination of eye location and head pose estimation promises an improvement of gaze estimation using only a simple webcam in front of the user. Creating such an estimation system with high accuracy and real-time performance will open up a broad range of applications assisting disabled people, monitoring driver behavior, or gaze-based interaction on large screen displays. In this seminar paper the approach by Valenti et al. [16] is presented and its limitations and benefits are discussed. A small background summary is given, the approach is described, and experimental results are analyzed. It is shown that that an accurate real-time system is feasible even though some limitations apply. Keywords: gaze estimation, head pose estimation, eye-tracking 1 Introduction Estimating a person’s gaze can be useful for many applications. The gaze is directly connected to a person’s attention and can therefore be used to learn about the user’s interest in observed objects. This knowledge may then be used to learn more about the user’s behavior or to build systems capable of interacting with the user based on the expressed interest. The possible applications for this knowledge cover diverse topics such as marketing and usability studies, interaction devices for disabled people, attention monitoring while driving, and interactive reading. Gaze estimation and head pose estimation have been studied separately to a large extent in the past and numerous solutions have been proposed that vary strongly in respect to the usage scenario. For gaze estimation static eye-trackers using infrared illumination, head-mounted eye-trackers in the form of glasses, and electro-oculography based gaze estimation are used in practice [2, 14, 5]. To estimate the head pose hardware like head mounted inertial measurement units (IMUs), time of flight based systems (Microsoft Kinect), and multi camera setups are used. While they work well in their respective use case, the goal of this work is to investigate a solution which allows for maximal movement freedom of the user, while using minimal affordable hardware such as a webcam and no wearable equipment whatsoever. Image-based solutions to estimate the gaze as well as the head pose separately from an image captured by a webcam have been proposed in several papers 2 as well, but most of these approaches are limited in terms of trackable head poses. The aim of this paper is to study a novel approach by Valenti et al. [16] which combines eye location and head pose estimation in a non-sequential way to overcome these limitations. To accomplish this, the used technologies for eye location and head pose estimation are introduced and the necessary details of the novel system provided. In Section 2 a short summary of background on gaze and head pose estimation will be given to put the studied approach into context. The details of the approach are then explained in Section 3 and experimental results illustrated in Section 4. The implementation and corresponding issues will be described in Section 5. Subsequently limitations regarding the implementation, the general approach, and the evaluation will be discussed in Section 6 and finally the conclusion regarding the studied approach will be summarized in the last Section. 2 Background To put the studied approach for gaze estimation into context, a short overview of recent approaches will be given in this section. 2.1 Eye location based approaches For general eye location based gaze estimation several approaches have been established. Image-based devices extract the eye position from images that are captured from a camera observing the eye’s movements. Electro-Oculography measures the electromagnetic variation that occurs upon movement of the eye muscles using electrodes attached to the head, from which the current eye position is calculated and thus the point of gaze can be obtained. Search coil describes a thin metal rod that is being placed on the eye to measure electromagnetic induction. For this purpose an electromagnetic field is applied and the gaze position calculated from the induction. For further details and other methods, see Holmqvist et al. [9]. In this paper the image-based approach is used, as the desired input device is a webcam that is observing the user. The image-based techniques using a stationary camera can be divided further into active systems that use a light source, e.g. infra red light, for tracking and passive systems that rely solely on the camera image. For either one different challenges arise. IR bases approaches are widely used in user studies such as psychological attention, reading, etc. [7, 2] They typically operate in close range (approximately the distance of a subject sitting in front of a monitor), require a restriction of the user’s head movement (usually achieved by a head rest), and require a calibration procedure. An additional limitation is the interference of sun light with the system’s IR illumination of the eye, which makes outdoor usage of such systems very difficult. These characteristics show that IR based stationary systems are usually designed to be very accurate and used in a lab environment rather than for everyday use. 3 Appearance-based approaches require more complex processing as no additional information such as IR reflections is available. This usually involves detecting the eyes, compensating for head motion, and processing the image to estimate the gaze. Still, more complex but less restrictive solutions are more suitable for many applications, since the human gaze does not only consist of eye movements but head movements are also important. State of the art appearance-based methods use a variety of techniques to locate the eyes and their centers and show that non-IR based eye gaze estimation is becoming feasible. Asteriadis et al. [1] use feature matching on the edge map of the eye area and a training set to localize the eyes. Other recent approaches rely on machine learning and some kind of feature extraction. Hamouz et al. [8] use gabor filters to extract 10 features and two support vector machine (SVM) classifiers, Türkan et al. [15] use SVMs and edge projection, and Campadelli et al. [6] use an eye detector and a process to refine the eye positions using Haar wavelets and a SVM. In the studied gaze estimation system an approach by Valenti et al. [17] based on isophote curvatures is used, which outperforms the aforementioned approaches regarding eye center location while being robust to small pose as well as illumination changes. 2.2 Head pose based approaches To estimate the user’s gaze, the pose can be used as well. Once it is known where the user’s head is facing, the gaze can be estimated using the field of view. As with eye location based estimation, systems that use more complex hardware, such as IMUs, multi-camera setups, or Microsoft’s Kinect, already provide reasonable solutions to estimate a user’s head pose. Yet, to achieve more flexibility and reduce hardware cost, only image based approaches will to be considered here. A variety of image based head pose estimation algorithms exist of which many make use of image features and use either a 3-D or 2-D model to represent the head. For 3-D-model based approaches more or less complex generic or user specific models are used, but complex models are usually less tolerant to initialization errors, are computationally expensive, and suffer from large drift over time. Therefore, simpler models, like the cylindrical head model (CHM), are used in publications and demonstrate good performance regarding real-time use in studies [4, 11, 19]. In the studied gaze estimation system the approach by Xiao el al. [19] is used, which is capable of tracking the head even when it is turned by more than 30◦ from the frontal position in respect to the camera. 3 The studied approach The studied novel visual gaze estimation system is a combination of an isophote based eye location detector by Valenti et al. [17] and a head pose estimator by Xiao et al.[19] that will be described in more detail in the following subsections. 4 3.1 Accurate Eye Center Location and Tracking Using Isophote Curvature The main idea behind the eye center location approach is that each eye’s appearance in the captured image can be assumed to be a bright ellipse with a dark ellipse in its center. These shapes can be described by an combination of isophotes, which are curves that connect points of equal intensity in the image. Isophotes are used as they posess several useful properties for object descriptors. They do not intersect each other and can therefore be used to fully describe an image and their shape is independent to linear lightning changes and rotation. Fig. 1. ‘(a) Source image, (b) the obtained centermap, and (c) the 3-D representation of the latter.‘[16] Once the whole image is described as isophotes, we still need to determine those that correspond to the eye center. For this step, Valenti et al. make use of the isophote curvature k, which is expressed as k=− δI 2 δ 2 I δy δx2 2 δI δ I δI − 2 δx δxδy δy + 3 δI 2 δI 2 2 + δx δy δI 2 δ 2 I δx δy 2 (1) δI is the first-order derivative of the intensity function I on the y-dimension. where δy The intensity on the outer side of the curve determines the sign of this curvature, which enables us to discriminate between dark and bright isophote centers and therefore find the dark area of cornea and iris within the brighter sclera (negative sign of curvature). To find global isophote centers, the distance D(x, y) to the closest isophote center is estimated for every pixel in the image using k as o 2 n δI δI 2 δI δI , + δx δy δx δy D(x, y) = − δI 2 δ2 I . (2) δI δ 2 I δI δI 2 δ 2 I − 2 + δy δx2 δx δxδy δy δx δy 2 5 These values, ignoring points with a positive curvature, are then mapped into an accumulator which performs weighting based on a curvedness measure [10] for each point. Finally, to yield a single estimate for each cluster, the accumulator is convolved with a Gaussian kernel and the maximum estimate is chosen as the estimated location of the eye center (see 1). 3.2 Robust Full-Motion Recovery of Head by Dynamic Templates and Re-registration Techniques The approach to estimate the head pose by Xiao et al. [19] uses a cylindrical head model (CHM) which is initialized assuming a frontal face position of the user. It is based on the idea of estimating the motion between two images assuming constant intensity. I(F (u, µ), t + 1) = I(u, t) (3) where F (u, µ) is a parametric motion model with motion parameter µ, u the old location, F (u, µ) the new location, and t the time. Since the CHM is three dimensional, the motion parameter vector µ has 6 degrees of freedom, µ = [ω x , ω y , ω z , tx , ty , tz ] where ω x , ω y , and ω z are rotation parameters (pitch, yaw, and roll angles respectively) and tx , ty , and tz translation parameters. These parameters are initialized on the initial frame containing the frontal face. The positions of the eyes are detected used to improve the tx and ty parameters estimated by the center of the detected face. The distance between the eyes is used to estimate tz . Since a frontal position is assumed, the pitch (ω x ) and yaw (ω y ) angles are set to zero and the roll (ω z ) is determined using the eyes’ positions. On the next frame, the motion is estimated using (3) where F (u, µ) is derived using the rigid movement of a point X = [x, y, z, 1] between t and t + 1   1 −ω x ω y tx  ω x 1 −ω x ty   (4) X(t + 1) = M · X(t) =  −ω y ω x 1 tz  · X(t) 0 0 0 1 and the camera projection matrix, assuming the matrix depends only on the focal T length. This yields the following equation for an image point u of X = [x, y, z, 1] at t + 1: fL x − yω z + zω y + tx · u(t + 1) = (t) (5) xω z + y − zω x + ty −xω y + yω x + z + tz Equation (3) is then solved for µ using the Lucas-Kanade method [12] with applied weights as follows µ=− X Ω T w (Iu Fµ ) (Iu Fµ ) !−1 X T w It (Iu Fµ ) . Ω (6) 6 where Fµ is the partial differential of F (•), Ω is the face region (called template), and w are weights. The weights are determined in several steps to account for the following concerns. For robustness against noise, non-rigid motion, and occlusion, iteratively re-weigthed least squares (IRLS) [3] are used and compensation is applied for side effects caused by IRLS. To account for non-uniform pixel density, i.e. that highly dense pixel originating from the border of the cylinder should contribute less to the motion, the density is calculated and the weights adapted accordingly. For more details please refer to [19]. Before applying equation (6) iteratively, the initial µ is calculated using the partial differential Fµ at µ = 0. Subsequently the incremental transformation is computed using µ after each iteration and all incremental transformations composed for the final transformation matrix. To deal with large head movements, the template is adapted each frame and the error measured to re-register the template once the error is to large. 3.3 Combining eye location and CHM tracking Each system on its own is capable of achieving better results than other proposed solutions but yet each one has its limitations. The gaze estimation system requires a frontal head pose, or less than 30◦ rotation at the cost of accuracy, to achieve good results and the CHM tracker may not be able to recover when converging to the wrong location. Hence, Valenti et al. [16] propose a simultaneous integration of both systems to overcome these limitations. This is done by correcting the eye location using the estimated head pose and improving the head pose estimation by using the eye detectors results as a cue for quality control. In the first frame the eye locations are detected using the standard eye locator described in Section 3.1 and used as reference points. These reference points are projected onto the estimated cylindrical model and an area around the target point is extracted. The obtained patched are then transformed using the transformation matrix of the CHM tracker. The eye locator is then applied to the transformed patches that are expected to be more similar to a frontal eye image and therefore deliver better results (see 2). In addition the center of each patch is used as a reference when choosing the peak of the accumulator, since peaks closer to the center are closer to the initial location estimate. To check the results of the head pose estimation, the pose vector is calculated from the 3-D eye locations, given that the location of the eyes in 3-D is known, and compared to the resulting vector of the head tracker. If the difference exceeds a certain threshold, the transformation matrix of the tracker is recomputed based on the average of the vectors. Furthermore, the standard eye locator is used to check whether the eye locations calculated by the head tracker are close to the estimated locations of the standard locator, which should be accurate when a frontal view is encountered. 7 Fig. 2. ‘Examples of extreme head poses and the respective pose-normalized eye locations. The results of the eye locator in the pose normalized eye region is represented by a white dot.‘[16] 3.4 Gaze estimation Given a reliable eye location and head pose in 3-D one can basically calculate the users gaze and field of view. Of course, the exact parameters for a person’s eyes had to be known and hence, an approximation is used. In their paper, Valenti et al. [16] suggest a binocular field of view that spans 120◦ of visual angle surrounded by a monocular field of view to each side horizontally of approximately 30◦ each. This binocular field of view is centered on the gaze point M and can be approximately described by a pyramid OABCD (see 3). The pyramid is then intersected with the plane of the observed target scene P to yield the area that is visible to the user. This estimation is a good approximation of the user’s field of view but does not take the eye movement in the ocular cavities into account. Valenti et al. assume that “the point of interest (defined by the eyes) does not fall outside the head-pose-defined field of view.”[16, p. 808] based on the study in [13]. Furthermore, they state that a simple “2-D mapping of the location of the pupil (with respect to an arbitrary anchor point) and known locations on the screen”[16, p. 808] is sufficient to interpolate the focused point in the scene, since the eyes can be assumed to only shift in vertical and horizontal direction in the ocular cavities [18]. A calibration plane is constructed in front of the head and rays from the center of the head through known points on the target plane (points displayed for calibration purposes) intersected with the calibration plane. Now the known points can be retargeted and a recalibration simulated every time the user’s 8 Fig. 3. ‘Representation of the visual field of view at distance d.‘[16] head moves. The calibrations plane and the calibration points move according to the head pose model in 3-D and hence the new intersections of rays between head and calibration points with the target plane can be calculated as new know points and a new mapping learned. 4 Experimental results Valenti et al. [16] report three independent evaluations, one for each of the studied system’s components and an overall evaluation of the system. First, the eye locator supported by the head pose estimation was tested on the Boston University head pose database [11]. The system using head cues is compared to the standard eye locator on video sequences of subjects performing nine different head motions. The performance was measured in terms of the error, i.e. based on the Euclidean distance between estimated eye location and manually annotated ground truth. The results show a significant overall improvement in performance compared to the baseline results and a specific improvement in accuracy from 16% to 23% for an allowed error larger than 0.1 [16]. Second, the head pose estimation is evaluated using the same database as before. For pitch(ω x ), yaw(ω y ), and roll(ω z ) the root-mean-square-error (RMSE) and the standard deviation (STD) are computed between ground truth measured by magnetic sensors on the subject’s head and the estimates by the head pose estimator. The estimator is used in two different configurations regarding the template update described in 3.2. For one the first frame is kept as the template throughout the video sequence and for the other the template is updated each frame. Both methods are tested with and without eye location information and the results compared. The comparison shows that the updated approach achieves the best results when no eye information is used. With eye information both configurations per- 9 form equally good or better than the baseline while the fixed template approach outperforms the update approach. This is most likely due to small errors introduced by the eye locator that cannot be corrected by the head pose estimation. In addition Valenti et al. compare their results to two similar studies and report “comparable or better results with respect to the compared methods” [16, p. 810]. Finally, two experiments are performed to evaluate the overall system performance. For both, the data was recorded of 11 male and female subjects sitting in front of a computer monitor equipped with a webcam under different lighting conditions. The first task involved looking at a dot on the screen while moving the head towards to dot and then randomly while still gazing at the dot when the desired position facing the dot is reached. The second task is to follow the dot around the screen naturally. For both tasks the face of the user and the position of the dot on the screen are recorded as ground truth. Based on this data three algorithms are evaluated. 1. Eyes-only gaze estimator: The approach representing traditional mapping of anchor-pupil vectors to the screen (as in [18]) without taking head motion into account. 2. Pose-normalized gaze estimator: An approach using anchor-pupil vectors that have been normalized using the head pose. (see [16]) 3. Pose-retargeted gaze estimator: The method proposed in Section 3.4 using pose-normalized displacement vectors, which is different from the posenormalized gaze estimator because of the retargeting of known points as described in Section 3.4. Based on the mean error and STD reported on task one and two the results can be summarized as follows. The pose-normalized gaze estimator and poseretargeted gaze estimator achieved a significantly lower error than the eyes-only gaze estimator on task one and also outperformed it on task two where the eyesonly gaze estimator failed as expected due to head motion. The pose-retargeted improved the results of the pose-normalized gaze estimator in both tasks and achieved a significantly lower error in task two as opposed to task one mainly because the eye displacement in respect to the head were natural in task two. The pose-retargeted gaze estimator “has a mean error of (87.18, 103.86) pixels, corresponding to an angle of (1.9◦ , 2.2◦ ) in the x- and y-direction, respectively” [16, p. 813], which is quite impressive considering that the human fovea covers roughly 2◦ of visual angle. 5 Implementation The implementation of the discussed approach is a goal of this work to fully understand the approach’s details and possible issues that may arise. To this end, the implementation of the eye center locator, the head pose estimator, and the overall system will be discussed in the following subsections. 10 Fig. 4. ‘Schematic diagram of the components of the system.‘[16] 5.1 System Overview Starting with the eye center locator, the parts of the system were implemented using the C++ OpenCV3 library. As illustrated in Figure 4 the system is initialized by detecting a frontal face and the eyes. This is accomplished by using 3 http://opencv.org/ 11 the OpenCV Haar Cascade Classifier4 . After initialization of the cylindrical head model the head is tracked in the subsequent frame and the eye region”s extracted according to the head model. The eye center locator is then applied on the normalized eye patches and the results transformed back into the model’s 3D space. The overall system was tested under Ubuntu5 running on an Intel Xeon 8-core 3.4 GHz processor with 16GB RAM. As cameras the Playstation Eye® and a Logitech HD Webcam C270 were used both recording images of 640x320 pixels resolution at a distance of approximately 80cm from the head to the camera (see Figure 5). Fig. 5. Camera and display setup for calibration 5.2 Eye Center Locator According to Valenti et al.[17] the isophote curvature, isophote center displacement vectors, and curvedness are calculated using the image gradient of the smoothed input image (see Section 3.1 for details). The sign of the curvature is used to discriminate between votes and the curvedness serves as a weight for each displacement vector vote into an accumulator. Finally, the accumulator is convolved with a Gaussian kernel to merge clusters of votes and yield several high peaks. The implementation of this part posed one major difficulty. The calculation of the derivatives in combination with the initial smoothing of the image has a great impact on curvature and displacement vectors. To reduce artifacts caused 4 5 http://docs.opencv.org/modules/objdetect/doc/cascade_classification. html#haar-feature-based-cascade-classifier-for-object-detection http://www.ubuntu.com/ 12 by the discrete nature of the image when calculating the derivatives, it is desirable to smooth the image. Unfortunately, using a high standard deviation for the Gaussian kernel used for smoothing will cause the curvature to degenerate and the displacement vectors to become inaccurate. Furthermore, different techniques to calculate the derivative, for example Scharr or Sobel kernels, have an influence on the center voting, as they take more or less of each pixel’s neighborhood into account. Fig. 6. Accuracy of Eye Center Locator on the BioID database under different configurations for initial smoothing and accumulator smoothing. To evaluate the implemented eye center locator, the BioID database was used and results directly compared to Valenti et al.[17] Since optimal parameter values for the initial smoothing as well as the Gaussian kernel to apply on the accumulator (center map) were unknown, the two dimensional parameter space (initial σ and accumulator σ) was explored to find the configuration with the highest accuracy(see Figure 6). Figure 7 shows the detailed results for a fixed initial σ of 0.1 of which the accumulator σ of 3 was chosen as the optimal configuration out of the tested space. For these parameter values the worst eye accuracy measure yields a score of 60.61% compared to Valenti’s results of 77.15%. For future work on this implementation it is desirable to investigate this discrepancy further and determine the correlation between parameter values and eye region resolution. 13 Fig. 7. Accuracy of Eye Center Locator on the BioID database for initial smoothing of 0.1 and varying accumulator smoothing. 5.3 Head Pose Estimator The head pose estimation was implemented according to Xiao et al.[19] to fully understand the concepts and later apply the parts used as in Valenti’s algorithm description[16]. For this purpose a modular approach was chosen to keep the parts, for example the cylindrical head model, as reusable as possible. The implementation consists of the cylindrical head model, the main algorithm to calculate the six dimensional motion model in an iterative way using the LucasKanade method [12], calculation of weights for the iterative approach, and the template re-registration algorithm. The main issue for this part of the implementation was the complexity of the overall system. Small errors and changes have a large impact on the overall estimation of the motion model and easily cause the iterative approach not to converge. This made finding errors in the calculation very difficult and led to the overall head pose estimation being not robust and accurate. An alternative approach was implemented later to compare the results and found to be more robust. The approach is based on using the OpenCV implementation of the Lucas-Kanade optical flow algorithm6 and solving the “Perspective-n-Point (PnP)” problem7 to estimate the new head pose. To compare the overall results of the system the latter implementation was used. Hence, it is desirable for further work to use the approach by Xiao et al. once the implementation is more robust. This is also desirable as it is not clear 6 7 http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_ tracking.html#calcopticalflowpyrlk http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_ reconstruction.html#solvepnp 14 whether Valenti et al. use any of the applied improvements, such as weighting and template re-registration, in their adaption of the head pose estimation algorithm. 5.4 Combination The combination of head pose estimation and eye center location was implemented according to Valenti et al.[16] Based on the estimated head pose the regions around the eyes were extracted and unwarped to a pose-normalized representation (see Figure 2). On these normalized eye regions the eye locator was applied and the estimated position transformed back into the 3D space of the head model. An evaluation of the implemented system’s performance could only be done by examining sample sequences but not using ground truth as in Valenti’s report due to issues with the implementation and the limited time for this work. Fig. 8. Left: Camera image with rendered head model, estimated eye centers, and point between eye centers. Right: Calibration points on target plane (blue) and estimated focused point (red). 5.5 Gaze Estimation Finally, the gaze estimation based on the estimated head pose and eye displacement vectors was implemented according to the studied approach. To this end, the implementation of a calibration plane was added to the cylindrical head model and a calibration GUI8 was created to add calibration points to the plane and collect corresponding screen coordinates and eye displacement vectors (see Figure 8). Again a final empirical evaluation was not possible as the overall system did not perform in a robust and accurate way. The initial fit of the head model to the user was not accurate enough and the eye center locator exposed some jittering on estimating the eye center location. This led to possibly wrong displacement vectors during calibration and therefore an unreliable mapping between displacement vectors and screen coordinates. 8 Graphical User Interface 15 To overcome this problem and compare the system’s performance to the one reported by Valenti et al.[16] the single components need to be revised and open implementation problems solved. 6 Limitations In this section I will discuss the limitations of the studied approach. First, I will discuss limitations regarding the evaluation scenario and subsequently what needs to be investigated further towards a more general application of webcam based gaze estimation. The evaluation scenario involved users sitting in front of a computer screen on which a camera is mounted to capture the user’s face. In this context Valenti et al. [16] report good results in terms of mean error in pixels onscreen but also hint to a couple of problems. One of these is the placement of the camera on top of the monitor which leads to partial eye occlusion by the eye lids when gazing at points on the bottom of the screen. Furthermore problems with the low resolution (320 x 240 pixels) on the Boston University head pose database [11] are reported for the evaluation of the eye location detector with head pose cues. This leads to the question in what way the user’s distance to the camera and the camera’s resolution impact the performance of the gaze estimation system. One a similar note, one might think of whether the screen size or the position of the face within the image has any effect on the performance, since a larger screen would lead to an even more displaced position of the camera. The reported on-screen mean error of (87.18, 103.86) pixels on the dot following task can be considered a good result, but depending on the usage scenario the error might be to large (considering the density of information on computer screens). A second issue is the calibration procedure. A calibration with target points needs to be conducted to achieve a mapping between the eye and the observed scene. Applications thus need to require the user to go through the calibration procedure for every session using the system. A further investigation on what is possible with different methods for automatic calibration would be useful. Towards a more general scenario one may consider using this technology in a car to assist the driver, on top of a TV set in the living room, or behind a shopping window to interact with people passing by. In this context several open questions arise that make interesting topics for further research. Regarding all of the example scenarios the general limitations of the system regarding the user’s distance to the camera, the resolution of the camera, and the position of the face within the image need to be explored further. Regarding the in-car application the limitations on the camera angle would be interesting, as the camera would have to be placed on the dashboard or in the rear view mirror. For the shopping window multi-user tracking and non-frontal initialization, as well as further occlusion and recovery are interesting. Limitations concerning the implementation of the discussed approach are mainly related to unknown parameters or unclear implementation details. The large influence of smoothing parameters is a clear limitation as optimal values 16 have to be determined for each application of the approach. Also the relation between face size and these values needs to be investigated further, to make the system usable at varying distances to the camera. Regarding the head tracking similar limitation regarding the distance need to be determined, as it relies on the optical flow of the face region and therefore less information is available on reduced resolution of the face in the image. Summarizing, the limitations of the studied approach need to be defined more precise which requires further research on this topic. The overall goal is still an auto-calibrating gaze estimation system which works at short and long range, preferably for multiple simultaneous user, at up to large head pose rotational angles from the camera. 7 Conclusion The main contribution of the studied approach is a gaze estimation system that works well for users in front of a computer monitor using only a webcam. Valenti et al. show that the systems achieve a better performance together than on their own. It is possible to increase the operation range of the eye locator by more than 15◦ and the overall system is capable of estimating the gaze with a small mean error between 2◦ and 5◦ of visual angle in real-time. With this technology a broad range of applications are possible that can be made easily accessible to many users, since the required hardware is very cheap compared to other eyetracking equipment. Nevertheless, there remain several open questions regarding the application of this technology in other scenarios and the general limitations imposed by the camera hardware and usage setup. The implementation of the system showed that there are several issues that can be resolved by investigating the relationship between parameters further. It also showed that there are improvements that may be considered in future work. References 1. S. Asteriadis, N. Nikolaidis, A. Hajdu, and I. Pitas. An eye detection algorithm using pixel to edge information. In Proc. of 2nd IEEE-EURASIP Int. Symposium on Control, Communications, and Signal Processing, ISCCSP 2006., 2006. 2. Ralf Biedert, Jörn Hees, Andreas Dengel, and Georg Buscher. A robust realtime reading-skimming classifier. In Proceedings of the Symposium on Eye Tracking Research and Applications, ETRA ’12, pages 123–130, New York, NY, USA, 2012. ACM. 3. Michael Black. Robust incremental optical flow. PhD thesis, Yale University, 1992. 4. L.M. Brown. 3d head tracking using motion adaptive texture-mapping. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–998–I–1003 vol.1, 2001. 5. Andreas Bulling, Jamie A. Ward, Hans Gellersen, and Gerhard Tröster. Robust recognition of reading activity in transit using wearable electrooculography. In Proceedings of the 6th International Conference on Pervasive Computing, Pervasive ’08, pages 19–37, Berlin, Heidelberg, 2008. Springer-Verlag. 17 6. Paola Campadelli and Raffaella Lanzarotti. Precise eye localization through a general-to-specific model definition. In Proc. of BMVC, 2006. 7. Laura A. Granka, Thorsten Joachims, and Geri Gay. Eye-tracking analysis of user behavior in www search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, pages 478–479, New York, NY, USA, 2004. ACM. 8. M. Hamouz, J. Kittler, J. K Kamarainen, P. Paalanen, H. Kalviainen, and J. Matas. Feature-based affine-invariant localization of faces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(9):1490–1495, 2005. 9. Kenneth Holmqvist, Marcus Nystrm, Richard Andersson, Richard Dewhurst, Jarodzka Halszka, and Joost van de Weijer. Eye tracking: A comprehensive guide to methods and measures. Oxford University Press, 2011. 10. Jan J. Koenderink and Andrea J. van Doorn. Surface shape and curvature scales. Image Vision Comput., 10(8):557–565, October 1992. 11. M. La Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking under varying illumination: an approach based on registration of texture-mapped 3d models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(4):322–336, 2000. 12. Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, pages 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc. 13. Rainer Stiefelhagen and Jie Zhu. Head orientation and gaze direction in meetings. In CHI ’02 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’02, pages 858–859, New York, NY, USA, 2002. ACM. 14. T. Toyama, A. Dengel, W. Suzuki, and K. Kise. Wearable reading assist system: Augmented reality document combining document retrieval and eye tracking. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 30–34, 2013. 15. M. Türkan, M. Pardás, and A. Çetin. Human eye localization using edge projection. In Proc. Comput. Vis. Theory Appl., 2007. 16. R. Valenti, N. Sebe, and T. Gevers. Combining head pose and eye location information for gaze estimation. Image Processing, IEEE Transactions on, 21(2):802–815, 2012. 17. Roberto Valenti and Theo Gevers. Accurate eye center location and tracking using isophote curvature. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008., pages 1–8, 2008. 18. Roberto Valenti, Jacopo Staiano, Nicu Sebe, and Theo Gevers. Webcam-based visual gaze estimation. In Proceedings of the 15th International Conference on Image Analysis and Processing, ICIAP ’09, pages 662–671, Berlin, Heidelberg, 2009. Springer-Verlag. 19. Jing Xiao, Tsuyoshi Moriyama, Takeo Kanade, and Jeffrey Cohn. Robust fullmotion recovery of head by dynamic templates and re-registration techniques. International Journal of Imaging Systems and Technology, 13:85 – 94, September 2003. SLAM for dynamic AR environments Philipp Hasper1 and Nils Petersen2 1 2 hasper@cs.uni-kl.de Nils.Petersen@dfki.de Abstract. The goal of this project was to implement a monocular SLAM system for tracking in an unknown environment and to examine how changes in the environment during runtime affect the tracking result. The approach is based on PTAM and splits the pipeline into two modules, namely ”Tracker” and ”Mapper”: The Mapper constructs a map of the environment while the Tracker tries to localize itself in this map. Keywords: SLAM, Augmented Reality, Tracking 1 Introduction The term SLAM is short for ”Simultaneous Localization and Mapping” and is a way for tracking in an unknown environment: A moving agent is building a map of its environment and simultaneously localizes itself in this newly created map (cf. Figure 1). The origin of SLAM lies in the filed of robot navigation [1, Section 2.4.3] where usually several different sensors (odometer, GPS, inertial, . . . ) are fused, those measurements then are used to derive position and/or pathway and finally a map of the environment is built. 1.1 Visual SLAM In the beginning, environment reconstruction using only 2D cameras as sensors was only known as Structure from Motion (SfM) - a process which is intrinsically offline, i.e. the calculations are performed only after all measurements have already been acquired. One of the early applications of online SfM (i.e. SLAM) in the Computer Vision community was proposed by [2] which made the real-time localization of robots using only a single camera possible. This technique of monocular visual SLAM is also suitable for tracking e.g. in the context of Augmented Reality (AR). While AR applications are usually based on pre-defined scenes to recognize and annotations to display, the visual appearance at run time may differ significantly. This may be due to changes in perspective when the user moves around once a pre-defined scene has been recognized correctly. In this case, scene recognition is to be followed by a tracking procedure to maintain a valid state of AR annotations and to guide future scene recognition. The PTAM system [4,5] (Parallel Tracking and Mapping) was specifically designed for this purpose: SLAM for small AR workspaces. Their approach is 2 Fig. 1. Visualization of the FastSLAM system showing a robot’s path and the observed landmarks. Taken from [8] the present paper’s basis and our adaptation of it will be extensively discussed in the following sections. One major limitation of PTAM is that accuracy is decreased in dynamic environments, i.e. when objects are moving or illumination differences change. In the outlook we will discuss how to incorporate Robust Dynamic SLAM [13] to overcome those limitations. 2 Our approach Our system is similar to PTAM [4] and shares the characteristic design choice of separating tracking and mapping in two separate modules, namely Tracker and Mapper (cf. Figure 2). The pipeline displayed in said figure is described in the following. A second design choice is also inherited, that is the use of keyframes to derive 3D information for the map construction. Keyframes are camera images which are characteristic for the scene - i.e. while an image of a fast camera movement would yield few useful information, an image showing substantially new information compared to the previous images is desired to be selected as keyframe. The map consists of a point cloud and localization is done by finding projections of those 3D points in the current camera image. We developed with C++ and OpenCV (version 2.4.8) which reduces the porting effort significantly since this library is available for all our targeted platforms (e.g. Android). 2.1 Camera calibration To account for camera distortions resulting from the pinhole camera abstraction the computation of the so-called camera intrinsics and the incorporation into the 3 Processing Pipeline Tracker: Initialization Reproject existing map points Find Matches Triangulate new points Add new keyframe Perspective-NPoint Pose Estimation Mapper: Find closest existing keyframe Map refinement Fig. 2. Our system’s processing pipeline consisting of two separate modules. The Tracker calls the Mapper whenever tracking is considered to be stable. The Mapper then decides if a new keyframe should be added and if so, the current camera image is used for a new triangulation. Besides that, the Mapper refines the existing map when he is in an idle state. calculations is advisable. OpenCV’s built-in functionality is used here to obtain the intrinsics as 3x3 matrix denoted by K and five distortion coefficients using multiple images of a chessboard pattern with given dimensions (cf. Figure 3). Fig. 3. One of the 12 images of a chessboard pattern used to calculate radial and tangential distortion of an image taken with the Samsung Galaxy S2. The left side shows the image with the detected corners marked. The right side shows the corrected image. 2.2 Feature detection and matching The 3D map points are obtained by triangulation of 2D points found in the camera images. For this purpose we need a point feature detector and a point feature 4 descriptor to match found features. PTAM uses FAST[11] as point features and matches them by comparing zero-mean SSD of 8x8-pixel patches. An extensive evaluation of different point feature detectors and descriptors for visual SLAM was performed by [9] and her conclusion was to use BRISK or ORB[12] so we are using the latter one. Feature matching is done by brute force matching of ORB descriptors. 2.3 Triangulation and Initialization One key technique used in the proposed approach is triangulation which denominates the process of deriving 3D points from a set of 2D point matches obtained from two camera images with sufficient baseline. The first step is to find the Fundamental Matrix F which maps one point in the first image to a line in the second image (cf. Figur 4). X xR xL OL Left view eL eR OR Right view Fig. 4. The epipolar constraint: A point in one view (XL ) can be projected into the other view as a ray (eR XR ). Taken from https://en.wikipedia.org/wiki/File: Epipolar_geometry.svg We use a RANSAC for this - the used distance threshold for the epipolar inliers is 3 pixels and the confidence threshold 99%. Second, the Essential Matrix is computed: E = K T F K and the rotation and translation matrices are derived by singular value decomposition. There are two solutions each due to projective ambiguity - R1 , R2 , t1 , t2 - so we assume the first camera’s projection matrix to ′ be P = I and the second one’s to be Pi,j = K[Ri |tj ]. The triangulation is done by constructing a system of linear equations from ′ the fact that x = P X and x′ = Pi,j X and solving it as discussed in[3]. Finally, we test which of the four possible projection matrices Pi,j leads to a triangulation with more than 60% of all points in front of both cameras. This step is necessary because there is only one camera pose with all triangulated points being in front of both cameras. In reality, even the correct pose will have some points violating this constraints due to numerical or measurement errors 5 so we use a 60% threshold instead of 100%. Points which are not in front of both cameras are removed and the remaining ones are returned as the triangulation’s result. This process of triangulation is first done in the initialization: The user is advised to hold the camera at the scene to be tracked and then smoothly offsetting it in a translational manner, creating a small parallax movement. Then, one image prior to this and one after this offset are used for triangulation by matching point features (cf. Figure 5) and performing the procedure explained above. Fig. 5. Parallax movement is needed for map initialization. The matches between two images of the same scene with a small translational movement are used for triangulation. Fig. 6. The point cloud derived from the initialization pictured in Figure 5. For illustration purposes the number of detected feature points was vastly increased so the scene is recognizable. Usually, a much smaller number of matches is needed. Each triangulated 3D point is added to the map and is assigned a feature descriptor calculated from the the first of the two images. Finally, said first image is also added to the map as its first keyframe. Each keyframe is also assigned its camera pose (in the case of initialization this is the identity matrix). 2.4 Camera pose estimation Once a map of 3D points is given, the current camera pose can be obtained by projecting the points into the image based on a pose assumption imposed 6 by previous camera poses and a motion model (the use of a stationary motion model has shown acceptable results). This is done by Perspective N-Point. We use EPnP[6] which reduces the problem to O(n) with n being the number of map points. Fig. 7. Matches of map points and the current camera image’s feature points. To visualize the map points, they are projected into the keyframe they were constructed from. On the left you see two examplary keyframes with the map points drawn in blue and the lines indicating matches between map points and image features of the current camera image on the right. Those matches of 3D to 2D points are then used for pose reconstruction with a Perspective-N-Point algorithm. 2.5 Adding a new keyframe Whenever tracking is assumed to be good, the Tracker calls the Mapper with the current camera frame and the corresponding pose estimation. The Mapper then decides if the camera image should be added as keyframe based on the last added keyframe (they have to be at least 20 frames apart). To add the new keyframe, the spatially closest already existing keyframe in the map is determined (the Tracker’s pose assumption and each stored pose are compared) and a new triangulation is done with those two images (cf. Section 2.3). 7 2.6 Map refinement The map has to be refined to remove spurious triangulations and to fuse triangulated points which are actually different measurements of one identical landmark. PTAM uses Levenberg-Marquardt[7] bundle adjustment for this. Currently, the integration of Levenberg-Marquardt in our system is a work in progress and the given results are without bundle adjustment but with a simplistic refinement approach using non-minimum suppression: Whenever two map points are closer than a given threshold the one with the higher reprojection error assigned in the triangulation is removed. 3 Handling of dynamic scenes RDSLAM [13] improves monocular SLAM by incorporating two major enhancements: Occlusion handling during the projection of map points into the current camera image and a prior-adaptive RANSAC algorithm for pose estimation. Occlusion handling works as follows: The current visual appearance of each map point whose projection lies in the current camera image is compared with its stored descriptor. If they differ significantly this means either a) the map point is occluded or b) the map point is invalid due to changes in the (dynamic) environment. In the first case, the map point should be excluded from the pose estimation but it should stay in the map. In the latter, it should be removed permanently. The distinction whether a) holds or b) is done by evaluating neighbouring features: 1. Denote the map point by X and its projection in the image by x. Collect all currently tracked map points whose projection into the image is less than 20 pixel away from x and call this set neighbourhood. 2. If the neighbourhood is empty, this is an indicator for X being occluded by a moving object since a moving object’s feature points are considered as outliers during tracking. In this case we keep the map point. 3. If the neighbourhood contains points x′ with their respective 3D points X ′ and X is closer to the camera than all X ′ , the map point is not occluded. Therefore it is save to assume that the point became invalid due to changes in the environment and we remove the map point. 4. If there are some neighbouring points whose 3D point X ′ is closer to the camera than X, there are two possible reasons for this: a) X ′ belongs to an object which occludes X or b) X ′ appears due to changes in the environment and X is actually invalid. To distinguish both cases, X and X ′ are projected in the keyframe X was constructed from. – If those projected points are distant, they don’t belong to the same object, hence a) is the case and we keep the map point. – If they are close to each other, b) is the case and we remove the map point. The second technique Tan et al. propose is PARSAC, a prior-based adaptive RANSAC which enforces a evenly distributed sampling and uses a weighting scheme based on the results from the previous frame. 8 4 Results and Conclusion The proposed system continuously tracks and adds new keyframes (cf. Figure 8). As already mentioned, the tracking accuracy suffers from dynamic changes so the next step would be to incorporate the techniques discussed in Section 3 (cf. Figure 9). Additionally, the selection of distinctive keyframes could be improved further, e.g. by cumulating feature measurements from several subsequent frames. Thirdly, panoramic movement is likely to occur in AR environments (especially when the user wears a HMD he often pans his head) but can break the map since there is no baseline for triangulation. Pirchheim et. al [10] propose a solution for this problem. Fig. 8. Screenshots of our SLAM system. The blue dots are the 3D map points projected into the image. Compare left and right image to see that map points got added due to the addition of a new keyframe. Fig. 9. Projection of map points in case of occlusions (left) or changes in the environment (right). In the right scene, the red box indicates points which have become invalidated since the structure they originated from (in this case they where constructed from the letters at the top of the book) is removed. The green box contains scene points which are not observable from the current position but are most likely to be still valid since the original structure (the title of the lying book) is untouched. The goal is to remove the points in the red box from the map and to maintain those in the green one. 9 References 1. Gabriele Bleser. Towards Visual-Inertial SLAM for Mobile Augmented Reality. PhD thesis, Technical University Kaiserslautern, 2009. 2. Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. MonoSLAM: real-time single camera SLAM. IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–67, June 2007. 3. RI Hartley and Peter Sturm. Triangulation. Computer vision and image understanding, 68(2):146–157, 1997. 4. Georg Klein and David Murray. Parallel Tracking and Mapping for Small AR Workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 1–10. IEEE, November 2007. 5. Georg Klein and David Murray. Improving the Agility of Keyframe-Based SLAM. In Proceedings of the European Conference on Computer Vision (ECCV) 2008, 2008. 6. Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision, 81(2):155–166, July 2008. 7. Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial & Applied Mathematics, 11(2):431– 441, 1963. 8. Michael Montemerlo. Fastslam: A factored solution to the simultaneous localization and mapping problem with unknown data association. Phd thesis, 2003. 9. Zhen Peng. Efficient matching of robust features for embedded SLAM. Diploma thesis, University of Stuttgart, 2012. 10. Christian Pirchheim, Dieter Schmalstieg, and Gerhard Reitmayr. Handling Pure Camera Rotation in Keyframe-Based SLAM. In IEEE International Symposium on Mixed and Augmented Reality, pages 229–238, 2013. 11. Edward Rosten and Tom Drummond. Machine Learning for High-Speed Corner Detection. In Aleš Leonardis, Horst Bischof, and Axel Pinz, editors, Computer VisionECCV 2006, pages 430–443, 2006. 12. Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: An efficient alternative to SIFT or SURF. In 2011 International Conference on Computer Vision, pages 2564–2571. IEEE, November 2011. 13. Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust Monocular SLAM in Dynamic Environments. In IEEE International Symposium on Mixed and Augmented Reality, pages 209–218, 2013. Visual Tracking Decomposition Mohammed Anwar1 and Alain Pagani2 1 2 anwarower@gmail.com alain.pagani@dfki.de Abstract. In this work we present an explanation to the Visual Tracking Decomposition methodology given in [8], digging further into the mathematical concepts behind its proposed approach. We explain the theory of basic Bayesian Tracking and some useful mathematical concepts in the paper. We give an idea about Principal Component Analysis, Diffusion Distance and Markov Chain Monte Carlo. We explain in detail how each of these concepts were applied to the paper. In conclusion, this work is meant to be an aiding supplementary material for understanding the paper. Keywords: Visual Tracking Decomposition, SPCA, Markov Chain Monte Carlo 1 Introduction Object Tracking is a problem that remains not perfectly solved till now in Computer Vision [10]. Researchers have recently tackled the area of real-world scenario rather than just confining to the lab environment. There are numerous difficulties that are encountered in Visual Tracking. The intractability of tracking an object emerges from the severe changes in its appearance or motion. The domain of these changes includes pose, illumination, occlusion in addition to abrupt motion due to a low video frame rate. This problem is tackled by the given paper. Here, a novel method is proposed lying on the concept of decomposing down the process of tracking into several basic models, allowing them to account for various motion assumptions. The results of the basic models then are then combinded together offering a highly complex visual tracker. 2 2.1 Mathematical Background Bayesian Tracking The Bayes Filter is the backbone of various probability based trackers. It is used to model and predict stochastically the states of a model in a dynamic system based on the control input to the model and the sensor readings from the surroundings. Since we are speaking about visual tracking, the control input plays no role in our considerations. The Bayes Filter uses the concept of a complete state based on the Markov assumption, which means that the previous state is 2 always considered a well-composed summary of the history up to its point in time. This drops the need for keeping track of the past input/ measurements. It is inevitable to discuss the Bayesian Tracking without giving a notion of a state. A state for a given object subject to tracking is a collection of all aspects relating it to the environment such as position, scale, velocity, etc [15]. We denote the state of an object for a given time t by Xt . There is usually a spectrum of interaction between the object under tracking and its surrounding environment. An important form of such is the observational data. These are observations done by the object to evaluate the correctness of its current belief. We donote the measurements observed at time t by Yt . The Bayesian Tracking procedure can be formalized for the visual tracking in the given equation [15]: Z p(Xt |Y1:t ) α p(Yt |Xt ) p(Xt |Xt−1 )p(Xt−1 |Y1:t−1 )dXt−1 (1) It is quite obvious that this equation has a recursive form. The term p(Xt−1 |Yt−1 ) is the belief a unit step back in time. An explanation for the rest of the terms is provided in section 2.1.1. 2.1.1 Bayesian Tracking Components The Bayesian Tracking process is composed of several models that collaboratively work to estimate the state of the tracked object. We will explicitly mention them below to demonstrate how they basicly work. The explanation for the approach deployed in the paper will be analogous to the basic approach. Motion Model The motion model is a prediction tool. It assumes a certain motion pattern and use it to give an estimation of the future state given the current one. Examples for this are : constant velocity motion model and constant acceleration motion model. In equation (1), the motion model is p(Xt |Xt−1 ). Xt is the predicted state given a current state of Xt−1 . Observation Model The observation model has basically two components which are : The object under tracking and the observation measurements. The object defines the aspects of the tracked target including shape, appearance, etc. The observation measurements take over correcting the current belief. An example for this can be taking sensor measurements, taking camera shots or comparing image patches as in visual tracking. In (1), The observation model is the term p(Yt |Xt ). It means given an estimated state of Xt what is the probability of getting the measurement record Yt . The consistency between Yt and the estimated state evaluates the tracking accuracy. 2.1.2 Bayesian Tracking Example To project the previously mentioned concepts into a concrete example, let us take a look now on figure 1. In (a) we have a given frame from a football game. We need to pick some object to be our 3 subject of tracking. This object is selected in (b) to be the ball. In this sense we can say that the object model is a circle with the football texture, etc. The state in this case is just a 2 dimensional position. Normally, since the frames are projected from a 3-D animated world, the linear motions would lose their linearity. However, for the sake of simplicity, we will assume that we have here a linear motion model that is estimated through previous frames to be in the direction from the player’s foot to the goal. Our motion model in this sense assumes the ball would continue its path with the same velocity. The uncertainty is embedded typically in the system by exploiting a gaussian distribution centered at the predicted new state. Now what happens from (b) to (a) is that we use the motion model to predict the new state (dashed boundaries), then we correctify our assumption using the observation model. Here, the observation model measures the similarity/dissimilarity between the patch described by the predicted state and the template we have in the object model (see (c)). 2.2 Diffusion Distance The Diffusion Distance [9] is an approach to measure distances between histogrambased local descriptors. Histogram based local descriptors are generally susceptible for deformations, illumination and noise. Normally, the distance between decriptors is measured through a bin-to-bin correspondance, which means that the corresponding differences between classes is calculated with their quantization value ranges. The distances used here can be the Euclidean Distance, the Kullback-Leiber-Divergence or various others. The methods are, nevertheless, sensitive to for effects that globally take place. For example, if an object is moving in an image sequence and 2 frames are being compared, a big distance could be estimated although the same object is being depicted. Here, the data is translated into a local region and during the comparison some regions in the environment will not be considered. To tackle this problem, a cross-bin approach can be used to compare between bins that could be found in different places. The Duffusion Distance offers a cross-bin comparison for multi-dimensional histogrambased local descriptors. The difference between the histograms is observed here as a temperature field and it models a diffusion through this field. Afterwards, the norm is integrated on the diffusion field over time, offering a dissimilarity measure between the histograms. A Gaussian pyramid is used for this diffusion difference. This Gaussian Filter smoothens the data and reduces the dimension. For the diffusion distance, two n dimensional histograms h1 (x) and h2 (x) are observed, where x ∈ Rn . Thus the Diffusion Distance is defined through : K(h1 , h2 ) = L X k(|dl (x)|) (2) l=0 where d0 (x) = h1 (x) − h2 (x) (3) dl (x) = [dl−1 (x) ∗ φ(x, σ)] ↓2 , l = 1, ..., L (4) 4 (a) Initial frame (b) Registering the required object for tracking. The template defined by the red boundaries is in this case the object model. (c) Using the motion model to predict the next state (dashed boundaries), then measuring the correctness of the predicted state using the observation model. Fig. 1. A simplified example capturing the aspects of Bayesian Tracking in application. 5 L is the levels of the Gaussian pyramid, [] ↓2 denotes the downsampling by half, φ is a Gaussian filter with a standard deviation σ. k() is a metric by which the norm of the histograms is calculated. It is defined in [9] to be the L1 -Norm, which is defined by : n X kxk = |xi | (5) i=1 The calculation of the Diffusion Distance has a complexity that is linear in terms of the number of histogram bins to be compared. As the dimension of the Gaussian pyramid exponentially decreases and only a small Gaussian filter is used, the convolution is carried out in linear time for the L levels. 2.3 Principle Component Analysis Principle Component Analysis (PCA) is a powerful mathematical tool that is used in this paper in the form of its extension (SPCA). PCA is used while dealing with multivariate data offering means of dimensional reduction with minimal loss of information. In figure 5, we are confronted with a view projection problem from 3D to 2D. The different views illustrate different axes of projection. That reveals the secret why the top figure is most illustratively helpful. The reason for this is that the axes of projection are aligned with the maximum 2 directions of data variation(In this case, the maximum variation of data is the most distance two points on the teapot’s body). That results in coveying the biggest amount of information possible about the data. Looking for principle component analysis for a set of data could be expressed by finding a new basis for defining them. To be able to smoothly use linear algebra, we assume linearity and orthogonality of the principle components. The most principle component in the data is the direction of the highest degree of variation. That poses a limiting factor for the next most important principal component to be on a perpendicular axes to it, we can say that the second most principle component is also the one showing the highest variance of data, and so on. PCA thus eliminates the redundancy in the given set, allowing us to express it in a simpler way. The redundancy in a given set is mirrored in a statistical factor, namely the covariance between its attributes. If we have an attribute that is intimately covarying with another one, then we do need only the behavior of one of them to deduce the other. This is done by expressing the covarying attributes as linear combination of each other. Eventually, we could say that we eliminated a portion of redundancy in our data set.(See figure 2). 2.3.1 Mathematical Background of PCA We have mentioned formerly that using PCA is a matter of choosing a new basis for the given dataset. Now We will give a glimpse about the mathematics behind. Let us denote a dataset with X, where X is a matrix obtained by stacking the data instances next to each other, so that each column corresponds to one 6 Fig. 2. A range of possible redundancies in data between the 2 variables r1 and r2 . The dashed line is a best-fit obtained by r2 = kr1 [13]. data instance. The problem of finding a new basis is then finding a P matrix, that projects the data to Y : Y = PX (6) Equally important, we need this new image of the data Y to host the maximum covariance possible. An insight into the covariance is done by building the covariance matrix. Assuming that the data X have a mean of zero, the covariance matrix is derived by : C =YYT (7) In the covariance matrix C, an element cij is the dot product of all the values for a given data dimension i and the respective values for another given dimension j. That is the covariance between the i-th dimension and the j-th dimension. Two vectors that are at most covariant have a covariance of 0. A diagonal entry however cii represents the variance along the dimension i. Thus, an idealized maximum covariance matrix would be the one where all the off-diagonal entries are 0. Now we can formally state our problem, which is picking P so that C is diagonalizied. We proceed on substituting (6) in (7), so we get : C = (P X)(P X)T (8) C = (P X)(X T P T ) (9) C = P (XX T )P T = P QP T (10) From linear algebra, we know that the matrix Q is symmetric and decomposable to : Q = AΣAT (11) 7 where Σ is a diagonal matrix and A is an orthognal matrix of Q’s eigenvectors. That yields : C = P AΣAT P T (12) Since we only want to keep the diagonalized structure of Σ, a smart solution is to neutralize P with A, namely by choosing P to be AT (up to a scale). C = (AT A)Σ(AT A) (13) Which finally gives us the diagonalized matrix : C = IΣI (14) A key knowledge here is that the inverse of an orthogonal matrix is equal to its transpose [14]. In conclusion, we can see that the principal components for a dataset are derived by calculating the eigenvectors for the covariance matrix. The method discussed here is just a single method for calculating principal components. There is some other technique that uses Singular Value Decomposition [17]. An informative step by step tutorial about PCA can be found in [13] 2.3.2 Sparse Component Analysis Although Applying PCA could significantly reduce the dimensionality of a given dataset, it still suffers from a big disadvantage. In most of the times, the principal components are linear combinations of all the variables. In many cases, it is benefitial to sacrifice a degree of data accuracy to describe each principal component in terms of few variables. That translates up to axes in which a few entries are non-zero. The extenstion (SPCA) used in this paper takes over solving this problem. It adds another constraint to the formal problem of PCA, which is that their cardinality (number of non-zero entries) should be minimized as possible to enable a better conveyance of the physical properties of the system. Consequently, this offers the chance to use less space for storing the data leading to the desired degree of compactness. The formal problem would then be : maximize xT Ax − ρCard2 (x) (15) subject to x > 0 (16) Here, A is the covariance matrix of the data. The cardinality function Card expresses the number of non-zero entries. ρ is a factor that reflects our preference to whether the accuracy or the sparsity of the components. This problem is in fact an NP one [11]. It is beyond the scope of this work to explicitly derive the solution to the problem. However, one of the approaches is offered in [4]. The solution relies on a class of mathematical approximation called convex optimization [2]. The same optimization tools is used for the subject paper. An alternative solution for the problem is the Iterative Elimination. There, variables are recursively eliminated in confinance with a given criterion that in turn targets to minimize the explained variance loss and reconsiders the sparse principal component analysis problem over and over until the desired sparsity is achieved 8 [17]. The application of SPCA is sometimes very powerful and offers very compact solutions. For example, it could reduce the number of involved variables to be 14 rather than 533 in [4]. 2.4 Markov Chain Monte Carlo 2.4.1 Markov Chains A Markov Chain is a sequence of random variables X ( 0),..x( M ) fulfilling the given condition for m ∈ {1, .., M − 1} : p(x(m+1) |x(1) , ...x(m) ) = p(x(m+1) |x(m) ) (17) This means in words that the next state does only depend on the previous state, regardless how the history before the previous state looked like. This is called the Markov Assumption [1]. In figure 3, we are having a grasshopper that is Fig. 3. A grasshopper’s right and left hops can be modelled as a Markov Chain. [6] taking steps either to the right and the left. It is initially starting at state 0. The decision of whether to hop right or left only depends on the current state. In this sense we can see that this behaviour can be modelled by a Markov Chain. 2.4.2 Markov Chain Monte-Carlo Methods One of the strength points of Markov Chains is that they can be used to sample from distributions that are actually intractable to directly sample from [6]. The algorithms using this tool are called the Markov Chain Monte Carlo methods. To sample a target distribution, an MCMC algorithm uses a Markov Chain that has the same distribution. It is worthwhile to observe in figure(3) how the probability distribution spreads out in each step, where initially we have a probability of 1 to be at state 0. Afterwards the distribution is {0.5, 0.25, 0.25 } as indicated in the graph. Further on, the distribution keeps getting divided for new states as the grasshopper moves. That continuous change in the Markov Chain’s distribution is however not desirable if we need it to simulate the target fixed distribution, which would imply that it at least keeps the same distribution along time. This poses a requirement on the Markov Chain used, which is that the probability that a given state xi is the current one would be the same over time. Luckily, Markov Chains have the ability to do that when they exhibit a stationary behaviour. In this case, as time 9 goes by, the transition probabilities between states converge to a certain limit. Let us make that more formal. Given a Markov Chain with a state X and a transition matrix T , the following equation holds: X ′ P (X (t) = x)T (x → x ) (18) p(X (t+1) = x′ ) = x This equations rules the dynamics of the chain. It means that the probability at a given time t + 1 to be at state x′ is equal to the same of probabilities of each path leading to x′ . Each path leading to x′ is composed of the joint probability between the probability of being at a given state and the probability of transition from this state to x′ . We denote then the stationary distribution by π. The probability of having state x′ is given accordingly by π(x′ ). If the system is in stationary state, that implies that : X ′ P (X (t) = x)T (x → x ) (19) p(X (t+1) = x′ ) = p(X (t) = x′ ) = x ∴ π(x′ ) = X π(x)T (x → x′ ) (20) x Arriving this stationary state, nevertheless is not guaranteed for all Markov Chains, therefore we have to ensure this by performing a regularity test. If for any given state in a Markov Chain, all the other states are reachable including itself (through self-loop), we can deduce that it is regular[6]. The regularity of a Markov Chain ensures in turn, that it has a stationary distribution that is also unique, regardless from the initial state. 2.4.3 Metropolis Hastings Algorithm [6] The objective of MCMC method is to sample a target distribution using a Markov Chain that is having the desired stationary distribution [6]. The actual implementation, however, depends on the field of the application. Metropolis-Hasting Algorithm is one of the MCMC algorithms. The pivot of this algorithm is the reversibility of a given chain. To understand what a reversible chain means let us observe figure 4. Here, we can see a Markov Chain depicted that is regular. Let x and x′ be any 2 arbitrary states. The probability of going from state x to state x′ is equal to the joint probability between being at state x (given by π(x)) and the probability of traversing then to state x′ (given by T (x → x′ )). That is modelled in the graph by traversing the red edge. Similarly, going from state x′ to x is modelled by traversing the green edge. If the probability between traversing the red edge and the green edge are equal, we can say that the transfer from state x to x′ is reversible. That is : π(x)T (x → x′ ) = π(x′ )T (x′ → x) (21) Since x and x′ are arbitrary, we can generalize this for any 2 states in the chain, which makes it a reversible chain. The finding is very useful. If there is a regular chain that is reversible for a given distribution π, then this distribution is in fact 10 its unique stationary distribution. That helps us to know for a given Markov Chain if it will really converge to our target probability distribution or not. The proof for this property is as follows : starting from (21), we sum up both sides with respect to x getting : X X π(x)T (x → x′ ) = π(x′ )T (x′ → x) (22) x x π(x′ ) is constant with respect to x, so we can get it out of the summation getting : X X T (x′ → x) (23) π(x)T (x → x′ ) = π(x′ ) x x Since summing the transition probabilites from a given state should should yield one, we are left with : X π(x)T (x → x′ ) = π(x′ ) (24) x Which is in fact, the stationary state equation. (20) . Our objective is to sample from intractable probability distribution. Since that is hard, we will be using a Markov Chain that has the same stationary distribution. We ensure this condition by choosing to have the chain reversible at our target distribution. A question would then be : How can we proceed on taking samples by the Markov Chain ? The answer is the main axis of the Metropolis-Hastings algorithm. The Metropolis-Hastings chain is made to explore the state space freely according to a propositional distribution Q(x → x′ ). Next, there comes an evaluating agent into play, which judges if a proposed next state is good enough or not and accordingly accept it or not. This is an acceptance ratio A(x → x′ ) that corrects the proposal. Mathematically, that means to decompse the translation probability T (x → x′ ) to be: T (x → x′ ) = Q(x → x′ )A(x → x′ ) (25) Plugging (25) into (21) we get : π(x)Q(x → x′ )A(x → x′ ) = π(x′ )Q(x′ → x)A(x′ → x) ′ ′ (26) ′ A(x → x ) π(x )Q(x → x) = A(x′ → x) π(x)Q(x → x′ ) (27) Since the ratio between the 2 acceptance probabilities is what matters, we can conventionally always select A(x′ → x) to be equal to 1. Moreover, we pose a condition to ensure the acceptance ratio not to exceed 1. This gets us : A(x → x′ ) = min{1, π(x′ )Q(x′ → x) } π(x)Q(x → x′ ) (28) 11 Fig. 4. A Markov Chain where reversibility holds. 2.5 Interactive Markov Chain Monte Carlo Based Tracking Interactive Markov Chain Monte Carlo is a probabilistic graphical sampling approach. It is usually used when multi-trackers are exploited as it offers to integrate the knowledge between them. In [5], an object is assigned a local tracker using its local appearance and a global tracker using global features of the image. In the Visual Tracking Decomposition paper, the same concept of integration is applied but between various local trackers. It is worth mentioning though that the Visual Tracking Decomposition applied this mathematical tool first. A detailed explanation for IMCMC can be found in [3], while another application for it can be found in [12] 3 Visual Tracking Decomposition To alleviate the problem of appearance abrupt changes, the paper adopts a divide-and-conquer technique. Instead of using a single tracker, a numerous simpler trackers are used that communicate together. Therefore, we can say that innovation of the tracker is conducted in a diamond-like workflow. The upperhalf of the diamond corresponds to distributing the work into smaller numerous components. On the other hand, the lower-half of the diamond depicts fusing the knowledge and integrating the beliefs of the basic components into the final decision. In the following we will discuss how each basic tracker is built and how it works, then we will discuss how the results of basic trackers are integrated together. 12 3.1 Basic Tracker As we formerly mentioned, a tracker is defined by an observation model and a motion model. In the next subsections we will see how the observation model and the motion model of a basic tracker is formed. 3.1.1 Observation Model To accomodate with the problem of severe change of appearance, several types of images features where used for each object’s image patch along time. The types of features used were hue, saturation, intensity and edge. For each type of features, measures were taken for the object at each unit of time. A sample fij denotes the sample at time i for the feature type j. The set of all samples fij is denoted by S. It is poorly explained how exactly the template is explained in terms of a vector f . However, it is most likely that the rows of each patch are stacked together in a single row. In this sense, we could say that S is the object model we are dealing with in the paper. However, since the tracker is to be decomposed to several basic onces, it consequently follows that the object model should be distributed between them. A question now arises : How should such distribution be performed?. To answer this, let us first observe a key example that would deliver an intuitive idea about the solution. In figure 5, we are confronted Fig. 5. Several views for the same teapot. The view in the top is illustratively the most helpful. [19] with four views for the same teapot. A question would be : Which is the view offering the most details? . With our human intuition, most of us will agree that the view on the top is the most illustratively informative one. In figure 6, we can see 2 different views for the same dimensional dataset. However, in the right 13 view, we can perceive an inherent property about the data that has not been obvious before, which is that the data points are almost co-planar. These two examples make us wonder if there is some objective measure that could be able to give us the same insights. This would then be helpful to efficiently describe the template set used for the object model in the paper. Luckily, the answer is yes. In fact, such a tool is nothing but the principal components. Exploiting this mathematical tool in the decomposition of the the object model is a main contribution for the paper. (a) A 3-dimensional dataset (b) This view suggests a reduction in the dimensionality of the data since they are almost on the same plane Fig. 6. An example delivering an intuition about principal component analysis [18] Our requirements are however more complicated that just reducing the dimensionality of the dataset. We need to apply a decomposition that fulfills three conditions. First : We need to capture as wide variations as possible for the tracked object appearance. Second We do not want that two different basic trackers track the same group of features, otherwise there would be no point of decomposing the object model. Formally, that means that we require the decomposed sub-models to be complementary. Third : To cope with the limitations in our resources, we would like the models to be as compact as possible. Since the PCA maximizes the covariance between data, it can achieve our first requirement. The principal components are also orthogonal, which offers the needed complementariness between the sub-models. We still need, however, to find a solution for our thirdly posed requirement. That is why the paper relies on SPCA for building the sub-models. As discussed in subsection(2.3.2), SPCA offers components that have an interpolation between being principal and having the least number of non-zero entries possible. This solution will however affect how perfectly the first two conditions are fulfilled, but as a good compromise, it will enable us to achieve the three requirements, each with an accepted degree. 14 The dataset we are having in the paper is a vector a constructed from the template set S : a = (f11 ...ft1 ...ftu ...ftu ). (29) From here, we construct a so-called Gramian matrix form by A = aT a. There is an elegant solution worth highlighting here. Since we are investigating the convariance between various patches of the same dimension, it is intuitive to think of the data dimensions we have in our dataset as the number of pixels in the patch. However, that would have implied taking A to be rather aaT , so that each entry would be a dot product of 2 vectors, each representing a list of the values for a given pixel entry in all the patches. On the other hand, in the covariance matrix calculated by the paper, each element correspond to the dot product of 2 vectors, each corresponding to a list of all pixel values for a single patch. So, what would we gain by abandoning the intuitive way ? In fact we gain a lot. This way, we force the gramian matrix each step to be of a fixed size that is equal to the square of the number of patches in S whatever the dimension of the patch is. Hadn’t we done that, we would have had to deal with a much bigger matrix, requiring possibly much more memory, time and calculations. The idea was carried out before in [16] verifying the principal components to be nearly the same in both cases. After applying SPCA and obtaining the sparse principal components of the dataset, the components are used to perform a decomposition of the data into the object sub-models . Each object sub-model will be then used by a basic tracker. The data set expressed by the vector a in (29) is projected once onto each principle component forming a given object sub-model each time, which we denote by Mti , which means the sub-model M built from the i-th component constructed at time t. This is equivalent to looking up for the non-zero entries in the component and taking the corresponding feature fit in the resulting object sub-model. This process is illustrated in figure (7). Distance between 2 Sets as an Observation Model For each object submodel a corresponding observation model is assigned. The probability for a given basic observation model is then infered from a score function between its object model Mti and the measurement Yt . Yt is defined by the patch taken at state Xt . (State defines position and scale). For a given object model Mt , The function is namely : P (Yt |Xt ) = e−λDD(Yt ,Mt ) (30) The probability decays exponentially with the distance score given by DD(Yt , Mt ), where Mt is the object model at time t. λ is a tuning parameter. The distance measure used here is the diffusion distance elaborated in subsection(2.2). 3.1.2 Motion Model The paper exploits two simple motion models, which are smooth motion and abrupt motion. Both of the models are Gaussian distributed. The difference between the two models is only the variance, which is relatively small in one and relatively big in the other. The small one assumes 15 Fig. 7. Building object models for the basic trackers from the calculated sparse principal components[7] smooth motion while the big one assumes abrupt motion. The two assumptions are depicted in figure 8. Formally, we can express a given motion model as the following : P (Xt |Xt−1 ) = G(Xt−1 , σ 2 ) (31) That means if we know that the previous state was Xt−1 , the current state Xt will be Gaussian distributed centered at Xt−1 and a standard deviation equal to σ. Fig. 8. Two Gaussians. The left has a wider spread and accounts for an abrupt motion assumption and the right has a tighter region of confidence, suiting a smooth motion. 3.1.3 State Estimation for a Basic Tracker The observation model offers us a probability distribution that exponentially decays with the dissimilarity between the object model and the correction measurements. The actual exploitation of such a probability distribution, however, is not that easy. The difficulty 16 resides in the continuity of the distribution and the intractability of allocating the corresponding probability for each sample in the sample space. Nevertheless, it turns out that there is some technique that can choose random samples from the distribution that together inherently reflect it. We The key concept beyond this approach is that the domain elements while are being hit while selection in frequency that is propotional to their respective probability in the target distribution. We refer to this process technically as sampling. This concept will be used to deduce discrete instantations from the given probability and then use them to do the final estimation for the state. One of the tools for achieving this sampling is the Markov Chains. Using Metropolis Hastings Algorithm The Metropolis-Hastings Algorithm works as follows : Initially, A random number sample is picked from from the proposal distribution. Then, the proposed distribution suggests another sample. We evaluate the probability of this proposed value with respect to the current sample we are having and calculate an acceptance ratio. Afterwards, a random number from a uniform distribution is picked and the sample is accepted if this random number is less than the acceptance ratio. Otherwise, the new sample would just be equal to the current sample, and we try another suggested sample. Now let us project this theoretical silver lining on the problem in the paper. The target probability distribution we are trying to model is the observation model p(Yt |Xt ). Now what could be the distribution proposing the samples? Without much thinking, the question could answer itself through a rephrasal : Given a certain current state, which state is most likely to happen in the next one? This is actually nothing but the function of the motion model we discussed in section 2.1.1. So the proposing distribution is the same as p(Xt |Xt−1 ). The sampling we are speaking here is in the state space, which means that each sample corresponds to a state. We will use the notation Xt for the current sample in the time t and the notation Xt∗ for the sample proposed upon it. So now we talk of p(Yt |Xt∗ ) as a target distribution and p(Xt |Xt∗ ) as a proposing distribution. To measure the acceptance ratio we mentioned previously for a given sample, we plug in the terms in the Metropolis-Hasting Algorithm formula in (28): γparallel = min(1, p(Yt |Xt∗ )p(Xt |Xt∗ ) ) p(Yt |Xt )p(Xt∗ |Xt ) (32) In (32), a term p(Xt∗ |Xt ) is the corresponding probability for a randomly taken sample Xt∗ from the motion model distribution based on the the previous sample Xt . On the other hand, a term p(Yt |Xt∗ ) reflects how the proposed sample is consistent with the observation model. This is done as shown in (30). Getting the samples according to a predefined number of iterations, the best estimate is then calculated using a maximum a postiori estimation : (l) Xt = arg max p(Xt |Y1:t ) f or l = 1, .., N (l) Xt (33) 17 (l) Here Xt time t. 3.2 is the l-th sample and N is the total number of samples evaluated at Integration of Basic Trackers Up till this point, we have seen how a single MCMC based tracker works. We will turn now to another main contribution of the paper, which is using a number of these basic trackers and integrating their knowledge together. 3.2.1 Forming different Basic Trackers It is worthwhile to mention how the different basic trackers are built. Up to this point, it should be clear that each object model is associated with only one observation model. For a given object model, different basic trackers are formed by permuting its corresponding observation model with the assumed motion models. In a result to this, if we have in total r object models and s motion model, we get at the end s x r basic trackers. In the paper, the number of object models was fixed by taking only the most 4 principle components in (15). On the other hand, the number of motion models was 2. That makes up only 8 basic trackers in total (See table 1). Motion \Observation P1 (Yt |Xt ) P2 (Yt |Xt ) P3 (Yt |Xt ) P4 (Yt |Xt ) P1 (Xt |Xt−1 ) T11 T21 T31 T41 P2 (Xt |Xt−1 ) T12 T22 T32 T42 Table 1. Creating different basic trackers by taking possible pairs of each observation model and motion model. Since we have 2 possible motion models and 4 object models, we have in total 8 trackers. 3.2.2 Integrating Basic Models The usage of various basic trackers simultaneously can be regarded as if P (Yt |Xt ) in (1) is decomposed into : P (Yt |Xt ) = i=s X wi Pi (Yt |Xt ), (34) i=1 Where Pi (Yt |Xt ) denotes the i-th basic observation model and s is the total number of basic observation models. Similarly, P (Xt |Xt−1 ) in (1) is decomposed into : j=r X wj Pj (Xt |Xt−1 ) (35) P (Xt |Xt−1 ) = j=1 Where Pj (Xt |Xt−1 ) denotes the j-th basic motion model and r is the total number of basic motion models. Nevertheless, (34) and (35) are only means of delivering an intuition for the decomposition process. The reason for this is that 18 they are not explicitly evaluated or used further. The two equations are just an image of another mathematical process happening in the same time. More details are in the next subsection. 3.2.3 Using Interactive Markov Chain Monte Carlo After knowing how each basic tracker estimates the next state, let us see how the trackers then integrate their knowledge. The Metropolis-Hastings algorithm we mentioned before is in fact a part of a bigger picture, namely the Interactive Markov Chain Monte Carlo [3]. This algorithm has 2 modes : parallel mode (which operates as Metropolis-Hastings, and interactive mode, which is responsible for integrating the knowledge of the basic trackers. What the interactive mode does is that it allows for each tracker to get influenced by its peers. With a proportion to how its belief measures in comparison with the sum of other beliefs, each tracker has a probability of getting its state accepted by the other trackers. This means for a tracker that it accepts the state of anoter tracker Tij built up from i-th observation model and j-th motion model, with an acceptance rate of : pi (Yt |Xtj ) γinteractive = P s r P pi (Yt |Xtj ) (36) i=1 j=1 The acceptance of the state of a given tracker Tij by another tracking is equivalent to doubling the weight of the i-th observation model and j-th motion models in (34) and (35) respectively. In this way, the weights in (34) and (35) are calculated. Getting the samples for each basic tracker as formerly explained, the most likely state is then determined according to (33). 4 Results The proposed VTD Tracker was evaluated quantitatively and qualitatively in comparsion with four different tracking approaches : Standard MCMC, Means Shift (MS), Online Appearance Learning (OAL) and Multiple Instance Learning (MIL). The VTD scored the best results overcoming illumination changes, occlusion, background clutter and abrupt motion. None of the other trackers were able to stand up against the whole difficulties. Moreover, the behavior of VTD itself was compared to itself without SPCA onetime and without Interactive MCMC another time. In both cases, it was shown that the design is optimal. 5 Discussion It has been shown that the Visual Tracking Decomposition methodology is a valuable contribution to the tracking research constitution. Choosing the SPCA to decompose the object model is a novel way for achieving compactness while keeping the most information possible. However, the proposed approach suffers 19 from a significant drawback. In various parts of paper, tuning parameters are injected into the building equations (e.g : Object Model, Motion Model,etc). The values used for the parameters are mentioned, but it was never explained why they were chosen in specific. That keeps a door open for a collapse or at least a high drop in the performance if those parameters are for any reason mistaken. Moreover, the parameters are not proven to hold for all scenarios, which means that the one used in the paper can simply fail in other trials. One other problem is the high requirements posed on the memory, computation and time resources. The requirements scale up quite fast with the number of trackers involved. The algorithm is not meant to be a realtime one, but still a too long delay would be unwanted. 6 Conclusion The proposed approach discussed here offered promising results in the field by overcoming the appearance change problem. We demonstrated how the novel approach exploited SPCA for decomposing the object model to compact, complementary and expressive models, each for a basic tracker. We eventually saw how the integration between trackers was performed through IMCMC, leading to an improved performance. The drawbacks for the approach are the tunable parameters and low scalability due to required resources. As future work, it would be benefitial to consider the parallelization of the algorithm, which could lead to a significant speed-up. References 1. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. 2. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. 3. Jukka Corander, Magnus Ekdahl, and Timo Koski. Parallell interacting mcmc for learning of topologies of graphical models. Data Mining and Knowledge Discovery, 17(3):431–456, 2008. 4. Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse pca using semidefinite programming. In SIAM Review, pages 41–48. MIT Press, 2004. 5. Zia Khan, Tucker Balch, and Frank Dellaert. An mcmc-based particle filter for tracking multiple interacting targets. In in Proc. ECCV, pages 279–290, 2003. 6. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. 7. Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition : Presentation slides, 2009. 8. Junseok Kwon and Kyoung Mu Lee. Visual tracking decomposition. In CVPR, pages 1269–1276, 2010. 9. Haibin Ling. Diffusion distance for histogram comparison. In In CVPR06, pages 246–253, 2006. 20 10. Dr Emilio Maggio and Dr Andrea Cavallaro. Video Tracking: Theory and Practice. Wiley Publishing, 1st edition, 2011. 11. Baback Moghaddam, Yair Weiss, and Shai Avidan. Generalized spectral bounds for sparse lda. In In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML, 2006. 12. S. Santhoshkumar, S. Karthikeyan, and B.S. Manjunath. Robust multiple object tracking by detection with interacting markov chain monte carlo. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 2953–2957, Sept 2013. 13. John Shlens. A tutorial on principal component analysis. 14. Gilbert Strang. Linear Algebra and Its Applications. Wellesley-Cambridge Press, 2009. 15. Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005. 16. Matthew Turk and Alex Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–86, January 1991. 17. Yang Wang and Qiang Wu. Sparse pca by iterative elimination algorithm. Advances in Computational Mathematics, 36(1):137–151, 2012. 18. YouTube http://www.youtube.com/watch?v=4pnQd6jnCWk. Principal component analysis (pca), 2009. 19. YouTube: http://www.youtube.com/watch?v=BfTMmoDFXyE. A layman’s introduction to principal component analysis, 2009. Gradient Response Maps vs. HOG Features for Detection of Texture-Less Objects Sahar Javadi1 and Stephan Krauss2 1 2 Technical University of Kaiserslautern javadi@rhrk.uni-kl.de German Research Center for Artificial Intelligence stephan.krauss@dfki.de Abstract. In this paper a comparison between two approaches, Gradient Response Maps and Histograms of Oriented Gradients (HOG), in the context of detection of texture less objects, is presented by first explaining the parameters of each method separately in detail and then presenting the experimental results of a case study comparing two methods. Key words: Gradient Response Maps, Histograms of Oriented Gradients (HOG) 1 Introduction Real time detection and learning of texture-less or low textured objects is a critical and challenging task in the area of Computer Vision. Many applications such as robotics in which a robot needs to deal with a continuously changing environment and learn new objects in real time strongly need efficient approaches with low computational cost. Real-time 3D detection of instances of objects using Gradient Response Maps [1] is a new approach for the detection of untextured objects which does not need a time consuming training phase, and consequently can be used in time critical applications such as robotic. The robustness of this approach is due to the spread of image gradient orientations which makes it possible to test only a small subset of all possible pixel locations when parsing an image and representing a 3D object with a limited set of templates. This approach can be improved in presence of a dense depth sensor which takes into account also the 3D surface normal orientations. Histograms of Oriented Gradients (HOG) [2] is another related and popular method for object detection which is based on the statistical description of the distribution of intensity gradients in localized portions of the image. The basic idea behind this approach is that local appearance of an object or its shape can be characterized by the distribution of local intensity gradients or edge directions quite well, even if a precise knowledge of the location of the corresponding gradient or edge is not available. This approach gives reliable results but it is computationally complex and consequently not suitable for on-line applications. This paper is organized as follows. In the next two sections a detailed description of the two approaches Gradient Response Maps and the Histograms of Oriented Gradients (HOG), is presented. In Section 4 a comprehensive comparison between the two methods is presented by describing the evaluation results of experiments comparing two methods from different aspects such as robustness, speed and occlusion. The last section concludes with summary and discussion. 2 Gradient Response Maps In this section a comprehensive description of the Gradient Response Maps approach with detailed explanation of its parameters is presented, and it is shown how a new representation of the input image can be built in order to parse the image quickly for finding the objects. 2.1 Similarity Measure The first very important component to explain is the similarity measure.What this similarity measure basically does is that for each gradient orientation on the object, it searches in a neighbourhood of the associated gradient location for the most similar orientation in the input image. This can be shown as: ε(I, τ, c) = Σrǫp ( max | cos(ori(O, r) − ori(I, t)) |) tǫR(c+r) (1) in this formula R(c + r) = [c + r − T2 , c + r + T2 ] × [c + r − T2 , c + r + T2 ] is the neighbourhood of size T centred at location c + r in the input image. ori(O, r) is the gradient orientation in radians at location r in a reference image O of an object to detect. The locations r are considered in O as specified in a list denoted by P .In the above formula, considering only the orientation of gradients and not their magnitude or direction, makes the measure robust to contrast changes, and taking the absolute value of the cosine will allow the measure to handle the object occluding boundaries. The reason for considering image gradients in the very first step is that they are proven to be robust to illumination changes and noise and are normally more discriminant than other representation forms. In following sections it is shown how this similarity measure can be computed easily by spreading the computed gradient orientations. 2.2 Computing the Gradient Orientations The orientation of gradients is computed on each color channel of the input image in order to increase the robustness, and then the gradient orientation of the channel whose gradient magnitude is largest is considered as the gradient orientation in that location of the image, according to the formula below. In this formula for an RGB color image I the gradient orientation map Ig (x) at location x is computed as Ig (x) = ori(Ĉ(x)) (2) where Ĉ(x) = arg max | CǫR,G,B ∂C | ∂x (3) In order to quantize the gradient orientation map, the gradient directions are omitted and only the gradient orientations are taken into account. The orientation space in then divided into n0 equal spacings as illustrated in Fig. 1. Fig. 1. Quantizing the gradient orientations 2.3 Spreading the Orientations In this subsection it is shown how a new binary representation of the gradients around each image location in order to avoid computing the max operator in Eq. 1 every time a new template needs to be evaluated against an image location. This new binary representation along with lookup tables is then used for precomputing the maximal values efficiently. The computation procedure of  is shown in Fig. 2 Fig. 2. Spreading the gradient orientations As it can be seen in Fig. 2, first step in computation of the new binary representation of the input image is to quantize the orientations into a small number of values. By doing so, the new representation of the input image can be done simply by spreading the gradient orientations of the input image around their locations. For encoding all possible combinations of orientations spread in a specific location, a binary string is used in which each individual bit corresponds to one quantized orientation, and is set to 1 if this orientations is present in the neighbourhood of the this location and 0 otherwise.These binary strings are then used as indices for accessing lookup tables in order to compute the similarity measure in a fast way. 2.4 Precomputing the Response Maps The precomputation of response maps is shown in Fig. 3. Fig. 3. Precomputing the Response Maps As shown in this figure the new binarized image and lookup tables are used together to precompute the max operations in the similarity measure for each location and each possible orientation in the template. The results are stored in 2D maps. For computing similarity measures it is just enough to sum values read from these maps. Due to the fact that these maps are shared between the templates, once the maps are computed the matching of several templates against the input image can be done quickly. 2.5 Extension to Dense Depth Sensors In case a depth sensor is available this approach can be extended using quantized surface normals which leads to more robustness in the detection of objects. The similarity measure is then defined as the dot product of the normalized surface gradients, instead of the cosine difference for the image gradients. The combined similarity measure in this case is simply the sum of the measure for image gradients and that of surface normals. 3 Histograms of Oriented Gradients This approach is based on the idea that local appearance and shape of objects can often be characterized rather well by the distribution of local intensity gra- dients or edge directions, even if the precise knowledge of the position of the corresponding gradient or edge position is not available. In practice, the image window is divided into small spatial regions (cells), and gradient directions or edge orientations over the pixels of the cell are accumulated as a local 1-D histogram. A contrast normalization is also done in overlapping descriptor blocks before using them. Each block is a collection of neighbouring cells. Normalized descriptor blocks are referred to as Histogram of Oriented Gradient (HOG) descriptors. An overview of HOG approach is depicted in Fig. 4 in case of human detection. The used classifier in this case is a conventional SVM. Fig. 4. An overview of HOG approach Experimental results for examining the influence of different descriptor parameters show that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high quality local contrast normalization in overlapping descriptor blocks are all important for good performance. The similarity measure used in classifier in HOG approach is normally Euclidean metric or cosine similarity which are computed as follows respectively. d= p (p1 − q1 )2 ...(pn − qn )2 cos(θ) = 4 A.B |A|.|B| (4) (5) Experimental Validation In this section the experimental validation results of comparing the method using gradient response maps which is called LINE to HOG and three other methods i. e., DOT [3], TLD [4] and Steger [5] are presented. While DOT is recognized as a fast template matching method HOG and Steger are more known as slow but very robust template matching methods. For experimental validation three different variations of LINE have been used. LINE-2D which uses just image gradients, LINE-3D which uses just the surface normals, and LINE-MOD which is multimodal and uses both. These methods are compared to each other from robustness, speed and occlusion aspects. 4.1 Robustness The six methods mentioned above are evaluated using six sequences made of more than 2000 real images each. Illumination and large viewpoint changes on heavily cluttered backgrounds are contained in each sequence of images. As the experimental results illustrated in Fig. 5 and Fig. 6 show LINE-2D outperforms other methods,except for Steger method which gives the same results. One possible reason is that both LINE and Steger methods use similar similarity measures. As it can be seen in these figures when a depth sensor is available i. e., in LINEMOD there are just few false positives and this method outperforms all other methods without decreasing the runtime performance. The middle column in these figures represents the results in case of setting a threshold for each approach to allow 97% true positive rate and only evaluate the hypothesis with the largest response. Fig. 5. Comparison the robustness of six methods on real 3D objects 4.2 Speed Since the procedure of template learning is considered to be instantaneous, for evaluating different approaches from the aspect of speed, only runtime performance is taken into account. As the experimental results presented in Fig. 6 show Fig. 6. Comparison the robustness of six methods on real 3D objects the LINE approach is generally a real-time approach. In this case LINE-MOD is rather slower than LINE-2D and LINE-3D due to the slower preprocessing stage. The DOT method is initially fast but as the number of templates increases it becomes slower. 4.3 Occlusion As it can be seen in Fig. 7 the robustness of methods LINE-2D and LINE-MOD is also evaluated by adding synthetic noise and illumination changes to the image. The results suggest that both methods are linear with respect to occlusion. 5 Comparison of Gradient Response Maps and HOGs from Similarity Measure Aspect The LINE method and HOG both are gradient-based methods which take into account the orientation of gradients. Although exploiting gradient orientations(and not their direction) makes both methods robust to background clutter and also small shifts and deformations, each method achieves this capability in a different way. The HOG method does this by first quantizing the orientations and using local histograms. However this can be unstable when strong gradients Fig. 7. Comparison the speed of six methods Fig. 8. Left: LINE approach is linear with respect to occlusion, Right: Average recognition score for the six real 3D objects with respect to occlusion appear in the background. On the other hand, the similarity measure of LINE method, for each gradient orientation on the object, searches in a neighbourhood of the associated gradient location for the most similar orientation in the input image. Taking the absolute value of cosine between gradient orientations in similarity measure of the LINE method allows it to correctly handle object occluding boundaries.Considering only the orientation of gradients and not their norms makes the measure robust to contrast changes. This is something which is missing in HOG method, and consequently makes the method vulnerable to strong contrast changes. 6 Summary and Discussion In this paper a comprehensive comparison between the methods LINE with three different variations i. e., LINE-2D, LINE-3D, and LINE-MOD , and HOG was drawn. Moreover, the results of the experimental validations for LINE, HOG and three other methods were presented and discussed. These methods are validated for recognition of six different objects on heavily cluttered backgrounds from different aspects such as robustness, speed, and occlusion. The obtained results suggested that the LINE method is a real-time method which in most of the cases outperforms all other considered methods. However there are certain issues in the way this comparison was done. It is noteworthy that this paper reviews the results of another paper which basically attempts to prove the efficiency of the LINE method in comparison to four other methods i. e., HOG, DOT, Steger, and TLD. The main point of the current paper is however the comparison of LINE and HOG methods by reviewing the past obtained results. These two methods are focused on because they both use gradient features for the detection of objects. Although the LINE method uses just the orientation of the gradients while the HOG method takes orientation, and also the magnitude of the gradients into account. The first issue that arises is from the robustness aspect and the way theses methods have been evaluated. Considering the fact that these results were reported by experimenting on only 6 objects the question is whether 6 objects — all located in a room and not in an outside environment — are good representatives for certain judgement about the robustness of a method i. e., LINE in comparison to other methods such as HOG. On the other hand there are no clear explanations about the cases where the HOG method outperforms the LINE-2D method. Another debatable issue is in the context of speed. The question is that basically the comparison between a method like LINE which is specifically designed for real-time applications and a method like HOG which does not mean to be used in real-time situations is the right thing to do? From the perspective of this work’s author, this is not a fair thing. References 1. Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P.F., Navab, N., Fua, P., Lepetit, V.: Gradient response maps for real-time detection of textureless objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(5) (2012) 876–888 2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In Schmid, C., Soatto, S., Tomasi, C., eds.: International Conference on Computer Vision and Pattern Recognition. Volume 2., INRIA Rhone-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334 (2005) 886–893 3. Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., Navab, N.: Dominant orientation templates for real-time detection of texture-less objects. In: CVPR. (2010) 2257– 2264 4. Kalal, Z., Matas, J., Mikolajczyk, K.: P-n learning: Bootstrapping binary classifiers by structural constraints. In: In IEEE Conference on Computer Vision and Pattern Recognition. (2010) 49–56 5. Steger, C.: Occlusion, clutter, and illumination invariant object recognition (2002) have been created by applying homographies. One has to question the statistical significance of the performed evaluation, especially when it was outperformed by SURF in one of the six evaluated datasets. Furthermore, the dataset with the highest performance gain with respect to the compared feature detection methods is a dataset they created themselves, consisting on an image with only gaussian noise added at different intensities. Another concern is the runtime evaluation of the proposed method. Whereas the total time needed seems to be approximately the same as SIFT while still outperforming it in every evaluated dataset, it doesn’t hold up compared to more recent methods in terms of computation time. Compared to SURF, the computation time increases by a factor of 2.5-4, and even worse, compared to STAR, it needs approximately 8 times as much time. The improvements that are proposed in the accelerated KAZE features publication seem to improve its most important shortcoming: computation time. It achieves this by changing the computation method of the non-linear scale space. As a result, the computation time is faster than SURF, making it possible to use it for real-time applications. The most important question to ask is if the proposed optimizations come at the expense of a lower quality (with respect to stability of the features under noise or scale changes). However, the comparison of the evaluation results don’t indicate that this is the case. While the work on key point descriptors using a non-linear scale space seems promising, more care should be taken in presenting the results in a consistent way. For example, they included precision/recall graphs in the original paper, whereas for their accelerated features, they only show a table listing the matching score and recall matches. Most importantly, the authors could have described their datasets better, or even mention that they are composed of only one original picture with different homographies applied. 5 Conclusion This seminar report gives a short overview of KAZE features, a feature detection that makes use of a non-linear scale space. These features seem to outperform most state of the art feature detectors, like SIFT or SURF, although the computation time is comparable to that of SIFT, and a lot longer than SURF. This disadvantage is rectified by the accelerated-KAZE features, who achieved even better results during their evaluation. Even though the authors only included the evaluation of 6 different datasets, the gain in precision and recall seems to indicate that they might outperform other state-of-the-art feature detectors in a more rigorous evaluation. Even though the evaluation might not be the most rigorous, the proposed feature detector seems very promising. References 1. P. F. Alcantarilla, A. Bartoli, and A. J. Davison. KAZE features. In Eur. Conf. on Computer Vision (ECCV), 2012. 2. P. F. Alcantarilla, J. Nuevo, and A. Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces. In British Machine Vision Conf. (BMVC), 2013. 3. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In In ECCV, pages 404–417, 2006. 4. David G. Lowe. Distinctive image features from scale-invariant keypoints, 2003. 5. Joachim Weickert, Bart M. Ter Haar Romeny, and Max A. Viergever. Efficient and reliable schemes for nonlinear diffusion filtering. IEEE Transactions on Image Processing, 7:398–410, 1998. Fusion of inertial body tracking with Kinect body tracking Artem Avtandilov1 and Markus Miezal2 1 TU Kaiserslautern, artem@rhrk.uni-kl.de 2 markus.miezal@dfki.de Abstract. Inertial body tracking and body tracking with the Microsoft Kinect sensor provide certain benefits for real-time human body motion capturing, but both lack precision when it comes to disturbances: be it occlusion for visual sensor or magnetic field disturbances for inertial measurement units (IMUs). This paper proposes an approach to fuse inertial body tracking and Kinect body tracking in order to achieve better estimate of skeletal motions. Keywords: Inertial motion capture, sensor fusion, body sensor network, Kinect 1 Introduction Different technologies for body tracking have their strong and weak sides. While Kinect suffers from occlusion and is slower, inertial tracking systems could only detect posture and are vulnerable to magnetic disturbances. On the other hand, Kinect is capable of detecting a 3D pose and extracting segment lengths, but inertial measurement units are faster and tracking is more robust when not disturbed. Motivated by above, [4] and [5], this paper introduces a practical approach for fusion of inertial body tracking with Kinect body tracking. There were many attempts to fuse data from inertial measurement units and Kinect, such as [2], where emphasis is made on the mapping rather than body tracking. Combination of inertial measurement units (IMU) and Kinect is further used in [6] for gesture recognition using 5 XSens IMUs. [3] advances those approaches to joint angle estimations with applications in clinical rehabilitation. 2 2.1 Proposed approach Kinect SDK. Data structure The Kinect for Windows software development kit (SDK) enables the use of C++ to create applications that support body tracking features by using the Kinect sensor and a Windows enabled machine. The Developer Toolkit provides sample software that has been modified in order to record samples of body posture and process them further. The environment enables the use of all hardware features 2 of Kinect sensor: RGB camera, depth sensor and multi-array microphone. The SDK provides already fully-functioning version of body tracking that has been used in different generations of the XBox game console. However, the precision of the joint positions produced by Kinect is not very high, but it is already sufficient for fusion with inertial data and mapping applications[7]. Sample information includes 3D coordinates of 20 joint positions of tracked body, timestamp of the sample and information about specific points whether they were inferred or directly tracked. A rendering of the most recent detection is being output to the screen which makes it easy to monitor the current state. Kinect is capable of tracking two full bodies simultaneously and can detect four more3 . Fig. 1. Tracking capabilities of Kinect sensor: up to two distinguishable subjects could be tracked with all joints position (blue and purple as shown) and can detect presence of four more subjects in the scene (dots as shown) 2.2 Prerequisites for fusion The 3D points produced by Kinect and the joint points extracted from the kinematic chains of the inertial measurements system lie in different coordinate systems. Moreover, they are not ideally synchronized in time since the data is captured asynchronously on two different machines which leads to a time offset and drift. At this stage, the desired sensor-fusion is impossible. The data needs to be aligned with respect to time and in the spatial domain whereas the former has to be carried out first. 3 Image from Microsoft http://msdn.microsoft.com/en-us/library/hh973074. aspx 3 Fig. 2. Coordinate systems represented visually: global coordinates based on IMUs tracking information (left) and Kinect coordinate system that is due to be aligned with IMUs Timing synchronization. In order to synchronize two samples of tracking information produced by two different measuring platforms (Kinect and IMUs), an easily recognizable event has been asserted into the sequence of actions recorded from the human. At the beginning of each recorded sequence, the subject has to clap. The clapping event can easily be extracted from IMUs’ acceleration and from the Kinect’s hand positions. (a) Normalized acceleration of the hand recorded from IMU (b) Distance between hand points produced by Kinect Fig. 3. Data for synchronization. Fig. 3(a) shows the normalized acceleration of the hand IMU. The peak indicates the clapping event. From the Kinect data it is easy to calculate a timestamp of the event when hands were closest to each other (see Fig. 3(b)). 4 Comparing these two timestamps, the time offset between the two systems can been determined and an adjustment can be performed. As noted before, Kinect software is running on Windows machine and IMUs data are recorded under Linux. Naturally, they have a diverging clock, but synchronizing them with the calculated offset is sufficient for short term recording. Furthermore, Kinect and IMUs provide tracking information with different frequencies - 30 and 100 Hz accordingly. Skeletons alignment. The spatial alignment becomes possible because the skeletons are compatible and correspondences between joints exist. The alignment consists of rotation and translation. At this point, only rotational alignment is needed. Since the IMU tracking system is fixed with respect to the pelvis while Kinect’s skeleton can move freely, the Kinect skeletons translation has to be reset to the IMU skeletons origin. Rotational alignment is performed only once when processing begins as follows[1]. To find the optimal rotation both datasets are re-centralized so that both centroids are at the origin, like shown4 on Fig. 4. Fig. 4. Relative rotation between two centralized point clouds. This removes the translation component, leaving only the rotation. The next step involves accumulating a matrix, called H, and using the Singular Value Decomposition to find the rotation as follows: H= N X (PAi − centroidA )(PBi − centroidB )T (1) i=1 [U, S, V ] = SV D(H) 4 Image from Nghia Ho http://nghiaho.com/?attachment_id=807 (2) 5 Finally, the rotation matrix from A to B can be calculated using U and V : R = V UT (3) The translation alignment is performed on every following frame where Kinect data is present. The inertial tracking system’s pelvis position is fixed, so the vector from the Kinect skeletons pelvis to the inertial tracking systems pelvis is found and subtracted from every point of the Kinect skeleton. This final step allows the sensor-fusion between the two data streams. 2.3 Measurement models The inertial tracking system is based on extended Kalman filter (EKF) which operates on a kinematic chain, formed by Denavit-Hartenberg-Transformations (DH). The filter’s state comprises all time-dependent parameters of the kinematic chain (angles, angular velocities and angular accelerations) and is propagated in time with a standard constant angular acceleration model with white noise in acceleration[5]. To fuse the Kinect data into this system, the joint position data has to be related to the Kinect measurements using the underlying kinematic chain. Since the Kinect produces only 3D points, a fusion on angle level is impossible because not all degrees of freedom are covered by the 3D point model (e.g. rotation along the bone). The fusion has to be performed on point level. Two measurement models have been implemented. The first assumes that all 3D points in the IMUs skeleton are equal to their corresponding point in the Kinect’s skeleton. The second one weakens this approach, assuming that the direction the next corresponding joint is equal. Measurement model 1 (MM1) measures positions of 11 corresponding joints: 0=Y −P +e (4) Here Y represents joint points extracted from the state by multiplying through all transformations of the underlying kinematic chain to a specific joint, P stands for joint points produced by Kinect and e denotes zero-mean Gaussian measurement noise. On the downside, the segment lengths produced by these models might differ in reality and the alignment of skeletons cannot be perfect - these are the reasons why measurement model 2 (MM2) is introduced where segment directions are measured. Cosine of the angle between to vectors could be represented with dot product as follows: cos(ϑ) = A·B Anorm Bnorm (5) Two perfectly aligned vectors will form an angle of zero degrees, of which the cosine equals to 1. This leads to the following simplification: Let vector Y represent segments acquired from the state and vector P the corresponding vectors from the Kinect, then MM2 is defined as: 6 0=( Y P )·( )−1+e Ynorm Pnorm (6) The results of two introduced measurement models are further discussed in the next section. 3 Experiments The described system has been tested against magnetic disturbances brought to the scene in order to make data produced by IMUs unreliable. 3.1 Magnet affecting the scene (a) IMU skeleton reproduced to the visual image when subject is about to pick up a strong magnet demonstrating normal behavior (b) IMU skeleton reproduced to the visual image when subject is affected by magnet demonstrating how magnetic disturbances make inertial tracking unreliable Fig. 5. IMU skeleton with magnet in scene Left hand’s magnetometer normalized data demonstrates significant disturbances. Further analysis and graphic representation of recorded data reveal inadequate behavior of chains when there is no Kinect information present (see Fig. 6(a)). Even if the magnet was placed on the left wrist, the huge error quickly propagates through the inertial measurement model to all parts of the chain. Even the torso is significantly influenced by the disturbed measurements. Closer look at Z coordinate of torso shows that while in fact it is supposed to be stable 7 (a) Left hand magnetometer normalized data when magnet has been added to the scene (b) Distance between hand points produced by Kinect Fig. 6. Torso coordinates when magnet has been added to the scene like at the beginning of the measurement, it changes value unexpectedly and then gets stuck at zero being unable to produce tracking information. A reliable tracking can not be achieved in this setup (see Fig. 6(b)). 3.2 Kinect data used to restore corrupted scene Using the two Kinect measurement models, experimental data shows that visual data produced by the system and coordinates of specific joints is much more robust when Kinect tracking information is taken into account while IMUs cannot handle external magnetic disturbances. As a consequence, propagation of magnetic disturbances affects coordinates of shown joints. (a) Measurement model 1 propagation onto torso coordinates (b) Distance between hand points produced by Kinect Fig. 7. Measurement model 2 propagation onto torso coordinates 8 Further analysis of produced results demonstrates the behavior of the system. Fig. 7(b) for MM1 demonstrates jitter on Z coordinate comparing to the original tracking with IMUs only. This shows that the torso is being dragged to the point to match Kinect information through MM1 and EKF. Later when magnet is added to the scene, Z coordinate looks much more stable comparing to the original IMU data, but here inertial tracking information is trying to drag torso out of the way being hold by Kinect. And when effect of magnetic disturbances propagated to torso becomes the most severe and IMUs are not capable of tracking at all rushing Z and other coordinates to zero, MM1 still holds it on reliable level. Jitter is sufficiently small as can be clearly seen on the recording. Some segments, for example shoulders, are different in Kinect and IMU models, and when MM1 starts running during the recording replay, shoulders are dragged down as they are tracked by Kinect in its skeleton reproduction and similarly spine is dragged backwards. MM2 demonstrates similar performance, but it is clearly seen, that it measures different features - angles between same segments from two different sources. This puts the priority of the system to make joint segments to be parallel to each other. Depending on different level noises this could result in better or worse performance comparing to MM1 as discussed further. In particular, it could be clearly seen that when system can no longer handle disturbances resulting in torso being dragged out of the way, other segments will still be parallel. 4 Conclusion This paper proposes basic concept for fusion of inertial body tracking and Kinect body tracking. Results mentioned above allow to conclude, that tracking information received from Kinect sensor is capable of improving an estimate of human body movements in 3D, particularly when magnetic disturbances occur. Two measurement models demonstrated different performance depending on the noise levels, magnet locations and other factors. Measurement model 1 performed better with lower noises and demonstrated more robust results, however measurement model 2 was able to produce better estimate with higher noise levels and more intense magnetic fields. Data received from Kinect makes the system more robust, however it can not be completely reliable due to significant levels of jitter occurred from optical tracking originating from the nature of optical tracking itself. References 1. Paul J. Besl. A method for registration of 3-d shapes. IEEE transactions on pattern analysis and machine intelligence, 1992. 2. Bas des Bouvrie. Improving rgbd indoor mapping with imu data. MSc thesis for Faculty EEMCS, Delft University of Technology, 2011. 3. Antonio Padilha Lanari Bo et al. Joint angle estimation in rehabilitation with inertial sensors and its integration with kinect. 33rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2011. 9 4. Jamie Shotton et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2013. 5. Markus Miezal et al. A generic approach to inertial tracking of arbitrary kinematic chains. BodyNet, 2013. 6. Oresti Banos et al. Kinect=imu? learning mimo signal mappings to automatically translate activity recognition systems across sensor modalities. 16th International Symposium on Wearable Computers, 2012. 7. Kourosh Khoshelham and Sander Oude Elberink. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors, 2012.

Log In

Computer Vision: Object and People Tracking

Related papers

Related papers