Vision-Aided Navigation for GPS-Denied Environments Using Landmark Feature Identification

Tennyson John

Dissertations and Theses 12-2014 Vision-Aided Navigation for GPS-Denied Environments Using Landmark Feature Identification Tennyson Samuel John Follow this and additional works at: https://commons.erau.edu/edt Part of the Aerospace Engineering Commons Scholarly Commons Citation John, Tennyson Samuel, "Vision-Aided Navigation for GPS-Denied Environments Using Landmark Feature Identification" (2014). Dissertations and Theses. 218. https://commons.erau.edu/edt/218 This Thesis - Open Access is brought to you for free and open access by Scholarly Commons. It has been accepted for inclusion in Dissertations and Theses by an authorized administrator of Scholarly Commons. For more information, please contact commons@erau.edu. VISION-AIDED NAVIGATION FOR GPS-DENIED ENVIRONMENTS USING LANDMARK FEATURE IDENTIFICATION by TENNYSON SAMUEL JOHN A Thesis Submitted to the College of Engineering, Department of Aerospace Engineering for the partial fulfillment of the requirements of the degree of Master of Science in Aerospace Engineering Embry-Riddle Aeronautical University Daytona Beach, Florida December 2014 Acknowledgements It gives me immense pleasure to acknowledge the people who provided me the necessary guidance, support and continual motivation towards the completion of this thesis. I would like to extend my gratitude and specially acknowledge my advisor, Dr. Richard J Prazenica for his technical guidance, supervision, and continual moral support during the course of this thesis. He enabled me to strike the right balance between classes and thesis work, explore the versatile character embedded in me and bring forth a quality output at all times. All the above proved as a benefactor towards the completion of the thesis work. I would like to thank Dr. Hever Moncayo for serving on the committee and for providing me with quality technical knowledge time and again over the course of this thesis. I would like to thank Dr.Hamilton Hagar Jr for serving on the committee and for providing necessary guidance and support amidst his busy schedule. I would like to express my sincere appreciation to my friends and colleagues in the Flight Dynamics and Controls Laboratory for their support and encouragement throughout my thesis work. Finally, I would like to share this feat with my Parents and express my sincere appreciation for their never-ending love, continual motivation and strong support extended at all times. iii ABSTRACT Author: Tennyson Samuel John Title: Vision-Aided Navigation for GPS-Denied Environments using Landmark Feature Identification Institution: Embry-Riddle Aeronautical University Degree: Master of Science in Aerospace Engineering Year: 2014 In recent years, unmanned autonomous vehicles have been used in diverse applications because of their multifaceted capabilities. In most cases, the navigation systems for these vehicles are dependent on Global Positioning System (GPS) technology. Many applications of interest, however, entail operations in environments in which GPS is intermittent or completely denied. These applications include operations in complex urban or indoor environments as well as missions in adversarial environments where GPS might be denied using jamming technology. This thesis investigates the development of vision-aided navigation algorithms that utilize processed images from a monocular camera as an alternative to GPS. The vision-aided navigation approach explored in this thesis entails defining a set of inertial landmarks, the locations of which are known within the environment, and employing image processing algorithms to detect these landmarks in image frames collected from an onboard monocular camera. These vision-based landmark measurements effectively serve as surrogate GPS measurements that can be incorporated into a navigation filter. iv Several image processing algorithms were considered for landmark detection and this thesis focuses in particular on two approaches: the continuous adaptive mean shift (CAMSHIFT) algorithm and the adaptable compressive (ADCOM) tracking algorithm. These algorithms are discussed in detail and applied for the detection and tracking of landmarks in monocular camera images. Navigation filters are then designed that employ sensor fusion of accelerometer and rate gyro data from an inertial measurement unit (IMU) with vision-based measurements of the centroids of one or more landmarks in the scene. These filters are tested in simulated navigation scenarios subject to varying levels of sensor and measurement noise and varying number of landmarks. Finally, conclusions and recommendations are provided regarding the implementation of this vision-aided navigation approach for autonomous vehicle navigation systems. v TABLE OF CONTENTS Acknowledgements .......................................................................................................... iii Abstract ............................................................................................................................. iv List of Figures .................................................................................................................. vii Chapter 1 Introduction......................................................................................................1 1.1 Motivation ..................................................................................................................1 1.2 Literature Review.......................................................................................................1 1.3 Scope of Work ...........................................................................................................4 1.4 Technical Objectives ..................................................................................................5 Chapter 2 Detection And Tracking Algorithms ..............................................................8 2.1 CAMshift Algorithm ..................................................................................................9 2.1.1 Mean Shift Theory ......................................................................................11 2.1.2 Mass Center Calculation .............................................................................15 2.1.3 Histogram Back-Projection .........................................................................18 2.1.4 Effect of Scaling and Orientation Features in CAMshift algorithm ...........22 2.2 Adaptable (Advanced) Compressive (ADCOM) Tracking Algorithm ....................27 2.2.1 Background Subtraction Based Tracker ......................................................27 2.2.2 Particle Filter Based L1 Tracker .................................................................28 2.2.3 Fundamental Types of Online Tracking Algorithms ..................................29 2.2.4 Need for a Stable, Robust and Efficient Algorithm ....................................31 2.2.5 Random Projection ......................................................................................32 2.2.6 Sparse Measurement Matrix Representation...............................................34 2.2.7 ADCOM Tracking Algorithm .....................................................................36 2.2.7.1 Haar-Like Features .........................................................................36 2.2.7.2 Dimensionality Reduction (Lossless Compression) ......................37 2.2.7.3 Naïve Bayes Classifier ...................................................................39 2.3 Secondary Algorithms to Support Data-dependent and Data-independent Algorithms ......................................................................................................................41 2.3.1 Template Matching .....................................................................................41 2.3.2 Pattern Recognition .....................................................................................43 2.3.3 Color Detection ...........................................................................................44 vi 2.3.4 Edge Detection ............................................................................................45 2.3.5 Corner Detection .........................................................................................46 2.3.6 Kanade-Lucas-Tomasi Tracker (KLT)........................................................47 Chapter 3 Image Processing Implementation ...............................................................48 3.1 Implementation and Analysis of CAMshift Algorithm ............................................48 3.2 Implementation and Analysis of ADCOM Algorithm .............................................56 Chapter 4 Navigation Filters...........................................................................................64 4.1 Kalman Filter............................................................................................................64 4.2 Extended Kalman Filter............................................................................................68 4.2.1 Experimental EKF for Vision-Aided Navigation........................................71 Chapter 5 Extended Kalman Filter (EKF) Implementation........................................77 5.1 Position and Velocity Estimates Using a 4-State EKF (with Accelerometer data) ..............................................................................................82 5.2 Position and Velocity Estimates Using a 4-State EKF (without Accelerometer data) .........................................................................................86 5.3 4-State Extended Kalman Filter with Additive White Gaussian Noise ...................89 5.4 Extended Kalman Filter for Tracking Multiple Landmarks/Targets .......................92 5.5 6-State Extended Kalman Filter ...............................................................................96 Chapter 6 Conclusion and Future Recommendations ...............................................101 6.1 Future Recommendations .......................................................................................103 Bibliography ...................................................................................................................106 vii List of Figures Figure No. Title Page No. 2.1 Mean Shift Convergence 15 2.2 CAMshift Implementation 17 2.3a 2.3b RGB Color Cube HSV Color System 18 18 2.4 ADCOM Implementation 38 3.1a 3.1b Red Ball- Object Tracking Centroid Tracking 50 50 3.2a 3.2b Green Bottle -Object Tracking Centroid Tracking 51 51 3.3a 3.3b Cone- Object Tracking Centroid Tracking 52 52 3.4a 3.4b Rock- Object Tracking Centroid Tracking 53 53 3.5 Red Ball-Object Tracking(ADCOM) 57 3.6 Green Bottle-Object Tracking 58 viii Figure No. Title Page No. 3.7 Cone-Object Tracking 59 3.8 Rock-Object Tracking 60 3.9 ADCOM Implementation 63 4.1 Kalman Filter Process 67 4.2 Non-Linear to Linear Model 68 4.3 Example of Tracking an aerial vehicle from the ground 71 4.4 Depiction of EKF Logic 76 5.1 Corobot Classic CL4 78 5.2 MicroStrain 30dm-gx3-45 78 5.3a 5.3b Go Pro Hero 2 Microsoft LifeCam HD-3000 79 79 5.4 Path of the UGV 80 5.5 Raw GPS Lat-Long-Altitude data in NED 80 5.6 Euler Angles 81 5.7 Accelerometer data from MicroStrain 81 5.8 Model with Accelerometer data 82 5.9 Measurement plot 82 ix Figure No. Title Page No. 5.10 X Position(EKF with Accelerometer Data) 83 5.11 X Position Zoom 84 5.12 Y Position 84 5.13 Y position zoom 85 5.14 Velocity in X direction 85 5.15 Velocity in Y direction 86 5.16 EKF without accelerometer data 86 5.17 X position(Without Accelerometer Data) 87 5.18 Y position 88 5.19 Velocity in X direction 88 5.20 Velocity in Y direction 89 5.21 X estimate with measurement noise characteristics 90 5.22 Y Position with measurement noise characteristics 90 5.23 Velocity in X direction measurement noise characteristics 91 5.24 Velocity in Y direction measurement noise characteristics 91 5.25 Simulink model of EKF with Measurement Noise 92 x Figure No. Title Page No. 5.26 X position(EKF for Multiple Landmarks) 93 5.27 Y Position 93 5.28 Velocity in X direction 93 5.29 Velocity in Y-direction 95 5.30 Simulink Model for Multiple Target Tracking 95 5.31 Simulink model for 6-state EKF with accelerometer data 97 5.32 Cartesian to Spherical transformation 97 5.33 Spherical Co-ordinate transformation 98 5.34 Measurement Data 99 5.35 Position Estimates for 3D 99 5.36 Velocity Estimates for 3D 100 xi CHAPTER 1 Introduction 1.1 Motivation The Global Positioning System (GPS) is used in multifarious systems as a position estimation sensor for vehicle navigation. Most inertial navigation systems (INS) are dependent on GPS to correct for drift errors that accumulate when processing inertial measurement unit (IMU) data (i.e., accelerometer and rate gyro data). For many scenarios of interest, however, such as UAV missions in urban environments, indoor operations, or space exploration missions, GPS data may be intermittent, corrupted, or completely denied. As a result, there has been an increasing need for navigation solutions that do not depend on GPS. The advent of computer vision and control theory in autonomous navigation applications, and the introduction of monocular camera systems on unmanned vehicles for the same purpose, has received growing interest as an alternative sensor to GPS systems. Therefore, vision-aided navigation systems represent a potentially important enabling technology for autonomous vehicle development. The diverse variety of available image processing algorithms serves as a primary reason for developing, implementing and testing two completely different tracking algorithms for advanced navigation applications with unmanned systems. 1.2 Literature Review Unmanned Aerial Vehicles (UAVs) constitute a research ﬁeld that has been extensively explored in the last decade [1]. Earlier studies on autonomous vehicles have focused on modeling and identiﬁcation, simulation, sensor integration, control design and fault analysis [2–5]. The effective use of visual odometry from high-end cameras interfaced with the onboard sensors as a suitable alternative method for autonomous vehicle navigation has been discussed frequently in recent years. These efforts were undertaken since the classical use of GPS to correct for drift in INS systems cannot sustain autonomous ﬂight in GPS-denied environments [6, 7]. Early researchers in this field have overcome the problem of navigation and control in GPS-denied environments efficiently using generic techniques. Researchers have investigated laser devices such as scanning rangefinders and Light Detection and Ranging devices (LIDAR) for mapping the surrounding environment and used this information for navigation. However, these range sensors typically rely on the properties of the signal, but these estimates become too noisy and the accuracy is insufficient for estimating velocity for feedback control [8]. Vision-based tracking algorithms such as field estimation [9], feature point tracking [1011] and related research on computer vision techniques in UAVs have been applied to several applications. Camera sensors have a tremendous potential for localization, target identification and surface mapping applications since they provide data about features such as landmarks, corners and edges, and patterns in the environment, which can be used to infer information about vehicle motion and position [12]. One reference paper [13] proposed a visual odometry algorithm based on geometric homography. In this approach, vision data was used to compute a frame by frame odometry for a Simultaneous Localization and Mapping (SLAM) algorithm. In earlier times, when the problem of instability arose due to camera mounting, in most cases contributing to the addition of noise, poor navigation solutions were deemed responsible for causing a potentially destabilizing coupling between the navigation and 2 control systems of the vehicle. The best solutions available were based on the use of camera based monocular SLAM (Simultaneous Localization and Mapping) techniques, which later were replaced by the incorporation of the inertial measurements from the IMU integrated with image processing data for navigation purposes. SLAM is considered to be a hybrid of two well-known problems: tracking and navigation through localization [13]. It is a mapping technique used to develop a map of an unknown environment or update a previous map thereby aiding navigation in the case of GPSdenied environments. The earliest use of this strategy was in the 1980’s [14]. Vision-based methods have been proposed even in the context of autonomous landing at a landmark, as seen in [15]. In this work, inertial sensors are combined with a single camera and a specially designed landing pad as a landmark in order to be GPS independent. The problem of autonomously landing a UAV or helicopter in an unknown environment is discussed in [16]. In general, the degree of autonomy of an unmanned vehicle during navigation depends on factors such as the ability to cope with the unexpected loss of GPS signal and the ability to navigate using natural landmarks. Several generic solutions for vision-based autonomous waypoint navigation and safe landing on unknown or known landmarks have been implemented using classical image processing techniques [17–19], and vision-based solutions have also been developed for the problem of collision awareness and avoidance [20–23]. The algorithms used earlier for extracting information from the camera were often computationally intensive and difficult to process taking into account lesser resources for data processing after extraction. To compensate for the loss of resolution and the inaccuracy of the data processing unit, a Kalman filter based navigation system integrated 3 with an IMU and fused with a vision system was introduced to alleviate the computational burden by allowing frame to frame prediction of camera motion and object position in the image frame, thereby helping to resolve scale ambiguity and improving overall accuracy. 1.3 Scope of Work The two algorithms under consideration in this thesis for object tracking for vehicle navigation in GPS-denied environments belong to two extremes of the image processing spectrum. The continuous adaptive mean shift (CAMshift) algorithm is highly data dependent, thereby learning and training itself based on the changes in the environment and taking into consideration the factors affecting the change. The advanced compressive (ADCOM) algorithm is data independent in the sense that it makes use of features extracted from the object to be tracked, classifies these features based on positive and negative samples and continuously tracks these compressed features robustly. In general, these algorithms, which are developed towards the goal of landmark tracking for navigational purposes, employ visual cues based on shape, edges, color, intensity, corners, patterns, and center of mass of densest region for landmark identification and for estimating their position, thereby providing assistance for real-time tracking in environments where GPS data is inaccessible. In most cases, factors such as variation in lighting, shadows, occlusions, dynamic objects in the scene, and motion and vibration effects introduce an element of uncertainty, which may aﬀect the robustness of an algorithm in real-time scenarios. 4 The work done in this thesis is centered on the fidelity of the two algorithms and testing their tracking capabilities in severe environments pertaining to high risk applications. The primary reason for choosing these two algorithms is that from previous work it was understood that algorithms having the same fundamental base or mode of operation have been tested for object detection. This work gives an option for the user to choose between two algorithms belonging to two highly different operational modes for suitable applications in object tracking. Also, the data obtained from the post-processing phase can be used for navigation of unmanned vehicles in GPS-denied environments. In this work, experiments are designed and set up to evaluate the performance of the two algorithms by processing data taken from static landmarks under conditions such as reduction in resolution, constant change in luminosity, motion blur, occlusions, and feature-less landmarks. The most vital difference between the CAMshift algorithm when compared to ADCOM is the need for the reference object or landmark under consideration to remain in the scene for the former, while the reference object is only required during initialization of the position of the object for the latter [24]. The study of the efficiency and versatility of the two algorithms is validated by experiments conducted in various conditions for navigation purposes. The developed navigation filter would provide a solution for vehicle navigation in remote areas. This would eventually aid the domain of vision-aided navigation using landmark feature tracking in GPS-denied environments, thereby expanding the scope to newer, flexible methods for commercial and military applications. 5 1.4 Technical Objectives This research emphasizes some unique image processing methods, which are adopted for navigation in GPS-denied environments by making use of processed data from high-end cameras as a replacement for GPS navigation measurements. A key objective of the research is to provide an in-depth analysis of the emerging tracking algorithms, which are tested in specific applications such as UAV vision-based object tracking under various conditions. An extended Kalman filter is developed to process the real-time data provided by the image processing algorithms. The primary technical objectives of the project are: 1. Generate a detailed vision data set using an Unmanned Ground Vehicle (UGV) and an Unmanned Aerial Vehicle (UAV) equipped with video cameras for the purposes of 3D landmark recognition. 2. Derive and analyze robust image processing techniques for identifying known landmarks in the environment from the vision data using video from monocular cameras. 3. Based on the landmarks that are identified within the scene, derive navigation laws/filters to integrate the landmark measurements with IMU measurements to compute a GPS- denied navigation solution. Two specific image processing algorithms were selected for landmark detection and tracking, providing an opportunity to perform a comparative study of the two techniques. Chapter Two specifically discusses these two algorithms. The first one involves the CAMshift algorithm (continuously adaptive mean shift) with the implementation of 6 scaling and orientation compensation to reduce occlusion and vibration due to camera motion. Earlier works incorporated a separate algorithm to compensate for the vibration reduction due to the camera motion. This algorithm works fundamentally on the principle of shifting the centroid of the region of interest (ROI) until it superimposes the center of mass of the densest region in an image sequence for dynamic vision data. The second technique, advanced compressive tracking (ADCOM), is derived from the compressive sensing technique used predominantly in audio signals. This algorithm develops a sparse matrix for signal reconstruction from a set of random projections (features) which is comprised predominantly of the compressed positive (foreground) and negative (background) samples of the target under consideration. This technique eliminates the occlusion, motion blur, illumination change and appearance model changes that often occur due to continuously updating the frames. The fourth chapter introduces the type of navigation filter tested with real-time data from an unmanned vehicle. An extended Kalman filter is designed to provide a robust navigation system. The bridge between the image processing data and the navigation filters lies in the fact that the position of the tracked target using the above two algorithms is used as a measurement update in the Kalman filter. In chapters three and five, the 3D landmark tracking results with real-time data and the state estimates from the navigation filter using simulated and real-time IMU data are presented and discussed in detail. Finally, chapter six discusses the future advancements in this field and discusses the integration of these techniques into a real-time vision-aided navigation system. 7 CHAPTER 2 Detection And Tracking Algorithms The concepts of object recognition, detection and tracking have been explored in the field of computer vision starting in the early 1980’s. The earliest use of these techniques was tested in a project using an IMSAI 8080 microcomputer, Altair computers and a Cromemco Cyclops camera. The software used was written in assembly language, with some basic running on CPM (a control program for microcomputers), which paved the way for the floppy disk to be shared by multiple computers. The camera was programmed to recognize, detect and track a ball even though it functioned under a low 32 x 32 bit resolution. The ball was made black for high contrast and fast positioning determination. The maze surface used was represented as a ball position, speed and direction vector map/matrix. This entire process used 3 computers: one for ball position and speed through the camera data, one for stepping motor control and one for the coordination of the ball position and speed in the vector field. As the name suggests, object recognition involves the positive identification of an object based on one or more properties of the object under consideration such as specific features, edges, corners, and color intensity. Object detection is similar in principle to object recognition, where the target in a frame is identified and compared to either a specific shape or pattern or a template, and then labeled or segmented as observed. This chapter emphasizes two important detection and tracking algorithms that were developed in order to suit the core purpose of this research. These algorithms can be implemented as real-time tracking algorithms in the flight computer of a fixed wing UAV or a quadrotor for the purpose of vision-aided navigation. Also, some secondary algorithms, which could be used alongside the primary algorithms, are discussed. The two primary 8 object detection and tracking algorithms considered in this research are the continuousadaptive mean shift (CAMshift) algorithm and the adaptable compressive tracking algorithm. 2.1 CAMshift Algorithm The Continuously Adaptive Mean Shift (CAMshift) algorithm is an advancement of the mean-shift tracking algorithm, which provides a robust algorithm for object detection and tracking. The tracking is predominantly dependent on properties such as color, edge, and texture. The CAMshift algorithm reduces the tracking error by continuously changing the search window based on target properties in cases where the scaling of the image is improper or the search window is either too small or too large. Basically, the algorithm adapts to changes in the color, scaling, orientation, and texture of the image during the target search and localization processes. A key enhancement of the CAMshift algorithm over the mean shift algorithm is that CAMshift is designed specifically for dynamically changing distributions while the mean shift algorithm is designed for static distributions. The CAMshift algorithm is composed of the following basic steps: 1. The target to be tracked is chosen at first either by providing its position in the first frame or manually selecting it. 2. A histogram of the hue signal from the complete image is generated. 3. The region of interest is increased and the zeroth and first moments of the image, which correspond to the area and mean of the probability density function, respectively, are calculated. 9 4. The mean shift algorithm is iterated until convergence occurs. 5. The centroid calculated is used to center the search window for the next frame and the zeroth moment is used to recalculate the image size and window size. 6. The mean shift algorithm is iterated continuously until all frames are computed (i.e., the mean location moves less than a preset threshold). When tracking a colored object from a video sequence, CAMshift operates on a color probability distribution image derived from color histograms. CAMshift calculates the centroid of the 2D color probability distribution within its 2D window of calculation, recenters the window, and then calculates the area for the next window size. Thus, it is not necessary to calculate the color probability distribution over the whole image, but instead the calculation of the distribution can be restricted to a smaller image region surrounding the current CAMshift window [25]. The essence of the CAMshift algorithm lies in the idea of region matching, a process to find correspondences between the image elements from two image frames with different viewpoints. The match criterion is the similarity based out of the color probability distribution, which is continuously changing based on the changes in the image elements in consecutive frames. The mean shift algorithm determines the convergence from the initial search window location and scales the best match based on the color-histogram similarity. The location is updated by applying the mean shift algorithm across each pixel location and scaling is performed to find the densest region of similarity to the target’s probability density 10 function. This similarity refers to the Bhattacharyya distance d, represented by its coefficient  . m [ p( y ), q]   pu ( y ) qu u 1 (2.1) d [ p  y , q ]  1  [ p( y ), q ] The algorithm makes use of the mean shift theory in order to determine the probability density histogram over successively changing frames. In equation 2.1, ‘p’ and ‘q’ represent the pixel range in 2D for the target and candidate model. The Bhattacharyya distance show the similarity between the candidate and target histograms under consideration. The mean shift theorem is used to continuously update the center of mass and coordinates of the centroid over a search window until the densest region is obtained. The mean shift algorithm is discussed in detail in the following section. 2.1.1 Mean Shift Theory The mean shift algorithm is a robust, nonparametric technique that ascends the gradient of the probability distribution to find the mode or peak of this distribution [26]. The earliest usages of this algorithm were seen in mode seeking, particle filtering, and kernel-based object tracking applications [27]. In order to apply the algorithm, a region of interest (or window) is first selected, then the centroid of this ROI is obtained for every frame, and finally the centroids for all the frames are linked to provide object tracking. To explain the algorithm more clearly, a collection of points or scattered pixels is considered in 2D and the densest region in the collection of points is found. An initial 11 ROI is selected and its centroid is noted. The center of mass for the group of points is then calculated. This is similar to the centroid for the weight of all the points in the ROI, which may not be the same as the centroid of the ROI. The mean shift vector is found as the shift or distance between the initial estimate (initial mean of the data points) and the new center of mass. The centroid is then shifted to the new center of mass. This procedure of finding the centroid and center of mass and the mean shift vector is repeated iteratively until the centroid and center of mass overlap, indicating the densest region. The mean shift vector M h is represented as:  nx   W i ( y0 ) xi    y0 M h ( y0 )   i 1nx    Wi ( y0 )   i 1  (2.2) Where x i are the data points as 2D vectors, Wi is the weight for each data point (dependent on the distance from the initial mean location to that particular data point), n x is the total number of data points and y 0 is the initial estimate or mean location. The process is performed iteratively until the mean shift vector becomes zero, indicating that the densest region has been reached. The mean shift vector [28] always points towards the direction of maximum increase in density. The denominator of the above equation represents the added weights for all data points, which is referred to as the “kernel”. Some noteworthy properties of the mean shift algorithm are: 1. The mean shift vector has the direction of the gradient of the density estimate. 12 2. The mean shift vector is computed iteratively to obtain the maximum density in the local neighborhood. 3. The mean shift algorithm is an effective means of finding modes or peaks in a set of data points in a distribution manifesting an underlying probability density function (PDF). It can be noted that, given a set of data, mean shift helps to analyze the distribution that is nonparametric. Parametric distributions, such as Gaussian or exponential distributions, can be represented in terms of analytical formulas. Nonparametric distributions describe distributions in which a Gaussian or a mixed-multiple Gaussian curve cannot be used to fit the data points. Computing the gradient of the nonparametric distribution using the mean shift is better than the nonparametric density estimation without the gradient. In this process, the mode is found as the peak in the distribution, and the mode corresponds to the gradient. The height of the distribution in a histogram is proportional to the number of data points. A taller histogram suggests a denser region with more points. This can be also indicated using the Bhattacharyya coefficient. The kernel density estimation for the nonparametric data results in finding the probability density function. The weights are assigned based on the different types of kernels such as Gaussian, uniform and Epanechnikov. Each of these kernels has a specific profile associated with it. Equation (2.3) represents the probability density function arising from a radially symmetric kernel whose profile is given as ‘k’. This equation leads to a relationship with the mean shift vector. 13 P( x)  c n 2 k ( x  xi )  n i 1 (2.3) The gradient of the probability function given above would lead to equation (2.4) which can be directly related to mean shift vector when compared to equation (2.3).  n  xi g i    n c 2 2 P( x)   g i  i 1n  x  ; (here, g i  g ( x  xi )  k ( x  xi )  k i ) n i 1   gi    i 1 (2.4) It can be observed from equation (2.3) and (2.4) that the mean shift vector is one of the terms in the gradient of probability function. In other words, a relationship exists between the two factors which is given in the equation (2.5). Mean Shift Vector, M h ( x)  P( x) ; (where g i represents the weights) c n  n  g i   i 1 (2.5) The mean shift algorithm incorporates some crucial stages that are employed in the proper execution of the CAMshift algorithm. These stages are given as follows: 1. Choose a search window and an initial location of the search window. 2. Compute the mean location (centroid) in the search window by finding the zeroth moment and then the first moment for x and y. 3. Center the search window at the computed mean location. 4. The above steps are carried out until a convergence occurs when the mean reaches zero and the centroid converges with the center of mass (densest region). 14 Figure (2.1) depicts the mean shift process as the convergence occurs at the densest region. Figure 2.1 Mean Shift Convergence 2.1.2 Mass center calculation The mean shift convergence criteria states that the mean shift component is implemented by continually computing new values of the mean or centroid location ( xc , yc ) for the window position computed in the previous frame until there is no significant shift in position. The maximum number of mean shift iterations is usually taken to be 10-20 iterations. Since sub-pixel accuracy cannot be visually observed, a minimum shift of one pixel in either of the x and y directions is selected as the convergence criteria. Furthermore, the algorithm must terminate in the case where M 00 (the first moment) is zero, which corresponds to a window consisting entirely of zero intensity. For the CAMshift process the zeroth, and first moments are calculated initially 15 as a part of the mean shift approach in order to compute the centroid for the tracked object. They can be given as the following mass center calculations: a) Compute the zeroth moment M 00   I ( x, y) x (2.6) y where I(x, y) represents the intensity of the discrete probability image at (x, y) within the search window. b) Find the first moment for x and y M 10   xI ( x, y ) x y (2.7) M 01   yI ( x, y ) x y c) Compute the mean search window location xc  M10 M ; yc  01 M 00 M 00 (2.8) CAMshift is designed for dynamically changing probability distributions. These occur when a tracked object moves so that the size and location of the probability distribution changes in time at a significant rate. The CAMshift algorithm adjusts the search window size in the course of its operation. The initial window size can be set at any reasonable value. CAMshift uses the zeroth and first moment to continuously adapt its window size within or over each video frame. The window radius, or height and width, is set to a function of the zeroth moment found during the continuous search. The algorithm is then performed using any initial nonzero window size. 16 The motivation for the CAMshift algorithm, which is a modification of the mean shift theory, is that the mean shift algorithm fails as a tracker when subjected to adverse realtime conditions. A window size that works at one distribution scale is not suitable for another scale as the color object moves towards or away from the camera. The mean shift algorithm makes use of a single probability density function for the entire tracking process. Therefore, the algorithm may fail when subjected to camera motion, occlusions or noise in the form of distractors, inclusive of a large sized or a small sized search window. The flowchart in Figure (2.2) describes the CAMshift algorithm with the implementation of the mean shift theory. The algorithm continuously adapts the search window based on the scaling and orientation of the tracked object. Figure .2.2 CAMshift Implementation 17 2.1.3 Histogram Back-Projection The probability distribution function (PDF) may be determined using any method that matches a particular pixel value to a probability distribution of a landmark or the surrounding [29]. A common method used for this calculation is the histogram backprojection. First, the PDF is generated by obtaining an initial histogram using the first step of the CAMshift algorithm process from the initial search window (ROI) of the filtered image. The histogram generated makes use of the hue channel in HSV color space. As discussed by Bradski, the Hue Saturation Value (HSV) color system corresponds to projecting standard Red, Green, Blue (RGB) color space along its principle diagonal from white to black (shown in Figure (2.3a)), which results in the hex cone shown in Figure (2.3b). Descending the V axis in Figure (2.3b) gives smaller hex cones corresponding to smaller (darker) RGB subcubes in Figure (2.3a). The unique quality of HSV space separates out hue (color) from saturation (color concentration) and brightness. The color models are then created by computing 1D histograms from the H (hue) channel in HSV space. Figure 2.3(a) RGB color cube 2.3(b) HSV color system 18 A noteworthy consideration when using real cameras with discrete pixel values is that a problem can occur when using HSV space as can be seen in Figure 2.3(b). When brightness is low (V near 0), saturation is also low (S near 0). Hue then becomes quite noisy, since in such a small hex cone, any small changes in RGB values would not be adequately represented by discrete hue pixels. This then leads to large deviation in the hue values, causing an erroneous and noisy histogram. To overcome this problem, hue pixels that have very low corresponding brightness values are neglected. This means that, for very dim scenes, the camera must auto-adjust or be adjusted for more brightness in order to track effectively. With sunlight and with bright white colors, an upper threshold can be applied to ignore hue pixels with corresponding high brightness. At very low saturation, hue is not defined, so hue pixels that have very low corresponding saturation are also ignored. The 1D histogram obtained initially is quantized into bins, which lessens the computational and spatial complexity and allows similar color values to be clustered together [30]. The histogram bins are then scaled between the minimum and maximum probability image intensities using Equation (2.9):       255  qu ,255   Pu  min   max( q )   u     u 1....m (2.9) Here ‘m’ denotes the number of bins, Pu and q u represent the discrete pixel range of 2D probability and histogram respectively. Histogram back-projection is a primitive operation that associates the pixel values in the image with the value of the corresponding 19 histogram bin by allowing the re-application of the histogram values generated already on the current frame. The back-projection of the target histogram with any consecutive frame generates a probability image where the value of each pixel characterizes the probability that the input pixel belongs to the histogram that was used. This is accommodated by rescaling the probability (histogram values) to a new range, where pixels with highest probability of being in the sample histogram map as visible intensities in the 2D histogram back- projection image. The CAMshift algorithm deals with image problems that frequently occur during colored object recognition and tracking such as irregular object motion due to perspective, image noise, distractors, and occlusions. The algorithm constantly re-scales itself in a way that naturally fits the structure of the image data. Based on the potential velocity and acceleration of the object and the scaling based on its distance to the camera, the size of its color probability distribution in the HSV plane is scaled accordingly. In this manner, when objects closer to the camera move rapidly in the image plane, their color distribution occupies a larger area. In this situation, the search window size gets scaled larger, hence capturing larger movements. When objects are distant, the color distribution is small so the search window size is also small, but distant objects are slower to traverse the video scene. This natural adaptation to distribution scale and translation without predictive filters or a parametric distribution helps further the computational savings and serves as a built-in remedy to the problem of erratic object motion and sudden changes in the camera motion. This process ignores the distribution outliers and substantially reduces occlusions and 20 noise; thus tracking parameters need not be smoothed, filtered, or sampled resulting in robust noise tolerance. The robustness of CAMshift with respect to noise, transient occlusions, and distractors mostly depends on the search window matching the size of the object being tracked. It has been shown that it is better to err on the side of the search window being too small than too large. The search window size depends on the zeroth moment M 00 . To indirectly control the search window size, we adjust the color histogram up or down by a constant, truncating at zero or saturating at the maximum pixel value. The adjustment affects the pixel values in the color probability distribution image, which affects M 00 and hence the window size. For an 8-bit hue signal, the histogram is adjusted by 20 to 80 (maximum being 255), which tends to reduce the search window size to just within the size of the tracked object, hence eliminating noise. HSV brightness and saturation thresholds are employed in cases where hue is not well defined for very low or high brightness or low saturation. Low and high thresholds are set from 10 – 20% of the maximum pixel value. It can be noted that the Kalman filtering process, or any other smoothing process typically used in object tracking, is an optimal estimation method with the criterion of minimal error covariance. It has the advantages of low calculation scale and real-time performance. However, the standard Kalman filter employs a linear Gaussian state space model, which may be not consistent with the motion of the tracked object in the real world. In comparison, an improved object tracking algorithm such as the CAMshift algorithm has discrete advantages in that it does not make assumptions regarding the 21 motion of the camera and can provide improved tracking in the presence of noise and occlusions. 2.1.4 Effect of Scaling and orientation features in CAMshift algorithm The use of moments to determine the scale and orientation of a probability distribution of a tracked object was first described in 1986 [31] [37] and later in the late 1990’s by Freeman and Bradski. Here, the scale and orientation of the target are estimated using the moment features of the weight image. Therefore, once the weight image is properly calculated, it can lead to accurate moment features and consequently good estimates of changes in the target. The scale and orientation can be well estimated using the probability density function together with the moment features of the weight image [32]. First, in order to estimate the weighted area of the target in the target candidate region, the zeroth moment (M 00 ) is taken into account. The Bhattacharya coefficient is an indicator showing the similarity between the target model and the target candidate model. Higher values of this coefficient means more features from the target are present than the background [33] [34] [35]. Also, it also reveals a lower estimation error considering the zeroth moment. Therefore, the Bhattacharya coefficient is a reliable indicator considering its effect on the zeroth moment and thereby the target area. This can be given by the following equation: A  c(  ) M 00,Where Ais the Area and M 00 is the Zeroth Moment Where c(  )    1 in caseof a linear function (or )   1  c(  )  exp   ,0    1 in other cases    22 (2.10) Based on the experimental results for this thesis, the linear function is put to use and tested successfully. Also, in other experiments conducted previously where the exponential function was used the value of  ranges between 1 and 2 for robust tracking results adaptive to the video content. In order to determine the orientation of the major axis and the scale of the distribution, an equivalent rectangle is found that has the same moments as those measured from the 2D probability distribution image. Two methods are suggested and compared for calculating the orientation and scale of the object. The first method is the conventional type, which uses the following first and second moment expressions to compute the scaling and orientation estimates: M 11   xyI ( x, y ) x y M 20   x 2 I ( x, y ) x (2.11) y M 02   y 2 I ( x, y ) x y The first two eigenvalues, the major length and width of the probability distribution, are calculated in closed form as follows with the help of the first and second order central moments given below.  M 11    xc yc   ,  M 00  11  2  M  M  20   20  xc2  ; 02   02  yc2   M 00   M 00  23 (2.12) From the above equations the minor and major semi-axes w and l of the probability distribution can be determined as follows. Major semi  axes , l  Minor semi  axes, w  ( 20  02 )  112   20  02  2 2 ( 20  02 )     20  02  2 11 2 (2.13) 2 The object orientation or the head roll, in case of face tracking, or major axis inclination can be found making use of the central moments as well. In most cases the CAMshift helps in tracking four degrees of freedom. The orientation is given by:  11  1   arctan   2  20  02  (2.14) By using this method, it can be seen that the CAMshift algorithm can be used for generalpurpose object tracking using a background-weighted histogram and arbitrary quantized color features of the target, but the scaling with orientation values would not be as accurate as the second method using covariance matrices. The second method entails the formation of a covariance matrix using the second order central moments. The covariance matrix given below is estimated in order to obtain the width, height and orientation of the target. 24 11   Cov   20   11 02  (2.15) Therefore, by making use of the estimated area A from equation (2.10) and the central moment features, the width, height and orientation are computed. This is specifically done using the singular value decomposition (SVD) process [38]. The SVD method implies that  v11 v12   12 Cov  V  S  V     v21 v22   0 T T 0   v11 v12    22   v21 v22  (2.16) where 12 and 22 are eigenvalues of the matrix Cov Also, the orientation of the two main axes, pitch and roll estimation, is given by two vectors of matrix V given as (v11 , v21 ) and (v12 , v22 ) respectively. A noteworthy comparison between the two methods is that the values of l and w correspond to 1 and 2 respectively. But the values calculated in the second method are more accurate. An advancement to the above method has been developed to compute a more accurate width and height estimation. In some cases, the target is represented in the form of an ellipse for which the lengths of the semi-major axis and semi-minor axis are denoted by a and b, respectively. Theoretically it is proven that ‘a’ and ‘b’ can be related to 1 and 2 as follows: 1 a   a  k 1 and b  k 2 2 b where ' k ' is the scale factor 25 By equating the known area of the target representation, either a rectangle or an ellipse, with the area A estimated from Equation (2.10), the following expressions are obtained:   k      a     b   A /  12  for RECTANGULAR TARGET A / 12  for ELLIPTICAL TARGET 1 A /  2  for RECTANGULAR TARGET (2.17) 1 A / 2  for ELLIPTICAL TARGET 2 A /  1  for RECTANGULAR TARGET 1 A /  2  for ELLIPTICAL TARGET The adjusted covariance matrix with the above implementation in place is given as below: v Cov  V  S1  V   11  v21 T v12   a 2  v22   0 0   v11  b 2   v21 T v12   v22  A fundamental but essential distinguishing factor between algorithms that estimate iteratively the covariance matrix for consecutive frames based on mean shift tracking and the above method is that the latter algorithm efficiently makes use of the area of the target representation with the covariance matrix to estimate the width, height and orientation of the object. This algorithm can robustly estimate the scale and orientation changes of the target under the CAMshift framework. The CAMshift algorithm with augmented scaling and orientation estimation techniques provides a simple and effective method to estimate the attitude and scaling features of the target as well as successfully tracking the object using 26 the adaptive feature incorporated in the algorithm. Therefore, knowing the area of the target improves the accuracy of the algorithm. The values estimated can be used in vision-based navigation algorithms. 2.2 Adaptable (Advanced) Compressive (ADCOM) Tracking Algorithm Compressive sensing theory uses a unique set of discrete audio or video signals, which are adequately sensed using far fewer measurements than the dimension of the ambient space in which they appear. This technique constantly maintains the appearance of the model in the case of an image or maintains the quality of the audio signal without compromising the structure of the signal. The signal under consideration is accurately obtained from the data collected during the sensing process. There is no loss of data or addition of noise when this technique is implemented. 2.2.1 Background Subtraction Based Tracker From previously used methods, it can be seen that large amounts of data can deteriorate the tracking functionality. One of the most intuitive applications of compressive sensing in visual tracking is the modiﬁcation of background subtraction [39] such that it is able to operate on compressive measurements. Background subtraction aims to differentiate the object-containing foreground from the background. This process not only helps to localize objects, but also reduces the amount of data that must be processed at later stages of tracking [40] [41]. Usually, the foreground information occupies a sparse spatial support compared to the background and may be caused by the motion and the appearance change of objects within the scene. By obtaining the object silhouettes on a single image plane or multiple 27 image planes, a background subtraction algorithm is conventionally performed. However, traditional background subtraction techniques require that the full image be available before the process can begin. Background subtraction techniques may require complicated density estimates for each pixel, which become burdensome in the presence of high-resolution imagery. 2.2.2 Particle Filter Based L1 Tracker The concept of sparse representation has been recently used in the L1 tracker where an object is modeled by a sparse coding process with a given dictionary of target and trivial templates. The algorithm seeks to find the best candidate that negates the reconstruction error using only this sparse linear combination of target and trivial templates. However, the computational complexity of this tracker is rather high, demanding high processing speeds, thereby limiting its applications in real-time experiments. The L1 tracker and its extensions have been developed in the particle filter (PF) framework. A common problem arising in vision tracking is to estimate the posterior probability distribution. By definition, the posterior probability is the probability distribution of an unknown quantity, treated as a random variable, conditional on the relevant evidence or ground truth taken into account from a practical real-time experiment. In Bayesian statistics, the particle ﬁlter framework is a sequential Monte Carlo method, using samples of the conditional distribution in order to approximate it and thus the desired estimates. The L1 tracker makes use of a technique known as sequential importance sampling in the form of a bootstrap filter. 28 The sampling technique involves three stages. First, the samples are drawn from the known prior distribution. Second, a prediction step involving generation of candidate samples is followed by calculating importance weights based on the observation, which are later normalized. For each particle, its representation is computed independently by solving a constrained L1 minimization problem with non-negativity constraints. Finally, the filter enters the selection step where samples are generated from a discrete distribution over the candidate particles. In order to adapt to the appearance and structural changes of the tracked object, the template is updated depending on both the weights assigned to each template and the similarity between templates and current estimation of target candidate. 2.2.3 Fundamental Types of Online Tracking Methods The ADCOM algorithm used in this thesis is derived from the three types of online tracking methods discussed below. This discussion highlights the vast difference, flexibility and versatility of the ADCOM algorithm when compared to the other available visual tracking methods [42]. Generative Method The generative method involves learning a model to represent the object and then using the model to search for the image with minimal reconstruction error. The tracking problem is formulated as searching for the regions that are most similar to the tracked targets. These are based on either template matching or subspace models. The continuously adaptive mean shift algorithm (CAMshift) is an example of this method. 29 These methods consider the color probability distribution or the color histogram in their tracking approach. Discriminative Method The discriminative method poses the tracking problem as a binary classification task in order to find the decision boundary for separating the object from the background; hence it subtracts the background information or totally ignores the background data, thereby drawing an imaginary boundary between the features that best discriminate between object and background data. Online updating of the learned target is a common feature of this approach. Mostly, the negative samples are eliminated and only the positive features are learned and tracked. Drift is a major issue to be addressed when implementing these methods. The tracking, learning and detecting (TLD) algorithm is an apt example of the discriminative method. Collaborative Method Collaborative methods involve a semi-supervised learning approach in which positive and negative samples are selected via an online classiﬁer with structural constraints. Both foreground and background information are used in the tracking process. While most tracking algorithms work on the premise that the object appearance or environmental lighting conditions or intensity do not significantly change as consecutive frames are processed, the ADCOM algorithm belongs to a category of online tracking algorithms that account for the appearance variation of the target and background, thereby facilitating the tracking task in various circumstances. The compressed samples of foreground targets and the background are obtained using the same sparse 30 measurement matrix. The tracking task is formulated by classifying the positive and negative samples via a naive Bayes classiﬁer under the Bayesian framework. 2.2.4 Need for a Stable, Robust and Efficient Algorithm The most challenging task is to develop effective appearance models for robust and optimal object tracking solutions due to weakening factors like pose variation, illumination fluctuation, occlusion and motion blur [43] [44]. A current feature of online tracking algorithms is that they often update the appearance model with samples from observations in previous or recent frames. There are two main deviations to this feature. First, while these adaptive appearance models are data-dependent, there are very few data that can be properly utilized by these algorithms at the outset. Second, online tracking algorithms often encounter the problem of drift. This means that algorithms that adapt to their surrounding conditions work with misaligned samples that overlap each other and accumulate causing the degradation of the appearance model. In most cases, these algorithms fail to include crucial details from the background that would likely improve the tracking performance, stability and accuracy. The solution proposed is in the form of an algorithm with an appearance model based on features extracted from multi-scale image feature space with data-independent features. Compressive tracking [45] is introduced to help alleviate some of the challenges associated with performing classical tracking in the presence of large amounts of data. The amount of data that the system must handle can be drastically reduced using this technique. The algorithm employs non-adaptive random projections that preserve the structure of the image feature space of objects. A very sparse measurement matrix is adopted to eﬃciently extract the features for the appearance model. In this algorithm, two 31 important factors to be considered are the non-adaptive advanced random projections and the sparse measurement matrix. These two features are substantial to compressive sensing behavior and hence employed in the object tracking domain. This algorithm has been previously tested in environments where a static background is processed using dataindependent features. 2.2.5 Random Projection The concept of using random projection for compressive tracking revolves around a core foundation built on the Johnson-Lindenstrauss Lemma principle [46], which needs to be satisfied in order to apply random projection to compress a matrix. It states that if a set of points in a high-dimensional space is given, then these points can be projected into a much lower-dimensional random subspace that is independent of the original dimension, and with high probability preserve much of its structure in terms of its interpoint distances (Euclidean distance) and angles. Let A∈ R n D be our n points in D dimensions [47]. The method multiplies A, which corresponds to the image space, by a random matrix C∈ R Dk , reducing the D dimensions down to just ‘k’ for speeding up the computation. The C matrix is the random measurement matrix, which typically consists of entries of standard normal N (0, 1). The projected data matrix, which is of a lower dimensional image space, is given by, B=A*C  R nk (2.18) If the random measurement matrix C in (2.18) satisﬁes the Johnson-Lindenstrauss lemma, the A matrix can be reconstructed with minimum error from B with high probability and robustness if and only if the A matrix is compressive in nature, such as in 32 an audio or image vector space. In this way, all the information in B is preserved by the matrix A in a higher dimension. This theoretical foundation enables the linear analysis of the high-dimensional signals via low-dimensional random projections. For an efficient projection, a very sparse measurement matrix needs to satisfy the restricted isometry property (RIP). According to the restricted isometry property [48], a random matrix C which is of size D  k is said to satisfy the RIP with the restricted isometry constant R (m, D, k; C) if, for every A   k (m) : A  R k : A 0  m, R(m, D, k ; C ) : min y subject to (1  y) A y 0 2 2  AC 2 2  (1  y) A 2 (2.19) 2 The RIP constant R(m,D,k;C) is the maximum distance from 1 of all the eigenvalues of the  k  sub matrices, C KT , C K , derived from C,where K is an index set of cardinality m m   which restricts C to those columns indexed by K.A typical sparse measurement matrix satisfying the restricted isometry property is the random Gaussian matrix C∈ R Dk where r ij ~ N (0, 1), as used commonly in previous works [49][50][51]. In the ADCOM algorithm, a very sparse random measurement matrix C with i.i.d(independent identical distribution) entries is considered based on the novel work of Achiloptas [52] According to his theory, 1  1 1 1 rij  s  0 With probability of [ ,1  , ] respectively 2s s 2s  1  33 (2.20) Achiloptas proved that this type of matrix with s = 1, 2 or 3 satisﬁes the JohnsonLindenstrauss lemma. This matrix is very easy to compute, requiring only a uniform integer random generator. With s = 3, one can achieve a threefold speedup because only 1/3 of the data need to be processed (hence the name sparse random projections) and 2/3 of the computations can be avoided. The multiplication factor √s can delay the process, hence no ﬂoating point value is needed and all computations amount to highly optimized database aggregation operations. This algorithm was ﬁrst experimentally tested on image and text data as an application of random projection in 2001 [53]. The sparse random projections are sampled at a rate of 1/s (i.e., when s = 3, only onethird of the data are sampled), which is a considerably smaller data sample size. Statistical and empirical results reveal that, in certain cases, one does not have to sample one-third of the data in order to obtain good estimates. In fact, when the data are approximately normal, log D of the data can be used as in s  D D instead of s  , as 4 log D used in this algorithm because of the exponential error tail bounds, which are common in normal-like distributions, such as binomial and gamma distributions. For certain tracking applications, such as facial recognition against a static background, and to improve robustness, it is recommend choosing s = √D, which could further reduce the portion of samples that need to be processed. 2.2.6 Sparse Measurement Matrix Representation The fundamental definition of a sparse matrix is a matrix containing few non-zero elements and mostly zeroes in its rows and columns. Sparse matrices have several attractive properties, such as low computational complexity in both encoding and 34 reconstruction, simple incremental updates to signals, and low storage requirements. The Bayesian framework, which is incorporated as a classifier in the ADCOM algorithm, generally assumes an a priori probability distribution that favors sparsity for the signal vector and uses a maximum a posteriori (MAP) estimator, which provides a point “best” estimate to incorporate the observation [54]. In order to represent the sparse matrix, consider an image I of size N × M and which is vectorized into a column vector X of size P × 1, where P = N*M, by concatenating the individual columns of I in order. The nth element of the image vector I is referred to as I (n), where n =1... P. Let us assume that the basis Ψ = [ 1 ... P ] provides a K-sparse representation of X: P X=  (n) n 1 K n   (nl ) nl (2.21) l 1 where θ (n) is the coefﬁcient of the nth basis vector ψ (ψ: P×1) and the coefﬁcients indexed by ‘nl’ are the K-nonzero entries of the basis decomposition. Equation (2.21) can be more compactly summarized as follows X = , (2.22) where θ is a P×1 column vector with K-nonzero elements. If . p is used to denote the l p norm and if the l 0 norm simply counts the non-zero elements of θ, the image I is called K-sparse. 35 The tracking task is formulated as a binary classiﬁcation via a naive Bayes classiﬁer with online update in the compressed domain. It is to be noted that this ability would not cost a signiﬁcant decrease in tracking performance of the ADCOM algorithm. 2.2.7 ADCOM Tracking Algorithm Before discussing the process of classification and tracking, the method is first adapted to initially select relevant features for tracking based on dimensionality reduction in the form of compression. 2.2.7.1 Haar-Like Features The ADCOM algorithm makes use of the Haar-like features technique to determine significant features from the tracked object, which are compressed using the random sparse measurement matrix. The compressive feature extraction method with dimensionality reduction preserves the structure of the object. These features make use of the change in the contrast values between a set of adjacent pixels encrypted inside rectangular windows [55]. The intensity values of each pixel are not considered in this case. The changes in contrast between the rectangular pixel groups are used to distinguish relatively light and dark areas. Hence, two or three adjacent groups with a relative contrast variance form Haar-like features. The concept of the use of an integral image in the form of summed area tables to improve the speed of the feature classification is incorporated in the feature extraction process. The integral image is the intermediate representation of an image frame in which simple rectangular adjacent pixel features are calculated. The integral image uses a series of these features, which are simple rectangular areas with relatively dark and light regions, 36 to find corresponding matching patterns in adjacent groups of pixels. The integral image consists of a 2D or 3D matrix of the same size as the original image, in the form of lookup tables. Each element of the matrix represents the sum of all the pixel intensity values located directly to the upper-left of a pixel location [56]. The integral image can be represented as follows: I ( x, y)  i( x, y)  I ( x 1, y)  I ( x, y 1)  I ( x 1, y 1) (2.23) where I(x, y) corresponds to the intensity of the pixel under consideration and the other terms correspond to the intensity values of the adjacent pixels in the rectangular area. The changes in contrast between two or more rectangular groups of pixels contribute to the feature classification, which is then compressed for further processing. 2.2.7.2 Dimensionality Reduction (Lossless Compression) Compared to traditional dimensionality reduction methods, the proposed algorithm makes no use of data-dependent parameters, nor does it require additional computation for the eigenvalue decomposition. Mathematically, the algorithm makes use of the set of input samples of features, which are sampled as positive and negative samples depending on their distance from the object tracked. These are taken into the compression process as a library of positive, negative and zero entities [57]. The algorithm then selects the features that minimize the residual output error iteratively, thus the resulting features have a direct correspondence to the performance requirements of the given task. Furthermore, the proposed algorithm makes use of a sparse measurement matrix with which the higher-dimensional multi-feature vector spaces are concatenated to produce a lower-dimensional compressed feature set that is classified using a naïve Bayes classifier. 37 An example is considered to understand the dimensionality reduction and the compression process:  A1     A2    .  .     Ak  Random Matrix (C∈ R Dk  B1     B2    .  .     BD  Bi  C A ij j j ) Figure 2.4 ADCOM implementation The random matrix C consisting of gray, black and white areas indicate positive, negative and zero entries respectively. Here, consider a sample Q  R wh , which is one entry from the random matrix. The sample S is convolved with a set of rectangular filters at multiple scales, which can defined as: 1,1  x  i,1  y  j hi , j ( x, y )   0, otherwise Here ‘i’ and ‘j’ represent the width and height of a rectangular filter, respectively. The arrows illustrate that one of the nonzero (gray or black) entries of one row of C sensing an element in A is equivalent to a rectangle ﬁlter of a particular width and height representing the pixel intensity at a ﬁxed position of an input image. Here, the set of filtered image frames would belong to a column vector that can be concatenated as a very high-dimensional multi-scale image feature column vector A= ( A1 , A2 ,...Ak )T . Note that 38 k  (w * h)2 here and ‘k’ represents a very high dimensionality. The sparse random matrix is then used to project the higher dimensional vector to a much lower dimensional space without compromising the structure of the image data that are compressed. The lower dimensional image feature vector B is comprised of elements that are a linear combination of spatially distributed rectangle features at multiple scales. The compressed features are computed based on relative intensity differences, which is based on a method parallel to the Haar-like feature detection. As discussed earlier in section 2.2.8.1, these Haar-like features preserve the information of the original image in a compressed form. Features obtained after compression are employed as classifiers for object tracking. 2.2.7.3 Naïve Bayes Classifier A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model." The Bayes theorem describes how the conditional probability of each of a set of possible causes for a given observed outcome or consequence can be computed from knowledge of the probability of each cause and the conditional probability of the outcome for each cause [58]. In simple terms, a naive Bayes classifier [59] assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature of the same class. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting such as in a 39 tracking algorithm where it is used as a classifier to train the compressed vector space in order to separate the positive foreground samples from the negative background samples. The discrete data needed for the classifier are based out of the positive and negative samples extracted by the Haar-like feature extraction. Here the input library is comprised of these extracted samples, which are distributed over a random matrix as non-zero and zero entries. The output labels are given as y= {0, 1} and a training set of i.i.d entries is obtained as hi, j ( x, y) concatenating the input library with output labels. The naïve Bayes classifier uses the training set to calculate the estimates of the probabilities p( Bi | y) and p( y) . The estimates of the probabilities are calculated in such a manner that the number of occurrences of an event in the training set is accounted for. The Bayes classifier is modeled as below: H ( B)  log( n  in1 p( Bi | y  1) p( y  1) p( Bi | y  1) )  log( ) P( Bi | y  0)  in1 p( Bi | y  0) p( y  0) i 1  (2.24) Based on the work of Diaconis and Freedman [60], most of the random projections of high dimensional random matrices are Gaussian. The mainstay of data analysis, as in this case, makes use of the lower dimensional projections to study the higher dimensional data sets. The two or higher order dimensional data sets can be represented as one dimensional histograms. According to their work, the standard measure of randomness is entropy. The maximum entropy occurs during the Gaussian distribution for a fixed scale. According to the above theory, the probability estimates p( Bi | y  1) and p( Bi | y  0) are Gaussian distributed with four parameters corresponding to the mean of positive and negative features, standard deviation of positive and negative features 1 , 0,1 , 0 40 respectively. According to the maximum likelihood estimation (a totally analytic maximization procedure), a method for estimating the parameters of the Bayes classifier model, the following equation set is derived. i1  i1  (1   ) 1  i1   ( i1 )2  (1   )( 1 ) 2   (1   )( i1  1 ) 2 1  n 1 (2.25) n 1   1 1 ( Bi (k )  1 ) 2 ; 1  Bi (k ) n k  0| y 1 n k  0| y 1 Some generative and discriminative algorithms make use of the training samples cropped from previous frames, which are stored and updated, but the ADCOM algorithm uses a data-independent random measurement matrix. Second, the algorithm extracts a linear combination of generalized Haar-like features in contrast to other trackers that use the whole template of the landmark for tracking purposes. These two properties of the ADCOM algorithm clearly differentiate it from other tracking algorithms. 2.3 Secondary Algorithms to support data-dependent and independent algorithms 2.3.1 Template Matching Template matching is a simple process in which a template is matched to an image where the template is a sub-image that contains the shape or contour of the object to be tracked. The generic algorithm centers the template on an image point and sums the total number of points in the template that match those in the image. The procedure is repeated for the entire image, and the point that leads to the best match, corresponding to the maximum 41 count, is defined as the point where the shape (given by the template) lies within the original image. Template matching may be used when the standard deviation of the template image compared to the source image is small. Templates are most often used to identify printed characters, numbers, and other small, simple objects. Template matching is performed on either bi-level images (black and white) or grey level images depending on the application. In Simulink models, the template can be loaded from the main image during the object tracking process. The best match found in the search over the image continues the tracking of that particular portion of the original image. Formally, template matching can be defined as a method of parameter estimation. The parameters define the position of the template. Template matching uses a specific similarity criterion for locating an object, using the correlation principle to incorporate the following equation,  x  y  Ax  A)( Bxy  B) p   x  y  Axy  A   2  x  y  Bxy  B) 2  (2.26) where A and B are the 2-dimensional ‘mean’ of the respective image matrices, and (x,y) are the spatial coordinates within A and B. This correlation coefficient closely resembles a traditional statistical correlation, with the difference being that the traditional method is calculated in one dimension instead of two dimensions. A high correlation coefficient in a pixel-by-pixel comparison between the template and the region of interest (ROI) indicates a good match. 42 2.3.2 Pattern Recognition Pattern recognition entails analyzing raw data and categorizing them into available classes or subsets. In order to perform this classification, one feature space is typically selected to represent the data in a manner that simplifies the classification task. Once the features are identified, each class or category is defined using specific models. Once data of an unlabeled object is read, the class or category of this object is determined by inferring which of the descriptions best classifies the features. This process of detecting, describing and recognizing visual patterns has led to advances in automating several tasks like optical character recognition, scene analysis, fingerprint identification, and facial recognition. When applied to tracking hand gestures, histograms of the image measurement such as the gradient orientation or color probability of the tracked hand may be used to recognize gestures captured in the video image. In generic applications, such as face or hand motion tracking, a system is operated in two modes. First, the system operates in a training mode in which the measurement data (e.g., gradient orientation histograms) taken from the specific tracked regions of a video frame are linked with identified gestures or motion of the object. Next, the system operates in a performance mode. In performance mode, hand or face regions of video frames are compared with the stored hand or face regions to identify the gesture or motion captured in the video frame. The identified and recognized motion or gestures are used for interface control. 43 2.3.3 Color Detection Most state-of-the-art and common approaches to object detection and tracking rely heavily on intensity-based features without considering the color information in the image. This exclusion of color information is usually due to large variations in color caused by changes in illumination, compression, motion blur, and occlusions. These variations make the task of robust color description especially difficult. On the other hand, and in contrast to object detection, color has been shown to yield excellent results in combination with shape features for image classification. There are three main criteria to consider when choosing a color tracking algorithm for vision-aided navigation purposes. Feature Combination: There exist two main approaches to combining shape and color information: early and late fusion. Early fusion combines shape and color at the pixel level, which are then processed together throughout the rest of the learning and classification pipelines. In late fusion, shape and color are described separately from the beginning and the exact binding between the two features is lost. Early fusion, in general, results in more discriminative features than late fusion since it preserves the spatial binding between color and shape features. Photometric Invariance: One of the main challenges in color representation is the large variation in features caused by scene-accidental effects such as illumination changes and varying shadows. Photometric invariance theory provides guidelines on how to ensure invariance with respect to such events; however, photometric invariance comes at the cost of discriminative power. The choice of the color descriptor used should take into consideration both its photometric invariance as well as its discriminative power. 44 Compactness: Existing luminance-based object detection methods use complex representations. For example, the part-based method models an object as a collection of parts, where each part is represented by a number of histograms of gradient orientations over a number of cells. Each cell is represented by an N-dimensional vector. Training such a complex model, for just a single class, can require large gigabytes of memory space and a supercomputer to process the information and track the objects. When extending these cells with color information, it is therefore imperative to use a color descriptor or a color classifier as compact as possible to reduce both memory usage and total learning time. A specific color signal can be selected and separately processed, which can be used for tracking purposes based on its variation. 2.3.4 Edge Detection Edge detection is the process of finding edges in an image in order to facilitate image segmentation, data compression, and image reconstruction. An edge may be regarded as a boundary between two dissimilar regions, points or sets of pixels in an image. In computer vision and image processing systems, in order to interpret an image, the separation of the image into object and background regions is a critical step. Segmentation partitions the image into a set of disjoint regions that are visually different, and are uniform and meaningful with respect to some characteristics or computed properties, such as grey level, intensity, texture or color to facilitate image analysis. Edge-based methods are the most commonly used techniques for performing image segmentation. The edges in an image are the significant characteristics that provide an indication of spatial frequency. Edge detection is commonly used for feature detection and feature extraction as part of object identification algorithms. The edge detection 45 process marks points in a digital image at which the luminous intensity changes sharply. In many image analysis processes, the edges of objects in the image are first detected. Edge representation of an image significantly reduces the amount of data to be processed, yet it retains useful information about the shapes of objects in the scene. The effectiveness of many image processing and computer vision tasks depends on the detection of meaningful edges. Edge detection has proven to be a challenging task in low level image processing. Various approaches are available for edge detection, some of which are based on error minimization, maximizing an object function, neural networks, fuzzy logic, wavelet approaches, Bayesian approaches, morphology, and genetic algorithms. 2.3.5 Corner Detection The point where two edges meet in an image frame is defined as a corner. Corner detection is based on detecting corner feature points with the primary goal of obtaining robust, stable and well-defined image features for object tracking and recognition of three-dimensional landmarks or objects. There is no image reconstruction available from the corner detection process. Corner detection is frequently used in motion detection, image matching, tracking, image mosaicing, panoramic stitching, 3D modeling and object recognition. Corners in images contain a great deal useful information about the scene. Extracting corners accurately is significant in image processing, which can reduce many of the calculations. The corner detection procedure on object boundaries involves first segmenting a scene image into meaningful regions and then extracting boundaries from the regions of interest. 46 2.3.6 KLT Tracker The Kanade-Lucas-Tomasi (KLT) tracker has been studied extensively and used for feature point tracking. The KLT tracking algorithm is based on the assumption of small frame-to-frame motion of features, and initially KLT computes the optical flow of feature points and performs nonlinear optimization. Feature tracking is a first step in many machine vision applications including optical flow, object tracking, 3D reconstruction and collision avoidance. Higher-level computer vision algorithms require, therefore, robust tracking performance without accounting for the motion of the camera. KLT feature point tracking uses template image alignment techniques. The fundamental assumption of KLT feature tracking is small spatial and temporal changes in appearance across consecutive images. Mathematically speaking, the KLT tracker solves a nonlinear optimization problem and has a limited convergence region for a true global minimum. Incorrect solutions based on local minima can be minimized when a new search region is given from an external inertial sensor instead of simply using the last tracking state. 47 CHAPTER 3 Image Processing Implementation Image processing is a technique of continuously and rigorously passing raw camera data in terms of image frames through an analysis phase thereby obtaining a processed image at the output. The analysis phase may include image display and printing, image manipulation, features extraction, image enhancement, and image compression. An image is an array or a matrix of square pixels (discretized points in a graphical image). As discussed in Chapter 2, the two essential image processing algorithms considered in this work are CAMshift (continuously adaptive mean shift) and ADCOM (advanced compressive tracking). Both of these algorithms were implemented in a series of steps in order to obtain robust tracking results which pave way for further analysis. The analysis of both these algorithms is based out of experimentation and testing of four different examples, which are studied in great detail. 3.1 Implementation and Analysis of CAMshift Algorithm The CAMshift algorithm is robust and depends entirely on the color probability or the texture of the object being tracked. It is adaptive in that it continuously changes the window size based on changes in the object properties. A trademark feature of this algorithm is that it provides robust tracking in dynamically changing environments. The implementation of the algorithm is carried out using the following sequence, executed in MATLAB. Step 1: Initialize the directory in which the camera data is stored as N frames at a certain sample rate (frames per second, fps) along with the initial threshold level and pixel rate of expansion for the search window. 48 Step 2: The search window size and location are defined carefully after close inspection of the position of the target/landmark in the camera frame. Step 3: The Hue (color channel) is extracted as a 1D histogram for processing in the region of interest as it highly suitable for generating a color probability distribution. Step 4: Initialization and calculation of the zeroth, first and second order moments along with the calculation of the centroid of the tracked object based on these moments. Step 5: The covariance matrix is defined using the singular value decomposition method to estimate the width, height and orientation of the tracked object. Step 6: Calculate the orientation along the longitudinal and lateral axes. Steps 5 and 6 are non-trivial as they can be used to study the change in the properties of the object and how the algorithm adapts to these substantial changes. Step 7: The new window size is calculated after the convergent values (based on the centroid) are computed. This calculation is based on the area of the probability distribution making use of a scaling factor. An appropriate scaling factor is one which does not generate unreasonably large increases or decreases in the size of the search window. Step 8: Display and print the tracking and scaling results, store the tracked frames and plot a map based on the tracked centroids. 49 Example 1: Red ball tracking Figure 3.1 (a) Object tracking (Frame # 5, 10, 40, 60, 80,110,125,135,145) (b) Centroid Tracking 50 Example 2: Green Bottle Tracking Figure 3.2 (a) Object tracking (Frame # 4, 5, 6, 7, 25, 40, 43, 46, 75, 77) (b) Centroid Tracking 51 Example 3: Cone tracking (Camera mounted on a quad copter) Figure 3.3 (a) Object tracking (Frame # 5, 10, 15, 20, 25, 30, 35, 45, 50) (b) Centroid Tracking 52 Example 4:Rock Tracking(camera mounted on an unmanned ground vehicle) Figure 3.4 (a) Object tracking (Frame # 25, 50, 75, 100, 125, 150, 185, 200) (b) Centroid Tracking 53 These examples demonstrate the centroid tracking ability of the CAMshift algorithm. In each example, it can be noted that the background is dynamically changing and is not a static background. All the experiments conducted included a static object in the foreground but a dynamic background and a moving camera either mounted on a vehicle or manually held. The plots in each example show the centroid tracking in the image reference plane. Red Ball Tracker Analysis: A simple 8 megapixel camera was used to capture the frames for this example. The red ball was a unique landmark or target when compared to the background in which it was tracked. The reasoning behind this claim is that the color probability histogram was able to obtain a dense probability function as the red color was easily distinguishable. Even when new objects were introduced in the image frame, the tracker algorithm showed no signs of tracking error. The window size changes according to the orientation. A window size of 80 x 80 was selected for this example after careful testing. In this case, the effects of illumination changes were minimal; hence the hue value was not affected. Green Bottle Tracker Analysis: Initial tests involved the use of two or three green bottles with varied color distribution on the object itself. The tracker did not perform as well as observed in the previous example when tested in a situation with three almost identical targets. The other two bottles differed in their size; hence the single object was tested in an environment with a similar background color as the object itself. 54 The window size in this example was set at 50 x 70. In almost all the frames under examination, the window did not significantly change in size. It should be noted that, in all the experiments conducted, the scaling factor for the new window computation after convergence differed based on the orientation of the object/landmark. Cone Tracker Analysis: The cone tracking experiments were conducted under varied conditions with numerous constraints in the environment. Some of the constraints involved the varying light intensity on the target from the environment, similar sized cones in the scene with the same color probability, and multiple landmarks in the vicinity of the original cone, and camera motion or vibration due to mounting on the quad copter. The size of the search window in this particular case is 60 x 25. It was observed that, in a few frames, the algorithm erroneously identified and tracked the similar cone that was placed close to the original target cone to be tracked. This occurred as the search window increased and a similar dense color probability distribution was observed; therefore the similar cone was tracked as well. Rock Tracker Analysis: The final set of experiments was conducted with a rock as a static foreground landmark using a camera mounted on an unmanned ground vehicle that runs on a high-speed dual core processor. The controller for the ground vehicle was already programmed and operated through a separate keyboard that provides commands for maneuvering the vehicle. The camera was internally programmed for 15 fps, although accurate centroid tracking can be achieved at lower frame rates. The downside to this approach lies in the fact that if there is constant change in the illumination, motion changes, or occlusions, a robust estimation of the centroid is not possible. 55 3.2. Implementation and Analysis of ADCOM algorithm: As discussed in Chapter 2, the advanced compressive (ADCOM) tracking algorithm continuously makes use of the features derived from the landmark or target, classifying them as positive or negative features. A search window established around this landmark facilitates the continuous extraction of positive and negative features. The initial determination of the search window size, which is specified in pixels, is an important factor affecting the performance of the algorithm. A poorly sized search window can extract features from different objects if the background is almost the same throughout. Smaller search window sizes help in the extraction of these features as it would be concentrated to a localized area around the target. The steps involved in the ADCOM tracker are as follows: Step 1: Load the frame data, initialize the search window, and input the corresponding parameters for the negative and positive samples (i.e., the number of samples and radial scope). Step 2: Determine the classifier parameters, the rectangle window parameters, learning rate, and positive/negative classifier parameters. Step 3: Using a Haar-like feature detector, extract the features from the landmark and calculate a sample template. Step 4: Extract the positive and negative features and extract the data from them for updating the classifier. Step 5: Display the tracking results and update the classifier for the next consecutive frames. 56 Example 1: Red Ball Tracking Figure 3.5 Object tracking (Frame # 5, 10, 40, 60, 80,110,125,135,145) 57 Example 2: Green Bottle Tracking Figure 3.6 Object tracking (Frame # 4, 5, 6, 7, 25, 40, 43, 46, 75, 77) 58 Example 3: Cone tracking (Camera mounted on a quad copter) Figure 3.7 Object tracking (Frame # 5, 10, 15, 20, 25, 30, 35, 45, 50) 59 Example 4: Rock Tracking (camera mounted on an unmanned gorund vehicle) Figure 3.8 Object tracking (Frame # 25, 50, 75, 100, 125, 150, 185, 200) 60 Red Ball Tracker Analysis: It was observed from the experiment that the tracker separated the positive and negative samples inside the search window and hence any change in the orientation of the camera with respect to the landmark (red ball) did not induce drift while tracking. The tracking ability did not get affected by the introduction of a flesh-toned hand as it was tracking the positive samples from the red ball, which was constantly updated using the Bayes classifier. The number of trained positive samples was 50 and this value is low when compared to the other experiments as there were fewer data that needed to be extracted from the background. Green Bottle Tracker Analysis: The tracking capability of this algorithm was tested to a certain extreme while conducting this particular experiment because three similar landmarks were in the scene with the objective of tracking only one landmark/target. The prominent difference between the CAMshift and the ADCOM algorithms was clearly shown in this particular test as the ADCOM algorithm was able to track over the entire library of frames in spite of introducing similar targets in the environment. The key difference in performance lies in the fact that the ADCOM algorithm incorporates a structured classifier method that is partially dependent on the color intensity of the landmark, unlike the CAMshift algorithm, which is purely driven by the color probability density function. The search window size was expanded to 150 pixels around the target in order to expand the range of search for negative samples in the background. The ADCOM algorithm is more adaptable than the CAMshift algorithm for applications in which there are multiple, similar moving targets and only one of these is to be tracked continuously. 61 Cone Tracker Analysis: The cone tracker experiment was carried out under varied illumination as the light intensity in the environment continuously changed owing to various factors such as the intrusion of the sun’s rays and temporary yellow light set up in the environment. It is to be noted that a large set of templates was extracted as the number of unnecessary targets in the background image was high in number in comparison to the other experiments. The number of trained negative samples was 75 samples as the algorithm needs to be tuned due to the presence of trivial targets that could interrupt the tracking capability. The search window size was around 130 pixels, which was selected due to presence of a duplicate cone with almost the same features. The tracking shows that only the selected landmark was successfully tracked until a point where the target was completely lost in the frame due to the sudden change in the attitude of the aerial vehicle carrying the camera. Rock Tracker Analysis: The rock tracker experiment was more robust than the cone tracker experiment when testing the ADCOM algorithm as there were very few similar objects in the environment when compared to cone example. Hence lesser trained negative samples were required for this experiment. As the ground vehicle moves along its path, a larger search window is employed to ensure that some of the environmental occlusions are also considered among the 50 trained negative samples. In cases where there is vehicle motion with a static camera in environments where more prospective targets are seen in the image frame, one can use a larger search window size with a larger number of trained negative samples. When compared to the CAMshift algorithm, the ADCOM algorithm provides better performance in several ways including 62 the ability to adjust the tracking ability in the presence of camera vibration due to the motion of the ground vehicle over uneven terrain and the ability to recognize, analyze and track only the features pertaining to the target/landmark without compromising on the scaling effects. The flow of the ADCOM algorithm or the process can be pictorially represented as shown in Figure 3.9. Figure 3.9 ADCOM Implementation In brief, the algorithm observes the search window and as the initialization process occurs, it begins to recognize and identify the landmark. Once the Haar-Feature detector is employed the positive and negative samples are classified, extracted and placed in a higher multi-dimensional vector space. The compression process which works on the principle of random projection compresses(lossless compression) to a lower dimension using the random sparse matrix as descried in Chapter 2.Finally the classifier uses the compressed data to track the features in every frame as the classifier gets updated at every time step. 63 CHAPTER 4 Navigation Filters Navigation is defined as the process of determining the current motion parameters of a vehicle, such as acceleration, velocity and position of the center of mass, and planning solutions to traverse a path or a map based on the estimated parameters. The system that provides and supports this process is termed a navigation system. One of the most commonly used navigation systems, which has been developed for a wide range of vehicles, is the Inertial Navigation System (INS) [61]. This system integrates the accelerations and angular rates provided by an Inertial Measurement Unit (IMU) to compute the position, velocity, and attitude of the vehicle. Because IMU solutions for position and attitude are subject to drift errors that build over time, Global Positioning System (GPS) data are typically fused with IMU data in a Kalman filter within the INS in order to provide optimal estimates of the vehicle position, velocity and attitude. This thesis investigates navigation filter implementations in which vision-based measurements are fused with IMU data for cases in which GPS is unavailable. 4.1 Kalman Filter The Kalman filter is an optimal recursive estimator in which the estimated state from the previous time step and the current measurement are required to compute an estimate of the current state. The filter is an implementation of the least squares algorithm and infers the parameters or states of interest from uncertain observations. The recursive nature of the filter [62] [63] allows it to process the data sequentially as they arrive, providing realtime estimation. The process of finding an optimal estimate from noisy data requires filtering the noise, which is accomplished in the Kalman filter assuming Gaussian white 64 process and measurement noise. The Kalman filter not only filters the data measurements, but also projects these measurements onto the state estimate. The Kalman filter is a subset of the Bayes filter where a linear relationship is established between the assumption of the Gaussian noise distribution and the current state to the previous state imposed. A set of mathematical equations is used to implement the prediction and the measurement update phases in such a manner that the filter minimizes the estimated error covariance when the condition of the linear-Gaussian dependency is met. The Kalman filter has the ability to combine the subsystems based on knowledge of the sensor measurement noise covariance and the process noise covariance. The Kalman Filter provides a solution to the linear-quadratic problem [64], which is the problem of estimating the current or present state of a linear dynamic system perturbed by white Gaussian noise. This is accomplished by using measurements that are linearly dependent on the states of the system but are also subject to Gaussian noise. The resulting predictor-update estimator is a recursive optimal solution with respect to any quadratic function of estimation error based on the error covariance matrix [65][66]. A simplified equation to represent the Kalman filter is given below, which shows that the Kalman filter finds the optimal averaging factor for each consecutive state, and the filter utilizes previous state estimates to compute the current state estimate [67].   X k  Kk .Z k  ( I  Kk ). X k 1 (4.1) The fundamental purpose is to find the estimate of a state vector X. The state estimate at  the current time step is given by X k where Zk 65 and Kk represent the measurement and Kalman gain respectively and  X k 1 is the previous state estimate [68]. A simplified step- by-step guide to the entire Kalman filter process is given below. The first step involves building a system model, which is used to propagate the state from one time step to the next. The model includes the state dynamic equation and a measurement equation that expresses the measurement as a linear function of the state: xk  Ax k 1  Bu k  wk 1 z k  Hxk  vk (4.2) (4.3) Equation (4.2) expresses X k as a linear combination of its previous value plus a control signal u k subject to additive process noise. In Equation (4.3), the measurement is expressed a linear function of the state subject to additive measurement noise. The process and measurement (or sensor) noise are assumed to be Gaussian and statistically independent. The filter requires estimates of the mean and standard deviation of the noise functions, W k 1 and V k , which represent the process and measurement noise respectively. The Kalman filter is frequently applied with success in cases where the assumptions of independent Gaussian process and measurement noise do not strictly hold. An initial value for the state estimate must also be provided. The Kalman filter process is then implemented in two stages: the prediction or time update phase and the measurement update or correction phase. At every time step, both stages are implemented to estimate the next state and update the filter parameters. The first step involves projecting both the most recent state estimate and an estimate of the error covariance (from the previous time period) forward in time to compute a 66 predicted (or a priori) estimate of the state at the current time. The second step involves correcting the predicted state estimate that was calculated in the first step by incorporating the most recent output measurement to generate an updated (or a posteriori) state estimate. The Q and R matrices represent the covariance of the process and output noise, which are specified when initializing the filter. This process is iterated for every time step. The two stage Kalman filter process is summarized in Figure 4.1 Figure 4.1 Kalman Filter Process. In Figure 4.1, xˆk is the "prior estimate", which represents the propagated state estimate before the measurement update correction. Similarly, Pk  represents the "prior error covariance." The measurement update equations are then used to compute xˆk , which is the current state estimate at discrete time ‘k’. Also, the error covariance Pk and the Kalman gain matrix are updated, which are required for the ‘k+1’ future estimate. 67 4.2 Extended Kalman Filter The Extended Kalman filter (EKF) provides an approximation of the optimal state estimate. The nonlinear system dynamics are approximated by a linearized version of the nonlinear system model around the previous state estimate. For this approximation to be valid, the linearization should be a good approximation of the nonlinear model in the vicinity of the state estimate. The EKF is not exactly an optimal filter but rather is implemented based on a set of approximations. A nonlinear discrete time process with input and measurement noise model is shown in Figure 4.2 below. Figure 4.2 Non-linear to Linear Model The nonlinear representation can be written in the standard state space form as, xk  f ( xk 1 , uk , k )  wk 1 (4.4) yk  h( xk , uk , k ) (4.5) yk  yk  vk (4.6) In these equations, 68 1. ‘k’ denotes the current discrete time step (with ‘k-1’ representing the previous time step) 2. uk is a vector of inputs, xk is the state vector, which may be observable but not measured, yk is a vector of the modeled process outputs and yk is a vector of the measured process outputs. 3. wk and vk are the process and measurement noise respectively. They are assumed to be zero mean Gaussian noise with covariance Qk and Rk respectively. 4. f (.) and h(.) are generic nonlinear functions relating the past state, current input, and current time to the next state and current output respectively. Given the inputs, measured outputs and necessary assumptions on the process and output noise, the purpose of the extended Kalman Filter is to estimate unmeasured states (assuming they are observable) and the actual process outputs. Observability is the measure of how well the estimated unmeasured internal states in a system can be determined based on the knowledge of the external outputs or the system response. Similar to the Kalman filter, the extended Kalman Filter uses a 2-step predictor-corrector algorithm. The first step involves projecting both the most recent state estimate and an estimate of the error covariance (from the previous time step) in order to compute a predicted (or a priori) estimate of the states at the current time. The second step involves correcting or updating the predicted state estimate calculated in the first step by incorporating the most recent process measurement to generate an updated (or aposteriori) state estimate. However, due to the nonlinear nature of the process being estimated, the covariance prediction and update equations cannot directly utilize the 69 nonlinear functions f(.) and h(.). In order to solve this problem, Jacobian matrices are defined with respective to the states: Fk  f ( xˆk , uk , k ) x Hk  H ( xˆk , uk , k ) x The EKF predictor and the corrector steps are then given below: Step 1: PREDICTION xˆ k  f ( xˆ k1 , uk , k ) Pk  Fk 1 Pk 1 FkT1  Qk Step 2 : CORRECTION / UPDATE K k  Pk H kT ( H k Pk H kT  Rk )1 xˆ k  xˆ k  K k ( yk  h( xˆ k , uk , k )) Pk  ( I  K k H k ) Pk In these equations, Pk is an estimate of the error covariance and K k is the Kalman gain matrix. After both the prediction and correction steps have been performed, xˆ k represents the current state estimate. Both xˆ k and Pk are stored and used in the predictor step for the next time step. However, unlike the standard Kalman Filter, the extended Kalman Filter is not guaranteed to be optimal in any sense. Further, if the process model is inaccurate due to the use of the Jacobians, which represent a linearization of the model, the extended Kalman Filter can diverge leading to very poor estimates. In practice, when a reasonably accurate model is developed, the extended Kalman Filter often leads to reliable state estimation. 70 4.2.1 Experimental EKF for Vision-Aided Navigation A simple EKF implementation is provided and modified for the actual experimentation (simulation) phase of the project. One of the earliest applications of the EKF [68] [69] was to solve the issue of tracking flying objects from the ground. In this example of object tracking, at each time step the tracked object or target (either dynamic or static) has a measured range and bearing from the observer. In general cases, the observer is considered to be the location of a radar device tracking the object as seen in Figure 4.3. Figure 4.3 Example of tracking an aerial vehicle from the ground [71] The range and bearing of the tracked object are represented in terms of displacements, which are given as the distances from the observer in both x and y directions. In this particular tracking problem, the state estimates are not only the x and y displacements of the target or object but also its linear velocities in x and y directions. These four states are estimated once provided with measurements of range and bearing that are subject to 71 Gaussian white noise. The EKF is suitable for this type of problem as the displacements and linear velocities are nonlinearly related to the measured range and bearing. For experimental purposes, a 4-state EKF can be developed to determine the tracking ability of a vehicle towards a target or a landmark at a given or determined position in the earth reference frame. This scenario is quite similar to that depicted in Figure 4.3 except the target is now fixed and the observer is now a vehicle that is in motion. The filter makes use of the four-state Equation (4.7) and body centered accelerations in the x and y directions (as measured by accelerometers), given in Equation (4.8). Equation (4.10) represents the linearized form of these equations to be used in the Kalman filter for state estimation. This filter employs measurements of the bearing angle and the range, which can expressed as functions of the state vector using the Cartesian to polar transformation. Equation (4.11) represents the range and bearing for a target at the location ( xT , yT ) . X {x , vx , y, vy }T (4.7) U  {ax B , ay B }T (4.8)  X k 1  f k ( X k ,U k ) (4.9) The linearized form of Equation (4..9) is given as:  X k 1  A k X k  BkU k (4.10) Range  ( x  xT )2  ( y  yT )2  ( y  yT )  Bearing (in rad )  arctan    ( x  xT )  72 (4.11) The 4-state filter can be easily augmented into a 6-state EKF for estimating the position and velocities in 3D, which requires a modification in the form of the Cartesian to spherical transformation to include an elevation angle. Note that the Jacobian for the state and measurement variables must be calculated during the state estimation process. A 9-state EKF can then be formulated with a state vector that is comprised of the inertial position, velocity and Euler angles. The control inputs in this formulation are the bodycentered accelerations (as provided by accelerometers) and the Euler rates (as provided by rate gyros). The IMU is used to capture these inertial data of a vehicle for navigation purposes. It is to be noted that the filter incorporates image processing data as part of the estimation as the intended application is for GPS-denied locations. The position data of the object tracked in the image frame (camera frame) has to be transformed in order to determine an accurate location with respect to the vehicle body frame. The range to the target, while not directly measured using vision, can be inferred from a priori knowledge of the size of the target and the number of pixels the target covers in the image frame. The following steps should be carried out while implementing the 9-state Extended Kalman filter for the purpose of vision-aided navigation: Step 1: Determination of the vector distance of the target object from the camera mounted on the vehicle and the vector distance of the camera and the center of gravity/origin of the vehicle. E These vectors are represented as r  [ x , y , z ] and r T T T T T B C/ B  [ xC / B , yC / B , zC / B ]T respectively. The inertial to body frame direction cosine rotation matrix ( R ) is estimated in the B E 73 process, which is a function of the Euler angles. The body to camera rotation matrix RBC is fixed and known based on the camera mounting onto the body of the vehicle. Step 2: The states, control inputs and the estimated inertial data are defined as  X k 1  X k  Vxk t   ( Inertial Position) Yk 1  Yk  Vyk t    Z k 1  Z k  Vzk t Vxk 1  Vxk  ax E t  ( InertialVelocity ) Vyk 1  Vyk  a y E t  E Vzk 1  Vzk  az t k 1  k  ( p  q sin k tan  k  r cos k tan  k ) t   k 1   k  ( q cos k  r sin k ) t     [(q sin   r cos  )sec ] t k k k k  k 1 ( Euler Angles) a E   x  E E a y   RB  E az  g  ax B   B  a y   B az     IMU data  ( Inertial Accel.) where RB E cos    sin  0  sin cos 0 0   cos   0  0 1    sin  0 sin   1 0  1 0  0 cos  0 cos   0 sin  0    sin   cos   Step 3: The extended Kalman filter process is then implemented. The covariance of the process and output Gaussian noise are determined as well as the Jacobian matrices of all 74 the input states and process outputs. This is done in order to establish a linearized state space equation as mentioned above. The target position estimated from the image processing algorithm is correlated with the vector distance or range between the camera C and the target given by r  [ x , y , z ] . T T T /C T /C T /C Step 4: The prediction and correction phases are carried out as discussed above based on the Ricatti equation. The measurement equations can be observed as follows: Range  ( x  xT ) 2  ( y  yT ) 2  ( z  zT ) 2  ( y  yT )  Azimuth(in rad )  arctan    ( x  xT )   ( x  x )2  ( y  y )2 T T Zenith / Inclination Angle(in rad )  arctan  z z   T      The EKF described above is proposed to be used in the navigation system where image processing algorithms are applied for the purpose of object tracking and autonomous navigation in the absence of the Global Positioning System. This process is depicted in Figure 4.4. In conclusion, for the purpose of vision-aided navigation with typically nonlinear dynamic systems, where both the system dynamics and the measurements are nonlinear, state estimation can be provided using a 9-state EKF implementation. 75 Figure 4.4 Depiction of EKF logic The EKF has some limitations, which should be kept in mind when developing a model and performing state estimation: 1. The linear and quadratic transformations produce reliable results only when the error propagation can be well approximated by a linear or a quadratic function. If this condition is not met, the performance of the filter can be extremely poor. In the worst case scenario, the state estimates can diverge altogether. 2. The Jacobian matrices need to exist so that the transformation can be applied. There are cases for which these Jacobians cannot be defined. For example, a system that is jump-linear has discontinuous parameters. 3. In many cases, the calculation of Jacobian and Hessian matrices (in the case of a second-order Kalman filter implementation) can be difficult processes subject to computational errors (model development and software implementation). These errors are often difficult to identify as it is difficult to see which parts of the system produces the errors simply by viewing the estimates, especially since the performance of the filter is uncertain. 76 CHAPTER 5 Extended Kalman Filter (EKF) Implementation An Unmanned Ground Vehicle (Corobot Classic CL4) was used as a common platform carrying sensors for data acquisition and storage. Figure 5.1a and 5.1b shows the UGV used for the process of collecting data for both image processing and navigation filter design. Some of the features of the Corobot are listed below: Basic Robot Control:  The motor controller on-board enables speed control of both sides of the motors from -100% to 100% of the total speed of the motors.  The high-speed encoder board enables the measurement of high-resolution ticks of encoders of the front left wheel and front right wheel. If needed, it can also accommodate tick data of back wheels.  The interface kit on-board allows mounting an additional 5 analog sensors and has 8 digital I/O pins. These numbers can be easily increased. Software:  The robot control software can drive the robot forward, back, left and right at different ranging speeds.  Event driven notifications on change in encoder ticks or sensor values.  Source code or API available in programming languages, such as, C++ and C#. Other programming languages are possible. 77  Available for both the Operating Systems: Windows 7 and Ubuntu 12.04 (ROS compliant). Figure 5.1 Corobot Classic CL4[72] The inertial measurements were recorded using a MicroStrain 3dm-gx-45 module which has an attitude heading reference system and an extended Kalman filter running continuously to integrate the data measured by the gyros, accelerometer and magnetometer. Figure 5.2 shows the MicroStrain being used for experimental purposes. Some of its significant features are listed below:  precise position, velocity and attitude estimation  high-speed sample rate and flexible data outputs  high performance under vibration and high accelerations  smallest, lightest industrial GPS/INS available  simple integration supported by SDK and comprehensive API Figure 5.2 MicroStrain 30dm-gx3-45[73] 78 Vision Data Collection and Storage: Two different vision sensors were used for the purpose of tracking a landmark. These were either mounted on the ground vehicle or an aerial vehicle in order to record video/image frames of the landmark in an environment. GoPro HD Hero2  The 11 megapixel 1/2.3-inch sensor captures video in resolutions from 720p HD (1280 x 720 pixels) at 30 or 60 frames per second to 1080p (1920 x 1089) at 30 frames per second in NTSC.  The resolution can be reduced to WVGA (848 x 480 pixels) to record frame rates as high as 120 fps, perfect for super slow motion.  The f/2.8 fixed focus lens provides up to 170 degrees field of view. Figure 5.3 (a) GoPro Hero2 (b) Microsoft LifeCam HD-3000[74] Microsoft Life Cam HD-3000  A basic webcam for high definition video recording and streaming at 720p HD with 16:9 widescreen resolution.  Live video recording and easy to configure when connected to a processor. 79 Unmanned Ground Vehicle Tests Figure 5.4 and 5.5 show the 3D path tracked by the unmanned ground vehicle and realtime GPS latitude-longitude data transformed into the North East Down plane. The path was verified using camera imagery. For the scenarios considered in these filter experiments, the ground vehicle was driven in a 7x7 square meter area. Figure 5.4 Path of the UGV Figure 5.5 GPS Lat-Long-Altitude data converted to local NED coordinates 80 Figure 5.6 represents the Euler angles estimated using the angular rates measured by the gyros. Due to unleveled ground, small roll and pitch angles were observed as the vehicle was traveling along its path. Figure 5.6 Euler Angles Figure 5.7 Accelerometer data from MicroStrain Figure 5.7 represents the raw accelerometer data from the MicroStrain on the ground vehicle. The noisy data are being used for the EKF for the estimation of various states. The 1-g acceleration due to gravity is subtracted from the accelerometer data in the Z axis in all the simulations. 81 5.1 Position And Velocity Estimates Using 4 State EKF (With Accelerometer data) Figure 5.8 Simulink Model with Accelerometer data Figure 5.9 Measurement plot Since the ground vehicle basically travels a 2D trajectory, the 4-state Kalman filter was first implemented. This filter makes use of the accelerometer data for the ‘U’ control matrix described in Equation (4.10). Figure 5.8 shows the implementation of the 4-state EKF including the accelerometer data. There is a transformation made from the body 82 reference frame to the earth reference frame using the Direction Cosine Matrix, which is derived from the Euler angles obtained by integrating the rate gyro data. Figure 5.9 represents the measurement data used inside the filter to calculate the measurement residual and also to update the states. In this example, target measurements, which would be obtained using image processing in practice, are simulated by selecting a target location and measuring the relative target location using GPS data to represent the true position of the vehicle referred to as the ground truth. POSITION ESTIMATES Figure 5.10 X Position 83 Figure 5.11 X Position zoom Figure 5.12 Y Position Figures 5.10 -5.13 show the position estimates in the x-y direction compared to the x-y position measured by the GPS (in flat earth) which is considered as the ground truth. It can be observed later in this chapter that the performance of the filter is robust when the accelerometer data is included. 84 Figure 5.13 Y position zoom VELOCITY ESTIMATES: Figure 5.14 Velocity in X direction 85 Figure 5.15 Velocity in Y direction Figure 5.14 and 5.15 provide velocity estimates where differentiated GPS data is used to derive ground truth velocity. The velocity estimates make use of the accelerometer measurements in the earth reference frame. The noisy accelerometer data is smoothened in the filter using suitable Q matrix with higher process noise covariance. 5.2 Position And Velocity Estimates Using 4 State EKF (Without Accelerometer data) Figure 5.16 EKF without accelerometer data 86 In this implementation, Equation (5.1), used to estimate the inertial velocity, is reduced to the form given below without the accelerometer data. In this case, noisy acceleration data are omitted during the estimating process. Note that the velocity model in this case takes the form of a random walk model. Vxk 1  Vxk  ax E t ( InertialVelocity )  E Vyk 1  Vyk  a y t Where ax E  a y E  0 (5.1) In this case the linearized equation becomes:  X k 1  A k X k (5.2) A comparison between the tracking ability of the EKF with and without accelerometer data is provided and conclusions are drawn based on the best set up of the EKF. POSITION ESTIMATES: Figure 5.17 X position 87 Figure 5.13 Y position Figures 5.17 and 5.18 compare the X and Y estimates to the ground truth. It can be observed that the EKF performs better along the X axis than the Y axis when compared to respective plots with accelerometer data included. VELOCITY ESTIMATES: Figure 5.19 Velocity in X direction 88 Figure 5.20 Velocity in Y direction The velocity estimates shown in Figure 5.19 and Figure 5.20 are less robust than the estimates shown earlier with the accelerometer data included. 5.3 4-State Extended Kalman Filter with Additive White Gaussian noise In Figures 5.21 - 5.24, the position and velocity estimates with three different levels of measurement noise are shown. Each case has a random additive white Gaussian noise augmented to the measurement data separately. In spite of external measurement noise, which can arise due to many factors, the filter performs robustly and tracks the position and velocity states in all three cases. The three cases correspond to adding WGN to the range measurement with variances of 5m, 12.5m and 20m. Similarly, WGN was added to the bearing measurements with variances of 2.5 deg, 5 deg and 8 deg. 89 Figure 5.21 X estimate with measurement noise characteristics Figure 5.22 Y Position with measurement noise characteristics 90 Figure 5.23 Velocity in X direction measurement noise characteristics Figure 5.24 Velocity in Y direction measurement noise characteristics 91 Figure 5.25 Simulink model of EKF with Measurement Noise 5.4 Extended Kalman Filter for tracking Multiple Landmarks/Targets: The simulation discussed below depicts the navigation or tracking ability of the 4-state EKF when two or more landmarks are introduced in the inertial frame. The introduction of multiple landmarks in the environment for vehicle navigation aims at studying the versatility and robustness of the EKF, which can be experimentally tested in real-time scenarios involving multiple target tracking for reconnaissance or other applications. In this simulated example, the first target is located at the origin (0, 0) and the second object is located at a distance of 5m from the first target along the y-axis (0, 5). The logic involves the change in the calculation of the range and the bearing angle. Given that there are two targets at ( x1 , y1 ) and ( x2 , y2 ) respectively, the range and bearing to these targets are given by: 92 Range1  ( x  x1) 2  ( y  y 1) 2  y  y1  Bearing1  tan 1    x  x1  Range2  ( x  x 2) 2  ( y  y 2) 2 (5.3)  y  y2  Bearing 2  tan 1    x  x2  In Equation (5.3), ( x1 , y1 ) and ( x2 , y2 ) represent the coordinates of the 2 targets. The EKF tracking with respect to 2 targets, 1 target and the ground truth (raw GPS data) are shown in the following simulations. In later stages, the camera data in the inertial frame would replace the simulated target measurement data. Equation (5.3) can be implemented as a separate subsystem to perform the Cartesian to Polar transformation based on the X-Y coordinates of the targets and the vehicle. POSITION ESTIMATES: Figure 5.26 X position 93 Figure 5.27 Y Position Figures 5.26 and 5.27 show the performance of the filter for tracking the X-Y position when two targets are present in the environment and when only one target is present. VELOCITY ESTIMATES: Figure 5.28 Velocity in X direction 94 Figure 5.29 Velocity in Y-direction Figures 5.28 and 5.29 represent the velocity estimation for 2 targets and one target in the X-Y direction. The performance of the filter is more robust when tracking the single target. The tracking performance for two or more targets can be improved by placing them at a further distance from each other. The effects of target placement and the number of targets on the estimation process merits further Figure 5.30 Simulink Model for Multiple Target Tracking 95 investigation. Figure 5.30 represents the implementation of the multiple landmark/target tracking example. It is to be noted that the size of the measurement matrix changes in the case of 2 or more targets tracked. The size of the measurement becomes 2 times the number of targets. Similarly this reflects in the dimensions of the measurement Jacobian matrix. The above example can be expanded to track more targets based on the application in which it is used. 5.5 6-State Extended Kalman Filter A final set of experiments is conducted to estimate the 6 states corresponding to the position and velocity estimates in X-Y-Z directions. This 3D example uses the same data from the MicroStrain attached to the unmanned ground vehicle taking into account the Z direction as well. The equations corresponding to the 3D position and velocity are as follows:  X k 1  X k  Vxk t   ( Inertial Position) Yk 1  Yk  Vyk t    Z k 1  Z k  Vzk t (5.4) Vxk 1  Vxk  ax E t  ( InertialVelocity ) Vyk 1  Vyk  a y E t  E Vzk 1  Vzk  az t (5.5) The implementation of the above 6-state Extended Kalman Filter with the accelerometer data is shown in Figure 5.31 96 Figure 5.31 Simulink model for 6-state EKF with accelerometer data The 6-state EKF makes use of a Cartesian to a spherical transformation to estimate the range, bearing angle and zenith angle [75].The Cartesian to spherical block shown in Figure 5.32 the transformation. Figure 5.32 Cartesian to Spherical transformation The state and measurement equations in the filter change accordingly owing to the six state EKF.The equations for the measurements are given in equation (5.6).The Jacobian 97 matrix for the measurement data is obtained by linearizing of these equations about the current state estimate. Range(  in m)  x 2  y  z 2 2 y  Azimuth( in rad )  tan 1   x   x2  y 2 Bearing ( in rad )  tan 1   z  (5.6)     In spherical coordinates, the bearing angle can be represented as the inclination or elevation angle and the azimuth angle is represented as the angle measured from the azimuth reference direction to the orthogonal projection of the line segment connecting the points corresponding to a fixed origin and the vehicle position in n-dimensional space onto the reference plane. The spherical coordinate transformation is illustrated in Figure 5.33. Figure 5.33 Spherical Co-ordinate transformation The measurement data after the spherical transformation representing the range, bearing and azimuth angle is represented in Figure 5.34.The size of the measurement data matrix varies based on the number of measurements. 98 Figure 5.34 Measurement Data POSITION ESTIMATES: Figure 5.35 Position Estimates for 3D 99 VELOCITY ESTIMATES: Figure 5.36 Velocity Estimates for 3D Figure 5.35 and 5.36 represent the 6-state estimates and the ground truth. The position estimates in the Y-Z axis experience a small offset, which does not affect the tracking ability in this case. The same application of using multiple objects in the scene can also be incorporated in 3D. 100 CHAPTER 6 Conclusion and Future Recommendations The technical objectives of this thesis were focused on the development of a visionaided navigation strategy to compute vehicle navigation solutions for operations in GPSdenied environments. This strategy was based on the concept of applying image processing algorithms to detect and track known landmarks in the environment using an onboard camera. In this concept of operations, the landmarks consist of features in the environment for which the inertial location and scale are known based on previous measurements. If one or more of these landmarks can be successfully located in the image plane of an onboard camera, a measurement of the location of the landmark relative to the vehicle is obtained. Because the inertial locations of the landmarks in the environment are known, these camera measurements indirectly provide surrogate GPS measurements, which can then be fused with inertial measurement unit (IMU) data (i.e., accelerometer and gyro data) in an extended Kalman filter to provide a navigation solution. Note that, while the range to an object cannot typically be computed from a single camera image, the range can be inferred in cases where the scale of the object is known. These technical objectives were investigated both theoretically and experimentally. The idea of implementing specific image/video processing algorithms for the purpose of vision-aided navigation was discussed. The specific algorithms considered were classified as data independent and data dependent processing algorithms. The CAMshift (continuously adaptive mean shift) algorithm is a more application-specific algorithm and served the purpose of studying the effect of camera scaling and vehicle orientation on landmark tracking objectives. The ADCOM (advanced compressive tracking) algorithm 101 served the purpose of landmark tracking without focusing on the color probability histogram of the tracked landmark. By design, this algorithm is able to compensate for various types of camera motion, occlusions, background data interference or illumination effects. There is no need to apply separate functions to compensate for the abovementioned effects, especially the effects due to scaling and orientation. The effectiveness of the two application-specific image processing algorithms was demonstrated with experimental results, which indicated the tracking of the landmark and estimated the coordinates of the landmark in the camera/image reference frame. The results shown pertain to static landmarks in the environment with the vehicle in motion. The testing of the algorithms using video/image frames collected from high fidelity cameras mounted on an aerial and ground vehicle demonstrate the robustness of the algorithms. It is also to be noted that the CAMshift and ADCOM algorithms are data dependent and data independent respectively, which broadens the scope of using these algorithms for newer reconnaissance tasks. The comparative study of the two image processing algorithms identified three essential differences applicable to the fundamental operation of the algorithms. First, the CAMshift algorithm makes use of only the color probability function and tracks the centroid of the landmark while the ADCOM algorithm makes use of features extracted from the landmark being observed. These features are not necessarily restricted to texture or color but are varied over a number of generic parameters. Second, while using the CAMshift algorithm, owing to the constant adaptability of the bounding box based on the orientation effects, there can be a loss or a delay of the object being tracked as the structure of the tracked object can become compromised with constant motion of the 102 bounding box. In contrast, the ADCOM algorithm compresses the tracked features into a lower dimensional space without losing any feature data necessary for tracking; hence the structure of the image data is not compromised. Third, the CAMshift algorithm uses a kernel estimation as a classifier but the ADCOM algorithm makes use of a naïve Bayes classifier. A robust extended Kalman filter (EKF) was proposed for the process of vision-aided navigation in GPS-denied environments. The navigation filter provides navigation solution for the vehicle states using IMU data and noisy target measurements, which would be obtained in practice from image processing. This filter was tested using simulated target measurements where the coordinate transformation inside the simulation enables the calculation of the range between the vehicle and the landmark along with the angle of inclination to the landmark, represented in terms of bearing angle. The extended Kalman filter was tested for varying levels of measurement noise, specified in terms of a Gaussian white noise model. This modeled the effect of external noise during real-time implementation on a vehicle and how the filter robustly compensates for the noise disturbances. Data collected from an unmanned ground vehicle provided test data for the Kalman filter as the vehicle’s path and Euler angles were calculated for filter processing. A Cartesian to spherical coordinate transformation was used to simulate the range and angles to the target in this case. The essential feature of the 4-state EKF was the ability to estimate the planar position and velocity of the vehicle with the accelerometer data as input measurements and output measurements of multiple landmarks in the environment. The use of such filter logic is well suited for applications where multiple static objects are tracked by the same vehicle 103 for reconnaissance purposes in the defense sector. A final set of simulations involved the use of a 6-state EKF to estimate the 3D motion of the vehicle while tracking an object at a specific location in the environment. This proved slightly less robust than the 4-state filter as the addition of data in the Z-direction invoked high noise levels, resulting in drift in the state estimates. This result might be explained by the fact that the test data were obtained for a ground vehicle for which the motion in the Z-direction was minimal. 6.1 Future Recommendations The two core processes investigated in this thesis, image processing and navigation filtering, need to be integrated into one vision-aided navigation system. Several components are required to form such an integrated system. One required component is a camera calibration process to derive the transformation of the pixel points in the image reference frame to the inertial (earth) reference frame. Some earlier works have conducted extensive research on this transformation in simulated environments but did not test it with high end application specific image processing algorithms with navigation filters [76][77][78]. The pixel points in the inertial frame can be used as a surrogate to GPS measurements when encountering a GPS-denied terrain or in the case where GPS is accessible. The same data can then be used to navigate to the given target locations calculated using the above camera frame transformation. The application specific algorithms coupled with a 6-state or a simple 4-state EKF could be applied in real time vision-aided navigation systems such as in the ambulance drone (Flying Defibrillator), door-to-door product delivery, remote sensing, IPMC (Ionic Polymer Metal Composite) powered robotic fish and in agricultural surveillance coupled with a bio-sensor mechanism in the commercial sector. 104 The use of the above experimentally verified techniques in the defense sector expands from reconnaissance applications, tracking multiple moving targets, and augmenting LIDAR imaging to develop a 3D map of an inaccessible location. The need for newer application specific algorithms has raised the awareness amongst researchers to investigate methods that are suitable, compatible and robust for usage in real time applications. The intriguing task of providing robust navigation solutions for an unmanned vehicle in environments where GPS is inaccessible, using landmark feature identification techniques, lays the foundation for exploring the multifarious domain of UAV navigation in remote, inaccessible and undiscovered terrains. Henceforth, this continued effort pushes the boundary of such advanced and in-depth application-driven research to outer space objectives. 105 BIBLIOGRAPHY [1] Valavanis, K., ‘Advances in unmanned aerial vehicles: state of the art and the road to autonomy’, Intelligent Systems, Control and Automation: Science and Engineering 33, 2007. [2] Bejar, M., Ollero, A., Cuesta, F., ‘Modeling and control of autonomous helicopters, advances in control theory and applications’,Lect. Notes Control Inf. Sci. 353,2007. [3] Lee, D., Jin Kim, H., Sastry, S., ‘Feedback linearization vs. adaptive sliding mode control for a quadrotor helicopter’,Int. J. Control Autom. Syst. 7(3), 419–428,2009. [4] Bernard, M., Kondak, K., Hommel, G., ‘Framework for development and test of embedded ﬂight control software for autonomous small size helicopters’,Embedded Systems – Modeling, Technology, and Applications, pp. 159–168,2006. [5] Monteriù, A., Asthana, P., Valavanis, K., Longhi, S., ‘Model-based sensor fault detection and isolation system for unmanned ground vehicles: theoretical aspects (partiandii)’,Proceedings of the IEEE International Conference on Robotics and Automation (ICRA),2007. [6] Conte, G., Doherty, P., ‘An integrated UAV navigation system based on aerial image matching’, IEEE Aerospace Conference, pp. 1–10, 2008. [7] Luo P., Pei, H., ‘An autonomous helicopter with vision based navigation’, IEEE International Conference on Control and Automation, 2007. 106 [8] R. He, S. Prentice, and N. Roy, ‘Planning in Information Space for a Quadrotor Helicopter in a GPS-denied Environment’, IEEE Intl. Conf. Robotics and Automation, Pasadena, California, May 2008 [9] He, Z., Iyer, R.V., Chandler, P.R., ‘Vision-based UAV ﬂight control and obstacle avoidance’, American Control Conference, 2006. [10] Mondragon, I.F., Campoy, P., Correa, J.F., Mejias, L.: ‘Visual model feature tracking for UAV control’, IEEE International Symposium on Intelligent Signal Processing, WISP, 2007. [11] Campoy, P., Correa, J.F., Mondragón, I., Martínez, C., Olivares, M., Mejías, L., Artieda, J.: ‘Computer vision onboard UAVs for civilian tasks’. J. Intell. Robot. Syst. 54(1–3), 105–135 ,2009. [12] Chowdhary,Girish., Johnson Eric N.,Magree Daniel., Wu Allen., Shein Andy.,: ‘GPS-Denied Indoor and Outdoor Monocular Vision Aided Navigation and Control of Unmanned Aircraft’,January,2013. [13] Caballero, F., Merino, L., Ferruz, J., Ollero, A., ‘Vision-based odometry and SLAM for medium and high altitude ﬂying UAVs’. J. Intell. Robot. Syst. 54(1–3), 137– 161,2009. [14] Williams, Paul., Crump, Michael., ‘All-Source Navigation For Enhancing UAV Operations in GPS-Denied Environments’,28th International Congress of Aeronautical Sciences,2012. 107 [15] Smith, R.C., On the representation of spatial uncertainty. Int. J. Robotics Research, 5(4), pp.56-68,1987. [16] Merz,T.,Duranti,S.,Conte,G. ‘Autonomous landing of an unmanned helicopter based on vision and inertial sensing’,Experimental Robotics IX, Springer Tracts in Advanced Robotics, vol. 21, pp. 343–352,2006. [17] Meingast, M., Geyer, C., Sastry, S., ‘Vision based terrain recovery for landing unmanned aerial vehicles’,43rd IEEE Conference on Decision and Control (CDC),vol.2,pp.1670–1675,2004. [18] Shakernia, O., Vidal, R., Sharp, C.S., Ma, Y., Sastry, S.S.: Multiple view motion estimation and control for landing an unmanned aerial vehicle, Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2793–2798 (2002) [19] Saripalli, S., Montgomery, J., Sukhatme, G., ‘Visually-guided landing of an unmanned aerial vehicle’,IEEE Trans. Robot. Autom. 19(3), 371–381,2003. [20] Saripalli, S., Sukhatme, G.S., ‘Landing a helicopter on a moving target’, IEEE International Conference on Robotics and Automation (ICRA), pp. 2030–2035,2007. [21] Garcia-Padro, P.J., Sukhatme, G.S., Montgomery, J.F., ‘Towards vision-based safe landing for an autonomous helicopter’, Robotics and Autonomous Systems, vol. 38, no. 1, pp. 19–29(11). Elsevier, 2002. [22] Johnson, A., Montgomery, J., Matthies, L., ‘Vision guided landing of an autonomous helicopter in hazardous terrain’,Proceedings of the IEEE International Conference on Robotics and Automation,2005. 108 [23] Templeton, T., Shim, D.H., Geyer, C., Sastry, S., ‘Autonomous vision-based landing and terrain mapping using am MPC-controlled unmanned rotorcraft’, Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1349–1356, 2007. [24] W. Kropatsch, 'History of Computer Vision A Personal Perspective', Institute of Computer Aided Automation 183/2 Vienna University of Technology Pattern Recognition and Image Processing Group, 2008. [25] Z. Wang and F. Yang, 'Object Tracking Algorithm Based on Camshift and Grey Prediction Model in Occlusions', in the 2nd International Conference on Computer Application and System Modeling (2012), France, 2012. [26] Fukunaga, K. (1990), ‘Introduction to Statistical Pattern Recognition’, 2nd Edition. Academic Press, New York, 1990. [27] D. Comaniciu, V. Ramesh, and P. Meer, ‘Real-time tracking of non-rigid objects using mean shift’, IEEE Proc. on Computer Vision and Pattern Recognition on, pages673–678, 2000. [28] R. Collins, 'Mean-shift Tracking', Computer Science Engineering,CSE598G, Penn State University, 2006. [29] L. Jae-Yeong and W. Yu, 'Visual tracking by partition-based histogram backprojection and maximum support criteria', in Robotics and Biomimetics (ROBIO), 2011 IEEE International Conference, Karon Beach, Phuket, 2011, pp. 2860 - 2865. 109 [30] Schugk, D., Kummert, A., and Nunn, C., ‘Adaptation of the Mean Shift Tracking Algorithm to Monochrome Vision Systems for Pedestrian Tracking Based on HoGFeatures’, SAE Technical Paper 2014-01-0170, 2014. [31] Bradski, G. R. (1998), ‘Computer vision face tracking for use in a perceptual user interface’, Intel Technology Journal, 2nd Quarter, 1998. [32] Freeman, W. T., Tanaka, K., Ohta, J. and Kyuma, ‘Computer Vision for Computer Games’,Int. Conf. On Automatic Face and Gesture Recognition, pp.100-105, 1996. [33] P. Fieguth and D. Terzopoulos, ‘Color-based tracking of heads and other mobile objects at video frame rates,’In Proc. Of IEEE CVPR, pp. 21-27, 1997. [34] Hu J., Juan C., Wang J.: ‘A spatial-color mean-shift object tracking algorithm with scale and orientation estimation’, Pattern Recognition Letters, 2008, 29, (16), pp. 21652173. [35] M. Hunke and A. Waibel, ‘Face locating and tracking for human-computer interaction,’ Proc. Of the 2gth Asilomar Conf. On Signals, Sys. and Comp., pp. 1277128 I, 1994 [36] K. Sobottka and I. Pitas, ‘Segmentation and tracking of faces in color images,’ Proc. Of the Second Intl. Conf. On Auto. Face and Gesture Recognition, pp. 236-241, 1996 [37] M. Swain and D. Ballard, ‘Color indexing,’ Intl. J. of Computer Vision, 7( I) pp. 1 132, 1991. [38] Horn R. A., Johnson C. R., “Topics in Matrix Analysis”, Cambridge University Press, U.K., 1991. 110 [39] Volkan Cevher, Aswin Sankaranarayanan, Marco F. Duarte1, Dikpal Reddy, Richard G. Baraniuk, and Rama Chellappa, ‘Compressive sensing for Background subtraction’, Rice University,ECE,Houston TX 77005,2008. [40] Aggarwal, A., Biswas, S., Singh, S., Sural, S., Majumdar, A.K, ‘Object Tracking Using Background Subtraction and Motion Estimation in MPEG Videos’. In: ACCV, Springer (2006) 121–130 [41]Cao, Y., Lei, Z., Huang, X., Zhang, Z. and Zhong, T, ‘A Vehicle Detection Algorithm Based on Compressive Sensing and Background Subtraction”, AASRI Procedia, 1, pp.480-485,2012. [42] Fan, B., Du, Y. and Cong, Y. (2014). ‘Online Learning Discriminative Dictionary with Label Information for Robust Object Tracking’, Abstract and Applied Analysis, 2014, pp.1-12. [43]Wu, Y., Jia, N. and Sun, J., ‘Real-time multi-scale tracking based on compressive sensing’,2014. [44] Garrett Warnell and Rama Chellappa, ‘Compressive Sensing in Visual Tracking, Recent Developments in Video Surveillance’, Dr. Hazem El-Alfy (Ed.), ISBN: 978-95351-0468-1, 2012. [45] K. Zhang, L. Zhang and M. Yang, 'Real-Time Compressive Tracking', ECCV 2012, Part III, LNCS 7574, pp. pp. 866–879, 2012. [46] P. Frankl and H. Maehara, ‘The Johnson-Lindenstrauss lemma and the sphericity of some graphs’, Journal of Combinatorial Theory A, 44(3):355–362, 1987. 111 [47] P. Li, T. Hastie and k. Church, ‘Very Sparse random projections’, in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, 2006. [48] E. Candès, 'The restricted isometry property and its implications for compressed sensing',Comptes Rendus Mathematique, vol. 346, no. 9-10, pp. 589-592, 2008. [49] Li, H., Shen, C., Shi, Q., ‘Real-time visual tracking using compressive sensing’, In: CVPR, pp. 1305–1312, 2011. [50] Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y, ‘Robust face recognition via sparse representation’,PAMI 31, 210–227,2009. [51] Liu, L., Fieguth, P., ‘Texture classiﬁcation from random features’, PAMI 34, 574– 586, 2012. [52] Dimitris Achlioptas, ‘Database-friendly random projections: Johnson-Lindenstrauss with binary coins’, Journal of Computer and System Sciences, 66(4):671–687, 2003. [53] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proc. of KDD, pages 245–250, San Francisco, CA, 2001. [54] K. Wu and X. Guo, 'Compressive Sensing with Sparse Measurement Matrices', in Vehicular Technology Conference, China, pp. pp 1-5, 2011. [55] A. Panning, A. Al-Hamadi, R. Niese and B. Michaelis, 'Facial expression recognition based on Haar-like feature detection', Pattern Recognition and Image Analysis, vol. 18, no. 3, pp. 447-452, 2008. 112 [56] R. Lienhart and J. Maydt, 'An Extended Set of Haar-like Features for Rapid Object Detection', in image Processing Proceedings International Conference, USA, pp. I-900 I-903 vol.1, 2002. [57] Yang, J., Bouzerdoum, A., Tivive, F. & Phung, S., ‘Dimensionality reduction using compressed sensing and its application to a large-scale visual recognition task’, WCCI 2010 IEEE World Congress on Computational Intelligence,pp. 1607-1614,2010. [58] H. Zhang, 'EXPLORING CONDITIONS FOR THE OPTIMALITY OF NAÏVE BAYES', International Journal of Pattern Recognition and Artificial Intelligence, vol. 19, no. 02, pp. 183-198, 2005. [59] J. Xue and D. Titterington, 'Comment on “On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes”', Neural Processing Letters, vol. 28, no. 3, pp. 169-187, 2008. [60] P. Diaconis and D. Freedman, 'Asymptotics of Graphical Projection Pursuit', The Annals of Statistics, vol. 12, no. 3, pp. 793-815, 1984. [61] V. Indelman, P. Gurfil, E. Rivlin and H. Rotstein, 'Real-Time Vision-Aided Localization and Navigation Based on Three-View Geometry', IEEE Trans. Aerosp. Electron. Syst., vol. 48, no. 3, pp. 2239-2259, 2012. [62] Greg Welch, Gary Bishop, ‘An Introduction to the Kalman Filter’, University of North Carolina at Chapel Hill Department of Computer Science, 2001. [63] M.S.Grewal, A.P. Andrews, ‘Kalman Filtering - Theory and Practice Using MATLAB’, Wiley, 2001. 113 [64] Maybeck, P. S. ‘The Kalman filter: An introduction to concepts’, Autonomous Robot Vehicles. I. J. Cox and G. T. Wilfong. New York, Springer-Verlag: 194-204, 1990. [65] Gary Bishop and Greg Welch, ‘An Introduction to the Kalman Filter’, University of North Carolina SIGGRAPH 2001 course notes. ACM Inc., North Carolina, 2001. [66] V. Sazdovski, T. Kolemishevska-Gugulovska and M. Stankovski, 'Kalman Filter Implementation For Unmanned Aerial Vehicles Navigation Developed Within A Graduate Course', Institute of ASE at Faculty of EE, St. Cyril and Methodius University, MK-1000, Skopje, Republic of Macedonia, 2005. [67] John Spletzer, ‘The Discrete Kalman Filter’. Lecture notes CSC398/498. Lehigh University. Bethlehem, PA, USA. March 2005. [68] Rudolph van der Merwe, Alex T. Nelson, Eric Wan, ‘An Introduction to Kalman Filtering.’ OGI School of Science & Engineering lecture. Oregon Health & Sciences University. November 2004. [69]Simon Julier and Jeffrey Uhlmann, ‘A new extension of the kalman filter to nonlinear systems’,Int. Symp. Aerospace/Defense Sensing, Simul. And Controls, Orlando, FL, 1997. [70] N.J. Gordon, D.J. Salmond, and A.F.M. Smith., ‘A novel approach to nonlinear/nonGaussian Bayesian state estimation’,IEEE Proceedings on Radar and Signal Processing, volume 140, pages 107-113, 1993. 114 [71] G. Consulting, 'Using an Extended Kalman Filter for Object Tracking in Simulink',Goddardconsulting.ca,2014.[Online].Available:http://www.goddardconsulting. ca/simulink-extended-kalman-filter-tracking.html#Figure2. [72] Robotics.coroware.com, 'CoroBot Classic four wheel drive (CL4)', 2014. [Online]. Available: http://robotics.coroware.com/Template2.aspx?WebPage=CL4. [73] Microstrain.com, '3DM-GX3®-45 -- Product no longer stocked – limited availability', 2011. [Online]. Available: http://www.microstrain.com/inertial/3dm-gx3-45. [74] Shop.gopro.com, 'GoPro Official Website: The World's Most Versatile Camera', 2014. [Online]. Available: http://shop.gopro.com/. [75] Zwillinger, D. (Ed.). ‘Spherical Coordinates in Space.’ §4.9.3 in CRC Standard Mathematical Tables and Formulae. Boca Raton, FL: CRC Press, pp. 297-298, 1995. [76] A. E. Johnson and L. H. Matthies, “Precise image-based motion Estimation for autonomous small body exploration,” in Proc.5th Int’l Symp. On Artificial Intelligence, Robotics and Automation in Space, Noordwijk, The Netherlands, June 1-3 1999, pp. 627–634 [77] J. Lobo and J. Dias, 'Relative Pose Calibration Between Visual and Inertial Sensors', The International Journal of Robotics Research, vol. 26, no. 6, pp. 561-575, 2007. [78] Kelly and G. Sukhatme, 'Fast relative pose calibration for visual and inertial sensors’, Springer Berlin Heidelberg, vol. 54, pp. 515-524, 2009. 115 116

RELATED PAPERS

RELATED TOPICS

Log In

Vision-Aided Navigation for GPS-Denied Environments Using Landmark Feature Identification

Vision-Aided Navigation for GPS-Denied Environments Using Landmark Feature Identification

Related Papers

RELATED PAPERS

RELATED TOPICS