Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Academia.eduAcademia.edu

Pedestrian Detection using Labeled Depth Data

This paper presents pedestrian detection algorithm on labeled depth data which is obtained from road scenes. Our approach computes feature responses for head and legs of human body using depth and label data. And then, it detects pedestrians by removing edges and partitioning a bipartite graph of head and leg response blobs using prior knowledge about human body. In the experiments, the proposed algorithm produces better result compared to the method which uses histogram of gradient feature and the ground plane for road scenes.

Pedestrian Detection using Labeled Depth Data Kwang Hee Won, Sisay Gurmu, Soon Ki Jung School of Computer Science and Engineering Kyungpook National University Daegu, South Korea khwon@vr.knu.ac.kr, sisay@vr.knu.ac.kr, skjung@knu.ac.kr Abstract—This paper presents pedestrian detection algorithm on labeled depth data which is obtained from road scenes. Our approach computes feature responses for head and legs of human body using depth and label data. And then, it detects pedestrians by removing edges and partitioning a bipartite graph of head and leg response blobs using prior knowledge about human body. In the experiments, the proposed algorithm produces better result compared to the method which uses histogram of gradient feature and the ground plane for road scenes. Keywords—pedestrian detection; depth data; road scene I. INTRODUCTION Depth data has been used to analysis the geometry of a captured scene. The ‘Kinect’ sensor is a popular depth sensor which produces high quality depth data for indoor environment. Many researchers have developed human detection algorithms to utilize it for human computer interaction for various applications [1, 2]. For outdoor environment, especially for intelligent vehicle systems, it is necessary to acquire depth data to analyze 3D geometry of the road and detect pedestrians for driver assistant applications. And it is challenging for several reasons. This paper presents a pedestrian detection algorithm which utilizes labeled depth data from stereo vision sensor. On the road scene, we assume that there are three types of labels, ‘the ground’, ‘upstanding objects’, and ‘background’. The label of each pixel is determined with their disparity values (slope of ground plane for the ‘ground’ label) during the stereo matching process introduced in [3]. Those labels and depth values are then used to obtain feature response for heads and leg parts of pedestrians. For aforementioned reasons, the algorithm only searches head and legs. Then, a bipartite graph is generated from two categories of responses. Finally, the algorithm detects pedestrians by partitioning it with prior knowledge about the scene. This paper organized as follows. In Sec. II, we introduce some related works on object detection from depth data. In Sec. III, we introduce a stereo matching/labeling method which produces depth data with label. In Sec. IV, we define features for two body parts, head and leg, and we introduce the process of detecting pedestrians from feature responses. In Sec. V, we show some experimental results, and we conclude this paper and present some future works in Sec. VI. II. RELATED WORKS First, the field of view is large and the range of depth is from 0 to infinity. Thus, some indoor sensors such as ‘Kinect’ sensor may not be suitable because it cannot cover the required depth range. This also makes human detection difficult because the humans in a road scene are relatively small compared to indoor depth data. Thus, it is not easy to recognize body parts which can cooperatively be used to determine the existence of human in addition to this, their hands are inside the pockets and they are carrying bags. Even the Histogram of Gradient (HoG), the representative 2D feature based human detection approach is widely used for intelligent vehicle systems [5]. Depth data has several advantages over 2D intensity images for human detection task. Depth data provides 3D structures of a given scene. It is useful to remove false alarms and missing which can be dangerous to the drivers for the safety assistant systems. For example, the authors of [1] implemented the ‘detect all’ manner obstacle detection approach using stereo disparity map. Next, various illumination conditions interfere with the acquisition of depth data from stereo depth sensors. Due to they rely on stereo correspondence search process to obtain depth data, the varying righting condition can degrade the accuracy of depth data. On the other hand, the LIDAR sensor doesn’t have such problem and is accurate and has enough depth resolution however it is too expensive for some applications. There are several human detection approaches that exploit the range data of the target scene [6, 7]. The depth data is used to set region of interest (ROI) by constraining geometric features such as heights of human [8]. They assumed the largest part in the scene to be the ground plane and calculated the heights of each object relative to it. They also used 2D features to classify ROIs into targets and negatives. In the meanwhile, the stereo depth sensor is still relevantly used for intelligent vehicle systems for several reasons. Recently, more robust dense stereo matching algorithms are developed and they have been successfully applied to the field of vehicle intelligence [3, 4]. Part based human detection approaches are popular for indoor environment [1, 2]. Plagemann et al. detect head, hands and foot, Shotton et al. used more body parts to recognize human and their posture. However, pedestrians on road scenes are smaller than indoor case and some body parts are not easily detected for several reasons. object. Thus the labels of pl and pr can be both ‘ground’ and ‘object’. In this paper, we use dense stereo disparity map and label data as input for our algorithm. Thus, we additionally utilize the ground plane, background, and object labels rather than the previous researches. We define head and leg features which are easily observable in outdoor scenes. We detect pedestrians by partitioning head-leg graph which is generated from feature responses. III. STEREO MATCHING AND LABELING Recently, a combined stereo matching and labeling algorithm for obstacle detection on a road scene has been proposed by Won and Jung [3]. In the literature, the authors assumed that a road scene can be labeled into three categories that is ‘ground’, ‘objects’, and ‘background’. We make use of the label information as well as the depth data to collect feature responses to detect the candidates of heads and legs of pedestrians. The algorithm uses a series of predefined homography based matching costs for each label to represent the slopes for the ground plane and distances for the obstacles and backgrounds. As a result, it produces labels of road scene and high quality dense depth data. The examples of labels and disparity values, and its corresponding depth data is shown in Fig. 1. Fig. 2. ‘leg’ feature and its 3D representation. The feature response of leg feature, rL of pixel p is computed by the following equation: rL(p) = δg(pb)ⅹdot(pu -p,pbpr)ⅹdot(pu-p, plpb), where δg(·) is delta function which returns 1 for the ground label and 0 otherwise, dot(·) is inner production of two 3vectors, pbpr and plpb are cross products between pb and pr, and pl and pb, respectively (they are represented by ‘red’ arrows in Fig. 2). To detect slanted legs, we also compute responses for rotated shape of the feature of Fig. 2. Fig. 1. (from left) Input road scene, the labels, and disparity map. We produce depth map from homogeneous disparity map [4] which contains reliable disparity values supported by a small surrounding region of similar disparity values. IV. FEATURE DEFINITIONS FOR BODY PARTS In this paper, we define features for ‘head’ and ‘leg’ using depth data and labels. For each pixel on 2D coordinate, we select four offsets which are up, down, left, and right. The scale of offset is defined using k0 which represents a number of pixels at certain depth d0 for example 8 meters in experiments of Sec. V. For a given 2D coordinate p, the scale of feature is adaptively determined by depth value d: k=k0d0 /d. (2) ‘Head’ feature is defined using the same number and positions of supporting points of the leg feature. However this time, p and pb are on the object label, and pu, pl and pr can be either object and background labels. If one of pl and pr is located at depth hole, then we apply small penalty to reflect the uncertainty. The response for ‘head’ feature, rH is then defined as the following: rH(p) =exp(-|Z(p)-Z(pb)|/σ)Δb(pu,p)Δb(pr,p)Δb(pl,p), (3) where Z(·) is depth value, and σ scales the difference of depth values, and Δb takes the maximum of two delta functions δb(·) and δt(·,·) defined by, (1) δb(p): 1, if label of p is ‘background’, 0 otherwise, and (4) A. ‘Head’ and ‘Leg’ Features In Fig. 2, four offsets of ‘leg’ feature are represented in pixel coordinates and 3D coordinates. First, the label of p has to be ‘object’. Then, label of pb is ‘ground’ because the pedestrians are on the ground. We expect that pu can be observed from a part of legs or lower body. pl and pr can be observed on the ground and either on the surface of other δt(pu,p): 1, if |Z(pu)-Z(p)| > T, 0 otherwise, (5) where T is a threshold which represents a maximum possible size of an head in depth direction. For each pixel p and 4 offsets, we check 4-neighbor pixels of each coordinate to reduce the influence of noise and depth holes. The feature responses are computed for several k0 and averaged. Then, we threshold each response map and obtain candidate blobs of heads and legs. B. Pedestrian Detection via Bipartite Graph Partitioning From the thresholded feature response map of heads and legs, the algorithm builds bipartite graph which connects each head to every legs. By using prior knowledge about human, the algorithm removes unpromising edges and partition the graph to detect pedestrians. detector only detects the pedestrians which are supported by the ground region. INRIA pedestrian data set [5] is used to train the HoG detector. The same ground region obtained by [3] is used. We choose the best result based on the detection rate and false alarm. For our method, k0 of 8, 12, and 16 pixels are selected for head features, 10, 12, and 14 pixels are selected for leg features at the distance (d0) of 8 meters. Human heights are set between 1.2 to 2.0 meters, and the minimum size of response blob is set as 8 pixels. The bounding boxes which are overlapped with ground truth bounding box, more than 50 percent are considered as correct detection. Table I shows that the proposed algorithm produced the better result than ground plane HoG in terms of detection rate and false alarm rate. Figure 4 shows the examples of detection results. TABLE I. DETECTION RESULTS Methods Fig. 3. Bipartite graph to partition it into pedestrians and false responses. ground plane HoG proposed detection rate 53.20% 68.31% missing rate 46.80% 31.69% false alarm rate 47.71% 27.47% The edges are removed by the following rules: • If the distance between head and leg in any axis (X, Y, Z) is larger than a human heights. • If the 3D distance is larger than the maximum human height or less than the minimum. • For each head blob, at most two closest legs can survive among the connected ones. Because a human can have at most two legs. • If any leg blob is connected to multiple head blobs, the closest among them will survive. For the last rule, the component labels generated from homogeneous disparity map is used as a measure, then, the 3D distance is used. The strategy to remove edges is rather heuristic but works well for the graph of response blobs. V. EXPERIMENTAL RESULTS For experiments, we captured stereo road scenes and produce disparity and label data using the method introduced in [1]. Then, we produce homogeneous disparity map and convert it to depth data. We tested 150 frames to evaluate the performance of the proposed algorithm. The sequence contains 344 pedestrians. The ground truth data is manually obtained for the pedestrians which have the larger bounding box than 30 by 50 pixels. We compared our results with those of HoG human detector which is similar to the method introduced in [8]. In the implementation, we utilize the ground region to constrain the location of candidate bounding boxes. In other words, this HoG Fig. 4. Pedestrian detection results of the proposed algorithm. Figure 5 shows a detection result which contains one false negative and its corresponding depth data. The artifact from depth data (inside red circle) cause a missing of one pedestrian. We believe that such problem can be relieved by the frame coherence of 3D objects or the use of 2D appearance model. Another weak point of the proposed algorithm is that it produces false alarm for the pole-like objects which has the heights of human (the red rectangle in Fig. 6). The combining of 2D features also will compensate those kinds of weak points and will increase overall performance of the detection algorithm. VI. CONCLUSIONS AND FUTURE WORKS In this paper, we introduce a pedestrian detection algorithm which makes use of labeled depth data. The algorithm utilizes the responses from head and leg features which are designed by depth values and labels. The proposed feature design can represent more complex constraints compared to the case of simple depth features. The experimental results show that the algorithm works reasonably. In the future, we will develop more simple and discriminative features by utilizing simplified 3D mesh data. Also, we will compare the proposed algorithm with many other pedestrian detection algorithms which work in outdoor especially for road scenes. We will optimize and improve the edge removal and graph partitioning process to detect single person as well as person groups. ACKNOWLEDGMENT This research is supported by the MKE (The Ministry of Knowledge Economy), Korea, under the CITRC (Convergence Information Technology Research Center) support program (NIPA-2013-H0401-12-1006) supervised by the NIPA (National IT Industry Promotion). This work was also supported by the Industrial Strategic Technology Development Program and the Development of Driver Oriented Vehicle Augmented Reality System based HUD(Head Up Display) Technology (10040927) funded by the Ministry of Knowledge Economy (MKE, Korea). REFERENCES [1] [2] [3] [4] Fig. 5. Detection result with one missing pedestrian (top row) and its corresponding depth data (bottotm). [5] [6] [7] [8] Fig. 6. False positive response caused by a pole-like object. C. Plagemann, V. Ganapathi, D. Koller, S. Thrun, “Real-time identification and localization of body parts from depth images,” IEEE International Conference on Robotics and Automation (ICRA), pp.31083113, May 2010. J. Shotton “Real-Time Human Pose Recognition in Parts from Single Depth Images,” In CVPR, 2011. K.H. Won, S.K. Jung, “Billboard sweep stereo for obstacle detection in road scenes,” Electronics Letters, vol. 48(24), pp.1528-1530, November 2012. S. Hermann, and R. Klette “Iterative Semi-Global Matching for Robust Driver Assistance Systems,” In ACCV 2012. N. Dalal, and B. Triggs, “Histograms of oriented gradients for human detection,” In CVPR 2005. O. Neeti, “A survey of techniques for human detection from video,” Master’s thesis, University of Maryland, 2006. H. Yang, S.W. Lee, “Reconstruction of 3D human body pose from stereo image sequences based on top-down learning,” Pattern Recognition, vol. 40(11), pp. 3120-3131, 2007. H. Hattori, A. Seki, M. Nishiyama, and T. Watanabe, “Stereo-based pedestrian detection using multiple patterns,” In BMVC 2009.