Depth-based 3D hand pose tracking

Kha Gia Quach

Depth-based 3D hand pose tracking

2016, 2016 23rd International Conference on Pattern Recognition (ICPR)

2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 Depth-based 3D Hand Pose Tracking Kha Gia Quach∗ , Chi Nhan Duong∗ , Khoa Luu† and Tien D. Bui∗ ∗ Department of Computer Science and Software Engineering Concordia University, Montreal, Quebec, Canada Email: {k_q, c_duon, bui}@encs.concordia.ca † CyLab Biometrics Center and the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA Email: kluu@andrew.cmu.edu Abstract—In this paper, we propose two new approaches using the Convolution Neural Network (CNN) and the Recurrent Neural Network (RNN) for tracking 3D hand poses. The first approach is a detection based algorithm while the second is a data driven method. Our first contribution is a new trackingby-detection strategy extending the CNN based single frame detection method to a multiple frame tracking approach by taking into account prediction history using RNN. Our second contribution is the use of RNN to simulate the fitting of a 3D model to the input data. It helps to relax the need of a carefully designed fitting function and optimization algorithm. With such strategies, we show that our tracking frameworks can automatically correct the fail detection made in previous frames due to occlusions. Our proposed method is evaluated on two public hand datasets, i.e. NYU and ICVL, and compared against other recent hand tracking methods. Experimental results show that our approaches achieve the state-of-the-art accuracy and efficiency in the challenging problem of 3D hand pose estimation. I. I NTRODUCTION The problem of human hand pose estimation has been studied for decades. In particular, markless hand tracking has gained a lot of attention due to its numerous applications in human-computer interaction with virtual augmented reality, 3D gaming, gestural inputs, etc. Major progresses have been made recently in hand pose estimation thank to the depth sensors and deep learning models [1]. However, the problem is still far from solved. The challenges remain but shift toward dealing with occlusions, low-resolutions, clutters, etc. In addition, real-time requirement makes the problem even harder. This is because natural hand movement exhibits many degrees of freedom (≥ 25o ), fast motions with rapid changes in direction, self-similar parts and self-occlusion. All these factors make estimating the pose of articulated objects from depth video data hard, unless the desired method is robust against such self-similarity and self-occlusion. A. Motivation There are two common approaches in the problem of hand pose estimation, i.e. the model-base generative tracking and the discriminative hand pose detection. Both approaches have their drawbacks, e.g. tracking framework cannot recover from initial error while detectors may fail in case of occlusions. Thus, the motivation of our work is to develop new strategies to combine the benefits of both generative tracking and discriminative detection into a unified framework that yields 978-1-5090-4846-5/16/$31.00 ©2016 IEEE high efficiency and robust performance. In addition, this paper presents two ways to build a hybrid framework that can track human hand with a single depth-image sequence. B. Contributions In this paper, we propose a hand pose tracking framework based on CNN and RNN. The first contribution in our first strategy is a new tracking-by-detection strategy which combines the Convolutional Neural Networks (CNN) and the Recurrent Neural Networks (RNN) inspired from the work on visual object tracking [2]. In this approach, we take into account prediction history to extend single frame detection into a multiple frame tracking method. The RNN outputs hand pose prediction computed based on the visual features generated by the CNN and the memory state of RNN, as shown in Fig. 1(a). In other words, the RNN plays a role of memorizing hand pose prediction from previous frames. The results of CNN-based hand pose estimation are now affected by previous frames rather than the current input frame solely. Finally, the prediction can optionally emphasize attention areas in the input before feeding it into convolutional network in the next frame. The second contribution is a new generative model-based tracking where each time step in RNN can be seen as efficient approximate model fitting step in traditional model-based tracking. The similarity function is learned from training data and integrated into the learning of RNN. CNN is for extracting features from the current depth-frames and producing a good initial hand pose if necessary, as shown in Fig. 1(b). Our methods are able to recover from error, fast, reliable and robust against occlusions. The rest of the paper is organized as follows. First, we review two important categories of hand pose estimation methods in Section II, i.e. model-based tracking and discriminativebased detection. In Section III, we present a novel hand pose tracking framework followed by a description of the settings for training the model in Section III-B3. Results and analysis are given in Section IV. Finally, the conclusion in this paper is presented in Section V. II. P REVIOUS W ORK There has been a variety of approaches proposed over the last decade for hand pose estimation. These techniques rely on markers or wearable gloves, RGB input from single or 2747 (a) Tracking-by-detection approach (b) Model-based tracking approach Fig. 1. The diagram of our hand pose tracking models multiple cameras, and more recently depth camera or RGBD input. Good overviews of earlier and recent work are given in [3] and [1]. In this section, we will review only more recent work relevant to our approach based on a single depth-frame captured by depth camera. Hand-pose estimation methods can be divided into two main categories: appearancebased (discriminative) methods and model-based (generative) methods. The first group of approaches is based on discriminative models that aim at directly predicting the joint locations from RGB-D or depth images. These models are often used in detection-based framework. Some popular architectures in this group include decision forests [4], [5], [6], [7], [8], [9], part-based models [10], and deep models. We will focus on deep models and readers can refer to [1] for an overview of other detection-based approaches. Follow the current trend in Computer Vision, deep neural nets have been explored for hand pose estimation. Tompson et al. [11] proposed a three-stage pipeline including hand detection with a decision forest, joint location estimation with a deep network, and joint refinement with inverse kinematics (IK). They also introduced a large-scale dataset for training and evaluation called NYU hand dataset, so this paper/method is referred as DeepNYU [11]. Oberweger et al. [12] employed a similar deep network but proposed to learn a prior subspace to replace the IK stage. Farabet et al. [13] developed a pixel-labeling approach which predicts joint labels for each pixel and followed by a clustering stage to predict joint locations. This approach is inspired from a similar method in human pose recognition problem [14] but a deep network was used instead of a decision forest. The second group consists of generative, model-based methods. This type of method usually has four main building blocks: a hand model, a similarity function measuring the difference between the observed image and the model, an optimization algorithm (e.g. Particle Swarm) to maximize the similarity function with respect to the model parameters, and an initial pose for the optimization starting point. This modeldriven system usually found in tracking-based domain [15], [16], [17], [18], [19]. Oikonomidis et al. [15] proposed a model-based method using particle swarm optimization (PSO). This method achieves 15 fps with GPU acceleration and uses skin color segmentation which is sensitive to lighting. Oikonomidis et al. [20] later presented a more advanced sampling strategy that improves tracking efficiency without compromising quality. However, gradient-based optimization approaches converge faster and more accurately than PSO when close to the solution, and are therefore well suited for realtime applications. Melax et al. [16] proposed a tracking method based on depth by using efficient parallel physics simulations. Although this method is fast, finger articulations are often incorrectly tracked. Oberweger et al. [17] proposed a model-based method, called DeepPrior, that can learn to generate realistic depth images of hands and to predict updates that improve the current estimate of the hand pose given the input depth image and a generated image for this estimate. The method uses a feedback loop which is constructed by Deep Networks learned from training data rather than a carefully designed similarity function and an optimization algorithm. Paliouras et al. [19] aim at automating the definition of the objective function in the optimization problem of model-based methods. All these works require careful initialization in order to guarantee convergence and therefore rely on tracking based on the last frames’ pose or separate initialization methods. Such tracking-based methods have difficulty handling drastic changes between two frames, which are common as the hand tends to move fast. There have been some works trying to combine both discriminative and generative models into a unified framework [21] [22] [23]. Qian et al. [21] proposed a hybrid method that uses a simple hand model consisting of a number of spheres in combination with discriminative fingertip detection. This method achieves 25 fps but requires the fingertips to be visible, otherwise tracking would be hard since it could easily be occluded or misdetected. III. O UR P ROPOSED H AND P OSE T RACKING F RAMEWORK In this section, we first present a tracking-by-detection approach to hand-pose tracking problem using deep learning 2748 techniques [24], particularly CNN and RNN. Then we describe a model-based tracking to simulate the process of model fitting by cascade RNN regression. The first proposed model is a RNN that takes depth video frames as inputs and returns the estimated hand pose (See Fig. 1(a)). This model can be equivalently written in the tracking probability form as follows, p(z1 , z2 , ..., zT |x1 , x2 , ..., xT ) = T Y p(zt |z<t , x≤t ) (1) t=1 where zt and xt are the estimated hand poses and the input depth-frame at time t, respectively. z<t is a history of all previous locations before time t, and x≤t is a history of all input frames up to time t. The second proposed framework consists of two main components: pose cascade regression and tracking-detection gating mechanism (see Fig. 1(b)). The pose regression module is a recurrent neural network that takes estimated hand pose as input and returns the improved hand pose estimation at each step/stage. The tracking-detection gating mechanism allows the model to choose between previously estimated pose and the pose prediction from CNN. A. Tracking-by-detection approach 1) Modeling: We consider at each time step t of RNN, the input depth-frame xt is first gone through a convolutional network to extract visual features. This gives us a feature vector φ(xt ) of the input depth-frame xt as follows, φ(xt ) = CNNθc (m(xt )) (2) where CNN is a convolutional network with its parameters θc and m(·) denotes the preprocessing step including hand localization and segmentation and returning the local window containing only the current tracked human hand. The preprocessing steps are discussed in details in Section III-A2. This feature vector is then fed into a RNN and the RNN updates its internal state ht based on the previous hidden state ht−1 , previous estimated hand pose z̃t−1 and the current frame feature vector φ(xt ) as follows, ht = RNNθr (ht−1 , z̃t−1 , φ(xt )) (3) where RNNθr can be any recurrent activation functions such as gated reccurent units (GRUs) [25], long short-term memory units(LTSMs) [26], or a simple logistic function with their parameters θr . Using the updated hidden state ht , the RNN computes the predictive distribution over the hand pose’s history as follows, p(zt |z<t , φ(x≤t )) = outθo (ht ) (4) where θo is a set of parameters defining the output neural network. The whole process (Eqs. (2) - (4)) is iteratively applied as a stream of new frames comes. Fig. 1(a) graphically illustrate this process. 2) Pre-processing: To extract CNN visual features, we first detect the hand location in the input depth-frame. Then, a fixed-size metric cube is segmented from the depth image around the hand. The cube is resized to a 128 × 128 and the depth values within the cube are normalized to [−1, 1]. The depth values are clipped to the cube sides front and rear. This preprocessing step was also done in [6] to provide invariance to different hand-to-camera distances. 3) Training: The CNN that we constructed to extract visual features consists of three convolutional layers, three maxpooling layers and three fully-connected layers. This is a generic network similar to the one in [27]. Let C denote a convolutional layer, P a pooling layer and F C denote a fully connected layer. Both C and F C layers consist of Rectified Linear Unit (ReLU) [28] activation functions. For C layers, the size is defined as (w × h × d), where the first two define the dimension of filters and the last one is the number of filters. For P layers, the size is defined as (w×h). The convolutional network can be described concisely as C(5 × 5 × 8) → P (4 × 4) → C(5 × 5 × 8) → P (2 × 2) → C(3 × 3 × 8) → P (1 × 1) → F C(1024) → F C(1024) → F C(30). For the RNN, we consider a simple network with one hidden layer with 30 units, the output layer having the dimension of J × 3, where J is the number of joints, and the input layer has the dimension of the last layer of CNN, i.e. 30, plus the dimension of the output layer. We optimize the network parameters by using error backpropagation and apply the stochastic gradient descent (SGD) algorithm. The batch size is 128. The learning rate decays over the epochs at the rate of 0.95 and starts with 0.01. The networks are trained for 1000 epochs. At each SGD update, we generate a minibatch of examples by the following steps: • Choose a group of consecutive frames • Extract their CNN features and ground truth pose • Provide groundtruth pose and corresponding CNN features as the input sequences for RNN • Receive groundtruth pose shifted (i.e. starting from frame t + 1) as the output sequences for RNN After these steps, each training example is a pair of data, containing a sequence of CNN features, and a sequence of ground-truth joint locations. We use this minibatch of N generated examples to compute the gradient of the minibatch log-likelihood. B. Model-based tracking approach 1) Modeling: For each input depth-frame t, we also first compute its CNN’s feature vector φ(xt ). However, the role of RNN is now to evolute from the initial state of the hand pose to the correct one given the feature vector φ(xt ) of the current depth-frame. Note that the input depth-frame also has gone through the pre-processing step as described in Section III-A2. This feature vector is also fed into a RNN and the RNN (k) updates its internal state ht based on the previous hidden 2749 (k−1) state ht follows, (k−1) and previous estimated hand pose z̃t (k) ht (k−1) = RNNθr (ht (k−1) , z̃t ) as (5) where the first input in the sequence of RNN (with K stages) (1) (K) z̃t = z̃t−1 is the estimated pose of the previous frame t − 1. (1) The initial hidden state of RNN ht = φ(xt ). (k) Using the updated hidden state ht , the RNN computes the predictive distribution over the hand pose’s history as follows: (k) (<k) p(zt |zt (≤k) , φ(xt (k) )) = outθo (ht ) (6) where θo is a set of parameters defining the output neural network. This whole cascade fitting process is iteratively applied to improve the estimated pose at each time step for the current depth-frame. Fig. 1(b) graphically illustrates this process. 2) Hand pose subspace modeling: Besides using joints coordinates to represent hand pose, we also model hand pose via PCA. A hand pose subspace is learned from training data. In other words, we learn two functions PoseToVec and VecToPose to perform foward and inverse transformation between feature vector embedding space and joint coordinate space. 3) Training: The number of hidden units in each stage of the RNN is the same as the first framework, however, the dimensions of the inputs and outputs are both J × 3. We also use SGD but the way that we generate training samples are completely different. At each SGD update, we generate a minibatch of examples by the following steps: • • • For each video frame, extract its CNN features The input of the first stage of RNN is the hand pose estimated from CNN The output at each stage of RNN is generated based on the distance between the first stage’s input pose and the ground truth pose of the current frame so that the final stage will equal to the ground truth pose. According to our experiments, we use this training strategy throughout this paper since it achieves our training target which is improving pose estimation stage-by-stage (See. Fig. 2 and Fig. 3). IV. E XPERIMENTS In this section, we evaluate the performance of our proposed framework on two different datasets including real and synthetic data. We also perform two different sets of experiments which include normal and occlusion cases to show the robustness of our proposed approaches. A. Datasets There exist few public datasets for hand pose estimation. Most early works perform quantitative evaluation only on synthetic data. Several recent hand tracking works release large enough training and testing data which allow a fair comparison. We evaluated our method on both real data and synthetic (occluded) data including the NYU Hand [11] and ICVL [6] datasets. Real data: ICVL dataset [6] consists of 180K ground truth annotated training images captured from 10 different subjects of varying hand sizes. The dataset also contains two testing sequences (denoted A and B) each containing 1000 frames capturing various poses with severe scale and viewpoint changes. NYU dataset [11] contains 72K training and 8K test frames which are well annotated (i.e. containing J = 36 annotated joints). We follow the evaluation protocol of [17], [11] and use the same subset of J = 14 joints. The training set contains hand poses of one person, while the test set has hands from two persons. It also shows a high variability of poses, however, this can make pose estimation challenging. The dataset was captured using a structured light RGB-D sensor but we used only the depth images for our experiments. B. Evaluation metrics and Methods for comparisons Metric there are three quantitative metrics. The first two consists of the average and max per-joint (in millimeters) error across all frames. The third one is success rate, i.e. the percentage of correct frames whose average (or max) joint distance error is lower than a threshold (∼20mm). This strict measure has been described and used in [6]. Methods We demonstrate the efficacy of our frameworks by comparing them to the Deep-model based methods including DeepNYU [11] and DeepPrior [17] C. Experiments on Real Dataset First, we perform on the normal testset of the two real databases, i.e. ICVL and NYU. This experiment shows how good the model can handle distinct poses in ICVL testset and unconstrained poses using NYU testset. Second, we perform another experiment where we add random occlusions in the Fig. 2. Average joint errors over 10 recurrent stages Fig. 3. Joint prediction over different stages (1, 2, 4, 8, 10) 2750 (a) Success rate over different thresholds (b) per-joint max errors (c) per-joint average errors Fig. 4. Comparison of different methods on the ICVL dataset (Best viewed in color). (a) Success rate over different thresholds (b) per-joint max errors (c) per-joint average errors Fig. 5. Comparison of different methods on the NYU hand dataset (Best viewed in color). Fig. 6. Comparison of three methods (from top to bottom: our PR-CNN, Oberweger et al. and Tompson et al.) on the NYU dataset. 2751 Fig. 7. Comparison of two methods (from top to bottom: Oberweger et al. and our PR-CNN) on the NYU dataset with occlusions. depth input frames (input frames are chosen randomly). The results in terms of prediction errors for two experiments are shown in Fig. 4 for ICVL testset and Fig. 5 for NYU testset. Noting that the per-joint errors are also computed for each finger and palm which are indexed as C: palm, T: thumb, I: index, M: middle, R: ring, P: pinky, W: wrist. The qualitative results for the first and the second experiments are shown in Fig. 6 and Fig. 7, respectively. As we can see from the second experiment, the perfomrance of DeepPrior is affected significantly by occlusions while our PR-CNN method still work well in this case. Our methods achieve competitive results comparing to other state-of-the-art methods. V. C ONCLUSION We proposed two different network architectures for tracking 3D hand poses. Both are using CNN and RNN. With the second architectures, i.e. model-based approach, we show that this framework can automatically correct the fail detection made in previous frames due to occlusions. We have compared the architectures on two datasets and shown that they are competitive with previous state-of-the-art both in terms of localization accuracy and robustness. ACKNOWLEDGMENT This work is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada R EFERENCES [1] J. S. Supancic, III, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan, “Depth-based hand pose estimation: Data, methods, and challenges,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [2] Q. Gan, Q. Guo, Z. Zhang, and K. Cho, “First step toward modelfree, anonymous object tracking with recurrent neural networks,” arXiv preprint arXiv:1511.06425, 2015. [3] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Visionbased hand pose estimation: A review,” Computer Vision and Image Understanding, vol. 108, no. 1, pp. 52–73, 2007. [4] C. Keskin, F. Kıraç, Y. E. Kara, and L. Akarun, “Hand pose estimation and hand shape classification using multi-layered randomized decision forests,” in Computer Vision–ECCV 2012. Springer, 2012, pp. 852–863. [5] C. Xu and L. Cheng, “Efficient hand pose estimation from a single depth image,” in The IEEE International Conference on Computer Vision (ICCV), December 2013. [6] D. Tang, H. Jin Chang, A. Tejani, and T.-K. Kim, “Latent regression forest: Structured estimation of 3d articulated hand posture,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [7] D. Tang, J. Taylor, P. Kohli, C. Keskin, T.-K. Kim, and J. Shotton, “Opening the black box: Hierarchical sampling optimization for estimating human hand pose,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [8] P. Li, H. Ling, X. Li, and C. Liao, “3d hand pose estimation using randomized decision forest with segmentation index points,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [9] C. Xu, A. Nanjappa, X. Zhang, and L. Cheng, “Estimate hand poses efficiently from single depth images,” International Journal of Computer Vision, vol. 116, no. 1, pp. 21–45, 2016. [10] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun, “Cascaded hand pose regression,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [11] J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Transactions on Graphics (TOG), vol. 33, no. 5, p. 169, 2014. [12] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands Deep in Deep Learning for Hand Pose Estimation,” in Proc. Computer Vision Winter Workshop (CVWW), 2015. [13] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1915–1929, 2013. [14] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Communications of the ACM, vol. 56, no. 1, pp. 116–124, 2013. [15] I. Oikonomidis, N. Kyriazis, and A. A. Argyros, “Efficient model-based 3d tracking of hand articulations using kinect.” in BmVC, vol. 1, no. 2, 2011, p. 3. [16] S. Melax, L. Keselman, and S. Orsten, “Dynamics based 3d skeletal hand tracking,” in Proceedings of Graphics Interface 2013. Canadian Information Processing Society, 2013, pp. 63–70. [17] M. Oberweger, P. Wohlhart, and V. Lepetit, “Training a feedback loop for hand pose estimation,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [18] N. Kyriazis, I. Oikonomidis, P. Panteleris, D. Michel, A. Qammaz, A. Makris, K. Tzevanidis, P. Douvantzis, K. Roditakis, and A. Argyros, “A generative approach to tracking hands and their interaction with objects,” in Man–Machine Interactions 4. Springer, 2016, pp. 19–28. [19] K. Paliouras and A. A. Argyros, Man–Machine Interactions 4: 4th International Conference on Man–Machine Interactions, ICMMI 2015 Kocierz Pass, Poland, October 6–9, 2015. Cham: Springer International Publishing, 2016, ch. Towards the Automatic Definition of the Objective Function for Model-Based 3D Hand Tracking, pp. 353–363. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-23437-3_30 [20] I. Oikonomidis, M. Lourakis, and A. Argyros, “Evolutionary quasirandom search for hand articulations tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3422–3429. [21] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [22] S. Sridhar, F. Mueller, A. Oulasvirta, and C. Theobalt, “Fast and robust hand tracking using detection-guided optimization,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [23] C. Choi, A. Sinha, J. Hee Choi, S. Jang, and K. Ramani, “A collaborative filtering approach to real-time hand pose estimation,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [25] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [27] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1653–1660. [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. 2752

Log In

Depth-based 3D hand pose tracking

Related papers

Related papers

Related topics