International Journal of Computer Vision manuscript No.
(will be inserted by the editor)
3D Hand Pose Detection in Egocentric RGB-D Images
arXiv:1412.0065v1 [cs.CV] 29 Nov 2014
Grégory Rogez · J. S. Supančič III · Maryam Khademi ·
J. M. M. Montiel · Deva Ramanan
Received: date / Accepted: date
Abstract We focus on the task of everyday hand pose estimation from egocentric viewpoints. For this task, we show
that depth sensors are particularly informative for extracting
near-field interactions of the camera wearer with his/her environment. Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand
pose estimation in RGB-D images is still an unsolved problem. The problem is considerably exacerbated when analyzing hands performing daily activities from a first-person
viewpoint, due to severe occlusions arising from object manipulations and a limited field-of-view. Our system addresses
these difficulties by exploiting strong priors over viewpoint
and pose in a discriminative tracking-by-detection framework. Our priors are operationalized through a photorealistic synthetic model of egocentric scenes, which is used to
generate training data for learning depth-based pose classifiers. We evaluate our approach on an annotated dataset
of real egocentric object manipulation scenes and compare
to both commercial and academic approaches. Our method
provides state-of-the-art performance for both hand detection and pose estimation in egocentric RGB-D images.
Keywords egocentric vision, hand pose, object manipulation, RGB-D sensor
This work was supported by EU grant Egovision4Health.
Grégory Rogez, J. M. M. Montiel
Aragon Institute of Engineering Research (i3A), Universidad de
Zaragoza, Spain
E-mail: {grogez,josemari}@unizar.es
Grégory Rogez, J. S. Supančič III, Maryam Khademi, Deva Ramanan
Dept. of Computer Science, University of California, Irvine, USA
E-mail: {grogez,jsupanci,mkhademi,dramanan}@unizar.es
1 Introduction
Much recent work has explored various applications of egocentric RGB cameras, spurred on in part by the availability of low-cost mobile sensors such as Google Glass, Microsoft SenseCam, and the GoPro camera. Many of these
applications, such as life-logging [14], medical rehabilitation [59], and augmented reality [3], require inferring the
interactions of the first-person observer with his/her environment while recognizing his/her activities. Whereas thirdperson-view activity analysis is often driven by human fullbody pose, egocentric activities are often defined by hand
pose and the objects that the camera wearer interacts with.
Towards that end, we specifically focus on the tasks of hand
detection and hand pose estimation from egocentric viewpoints of daily activities. We show that depth-based cues,
extracted from an egocentric depth camera, provides an extraordinarily helpful cue for egocentric hand-pose estimation.
One may hope that depth simply “solves” the problem,
based on successful systems for real-time human pose estimation based on Kinect sensor [47] and prior work on articulated hand pose estimation for RGB-D sensors [30, 55,
17,36,55]. Recent approaches have also tried to exploit the
2.5D data from Kinect-like devices to understand complex
scenarios such as object manipulation [21] or two interacting
hands [32]. We show that various assumptions about visibility/occlusion and manual tracker initialization may not hold
in an egocentric setting, making the problem still quite challenging.
Challenges: Three primary challenges arise for hand pose
estimation in everyday egocentric views, compared to 3rd
person views. First, tracking is less reliable. Even assuming
that manual initialization is possible, a limited field-of-view
from an egocentric viewpoint causes hands to frequently
move outside the camera view frustum. This makes it dif-
2
Grégory Rogez et al.
(a)
(b)
(c)
(d)
Field of
view
Self-occlusions
Object
manipulation
Malsegmentability
Fig. 1 Challenges. We contrast third person depth (a) and RGB images (b) (overlaid with pose estimates from [36]) with depth image
and RGB images (c,d) from egocentric views of daily activities. Hands
leaving the field-of-view, self-occlusions, occlusions due to objects
and malsegmentability due to interactions with the environment are
common hard cases in egocentric settings.
ficult to apply tracking models that rely on accurate estimates from previous frames, since the hand may not even be
visible. Second, active hands are difficult to segment. Many
previous systems for 3rd-person views make use of simple
depth-based heuristics to both detect and segment the hand.
These are difficult to apply during frames where users interact with objects and surfaces in their environment. Finally,
fingers are often occluded by the hand (and other objects
being manipulated) in egocentric views, considerably complicating articulated pose estimation. See examples in Fig. 1.
Our approach: We describe a successful approach to
hand-pose estimation that makes use of the following key
observations. First, depth cues provide an extraordinarily
helpful signal for pose estimation in the near-field, first-person
viewpoints. Though this observation may see obvious, stateof-the-art methods for egocentric hand detection do not make
use of depth [25, 24]. Moreoever, in our scenario, depth cues
are not “cheating” as humans themselves make use of stereopsis for near-field analysis [44,11]. Second, the egocentric setting provides strong priors over viewpoint, grasps,
and interacting-objects. We operationalize these priors by
generating synthetic training data with a rendered 3D hand
model. In contrast to previous work that uses a “floating
hand”, we mount a synthetic egocentric camera to a virtual
full-body character interacting with a library of everyday objects. This allows us to make use of contextual cues for both
data generation and recognition (see Fig. 3). Third, we treat
pose estimation (and detection) as a discriminative multiclass classification problem. To efficiently evaluate a large
number of pose-specific classifiers, we make use of hierarchical cascade architectures. Unlike much past work, we
classify global poses rather than local parts, which allows
us to better reason about self-occlusions. Our classifiers process single frames, using a tracking-by-detection framework
that avoids the need for manual initialization (see Fig. 2c-e).
Evaluation: Unlike human pose estimation, there exists no standard benchmarks for hand pose estimation, especially in egocentric videos. We believe that quantifiable performance is important for many broader applications such as
health-care rehabilitation, for example. Thus, for the evaluation of our approach, we have collected and annotated (full
3D hand poses) our own benchmark dataset of real egocentric object manipulation scenes, which we will release
to spur further research. It is surprisingly difficult to collect annotated datasets of hands performing real-world interactions; indeed, many prior work on hand pose estimation
evaluate results on synthetically-generated data. We developed a semi-automatic labelling tool which allows to accurately annotate partially occluded hands and fingers in 3D,
given real-world RGB-D data. We compare to both commercial and academic approaches to hand pose estimation,
and demonstrate that our method provides state-of-the-art
performance for both hand detection and pose estimation in
egocentric RGB-D images.
Overview: This work is an extension of [39]. This manuscript
explains our approach with considerably more detail, reviews a broader collection of related work, and provide new
and extensive comparisons with state-of-the-art methods. We
review related work in Sec. 2 and present our approach in
Sec. 3, focusing on our synthetic training data generation
procedure (Sec. 3.1) and our hierarchical multi-class architecture (Sec. 3.2). We conclude with experimental results in
Sec. 4.
2 Related work
Egocentric hands: Previous work examined the problem of recognizing objects [10, 34] and interpreting American Sign Language poses [51] from wearable cameras. Much
work has also focused on hand detection [25, 24], hand tracking [20,19, 18, 29], finger tracking [7], and hand-eye tracking [43] from wearable cameras. Often, hand pose estimation is examined during active object manipulations [28, 38,
37,9]. Most such previous work makes use of RGB sensors.
Our approach demonstrates that an egocentric depth camera
makes things considerably easier.
Egocentric depth: Depth-based wearable cameras are
attractive because depth cues can be used to better reason
about occlusions arising from egocentric viewpoints. There
3D Hand Pose Detection in Egocentric RGB-D Images
3
Fig. 2 System overview. (a) Chest-mounted RGB-D camera. (b) Synthetic egocentric hand exemplars are used to define a set of hand pose classes
and train a multi-class hand classifier. The depth map is processed to select a sparse set of image locations (c) which are classified obtaining a list
of probable hand poses (d). Our system produces a final estimate by reporting one or more top-scoring pose classes (e).
has been surprisingly little prior work in this vein, with notable exceptions focusing on targeted applications such as
navigation for the blind [27] and recent work on egocentric object understanding [5, 26]. We posit that one limitation may be the need for small form-factors for wearable
technology, while structured light sensors such as the Kinect
often make use of large baselines. We show that time-offlight depth cameras are an attractive alternative for wearable depth-sensing, since they do not require large baselines
and so require smaller form-factors.
Depth-based pose: Our approach is closely inspired by
the Kinect system and its variants [47], which makes use
of synthetically generated depth maps for articulated pose
estimation. Notably, Kinect follows in the tradition of local part models [4, 8], which are attractive in that they require less training data to model a large collection of target
poses. However, it is unclear if local methods can deal with
large occlusions (such as those encountered during egocentric object manipulations) where local information can be
ambiguous. Our approach differs in that our classifiers classify global poses rather than local parts. Finally, much previous work assumes that hands are easily segmented or detected. Such assumptions simply do not hold for everyday
egocentric interactions.
Interacting objects: Estimating the pose of a hand manipulating an object is challenging [13] due to occlusions
and ambiguities in segmenting the object versus the hand.
It is attractive to exploit contextual cues through simultaneously tracking hands [21, 22] and the object. [2] use multicameras to reduce the number of full occlusions. We jointly
model hand and objects with synthetic hand-object exemplars as in [42]. However, instead of modeling floating hands,
we model them in a realistic egocentric context that is constrained by the full human body.
Tracking vs detection: Temporal reasoning is also particularly attractive because one can use dynamics to resolve
ambiguities arising from self and object occlusions. Much
prior work on hand-pose estimation takes this route [30, 55,
17,36]. Our approach differs in that we focus on singleimage hand pose estimation, which is required to avoid manual (re)initialization. Exceptions include [58, 54, 56], who also
process single images but focus on third-person views.
Generative vs discriminative: Generative model-based
approaches have historically been more popular for hand
pose estimation [52]. A detailed 3D model of the hand pose
is usually employed for articulated pose tracking [31, 32]
and detailed 3D pose estimation [23]. Discriminative approaches [55,17] for hand pose estimation tend to require
large datasets of training examples, synthetic, realistic or
combined [55]. Learning formalisms include boosted classifier trees [33] and randomized decision forests[17], and regression forests [55, 54]. Sridhar et al. [50] propose a hybrid
approach that combines discriminative part-based pose retrieval with a generative model-based tracker. Our approach
uses a computer graphics model to generate training data,
which is then used to learn discriminative pose-specific classifiers.
Hierarchical cascades: We approach pose estimation as
a hierarchical multi-class classification task, a strategy that
dates back at least to Gavrila et al [12]. Our framework fol-
4
Grégory Rogez et al.
Virtual egocentric camera
Everyday hand package
Synthetic egocentric RGB-D images
Fig. 3 Training data. We show on the left hand side our avatar that we mount with a virtual egocentric camera. In the middle we show the
EveryDayHands animation library [6] used to generate realistic hand-object configurations. On the right, we present some examples of resulting
training images rendered using Poser.
lows a line of work that focuses on efficient implementation through coarse-to-fine hierarchical cascades [61, 53, 4,
40]. Our work differs in its discriminative training and largescale ensemble averaging over an exponentially-large set of
cascades, both of which considerably improve accuracy and
speed.
3 Our method
Our method works by using a computer graphics model to
generate synthetic training data. We then use this data to
train a classifier for pose estimation. We describe each stage
in turn.
3.1 Synthesizing training data
We represent a hand pose as a vector of joint angles of a
kinematic skeleton θ. We use a hand-specific forward kinematic model to generate a 3D hand mesh given a particular
θ. In addition to hand pose parameters θ, we also need to
specify a camera vector φ that specifies both a viewpoint
and position. We experimented with various priors and various rendering packages.
Floating hands vs full-body characters: Much work
on hand pose estimation makes use of an isolated “floating”
hand mesh model to generate synthetic training data. Popular software packages include the open-source libhand [57]
and commercial Poser [48, 46]. We posit that modeling a full
character body, and specifically, the full arm, will provide
important contextual cues for hand pose estimation. To generate egocentric data, we mount a synthetic camera on the
chest of a virtual full-body character, naturally mimicking
our physical data collection process. To generate data corresponding to different body and hand shapes, we make use of
Poser’s character library.
Pose prior: Our hand model consists of 26 joint angles,
θ ∈ [0, 2π]26 . It is difficult to specify priors over such highdimensional spaces. We take a non-parametric data-driven
approach. We first obtain a training set of joint angles {θn }
from a collection of grasping motion capture data [41]. We
then augment this core set of poses with synthetic perturbations, making use of rejection sampling to remove invalid
poses. Specifically, we first generate proposals by perturbing the ith joint angle of training sample n with Gaussian
noise
θn [i] → θn [i] + ǫ
where ǫ ∼ N (0, σi )
(1)
The noise variance σi is obtained by manual tuning on validation data. Notably, we also perturb the entire arm of the
full character-body, which generates natural (egocentric) viewpoint variations of hand configurations. Note that we consider smaller perturbations for fingers to keep grasping poses
reasonable. We remove those samples that result in poses
that are self-intersecting or lie outside the field-of-view.
Viewpoint prior: The above pose perturbation procedure for θ naturally generates realistic egocentric camera
viewpoints φ for our full character models. We also performed some diagnostic experiments with a floating hand
model. To specify a viewpoint prior in such cases, we limited the azimuth φaz to lie between 180 ± 30 (corresponding
to rear viewpoints), elevation φel to lie between −30 and 10
(since hands tend to lie below the chest mount), and bank φb
to lie between ±30. We obtained these ranges by looking at
a variety of collected data (not used for testing).
Interacting objects: We wish to explore egocentric hand
pose estimation in the context of natural, functional hand
movement. This often involves interactions with the surrounding environment and manipulations of nearby objects. We
posit that generating such contextual training data will be
important for good test-time accuracy. However, modeling
the space of hand grasps and the world of manipulable objects is itself a formidable challenge. We make use of the
3D Hand Pose Detection in Egocentric RGB-D Images
5
Fig. 4 Hierarchy of hand poses. We visualize a hierarchical graph G = (V, E ) of quantized poses with K = 16 leaves. The ith node in this tree
represents a coarse pose class, visualized with the average hand pose and the average gradient map over all the exemplars in that coarse pose.
EveryDayHands animation library [6], which contains 40
canonical hand grasps. This package was originally designed
as a computer animation tool, but we find the library to cover
a reasonable taxonomy of grasps for egocentric recognition.
A surprising empirical fact is that humans tend to use a small
number of grasps for everyday activities - by some counts,
9 grasps are enough to account for 80% of human interactions [60]. Following this observation, we manually amassed
a collection of everyday common objects from model repositories [1]. Our objects include spheres and cylinders (of
varying sizes), utensils, phones, cups, etc. We paired each
object a viable grasp (determined through visual inspection),
yielding a final set of 52 hand-object combinations. We apply our rejection-sampling technique to generate a large number of grasp-pose and viewpoint perturbations, yielding a
final dataset of 10,000 synthetic egocentric hand-object examples. Some examples are shown in Fig. 3.
cast previous approaches in a mathematical framework that
is amenable to our proposed modifications.
Hierarchical quantization: Firstly, we represent each
training example as a depth image x and a label vector y of
joint positions in a canonical coordinate frame with normalized position and scale. We quantize this space of poses {y}
into K discrete values with K-mean clustering. We then agglomeratively merge these quantized poses into a hierarchical tree G = (V, E) with K-leaves, following the procedure
of [40]. Each node i ∈ V represents a coarse pose class. We
visualize the tree in Fig. 4.
Coarse-to-fine sharing: Given a test image x, a binary
classifier tuned for coarse pose-class i ∈ V is evaluated as:
fi (x) =
Y
hj (x)
where hj (x) = 1[wjT x>0]
(2)
j∈Ai
where 1 is the indicator function which evaluates to 1 or
0. Here, fi is a “strong” classifier for class i obtained by
ANDing together “weak” binary predictions hj from a set
j ∈ Ai . Ai is the set of “ancestor” nodes encountered on
3.2 Hierarchical cascades
the path from i to the root of the tree (i, its parents, grandparents,
etc.). Each weak classifier hj is a thresholded linWe use our training set to learn a model that simultaneously
ear
function
that is defined on appearance features extracted
detects hands and estimates their pose. Both tasks are adfrom window x. We use HOG appearance features extracted
dressed with a scanning window classifier, that outputs one
from subregions within the window, meaning that wj can
of K discrete pose classes or a background label. One may
be interpreted as a zero-padded “part” template tuned for
need a large K to model lots of poses, increasing traincoarse-pose j. Parts higher in the tree tend to be generic and
ing/testing times and memory footprints. We address such
difficulties through coarse-to-fine sharing and scanning-window capture appearance features common to many pose classes.
Parts lower in the tree, toward the leaves, tend to capture
cascades. Such architectures have been previously explored
pose-specific details.
in [61,61,4,40]. We contribute methods for efficient disBreadth-first search (BFS): The prediction for posecriminative training, efficient run-time evaluation, and ensemble averaging. To describe our contributions, we first reclass i will be 1 if and only if all classifiers in the ancestor
6
Grégory Rogez et al.
Fig. 5 Hierarchical cascades. We approach pose-estimation as a K -way classification problem. We define a linear-chain cascade of rejectors for
each of K pose class (left). By sharing “weak” classifiers across these cascades, we can efficiently organize the collection into a coarse-to-fine tree
G = (V, E ) with K leaves (right). Weak classifiers near the root of tree are tuned to fire on a large collection of pose classes, while those near the
leaves are specific to particular pose classes. Each linear cascade can be recovered from the tree by enumerating the ancestors of a leaf node.
set predict 1. If node i fails to fire, its children (and their descendants) cannot be detected and so can be immediately
rejected. We can efficiently prune away large portions of
the pose-space when evaluating region x with a truncated
breadth-first search (BFS) through graph G (see Alg. 1),
making scanning-window evaluation at test-time quite efficient. Though a queue-based BFS is a natural implementation, we have not seen this explicitly described in previous work on hierarchical cascades [61,61,4,40]. As we will
show, such an “algorithmic” perspective immediately suggests straightforward improvements for training and modeling averaging.
input : Image window x, classifiers {wi , i ∈ V }
output: vote(i) for each leaf class i.
1
2
3
4
5
6
7
8
9
10
11
12
13
create a queue Q;
enqueue 1 onto Q;
while Q 6= empty do
i=Q.dequeue();
if wiT x > 0 then
for ∀k ∈ child(i) do
enqueue k onto Q;
end
if k = ∅ then
vote(i)=1;
end
end
end
Algorithm 1: Classification with a single cascade. We
perform a truncated breadth-first search (BFS) using a
first-in, first-out queue. We insert the children of the
node i into the queue only if the weak classifier wi
successfully fires.
Multi-class detection: When Eq (2) is evaluated on leaf
classes, it is mathematically equivalent to a collection of K
linear chain cascades tuned for particular poses. These linear chains are visualized in Fig. 5. From this perspective,
multiple poses (or leaf classes) may fire on a single window
x. This is in contrast to other hierarchical classifiers such
as decision trees, where only a single leaf can be reached.
We generally report the highest-scoring pose as the final result, but alternate high-scoring hypotheses may still be useful (since they can be later refined using say, a tracker). We
define the score for a particular pose class by aggregating
binary predictions over a large ensemble of cascades, as described below.
Ensembles of cascades: To increase robustness, we aggregate predictions across an ensemble of classifiers. [40]
describes an approach that makes use of a pool of weak part
classifiers at node i:
hi (x) ∈ Hi
where |Hi | = M
(3)
One can instantiate a tree by selecting a weak classifier (from
its candidate pool Hi ) for each node i in graph G = (V, E).
This defines a potentially exponential-large set of instantiations M |V | , where M is the size of each candidate pool. In
practice, [40] found that averaging predictions from a small
random subset of trees significantly improved results.
3.3 Joint training of exponential ensembles
In this section, we present several improvements that apply in our problem domain. Because of local ambiguities
due to self-occlusions, we expect individual part templates
3D Hand Pose Detection in Egocentric RGB-D Images
7
to be rather weak. This in turn may cause premature cascade rejections. We describe modifications that reduce premature rejections through joint training of weak classifiers
and exponentially-large ensembles of cascades.
Sequential training: Much previous work assumes posespecific classifiers are given [53,4] or independently learned
[40]. For example, [40] trains weak classifiers wi by treating
all all training examples from pose-class/node i as positives,
and examples from all other poses as negatives. Instead, we
use only the training examples that pass through the rejection cascade up to node i. This better reflects the scenario at
test-time. This requires classifiers to be trained in a sequential fashion, in a similar coarse-to-fine BFS over nodes from
the root to the leaves (Alg. 2 ).
where TrainEnsemble is a learning algorithm that returns an
ensemble of M models by randomly selecting subsets of
training data or subsets of features. We select random subsets of features corresponding to local regions from window
x. This allows the returned models wij to be visualized as
”parts” (Fig. 5).
input : Image window x, weak classifier pools {Hi , i ∈ V }
output: vote(i) for each leaf class i.
1
2
3
4
5
6
7
8
input : Training data (xn , yn ) and tree G = (V, E ).
output: Weak part classifiers {wi , i ∈ V }.
1
2
3
4
5
6
7
8
9
10
9
10
create a queue Q;
enqueue (1,{xn }, {yn }) onto Q;
while Q 6= empty do
(i,x,y)=Q.dequeue();
wi =Train(x,y ∈ classi );
(x, y ) := {(xn , yn ) : wiT xn > 0};
for ∀k ∈ child(i) do
enqueue (k,x,y) onto Q;
end
end
11
12
13
14
create a queue Q;
enqueue (1,1) onto Q;
while Q 6= empty do
(i,t)=Q.dequeue();
P
t := t · h ∈H hi (x);
i
i
if t > 0 then
for ∀k ∈ child(i) do
enqueue (k,t) onto Q;
end
if k = ∅ then
vote(i)=t;
end
end
end
Algorithm 3: Classification with exponentially large
number of cascades. When processing a node i, all
its associated weak classifiers hi ∈ Hi are evaluated.
We keep track of the (exponentially-large) number of
successful ensemble components by enqueing a running estimate t and node index i. Once the queue is
empty, vote[i] is populated with the number of ensemble components that fired on leaf class i (which is upper bounded by M |V | ).
Algorithm 2: Cascade sequential training. We perform a BFS through pose classes, training weak classifiers wi by enqueueing node indices i and the training
data (x, y) that reaches that node. T rain returns a (linear SVM) model given training examples with binary
labels. With a slight abuse of notation, y ∈ classi denotes a set of binary indicators that specify which examples belong to classi , where classi is the set of leaf
classes reachable through a BFS from node i.
Detection rate
Processing Time
Exponentially-large ensembles: Rogez et al. [40] average votes across a small number (around hundred) of explicitlyconstructed trees. By averaging over a larger set, we reduce
the chance of a premature cascade rejection. We describe a
simple procedure for exactly computing the average over the
exponentially-large set of M |V | in Alg. 3. Our insight is that
one can compute an implicit summation (of votes) over the
set by caching partial summations t during the BFS. We refer the reader to the algorithm and caption for a detailed de(a)
(b)
scription. To train the pool of weak classifiers, we can leverFig. 6 Comparison with random cascades. We show in (a) that our
age our sequential training procedure. Simply replace Lines
new detector is equivalent to an exponentially-large number of Random
5 and 6 of Alg. 2 with the following:
Cascades (RC) from [40]. In (b), we show that the RC computational
{wij : j = 1 . . . M } = TrainEnsemble(x, y ∈ classi )
X
T
wij
xn > 0}
(x, y) := {(xn , yn ) :
j
(4)
(5)
cost increases linearly with the number of cascades and that, when considering a very large number of cascades, our model is more efficient.
8
Grégory Rogez et al.
Fig. 7 Test data. We show several examples of training RGB-D images captured with the chest-mounted Intel Creative camera from Fig. 2a. Real
egocentric object manipulation scenes have been collected and annotated (full 3D hand poses) for evaluation.
3.4 Implementation issues
Sparse search: We leverage two additional assumptions to
speed up our scanning-window cascades: 1) hands must lie
in a valid range of depths, i.e., hands can not appear further
away from the chest-mounted camera than physically possible and 2) hands tend to be of a canonical size s. These
assumptions allow for a much sparser search compared to a
classic scanning window, as only “valid windows” need be
classified. A median filter is first applied to the depth map
d(x, y). Locations greater than arms length (75 cm) away
are then pruned. Assuming a standard pinhole camera with
focal length f , the expected image height of a hand at valid
location (x, y) is given by Smap (x, y) = fs d(x, y). We apply our hierarchical cascades to valid positions on a search
grid (16-pixel strides in x-y direction) and quantized scales
given by Smap , visualized as red dots in Fig. 2c.
Features: We experiment with two additional sets of
features x. Following much past work, we make use of HOG
descriptors computed from RGB signals. We also evaluated
oriented gradient histograms on depth images (HOG-D). While
not as common, such a gradient-based depth descriptor can
be shown to capture histograms of normal directions (since
normals can be computed from the cross product of depth
gradients) [49]. For depth, we use 5x5 HOG blocks and 16
signed orientation bins.
4 Experiments
Depth sensor: Much recent work on depth-processing has
been driven by the consumer-grade PrimeSense sensor [45],
which is based on structured light technology. At its core,
this approach relies on two-view stereopsis (where correspondence estimation is made easier by active illumination).
This may require large baselines between two views, which
is undesirable for our egocentric application for two reasons;
first, this requires larger form-factors, making the camera
less mobile. Second, this produces occlusions for points in
the scene that are not visible in both views. Time-of-flight
depth sensing, while less popular, is based on a pulsed light
emitter that can be placed arbitrarily close to the main camera, as no baseline is required. This produces smaller form
factors and reduces occlusions in that camera view. Specifically, we make use of the consumer-grade TOF sensor from
Creative [15] (see Fig. 2a).
Dataset: We have collected and annotated (full 3D
hand poses) our own benchmark dataset of real egocentric
object manipulation scenes, which we will release to spur
further research 1 . We developed a semi-automatic labelling
tool which allows to accurately annotate partially occluded
hands and fingers in 3D. A few 2D joints are first manually
labelled in the image and used to select the closest synthetic
exemplars in the training set. A full hand pose is then created
combining the manual labelling and the selected 3D exemplar. This pose is manually refined, leading to the selection
of a new exemplar, and the creation of a new pose. This iterative process is followed until an acceptable labelling is
achieved. We captured 4 sequences of 1000 frames each,
which were annotated every 10 frames in both RGB and
Depth. We use 2 different subjects (male/female) and 4 different indoor scenes. Some examples are presented in Fig. 7.
Parameters: We train a cascade model trained with K =
100 classes, a hierarchy of 6 levels and M = 3 weak classifiers per node. We synthesize 100 training images per class.
We experimented with larger numbers of classes (up to K =
1000), but did not observe significant improvement. We suspect this is due to the restricted set of viewpoints and grasp
poses present in egocentric interactions (which we see as a
contribution of our work). As a point of contrast, we trained
a non-egocentric hand baseline in Fig. 8 that operated best
at K = 800 classes.
1
Please visit www.gregrogez.net/
3D Hand Pose Detection in Egocentric RGB-D Images
3rd-person hand detection
(a)
3rd-person finger tip detection
(b)
9
Egocentric hand detection
(c)
Egocentric finger tip detection
(d)
Fig. 8 Third-person vs 1st person . Numerical results for 3rd-person (a-b) and egocentric (c-d) sequences. We compare our method (tuned for
generic priors) to state-of-the-art techniques from industry (NITE2 [35] and PXC [15]) and academia (FORTH [30], Keskin et al [16] and Xu et
al. [58]) in terms of (a) hand detection and (b) finger tips detection. We refer the reader to the main text for additional description, but emphasize
that (1) our method is competitive (or out-performs) prior art for detection and pose estimation and (2) pose estimation is considerably harder in
egocentric views.
4.1 Benchmark performance
In this subsection, we validate our proposed architecture on
a in-house dataset of 3rd-person hands. Our goal is to compare performance with standard baselines, verifying that our
architecture is competitive. In this section, we make use of
a generic set of 3rd-person views for training and use K =
800 pose classes. We then use this system as a starting point
for egocentric analysis, exploring various configurations and
priors further in the next section.
Evaluation: We evaluate both hand detections and pose
estimation. A candidate detection is deemed correct if it sufficiently overlaps the ground-truth bounding-box (in terms
of area of intersection over union) by at least 50%. As some
baseline systems report the pose of only confident fingers,
we measure finger-tip detection accuracy as a proxy for pose
estimation. To make this comparison fare, we only score visible finger tips and ignore occluded ones.
Baselines: We compare our method to state-of-the-art
techniques from industry [35, 15] and academia [30,58, 16,
24]. Because public code is not available, we re-implemented
Xu et al˙ [58] and Keskin et al˙ [16], verifying that our performance matched published results in [54]. Xu proposes a
three stage pipeline: (1) detect position and in-plane rotation
with a Hough forest (2) estimate joint angles with a second
stage Hough forest (3) apply an articulated model to validate
global consistency of joint estimates. Keskin’s model also
has three stages: (1) estimate global hand shape from local
votes (2) given the estimated shape, apply a shape-specific
decision forest to predict a part label for each pixel and
(3) apply mean-shift to regress joint positions. Because Keskin’s model assumes detection is solved, we experimented
with several different first-stage detectors before settling on
using Xu’s first stage Hough forest, due to its superior per-
formance. Thus, both models share the same hand detector
in our evaluation.
Third-person vs egocentric: Following Fig. 8, our hierarchical cascades are competitive for 3rd-person hand detection (a) and state-of-the-art for finger detection (b). When
evaluating the same models (without retraining) on egocentric test data (c) and (d), most methods (including ours) perform significantly worse. FORTH and NITE2 trackers catastrophically fail since hands frequently leave the view, and so
are omitted from (c) and (d). Random Forest baselines [58,
16] drop in performance, even for hand detection. We posit
this drop comes difficulties in segmenting egocentric hands
(as shown in Fig. 9). To test this hypothesis, we develop a
custom segmentation heuristic that looks for arm pixels near
the image border, followed by connected-component segmentation. We also experiment with the (RGB) pixel-level
hand detection algorithm from [24]. These segmentation algorithms outperform many baselines for hand-detection, but
still underperforms our hierarchical cascade. We conclude
that (1) hand pose estimation is considerably harder in the
egocentric setting and (2) our (generic) pose estimation system is a state-of-the-art starting point for our subsequent
analysis. Finally, we posit that our strong performance (at
least for hand detection) arises from our global hand classifiers, while most baselines tend to classify local parts.
4.2 Diagnostic analysis
In this section, we further explore various configurations and
priors for our approach, tuned for the egocentric setting.
Evaluation: Since our algorithm always returns a full
articulated hand pose, we evaluate pose estimation with 2DRMS re-projection error of keypoints. This time, we score
the 20 keypoints defining the hand, including those which
10
Grégory Rogez et al.
VP Detection (PR)
2D RMSE (N candidates)
VP Detection (N candidates)
Cond 2D RMSE (N candidates)
t!
(a)
(b)
(c)
(d)
Fig. 10 Quantitative results varying our prior. We evaluate th different priors with respect to (a) viewpoint-consistent hand detection (precisionrecall curve), (b) 2D RMS error, (c) viewpoint-consistent detections and (d) 2D RMS error conditioned on viewpoint-consistent detections. Please
see text for detailed description of our evaluation criteria and analysis. In general, egocentric-pose priors considerably improve performance,
validating our egocentric-synthesis engine from Sec. 3.1. When tuned for N = 10 candidates per image, our system produces pose hypotheses
that appear accurate enough to initialize a tracker.
Segmentation
Pose on depth
RGB image
Fig. 9 Qualitative results obtained by state-of-the-art [16]: (top)
Shows a test sample which is akin to those addressed by prior work.
The hand is easily segmentable from the background and no object interaction is present. In this case, the method from [16] correctly identifies 4 of the fingers and has a small localization error for the 5th. (middle) Here the hand is holding a ball. Now, only the pinky is correctly
localized, despite the algorithm being provided with object interaction
in the training data. This demonstrates the combinatorial increase in
difficultly posed by introducing objects. (bottom) Finally, when we are
able to correctly detect the hand, but cannot easily segment it from
the background, methods based on per-pixel classification fail because
they produce strong “garbage” classifications for background.
are occluded. We believe this is important, because numerous occlusions arise from egocentric viewpoints and object
manipulation tasks. This evaluation criteria will give a better sense of how well our method actually recognizes global
hand poses, even in case of partially occluded hands. For
additional diagnosis, we categorize errors into detection failures, correct detections but incorrect viewpoint, and correct
detection and viewpoint but incorrect articulated pose. Specif-
ically, viewpoint-consistent detections are detections for which
the RMS error of all 2D joint positions falls below a coarse
threshold (10 pixels). Conditional 2D RMS error is the reprojection error for well-detected (viewpoint -consistent) hands.
Finally, we also plot accuracy as a function of the number of
N candidate detections per image. With enough hypotheses,
accuracy must max out at 100%, but we demonstrate that
good accuracy is often achievable with a small number of
candidates (which may later be re-ranked, by say, a tracker).
Pose+viewpoint prior: We explore 3 different priors:
1) a generic prior obtained using a floating“libhand” hand
with all possible random camera viewpoints and pose configurations, 2) a viewpoint prior obtained limiting a floating hand to valid egocentric viewpoints and 3) a viewpoint
& pose prior obtained using our full synthesis engine described in Sec. 3.1, i.e. using a virtual egocentric camera
mounted on a full body avatar manipulating objects. Note
that we respectively consider 800, 140 and 100 classes to
train these models. In Fig. 10, we show that, in general,
a viewpoint prior produces a marginal improvement, while
our full egocentric-specific pose and viewpoint prior considerably improves accuracy in all cases. This suggests that
our synthesis algorithm correctly operationalizes egocentric
viewpoint and pose priors, which in turn leads us to make
better hypothesis for daily activities/grasp poses. With a modest number of candidates (N = 10), our final system produces viewpoint-consistent detections in 90% of the test frames
with an average 2D RMS error of 5 pixels. From a qualitative perspective, this performance appears accurate enough
to initialize a tracker.
Ablative analysis: To further analyze our system, we
perform an ablative analysis that turns “off” different aspects of our system: sequential training, ensemble of cascades, depth feature, sparse search and additional object prior.
Hand detection and conditional 2D hand RMS error are given
3D Hand Pose Detection in Egocentric RGB-D Images
VP Det. - Frames with Objects
(a)
Cond 2D RMSE - Frames with Obj.
(b)
11
VP Det. - Frames without Obj.
(c)
Cond 2D RMSE - Frames without Obj.
(d)
Fig. 12 Object prior . Effect of modeling a hand with and without an object on hand detection (a and c) and hand pose recognition (b and d).
Results are given for test frames with (a and b) and without (c and d) object manipulation . Again, we measure viewpoint-consistent detections
(a and c) and 2D RMS error conditioned on well-detected hands (b and d). Both hand detection and pose recognition are considerably more
challenging for frames with object interactions, likely due to additional occlusions from the manipulated objects. The use of an object prior
provides a small but noticeable improvement for frames with objects, but does not affect the performance of the system for the frames without
objects.
VP Detection (N candidates)
Cond 2D RMSE (N candidates)
timation when employing an object prior (or not). We plot
the accuracy (for both viewpoint-consistent detections and
conditional 2D RMS error) on those test frames with (a,b)
and without (c,d) object interactions. The corresponding plots
computed on the whole dataset (using both types of frames)
is already shown in Fig. 11a and Fig. 11b . We see that
additional modeling of interacting hands and objects (with
an object prior) somewhat improves performance for frames
with object manipulation without affecting the performance
of the system for the frames without objects.
(a)
(b)
Number of parts: In Fig. 13, we show the effect of varying the number of parts M at each branch of our cascade
model. We analyze both hand detection rate and hand pose
average accuracy. These plots clearly validate our choice of
using M = 3 parts per branch (Fig. 13a). The performance
decreases when considering more parts because the classifier is more likely to produce a larger amount of false positives. Additionally, we can see in Fig. 13b that using more
than 3 parts does not improve the accuracy in terms of hand
pose.
Fig. 11 Ablative analysis. We evaluate performance when turning
off particular aspects of out system, considering both (a) viewpointconsistent detections (b) 2D RMS error conditioned on well-detected
hands. When turning off our exponentially large ensemble or synthetic
training, we use the default of 100-component ensemble as in [40].
When turning off the depth feature, we use a classifier trained on
aligned RGB images. Please see the text for further discussion of these
results.
in Fig. 11. Depth HOG features and sequential training of
parts are by far the crucial components of our system. Turning these parameters off decreases the detection rate by a
substantial amount (between 10 and 30%). Our exponentiallylarge ensemble of cascades and sparse search marginally
improve accuracy but are much more efficient: in average,
the exponentially-large ensemble is 2.5 times faster than an
explicit search over a 100-element ensemble (as in [40]),
while the sparse search is 3.15 times faster than a dense
grid. Modeling objects produces better detections, particularly for larger numbers of candidates. In general, we find
this additional prior helps more for those test frames with
object manipulations as detailed below.
Modeling objects: In Fig. 12, we analyze the effect of
object interactions on egocentric hand detection and pose es-
Pixel threshold: In Fig. 14, we show the effect of varying the pixel-overlap threshold for computing correct detection and average pose accuracy. A lower pixel threshold decreases detection rate but increases pose accuracy on
detected hands and vice-versa. Our 10-pixel threshold is a
good trade-off between these 2 criteria. In (c) and (d), we
show the percentage of detection and pose correctness for 2
different thresholds: 10 pixels (c) used throughout the main
paper and 5 pixels (d). The measure we used for detection
(blue area) is much more strict than a simple bounding boxes
overlap criteria (green area) as it only considers valid detections when the hand pose is also correctly estimated.
Qualitative results: We invite the reader to view our
supplementary videos. We illustrate successes in difficult
12
Grégory Rogez et al.
VP Detection (N candidates)
(a)
Cond 2D RMSE (N candidates)
Hand Detection (10-pixel threshold)
(b)
(c)
Hand Detection (5-pixel threshold)
(d)
Fig. 14 Pixel threshold. Effect of the pixel threshold on viewpoint-consistent detections (a) and 2D RMS error conditioned on well-detected
hands(b). A lower pixel threshold decreases detection rate but increases pose accuracy on detected hands, while a higher threshold increases
detection rate and decreases pose accuracy. We chose to use a threshold of 10 pixels as trade-off between these 2 criteria. In c and d, we show the
percentage of detection and pose correctness for 2 different thresholds: 10 pixels (c) used all over the main paper and 5 pixels (d). The green area
corresponds to correct detections (in term of bounding boxes overlap between estimated and ground-truth poses) but incorrect poses. The measure
we use for detection (blue area) is much more strict as it only considers valid a detection when the hand pose is also correctly estimated.
VP Detection (N candidates)
Cond 2D RMSE (N candidates)
speed. Finally, we have provided an insightful analysis of the
performance of our algorithm on a new real-world annotated
dataset of egocentric scenes. Our method provides state-ofthe-art performance for both hand detection and pose estimation in egocentric RGB-D images.
References
(a)
(b)
Fig. 13 Choice of the number of parts. We evaluate the importance
of choosing the right number of parts in our cascade model by analyzing the performance achieved for different numbers. For each case,
we compute the viewpoint-consistent detections (a) and 2D RMS error
conditioned on well-detected hands (b) when varying the number of
possible candidates.
scenarios in Fig. 15 and analyze common failure modes in
Fig. 16. Please see the figures for additional discussion.
5 Conclusion
We have focused on the task of hand pose estimation from
egocentric viewpoints. For this problem specification, we
have shown that TOF depth sensors are particularly informative for extracting near-field interactions of the camera
wearer with his/her environment. We describe a detailed,
computer graphics model for generating egocentric training data with realistic full-body and object interactions. We
use this data to train discriminative K-way classifiers for
quantized pose estimation. To deal with a large number of
classes, we advance previous methods for hierarchical cascades of multi-class rejectors, both in terms of accuracy and
1. https://3dwarehouse.sketchup.com/ 5
2. Ballan, L., Taneja, A., Gall, J., Gool, L.J.V., Pollefeys, M.: Motion
capture of hands in action using discriminative salient points. In:
ECCV (6), pp. 640–653 (2012) 3
3. den Bergh, M.V., Gool, L.J.V.: Combining rgb and tof cameras
for real-time 3d hand gesture interaction. In: WACV, pp. 66–72
(2011) 1
4. Chen, B., Perona, P., Bourdev, L.: Hierarchical cascade of classifiers for efficient poselet evaluation. In: BMVC, pp. 1–10 (2014)
3, 5, 6, 7
5. Damen, D., Gee, A.P., Mayol-Cuevas, W.W., Calway, A.: Egocentric real-time workspace monitoring using an rgb-d camera. In:
IROS (2012) 3
6. Daz3D: Every-hands pose library. http://www.daz3d.com/
everyday-hands-poses-for-v4-and-m4 (2013) 4
7. Dominguez, S., Keaton, T., Sayed, A.: A robust finger tracking
method for multimodal wearable computer interfacing. Multimedia, IEEE Transactions on 8(5), 956–972 (2006). DOI
10.1109/TMM.2006.879872 2
8. Erol, A., Bebis, G., Nicolescu, M., Boyle, R.D., Twombly, X.:
Vision-based hand pose estimation: A review. CVIU 108(1-2),
52–73 (2007) 3
9. Fathi, A., Farhadi, A., Rehg, J.: Understanding egocentric activities. In: ICCV (2011) 2
10. Fathi, A., Ren, X., Rehg, J.: Learning to recognize objects in egocentric activities. In: CVPR (2011) 2
11. Fielder, A.R., Moseley, M.J.: Does stereopsis matter in humans?
Eye 10(2), 233–238 (1996) 2
12. Gavrila, D.M., Philomin, V.: Real-time object detection for smart
vehicles. In: Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 1, pp. 87–93. IEEE
(1999) 3
3D Hand Pose Detection in Egocentric RGB-D Images
13
Reflective objects
Novel objects
Noisy depth data
Fig. 15 Good detections. We show a sample of challenging frames where the hand is correctly detected by our system. Reflective objects (top
row: wine bottle, pan, phone, knife and plastic bottle) produce incorrect depth maps due to interactions with our sensor’s infrared illuminant.
Novel objects (middle row: envelope, juice box, book, apple, spray and chocolate powder box) require generalization to objects not synthesized
at train-time, while noisy depth data (bottom row) showcases the robustness of our system.
13. Hamer, H., Schindler, K., Koller-Meier, E., Gool, L.J.V.: Tracking
a hand manipulating an object. In: ICCV (2009) 3
14. Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan, J., Butler,
A., Smyth, G., Kapur, N., Wood, K.: SenseCam: A retrospective
memory aid. UbiComp (2006) 1
15. Intel: Perceptual computing sdk (2013).
URL http:
//software.intel.com/en-us/vcsource/tools/
perceptual-computing-sdk 8, 9
16. Keskin, C., Kıraç, F., Kara, Y.E., Akarun, L.: Hand pose estimation and hand shape classification using multi-layered randomized
decision forests. In: ECCV 2012, pp. 852–863 (2012) 9, 10
17. Keskin, C., Kiraç, F., Kara, Y.E., Akarun, L.: Hand pose estimation and hand shape classification using multi-layered randomized
decision forests. In: ECCV (6), pp. 852–863 (2012) 1, 3
18. Kölsch, M.: An appearance-based prior for hand tracking. In:
ACIVS (2), pp. 292–303 (2010) 2
19. Kölsch, M., Turk, M.: Hand tracking with flocks of features. In:
CVPR (2), p. 1187 (2005) 2
20. Kurata, T., Kato, T., Kourogi, M., Jung, K., Endo, K.: A
functionally-distributed hand tracking method for wearable visual
interfaces and its applications. In: MVA, pp. 84–89 (2002) 2
21. Kyriazis, N., Argyros, A.A.: Physically plausible 3d scene tracking: The single actor hypothesis. In: CVPR (2013) 1, 3
22. Kyriazis, N., Argyros, A.A.: Scalable 3d tracking of multiple interacting objects. In: CVPR (2014) 3
23. de La Gorce, M., Fleet, D.J., Paragios, N.: Model-based 3d hand
pose estimation from monocular video. IEEE PAMI 33(9), 1793–
1805 (2011) 3
24. Li, C., M. Kitani, K.: Model recommendation with virtual probes
for egocentric hand detection. In: ICCV (2013) 2, 9
25. Li, C., M. Kitani, K.: Pixel-level hand detection in ego-centric
videos. In: CVPR (2013) 2
26. Lin, Y., Hua, G., Mordohai, P.: Egocentric object recognition
leveraging the 3d shape of the grasping hand. In: ECCV Workshop on Consuper Depth Camera for Vision (CDC4V), pp. 1–11
(2014) 3
27. Mann, S., Huang, J., Janzen, R., Lo, R., Rampersad, V., Chen,
A., Doha, T.: Blind navigation with a wearable range camera and
14
Grégory Rogez et al.
noisy depth
reflective object (phone)
novel object (keys)
novel object (towel)
truncated hand
Fig. 16 Hard cases. We show frames where the hand is not correctly detected by our system, even with 40 candidates. These hard cases include
excessively-noisy depth data, hands manipulating reflective material (phone) or unseen/deformable objects that look considerably different from
those in our training set (e.g. keys, towels), and truncated hands.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
vibrotactile helmet. In: ACM International Conf. on Multimedia,
MM ’11 (2011) 2
Mayol, W., Davison, A., Tordoff, B., Molton, N., Murray, D.: Interaction between hand and wearable camera in 2d and 3d environments. In: BMVC (2004) 2
Morerio, P., Marcenaro, L., Regazzoni, C.S.: Hand detection in
first person vision. In: FUSION (2013) 2
Oikonomidis, I., Kyriazis, N., Argyros, A.: Efficient model-based
3d tracking of hand articulations using kinect. In: BMVC (2011)
1, 3, 9
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Full dof tracking
of a hand interacting with an object by modeling occlusions and
physical constraints. In: ICCV (2011) 3
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the Articulated Motion of Two Strongly Interacting Hands. In: CVPR (2012)
1, 3
Ong, E.J., Bowden, R.: A boosted classifier tree for hand shape
detection. In: FGR (2004) 3
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in
first-person camera views. In: CVPR (2012) 2
PrimeSense: Nite2 middleware (2013). URL http://www.
openni.org/files/nite/ 9
Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust
hand tracking from depth. In: CVPR (2014) 1, 2, 3
Ren, X., Gu, C.: Figure-ground segmentation improves handled
object recognition in egocentric video. In: CVPR, pp. 3137–3144.
IEEE (2010) 2
Ren, X., Philipose, M.: Egocentric recognition of handled objects:
Benchmark and analysis. In: IEEE Workshop on Egocentric Vision (2009) 2
Rogez, G., Khademi, M., Supancic, J., Montiel, J., Ramanan, D.:
3d hand pose detection in egocentric rgbd images. In: ECCV
Workshop on Consuper Depth Camera for Vision (CDC4V), pp.
1–11 (2014) 2
Rogez, G., Rihan, J., Orrite, C., Torr, P.H.S.: Fast human pose detection using randomized hierarchical cascades of rejectors. IJCV
99(1), 25–52 (2012) 3, 5, 6, 7, 11
Romero, J., Feix, T., Kjellstrom, H., Kragic, D.: Spatio-temporal
modeling of grasping actions. In: IROS (2010) 4
Romero, J., Kjellstrom, H., Ek, C.H., Kragic, D.: Non-parametric
hand pose estimation with object context. Im. and Vision Comp.
31(8), 555 – 564 (2013) 3
Ryoo, M.S., Matthies, L.: First-person activity recognition: What
are they doing to me? In: CVPR (2013) 2
Sakata, H., Taira, M., Kusunoki, M., Murata, A., Tsutsui, K.i.,
Tanaka, Y., Shein, W.N., Miyashita, Y.: Neural representation of
three-dimensional features of manipulation objects with stereopsis. Experimental Brain Research 128(1-2), 160–169 (1999) 2
45. Sense, P.: The primesensortmreference design 1.08. Prime Sense
(2011) 8
46. Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with
parameter-sensitive hashing. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 750–757. IEEE
(2003) 4
47. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M.,
Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011) 1,
3
48. SmithMicro: Poser10. http://poser.smithmicro.com/
(2010) 4
49. Spinello, L., Arras, K.O.: People detection in rgb-d data. In: IROS
(2011) 8
50. Sridhar, S., Oulasvirta, A., Theobalt, C.: Interactive markerless articulated hand motion tracking using rgb and depth data. In: ICCV
(2013) 3
51. Starner, T., Schiele, B., Pentland, A.: Visual contextual awareness
in wearable computing. In: International Symposium on Wearable
Computing (1998) 2
52. Stenger, B., Thayananthan, A., Torr, P., Cipolla, R.: Model-based
hand tracking using a hierarchical bayesian filter. PAMI 28(9),
1372–1384, (2006) 3
53. Stenger, B., Thayananthan, A., Torr, P.H., Cipolla, R.: Estimating
3d hand pose using hierarchical multi-label classification. Image
and Vision Computing 25(12), 1885–1894 (2007) 3, 7
54. Tang, D., Chang, H.J., Tejani, A., Kim, T.: Latent regression forest: Structured estimation of 3d articulated hand posture. In:
CVPR (2014) 3, 9
55. Tang, D., Kim, T.H.Y.T.K.: Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In:
ICCV (2013) 1, 3
56. Tompson, J., Stein, M., LeCun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks.
ACM Trans. Graph. 33(5), 169 (2014) 3
57. Šarić, M.: Libhand: A library for hand articulation (2011). URL
http://www.libhand.org/. Version 0.9 4
58. Xu, C., Cheng, L.: Efficient hand pose estimation from a single
depth image. In: ICCV (2013) 3, 9
59. Yang, R., Sarkar, S., Loeding, B.L.: Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. PAMI
32(3), 462–477 (2010) 1
60. Z. Zheng, J., De La Rosa, S., M. Dollar, A.: An investigation of
grasp type and frequency in daily household and machine shop
tasks. In: ICRA, pp. 4169–4175 (2011) 5
61. Zehnder, P., Koller-Meier, E., Van Gool, L.J.: An efficient shared
multi-class detection cascade. In: BMVC, pp. 1–10 (2008) 3, 5, 6