Active Hand Tracking
Jérôme Martin, Vincent Devin and James L. Crowley
GRAVIR - IMAG
INRIA Rhône–Alpes
655 av. de l’Europe
38330 Monbonnot, FRANCE
E-mail: Jerome.Martin@imag.fr
Abstract
clear mathematical foundations for which it is possible to
predict the proper operating conditions and to generate an
on-line estimate of the probability of success. Often such
techniques have been rejected in the past because they
were not universal. We maintain that no individual vision
technique will be universal. Reliability and robustness to
changes in operating conditions can be achieved by combining multiple complementary techniques. Such combination requires dynamically estimating the confidence in the
output of each process and dynamically reconfiguring the
system as operating conditions change.
In our work, two aspects of gesture are explored: hand
tracking and gesture–recognition. This paper deals with
hand tracking, information on gesture–recognition can be
found in [8]. Our search for complementary processes
has led us to use diverse techniques for detection such as
normalised color histograms, multi-dimensional histograms
[10] and normalised energy cross-correlation [4]. Such
techniques have been rejected in the past because they are
known to fail under certain circumstances. In the context of
a system which integrates multiple processes, such failure
is not a problem.
Locating and normalizing a hand is a processes of tracking. Several methods can be used for tracking, each with
different success and failure conditions. A reliable tracking system can be obtained by integrating and coordinating several complementary tracking processes. Integration
and coordination are performed using a synchronous architecture in which a supervisor selects, activates and provides
parameters for visual processes. The system includes a control part described as a graph using combination operators
and modifiers and a decision part reorganizing the control
in reaction to changes in the confidence factor.
The following section presents the visual process architecture used for the system and describes techniques for estimating a confidence factor for detection. A tracking process based on a recursive estimator is formulated as a zeroth order Kalman filter. Visual processes are described using
This paper describes a system which uses multiple visual
processes to detect and track hands for gesture–recognition
and human–computer interaction. This system is based on
an architecture in which a supervisor selects and activates
visual processes. Each process provides a confidence factor
which make it possible for the system to dynamically reconfigure itself in response to events in the scene.
Visual processes for hand tracking are described using image differencing and normalized histogram matching. The result of hand detection is used by a recursive
estimator (Kalman filter) to provide an estimate of the position and size of the hand. The resulting system provides
robust and precise tracking which operates continuously at
approximately 5 images per second on a 150 megahertz Silicon Graphics Indy.
1
Computer Vision for Perceptual User Interface
Krueger [7], followed by Wellner and Mackay [11], have
shown how human gesture can be used to define natural
interaction modes for interacting with computer systems.
The techniques described by Krueger and by Wellner were
simple concept demonstrations. As such they were fragile
and worked in constrained environments. We are exploring
methods for building perceptual user interfaces which are
robust and reliable. Our primary means of obtaining reliability is through the integration of multiple computer vision
techniques in a system which dynamically reconfigures itself in response to its own estimates of the confidence in its
results.
In our systems, we employ computer vision techniques
which have low computational complexity so that the techniques can operate in real time. We favor techniques with
1
images differencing and skin–color detection. The results
of experimental measurements precision, probability of error and computation time presented.
2
Integration of Visual Processes for Tracking
The hand tracking system described in this paper is based
on an architecture in which a supervisor activates and coordinates a number of reactive visual processes (see figure
1). The SERVP (Synchronous Ensemble of Reactive Visual
Processes) architecture [2] was developed as a synchronous
approach to the integration of processes for active vision.
The SERVParchitecture has been used in the construction
of several systems including appearance based navigation
[1], tracking faces for video–communications [3] and in detection and tracking of individuals [5]. For the work described in this paper, we use a software skeleton, named
Chord [6, 12], which is based on the SERVP architecture.
The Chord system includes facilities for both synchronous
processing on a single host and asynchronous processing on
multiple hosts. Chord also includes operators for combining and managing processes.
Supervisor
Control
Decision
Image
Processing
Actions
Visual Processes
data (images and parameters) from inputs, perform computations and then write results to output ports. Processes
react to commands such as initialization, to starting and
stopping execution, and termination. Processes can generate event messages based on the results of computation,
including exception conditions which trigger the supervisor
to modify the system configuration. A process can be run
for one or more cycles. At the end of the execution, the process communicates a message of success or failure to the
supervisor.
In order to control the execution and the relationships of
process, Chord defines a set of control operators and modifiers [6]. Operators and modifiers include a sequential operator, a concurrent operator, a conditional sequential operator, a conditional concurrent operator, a watchdog, the loop,
negation and synchronous/asynchronous modifiers.
The decision part of the architecture is expressed as a
set of rules which are executed by forward chaining and
which react to commands, as well as messages from the
visual processes. The supervisor receives messages from
the visual processes concerning processing state and visual
events. These messages are encoded as items in a working
memory. Such items trigger rules which can modify process
parameters as well as the control graph. Visual events used
in the system described below are based on the confidence
factor which each process generates as part of its output.
Other events are possible for the recognition of objects or
action. In our example, visual processes pass information
to a tracking process which maintains an estimate of the
center point and the size of the hand. This tracking process
is implemented as a recursive estimator based on a Kalman
filter. Figure 3 shows the current graph control.
Figure 1. Synchronous Ensemble of Reactive
Visual Processes
loop
seq
Commands In
Image
Processing
Tracking
Desicion
State
Data
in
Visual Process
Data
out
Events Out
Figure 2. A generic model of visual process
2.1
Visual Processes
A visual process is graphically represented as a box with
input and output ports as shown in figure 2. Processes read
Figure 3. The graph control of our system using
Chord formalism
In this graph, circles represent Chord operators. The
graph execute a loop on the sequence of three visual processes: Image Processing, Tracking and Decision. The
Tracking process, represented by a rounded box, is also described by a Chord graph control and is composed of low
level visual processes (image differencing, skin detection
by color, cross–correlation and followed by the recursive
estimator process).
3
2.2 Estimating the Con dent Factor of
Visual Processes
Robust tracking can be obtained by driving several detection processes. This section briefly describes processes
for detecting hands using image differencing and skin detection. More details can be found in [9].
The primary visual event used for control in the system
described below is the confidence factor, denoted CF. Confidence is represented by a numerical value between 0 (no
confidence) and 1 (certainty). The CF is an estimate of the
likelihood that a successful detection was achieved and thus
is computed as probability using pre–trained sample of correct detections. During system set–up, a large number of
correct detections are identified for each process. The mean
s and the covariance matrix s of these examples define
the parameters for likely correct detections. Give these parameters, the probability of observing a particular parameter
vector given a correct detection can be estimate by a Gaussian probability law. We use an unnormalised probability
as an indicator of the probability of a correct observation,
given the observed vector Y . This is computed as:
T ,1
1
CFY = e, 2 (Y ,s ) s (Y ,s )
2.3 The Tracking Process
The tracking process is formulated as a recursive estimation process (see figure 4). Such process maintains an
estimate of a state vector, and its uncertainty. Because a
hand will often have accelerations which are very fast, we
do not attempt to estimate hand velocities. Thus our tracking process is a zero–th order estimation system with a state
vector composed of the position (x; y) and bounding box
(w; h) of the hand.
observation
To Tracking
Process
Validate
Two different image differencing processes are included
in our system. The first process computes differences between successive images. The difference in luminance of
pixels from two successive images is close to zero for pixels
of the background. By choosing and maintaining an appropriate threshold, moving objects are detected within a static
scene. As drawbacks, some intrusive moving object can be
also detected and a static hand leads to a failure of the detection. The difference in luminance from the live image and
a pre–stored background image always detects the hand but
also the arm and some shadows.
Whereas differencing from two successive images detected a part of the hand, differencing from a background
image provide a large region of interest. This is the
reason why our system toggles between those two processes depending on confident factor. Assuming that the
differencing–from–successive–images process is the current active process, a static hand leads to a small bounding
box and consequently to a small confident factor. An event
is generated and the supervisor start the differencing–from–
background process. Such decision can be formulated as a
control rule :
X*, C*X
Predict
In this rule, EVENT and PROCESS are messages while
start and stop are commands.
When some thing
Update
^
3.1 Image di erencing
(EVENT CF_low)
(PROCESS differencing_from_successive_images)
=>
(stop differencing_from_successive_images)
(start differencing_from_background)
prediction
Y, CY
Visual Processes for Tracking Hands
^
X, C X
estimation
Figure 4. The tracking process is a zero{th order recursive estimator for position and size
The covariance of the four parameters is a 4x4 covariance matrix. However, if we assume that errors in size are
independent of errors in size, then the covariance matrix can
be separated into two independent 2x2 covariance matrices.
is detected using differencing–from–background process,
the supervisor toggle back to the differencing–from–
successive–images process.
3.2 Skin detection
Skin can be easily detected using a normalised color vector [10]. The color pixels are normalised by dividing out the
luminance component. This removes the effect of changes
in the orientation of the skin surface with respect to a light
source. The remaining 2 vector, composed of (r; g) is nearly
constant for skin regions. A 2–D histogram h(r; g) of the
pixels from a region containing skin will show a strong peak
at the skin color. We use a sample of N pixels obtained automatically when another process succeeds in detection, to
initialise the normalised color histogram. The histogram
makes it possible to use table-lookup to estimate the con~ = (r; g )
ditional probability of observing a color vector C
given that the pixel correspond to skin:
~ jskin)
p(C
=
1
N
:h(r; g )
Regions where this probability is above a low threshold are detected and described using connected components
analysis to provide possible parameters for hand tracking.
Position error (x) for all visual processes
25
20
Experiments
The following section presents experiments on the different visual processes. For these experiments, we are using a database of 100 images from a sequence of images of
a moving hand.
Error in x (pixels)
4
4.2 Evaluating computation time
In order to compare computation time, we made experiments for an entire image of size 192 by 144, results are
shown in figure 7. Visual processes using image differencing are faster than processes using skin–color detection.
The method is the one combining image differencing and
skin–color detection. In addition to the computation time
10
5
4.1 Evaluating precision
0
background image successive images
differencing
differencing
skin detection
background image
differencing &
skin detection
Visual processes
Position error (y) for all visual processes
25
20
Error in y (pixels)
For each image of the database, the bounding box determined automatically is compared to a bounding box has
been selected by hand to measure the error in position (x; y)
of the center and in the size (w; h). Figure 5 and Figure 6 give the error in position and size for visual processes difference with static background (back), successive image differencing (succ), skin–color detection (skin)
and the combination of background differencing and skin–
color detection. The combination of background differencing and skin–color detection correspond to compute an
approximative bounding–box which enclose the hand using background–differencing and then, compute the skin–
detection in the bounding–box.
The above experiments show that the bounding box obtained by differencing successive images is very imprecise.
Difference between the image and a static background image tends to be somewhat imprecise but better than difference of successive images. The skin detection using normalised color histograms and background difference and
skin–color detection are the most precise with approximately same results.
15
15
10
5
0
background image successive images
differencing
differencing
skin detection
Visual processes
background image
differencing &
skin detection
Figure 5. Error of position in x and y
of the image differencing, the skin–color detection is computed onto a sub–image. The overall computation time depends on the size of the sub–image.
Computation time for all visual processes
(Time for images 192x144 pixels)
1
0.8
Time (ms)
Size error (width) for all visual processes
40
35
0.6
0.4
Error in width (pixels)
30
0.2
25
20
0
background image successive images
differencing
differencing
15
skin detection
10
background image
differencing &
skin detection
Visual processes
5
0
background image successive images
differencing
differencing
skin detection
background image
differencing &
skin detection
Figure 7. Computation time for image 192 by
144 pixels
Visual processes
Size error (height) for all visual processes
50
4.3
Error in height (pixels)
40
Execution of Visual Processes
In order to verify the control graph, a sequence of images
including hand movement, important change in hand size
and appearance and disappearance of the hand. Figure 8
shows a small part of the total sequence.
30
20
Execution of visual processes
Visual processes
10
(2)
(1)
(3)
(4)
(5)
0
background image successive images
differencing
differencing
skin detection
background image
differencing &
skin detection
Visual processes
Figure 6. Error of size in width and height
skin detection
successive images
differencing
background image
differencing
180
200
220
240
260
280
300
320
Time (x100ms)
Figure 8. Part of execution of Visual Processes
In this figure, five part have been numbered corresponding successive images in the sequence. In part (1),
successive–images–differencing process is first executed.
Because there is no moving object in the scene, an event
is generated. This event activate the background–image–
differencing process followed by skin–color–detection process. In (2), the hand moved enough to be tracked
by successive–images–differencing process given a normal
bounding box. Because the bounding box is small, the
computation for skin–detection is not so important. At
stage (3), the hand is closed (during the training of the
recursive estimator, the hand was always opened) consequently the bounding box does not correspond to the expected one. The successive–images–differencing process
fails and background–image–differencing process is activate. During part (4) and (5), the hand is moving with
correct size and the successive–images–differencing process is activated.
5
Conclusion
The integration of complementary visual processes can
produce a reliable and robust system. Integration and coordination is provided by the Chord architecture, expansion
of the Synchronous Ensemble of Reactive Processes model
developed in the VAP Project [5]. Coordination of visual
requires signaling visual events to the supervisor. Such visual events correspond to confidence factor given by the visual process according to its results. Experiments on visual
processes are performed in order to build the control graph
according to strength and weakness of the visual processes.
Those experiments lead to the conclusion that the region
of interest can be reduced if the visual processes are called
sequentially. In a near future, new visual processes will be
added to our current system. Active contours and cross–
correlation can be used as tracking process. This system is
a part of a larger project of gesture–recognition [9].
6
Acknowledgments
This work has been partially supported by France Telecom CNET (Project COMEDI).
References
[1] C. Andersen, S. Jones, and J. Crowley. Apperance based
processes for visual navigation. In 5th International Symposium on Intelligent Robotic Systems, SIRS’97, pages 227–
236, Royal Institue of Technology, Stockholm, Sweden, July
1997.
[2] J. Crowley and J. Bedrune. Integration and control of reactive visual processes. In J.-O. Eklundh, editor, European
Conference on Computer Vision, (ECCV’94), volume 801 of
Lectures Notes in Computer Science, Stockholm, Sweden,
May 1994. Springer Verlag.
[3] J. Crowley and F. Bérard. Multi–modal tracking of faces for
video communications. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR’97), pages 640–645,
San Juan, Puerto Rico, June 1997.
[4] J. Crowley, F. Bérard, and J. Coutaz. Finger tracking as an
input device for augmented reality. In International Workshop on Automatic Face– and Gesture–Recognition, Zurich,
Switzerland, June 1995.
[5] J. L. Crowley and H. I. Christensen, editors. Vision As Process. ESPRIT Basic Research Series. Spring Verlag, 1995.
[6] S. Jones. Robust Task Achievement. PhD thesis, Institut National Polytechnique de Grenoble, GRAVIR – IMAG, May
1997.
[7] M. Krueger. Artificial Reality II. Addison-Wesley, 1991.
[8] J. Martin and J. Crowley. Un système visuel de reconnaissance de gestes. Soumis à RFIA’98.
[9] J. Martin and J. Crowley. An appearance–based approach
to gesture–recognition. In A. D. Bimbo, editor, International Conference on Image Analysis and Processing, number 1311 in Lecture Notes in Computer Science, Florence,
Italia, Sept. 17–19, 1997. Spriger Verlag.
[10] B. Schiele and A. Waibel. Estimation of the head orientation based on a face–color–intensifier. In 3rd International
Symposium on Intelligent Robotic Systems ’95, 10–14 July
1995.
[11] P. Wellner, P. Mackay, and R. Gold. Computer–augmented
environments : back to the real world. Special Issue of Communications of the ACM, 36(7), 1993.
[12] B. Zoppis. Outils pour l’Intégration et le Contrôle en Vision
et Robotique Mobile. PhD thesis, Institut National Polytechnique de Grenoble, GRAVIR – IMAG, Juin 1997.