An Efficient Human-Computer Interaction
Framework Using Skin Color Tracking
and Gesture Recognition
Nam Vo, Quang Tran, Thang Ba Dinh, Tien Ba Dinh
Quan M. Nguyen
Faculty of Information Technology
University of Science
Ho Chi Minh City, Vietnam
{nam.poke, tmquang}@gmail.com,
{dbthang, dbtien}@fit.hcmus.edu.vn
International Training Program Center
Sai Gon Technology University
Ho Chi Minh City, Vietnam
nmquan@troyhcmc.edu.vn
Abstract—Skin color is one of most important source of
information widely used in Human-Computer Interaction (HCI)
system where detecting the presence of the human plays a key
role. It is because of the dynamics and articulations of human
body which is hard to be captured rather than using primitive
feature like skin color. However, skin color varies across
different persons limiting us from building a generative skin
model for all persons. Here, we propose an efficient skin color
detection and tracking based on a face detector to refine the skin
model for a specific person and a tracking method using a simple
data association algorithm. Beyond that, we also introduce a
robust gesture recognition method using different extracted
information such as local features, hand motion, head
orientation, and finger tips. The final framework is proposed to
interpret our gesture recognition method in interacting with a
computer through multiple applications: controlling computer
mouse and playing games.
Human-Computer Interaction, Skin color tracking, Gesture
Recognition.
I.
INTRODUCTION
Hand gesture recognition is an interesting research topic in
the computer vision community, mainly for the purpose of
Human-Computer Interaction (HCI). It can be described as a
way allowing computers to understand human gesture. A
number of applications using hand gesture control interface
have been presented such as television control [1, 2],
interaction between fingers and papers [3] and some impressive
prototypes like Natal Project1 and Sixth-Sense2.
Most hand interactive systems include three different
layers: detection, tracking and recognition [4]
Detection: involves detecting interested objects in the
scene. Conducted researches do this based on
knowledge about object’s feature (such as skin color,
or hand shape). Skin color segmentation is one of
•
1
2
http://www.xbox.com/en-us/live/projectnatal/
http://www.pranavmistry.com/projects/sixthsense/
popular approaches, many researches consider
different color space (RGB, HSV, YCrCb…), color
model (Histogram, Gaussian, mixture of Gaussians).
Though many techniques were proposed, skin
detection is still one of the most challenging problems
[5].
•
Tracking: is in the process toward understanding
observed objects movement. Tracking not only keeps
track on the motion of objects but also gives us the
knowledge to update parameters model and features at
a certain movement in time. On-going active
approaches employed Kalman filter, particle filter,
mean-shift, etc. [6, 7]. Many data association
techniques have also been proposed to track multiple
objects over time based on detection result [8, 9].
•
Recognition: at this layer, information about hand
features is extracted depending on each specific
application, such as hand posture recognition,
fingertip detection.
There have been many previous works which are dedicated
for hand posture and sign recognition purposes [10, 11]. In
[10], a sign language recognition system was presented using
Hidden Markov Model. Meanwhile, in [11], another
sophisticated system, which can extract natural hand
parameters from monocular image sequences in order to be
able to retrieve detail information about the finger constellation
and 3D hand posture, was proposed.
Another common system is for hand-driven control, which
usually focuses on the motion of hands and fingertips. Kenji
Oka [12] proposed a fast tracking of hands and fingertips
method for application in an augmented desk interface systems.
By using an infrared camera to capture images, the method
gives good result even in challenging situations. More recently,
Wang and Popovic [13] used a color glove to simplify the pose
estimation problem. Based on the designed color pattern on the
glove, the system employs a nearest-neighbor approach to track
hands and reconstructs 3D hand pose using 2D captured image
978-1-4244-8075-3/10/$26.00 ©2010 IEEE
sequences. However, it is not convenient to find a color glove
in practical applications.
The problem of using only bare-hands with little special
setup as an input is much more challenging. Argyros [14] has
proposed a quite robust approach for hand tracking, gesture
recognition, and applied it in controlling computer mouse.
Inspired by his work, we further extend our system to a more
powerful framework in different applications.
In this paper, we introduce a framework which allows a
more natural, human-centered way of interaction with
computers. Using images captured from a simple webcam
where the upper body of the user is visible, our framework is
able to detect and track user head and hands. By extracting
features from these objects, it can recognize and interpret
gestures of the user as commands for computers in real time.
The rest of the paper is organized as follows. Section 2
describes the techniques we used in detail. Section 3 presents
experiment results and applications of the framework. Section
4 summarizes our work.
II.
METHOD DESCRIPTIONS
Our system framework operates as follows. First, skin
pixels are detected and used to find skin blobs. Skin color
model is learned online during initialization using the detected
face of the user. In the next step, a simple, non-Bayesian
tracking method associating tracked objects with detected
blobs is implemented. It can handle objects moving fast, in
complex trajectories, and moving in/out the field of view.
Finally, several interested features such as object orientation,
hand palm position, fingertip position and orientation, hand
shape are interpreted. They are used to recognize the gestures
of the user for computer interactions.
A. Skin detection
Skin detection is one of the most challenging problems in
computer graphics because human skin color varies from
person to person, race to race, and is strongly sensitive to
lightning condition. In our framework, the skin color is learned
online based on the similar skin color of the face which is
successfully detected with high accuracy.
For initialization, our framework uses the boosted Haar
wavelet classifier to detect the face. This method has been
initially proposed by Paul Viola, then improved by Rainer
Lienhart [15, 16], and widely used in many applications
because of its efficiency and robustness. After that, skin pixels
within face region are detected using “explicit skin cluster"
technique [17] with following thresholds in RGB color space:
• R > G, R > B (skin color usually has high red
component)
• Max(R, G, B) > 30 (exclude black color)
• Min(R, G, B) < 235 (exclude white color)
It is important to note that the thresholds are not very tight
since pixels in face region have high probability to be skin
pixel. The skin pixels detected are used as training data to
detect skin which is observed from hand, arm and face.
For the main task, our framework use the approach in [14]
which employs the Bayes rule to calculate probability P(s|c) of
a color c to be skin color:
P(s|c) = P(c|s) P(s) / P(c)
(a)
(b)
Figure 2. Face detection: (a) input image, (b) detected skin pixels
Where P(s) is the prior probability of skin color, P(c) is the
probability of the occurrence of each color c, P(c|s) is the
probability of skin color to be color c (sometimes called
likelihood). These probabilities are calculated using data
collected in the initialization phase.
Two thresholds Tmin, Tmax are used to determine a pixel
having skin color or not:
• Pixels with color c such that P(s|c) > Tmax are classified
as skin-color.
• Pixels with color c such that P(s|c) > Tmin and lies
nearby another skin pixel are also considered skin-color.
Figure 1. Overview of proposed framework
By using two thresholds, this approach decreases false
alarm rate drastically. Moreover, learning skin color
automatically using the user’s face is a very efficient to handle
the variation in skin color between different people. Changes in
illumination conditions can be adapted by re- initialization
phase periodically or at special events, for instance,
background changing, the user face moving close to the
camera, difference between the detected colors of the object
and the learned skin color becoming big enough.
Employing the method proposed by Suzuki [18], we search
for connected components, and form skin blobs. Components
whose areas are smaller than a predefined threshold are
considered noises and are ignored.
(a)
(b)
An example is shown in Figure 4 where there are two
objects o1, o2 and three detected blob b1, b2, b3. Object o1 and
blob b2 will be associated because their distance is shortest;
then it is the turn of object o2 and b1. Finally, a new object is
created to associate with b3.
While this method is simple and fast, it still works very
well in the context of our problem.
Figure 4. An example of object hypotheses and skin blobs
C. Gesture recognition
Using face detection result, head can be easily identified
among the tracked objects. Two biggest remaining objects will
be identified as left hand and right hand based on their relative
positions. The fitted ellipse of associated blob of the object is
used to approximate its position, size and orientation.
(c)
(d)
Figure 3. Skin detection: (a) input image, (b) pixels that pass the first
threshold Tmax, (c) pixels that pass the second threshold Tmin , (d) pixels that
are classified as skin pixels.
B. Multiple object tracking
In [14], Argyros proposed a method for tracking multiple
skin objects moving in complex trajectories and occluding each
other. We propose a similar, but simpler approach assuming
that objects do not occlude each other; one object is assumed to
correspond to one blob, and vice versa.
First, the position of a blob is approximated by fitting an
ellipse to the blob using least square method [19]. Tracking
then is done by associating detected blobs with object
hypotheses:
•
Object hypothesis tracking: link an unassociated blob
and an unassociated object whose distance is the
shortest. Repeat this until running out of unassociated
blob or unassociated object.
•
Object hypothesis generation: each remaining
unassociated blob is assumed to correspond to a new
object moving in the scene. Therefore, a new object is
created and associated with the blob.
•
Object hypothesis removal: each remain unassociated
object is assumed to disappear; hence, it is removed if
its position is near the edges of the field of view of the
camera. Otherwise, the object is allowed to “survive”
for a period of time before being discarded to avoid
detection failure cases.
Hand palm segmentation: for interaction purpose, only the
hand palm parts are focused on. We identify the inside circle
(named as incircle) of the object as the largest circle within the
object in order to estimate the hand palm position. The hand
palm is separated from the wrist-arm part which usually lies far
below the incircle.
Fingertips extraction: the convex hull for each hand
object's blob can be computed, and its vertices are considered
fingertip candidates. To measure the curvature of each
candidate points, we calculate the angle created by the
candidate point P to two nearby points P1 and P2, which are
equally separated from P a number of contour points, k. Then P
is classified as a fingertip if its angle measurement is smaller
than a defined threshold. The line connecting P and the
midpoint of P1 and P2 indicates the orientation of the finger.
The parameter k can be set constantly, but it is not effective if
the size of the object size varies greatly. In our framework, we
compute k as: k = r / 2, where r is the radius of the incircle.
(a)
(b)
Figure 5. (a) Hand object, (b) Hand palm is segmented, all five fingertips are
detected
Object motion estimation: although the object motion can
be deduced using knowledge about their positions over time, in
most of applications need smooth and accurate information.
For that purpose, we adopt the pyramidal Lucas Kanade feature
tracker (KLT Tracker) [20, 21], a very well-known optical flow
estimation method. More specifically, a number of features on
the object is found and tracked by the tracker (Figure 6). The
motion of the object is then estimated by the median of the
movements of the features. It is very useful when the object is
lost for a short period of time due to detection or tracking
failure, its motion still can be obtained.
Hand gesture recognition: since our framework is for handdriven control, it does not need to recognize a big vocabulary
of hand postures. Here, we do not employ any special
technique for this purpose, but rather deduce the hand gesture
from extracted information about hand palm, fingertips. For
example, a hand object with five detected fingers can be
classified as the “opening” gesture.
(a)
(b)
(c)
(d)
Figure 7. Frames from an experimented sequence
We also define a vocabulary of gestures to recognize and
test on several captured image sequences (table I).
TABLE I.
(a)
THE VOCABULARY AND RULES TO RECOGNIZE GESTURES
Gesture
Rule
Hand Open
4 or 5 visible fingertips.
Hand Close
No visible fingertips.
Hand Point
left/right/up
1 visible fingertips, its orientation (with
respect to vertical axis) is in [45, 135],
[225, 315], [-45, 45] (in degree)
corresponding to left, right, up direction
respectively
Hand pose
Victory
2 visible fingertips.
Head tilting
Head’s orientation (with respect to
vertical axis) is more than 30 degree.
(b)
Figure 6. Two consecutive frames where features (red dots) on the hand
object are tracked
III.
EXPERIMENTS
A. Method analysis
A prototype of the proposed framework is implemented in
C++ with OpenCV Library. It runs on a computer with 1.86
GHz processor at a rate of 20Hz. The web cam captures image
sequences at the resolution of 320x240. Practical experiments
show that our framework works well in environments with
little noises (i.e., existence of objects whose color is similar to
human skin) and balanced lightning condition. Otherwise, an
off-line training phase is needed. Here, we describe a
representative experiment.
First, the user’s face is detected and used as an input of skin
color learning. After that, skin objects in the scene, including
the user’s head and left hand, are detected (Figure 7.a). He
begins to move his left hand in a random trajectory and varying
speed, sometimes very fast. Next, the left hand leaves the field
of view, and then two hands appear (Figure 7.b). The user
moves both hands at the same time. Then, he opens his right
hands, moves around, moves closer, and far away from the
camera, opens and closes some fingers (Figure 7.c, 7.d).
Though sometimes noises appear, the head and hands are
successfully detected and tracked during the experiment, and
so are the fingertips.
To recognize the gestures, some rules that uses extracted
features of objects are applied. Moreover, [4] requires three
criterions satisfied to determine the specific point in time that a
gesture takes place. Our framework makes use of two
following criterions:
•
The gesture must last for at least a predefined period
amount of time.
•
The hand performing a gesture must have a velocity
smaller than a predefined threshold. This is due to an
observation that it’s hard for people to move hand and
change posture at the same time. Moreover, image of
object moving at high speed is often not clear.
Four tested sequences consist of 10191 frames (about six
minutes long total). In these videos, a user shows different
gestures with the head and one or two hands visible. The
average accuracy is 94.1%. Most of failed cases are caused by
poor skin detection.
TABLE II.
ACCURACY IN CONDUCTED EXPERIMENTS
Sequence
Total frame
Fail frame
Accuracy
1
2990
128
95.9%
2
1535
27
98.2%
3
1834
101
94.5%
4
3704
342
90.8%
B. Applications
Using extracted features, the user’s gestures are recognized.
The framework then interprets these gestures as an input to
control the computer (shown in Figure 8). Different
applications can be developed using any interested gestures
which are suitable. Here, we introduce some applications used
to test our framework.
2) Playing racing games
Racing games such as Need for Speed3 attract a lot of
gamers. Our application tries to bring a more realistic feeling to
gamers.
The most important action during the game is controlling
the wheel. In this application, the user has his two hands acting
as holding a “virtual wheel” (Figure 10) to turn it left or right
to interact with the game. Here, the interpretation is based on
the relative position of the two detected hands. More
specifically, this is done by calculating the angle between two
hands with respect to the horizontal axis. Based on such
information, we can compute how far the “virtual wheel” is
turned, and map the action to the game accordingly.
Gesture of the right hand is further investigated to control
other actions:
•
Close hand: move forward.
•
Open hand: stop.
•
Victory posture: move backward.
Experiments show good results, playing racing game using
this interface give a more exciting and realistic feeling.
Figure 8. Gesture recognition and interpretation flowchart
1) Controlling computer cursor
The framework is first applied in controlling computer
cursor using user's right hand. The right hand is detected and its
motion are retrieved and interpreted as cursor movement, index
finger moving up gesture is interpreted as a mouse left button
down event. This interface allows the users to control many
programs.
Due to the limitations while interacting under Windows
operating system, it is not as smooth as using a mouse.
However, we find it comfortable when operating many
programs such as playing Solitaire (Figure 9).
Figure 10. Hand gestures in playing racing games
3) Playing first-person shooter games
The last application is playing Counter Strike4, a very wellknown first-person shooter game which is originated from a
Half-Life5 modification. We manage to interpret the following
actions that allow user to play the game without mouse and
keyboard (Figure 11):
•
Look and fire: right hand is used to simulate these
actions. The user has the right hand with the gun
shape in the game. The motion of the hand is
interpreted as the motion in game, and the thumb
down/up corresponds to the begin/stop firing actions.
•
Reload the bullets: the extracted information about
head orientation is used to recognize head tilting
gesture which is interpreted as reloading the bullets
for the gun
•
Moving forward: using left hand, with index finger
up/down corresponds to moving forward/stop actions.
3
http://www.needforspeed.com/
http://www.counter-strike.com/
5
http://www.gamespot.com/pc/action/halflife/index.html
4
Figure 9. Controlling cursor to play Solitaire
Using these gestures, the user can move around, fire and
reload. However, it is still difficult to aim at a specific point.
Improvement is needed for comfortable playing.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Figure 11. Hand gestures in playing first-person shooter game
IV.
CONCLUSION AND FUTURE WORK
We have presented an efficient Human-Computer
Interaction framework using skin object detection, tracking and
feature extraction. The system is able to interpret many human
gestures with real-time performance. We also demonstrate
several practical applications as a basic step in developing a
real system.
In the future, we would like to increase performance using
multithreading programming to get smoother results. Also,
more techniques can be employed, such as hand posture
recognition, eye detection and tracking, etc to extract more
features helping to deploy more gestures. It will allow us to
fully control a computer with more various complicated
commands.
[11]
[12]
[13]
[14]
[15]
[16]
ACKNOWLEDGMENT
This work is a part of the KC.01/06-10 project supported by
the Ministry of Science and Technology, 2009-2010.
[17]
REFERENCES
[18]
[1]
[2]
[3]
W.T. Freeman and C.D. Weissman , "Television control by hand
gestures," IEEE International Workshop on Automatic Face and Gesture
Recognition, Zurich, Switzerland, 1995.
L. Bretzner, I. Laptev, T. Lindeberg, S. Lenman, and Y. Sundblad, "A
prototype system for computer vision based human computer
interaction," Technical report CVAP251, ISRN KTH NA/P–01/09–SE,
Stockholm, Sweden, April 23-25, 2001.
Z. Zhang. "Vision-based interaction with fingers and papers". In Proc.
International Symposium on the CREST Digital Archiving Project, pp.
83–106, Tokyo, Japan, May 23–24, 2003.
[19]
[20]
[21]
X. Zabulis, H. Baltzakis, A.A. Argyros, "Vision-based hand gesture
recognition for human-computer interaction", in "The universal access
handbook", Lawrence Erlbaum Associates, Inc. (LEA), Series on
"Human factors and ergonomics", ISBN: 978-0-8058-6280-5, pp 34.1 –
34.30, Jun 2009.
V. Vezhnevets, V. Sazonov, and A. Andreeva, "A survey on pixel-based
skin color detection techniques," Proc. Graphicon, pp. 85–92, Moscow,
Russia, September 2003.
R. Hess and A. Fern, “Discriminatively trained particle filters for
complex multi-object tracking,” In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2009.
Z. Li, J. Chen, and N. N. Schraudolph, “An improved mean-shift tracker
with kernel prediction and scale optimisation targeting for low-framerate video tracking,” In 19th Intl. Conf. Pattern Recognition (ICPR),
Tampa, Florida, 2008.
C. Huang, B. Wu, and R. Nevatia,”Robust object tracking by
hierarchical association detection responses”, ECCV 2008, vol. 2, pp.
788–801.
F. Yan, W. Christmas, and J. Kittler , "Layered data association using
graph-theoretic formulation with applications to tennis ball tracking in
monocular sequence," IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 30, pp. 1814 – 1830, New York, 2008.
T. Starner, A. Pentland , J. Weaver, “Real-time American sign language
recognition using desk and wearable computer based video,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 20 issue
12, pp. 1371–1375, December 1998.
H. Fillbrandt, S. Akyol, K. F. Kraiss, “Extraction of 3D hand shape and
posture from image sequences for sign language recognition.” IEEE
International Workshop on Analysis and Modeling of Faces and
Gestures, vol. 17, pp. 181–186, October 2003.
K. Oka, Y. Sato, and H. Koike, "Real-time tracking of multiple
fingertips and gesture recognition for augmented desk interface
systems," IEEE Computer Graphics and Applications, vol. 22, no. 6, pp.
64–71, November-December 2002.
R. Y. Wang, and J. Popoviü, "Real-time hand-tracking with a color
glove," ACM Transactions on Graphics, vol 28 issue 3, no. 63, August
2009.
A.A. Argyros, M.I.A. Lourakis, “Real time tracking of multiple skincolored objects with a possibly moving camera”, in Proceedings of the
European Conference on Computer Vision (ECCV’04), SpringerVerlag, vol. 3, pp. 368-379, Prague, Chech Republic, May 11-14, 2004.
P. Viola, and M. J. Jones, “Rapid object detection using a boosted
cascade of simple features,” IEEE CVPR, 2001.
R. Lienhart, A. Kuranov, V. Pisarevsky. “Empirical analysis of detection
cascades of boosted classifiers for rapid object detection”. DAGM'03,
25th Pattern Recognition Symposium, Madgeburg, Germany, pp. 297304, Sep. 2003.
F. Gasparini and R. Schettini, "Skin segmentation using multiple
thresholding," in Internet Imaging VII, vol. 6061 of Proceedings of
SPIE, pp. 1-8, San Jose, Calif, USA, January 2006.
S. Suzuki, and K. Abe, “Topological structural analysis of digital binary
images by border following,” Computer Vision, Graphics and Image
Processing, vol. 30, no. 1, pp. 32-46, April 1985.
Z. Zhang, “Parameter estimation techniques: A tutorial with application
to conic fitting,” Image and Vision Computing 15, pp. 59–76, 1996.
Jean-Yves Bouguet, “Pyramidal implementation of the Lucas Kanade
feature tracker – Description of the algorithm”, Intel Corporation –
Microprocessor Research Labs.
“KLT: An implementation of the Kanade-Lucas-Tomasi feature
tracker,” http://www.ces.clemson.edu/~stb/klt/.