Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
An Efficient Human-Computer Interaction Framework Using Skin Color Tracking and Gesture Recognition Nam Vo, Quang Tran, Thang Ba Dinh, Tien Ba Dinh Quan M. Nguyen Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam {nam.poke, tmquang}@gmail.com, {dbthang, dbtien}@fit.hcmus.edu.vn International Training Program Center Sai Gon Technology University Ho Chi Minh City, Vietnam nmquan@troyhcmc.edu.vn Abstract—Skin color is one of most important source of information widely used in Human-Computer Interaction (HCI) system where detecting the presence of the human plays a key role. It is because of the dynamics and articulations of human body which is hard to be captured rather than using primitive feature like skin color. However, skin color varies across different persons limiting us from building a generative skin model for all persons. Here, we propose an efficient skin color detection and tracking based on a face detector to refine the skin model for a specific person and a tracking method using a simple data association algorithm. Beyond that, we also introduce a robust gesture recognition method using different extracted information such as local features, hand motion, head orientation, and finger tips. The final framework is proposed to interpret our gesture recognition method in interacting with a computer through multiple applications: controlling computer mouse and playing games. Human-Computer Interaction, Skin color tracking, Gesture Recognition. I. INTRODUCTION Hand gesture recognition is an interesting research topic in the computer vision community, mainly for the purpose of Human-Computer Interaction (HCI). It can be described as a way allowing computers to understand human gesture. A number of applications using hand gesture control interface have been presented such as television control [1, 2], interaction between fingers and papers [3] and some impressive prototypes like Natal Project1 and Sixth-Sense2. Most hand interactive systems include three different layers: detection, tracking and recognition [4] Detection: involves detecting interested objects in the scene. Conducted researches do this based on knowledge about object’s feature (such as skin color, or hand shape). Skin color segmentation is one of • 1 2 http://www.xbox.com/en-us/live/projectnatal/ http://www.pranavmistry.com/projects/sixthsense/ popular approaches, many researches consider different color space (RGB, HSV, YCrCb…), color model (Histogram, Gaussian, mixture of Gaussians). Though many techniques were proposed, skin detection is still one of the most challenging problems [5]. • Tracking: is in the process toward understanding observed objects movement. Tracking not only keeps track on the motion of objects but also gives us the knowledge to update parameters model and features at a certain movement in time. On-going active approaches employed Kalman filter, particle filter, mean-shift, etc. [6, 7]. Many data association techniques have also been proposed to track multiple objects over time based on detection result [8, 9]. • Recognition: at this layer, information about hand features is extracted depending on each specific application, such as hand posture recognition, fingertip detection. There have been many previous works which are dedicated for hand posture and sign recognition purposes [10, 11]. In [10], a sign language recognition system was presented using Hidden Markov Model. Meanwhile, in [11], another sophisticated system, which can extract natural hand parameters from monocular image sequences in order to be able to retrieve detail information about the finger constellation and 3D hand posture, was proposed. Another common system is for hand-driven control, which usually focuses on the motion of hands and fingertips. Kenji Oka [12] proposed a fast tracking of hands and fingertips method for application in an augmented desk interface systems. By using an infrared camera to capture images, the method gives good result even in challenging situations. More recently, Wang and Popovic [13] used a color glove to simplify the pose estimation problem. Based on the designed color pattern on the glove, the system employs a nearest-neighbor approach to track hands and reconstructs 3D hand pose using 2D captured image 978-1-4244-8075-3/10/$26.00 ©2010 IEEE sequences. However, it is not convenient to find a color glove in practical applications. The problem of using only bare-hands with little special setup as an input is much more challenging. Argyros [14] has proposed a quite robust approach for hand tracking, gesture recognition, and applied it in controlling computer mouse. Inspired by his work, we further extend our system to a more powerful framework in different applications. In this paper, we introduce a framework which allows a more natural, human-centered way of interaction with computers. Using images captured from a simple webcam where the upper body of the user is visible, our framework is able to detect and track user head and hands. By extracting features from these objects, it can recognize and interpret gestures of the user as commands for computers in real time. The rest of the paper is organized as follows. Section 2 describes the techniques we used in detail. Section 3 presents experiment results and applications of the framework. Section 4 summarizes our work. II. METHOD DESCRIPTIONS Our system framework operates as follows. First, skin pixels are detected and used to find skin blobs. Skin color model is learned online during initialization using the detected face of the user. In the next step, a simple, non-Bayesian tracking method associating tracked objects with detected blobs is implemented. It can handle objects moving fast, in complex trajectories, and moving in/out the field of view. Finally, several interested features such as object orientation, hand palm position, fingertip position and orientation, hand shape are interpreted. They are used to recognize the gestures of the user for computer interactions. A. Skin detection Skin detection is one of the most challenging problems in computer graphics because human skin color varies from person to person, race to race, and is strongly sensitive to lightning condition. In our framework, the skin color is learned online based on the similar skin color of the face which is successfully detected with high accuracy. For initialization, our framework uses the boosted Haar wavelet classifier to detect the face. This method has been initially proposed by Paul Viola, then improved by Rainer Lienhart [15, 16], and widely used in many applications because of its efficiency and robustness. After that, skin pixels within face region are detected using “explicit skin cluster" technique [17] with following thresholds in RGB color space: • R > G, R > B (skin color usually has high red component) • Max(R, G, B) > 30 (exclude black color) • Min(R, G, B) < 235 (exclude white color) It is important to note that the thresholds are not very tight since pixels in face region have high probability to be skin pixel. The skin pixels detected are used as training data to detect skin which is observed from hand, arm and face. For the main task, our framework use the approach in [14] which employs the Bayes rule to calculate probability P(s|c) of a color c to be skin color: P(s|c) = P(c|s) P(s) / P(c) (a) (b) Figure 2. Face detection: (a) input image, (b) detected skin pixels Where P(s) is the prior probability of skin color, P(c) is the probability of the occurrence of each color c, P(c|s) is the probability of skin color to be color c (sometimes called likelihood). These probabilities are calculated using data collected in the initialization phase. Two thresholds Tmin, Tmax are used to determine a pixel having skin color or not: • Pixels with color c such that P(s|c) > Tmax are classified as skin-color. • Pixels with color c such that P(s|c) > Tmin and lies nearby another skin pixel are also considered skin-color. Figure 1. Overview of proposed framework By using two thresholds, this approach decreases false alarm rate drastically. Moreover, learning skin color automatically using the user’s face is a very efficient to handle the variation in skin color between different people. Changes in illumination conditions can be adapted by re- initialization phase periodically or at special events, for instance, background changing, the user face moving close to the camera, difference between the detected colors of the object and the learned skin color becoming big enough. Employing the method proposed by Suzuki [18], we search for connected components, and form skin blobs. Components whose areas are smaller than a predefined threshold are considered noises and are ignored. (a) (b) An example is shown in Figure 4 where there are two objects o1, o2 and three detected blob b1, b2, b3. Object o1 and blob b2 will be associated because their distance is shortest; then it is the turn of object o2 and b1. Finally, a new object is created to associate with b3. While this method is simple and fast, it still works very well in the context of our problem. Figure 4. An example of object hypotheses and skin blobs C. Gesture recognition Using face detection result, head can be easily identified among the tracked objects. Two biggest remaining objects will be identified as left hand and right hand based on their relative positions. The fitted ellipse of associated blob of the object is used to approximate its position, size and orientation. (c) (d) Figure 3. Skin detection: (a) input image, (b) pixels that pass the first threshold Tmax, (c) pixels that pass the second threshold Tmin , (d) pixels that are classified as skin pixels. B. Multiple object tracking In [14], Argyros proposed a method for tracking multiple skin objects moving in complex trajectories and occluding each other. We propose a similar, but simpler approach assuming that objects do not occlude each other; one object is assumed to correspond to one blob, and vice versa. First, the position of a blob is approximated by fitting an ellipse to the blob using least square method [19]. Tracking then is done by associating detected blobs with object hypotheses: • Object hypothesis tracking: link an unassociated blob and an unassociated object whose distance is the shortest. Repeat this until running out of unassociated blob or unassociated object. • Object hypothesis generation: each remaining unassociated blob is assumed to correspond to a new object moving in the scene. Therefore, a new object is created and associated with the blob. • Object hypothesis removal: each remain unassociated object is assumed to disappear; hence, it is removed if its position is near the edges of the field of view of the camera. Otherwise, the object is allowed to “survive” for a period of time before being discarded to avoid detection failure cases. Hand palm segmentation: for interaction purpose, only the hand palm parts are focused on. We identify the inside circle (named as incircle) of the object as the largest circle within the object in order to estimate the hand palm position. The hand palm is separated from the wrist-arm part which usually lies far below the incircle. Fingertips extraction: the convex hull for each hand object's blob can be computed, and its vertices are considered fingertip candidates. To measure the curvature of each candidate points, we calculate the angle created by the candidate point P to two nearby points P1 and P2, which are equally separated from P a number of contour points, k. Then P is classified as a fingertip if its angle measurement is smaller than a defined threshold. The line connecting P and the midpoint of P1 and P2 indicates the orientation of the finger. The parameter k can be set constantly, but it is not effective if the size of the object size varies greatly. In our framework, we compute k as: k = r / 2, where r is the radius of the incircle. (a) (b) Figure 5. (a) Hand object, (b) Hand palm is segmented, all five fingertips are detected Object motion estimation: although the object motion can be deduced using knowledge about their positions over time, in most of applications need smooth and accurate information. For that purpose, we adopt the pyramidal Lucas Kanade feature tracker (KLT Tracker) [20, 21], a very well-known optical flow estimation method. More specifically, a number of features on the object is found and tracked by the tracker (Figure 6). The motion of the object is then estimated by the median of the movements of the features. It is very useful when the object is lost for a short period of time due to detection or tracking failure, its motion still can be obtained. Hand gesture recognition: since our framework is for handdriven control, it does not need to recognize a big vocabulary of hand postures. Here, we do not employ any special technique for this purpose, but rather deduce the hand gesture from extracted information about hand palm, fingertips. For example, a hand object with five detected fingers can be classified as the “opening” gesture. (a) (b) (c) (d) Figure 7. Frames from an experimented sequence We also define a vocabulary of gestures to recognize and test on several captured image sequences (table I). TABLE I. (a) THE VOCABULARY AND RULES TO RECOGNIZE GESTURES Gesture Rule Hand Open 4 or 5 visible fingertips. Hand Close No visible fingertips. Hand Point left/right/up 1 visible fingertips, its orientation (with respect to vertical axis) is in [45, 135], [225, 315], [-45, 45] (in degree) corresponding to left, right, up direction respectively Hand pose Victory 2 visible fingertips. Head tilting Head’s orientation (with respect to vertical axis) is more than 30 degree. (b) Figure 6. Two consecutive frames where features (red dots) on the hand object are tracked III. EXPERIMENTS A. Method analysis A prototype of the proposed framework is implemented in C++ with OpenCV Library. It runs on a computer with 1.86 GHz processor at a rate of 20Hz. The web cam captures image sequences at the resolution of 320x240. Practical experiments show that our framework works well in environments with little noises (i.e., existence of objects whose color is similar to human skin) and balanced lightning condition. Otherwise, an off-line training phase is needed. Here, we describe a representative experiment. First, the user’s face is detected and used as an input of skin color learning. After that, skin objects in the scene, including the user’s head and left hand, are detected (Figure 7.a). He begins to move his left hand in a random trajectory and varying speed, sometimes very fast. Next, the left hand leaves the field of view, and then two hands appear (Figure 7.b). The user moves both hands at the same time. Then, he opens his right hands, moves around, moves closer, and far away from the camera, opens and closes some fingers (Figure 7.c, 7.d). Though sometimes noises appear, the head and hands are successfully detected and tracked during the experiment, and so are the fingertips. To recognize the gestures, some rules that uses extracted features of objects are applied. Moreover, [4] requires three criterions satisfied to determine the specific point in time that a gesture takes place. Our framework makes use of two following criterions: • The gesture must last for at least a predefined period amount of time. • The hand performing a gesture must have a velocity smaller than a predefined threshold. This is due to an observation that it’s hard for people to move hand and change posture at the same time. Moreover, image of object moving at high speed is often not clear. Four tested sequences consist of 10191 frames (about six minutes long total). In these videos, a user shows different gestures with the head and one or two hands visible. The average accuracy is 94.1%. Most of failed cases are caused by poor skin detection. TABLE II. ACCURACY IN CONDUCTED EXPERIMENTS Sequence Total frame Fail frame Accuracy 1 2990 128 95.9% 2 1535 27 98.2% 3 1834 101 94.5% 4 3704 342 90.8% B. Applications Using extracted features, the user’s gestures are recognized. The framework then interprets these gestures as an input to control the computer (shown in Figure 8). Different applications can be developed using any interested gestures which are suitable. Here, we introduce some applications used to test our framework. 2) Playing racing games Racing games such as Need for Speed3 attract a lot of gamers. Our application tries to bring a more realistic feeling to gamers. The most important action during the game is controlling the wheel. In this application, the user has his two hands acting as holding a “virtual wheel” (Figure 10) to turn it left or right to interact with the game. Here, the interpretation is based on the relative position of the two detected hands. More specifically, this is done by calculating the angle between two hands with respect to the horizontal axis. Based on such information, we can compute how far the “virtual wheel” is turned, and map the action to the game accordingly. Gesture of the right hand is further investigated to control other actions: • Close hand: move forward. • Open hand: stop. • Victory posture: move backward. Experiments show good results, playing racing game using this interface give a more exciting and realistic feeling. Figure 8. Gesture recognition and interpretation flowchart 1) Controlling computer cursor The framework is first applied in controlling computer cursor using user's right hand. The right hand is detected and its motion are retrieved and interpreted as cursor movement, index finger moving up gesture is interpreted as a mouse left button down event. This interface allows the users to control many programs. Due to the limitations while interacting under Windows operating system, it is not as smooth as using a mouse. However, we find it comfortable when operating many programs such as playing Solitaire (Figure 9). Figure 10. Hand gestures in playing racing games 3) Playing first-person shooter games The last application is playing Counter Strike4, a very wellknown first-person shooter game which is originated from a Half-Life5 modification. We manage to interpret the following actions that allow user to play the game without mouse and keyboard (Figure 11): • Look and fire: right hand is used to simulate these actions. The user has the right hand with the gun shape in the game. The motion of the hand is interpreted as the motion in game, and the thumb down/up corresponds to the begin/stop firing actions. • Reload the bullets: the extracted information about head orientation is used to recognize head tilting gesture which is interpreted as reloading the bullets for the gun • Moving forward: using left hand, with index finger up/down corresponds to moving forward/stop actions. 3 http://www.needforspeed.com/ http://www.counter-strike.com/ 5 http://www.gamespot.com/pc/action/halflife/index.html 4 Figure 9. Controlling cursor to play Solitaire Using these gestures, the user can move around, fire and reload. However, it is still difficult to aim at a specific point. Improvement is needed for comfortable playing. [4] [5] [6] [7] [8] [9] [10] Figure 11. Hand gestures in playing first-person shooter game IV. CONCLUSION AND FUTURE WORK We have presented an efficient Human-Computer Interaction framework using skin object detection, tracking and feature extraction. The system is able to interpret many human gestures with real-time performance. We also demonstrate several practical applications as a basic step in developing a real system. In the future, we would like to increase performance using multithreading programming to get smoother results. Also, more techniques can be employed, such as hand posture recognition, eye detection and tracking, etc to extract more features helping to deploy more gestures. It will allow us to fully control a computer with more various complicated commands. [11] [12] [13] [14] [15] [16] ACKNOWLEDGMENT This work is a part of the KC.01/06-10 project supported by the Ministry of Science and Technology, 2009-2010. [17] REFERENCES [18] [1] [2] [3] W.T. Freeman and C.D. Weissman , "Television control by hand gestures," IEEE International Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995. L. Bretzner, I. Laptev, T. Lindeberg, S. Lenman, and Y. Sundblad, "A prototype system for computer vision based human computer interaction," Technical report CVAP251, ISRN KTH NA/P–01/09–SE, Stockholm, Sweden, April 23-25, 2001. Z. Zhang. "Vision-based interaction with fingers and papers". In Proc. International Symposium on the CREST Digital Archiving Project, pp. 83–106, Tokyo, Japan, May 23–24, 2003. [19] [20] [21] X. Zabulis, H. Baltzakis, A.A. Argyros, "Vision-based hand gesture recognition for human-computer interaction", in "The universal access handbook", Lawrence Erlbaum Associates, Inc. (LEA), Series on "Human factors and ergonomics", ISBN: 978-0-8058-6280-5, pp 34.1 – 34.30, Jun 2009. V. Vezhnevets, V. Sazonov, and A. Andreeva, "A survey on pixel-based skin color detection techniques," Proc. Graphicon, pp. 85–92, Moscow, Russia, September 2003. R. Hess and A. Fern, “Discriminatively trained particle filters for complex multi-object tracking,” In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2009. Z. Li, J. Chen, and N. N. Schraudolph, “An improved mean-shift tracker with kernel prediction and scale optimisation targeting for low-framerate video tracking,” In 19th Intl. Conf. Pattern Recognition (ICPR), Tampa, Florida, 2008. C. Huang, B. Wu, and R. Nevatia,”Robust object tracking by hierarchical association detection responses”, ECCV 2008, vol. 2, pp. 788–801. F. Yan, W. Christmas, and J. Kittler , "Layered data association using graph-theoretic formulation with applications to tennis ball tracking in monocular sequence," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 1814 – 1830, New York, 2008. T. Starner, A. Pentland , J. Weaver, “Real-time American sign language recognition using desk and wearable computer based video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20 issue 12, pp. 1371–1375, December 1998. H. Fillbrandt, S. Akyol, K. F. Kraiss, “Extraction of 3D hand shape and posture from image sequences for sign language recognition.” IEEE International Workshop on Analysis and Modeling of Faces and Gestures, vol. 17, pp. 181–186, October 2003. K. Oka, Y. Sato, and H. Koike, "Real-time tracking of multiple fingertips and gesture recognition for augmented desk interface systems," IEEE Computer Graphics and Applications, vol. 22, no. 6, pp. 64–71, November-December 2002. R. Y. Wang, and J. Popoviü, "Real-time hand-tracking with a color glove," ACM Transactions on Graphics, vol 28 issue 3, no. 63, August 2009. A.A. Argyros, M.I.A. Lourakis, “Real time tracking of multiple skincolored objects with a possibly moving camera”, in Proceedings of the European Conference on Computer Vision (ECCV’04), SpringerVerlag, vol. 3, pp. 368-379, Prague, Chech Republic, May 11-14, 2004. P. Viola, and M. J. Jones, “Rapid object detection using a boosted cascade of simple features,” IEEE CVPR, 2001. R. Lienhart, A. Kuranov, V. Pisarevsky. “Empirical analysis of detection cascades of boosted classifiers for rapid object detection”. DAGM'03, 25th Pattern Recognition Symposium, Madgeburg, Germany, pp. 297304, Sep. 2003. F. Gasparini and R. Schettini, "Skin segmentation using multiple thresholding," in Internet Imaging VII, vol. 6061 of Proceedings of SPIE, pp. 1-8, San Jose, Calif, USA, January 2006. S. Suzuki, and K. Abe, “Topological structural analysis of digital binary images by border following,” Computer Vision, Graphics and Image Processing, vol. 30, no. 1, pp. 32-46, April 1985. Z. Zhang, “Parameter estimation techniques: A tutorial with application to conic fitting,” Image and Vision Computing 15, pp. 59–76, 1996. Jean-Yves Bouguet, “Pyramidal implementation of the Lucas Kanade feature tracker – Description of the algorithm”, Intel Corporation – Microprocessor Research Labs. “KLT: An implementation of the Kanade-Lucas-Tomasi feature tracker,” http://www.ces.clemson.edu/~stb/klt/.