Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Active Hand Tracking Jérôme Martin, Vincent Devin and James L. Crowley GRAVIR - IMAG INRIA Rhône–Alpes 655 av. de l’Europe 38330 Monbonnot, FRANCE E-mail: Jerome.Martin@imag.fr Abstract clear mathematical foundations for which it is possible to predict the proper operating conditions and to generate an on-line estimate of the probability of success. Often such techniques have been rejected in the past because they were not universal. We maintain that no individual vision technique will be universal. Reliability and robustness to changes in operating conditions can be achieved by combining multiple complementary techniques. Such combination requires dynamically estimating the confidence in the output of each process and dynamically reconfiguring the system as operating conditions change. In our work, two aspects of gesture are explored: hand tracking and gesture–recognition. This paper deals with hand tracking, information on gesture–recognition can be found in [8]. Our search for complementary processes has led us to use diverse techniques for detection such as normalised color histograms, multi-dimensional histograms [10] and normalised energy cross-correlation [4]. Such techniques have been rejected in the past because they are known to fail under certain circumstances. In the context of a system which integrates multiple processes, such failure is not a problem. Locating and normalizing a hand is a processes of tracking. Several methods can be used for tracking, each with different success and failure conditions. A reliable tracking system can be obtained by integrating and coordinating several complementary tracking processes. Integration and coordination are performed using a synchronous architecture in which a supervisor selects, activates and provides parameters for visual processes. The system includes a control part described as a graph using combination operators and modifiers and a decision part reorganizing the control in reaction to changes in the confidence factor. The following section presents the visual process architecture used for the system and describes techniques for estimating a confidence factor for detection. A tracking process based on a recursive estimator is formulated as a zeroth order Kalman filter. Visual processes are described using This paper describes a system which uses multiple visual processes to detect and track hands for gesture–recognition and human–computer interaction. This system is based on an architecture in which a supervisor selects and activates visual processes. Each process provides a confidence factor which make it possible for the system to dynamically reconfigure itself in response to events in the scene. Visual processes for hand tracking are described using image differencing and normalized histogram matching. The result of hand detection is used by a recursive estimator (Kalman filter) to provide an estimate of the position and size of the hand. The resulting system provides robust and precise tracking which operates continuously at approximately 5 images per second on a 150 megahertz Silicon Graphics Indy. 1 Computer Vision for Perceptual User Interface Krueger [7], followed by Wellner and Mackay [11], have shown how human gesture can be used to define natural interaction modes for interacting with computer systems. The techniques described by Krueger and by Wellner were simple concept demonstrations. As such they were fragile and worked in constrained environments. We are exploring methods for building perceptual user interfaces which are robust and reliable. Our primary means of obtaining reliability is through the integration of multiple computer vision techniques in a system which dynamically reconfigures itself in response to its own estimates of the confidence in its results. In our systems, we employ computer vision techniques which have low computational complexity so that the techniques can operate in real time. We favor techniques with 1 images differencing and skin–color detection. The results of experimental measurements precision, probability of error and computation time presented. 2 Integration of Visual Processes for Tracking The hand tracking system described in this paper is based on an architecture in which a supervisor activates and coordinates a number of reactive visual processes (see figure 1). The SERVP (Synchronous Ensemble of Reactive Visual Processes) architecture [2] was developed as a synchronous approach to the integration of processes for active vision. The SERVParchitecture has been used in the construction of several systems including appearance based navigation [1], tracking faces for video–communications [3] and in detection and tracking of individuals [5]. For the work described in this paper, we use a software skeleton, named Chord [6, 12], which is based on the SERVP architecture. The Chord system includes facilities for both synchronous processing on a single host and asynchronous processing on multiple hosts. Chord also includes operators for combining and managing processes. Supervisor Control Decision Image Processing Actions Visual Processes data (images and parameters) from inputs, perform computations and then write results to output ports. Processes react to commands such as initialization, to starting and stopping execution, and termination. Processes can generate event messages based on the results of computation, including exception conditions which trigger the supervisor to modify the system configuration. A process can be run for one or more cycles. At the end of the execution, the process communicates a message of success or failure to the supervisor. In order to control the execution and the relationships of process, Chord defines a set of control operators and modifiers [6]. Operators and modifiers include a sequential operator, a concurrent operator, a conditional sequential operator, a conditional concurrent operator, a watchdog, the loop, negation and synchronous/asynchronous modifiers. The decision part of the architecture is expressed as a set of rules which are executed by forward chaining and which react to commands, as well as messages from the visual processes. The supervisor receives messages from the visual processes concerning processing state and visual events. These messages are encoded as items in a working memory. Such items trigger rules which can modify process parameters as well as the control graph. Visual events used in the system described below are based on the confidence factor which each process generates as part of its output. Other events are possible for the recognition of objects or action. In our example, visual processes pass information to a tracking process which maintains an estimate of the center point and the size of the hand. This tracking process is implemented as a recursive estimator based on a Kalman filter. Figure 3 shows the current graph control. Figure 1. Synchronous Ensemble of Reactive Visual Processes loop seq Commands In Image Processing Tracking Desicion State Data in Visual Process Data out Events Out Figure 2. A generic model of visual process 2.1 Visual Processes A visual process is graphically represented as a box with input and output ports as shown in figure 2. Processes read Figure 3. The graph control of our system using Chord formalism In this graph, circles represent Chord operators. The graph execute a loop on the sequence of three visual processes: Image Processing, Tracking and Decision. The Tracking process, represented by a rounded box, is also described by a Chord graph control and is composed of low level visual processes (image differencing, skin detection by color, cross–correlation and followed by the recursive estimator process). 3 2.2 Estimating the Con dent Factor of Visual Processes Robust tracking can be obtained by driving several detection processes. This section briefly describes processes for detecting hands using image differencing and skin detection. More details can be found in [9]. The primary visual event used for control in the system described below is the confidence factor, denoted CF. Confidence is represented by a numerical value between 0 (no confidence) and 1 (certainty). The CF is an estimate of the likelihood that a successful detection was achieved and thus is computed as probability using pre–trained sample of correct detections. During system set–up, a large number of correct detections are identified for each process. The mean s and the covariance matrix s of these examples define the parameters for likely correct detections. Give these parameters, the probability of observing a particular parameter vector given a correct detection can be estimate by a Gaussian probability law. We use an unnormalised probability as an indicator of the probability of a correct observation, given the observed vector Y . This is computed as: T ,1 1 CFY = e, 2 (Y ,s ) s (Y ,s ) 2.3 The Tracking Process The tracking process is formulated as a recursive estimation process (see figure 4). Such process maintains an estimate of a state vector, and its uncertainty. Because a hand will often have accelerations which are very fast, we do not attempt to estimate hand velocities. Thus our tracking process is a zero–th order estimation system with a state vector composed of the position (x; y) and bounding box (w; h) of the hand. observation To Tracking Process Validate Two different image differencing processes are included in our system. The first process computes differences between successive images. The difference in luminance of pixels from two successive images is close to zero for pixels of the background. By choosing and maintaining an appropriate threshold, moving objects are detected within a static scene. As drawbacks, some intrusive moving object can be also detected and a static hand leads to a failure of the detection. The difference in luminance from the live image and a pre–stored background image always detects the hand but also the arm and some shadows. Whereas differencing from two successive images detected a part of the hand, differencing from a background image provide a large region of interest. This is the reason why our system toggles between those two processes depending on confident factor. Assuming that the differencing–from–successive–images process is the current active process, a static hand leads to a small bounding box and consequently to a small confident factor. An event is generated and the supervisor start the differencing–from– background process. Such decision can be formulated as a control rule : X*, C*X Predict In this rule, EVENT and PROCESS are messages while start and stop are commands. When some thing Update ^ 3.1 Image di erencing (EVENT CF_low) (PROCESS differencing_from_successive_images) => (stop differencing_from_successive_images) (start differencing_from_background) prediction Y, CY Visual Processes for Tracking Hands ^ X, C X estimation Figure 4. The tracking process is a zero{th order recursive estimator for position and size The covariance of the four parameters is a 4x4 covariance matrix. However, if we assume that errors in size are independent of errors in size, then the covariance matrix can be separated into two independent 2x2 covariance matrices. is detected using differencing–from–background process, the supervisor toggle back to the differencing–from– successive–images process. 3.2 Skin detection Skin can be easily detected using a normalised color vector [10]. The color pixels are normalised by dividing out the luminance component. This removes the effect of changes in the orientation of the skin surface with respect to a light source. The remaining 2 vector, composed of (r; g) is nearly constant for skin regions. A 2–D histogram h(r; g) of the pixels from a region containing skin will show a strong peak at the skin color. We use a sample of N pixels obtained automatically when another process succeeds in detection, to initialise the normalised color histogram. The histogram makes it possible to use table-lookup to estimate the con~ = (r; g ) ditional probability of observing a color vector C given that the pixel correspond to skin: ~ jskin) p(C = 1 N :h(r; g ) Regions where this probability is above a low threshold are detected and described using connected components analysis to provide possible parameters for hand tracking. Position error (x) for all visual processes 25 20 Experiments The following section presents experiments on the different visual processes. For these experiments, we are using a database of 100 images from a sequence of images of a moving hand. Error in x (pixels) 4 4.2 Evaluating computation time In order to compare computation time, we made experiments for an entire image of size 192 by 144, results are shown in figure 7. Visual processes using image differencing are faster than processes using skin–color detection. The method is the one combining image differencing and skin–color detection. In addition to the computation time 10 5 4.1 Evaluating precision 0 background image successive images differencing differencing skin detection background image differencing & skin detection Visual processes Position error (y) for all visual processes 25 20 Error in y (pixels) For each image of the database, the bounding box determined automatically is compared to a bounding box has been selected by hand to measure the error in position (x; y) of the center and in the size (w; h). Figure 5 and Figure 6 give the error in position and size for visual processes difference with static background (back), successive image differencing (succ), skin–color detection (skin) and the combination of background differencing and skin– color detection. The combination of background differencing and skin–color detection correspond to compute an approximative bounding–box which enclose the hand using background–differencing and then, compute the skin– detection in the bounding–box. The above experiments show that the bounding box obtained by differencing successive images is very imprecise. Difference between the image and a static background image tends to be somewhat imprecise but better than difference of successive images. The skin detection using normalised color histograms and background difference and skin–color detection are the most precise with approximately same results. 15 15 10 5 0 background image successive images differencing differencing skin detection Visual processes background image differencing & skin detection Figure 5. Error of position in x and y of the image differencing, the skin–color detection is computed onto a sub–image. The overall computation time depends on the size of the sub–image. Computation time for all visual processes (Time for images 192x144 pixels) 1 0.8 Time (ms) Size error (width) for all visual processes 40 35 0.6 0.4 Error in width (pixels) 30 0.2 25 20 0 background image successive images differencing differencing 15 skin detection 10 background image differencing & skin detection Visual processes 5 0 background image successive images differencing differencing skin detection background image differencing & skin detection Figure 7. Computation time for image 192 by 144 pixels Visual processes Size error (height) for all visual processes 50 4.3 Error in height (pixels) 40 Execution of Visual Processes In order to verify the control graph, a sequence of images including hand movement, important change in hand size and appearance and disappearance of the hand. Figure 8 shows a small part of the total sequence. 30 20 Execution of visual processes Visual processes 10 (2) (1) (3) (4) (5) 0 background image successive images differencing differencing skin detection background image differencing & skin detection Visual processes Figure 6. Error of size in width and height skin detection successive images differencing background image differencing 180 200 220 240 260 280 300 320 Time (x100ms) Figure 8. Part of execution of Visual Processes In this figure, five part have been numbered corresponding successive images in the sequence. In part (1), successive–images–differencing process is first executed. Because there is no moving object in the scene, an event is generated. This event activate the background–image– differencing process followed by skin–color–detection process. In (2), the hand moved enough to be tracked by successive–images–differencing process given a normal bounding box. Because the bounding box is small, the computation for skin–detection is not so important. At stage (3), the hand is closed (during the training of the recursive estimator, the hand was always opened) consequently the bounding box does not correspond to the expected one. The successive–images–differencing process fails and background–image–differencing process is activate. During part (4) and (5), the hand is moving with correct size and the successive–images–differencing process is activated. 5 Conclusion The integration of complementary visual processes can produce a reliable and robust system. Integration and coordination is provided by the Chord architecture, expansion of the Synchronous Ensemble of Reactive Processes model developed in the VAP Project [5]. Coordination of visual requires signaling visual events to the supervisor. Such visual events correspond to confidence factor given by the visual process according to its results. Experiments on visual processes are performed in order to build the control graph according to strength and weakness of the visual processes. Those experiments lead to the conclusion that the region of interest can be reduced if the visual processes are called sequentially. In a near future, new visual processes will be added to our current system. Active contours and cross– correlation can be used as tracking process. This system is a part of a larger project of gesture–recognition [9]. 6 Acknowledgments This work has been partially supported by France Telecom CNET (Project COMEDI). References [1] C. Andersen, S. Jones, and J. Crowley. Apperance based processes for visual navigation. In 5th International Symposium on Intelligent Robotic Systems, SIRS’97, pages 227– 236, Royal Institue of Technology, Stockholm, Sweden, July 1997. [2] J. Crowley and J. Bedrune. Integration and control of reactive visual processes. In J.-O. Eklundh, editor, European Conference on Computer Vision, (ECCV’94), volume 801 of Lectures Notes in Computer Science, Stockholm, Sweden, May 1994. Springer Verlag. [3] J. Crowley and F. Bérard. Multi–modal tracking of faces for video communications. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’97), pages 640–645, San Juan, Puerto Rico, June 1997. [4] J. Crowley, F. Bérard, and J. Coutaz. Finger tracking as an input device for augmented reality. In International Workshop on Automatic Face– and Gesture–Recognition, Zurich, Switzerland, June 1995. [5] J. L. Crowley and H. I. Christensen, editors. Vision As Process. ESPRIT Basic Research Series. Spring Verlag, 1995. [6] S. Jones. Robust Task Achievement. PhD thesis, Institut National Polytechnique de Grenoble, GRAVIR – IMAG, May 1997. [7] M. Krueger. Artificial Reality II. Addison-Wesley, 1991. [8] J. Martin and J. Crowley. Un système visuel de reconnaissance de gestes. Soumis à RFIA’98. [9] J. Martin and J. Crowley. An appearance–based approach to gesture–recognition. In A. D. Bimbo, editor, International Conference on Image Analysis and Processing, number 1311 in Lecture Notes in Computer Science, Florence, Italia, Sept. 17–19, 1997. Spriger Verlag. [10] B. Schiele and A. Waibel. Estimation of the head orientation based on a face–color–intensifier. In 3rd International Symposium on Intelligent Robotic Systems ’95, 10–14 July 1995. [11] P. Wellner, P. Mackay, and R. Gold. Computer–augmented environments : back to the real world. Special Issue of Communications of the ACM, 36(7), 1993. [12] B. Zoppis. Outils pour l’Intégration et le Contrôle en Vision et Robotique Mobile. PhD thesis, Institut National Polytechnique de Grenoble, GRAVIR – IMAG, Juin 1997.