Head Pose Estimation of Partially Occluded Faces

Markus  Wenzel

Head Pose Estimation of Partially Occluded Faces Markus T. Wenzel and Wolfram H. Schiffmann University of Hagen (Germany), Department of Computer Science Markus.Wenzel | Wolfram.Schiffmann@FernUni-Hagen.de Abstract This paper describes an algorithm which calculates the approximate head pose of partially occluded faces without training or manual initialization. The presented approach works on low-resolution webcam images. The algorithm is based on the observation that for small depth rotations of a head the rotation angles can be approximated linearly. It uses the CamShift (Continuous adaptive Mean Shift) algorithm to track the users head. With a pyramidal implementation of an iterative Lucas-Kanade optical flow algorithm, a certain feature point in the face is tracked. Pan and tilt of the head are estimated from the shift of the feature point relative to the center of the head. 3D Position and roll are estimated from the CamShift results. 1. Introduction Immersive systems may display their graphical user interface on a Head Mounted Display (HMD). In this setup head pose estimation is a valuable Human-Computer Interface (HCI), but all head pose estimation algorithms known to the authors assume a completely visible face. Therefore navigation using head movement detection has not yet been presented as an HCI in this setting. For head pose estimation face detection in images is an issue, because it helps limiting the computational effort to the area exhibiting the face. Manifold research with respect to face detection was already conducted with different objectives and scaffolding. For use in human computer interfaces real-time performance both of face detection and of head pose estimation is compulsory. This paper focuses on real-time head pose estimation in video images. At startup the algorithm detects the head region using the temporal change between subsequent images. From that head region a cutout containing skin color is used to initialize the CamShift algorithm, which tracks the head henceforth. In the same cutout region the mouth is searched for by a combination of heuristics giving its most probable location and a set of image filters responding to the typical lightdark intensity pattern resembling the mouth. 2. Related Work Head pose estimation has become a broad area of research over the recent years. Most researchers group the various approaches with respect to the underlying method into model-based, appearance-based, and feature-based approaches [4, 16]. Model-based algorithms achieve good results but require training. Among these are Active Appearance Models (AAMs); see for example [9]. These are nonlinear parametric models derived from linear transformations of a shape model and an appearance model. Also, Neural Network based algorithms can be trained to distinguish between different persons or to distinguish poses of one persons face. The basic research was done by Rowley et al. [12]. Principal Component Analysis (PCA) based algorithms in general need the broadest training set. Their training is most often done in a way first proposed by Turk and Pentland [15]. With their research PCA was able to deal with images. As they are statistically inspired, PCA approaches are sometimes counted within the appearance-based approaches. The main disadvantage of all AAMs, Neural Network approaches and PCA methods is their lack of universality. Their detection rate and accuracy decreases as soon as the head does not match the model. This will happen if the user puts on an HMD (Head Mounted Display). In the same manner the performance of any Neural Network based vision system trained for a specific user decreases if confronted with unknown users. Appearance-based approaches use filtering and image segmentation techniques to extract information from the image. Furthermore, optical flow algorithms can be considered as appearance-based approaches. Among the widely used filters are edge detectors and in recent times Gabor wavelets. To the authors no head pose estimation algorithms are known that are only appearance-based. On the other hand, filtering and segmentation play significant roles in head pose estimation. Gabor wavelets are widely known as good feature detectors. These are complex sinusoids with a gaussian envelope. Some approaches use them in an appearancebased manner in a preprocessing step [10]. On the other hand Krüger invented an approach where the weights of a Gabor wavelet network directly represent the orientation of the face. Gabor head pose estimation is described in [16]. Its disadvantage, however, is the computational effort involved in fitting the network to the probe image. Moreover, they have to be trained and are very user specific. The algorithm proposed in this paper uses the Gabor wavelet transform for parts of the face in order to search for the mouth. The majority of feature-based algorithms use the eyes as features as these are easy to detect by symmetry and because of their prominent appearance. The nostrils often serve as trackable features as well, but they become invisible as soon as the user tilts his head downwards. The mouth can also be easy to find, provided it is not covered by a mustache or a beard. Several authors use a set of these features to estimate a 3D head orientation. Fitzpatrick shows another feature based approach to head pose estimation without manual initialization [5]. For feature detection and tracking he searches for the cheapest paths across the face region, whereby the cost of a path depends on the darkness of crossed pixels. The paths will therefore avoid dark regions. A pair of avoided regions is supposed to be the pair of eyes. The algorithm is thus dependant on the visibility of the eyes. He then estimates head pose based mainly on head outline and eye position. Gorodnichy shows a way to track the tip of the nose [7]. His approach can almost be called model free. He uses the resemblance of the tip of the nose with a sphere with diffuse reflection. This template is searched in the image. He does not use it for head pose estimation, but simply tracks the nose tip across the video images. The idea presented in this paper resembles his approach in that an individual feature is tracked in the images. 3. Pose Estimation with Partial Occlusion The intent of this paper is to present an algorithm that is capable to adapt to every user and especially to faces partially occluded by a HMD (see Figure 1). No Figure 1. User with HMD initial setup or training is needed. The algorithm delivers approximations of head roll, tilt and pan (see Figure 2). The head position in the image plane is given directly and the z axis translation can be estimated. The pose of a face may be depicted as a 3D vector d~ with its origin between the eyes. For head pose estimation two pieces of information are needed: 3D head position given at the middle of the head and the position of a feature of the face. The pose is then a linear transformation of the vector d~ in Figure 2. The user may define a certain position to be his ”straight ahead” direction. The vector obtained in this position is stored and used as a reference. Then for small angles in depth rotation the rotation angle can be linearly estimated. Head roll can be estimated from the rotation of an ellipse in the image plane approximating the head outline. This will be described below. Estimating mixed rotations around all three axes is still challenging, because every position is ambiguous. This problem is not discussed in this paper, but remains open for further research. The algorithm has two phases we call the initialization phase and the working phase (Figure 3). Initialization extracts the face region from the video sequence and finds a salient feature point to track. The proposed method aims at finding the mouth as a prominent feature. If the mouth is invisible or can not be detected, the algorithm will track another feature which, in most cases, is the tip of the nose or one of the nostrils. In case of failure the working phase returns to the initialization phase. 3.1. Initialization phase The initialization phase needs head movement to detect the face of a user. Therefore, a state called motion detection starts the initialization phase when motion in the picture occurs. The initialization phase searches Figure 2. Basic idea the users head. In the estimated head region the algorithm searches a salient trackable feature. It passes on the coordinates and the size of a region to CamShift. During the working phase the CamShift algorithm is used to track the biggest object in the picture that is of this regions dominant color. Motion Detection. For the face detection step to succeed, the head must have moved far enough. This is ensured by calculating the difference between a base image I(t0 ) and a subsequent image I(t1 ), I(t2 ), · · · from the video stream until the difference I(t0 ) ⊖ I(tk )for a certain k ≥ 1 exceeds a threshold τ . In the prototype implementation, the ⊖-operator is implemented as a sequence of operations: Subsequently a XOR image (pixel-wise and per channel) is built. Morphological opening (erosion then dilation) and closing (dilation then erosion) operations account for closing holes and deleting small outliers. From the remaining image only intensity information is kept. This intensity image will represent the most significant motion with the brightest color. In this way thresholding further removes noise. Face Detection. After the above operation, a contour search gives the largest remaining BLOBs (binary large objects) in the image. Given the assumption that only head movement occurs during initialization, the biggest BLOBs can be considered to represent the head. The image is saved for later steps. The next step requires a skin color probe. Care must be taken not to include the HMD in this extracted skin color probe, because it may reflect arbitrary colors and thus invalidate the results. We assume that a small region in the lower half of the moving part will mainly contain skin. The coordinates and the size of this color probe region are given by simple heuristic assumptions Figure 3. Details of algorithm about face biometrics. They are relative to the size and location of the biggest BLOBs in the above image. Skin color segmentation is done in the ”normed green” color space. The algorithm calculates a normed green image from the color camera image by applying G to every RGB (Red Green Blue) gnorm = R+G+B+1 pixel of the image. A color histogram of the above color probe region is then calculated in the gnorm image. Thresholding the gnorm image within a tight band around the maximum color value of the histogram gives the locations of skin regions. The coordinates and the size of the color probe region are passed on to CamShift as well. Skin color tones are tightly clustered in the rnorm /gnorm color space [13]. Experiments show that in the given context the gnorm part is sufficient. The transformation into this space is inexpensive in terms of computational effort. It shows Figure 4. Estimated head; search region for mouth Figure 5. Filtering the mouth search region more reliable results than thresholding the hue channel in the HSI (Hue Saturation Intensity) color space. Pixel-wise AND-concatenation of the biggest BLOBs image with the binary thresholded gnorm image gives a re- fined picture of the moving head region. Figure 4 shows the result (white areas). Position and size of a bounding box of these segments serve as the starting point for the search for a salient trackable feature point. The mouth search region is found based on an initial heuristic estimation of its most probable height. We assume 25 the height of the bounding box, located in its lower region. The width is that of the bounding box. The resulting initial mouth search region is displayed in Figure 4; outmost box. This box is shrunk to a size containing mostly white pixels (innermost box in Figure 4). Mouth Search. The mouth is a salient trackable feature, because it exhibits contrasting regions like edges and corners. Simple heuristics give a rough estimate of its position (the mouth search region). A near-frontal position of the users face is assumed during initialization. Figure 4 shows the segments displaying the head (white areas) and subsequent mouth search regions (grey rectangles, starting with the largest and ending with the innermost). A series of filters is then applied to the innermost region of the image. Figure 5 gives an example of a mouth search region (top image in figure) and the series of intermediate results. Note that the size of this region is variable. The algorithm starts with a Sobel operator to enhance intensity changes (second from top). The Sobel filter works with the first y derivative and an aperture size of seven pixels. Then the algorithm applies a horizontally oriented Gabor filter to the result. Its frequency resembles the smooth vertical intensity gradient of the lips to reduce the higher and lower frequencies in the image. For the result see the middle picture. Then a box three times higher than phase width is shifted across the result image. The standard deviation of every possible box codes the intensity of a corresponding pixel in a result image (second from bottom). Figure 6 illustrates this. Here the resulting image is normalized for better visibility of the results. The box is currently of fixed size, but can be made dependant on the size of the mouth search region. A second fixed-sized box is shifted across the image, this time with the expected size of the mouth. The box with the greatest sum of standard deviations (i.e. highest pixel intensities) is considered to be the most likely mouth region. In Figure 5, bottom image, this region is marked with a rectangle. The approach performs best for near-horizontal pictures of the mouth, but is still accurate for head roll up to 10°. Head pan up to 30° will not influence performance and the same is true for head tilt. 3.2. Working phase During working phase the algorithm performs three tasks. Head Tracking is done with CamShift. Feature tracking is accomplished with an iterative Lucas Kanade optical flow algorithm in image pyramids. The results of both are combined for head pose estimation. Each task is described below. Head Tracking. The head is tracked using the CamShift algorithm. CamShift is able to track every object with a distinct hue value. We initialize it as described above to track the moving object in the image. CamShift is an adaption of the Mean Shift algorithm for face tracking. Mean Shift is a non-parametric technique that climbs the gradient of a color distribution to find its mode (peak) [1]. CamShift augments the Mean Shift algorithm by adding a variably sized search region that gives rise to its name: Continously Adaptive Mean Shift. Figure 6. Shifting boxes across the image The position of the tracked face is given by the centroid of the color distribution. It is calculated from the moments of the distribution, as given by the below formulas. Utilizing the size of the search region, an estimate of z-axis position can be obtained. Let W be a search window with image points I(x, y). The centroid of the distribution µ ~ = (µx , µy )T is then given by M01 M10 ; µy = (1) µx = M00 M00 with Mij being the ith moment in x direction and j th moment in y direction. Mij is defined as XX Mij = xi y j I(x, y) x Figure 7. Image pyramid, levels 0-3 Let W ⊂ J be a window of size (2wx + 1) × (2wy + 1). Let its center be the coordinates ~u = (ux , uy )T of a feature point in the preceeding picture J. Let I be the current picture. Then optical flow algorithms try to find a window W ′ in I of equal size to W that is most ”similar” to W . Similarity may be measured by the least square error between two images. In this case estimating optical flow means minimizing ǫ in ~ ǫ(d) x=ux +wx y=uy +wy y The orientation of the color distribution can be calculated using zeroth to second moments. CamShift delivers the center of a rotated ellipse and its axes lengths. CamShift works by first back projecting the hue histogram of the color probe on the current image. Then it calculates the first and second moments of the color probability distribution. These give the lengths of first and second axis and the ellipses rotation angle. For implementation details for CamShift see [3]. The implementation used in the prototype application is included in the Intel OpenCV computer vision library [11]. Some of its shortcomings are explored in [1]. Feature Tracking. For feature tracking the algorithm uses an enhanced optical flow algorithm as proposed by Bouguet [2]. It can be used to track features in a video sequence. The basis of the feature tracking approach is an iterative Lucas Kanade optical flow algorithm. It works in image pyramids to suit the task of robust feature tracking. The fundamental Lucas Kanade optical flow algorithm was proposed by Lucas and Kanade in [8]. = ǫ(dx , dy ) X = (I(x, y) − J(x + dx , y + dy ))2 x=ux −wx y=uy −wy (2) by finding the optimal image velocity d~ concerning the image point ~u = (ux , uy )T . The Lucas Kanade algorithm estimates d~ based on discrete estimations of spatial intensity derivatives in the current image. Assuming small displacement vectors, the first order Taylor series expansion about the feature point can substitute J(x + dx , y + dy ) in the above Equation 2, giving ~ ∂ǫ(d) ∂ d~ x=ux +wx y=uy +wy ≈ −2 X x=ux −wx y=uy −wy " ∂J ∂x ∂J ∂y # Ixy − Jxy − d~ " ∂J ∂x ∂J ∂y #! (3) Note that this is only valid for small displacement ~ vectors d. The iterative Lucas Kanade iterates over this step taking d~ from the previous step as an initial guess for Figure 8. Displacement of feature with respect to face center the next step. This introduces greater accuracy but alone cannot overcome the limitation to small displacement vectors. Bouguet [2] suggests the usage of image pyramids to improve on this limitation. Image pyramids are pyramids of subsampled images with successively lowered resolution. For a 160×120 base image (zeroth pyramid level), the first level image has a size of 80×60 etc. Figure 7 shows zeroth (top) to third pyramid levels of an image, created with a 5×5 gaussian filter. The iterative Lucas Kanade in image pyramids conducts iterative Lucas Kanade steps in each level of an image pyramid starting at the highest level, i.e. with the lowest resolution. The result of the lowerresolution level is propagated to the next higher-resolution pyramid level and used as an additive offset. This enables iterative Lucas Kanade to track optical flow with large displacements and great accuracy. Head Pose Estimation. Figure 8 sketches the idea underlying depth rotation approximation. The center of the head (i.e. the centroid of the color probability distribution as given by CamShift) is marked with a green spot. The tracked feature is marked by a red spot. The arrow gives the relative shift of the red spot with respect to the ”straight ahead” position of the mouth. From the current shift relative to the initial shift which marks the straight ahead direction, the current depth rotation is estimated. Currently, only a linear approximation is calculated. In the figure, the ”straight ahead” mouth position is marked for clarity. In the prototype immersive system, all estimated rotation parameters and z translation are pictured by compass windows (see Figure 9). Figure 9. Demo Application with Compass Windows 4. Results The algorithm performs with far more than camera frame rate on a 1,066 MHz notebook. The tests were conducted both on still images and live video sequences captured by a Logitech QuickCam 3000 and aimed at evaluating quality and speed of the algorithms protoype implementation. Assessment of Speed. All tests on still images were conducted by using two pictures I and J that were presented vice versa to the different parts of the algorithm. Concerning initialization phase, no efforts were taken to improve performance of the implementation. Nevertheless, the initialization phase calculated about 25 pose estimations per second. In the working phase, CamShift aims primarily at high speed. With the described setup and given that the area tracked covers about a quarter of the 160×120 picture, it performs at about 500 pictures per second. The iterative Lucas Kanade feature tracker tracking a singular point in the image performs at roughly 3000 pictures per second regardless of displacement of the feature point. When both algorithms were combined, a rate of about 400 pictures per second was measured. Thus, real time pose estimation can be achieved by means of commodity hardware. Assessment of Quality. The initialization phase was tested for quality with various image pairs showing the face of the user in slightly altered positions (approximately 5° change in pan angle). The initialization phase proved to be insensitive to lighting conditions. As well, long hair is tolerable as long as certain thresholding values in motion detection are altered slightly. Most importantly, the usage of an HMD will not influence the results of initialization. The algorithm thus achieves the goals it is designed for. Severely disturbing for the goals of initialization are, however, all moving objects in the scene that are not a users head, even more so, if they are of skin color. The initialization phase will focus on whatever moves in the scene and try to detect a mouth in the moving parts. Due to the model free nature of the implementation, no checks are currently possible to account for this. We then tested the overall performance quality in a small demonstration application (see Figure 9). A panoramic picture is presented to the user via HMD. By panning and nicking his head, he may alter the viewport to the panorama. During working phase it is apparent that the linear approximation of the angle does not provide enough accuracy to feel fully immersed. Also, it is necessary to dampen the pose and orientation estimates. Both CamShift and the Lucas Kanade feature tracker sometimes pass on suddenly changing results that lead to a shattering impression of image movement on the output device. Damping diminishes this impression, but with the downside of retarded system reaction. Although the reaction lags behind head rotation for only a fifth of a second, this can be disturbing. To overcome the restrictions in accuracy introduced through the linear approximation of rotation values, we consider tracking more than one feature to apply a weak projection model for head pose estimation. Also, as mentioned before, the problem of estimating mixed rotations around two or three axes remains unsolved yet. With more feature points tracked, ambiguities might be overcome. 5. Further Work References Because reliable detection of the mouth region is crucial to pose estimation, one task is to improve the initialization phase of the algorithm. Incorporation of recent research results making use of methods resembling human eye movement (saccadic search, see [14]) promise more exact and more robust estimations. For those approaches to work, a model of the feature and consequently a training step is necessary. Current research suggests that user specific training may not be mandatory. Therefore it seems to be a promising approach. Also, a modification to CamShift will enable it to adapt to changes of the tracked color. This can improve tracking results. The methods suggested in [1, 6] promise better color segmentation in difficult lighting conditions and the just-in-time adaption to changing hue values of the tracked object. 6. Conclusion The purpose of this paper was to show how a user’s head pose can be tracked in a video stream when parts of his face are occluded. We presented an algorithm composed of two dedicated phases, initialization and working phase, that achieved this goal. During initialization, the face is detected from subsequent pictures by its motion. A color probe is drawn and given to CamShift for tracking during working phase. The mouth is searched for with a combination of heuristics and filtering. Heuristics give the most probable region where the mouth is expected. The mouth is then detected by applying a series of filters to this region. In the working phase, head and feature position are tracked in time. The relative shift of the feature point relative to both the ”straight ahead” feature position and the center of the head gives the user’s head pose. The algorithm estimates six degrees of freedom (rotation and translation with respect to all three spatial dimensions). The algorithm is usable as a HCI component in immersive systems. This was shown in a demonstration application. [1] [2] [3] [4] [5] [6] J. G. Allen, R. Y. D. Xu, and J. S. Jin. Object tracking using CamShift algorithm and multiple quantized feature spaces. In Conferences in Research and Practice in Information Technology, volume 36, 2003. J.-Y. Bouguet. Pyramidal implementation of the Lucas Kanade feature tracker. OpenCV Documentation. G. R. Bradski. Computer vision face tracking as a component of a perceptual user interface. In Workshop on Applications of Computer Vision, pages 214219, Princeton, NJ, Oct. 1998. Microcomputer Research Lab, Santa Clara, CA, Intel Corporation. L. M. Brown and Y.-L. Tan. Comparative study of coarse head pose estimation. Technical report, IBM T.J. Watson Research Center, Hawthorne, NY, 2002. P. Fitzpatrick. Head pose estimation without manual initialization. Technical report, AI Lab, MIT, Cambridge, USA, 2000. A. R. Francois. Real-time multi-resolution blob tracking. Technical Report IRIS-04-423, Institute for [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Robotics and Intelligent Systems, University of Southern California, July 2004. D. O. Gorodnichy. On importance of nose for face tracking. In Proceedings of the International Conference on Automatic Face and Gesture Recognition (FG2002), NRC 45854, pages 188196, Washington, DC, USA, May 2002. B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. of 7th International Joint Conference on Artificial Intelligence (IJCAI), pages 674679, 1981. I. Matthews and S. Baker. Active appearance models revisited. International Journal of Computer Vision, 60(2):135 164, Nov. 2004. In print. S. J. McKenna and S. Gong. Real-time face pose estimation. Real-Time Imaging, 4(5):333347, 1998. OpenCV. Sourceforge.net OpenCV project homepage. http://sourceforge.net/projects/opencvlibrary. H. Rowley, S. Baluja, and T. Kanade. Neural networkbased face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):2338, Jan. 1998. S. K. Singh, D. S. Chauhan, M. Vatsa, and R. Singh. A robust skin color based face detection algorithm. Tamkang Journal of Science And Engineering, 6(4):227234, 2003. F. Smeraldi and J. Bigun. Retinal vision applied to facial features detection and face authentication. Pattern Recognition Letters, 23:463475, 2002. M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):7186, 1991. Y. Wei, L. Fradet, and T. Tan. Head pose estimation using gabor eigenspace modeling. Technical report, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, 2001. 7

RELATED PAPERS

RELATED TOPICS

Log In

Head Pose Estimation of Partially Occluded Faces

Head Pose Estimation of Partially Occluded Faces

Head Pose Estimation of Partially Occluded Faces

Head Pose Estimation of Partially Occluded Faces

Head Pose Estimation of Partially Occluded Faces

Related Papers

RELATED PAPERS

RELATED TOPICS