and beyond
Xi Zhao
le 13 septembre 2010
3D Face Analysis:
Landmarking, Expression Recognition and beyond
Prof. Bulent Sankur Université Bogazici Rapporteur
Prof. Maurice Milgram Université UMPC Rapporteur
Prof. Alice Caplier Université INP Examinateur
Prof. Dimitris Samaras Université Stony Brook Examinateur
Prof. Mohamed Daoudi Université Telecom Lille Examinateur
Prof. Liming Chen Ecole Centrale de Lyon Directeur de thèse
Dr. Emmanuel Dellandréa Ecole Centrale de Lyon Co-directeur de thèse
1 Introduction 1
1.1 Research topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problems and objective . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2 3D Face Landmarking 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Face landmarking in 2D . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Face landmarking in 3D . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 A 2.5D face landmarking method . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 A 3D face landmarking method . . . . . . . . . . . . . . . . . . . . . 43
2.4.1 Statistical facial feature model . . . . . . . . . . . . . . . . . 43
2.4.2 Locating landmarks . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.3 Occlusion detection and classification . . . . . . . . . . . . . . 51
2.4.4 Experimentations . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5 Conclusion on 3D face landmarking . . . . . . . . . . . . . . . . . . . 70
2.12 Correlation meshes from two viewpoints. Actually these meshes are
in four dimension space, where the first three dimensions are x, y, z
and the last one is correlation values. In these figures, we display the
correlation values instead of z. (a) and (b) are the same correlation
mesh from two point of views, describing the similarity of texture
(intensity) instances from SFAM and texture (intensity) on the given
face. (c) and (d) are the correlation mesh describing the similarity
of shape (range) instances from SFAM and face shape (range). Red
color corresponds to the high correlation and blue color corresponds
to the low correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.13 Different types of occlusion: a) occlusion in the mouth region, b)
occlusion in the ocular region, c) occlusion caused by glasses. . . . . 53
2.14 SFAM learnt from FRGCv1 dataset: first variation modes on the
landmark configuration, local texture and local shape.First mode of
morphology explains the landmark configuration variations in terms
of face size; first mode of texture explains the intensity variation,
especially in the eye region; first mode of shape explains the geometry
variation in the upper part of face. . . . . . . . . . . . . . . . . . . . 56
2.15 SFAM learnt from Bosphorus dataset: variations of the two first mor-
phology modes. The first variation mode mostly explains the face
morphology changes along the vertical direction, while the second
variation mode explains the face morphology changes along the hor-
izontal direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.16 SFAM learnt from Bosphorus dataset: variations of the two first local
texture modes. The first variation mode mostly explains the facial
texture changes due to different skin color, while the second variation
mode explains the facial texture changes in the eye and mouth regions. 58
2.17 SFAM learnt from Bosphorus dataset: variations of the two first local
geometry modes. The first variation mode mostly explains the face
geometry changes in the lower part of face, while the second variation
mode explains face geometry changes in the upper part of face. . . . 59
2.18 Cumulative error distribution of the precision for the 15 landmarks
using FRGCv1 (a) and FRGCv2 (b). . . . . . . . . . . . . . . . . . . 60
2.19 Landmark locating examples from the FRGC dataset. . . . . . . . . 61
2.20 Landmarking examples from the BU-3DFE dataset with expressions
of anger (a), disgust (b), fear (c), joy (d), sadness (e) and surprise (f). 62
2.21 Landmarking accuracy on different expressions with the BU-3DFE
dataset. (1: left corner of left eyebrow, 2: middle of left eyebrow, 3:
right corner of left eyebrow, 4: left corner of right eyebrow, 5: middle
of left eyebrow, 6: right corner of right eyebrow, 7: left corner of left
eye, 8: right corner of left eye, 9: left corner of right eye, 10: right
corner of right eye, 11: left nose saddle, 12: right nose saddle, 13: left
corner of nose, 14: nose tip, 15: right corner of nose, 16: left corner
of mouth, 17: middle of upper lip, 18: right corner of mouth, 19:
middle of lower lip). . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.18 Block diagram of the BBN for expression and AU recognition. . . . . 115
3.19 Examples of Facial AUs. . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.20 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.21 LBP Operator. The circular (8,1), (16,2), and (8,2) neighborhoods.
The pixel values are bilinearly interpolated whenever the sampling
point is not in the center of a pixel. . . . . . . . . . . . . . . . . . . . 120
3.22 Multi-Scale LBP extracted from local texture and range map on a 3D
face scan. In the first row are LBP features extracted from texture
and in the second row are LBP features extracted from range. In the
third row are the (P,R) values of the corresponding columns. . . . . . 121
3.23 Shape index computed on local grids of a face . . . . . . . . . . . . . 122
3.24 Flow chart of the automatic facial expression/AU recognition system 123
3.25 ROC curves for the 16 AUs on the Borphorus database. The area
under ROC curve is in the bracket. (Part 1) . . . . . . . . . . . . . . 134
3.26 ROC curves for the 16 AUs on the Borphorus database. (Part 2) . . 135
3.27 Two examples of local grid configuration (number and size). . . . . . 136
Cette thèse de doctorat est dédiée à l’analyse automatique de visages 3D, incluant
la détection de points d’intérêt et la reconnaissance de l’expression faciale. En
effet, l’expression faciale joue un rôle important dans la communication verbale
et non verbale, ainsi que pour exprimer des émotions. Ainsi, la reconnaissance
automatique de l’expression faciale offre de nombreuses opportunités et applications,
et est en particulier au coeur d’interfaces homme-machine "intelligentes" centrées
sur l’être humain. Par ailleurs, la détection automatique de points d’intérêt du
visages (coins de la bouche et des yeux, ...) permet la localisation d’éléments du
visage qui est essentielle pour de nombreuses méthodes d’analyse faciale telle que
la segmentation du visage et l’extraction de descripteurs utilisée par exemple pour
la reconnaissance de l’expression. L’objectif de cette thèse est donc d’élaborer des
approches de détection de points d’intérêt sur les visages 3D et de reconnaissance de
l’expression faciale pour finalement proposer une solution entièrement automatique
de reconnaissance de l’activité faciale incluant l’expression et les unités d’action (ou
Action Units).
Dans ce travail, nous avons proposé un réseau de croyance bayésien (Bayesian
Belief Network ou BBN) pour la reconnaissance d’expressions faciales ainsi que
d’unités d’action. Un modèle statistique de caractéristiques faciales (Statistical Fa-
cial feAture Model ou SFAM) a également été élaboré pour permettre la localisation
des points d’intérêt sur laquelle s’appuie notre BBN afin de permettre la mise en
place d’un système entièrement automatique de reconnaissance de l’expression fa-
ciale. Nos principales contributions sont les suivantes. Tout d’abord, nous avons
proposé un modèle de visage partiel déformable, nommé SFAM, basé sur le principe
de l’analyse en composantes principales. Ce modèle permet d’apprendre à la fois les
variations globales de la position relative des points d’intérêt du visage (configura-
tion du visage) et les variations locales en terme de texture et de forme autour de
chaque point d’intérêt. Différentes instances de visages partiels peuvent ainsi être
produites en faisant varier les valeurs des paramètres du modèle. Deuxièment, nous
avons dévoloppé un algorithme de localisation des points d’intérêt du visage basé sur
la minimisation d’une fontion objectif décrivant la corrélation entre les instances du
modèle SFAM et les visages requête. Troisièmement, nous avons élaboré un réseau
de croyance bayésien (BBN) dont la structure décrit les relations de dépendance
entre les sujets, les expressions et les descripteurs faciaux. Les expressions faciales
et les unités d’action sont alors modélisées comme les états du noeud correspondant
à la variable expression et sont reconnues en identifiant le maximum de croyance
pour tous les états. Nous avons également proposé une nouvelle approche pour
l’inférence des paramètres du BBN utilisant un modèle de caractéristiques faciales
pouvant être considéré comme une extension de SFAM. Finalement, afin d’enrichir
l’information utilisée pour l’analyse de visages 3D, et particulièrement pour la re-
connaissance de l’expression faciale, nous avons également élaboré un descripteur
de visages 3D, nommé SGAND, pour caractériser les propriétés géométriques d’un
point par rapport à son voisinage dans le nuage de points représentant un visage
L’efficacité de ces méthodes a été évaluée sur les bases FRGC, BU3DFE et
Bosphorus pour la localisation des points d’intérêt ainsi que sur les bases BU3DFE
et Bosphorus pour la reconnaissance des expressions faciales et des unités d’action.
This Ph.D thesis work is dedicated to automatic facial analysis in 3D, including
facial landmarking and facial expression recognition. Indeed, facial expression plays
an important role both in verbal and non verbal communication, and in express-
ing emotions. Thus, automatic facial expression recognition has various purposes
and applications and particularly is at the heart of "intelligent" human-centered
human/computer(robot) interfaces. Meanwhile, automatic landmarking provides a
prior knowledge on location of face landmarks, which is required by many face anal-
ysis methods such as face segmentation and feature extraction used for instance for
expression recognition. The purpose of this thesis is thus to elaborate 3D landmark-
ing and facial expression recognition approaches for finally proposing an automatic
facial activity (facial expression and action unit) recognition solution.
In this work, we have proposed a Bayesian Belief Network (BBN) for recognizing
facial activities, such as facial expressions and facial action units. A Statistical
Facial feAture Model (SFAM) has also been designed to first automatically locate
face landmarks so that a fully automatic facial expression recognition system can
be formed by combining the SFAM and the BBN. The key contributions are the
followings. First, we have proposed to build a morphable partial face model, named
SFAM, based on Principle Component Analysis. This model allows to learn both
the global variations in face landmark configuration and the local ones in terms of
texture and local geometry around each landmark. Various partial face instances
can be generated from SFAM by varying model parameters. Secondly, we have
developed a landmarking algorithm based on the minimization an objective function
describing the correlation between model instances and query faces. Thirdly, we
have designed a Bayesian Belief Network with a structure describing the casual
relationships among subjects, expressions and facial features. Facial expression or
action units are modelled as the states of the expression node and are recognized
by identifying the maximum of beliefs of all states. We have also proposed a novel
method for BBN parameter inference using a statistical feature model that can be
considered as an extension of SFAM. Finally, in order to enrich information used
for 3D face analysis, and particularly 3D facial expression recognition, we have also
elaborated a 3D face feature, named SGAND, to characterize the geometry property
of a point on 3D face mesh using its surrounding points.
The effectiveness of all these methods has been evaluated on FRGC, BU3DFE
and Bosphorus datasets for facial landmarking as well as BU3DFE and Bosphorus
datasets for facial activity (expression and action unit) recognition.
Chapter 1
Human face contains important and rich visual information for identification and
communication, particularly for expressing emotion. Thus, analysing human face
benefits a wide variety of applications from public security to personal emotion
understanding, and from human computer interface (HCI) to robotics.
A problem of interest dealing with face analysis is facial expression recognition.
Indeed, the traditional HCI that neglects facial expression excludes important in-
formation which can stimulate computer/robot to initialize proactive and socially
appropriate behaviour during the communication process. This interaction between
human and computers/robots is computer-based. It emphasizes the transmission of
explicit information from texts, voices and gestures but ignores implicit information
about the user. However, studies on human interaction paradigm suggest that facial
expression contributes more than 50 percent to the effect of the spoken message as
a whole while verbal part of a message contributes less than 10 percent to the effect
of the message. Moreover, facial expression is one of the bases for understanding
human emotional state, which is expressed through the contraction of face muscles
resulting in facial appearance and geometry changes. Therefore, facial expression
is an important cue for understanding emotions and its automatic recognition is a
fundamental step for elaborating "intelligent" human/computer interactions.
Among various aspects of face analysis, we mainly focus on face landmarking and
facial expression recognition problems in 3D. Indeed, landmarking, which consists
in automatically detecting points of interest on the face, is a fundamental step for
further processing and is an important part in an automatic facial analysis system,
particularly for facial expression recognition which is the second important contri-
bution of our work. Moreover, we have proposed contributions related to new 3D
face features, and to face tracking in 2D videos for people counting.
Since images and videos can be easily accessed, 2D face analysis has been at the heart
of many research works for several years, such as detection, tracking, recognition
and expression recognition. However, head pose and illumination variations impose
strong hurdles on these problems. Recently, 3D face has emerged as a promising
solution in face processing and analysis. There are several reasons that the 3D face
has gained many interests. Firstly, 3D facial geometry is invariant to pose and illu-
mination conditions so that exploitation of 3D geometry can tackle various problems
encountered by 2D face analysis. Secondly, 3D face carries ample information on
both geometry and texture. This helps to improve the recognition accuracy and the
analysis of subtle facial motions. Finally, exploring the relationship between texture
and geometry of 3D face provides auxiliary means for 2D face analysis, making it
possible to reconstruct 3D shape from 2D face.
While 3D faces are theoretically reputed to be insensitive to lighting condition
changes, they still require to be pose normalized and correctly registered for further
face analysis. As most of the existing registration techniques assume that some 3D
face landmarks are available, a reliable localization of them is capital. There exist
many landmarking methods in 2D, such as PDM, ASM, AAM and CLM, which
can be applied to the texture maps of 3D faces. Landmarking on 3D faces is then
realized by mapping those 2D points to 3D face meshes. However, landmarking
directly on texture of 3D faces still encounters the pose and lighting problems.
Moreover, 3D face scans synthesized from stereovision systems generally do not
include this kind of mapping. Thus, most of existing 3D landmarking methods are
based on 3D geometry. They obtain a high accuracy when locating shape-salient
landmarks like nose tip. Unfortunately, the accuracy dramatically decreases when
they attempt to locate other landmarks such as mouth corners or eyebrow corners,
which either are distributed on non-rigid regions of face or do not have salient local
shape. Consequently, the number of landmarks provided is limited and the locating
accuracy is not robust to conditions that cause changes on face geometry, such as
expression and occlusion. Therefore, our objective in this work is to develop an
approach robust to expression and occlusion making use of both shape and texture
properties so that a larger number of shape-salient and non-shape-salient landmarks
can be located.
Human face properties are carried by rigid parts corresponding to the skeleton and
by non-rigid parts corresponding to face muscles whose configuration mainly depends
on emotion. Since the variations of these properties influence both face shape and
texture, the problem of interest turns to be how we can learn these variations and
apply them to the face analysis. Statistical models have been widely used for this
purpose, which learn the major variations of face by Principle Component Analysis
(PCA) and synthesize new face instances by a combination of learnt variation modes.
This process can be understood as the construction of a face space, whose bases are
eigenvectors of variations from PCA and new face instances can be associated with
a point in this space. Previous statistical models of 3D faces are generally built
on the whole face, such as the 3D Morphable Model. The drawback is that they
cannot be properly fitted into new face scans when faces are partially occluded and
thus have local shape deformations. In order to solve this problem, we propose in
this thesis to build statistical face models from local regions configured by a global
morphology, and apply them to face landmarking and expression recognition.
Specifically, for face landmarking, to overcome the accuracy decrease when locat-
ing non shape-salient landmarks, we consider both geometry and texture information
so that all landmarks can be featured prominently. The statistical models are learnt
from training faces with all kinds of the universal expressions in order to include the
expression variations. By doing so, face instances with expression can be generated
for landmarking on those faces with expression. Moreover, our statistical models
are built from local regions so that it is still applicable for partially occluded faces
by fitting itself based on unoccluded face regions.
els. By combining this graphical model with our automatic landmarking approach,
we are able to implement a fully automatic and efficient system for facial expression
These two main contributions of this PhD thesis involve 3D face landmarking and 3D
facial expression recognition. Moreover, we also contribute to new 3D face features,
and to face tracking in 2D videos containing drastic changes on face scale for people
3D face landmarking: Landmarking is essential for most of the face processing
and analysis methods. We target on locating a large number of feature points on
3D faces under challenging conditions, i.e., expression and occlusion, so that auto-
matically detected landmarks can be used for facial expression analysis. We have
intentionally chosen those landmarks that can be used for face registration and facial
expression recognition. To do so, we have proposed a 3D statistical facial feature
model (SFAM), which learns both global variations in 3D face morphology and lo-
cal ones around each landmark in terms of local texture and geometry respectively
by PCA. By varying control parameters of their corresponding sub-models in the
SFAM, we can thus generate different 3D partial face instances. The fitting of SFAM
into an input face is based on an optimization of an objective function describing the
correlation between model instances and faces. It also contains a set of parameters
modeling local occlusion for which we proposed an automatic detection.
Facial expression recognition: In order to flexibly add expression/action unit
classes and combine features from different representations (morphology, texture and
shape) to improve the recognition rate, we have designed a Bayesian Belief Network
(BBN) for 3D facial expression recognition. The structure of the BBN describes the
casual relationships among subject, expression and facial features. Facial expression
or action unit are modeled as the states of the expression node and are recognized
by identifying the maximum of beliefs of all states that are inferred from the feature
evidence on the target face. Different from the other graphic models for expression
Chapter 1. Introduction
about facial landmarking. Both facial landmarking methods in 2D and 3D are pre-
sented in the related works section. Then, we propose our 2.5D facial landmarking
approach in the following section. Section 2.4 presents the SFAM with a fitting al-
gorithm for 3D face landmarking. An occlusion detection and classification method
is also proposed here so that this approach is applicable to landmarking on partially
occluded faces. We draw our conclusion on facial landmarking at the end of this
Chapter 3 covers 3D facial expression recognition. We start the chapter by first
introducing the development of facial expression analysis and motivations for facial
expression recognition. Section 3.2 presents the emotion theories, facial expression
properties and facial expression interpretations. We make a review of the state-of-
the-art dealing with the classification of facial expressions in both 2D and 3D. In
section 3.4, we present a 3D facial expression recognition approach based on a local
geometry-based feature. This feature, named SGAND, is proposed in conjunction
with a pose estimation algorithm for 3D faces since the face direction is required
for the feature extraction. Section 3.5 presents our graphical model, BBN, for rec-
ognizing facial expressions and AUs with an uniform structure. A fully automatic
facial expression recognition system is also presented in this section. Conclusions
are drawn in section 3.6.
Chapter 4 presents a minor contribution: a people counting system which is
based on face detection and tracking. The related works and the system framework
are introduced in sections 4.1 and 4.2. Section 4.3 presents our face tracker whereas
section 4.4 describes the people counter. Experimental results are given in section
4.5 and a conclusion is drawn in section 4.6.
Chapter 5 summarizes the main thesis results and contributions. Finally further
research suggestions are given.
Chapter 2
3D Face Landmarking
2.1 Introduction
depth (z) value for every point in the (x, y) plane. While 2.5D/3D face models are
theoretically reputed to be insensitive to lighting condition changes, they still require
to be pose normalized and correctly registered for further face analysis, such as 3D
face matching [Lu et al. 2006, Zeng et al. 2008], tracking [Sun & Yin 2008], recog-
nition [Gokberk et al. 2008] [Kakadiaris et al. 2007], and facial expression analy-
sis [Niese et al. 2008]. As most of the existing registration techniques assume that
some 2.5D/3D face landmarks are available, a reliable localization of these facial
landmarks is essential.
When automatically locating landmarks on faces, approaches generally face the
challenges of head pose, illumination, facial expression and occlusion. Head pose
variations not only influence the facial appearance in images or video sequences but
also cause self-occlusion where some landmarks are hidden. Illumination changes,
including variations of lighting intensity and lighting source position, affect either
pixel values over the whole face or those on face parts. Facial expressions due to
facial muscle contractions cause non rigid deformations on face texture and shape,
especially in the mimic parts of faces such as mouth regions. Occlusion is usually
caused by face clusters, hand gesture or accessories such as glasses and masks. Some
landmarks may become hidden and thus changes may appear concerning the local
texture and shape around other landmarks. Although some approaches handle the
variations of head pose and illumination in their learning stages for a better precision
and robustness, these two problems are still challenging in 2D environment. Thus,
3D faces have gained interest since their processing and analysis may have the
ability to deal with pose and lighting variations. Indeed, landmarks on 3D faces can
be located through the face mesh analysing using curvature information or other
geometry-based features. However, expression and occlusion remain open problems
Face landmarking has been extensively studied on 2D face images. These approaches
can be divided into feature-based and structure-based categories.
Chapter 2. 3D Face Landmarking
Databases for 2D face analysis are dedicated to different research fields, mainly
including face detection, face tracking, face recognition, and facial expression
recognition. It is hard to include all datasets here so that we only present
some representative works. MIT Face Database [Database a], FERET Database
[Phillips et al. 1998], Yale Database [Database b] are among those mainly for face
detection; DXM2VTS [Teferi & Bigun 2007], A Video Database of Moving Faces
and People [O’Toole et al. 2005] for face tracking; FERET [Phillips et al. 1998],
Yale Database [Database b] for face recognition; Cohn-Kanade AU-Coded Facial Ex-
pression Database [Kanade et al. 2000], PIE database [Sim et al. 2003] and JAFFE
database [Lyons et al. 1998] for facial expression analysis. A detailed description on
these datasets and more datasets on 2D face analysis can be found following the
link: http://www.face-rec.org/databases/.
• Prior knowledge based strategies: [Zhao et al. 2008] uses the prior geometry
distances between eye regions as well as between the month region and eye
regions. In this approach, eye centers are located by calculating the sum of
RGB components difference and the mouth is supposed to be in a region lo-
Chapter 2. 3D Face Landmarking
cated by a certain distance below the eye centers. The mouth center is then
located from color distribution. An correction rate of 95.52% for locating eye
centers and mouth center is reached. When the Euclidean distance between
an automatic eye center and its corresponding manual eye center is less than
0.25 of the Euclidean distance (pixel) between two manual eye centers (drl ),
automatic eye centers are considered as correctly detected. When the Eu-
clidean distance between an automatic mouth center and the manual mouth
center is less than 0.12 of drl , automatic mouth center is considered as cor-
rectly detected. [Talafova & Rozinaj 2007] search landmarks at the assumed
subregions in a face image using human skin chromaticity and face morpho-
logical characteristics. Author gives the average locating errors of 5 pixels
without indicating the resolution of testing face images. [Wang et al. 2009]
uses the horizontal and vertical projection curves of gray values for locating
eye, nose and mouth. The approach has been tested on JAFFE dataset with-
out providing quantitative analysis on landmarking result.
map, which is computed by correlating DCT template with the DCT vector
of the test block. Tested on texture images from FRGC dataset, at least
83% of outer eye corners, 90% of nose tip and 70% of mouth corners are
correctly detected. The correct detection is evaluated as if the Euclidean dis-
tance in terms of pixels of an automatic landmark from the true position is
less than 0.1 of the inter-ocular distance. [Asbach et al. 2008] compares po-
tential landmarks with vertices on a normalized face mesh using SIFT and
SURF description. They conclude that scale invariant Harris interest points
with SURF descriptions are the most promising combination for locating land-
marks. [Celiktutan et al. 2008] models facial landmarks redundantly by four
different feature, i.e., DCT, Non-negative Matrix Factorization, Independent
Component Analysis and Gabor Wavelets. Matching scores are later fused for
selecting candidate points. over 94.8% of inner eye corners, over 93.8% and
89.7% of outer eye corners and inner eyebrow corners are correctly detected
using a criterion of 0.1 of inter-ocular distance for correct detection.
Structure-based approaches are usually implemented via fitting a face model com-
posed of both face shape and texture feature. Instead of extracting features from
texture or shape separately, these methods use texture and shape knowledge si-
multaneously in the locating process. There exist many popular face models or
approaches such as Active Shape Model and Active Appearance Model. Here we
review some representative studies.
• Active Shape Model (ASM): [Cootes et al. 1995] have first proposed to repre-
sent the shape of an object using landmark points and to learn shape variations
using Principle Component Analysis (PCA). Each landmark is associated with
• Other Approaches combining local and global information: [Tu & Lien 2009]
use Singular Value Decomposition (SVD) to combine two related classes, shape
and texture, in a single eigenspace, named Direct Combined Model (DCM)
algorithm. It estimates the facial shape directly by applying the significant
texture-to-shape correlations. Tested on 450 images with resolution of 640*480
from their own database, they achieve an average error of 2 pixels for all 84
landmarks over images with the frontal pose. This average error increases
to 8 pixels when head pose varies to 35 degrees. [Cristinacce & Cootes 2008]
build a Constrained Local Model by learning global shape variations, local tex-
ture variations around landmarks, and their correlation. Correlation Meshes
describing the similarity between local instances and local regions around land-
marks are computed and used for optimizing a fitting function driven by shape
parameters. Results have proved the efficiency of CLM compared to AAM.
The fitting of this model is improved by using an optimization in the form of
subspace constrained mean-shifts [Saragih et al. 2009]. [Kozakaya et al. 2008]
proposes a weighted vector concentration approach, which integrates the global
shape vector and locally normalized Histogram of Oriented Gradients (HOG)
descriptor. Both the global and local information are combined in landmark-
ing by solving a single weighted objective function. Tested on 1918 images
from FERET dataset, they achieve landmarking error between 0.03 and 0.07
for 14 landmarks. This error is measured as the average of Euclidean dis-
tance (normalized by inter ocular distance) between an automatic landmark
and its manual equivalent over all tested images. The mean error over all 14
landmarks is 0.05.
In order to overcome the lighting problem, the above studies perform either
an intensity normalization process [Cootes et al. 1995, Cootes et al. 1998], extract
illumination-insensitive features, such as facial component contour and corner
[Wang et al. 2009], or include illumination variations in their learnt facial mod-
els [Cristinacce & Cootes 2008]. Meanwhile, in order to overcome the head pose
problem caused by the in-plane rotation and face scale, [Celiktutan et al. 2008]
and [Xu & Ma 2008] detect landmark candidates under multi-directions and multi-
scales. However, locating landmarks remains too challenging 2D approaches when
dealing with faces with out-plane rotation and partially illuminated. Theoretically,
2D landmarking methods can be applied to 3D faces by locating landmarks on 2D
texture maps and then using correspondence from the scanner systems to map those
points onto 3D face meshes. However, due to the aforementioned limitations, dedi-
cated landmarking approaches are necessary for locating landmarks on 3D faces.
[Yin et al. 2006] and Bosphorus database [Savran et al. 2008] are also used for test-
ing 3D landmarking methods.
outer corners of eyes (left and right) and nose tip with mean errors (absolute dis-
tances between automatic landmarks and manual equivalent) of 11.89mm, 12.11mm,
19.38mm, 20.46mm and 8.83mm. Because features related to 3D face meshes are
only based on the geometry information, it is hard to distinguish geometry-non-
salient points e.g., eyebrow corners, and therefore the points that can be located are
limited. Moreover, shape variations like expression, occlusion and self-occlusion can
easily handicap this branch of landmarking approaches.
Range maps from 3D faces can be considered as 2D images with pixel values corre-
sponding to the linear transformation of Z coordinates in 3D. Thus, besides calcu-
lating curvature and shape index, popular features in 2D, like Gabor wavelet, can
also be extracted from this representation. [Lu et al. 2004] and [Colbry et al. 2005]
first find nose tip by closest Z value to camera and then find other landmarks using
shape index within eye, mouth and chin regions. Tested on non-normalized 113
scans from their own database, the mean error for locating the five landmarks is
10mm. [Colbry & Stockman 2007] propose a canonical face depth map on range
image and locates nose tip and inner eye corners based on this representation.
[Colombo et al. 2006] and [Szeptycki et al. 2009] compute Gaussian (K) and Mean
(H) curvature for each point in range image and set threshold on curvature to
isolate candidate regions for nose tip and eye inner corners. The candidate land-
marks are further filtered according to the shape of the triangles they compose in
[Szeptycki et al. 2009]. Tested on 1600 face scans from FRGC dataset, over 99% of
the three landmarks are localized with a precision of 10mm in [Szeptycki et al. 2009].
HK curvature can also be used on range images of full 3D head scans [Li et al. 2002],
which locate six landmarks when point curvature properties fulfil a set of empirical
conditions. In [Segundo et al. 2007], nose tip, nose corners and eye corners are ini-
tially located using HK curvature and their positions are then corrected by finding
salient points on projection curves of range images. Over 99% of all these land-
marks are correctly detected. However, no specific criterion on good detection has
been provided. [D’House et al. 2007] use Gabor wavelets on range map for a coarse
detection of landmarks and then apply Iterative Closest Points (ICP) algorithm to
enhance the location precision. They achieved 99.37% of correct nose tip location
with a precision of 10mm on FRGC v1.0 database, but their accuracy on the outer
corners of the eyes is relatively lower. [Dibeklioglu et al. 2008] propose a Gaussian
Mixture Models liked statistical model (MoFA), describing local gradient feature
distribution around each landmark. This model produces a likelihood map for each
landmark on new faces and the highest value in this map is located as landmark.
Tested on FRGC dataset, their approach achieves over 90%, 99%, 99% and 87% of
detection rates for outer eye corners, inner eye corners, nose tip and mouth corners
respectively with a precision of three pixels on texture maps (resolution of 480*640).
[Koudelka et al. 2005] develop an accurate approach by computing radial symmetry
maps, gradient and zero-crossing maps from range maps. Landmarks are chosen by
using a series of heuristic constraints. Over 97% of all five landmarks (nose tip, eye
inner corners, mouse center and sellion) are localized with a precision of 10mm on
FRGCv1 dataset. Compared with the first category, this branch of approaches can
localize more points with higher accuracy, because faces in range maps are normally
in frontal pose and facial landmarks can be represented with 2D features. Neverthe-
less, the drawbacks of these methods is their sensitivity to face scale and head pose
variations. Moreover, they have difficulty to locate non-salient points in geometry
and points in non-rigid face regions with the presence of expressions.
Due to the above reasons, a single face representation may not provide enough infor-
mation for localizing some landmarks consistently. However, the perfect matching
of range map and texture map from scanner systems ensures the combination cor-
rectness of multi-representation. Accumulating evidence derived from different face
representations has the potential to make the feature extraction richer and more
robust. [Boehnen & Russ 2004] compute the eye and mouth maps based on both
color and range information and selects potential feature candidates of inner corner
of eyes, nose tip and sub-nasal. A 3D geometric-based confidence of candidates
in this category.
2.2.3 Discussion
Chapter 2. 3D Face Landmarking
instance, [Boehnen & Russ 2004] compute the eye and mouth maps based on both
color and range information. [Wang et al. 2002] use "point signature" representa-
tion coding 3D face mesh as well as Gabor jets of landmarks from 2D texture image.
In [Salah et al. 2007] [Jahanbin et al. 2008b], Gabor wavelet coefficients are used to
model local appearance in texture map and local shape in range map around each
landmark while [Lu & Jain 2006] propose to compute and fuse shape index response
(range) and cornerness response (texture) in local regions around seven landmarks.
As the combinations of candidate landmarks resulting from shape
and/or texture related descriptors are generally important, some authors
also propose to make use of structural relationships between landmarks,
for instance through heuristics [Nair & Cavallaro 2009], a 3D geometric-
based confidence [Boehnen & Russ 2004], an extended elastic bunch
graph [Jahanbin et al. 2008b], or a simple mean model constructed as the av-
erage 3D position of landmarks from a learning dataset [Lu & Jain 2005]. However,
there is no approved technique which best takes into account both configuration
relationships between landmarks and the local properties in terms of geometric
shape and/or texture around each landmark.
Few of the aforementioned studies address the issue of face landmarking in the
presence of facial expression or occlusion. [Nair & Cavallaro 2009] experiment their
3D Point Distribution Model to locate five landmarks (the two outer eye points, the
two inner eye points and the nose tip) under facial expressions with a locating accu-
racy ranging from 8.83 mm for nose tip to 20.46 mm for the right outer eye point.
However, these five landmarks are all located on face regions stable to facial expres-
sions. [Dibeklioglu et al. 2008] study 3D facial landmarking under expression, pose
and occlusion variations. However, only one landmark, the nose tip, was considered
in their work which is not sufficient for further accurate face analysis.
In this chapter, we address the facial landmarking problem in 3D with presence
of expression and occlusion, aiming at locating a sufficient number of landmarks
with good accuracy for other face analysis, especially for facial expression recogni-
tion. In order to do so, we propose a general learning-based framework for 3D face
landmarking which combines configuration relationships among the landmarks and
Chapter 2. 3D Face Landmarking
their local properties of texture and geometry. Based on this principle, we propose
two approaches in section 2.3 and section 2.4 for landmarking 2.5D faces and 3D
faces respectively. Statistical face models are trained by applying Principle Compo-
nent Analysis (PCA) to face landmark configurations, local texture and local shape
around each landmark from training faces. Two different fitting algorithms are pro-
posed to fit the face models to new faces so that landmark locations can be found
by searching the closest points to known landmarks on fitted models. Provided with
3D training faces with expressions, our models are able to learn the expression vari-
ations and generate instances with these variations so that the accuracy in fitting
faces with expression can be increased. Moreover, in order to overcome the occlusion
problem, a classification system allowing to detect occluded faces and the type of
occlusion has been proposed, so that occlusion information can taken into account
during the fitting process of our second approach presented in section 2.4.
In this section, we propose a statistical learning-based approach for 2.5D face land-
marking. Taking benefit from the rich information contained in 3D face data, our
model is built not only based on facial texture but also based on the geometry vari-
ations from a training face set. Specifically, a variety of face shape on texture maps
are analyzed and learnt as the global configuration of landmarks. Meanwhile, vari-
ations on local texture and range are also learnt from the scale-free patches around
each landmark. Thus, the statistical model is made up of a global face shape model,
a texture model and a range model. New patch instances can then be synthesized by
varying the model parameters. When fitted for a best match to a new 3D face, this
statistical model delivers the location of the landmarks on the texture map of the
input 3D face. The fitting process is the optimization of the global shape in order to
reach the highest correlation in both texture and range between local patches from
the input face and instances synthesized from our texture and range models.
2.3.1 Methodology
3D face scans delivered by the current 3D imaging systems are usually noisy and
may contain holes and spikes, as shown in Fig. 2.2. In order to remove these noises,
we perform the following operations to enhance the quality of 3D face scans:
1. Median Cut: spikes are detected by checking the discontinuity of points and
are removed by the application of a median filter.
Although faces are scanned from the frontal view, there still remain variations
in head pose which disturb the learning of global shape variations and consequently
also may perturb the learning of local shape and texture variations. To compensate
head pose, faces are first translated near to the origin of the coordinate system by
subtracting the gravity center of the point cloud. Then, Iterative Closest Point
(ICP) algorithm [Zhang 1994] is used to minimize the difference between two point
clouds of the new face and an arbitrarily selected face which holds a frontal and
straight pose, as illustrated in Fig. 2.3.
Figure 2.3: Point clouds of the preset face (red) and new face (blue) before ICP
alignment (a) and after ICP alignment (b).
Contrary to other methods which directly use texture and range maps for extracting
local information, we process a local remesh on 3D point clouds. This is because
the distance between subjects and the scanner affects the face scale of 3D face
scans, which influences the 3D point cloud density and resolution of face in texture
and range maps. Directly sampling from the texture map is sensitive to the scale
variation and thus creates local patches covering different areas of face parts around
landmarks. Therefore, we create the scale-free local patches with uniform scale
among all faces to normalize face scale in local regions.
Uniform grids for texture and range respectively are created around each land-
mark with a fixed size (15*15 in this work, a compromise between accuracy and
efficiency) as shown in Fig. 2.5. We can benefit two factors from the uniform grids.
Firstly, because the distances between subjects and the 3D scanner are different so
Chapter 2. 3D Face Landmarking
that the number of points varies. This leads to variation on the density of point
clouds of 3D faces. By resampling local points, this variation can be normalized.
Secondly, there exists a nature correspondence of resampled points on grids centered
at a specific landmark of different faces. This find the point-to-point correspondence
among faces easily and efficiently.
Specifically, the centers of grids have the (x, y) values of their corresponding
landmark. The intervals of grids on X,Y dimensions are fixed to 1mm. The range
values (resp. intensity values) on the grids are interpolated from range values of
sampled points in the local regions (resp. the intensity values). The interpolation
methods used for range values and intensity values are different. Triangle-based
linear interpolation is used for the intensity values, which computes the current
intensity value based on the weighted distances from the point to three vertices of the
triangle covering the point. The Biharmonic Spline Interpolation [Sandwell 1987] is
used for the range values. The grids are then projected into 2D along Z direction
and then range and texture patches around the landmark can be obtained. This
process is repeated for all landmarks on a face.
Intensities and range values are then concatenated on all patches into two vectors
G and Z as in eq.2.1 and eq.2.2 where m is the total number of points on all grids,
Figure 2.5: Creation of uniform grid in a local region associated with the left corner
of the left eye from two viewpoints (a) and (b). Circles are the sampled points
from the 3D face model and the grid composed of the interpolated points. The
interpolation is also performed for intensity values.
(3375 here).
Then, all X vectors are normalized from training faces using a procrustes analysis
in order to remove 2D global variations [Cootes et al. 1995]:
2. Align the shapes to the approximated mean shape. First, calculate the centroid
of each shape. Then, align all shapes centroid to the origin and normalize each
shape scale. Finally, rotate each shape to align with the newest approximate
Principal Component Analysis (PCA) is then applied where 95% major compo-
nents have been preserved. Taken PCA on the set {Xi } for example, the analysis is
as follows:
1 X
X̄ = Xi (2.4)
x N
1 X
Σ= (Xi − X̄)(Xi − X̄)T (2.5)
Nx − 1
The same process is applied for the training sets of {Xi }, {Gi }, {Zi } to build
the following three linear models (eq.2.6-2.8).
X = X̄ + Px bx (2.6)
G = Ḡ + Pg bg (2.7)
Z = Z̄ + Pz bz (2.8)
where X̄, Ḡ, Z̄ are the mean 2D shape, mean normalized intensity and mean
range value respectively; Px , Pg , Pz are sets of modes of shape, intensity and range
value variation respectively; bx , bg , bz are sets of parameters of 2D shape, inten-
sity and range values respectively. The dimensions of Px , Pg , Pz are respectively
(2 ∗ N, nx ), (m, ng ), (m, nz ), where nx , ng and nz are the number of eigenvectors
preserved, N is the number of landmarks and m is the total number of points from
all grids around the landmarks.
In 2D statistical models such as ASM and AAM [Dryden & Mardia 1998,
Cootes & C.J.Taylor 2004], the assumption that the control parameters in the model
follow the Gaussian distribution has been proved to be efficient in many cases. Thus,
following these studies, we assume that bi from PCA where bi ∈ {bx , bz , bg } are in-
dependent and Gaussian distributed with a zero mean and a standard derivation σij ,
where j refers to each parameter of bi . Figures 2.6, 2.7 and 2.8, illustrate the first
two modes (j ∈ {1, 2}) at their left and right ending variation (-3σij ,+3σij ), namely
-3std and +3std, respectively for 2D shape variation, texture variation and range
Thus, the statistical model built here includes three sub-models: shape, texture
and range models. Similar to other statistical models, our model can be trained
with different training sets and thus can learn different variation modes. The more
diverse training faces are provided, the more comprehensive variations the model
includes. For example, if we provide a training set including faces with different
expressions and illumination conditions, the model is learnt with expression and
illumination variations. However, if the model is trained with neutral faces under a
single illumination condition, it only contain variations due to subject physiognomy.
Figure 2.6: First two modes of shape variation in 2D. Points represented by ’*’ are
current shapes while points represented by ’.’ are mean shape. The first variation
mode mostly explains the shape changes along the horizontal direction while the
second variation mode mostly explains the shape changes along the vertical direc-
Figure 2.7: First two modes of texture variation. The first variation mode mostly
explains the intensity changes in the eyebrow region and the mouth region, while
the second variation mode explains the intensity changes in the nose region.
Figure 2.8: First two modes of range variation. The first variation mode mostly
explains the range value changes in the lower part of face, while the second variation
mode explains the range value changes in the upper part of face.
Px , Pg , Pz in eq. 2.6-2.8 contain the variation modes of shape, texture and range.
Thus, given the parameters bx , the 2D shape can be generated by using eq. 2.6.
In order to transform it into a 2D image coordinate system, 3 more parameters
are required, namely a translation parameter (Cx ,Cy ), a scale parameter α and an
in-plane rotation parameter ρ as described in eq.2.9.
where X is the created shape instance and R(ρ) is the rotation matrix. The
shape transformation parameters and shape parameters (bx ) are concatenated into
a vector Θ = (bTx |C T |α|ρ)T .
Given a shape instance X and a preprocessed 3D face scan, the 2D points in X
can be mapped back into 3D space based on the correspondence from the scanner
system and vectors G and Z (eq. 2.1-2.2) can be obtained through the same process
as the one described in section They are further used to estimate bg and bz ,
as follows:
Given a new 2.5D face, the landmarking problem is how to best fit our learnt
statistical model on this face. This can be considered as an optimization problem
Chapter 2. 3D Face Landmarking
* +
XN Gi Ĝi
FG = ,
i=1 kGi k
* +
XN Zi Ẑi
FZ = ,
i=1 kZi k
where h·, ·i is the inner product and k·k is the L2 norm. N is the number of
1. The initial shape instance X̂ 0 is created from the vector Θ0 where bx are zeros
and C T , S, ρ are preset.
3. Texture and range instances Ĝk , Ẑ k are estimated following eq. 2.11 in sec-
4. The function f k is computed following the eq. 2.12, eq. 2.13, eq. 2.14.
5. Taking the Θk as variables and f k as the objective function value, the opti-
mization algorithm predicts Θk+1 which leads to a lower value of the objective
6. X k+1 is computed following the eq. 2.9 and compared with X k to check the
convergence. If convergence is not reached, go to 2.
In order to initialize C T and S, an Adaboost face detector [Viola & Jones 2002]
can be applied on the 2D texture maps and outputs a box containing faces. Thus,
these two parameters can be estimated by the center and length of the box respec-
tively. However, in our implementation, this initialization is performed thanks to
face masks obtained from the scanner system which has the advantage to be accu-
rate and much simpler. ρ are preset to zero. In order to constrain the deformations
and to ensure that the shape instance is plausible, bjx parameters are also limited
within the boundary ±3σxj . All trespassing bjx values are replaced by their closest
Note that it is not necessary to perform a size normalization before and dur-
ing the fitting process since three parameters which project shape instances into
the image coordinate are optimized during this fitting process. Moreover, there
is no photometric normalization done before the model fitting since the objective
function computes the correlation between scale free patches and their estimated
instances. This process has more tolerance to illumination conditions compared to
those directly extracting features on the images for landmarking. Database
The datasets we have used are FRGC v1.0 and v2.0 [Phillips et al. 2005]. The
first version of the FRGC dataset contains 953 face scans from 275 people, captured
under controlled illumination conditions and generally neutral expressions. However,
these 953 face scans have slight head pose variation and scale variation. The second
version of the FRGC database contains 4,007 face scans from 466 persons. These 3D
face scans were captured under different illumination conditions and contain various
facial expressions, including happiness or surprise, etc..
All faces have been manually labelled by our research group with 15 landmarks
as illustrated in Fig. 2.4. These manually labelled landmarks can be used as ground
truth for learning or quality assessment of automatic landmark location. In our
experiments, the whole FRGC v1.0 dataset is first cleaned by filtering out several
badly captured face models. It is further divided into two parts, the first half
part (452 faces) is used for training, and the second one (462 faces) for testing
our algorithm. Subjects in the training set are different from those in the testing
set. For comparison purpose, we also applied to this testing set the curvature
analysis based method developed within our team [Szeptycki et al. 2009]. However,
only 9 landmarks can be used for comparison between these two techniques as the
curvature analysis-based method can not locate the other 6 landmarks which do
not have prominent curvature properties. In order to assess the generality of our
statistical model which is learnt from 3D face models from FRGV v1.0, 1400 faces
are randomly selected from FRGC v2.0 dataset as an extended testing set. The
precision in all tests is measured as the mean locating error (Pr = di ) where
di is the 3D Euclidean distance between a landmarks automatically located and its
corresponding manually labelled landmark. Results
Fig. 2.9 displays the accumulative precision of all landmarks located by our model
on the testing set from FRGC v1.0 dataset. As we can see, our model can locate
97% cases in 10mm precision and 100% in 20mm precision for all landmarks. Our
method achieves its best location result for landmark 13 (see legend in Fig. 2.9) with
a 100% accuracy in the precision of 9mm, and the worst one for landmark 7 which
displays 100% accuracy only in the precision of 19mm. Fig. 2.10 shows the precision
curves displayed by the curvature analysis based method [Szeptycki et al. 2009] on
the same testing set. As we can see from the figure, while the nose tip and inner
corner of eyes, having each prominent geometric feature, are better located by the
curvature analysis-based method, our statistical model displays better precision on
all the other landmarks.
The first two rows in Table 2.1 shows the mean and std of locating error for each
Chapter 2. 3D Face Landmarking
Figure 2.9: Precision curves for all landmarks located by our method
Figure 2.10: Precision curves for all landmarks located by the method in
[Szeptycki et al. 2009]
Table 2.1: Mean and deviation of locating errors for all landmarks using FRGC v1.0
1 2 3 4 5 6 7
Mean 4,15 3,11 2,98 2,50 3,30 4,38 3,28
Std 2,82 1,90 2,23 1,51 2,04 2,81 2,43
Mean 8.76 3.85 - - 3.84 7.16 -
Std 4.24 2.02 - - 2.03 3.46 -
8 9 10 11 12 13 14 15
2,72 4,00 2,68 4,93 3,91 2,72 3,76 3,95
1,57 3,61 1,85 3,76 2,50 1,51 2,07 2,56
- 6.07 2.27 6.29 8.68 - - 8.44
- 4.18 1.35 4.27 7.47 - - 7.47
The first group of mean and std of locating errors are from this approach, and the second group
are from the method in [Szeptycki et al. 2009]. Both tests are done on the same testing data set.
When landmarking results are not available for the point, the symbol "-" is used.
landmark (di ) from our method while the following rows are the results achieved by
the curvature analysis-based method. The database that has been used is FRGC
v1.0. The table is indexed by the landmark number referring to the legend in Fig. 2.9.
As we can see, mean locating errors of all landmarks are less than 5mm. Notice
that, as mentioned previously, except the nose tip, the mean and std of locating
errors from our method are smaller than the ones from the curvature analysis-based
Table 2.2 shows the experimental results on 1400 face models randomly selected
from FRGC v2.0 dataset. Recall that our statistical model was trained on selected
face models in FRGC v1.0 only having controlled illumination and neutral expression
while the 1400 face models randomly selected from FRGC v2.0 dataset display facial
expressions and drastic illumination changes. As we can see from the table, the
mean error in locating all landmarks only increases by 1mm compared with the
experimental results on face models from FRGC v1.0 dataset.
The time for localizing landmarks on a face used by our algorithm (coded in
Matlab) varies from 18min to 25min on a desktop PC with Intel Pentium4 1.8GHz
and 1 Go RAM. Two steps are time consuming: firstly, it takes over 1500 iterations
for the simplex algorithm to reach the convergence, which is more robust to local
Table 2.2: Mean and deviation of locating errors for all landmarks using FRGC v2.0
1 2 3 4 5 6 7
Mean 5.22 4.36 4.07 3.24 3.78 4.97 4.21
Std 3.14 2.21 2.32 1.67 1.91 2.83 2.55
8 9 10 11 12 13 14 15
3.10 6.65 4.88 6.95 5.38 3.53 6.48 4.67
1.64 4.50 2.52 4.24 3.14 1.86 3.16 2.99 Discussion
Table 2.3: Mean and deviation of locating errors for individual manually labeled
1 2 3 4 5 6 7
Mean 2,95 2,42 2,03 1,94 2,04 2,76 2,11
Std 1,48 1,05 1,38 0,85 1,07 1,58 1,64
8 9 10 11 12 13 14 15
1,84 3,80 1,90 4,50 1,98 1,99 3,04 2,06
0,81 1,98 1,04 2,15 1,10 1,19 1,53 1,31
There exist three major sources of errors in our experiments. Firstly, our method
requires an exact match between texture images and range ones. Although several
badly mismatched face scans have already been filtered out in FRGC v1.0, there
are still many face scans containing mismatches to a certain extent, especially in
FRGC v2.0. Secondly, the training set should contain the major variations of faces,
so that our learnt statistical model can synthesize instances as close as possible to
the testing faces, further leading to a better locating accuracy. In our last test,
variation of illumination and expression are not learnt in training. Last, as shown in
Table 2.3, manual labelling also leads to locating errors of landmarks which implies
a divergence for the global minimum of the objective function during the fitting
process. Thus, our approach could be improved with a better learning of the model
by using a training set containing more face variations, especially in expression and
lighting condition, and with higher manual landmarking accuracy.
2.3.3 Conclusion
from half of the faces available from FRGC v1.0 dataset and experimented on 1400
faces randomly selected from FRGC v2.0 with uncontrolled illumination and facial
expressions, our method has reached an average of locating errors less than 7mm
for all 15 landmarks.
This approach is dedicated to 2.5D face landmarking. However, when the full
3D face information is available, this method can be improved by considering the
3D morphology as the global landmark configuration instead of the 2D shape on
texture maps of 3D faces. This is the purpose of the method we propose in next
Figure 2.11: Two sets of landmarks are manually labelled on FRGC (a), BU-3DFE
(b) and Bosphorus (c) datasets. Landmark set in (a) contains 15 landmarks, in-
cluding nose tip and corners, inner and outer eye corners, mouth corners; landmark
sets in (b) and (c) contain 19 landmarks, including corners and middles of eyebrows,
inner and outer eye corners, nose saddles, nose tip and corners, left and right mouth
corners and middles of upper and lower lips.
3D face landmarking algorithms, the set of our targeted landmarks can be easily
changed provided a learning dataset. Through a statistical learning process, the
local properties around landmarks along with their morphological relationships in
training faces can be encoded independently of their locations and their number.
To prove this, we have manually labelled two sets of landmarks on three differ-
ent datasets, namely FRGC, BU-3DFE and Bosphorus datasets, as illustrated in
Fig. 2.11. The landmark set for FRGC dataset is the same as the one described in
the previous section.
The local regions around labelled landmarks are remeshed according to a prin-
ciple similar to the one used for the creation of scale free local patches presented in
subsection Thus, points in local regions are first sampled and then interpo-
lated on the uniform grids with the resolution of 1mm.
Once a training 3D face has been preprocessed, 3D coordinates of all the landmarks,
called 3D morphology, are concatenated into a vector S, describing the configuration
relationships among local regions.
Two vectors G and Z are further generated, interpolated from local meshes as
in eq. (2.16) and (2.17) similarly to G, Z in the method presented in the previous
section. All Z vectors thus contain variations of local geometric shapes around
landmarks while G vectors describe local texture properties. Alternatively, other
local feature descriptors may also be computed from interpolated local 3D meshes
and used, such as HK curvature, shape index, etc. for local shape, and Gabor jets,
cornerness response, etc.
G = (g1 , g2 , ..., gm )T (2.16)
Principal Component Analysis (PCA) is then applied to the three vector sets
{Si }, {Gi }, {Zi } and 95% of the variations in landmark configurations (morphology)
as well as local texture and shape around each landmark are retained.
S = S̄ + Ps bs (2.18)
G = Ḡ + Pg bg (2.19)
Z = Z̄ + Pz bz (2.20)
where S̄, Ḡ and Z̄ are respectively the mean landmark configuration, mean
intensity and mean range value while Ps , Pg , Pz are respectively the three sets of
corresponding variation modes. bs , bg , bz are the sets of controlling parameters.
All the individual parameters respectively in bs , bz and bg are independent and
follow Gaussian distributions with a zero mean and a standard deviation σi . The
dimensions of Ps , Pg , Pz are respectively (3 ∗ N, ns , (m, ng ), (m, nz , where ns , ng ,
nz are the number of eigenvectors, N is the number of landmarks and m is the total
number of points from all grids around the landmarks.
where p(T |S, ψ), p(R|S, ψ) are the probabilities of the face texture and range
given a landmark configuration S and SFAM ψ. We assume the variable R and
T from different face representations are independent within a local face region.
p(S|ψ) is the probability of a given landmark configuration estimated by SFAM.
Probabilities p(T |S, ψ) and p(R|S, ψ) can be estimated using a Gibbs-Boltzmann
distribution as in eq. 2.22. This distribution has been widely used by PCA based sta-
tistical models in 2D, such as Constrained Local Model [Cristinacce & Cootes 2008],
and proved to be efficient. This assumption is quite reasonable and results from
the fact that the problem of 3D face landmarking is actually a Markov Random
Field (MRF) which consists in assigning to each vertex of a 3D facial scan a label
from a set of labels L. The set L encompasses all targeted landmarks (e.g., nose
tip, eye corners) and a null value labeling any vertex which is not the location of
any targeted landmark. Then, the theorem of the equivalence between MRFs and
Gibbs distributions by Hammersley and Clifford [Li 2009] implies that the problem
Chapter 2. 3D Face Landmarking
N N k 2
e−bj /λj
p(S|T, R, ψ) ∝ e−αηi e−βγi
i=1 i=1 j=1
N N k (2.22)
P P P b2j
log p(S|T, R, ψ) ∝ −αηi + −βγi − λj
i=1 i=1 j=1
where N is the number of local regions, ηi and γi are the similarities between
instances and local regions, and α and β are weight factors. p(S|ψ) can be consid-
ered as a penalty factor referred to [Cootes et al. 1995], where k is the number of
landmark configuration or morphology modes, bj is similar to bs in eq. 2.18 and λj
denotes the corresponding eigenvalues of the landmark configuration model.
We have extended the objective function to deal with face occlusion. Indeed,
in the presence of occlusion, each local region around a landmark i will be associ-
ated with a probability of being uncovered mi . The objective function is therefore
rewritten as follows:
N N k
X X X b2j
f (bs ) = mi α FGi (si ) + mi β FZ i (si ) − (2.23)
i=1 i=1 j=1
where FGi and FZ i refer to eq. 2.26. mi is the probability whether the region
around the ith landmark is uncovered, thus being 0 if the local region is fully oc-
cluded and 1 if the local region is totally uncovered. si is landmark location from
the morphology model.
The value of α and β can be determined by computing the ratio of FGi and
P b2j P N k
P b2j
λj , FZi and λj separately when applied to verification faces with manually
j=1 i=1 j=1
labelled landmarks.
In order to compute the FGi and FZ i factors in eq. 2.23, the correlation meshes
are calculated in order to describe the similarity between instances and local face
regions in both texture and shape modalities. It makes the optimization faster in
the fitting process since those two factors can be directly obtained from the meshes
instead of computing the objective function at each iteration.
The correlation meshes are illustrated in Fig. 2.12 and their computation is
described as follows:
1. Given a new face scan, the closest point set S ′ to the landmark configuration
S are computed on the face;
3. Local regions around the points in S ′ are remeshed for both texture and range
maps by using grids with a size of 51*51, as in the section;
4. For each local region i, a sliding window method is performed with the same
size as the local grid size in SFAM (15*15). At each step j, a local range map
Z and texture map G are extracted to compute the normalized correlation
between them and ẑ, ĝ respectively, which are the corresponding local parts
in Ĝ and Ẑ (eq. 2.26). Then, the normalized correlations are set as the values
of the window center on the corresponding meshes.
* + * +
gji ĝ zji ẑ
FG i =
, j , FZ i =
, j (2.26)
kĝj k j
kẑj k
Before landmarking a 3D face through the fitting algorithm presented here, the
occlusion algorithm described in section 2.4.3 is first applied to identify the occluded
local regions and thus to set the corresponding mi coefficient to zero. Therefore,
only the unoccluded local regions will take part in the following fitting process. The
algorithm works as follows:
1. Given a 3D face, its head pose is first compensated using ICP algorithm.
Chapter 2. 3D Face Landmarking
Figure 2.12: Correlation meshes from two viewpoints. Actually these meshes are in
four dimension space, where the first three dimensions are x, y, z and the last one
is correlation values. In these figures, we display the correlation values instead of z.
(a) and (b) are the same correlation mesh from two point of views, describing the
similarity of texture (intensity) instances from SFAM and texture (intensity) on the
given face. (c) and (d) are the correlation mesh describing the similarity of shape
(range) instances from SFAM and face shape (range). Red color corresponds to the
high correlation and blue color corresponds to the low correlation.
in eq. 2.23). The penalty factor (the third factor in eq. 2.23) is computed
Face analysis in the presence of partial occlusions, due to diverse factors such as hair,
glasses, mustaches, scarf, etc. is a difficult problem. As far as 3D face landmarking
is concerned, we are only interested in occlusions which may occur in local regions
around landmarks. Thus, we have proposed a simple approach to classify occlusion
type and give a set of binary values to local regions, corresponding to ’occluded’ or
’unoccluded’ states. Alternatively, we may have computed a probability associated
with a local region being occluded or a measure indicating roughly how much a local
region is occluded.
In order to perform occlusion detection, features from the range map are ex-
tracted since the presence of occlusion definitively changes the face shape in relevant
local regions. Therefore, given an input face scan, its closest points s′ to the mean
landmark configuration (eq. 2.18) are computed. Then, 51*51 grids are used to
remesh local regions around these points only for range values, as in section
For each local region i, a sliding window method is performed with the same size
as the one of the local regions considered in SFAM. At each step j, a local depth
map Zα is computed and its local shape instance Zβ is calculated to further obtain
a similarity map LS as follows:
bαj = Pzi (Zαj − z̄i ) (2.27)
Zαj Zβj
LSji = , (2.29)
kZαj k kZβj k
Pzi is the submatrix composed of the rows in Pz associated with local region i.
z̄i is the subvector composed of rows in z̄ also associated with local region i. h·, ·i
is the inner product and k·k is the L2 norm. bβ is obtained by limiting bα within a
predefined boundary used to limit the possible deformations.
In case of occlusion, the local deformations are too large to be handled by the
model. Thus, the instances Zβ generated from this model are quite different from
the occluded local shape Zα , which leads to a low similarity value in eq. 2.29. The
LSji describes the possibility to synthesize the local regions by learnt geometry
variations. Therefore, this possibility decreases when a part of a face is occluded
and thus contains non facial shape. This information is used for occlusion detection
and classification.
Once LS has been computed for all points in a local region, a histogram from
this similarity map is built. Then, histograms from all the local regions are further
concatenated into a single feature, labelled with the occlusion type, such as occluded
in the ocular region, occluded in the mouth region, occluded by glasses, or unoc-
cluded. The distances between histograms are valued by the Euclidean distance,
and the classification is performed by a simple K-NN classifier.
Since the available faces with occlusion in the dataset have certain patterns, as
shown in Fig. 2.13, we have preset a set of binary values indicating the occluded
state in local regions for each type of occlusion. For example, for occlusion in the
mouth region, the set of binary values {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0 }
is used to initialize mi , where the first two 0 correspond to the nose corners and last
four 0 correspond to the mouth corners and lip middles. The classification leads to
the list of local regions being occluded (mi in eq. 2.23).
2.4.4 Experimentations
The SFAM model based framework for 3D face landmarking described so far has
been experimented on three datasets, namely FRGC, BU-3DFE and Bosphorus
datasets which are described in subsection as well as the experimental setup.
Then, the results are given in the following subsections.
with the 2 highest-level expressions from all subjects, thus 1300 face scans in total.
The Bosphorus dataset contains 4666 face scans from 105 subjects. This dataset
contains not only many samples of the six universal facial expressions and many
AUs, but also 3D face scans under realistic occlusions like glasses, hands around
mouth and eye rubbing. Moreover, many male subjects have moustache and beard.
As illustrated in Fig. 2.11, 15 facial landmarks have been manually landmarked
for the FRGC dataset and 19 for the BU-3DFE and Bosphorus datasets. They have
been used as ground truth for learning the SFAM model and testing our landmark
fitting algorithm. These three landmark sets contain some common landmarks,
such as eye corners, mouth corners and contain landmarks from both face rigid and
non-rigid regions.
We have made use of the Bosphorus dataset including four kinds of occlusion, caused
respectively by hair, glasses, hand near the mouth region and hand near the ocu-
lar region. An illustration of these types of occlusion can be found in Fig. 2.22.
Occlusion caused by glasses occurs in front of two eyes, with changes mainly on
local geometry. Occlusion caused by hand near the ocular region occurs generally
in front of the right eye, with changes on both local geometry and local texture.
As occlusions caused by hair generally do not occur on the landmark regions, this
type of occlusion is excluded from our study. We consider for the experiments the
other three types of occlusions and an unoccluded neutral face from each subject.
We experimentally set K to five in the K-NN classifier and carried out a two-fold
cross-validation. 347 face scans of 105 subjects have been used based on the data
availability, where each subject contains at lease two scans out of four aforemen-
tioned types and at most four scans from each of the types. In each round, about
faces scans from half of the subjects are used for training and the rest for testing.
The subjects used in the training are different from those in the testing. After two
round, scans from all subjects are used once for training and once for testing. The
confusion matrix is given in table 2.4. As we can see, an average classification accu-
racy up to 93.8% can be achieved, which has been proved to be sufficient for further
’Eye’ represents the occlusion caused by hand near ocular regions; ’Mouth’ represents the occlusion
caused by hand near the mouth regions; ’Glasses’ represents the occlusion caused by glasses;
’Unoccluded’ represents neutral faces without occlusion.
We have made use of 452 face scans from FRGCv1 dataset to build our first SFAM
model which learns local properties of 15 regions and their configuration relation-
ships. The training face scans have limited illumination variations and do not con-
tain facial expressions. Fig. 2.14 illustrates the first mode of configuration, local
texture and local shape in the SFAM at their left and right boundaries (-3σ,3σ),
namely -3std and +3std.
Moreover, we have used the face scans from 11 subjects in BU-3DFE dataset
and the first 32 subjects in Bosphorus dataset to build our second and third SFAM
respectively which capture global relationships and local properties from 19 land-
marks. Every subject used for training has respectively 13 face scans in the case
of the BU-3DFE dataset (a neutral face scan and 2 face scans for each of the six
universal expressions in the intensity level 3 and 4), and 7 face scans including 6 ba-
sic expressions and the neutral one in the case of the Bosphorus dataset. Fig. 2.15,
2.16 and 2.17 illustrate the third SFAM learnt from Bosphorus dataset containing
the first and second modes of configuration, local texture and local shape at their
left and right boundaries (-3σ,3σ), namely -3std and +3std.
Figure 2.14: SFAM learnt from FRGCv1 dataset: first variation modes on the
landmark configuration, local texture and local shape.First mode of morphology
explains the landmark configuration variations in terms of face size; first mode of
texture explains the intensity variation, especially in the eye region; first mode of
shape explains the geometry variation in the upper part of face.
Figure 2.15: SFAM learnt from Bosphorus dataset: variations of the two first mor-
phology modes. The first variation mode mostly explains the face morphology
changes along the vertical direction, while the second variation mode explains the
face morphology changes along the horizontal direction.
Figure 2.16: SFAM learnt from Bosphorus dataset: variations of the two first local
texture modes. The first variation mode mostly explains the facial texture changes
due to different skin color, while the second variation mode explains the facial texture
changes in the eye and mouth regions.
Figure 2.17: SFAM learnt from Bosphorus dataset: variations of the two first lo-
cal geometry modes. The first variation mode mostly explains the face geometry
changes in the lower part of face, while the second variation mode explains face
geometry changes in the upper part of face.
Figure 2.18: Cumulative error distribution of the precision for the 15 landmarks
using FRGCv1 (a) and FRGCv2 (b).
Table 2.5: Mean error and standard deviation (mm) associated with each of the 15
landmarks on the FRGC dataset
lcle rcle ucle lwcle lcre rcre ucre
Mean 4.17/4.31 3.07/3.21 2.92/3.17 2.76/2.75 3.15/3.24 3.67/3.89 2.84/3.18
Std 2.13/2.05 1.42/1.44 1.39/1.66 1.21/1.31 1.56/1.43 1.90/2.04 1.45/1.63
lwcre lsn nt rsn lcm cul cll rcm
2.68/2.83 3.96/4.21 4.11/4.43 4.39/5.07 3.61/4.09 2.74/3.37 3.81/4.65 3.58/4.34
1.21/1.38 1.65/1.71 2.20/2.56 1.85/2.36 1.92/2.32 1.42/1.89 1.97/3.41 1.99/2.50
The index of the landmarks is the abbreviation of the legend in Fig. 2.18. The left number in each
cell gives the result on FRGCv1 data while the right one the result on FRGCv2 data.
Using the learnt statistical models, the fitting algorithm for 3D face landmarking has
been evaluated on 3 different experimental setups. In all these experiments, errors
are calculated as the Euclidean distance between automatically located landmarks
and the corresponding manual ones (ground truth). We do not set a general criterion
or maximum allowed error to separate outliers in the following statistical results,
which means almost all landmarking results are taken into consideration. A small
number of landmarking results (around 20 face scans) which has mean errors over
20mm are excluded. The reason for this unreasonable error may be mostly due to
the failure of ICP alignment.
Using the first SFAM, the fitting algorithm has first been experimented on the
Chapter 2. 3D Face Landmarking
remaining roughly half FRGCv1 dataset not used for training, i.e. 462 face scans
from subjects different from those in training. We have then tested SFAM on 1500
face scans randomly selected from the FRGCv2 dataset which contains illumination
variations and facial expressions. Fig. 2.18 shows the cumulative distribution of the
fitting accuracy for all 15 landmarks while Table displays the mean and std of
locating errors associated with each landmark. As we can see from the figure, most
landmarks are automatically located within 9mm precision in both tests. Mean error
and the corresponding standard deviation indicate that landmarks in the upper face
region are located with better precision. A slight increase on mean error and the
standard deviation in the second test is caused by uncontrolled illumination and
facial expressions on tested face scans. Fig. 2.19 illustrates some landmark locating
examples from these two experiments.
The third experiment has been carried out on the BU-3DFE dataset. Recall
that 143 face scans from the first five male subjects and six female subjects have
been used for training the second SFAM. 1157 face scans in total from the remaining
89 subjects are used for testing. Each testing subject has a neutral expression and
six basic facial expressions with intensity level three and four. Fig 2.20 illustrates
several locating examples with facial expression. Fig. 2.21 shows effect of expressions
on landmarking accuracy. As we can see from this figure, landmarks with less
deformation in expressions are better located, like eye corner, nose tip, nose corner.
Mouth corners and the middle of lower lip are located with the worst precision
Figure 2.20: Landmarking examples from the BU-3DFE dataset with expressions of
anger (a), disgust (b), fear (c), joy (d), sadness (e) and surprise (f).
Table 2.6: Mean error and the corresponding standard deviation (mm) of the 19
automatically located landmarks on the face scans, all expressions included, from
the BU-3DFE dataset
1 2 3 4 5 6 7 8 9
Mean 6.26 4.58 4.87 4.88 4.51 6.07 4.11 2.93 2.90
Std 3.72 2.82 2.99 2.97 2.77 3.35 1.89 1.40 1.36
10 11 12 13 14 15 16 17 18 19
4.07 3.30 3.27 3.32 4.04 3.62 7.15 4.19 7.52 8.82
2.00 1.70 1.56 1.94 1.99 1.91 4.64 2.34 4.75 7.12
and the greatest standard deviation in face scans expressing surprise because of the
significant deformation in this region induced by this emotional state. Table 2.6
summarizes the mean error along with the std of the landmarking algorithm with
all expressions. The mean errors for all 19 landmarks stay within 10mm while most
of standard deviations are lower than 5mm. The locating accuracy of landmarks in
rigid face region is comparable to those of the corresponding landmarks located in
The last experiment has tested the fitting algorithm using the third SFAM to
locate 19 landmarks on 3D face scans under occlusion from the Bosphorus dataset.
Fig. 2.22 illustrates several locating examples under occlusion. This experiment is
carried out on 292 face scans from all the subjects different from the ones used for
training in the Bosphorus dataset. In order to evaluate the efficiency of our proposed
occlusion classifier for landmarking, the fitting algorithm is compared between the
test with occlusion knowledge directly provided by the dataset and the test using
occlusion knowledge from our proposed occlusion detection and classification algo-
rithm (Table 2.7). The mean errors generally range from 6 to 11 mm, and more than
97% landmarks are located in 20mm precision in both configurations. Noting that
this precision is considered as a criterion by some other works. Meanwhile, there
exists an increase on mean error and std in average for the latter test, which is due
to occlusion classification errors. However, these results remain acceptable, all the
more since this automatic approach offers the ability to be generalized to datasets
without occlusion information.
The time for localizing landmarks on a face used by this algorithm (coded in
Matlab) varies from 10min to 16min on a desktop PC with Intel Pentium4 1.8GHz
and 1 Go RAM. Similar to the previous algorithm, the simplex algorithm is used
which is quite time consuming. The time consumed by the optimization in the step
2 is around 2 to 3 minutes. The computation of the correlation meshes saves time
great, because it finishes the computation of the local interpolation once in the step
3 instead of computing them in each iteration as in the previous algorithm.
Figure 2.22: Landmarking examples from the Bosphorus dataset with occlusion.
From left to right, faces are occluded in eye region, mouth region, by glasses and by
Table 2.7: Mean error and the corresponding standard deviation (mm) associated
with the each of the 19 automatically located landmarks on the face scans from the
Bosphorus dataset under occlusion
1 2 3 4 5 6
Mean 9.66/11.95 8.29/8.47 7.33/7.15 7.02/6.77 8.21/8.20 9.74/10.05
Std 6.08/8.85 3.92/4.39 3.41/3.36 3.23/3.38 4.27/4.45 5.23/6.08
7 8 9 10 11 12
Mean 7.01/8.83 6.25/6.87 6.44/6.51 7.46/7.86 7.5/7.56 7.58/6.92
Std 3.77/6.37 3.42/4.21 3.08/3.58 3.56/4.73 3.60/3.88 3.63/4.02
13 14 15 16 17 18 19
6.35/7.19 8.46/8.39 8.03/7.79 7.96/9.75 8.67/9.01 8.21/9.65 10.41/10.61
3.11/2.99 3.64/3.64 3.31/3.36 4.18/6.28 4.84/4.93 4.25/4.97 5.37/5.61
The landmark indexes are as in Fig. 2.21. The left number in each cell represents the testing result
using occlusion information provided by the dataset while the right one displays locating result
using occlusion information provided by our occlusion detection and classification algorithm. In
the latter tests, the knowledge of occlusion by hair (not considered by our occlusion detection and
classification algorithm) was provided by the dataset
Fig. 2.23 illustrates several failure cases of landmarking under different conditions.
The cases a and b are mainly due to the great deformation on the mouth region
when face are displaying expressions. The morphology model in SFAM can not
contain a specific mode for the deformation of a expression, however it generally
learns variation modes from a mixture of expression and identity. Thus, when
fitting SFAM on a face with great morphology deformation, like happiness and
surprise, the fitting algorithm sometimes can not generate morphology instances
which can approximate this extreme deformation. The cases c and d are mainly
due to the information reduction in the fitting process when occlusion occurs. The
occluded local parts are not considered in the fitting algorithm so that less part
of correlation meshes are used in the objective function. Thus, the prediction of
morphology parameters uses less information and is not as accurate and robust to
local minimum as the prediction when no occlusion happens. Moreover, the missing
values on occluded local correlation meshes introduce errors in the objective function
as the weights α and β are determined when all local regions are considered. Discussion
Compared to other 3D face landmarking algorithms in the literature, such as the ones
in [D’House et al. 2007] [Lu & Jain 2006] [Faltemier et al. 2008] [Xua et al. 2006]
[Colbry et al. 2005] [Jahanbin et al. 2008a] [Dibeklioglu et al. 2008], our SFAM-
based approach is a general 3D landmarking framework which encodes the configu-
ration relationships of the landmarks and their local properties in terms of texture
and shape by a statistical learning instead of heuristic knowledge directly embed-
ded within the algorithm. Our algorithm is thus more flexible and enables locating
landmarks which are not necessary shape prominent or texture salient.
Most existing works on 3D face landmarking in the literature are only experi-
mented on the FRGCv1 dataset. We can thus compare these results with the ones
achieved in our first experiment described in the previous subsection.
Using the same dataset and a heuristic guided statistical method, Dibeklioglu et
Figure 2.23: Some failure cases. a: failure case on face with surprise expression; b:
failure case on face with happy expression; c: failure case on face with occlusion in
mouth region; d: failure case on face with occlusion in eye region.
al. [Dibeklioglu et al. 2008] report an accuracy rate of around 99% for nose tips and
inner eye, around 90% for the outer eyes and mouth corners within around 19mm
precision (3 pixels precision on a reduced face texture with the reduction rate 8:1
in the paper. The 3D distance between pixels are 0.8mm to 1mm in FRGCv1).
Compared to this result, our locating technique localize more landmarks (15 instead
of 7) in better detection rates with the same precision.
In [Lu & Jain 2006], Lu et al. located seven landmarks, namely nose tips, corners
of eyes and mouth on face scans from FRGCv1. The mean errors for these landmarks
are around 6.0mm to 10mm, while our technique displays locating errors for 15
landmarks around 2mm to 5mm with much smaller standard derivation. Using the
Bosphorus dataset with 3D face scans under occlusion, most mean errors of our
landmarks range from 6mm to 10mm with a much lower standard deviation.
In [Koudelka et al. 2005], Koudelka et al. located five landmarks, namely inner
corners of eyes, Sellion, nose tip and middle mouth with a mean error of 3.57mm
over all the five landmarks and 97.22% of all the landmarks are correct detected
with a precision of 10mm. In our case, we reach a mean error of 3.43mm over all
the 15 landmarks and over 99% of all the landmarks are correct detected with a
precision of 10mm.
Compared between our two methods, the average of error mean and std over
15 landmarks are 3.49mm and 2.34mm in the first method and those are 3.43mm
and 1.68mm in the second method. All these results have a lower average er-
ror and better reliability compared with the curvature analysis based method
[Szeptycki et al. 2009], 6.15mm and 4.05mm for seven landmarks included by 15
landmarks in our methods.
To the best of our knowledge, there exists only one work in the literature at-
tempting to locate several landmarks on 3D face scans under facial expressions
[Nair & Cavallaro 2009]. In their study, a 3D point distribution model is proposed
to landmark five landmarks, namely the two inner eye corners, the two outer eye
corners and the nose tip. Note that these landmarks are on face regions rather
stable to expressions. Trained on 150 unnormalized face scans and tested on 2350
faces from the BU-3DFE dataset, their technique displays respectively a mean error
Chapter 2. 3D Face Landmarking
of 12.11mm, 11.89mm, 20.46mm, 19.38mm and 8.83mm for these five landmarks.
Using the same dataset and a comparable quantity of training faces (143 faces), we
display respectively a mean error of 4.11mm, 2.93mm, 2.90mm, 4.07mm, 4.04mm for
these five landmarks and our technique also located other landmarks from mimic
face region on 1157 face scans which produce the two higher levels of expression
intensity out of the whole dataset.
We have also studied the reproducibility and the corresponding precision of
manual landmarking for two reasons. First, manually labelled landmarks are used
as the ground truth of automatic landmarks. Because of subjective variance, it is not
necessary that manual landmarks are labelled at the precise location of landmarks.
This imprecision may disturb the evaluation of the automatic landmarks. Thus, this
study can provide a reference on the errors of manual landmarks used for evaluation.
Secondly, this study can also give a reference on the variance of landmarking done
by human and plays as a comparison with machine variance. For these purposes,
11 subjects are asked to manually label the 15 landmarks as shown in Table 2.3.
The mean error of 15 manual landmarks is 2.49mm with the std at 1.34mm. For
comparison, the second landmarking method achieves a mean error of 3.43mm with
the corresponding standard deviation at 1.68mm on the same dataset.
2.4.5 Conclusion
We have presented in this chapter two statistical model based methods for locating
landmarks on 3D face scans. Both methods rely on statistical models by learning
the variations in global landmark structure as well as local texture and range. How-
ever, the major difference between the methods are: firstly, the global landmark
configuration in the first method is on 2D texture images while the SFAM is a full
3D statistical model; secondly, the fitting algorithm of the former is similar to ac-
tive shape model in 2D while the second one introduces correlation meshes in the
fitting; thirdly, combined with an occlusion detection algorithm, SFAM is able to
perform landmarking on partial occluded faces. Flowcharts of these two methods
are provided in fig 2.24 and fig 2.25 for a clear comparison. Experimental results
have demonstrated that by considering both texture and geometry information, our
methods is able to locate a set of landmarks beyond those characterized by salient
shape with a better accuracy. Thus, SFAM has reached better landmarking accu-
racy than the previous models proposed in the literature in terms of accuracy and
robustness when encountering severe conditions such as expression and occlusion.
In this chapter, only range and texture maps are used as simple descriptors of
local shape and texture around landmarks. In the future, the landmark location may
be improved by considering other descriptors such as HK curvature, shape index,
etc. for shape feature or Local Binary Pattern, Gabor filtering, etc. for texture
Chapter 3
3.1 Introduction
Emotional states, like many other internal physiological activities (see fig. 3.1) are
conveyed by facial expressions, which are generated by facial muscle contractions
and result in temporal facial deformations in both facial geometry and/or texture.
Facial activities not only cause wrinkles, bulges and other kinds of appearance de-
formation due to stretch and shrink on facial surface which produce variances on
facial texture on captured face data, but also modify the facial geometry. Specifi-
cally, facial geometry here includes facial feature locations such as distance between
landmarks (nose tip, inner and outer eye corners, mouth corners, ...) or feature
point displacement, and geometrical shape of face surface. Because 2D cameras can
not capture the 3D face information, face surface shape is seldom investigated by
2D approaches. Although facial muscle activities inherently change the facial ap-
pearance for the three face representations, including facial landmark configuration,
texture and surface shape, the consequences on them are not necessarily displayed
at the same level. For example, blinking eyes causes obvious variance on texture and
shape in the eye region without displacing eye corners, as shown in fig.3.3b; pulling
a lip corner deforms local shape around mouth corners but causes subtle variance
in texture from certain views, as shown in fig.3.3c. Meanwhile, it is difficult and
challenging to detect certain facial activities using facial landmark configurations,
such as deepening nasolabial furrow, sucking the lips inward, raising chins which are
not apparent from movements of facial points but rather noticeable from variations
in other two representations, as shown in fig.3.3d,e,f.
Not only the nature of facial deformation carries the message, but also the rela-
tive timing and temporal evolution of expression conveys an important meaning. It
is suggested that the dynamics of facial expression provide an unique information
about emotion that is not available in static images. [Schmidt & Cohn 2001] have
shown that spontaneous smiles reach onsets faster than posed smiles and can have
multiple rises of the mouth corners. Moreover, they are accompanied by other mus-
cle activities that appear either simultaneously with mouth corner rises or follow
them within 1s. Generally an expression process can be segmented into 4 steps:
neutral, onset, apex and offset. The duration of typical muscle activities varies from
250ms to 5 seconds. Thus, using facial expression temporal dynamics are of impor-
tance for evaluating expression intensity level and categorizing facial expressions or
muscle activities.
According to the two types of aforementioned emotion theories (discrete and dimen-
sional), we can distinguish two main streams on analysis of the facial expressions:
message-based approaches and sign-based approaches.
Message-based approaches are concerned with the message conveyed by facial
expressions. They directly associate specific facial patterns with emotions and clas-
sify expressions into a predefined number of discrete categories. The most commonly
used facial expression categories are the six basic emotions (fear, sadness, happiness,
anger, disgust, surprise), proposed by Ekman [Ekman & Friesen 1971].
Sign-based approaches aim at describing face deformation objectively rather than
inferring meaning underlying the appearance. Facial muscle activities are hereby ab-
stracted and coded by facial action units and then mapped into a variety of states in
emotional space by high-level decision making. To completely describe all possible
perceptible changes, the Facial Action Coding System (FACS) has been proposed
[Ekman & Friesen 1978]. It is a comprehensive and anatomically based system used
Chapter 3. 3D Facial Expression Recognition
to measure all visually discernible facial movements in terms of atomic facial ac-
tions called Action Units (AUs). Over 7000 different AU combinations have been
observed and some of these combinations are mapped into basic emotions accord-
ing to Emotional FACS (EMFACS) rules and various affective states according to
FACS Affect Interpretation Database (FACSAID). For example, the combination of
AU1, AU2, AU5 and AU26 can be interpreted as the surprise expression while the
combination of AU6 and AU 12 is interpreted as happiness. With the assistance of
a high-level making process, it is applicable to identify AUs for recognizing spon-
taneous facial expression in the dimensional space rather than classifying into the
universal expressions. A detailed interpretation of AU combinations to emotions
can be found in the appendix.
Over the past 20 years, facial expression recognition has gained growing interests
within the computer vision community. Progress in recently five years has been
observed in two main aspects. New methods are proposed to detect facial action
units for recognizing more affect states as well as spontaneous expressions besides
the six universal expressions. Meanwhile, many studies have begun to consider 3D
faces for expression recognition.
ponents and facial feature points, including shapes and positions of face compo-
nents, as well as the location of facial feature points [Sohail & Bhattacharya 2007,
Tai & Chung 2007, Chang et al. 2009a, Ari et al. 2008]. In the case of 2D videos,
the position and shape of these components and/or landmarks are often detected
in the first frame and then tracked throughout the sequence [Obaid et al. 2009,
Gunes & Piccardi 2009, Brick et al. 2009]. The geometric features are easy to ex-
tract and quite efficient, however they ignore texture information reflecting facial
texture variations and may not have enough discriminative power for identifying
subtle expressions and action units. Appearance features such as Gabor wavelets,
Haar features and Local Binary Pattern represent facial texture and transient varia-
tions due to wrinkles, bulges and furrows [Savran et al. 2010, Littlewort et al. 2006,
Bartlett et al. 2006, Koutlas & Fotiadis 2008, Tong et al. 2010, Uddin et al. 2009,
He et al. 2009, Yang et al. 2007, Zhao & Pietikainen 2007]. These features are very
informative but they exclude global configuration of facial components and may be
sensitive to illumination variations. Some studies adopt both global facial shape and
features extracted from texture. The advantage is the mutually compensation on
discrimination power for expression from both representations. Good examples of
such a scheme are those in [Park et al. 2008, Park & Kim 2008, Mahoor et al. 2009]
using Active Appearance Model (AAM) to capture the characteristics of the facial
texture and the shape of facial expressions.
Generally these 2D recognizers face the challenges of illumination and head pose.
The appearance of facial expressions varies with the viewpoint of an observer. Thus,
head pose variation on face images includes the in-plane and out-of-plane rotations
as well as the face scale. The in-plane rotation occurs around roll axis and can be
rectified by face alignment. Face scale is generally normalized by interpolation or
subsampling. However, it is difficult to handle the out-of-plane rotations due to the
missing data caused by self-occlusion. Illumination has also a great influence on
face appearance in images. It has been observed that the face image modifications
caused by illumination changes can exceed the differences caused by expression and
identity factors. Although some lighting models have been proposed, this problem
is not yet completely solved, especially for expressions displayed on faces which are
3D faces, which contain not only facial texture but also facial surface shape, are
reputed to be insensitive to illumination and head pose. Since the geometry prop-
erty of face can be computed, such as normal of vertices or curvatures, the Phong
reflection mode is used to rectify the texture variance caused by different lighting
conditions. Head pose can be simply normalized by multiplying the rotation matrix
with each vertex and summarizing the translation vector. Because the unit of the
3D face coordinate system is mm instead of pixel in 2D, 3D face size does not vary
with the distance between 3D scanner and subjects when taking the scan. Thus,
recognizing facial expression in 3D offers the ability to handle illumination and head
pose problems contrary to 2D-based approaches. Moreover, because subjects can
be recorded with less controlled head pose using 3D scanner, spontaneous facial
expression can be displayed on faces and analysed by 3D facial expression recog-
nizers. [Savran et al. 2010] compares the effectiveness of 2D and 3D modality for
detecting 25 AUs and demonstrates 3D modality generally performs better than 2D
modality, especially for lower facial AUs and a fusion of both modality achieves the
best performance. Specifically, Adaboost feature selection is applied on the Gabor
magnitude responses for each AU on both 2D images and 2D conformal maps of 3D
faces for comparison.
Since a face in a static image can express an emotion, faces necessarily carry static
emotion properties. Thus, a majority of studies in the literature dealing with fa-
cial expressions considers static images. However, a facial expression also implies a
change of a visual pattern over time. This explains why more and more researchers
attempt to characterize the dynamic evolution of expressions in order to improve
the interpretation of facial activities [Hammal et al. 2007]. To do so, features rep-
resenting the temporal dynamics of facial expression are extracted. The speed of
a facial point displacement or the persistence of facial parameters over time can
be extracted [Chakraborty et al. 2009, Brick et al. 2009] either for action phrase
segmentation or recognition. In [Tong et al. 2007, Tong et al. 2010], the dynamic
The number of studies dealing with 3D facial expression recognition has recently sig-
nificantly increased in particular thanks to the publication of 3D facial expression
databases. These databases are interesting since they allow researchers to develop
and tune their approach, and then to compare their efficiency with the community.
Currently, there exist three public databases which contain 3D face scans for facial
expression analysis. The most widely used is the BU3DFE database [Yin et al. 2006]
which contains face scans from 100 subjects displaying the six universal expressions
as well as the neutral one. Each expression is displayed with 4 intensity levels from
onset to apex. The BU4DFE database [Yin et al. 2008] contains 606 facial expres-
sion sequences in 3D captured from 101 subjects, with a total of approximately
60,600 frame models. For each subject, there are six model sequences showing six
prototypic facial expressions respectively. This is the only database which contains
3D video sequences displaying facial expressions. Finally, the Bosphorus database
[Savran et al. 2008] contains 105 subjects scanned with both the six universal ex-
pressions and facial action units. This is the only public database that contains
dedicated scans displaying action units in 3D.
Existing approaches on expression recognition based on 3D faces can be divided
into two categories: feature-based and model-based facial expression recognition.
These approaches are further detailed in next subsections.
a recognition rate of 87.9%. Such distance-like features have been further explored
in [Tang & Huang 2008a], where less than 30 ’best’ features were automatically se-
lected from candidate pool (all distances between 83 feature points). They achieve a
recognition rate of 94.7% with a requirement of one neutral scan from each subject.
Besides distance feature, Hao and Huang [Tang & Huang 2008b] have also extracted
properties (the slope and length) of the line segments connecting 83 feature points,
to make up 96 distinguishing features for recognizing the six universal facial expres-
sions. They achieve a recognition rate of 87.1%. Landmark based features are easy
to extract and invariant to head pose. However, its robustness to landmark precision
has not yet been investigated. The recognition performance may highly rely on the
landmark location accuracy which is difficult to achieve by automatic landmarking
methods. Moreover, as a face contains information related to both person identity
and expression, a normalization process is generally adopted to exclude the identity
information that may disturb the expression recognition process.
Another kind of geometrical features is surface shape-based features, which
are extracted from 3D face meshes and describe local shape properties. In
[Wang et al. 2006], principal curvatures, surface principal directions and steepness
of the surface have been calculated and further mapped into one of 12 primitive
features on each vertex. Histograms of these primitive features from manual de-
fined regions are extracted for classification. They achieve a recognition rate of
83.6%. However, 64 manually labeled landmarks are still required for defining the
face regions. In [Savran & Sankur 2009], least squares conformal maps and elastic
registration are used to map 3D faces into 2D images and register mapped faces
into a reference one automatically. 22 AUs are detected by estimating the deforma-
tion between the registered face and the reference. The average of overall correct
recognition rate is 91.4%.
Instead of directly extracting features, model based approaches make use of a generic
face model, generally deformable, as an intermediate [Ramanathan et al. 2006,
Mpiperis et al. 2008, Rosato et al. 2008, Venkatesh et al. 2009]. The expression
3.3.4 Discussion
Overall, when comparing the state-of-the-art methods for recognizing 3D facial ex-
pression, one can observe that most of approaches:
• are generally not fully automatic since they require human intervention for
Chapter 3. 3D Facial Expression Recognition
locating landmarks or for fitting the initial model, due to the lack of reliable
facial landmarking technology in 3D.
• are based on single feature or face model and thus do not contain sufficient
information to describe a wide range of expressions or action units which are
commonly investigated in 2D environment as a promising alternative solution
for spontaneous expression recognition .
As it was mentioned in section 3.2.2, different facial actions deform different face
representations to different level/extent. Thus, we are convinced that features from
all face representations, including landmark location, facial texture and facial surface
shape/geometry, should be extracted and combined in order to characterize a wide
variety of facial expressions and action units comprehensively. In section 3.5, we will
present a unified probabilistic framework for both expression and AU recognition
problems, which aims at fusing the discriminative power of features from different
facial representations. Combined with the automatic landmarking methods we pro-
posed in the previous chapter, this framework is able to recognize facial expression
efficiently in a fully automatic manner. Before this work, we will first propose in
next section a local geometry based feature for describing face shape deformation
caused by expressions that can be used to feed a classical classifier such as SVM for
identifying the six universal expressions. Most of existing geometry based features
are based on curvature computation, which are computation complex and sensitive
to surface noise, or based on landmark configuration, which are easy to implement
but exclude the rich surface shape information. So we propose here a feature which
can be easy computed and effective in expression recognition.
Although features such as Gabor wavelet [Tong et al. 2010, He et al. 2009,
Chang et al. 2009b] or Local Binary Patterns (LBP) [Zhao & Pietikainen 2007,
Bai et al. 2009] have been widely used for recognizing facial expression or action
units in 2D environment, they can not carry information related to surface defor-
mation occurring on faces in the real 3D world since they do not integrate shape
information, and thus can not accurately reflect complex and authentic facial expres-
sions. Moreover, the assumption of frontal images of faces under good illumination
generally required by approaches in 2D is unrealistic in 3D. Therefore, there is a
high demand to represent efficiently facial expressions in 3D.
1 1 κ1 + κ 2
SI = − arctan( ) (3.1)
2 π κ1 − κ 2
(κ1 + κ2 )
H= , K = κ1 κ 2 (3.2)
The primitive surface feature derives from the principal curvatures κ1 , κ2 , the
surface principal directions v1 , v2 and the k∇zk representing steepness of the surface
around a vertex. Specifically, the local surface around a point is estimated by locally
1 2
approximating it with a smooth polynomial function, z(x, y) = 2 Ax + Bxy +
1 2
2 Cy + Dx3 + Ex2 y + F xy 2 + Gy 3 . The Weingarten matrix for the local surface
is W = = ( ṽ1 ṽ2 ) · diag( λ1 λ2 ) · ( ṽ1 ṽ2 )T , where λ1 , λ2 are
eigenvalues and ṽ1 , ṽ2 are the orthogonal eigenvectors in local coordinate system.
v1 , v2 are further computed by rotating ṽ1 , ṽ2 into the global coordinate system.
The gradient magnitude k∇Zk is computed from the smooth polynomial function.
Two thresholds are defined, namely TG and Tλ .
Figure 3.4: Five basic visible-invariant surface types defined by shape index
[Yoshida et al. 2002].
Instead of using curvature to characterize local surface, we directly sample the pe-
ripheral vertices of a vertex and characterize the vertex by comparing their geo-
metrical relationship. We name this feature: Surface Geometry feAture from poiNt
clouD (SGAND). From the fig. 3.4 and 3.5, we can observe that the basic surface
types can be modeled by the geometrical relationship between the center part and
the peripheral parts. For example, the center part of peek surfaces is higher than
the peripheral parts while the center part of pit surfaces is lower; the center of
saddle ridge surfaces is lower than some peripheral parts and higher than others.
Here, we use the investigated vertex p to represent the center and eight clusters of
vertices around to represent the peripheral parts. Their relationships are detected
Chapter 3. 3D Facial Expression Recognition
The radius of the cylinders C and the circle S are respectively defined as 2mm
and 7mm in fig. 3.7. We fix the radius of C because local surfaces with a size smaller
than 2mm can be considered as a flat surface. The radius of circle S can vary so that
Figure 3.6: Classification rule of primitive 3D surface labels [Wang et al. 2006].
the vertex P can be featured by different peripheral parts on a surface and thus be
more informative. fig. 3.8 illustrates the variation of the radius S. We compute and
compare the quantity of sampled vertices within the same cylinder above and below
the plane. Since these two set of vertices always have the same density, SGAND is
invariant to face scale.
When the investigated vertex p and the main direction (the normal of plane M )
are fixed, the feature varies with the radius of the circle S over which the cylinders
C are distributed. Different vertices are sampled with varying radii and thus may
influence the binary values. The implicit reason for changing the radius is the
different geometrical properties of facial surfaces at different distances. For example,
the feature for the nose tip is always 0 because it is the highest vertex on the face
and this property does not change with the radius. However, the feature value of the
inner corner of eyes definitively changes with the radius since the sampled regions
move across the nose saddle region and thus cause variations on sampled surface
property. fig.3.9 displays our feature extracted from one face scan with various
radiuses of the circle S. Each color corresponds to a value in SGAND ranging from
0 to 255. We can see the SGAND distribution on faces and how this distribution is
affect by radius of the S.
Chapter 3. 3D Facial Expression Recognition
Figure 3.7: Extraction of our proposed feature. a: frontal view, b: side view, c:
one cylinder for clearance. The green dot represents the investigated vertex which
is located in the nasal region of a face. A plane and eight cylinders are involved as
Unlike the primitive surface feature which is extracted using a local coordinate
system, we always extract SGAND using the direction vertical to the plane M in the
3D coordinate system. This is more intuitive and matches human perception habit
since we look at faces through the gaze direction. Usually, the M plane is formed
perpendicular to the z axes of the coordinate system for frontal 3D faces. If a face
has an other head pose, we need to define another direction for the M plane instead
of the z axes which varies with the head pose and indicates the frontal direction
of the face. Thus, we propose an automatic approach to estimate head pose and
find the direction in order to form the face planes for our feature extraction. This
Chapter 3. 3D Facial Expression Recognition
Figure 3.9: Influence of the radius of circle S on our feature extracted from a neutral
face: a: 3mm, b: 5mm, c: 7mm, d: 9mm, e: 11mm, f: 13mm.
Pose estimation of a 3D facial model aims at finding how the 3D face surface is
embedded into the 3D coordinate system [Besl & Jain 1986]. A reliable pose es-
timation plays an important role in face alignment and feature extraction. For
example, a plane is required vertical to the face frontal direction when extract-
ing the SGAND. There are some existing methods that have been proposed both
in 2D [Bailly et al. 2009] and in 3D. These approaches either use range data
[Breitenstein et al. 2008], which is applied to 2.5D faces, or require a training process
for a generic face model [Kinoshita et al. 2006]. Thus, we propose in the following
a fast and efficient pose estimation approach which is based on face mesh and does
not require any training process. Therefore, this method is suitable to be adopted
as a preprocessing step in 3D face analysis systems.
The basic idea is to use the vertices on the frontal side of a face to generate a
face plane by regression. With the normal of the plane and a direction from top to
bottom of faces, head pose in 3D coordinate system can be estimated and thus the
Chapter 3. 3D Facial Expression Recognition
In order to find the frontal vertices on a face, we take the normals of vertices
into consideration because they are rotation invariant and represent the directions
of vertices on facial surface. The normal of a vertex is computed by averaging the
normals of surrounding triangle facets. fig. 3.10 illustrates a face with normals on
all vertices.
Chapter 3. 3D Facial Expression Recognition
3 X
kxj − ui k2
arg min (3.3)
i=1 xj ∈Si
1. Place three points into the space represented by the normal data that are
being clustered. These points represent initial centroid groups.
2. Assign each normal to the group that has the closest centroid.
3. When all normals have been assigned, recalculate the positions of the K cen-
4. Repeat Steps 2 and 3 until the centroids stabilization. This produces a sepa-
ration of the normals into groups from which the metric to be minimized can
be calculated.
We have compared the clustering results between K-means and Mixture of Gaus-
sians, and display one example in fig. 3.11. We can observe that normals separated
by K-means are more symmetrical than the results obtained from Mixture of Gaus-
The clustering process outputs three normals u1 , u2 , u3 which are the centroids
of S1 , S2 , S3 respectively. In order to distinguish the mean normal which represents
the normals pointing to the front, we compute the inner products of each pair of
u1 , u2 , u3 . Indeed, the angle between the left mean normals and the right mean
normals is the biggest among all 3 angles formed by any pair of u1 , u2 , u3 . Thus, we
can find the minimum inner product from the pair and the other centroid represents
the frontal vertices.
After the group of frontal vertices have successfully been obtained, a Principal
Component Analysis (PCA) is used to fit a linear regression that minimizes the
perpendicular distances from the those points to a plane and a line. The process
is as follows. The coefficients D1 , D2 for the first two principal components define
vectors that form a basis for the plane. The third principal component is orthogonal
to the first two, and its coefficient D3 defines the normal vector of the plane. The
Chapter 3. 3D Facial Expression Recognition
Figure 3.11: Separation of vertices into 3 sets: left (red), frontal (blue), right (green).
a: K-means, b: Mixture of Gaussians
plane passes through the mean point Pm of the group. Meanwhile, the coefficient
D1 of the first principal component is the vertical direction of the face. Indeed,
the first component explains the most prominent variance in the data which is the
vertex location variance along the top-bottom direction as seen in blue points in fig.
3.11. The direction is the best 1-D linear approximation to the data. In summary,
D3 and Pm form the face plane while D1 and Pm form the line.
The plane can also be fitted by using other methods on the group of frontal
vertices, such as the least square method. Overall, the least square method is only
able to approximate a face plane whose normal can be used in our feature extraction.
Thus, our method offers the advantage of allowing the estimation not only of the
face plane but of three head pose directions including yaw, pitch and roll.
Our proposed SGAND has been designed for 3D face analysis including face detec-
tion, facial landmarking, face recognition and facial expression recognition. In this
subsection, we propose to make use of it for 3D facial expression recognition.
Chapter 3. 3D Facial Expression Recognition
Figure 3.12: Feature extracted from faces with six universal expressions. a: anger,
b: disgust, c: fear, d: happiness, e: sadness, f: surprise.
After having extracted the features from a face, facial expressions can be rep-
resented by the distribution of the features over the facial region. Indeed, facial
expression is the consequence of human emotion and implies facial muscle acti-
vation that modifies the facial surface geometry. Such a variation results in the
distribution variations of SGAND, as illustrated in fig. 3.12. Thus, one can identify
facial expressions by using SGAND.
To find an explicit description of the fundamental structure of facial surface
details, we have investigated the statistical distributions of the feature for nine
expressive facial regions. As shown in fig. 3.13, 83 manually labelled landmarks are
defined on the facial surface, and accordingly, the nine expressive local regions are
constructed based on these points. Note that the nose region and interiors of eyes
are currently not included in the nine local regions. Nose region is widely accepted
as a rigid facial region whose surface shape does not vary with facial expression.
Thus, it is useless to include nose region since no expression information provided.
Meanwhile, because of the flaw of 3D face scan capture, the interiors of eyes generally
contain hole and thus can not accurately record the local shape.
In short, the selected nine local regions cover the most mimic facial areas. In
Chapter 3. 3D Facial Expression Recognition
Figure 3.13: Nine selected facial regions labeled by colors other than blue.
where nim is the number of vertices having a feature value m from 0 to 255, and
ni is the total number of vertices in the ith local region (ni = nim ). M = 256
is the number of the slot of our feature value.
Chapter 3. 3D Facial Expression Recognition
(Pi in eqn. 3.6 ) corresponding to the six classes. Sets of probabilities from all
classifier associated with different radiuses are summarized respectively to obtain
the overall probability set, as shown in eq. 3.7.
where E is the number of expressions and i represents the index to radius of the
circle C. The expression can be recognized by choosing the one with the maximum
in the overall probability set. Eq. 3.7 can be tantamount as score fusion for easy
[Pi1 , Pij ..., PiE ])
X = arg max( (3.7)
where X ∈ anger; disgust; f ear; happinees; sadness; surprise and S the num-
ber of different radius.
In this section, we present the experimental results obtained for the evaluation of our
approaches on pose orientation estimation and on our SGAND for facial expression
recognition. The database we have used in the tests is the BU-3DFE dataset.
For the evaluation of the pose estimation approach, we have used the neutral faces
and faces with the six universal expressions of the two highest level from all subjects
so that 1300 face scans have been tested in total.
In fig. 3.14, faces with six universal expressions are displayed with the estimated
planes (the black rectangles) and three directions (green, blue, red lines) from PCA.
We have further analysed quantitatively the test results on the 1300 facial models
with different number of vertices, poses and expressions. For evaluation, we have
manually selected the feature points of inner eye corners and nose corners of each
model and derived its orientation as the ground truth of pose orientation. We then
Chapter 3. 3D Facial Expression Recognition
Figure 3.14: Results of pose estimation on faces with the six universal expressions.
a: anger, b: disgust, c: fear, d: happiness, f: sadness, e: surprise
compared the estimated pose orientation using our approach with the ground truth
orientation, and have considered that the estimated pose is correct if the difference
between the estimated and the ground truth pose orientation is less than 10◦ . The
correct pose estimation rates of face models are 94.36% for the normal of the face
planes (D3), displayed as the red lines in fig. 3.14 , 98.24% for the vertical directions
(D1), displayed as the green lines and 96.44% for the horizontal directions(D2)
displayed as blue lines. The approach in [Breitenstein et al. 2008] achieve a correct
rate of 80.8% with the same criterion for correct estimation using their own dataset.
In [Seemann et al. 2004], a pose success rate of 75.2% for 10◦ has been achieved.
Compared with them, our method appears to be more accurate. However, no direct
comparison is possible because different datasets have been used.
Chapter 3. 3D Facial Expression Recognition
For the evaluation of facial expression recognition, we have used faces with the six
universal expressions of the 2 highest level from 60 subjects so that the results can be
compared with other works in the literature. Each face has been manually labelled
with 83 fiducial points for the face segmentation.
For these tests, we have set the radius of the circle C to 3, 5, 7, 9, 11, 13mm
and have computed the expression descriptor E with these radiuses respectively.
Because most of the faces are in rough frontal pose, we have used the Z axis as the
main direction and have extracted our features directly on the faces. Then, we have
trained an SVM for each E and have fixed its parameter for all rounds of the test.
Table 3.4 shows the confusion matrix of the average case for the test. Expressions
surprise, happiness, sadness and disgust are well identified with accuracies over 90%,
especially 100% recognition rate for the recognition of surprise. However, the anger
and fear have quite lower recognition rates. Most of anger expressions are confused
with sadness, and fear expressions are more likely to be misclassified to happiness.
The average recognition rate for all the six universal expressions is 75.3%.
We mainly compare our results with those in [Wang et al. 2006]. The main
purpose of the comparison is to show the performance of the proposed feature. The
scheme of our method is very similar to theirs so that the efficiency of features
can be compared directly and fairly. [Wang et al. 2006] extracts another geometry-
based feature (primitive features) and compute histogram of features from different
Chapter 3. 3D Facial Expression Recognition
Table 3.13 lists several recognition results in the literature using the same
database. Among them, [Tang & Huang 2008a] achieves the best average recog-
nition rate of 94.7%, which selects good features from all distances between 83
landmarks by Adaboost algorithm. However, this study requires a neutral face from
each subject for distance normalization and thus is subject biased. The recognition
rates of other works are reported between 83% and 90% using manual landmarks.
Chapter 3. 3D Facial Expression Recognition
Figure 3.15: Failure cases for expression recognition using the proposed SGAND
features. The first and second row show the misclassification of anger into sadness
for subject 2 and 4. The third row shows the misclassification of fear into happiness
for subject 64. It can be observed that the distribution of extracted features under
different expressions are quite similar, which is the main reason for confusion.
Chapter 3. 3D Facial Expression Recognition
3.4.6 Conclusion
In this section, we have discussed and analyzed the popular geometry-based features,
including HK curvatures, shape index and primitive surface features. Then, we have
proposed our geometry-based feature SGAND, which can be extracted from point
clouds of 3D faces and is invariant to face scale, contrary to the other approaches.
Indeed, instead of computing the principal curvatures, we describe the local geome-
try by comparing the number of vertices within the sampled regions above and below
a plane defined by center vertex and a frontal face direction. Thus, the extraction
of this feature is fast and easy to be implemented.
In order to extract the feature on faces under various pose, we have proposed
a pose estimation approach which estimates the frontal, vertical and horizontal
orientations of 3D faces. This approach first clusters normals of vertices on a face
to detect those vertices on the frontal side. Then, the directions are estimated by
a PCA-based regression. Thanks to this approach, our geometry-based feature can
be extracted under various head poses.
Chapter 3. 3D Facial Expression Recognition
which use the similar scheme and experimental setup. However, more expressions
are better recognized by our approach which is computationally more efficient. Thus,
experiments have brought to the fore the ability of our proposed feature to describe
efficiently facial local geometry.
However, the local geometry may not carry sufficient information to represent all
kinds of deformations caused by various expressions and more information about the
local shape property may be necessary, such as texture for characterizing for example
bulges and furrows. Thus, in order to identify expressions and action units with high
precision, features from different facial representations should be considered. In the
next section, we will present our approach based on a Bayesian Belief Net for fusing
the features extracted from different face representations.
Existing 3D facial expression recognition systems mostly aim at identifying the six
universal expressions, using geometry-based features extracted from the face surface.
Line properties between landmarks, such as angles and distances, are often used
and can achieve rather good results. [Tang & Huang 2008a, Soyel & Demirel 2008,
Tang & Huang 2008b, Wang et al. 2006, Hu et al. 2008a]
However, as we discussed previously, expressions are created by facial muscle
contractions and result in the variations of landmark locations as well as texture
and surface shape in mimic facial parts. By means of these variations, a wide range
of expressions other than the universal ones can be exhibited on a face as well as
all the 44 basic facial action units (AUs). The geometry based approaches exclude
information on other face representations and thus do not make use of the com-
prehensive characteristics of facial appearance. Although experiments have proved
their good performance in recognizing the universal expressions, the geometry based
features may be not rich enough to discriminate other subtle expressions or facial
action units.
Moreover, feature based approaches generally rely on a large number of precisely
Chapter 3. 3D Facial Expression Recognition
located landmarks, either for feature extraction or for face segmentation. Thus, their
performance highly depends on the landmark precision, which can not be achieved
by automatically located landmarks. Thus human intervention is generally required
in these approaches.
On the other hand, morphable facial models are built by learning the deforma-
tion modes on texture and geometry representations, and use the deformation pa-
rameters as features for recognition [Ramanathan et al. 2006, Mpiperis et al. 2008,
Rosato et al. 2008, Venkatesh et al. 2009]. Two major problems arise: firstly, the
deformation modes learnt from whole faces describe the major variations globally
and thus can not properly reflect local deformation patterns caused by AUs. Sec-
ondly, the learnt variation modes are not necessarily consistent with the variations
among AUs and expressions, thus may not synthesize expressions accurately. In
other words, AUs or expressions can only be approximated by combining a set
of variation modes, rather than being modeled by one specific mode in the mod-
els. For certain expressions such as happiness or surprise, or action units such as
AU27 (mouth opening), the deformation is prominent and thus can be approximated
modelled well enough to be distinguishable. However, for some of other moderate
expressions and AUs, small variances in parameters yield different expressions.
Therefore, in order to characterize the facial deformations comprehensively, fea-
tures from all three face representations (facial landmark location or global geometry,
texture and local geometry) should be considered. This raises the problem on how
to fuse the contribution from each feature efficiently. In this section, we propose
to use a Bayesian Belief Network to solve this problem. Beliefs on the expression
node for different expressions or AU states are inferred from network parameters of
neighboring nodes. Statistical feature models (SFM) are learnt for estimating these
parameters on those nodes corresponding to the subject and the facial features.
A distance-like feature is extracted to describe the global geometry relationships
of face components. Meanwhile, local information is also extracted not only from
the raw facial texture and shape but also from other features such as shape index,
LBP so that subtle local deformations can be well characterized and different kind
of expressions or AUs can be more distinguishable. Thus, SFMs are learnt for each
Chapter 3. 3D Facial Expression Recognition
type of feature and the parameters are estimated following an uniform process in
our BBN. This leads to a flexible system where any new feature can be modeled by
a SFM, and the corresponding knowledge directly "plugged" into the BBN.
Moreover, the BBN can be further combined with our SFAM proposed in the
previous chapter to realize a fully automatic expression and AU recognition system.
Indeed, the adopted features are extracted from local regions on important facial
parts. Our SFAM is able to locate landmarks in those regions automatically. We
thus use the SFAM as the first part of the automatic system to locate feature
points and then extract features around those landmarks for recognition. Because
we consider features from three face representations, our system is more robust to
landmarking errors than state-of-the-art approaches as it has been proved by the
evaluation of the system.
Graphical models have already been used in facial expression analysis in 2D.
A Dynamic Bayesian Network is developed in [Tong et al. 2007] to model the dy-
namic and semantic relationships among facial action units. The network has been
extended to a more sophisticated one in [Tong et al. 2010] which coherently rep-
resents head pose and action units. A Bayesian Belief Net aiming at describ-
ing the relationship between expression and facial action units is developed in
[Datcu & Rothkrantz 2004] for expression recognition.
However, the BBN we propose differs from them in three aspects:
2. The structure of our BBN is different. In [Tong et al. 2007], the learnt struc-
ture of the Bayesian Network explores the dynamic relationship among AUs.
In [Datcu & Rothkrantz 2004], the structure of BBN describes the relationship
between AUs and the six universal expressions. However, our BBN concen-
trates on describing the causal relationship among subject, expression and
Chapter 3. 3D Facial Expression Recognition
facial features.
3. Because the objects and the structure of the graphic models are not same, the
computation of their parameters is consequently different.
In the following sections, we will present our Bayesian Belief Network as well as
feature extractions adopted in this network.
In this subsection, we first introduce some background knowledge on BBN and then
specify its usage for facial expression and AU recognition. The belief computation
in BBN is then presented. Since the BBN structure is elaborated in a unified way
for recognizing both facial expressions and AUs with the same procedure, we will
use the term ’facial activity’ to represent the six universal expressions as well as
A Bayesian Belief Network [Duda et al. 2000] is a probabilistic graphical model with
the topology of a directed acyclic graph (DAG), shown as fig. 3.16. It is made up
of a collection of nodes and directed edges, but without directed cycles, as shown
in fig. 3.17. Nodes represent a set of random variables and directed edges represent
their conditional dependencies.
In fig. 3.17, the ’belief’ of a variable on a node X (X = (x1 , x2 , ..., xn )) describes
the probability of its states in condition of knowing evidences e (observations) on its
connected neighbor nodes. These nodes can be divided into parents (nodes pointed
directly to X via an edge) and children (those nodes pointed directly from X via
an edge) to compute the belief as:
Chapter 3. 3D Facial Expression Recognition
The factor about parent nodes in eq. 3.8 is calculated as the conditional proba-
bilities of X under all combinations of all ’parents’ states as well as their probabilities
given evidences, as in eq. 3.9.
where ep is the evidence of parents, pi1 means the ith state of the first parent, pj2
means the jth state of the second parent, etc. P (pi1 |ep1 ) is the probability of the
ith state (I states in total) of the first parent given its evidence ep1 . P (pj2 |ep2 ) is
the probabilitiy of the jth state (J states in total) of the second parent, etc. The
P (X|ep ) is a sum of totally I ∗ J ∗ ... ∗ K factors.
The factor about children nodes can be rewritten as:
P (e |X) = P (ec1 , ec2 , ..., ecN c |X) = P (ecl |X) (3.10)
where ecl is the evidence or observation of the lth child node, Nc is the number of
children, P (ecl |X) is the probability of evidence knowing the X state.
Chapter 3. 3D Facial Expression Recognition
Figure 3.17: The proposed Bayesian Belief Network. We infer belief of states in
node X, which represents facial activity (expression or AU), from its parent node
S, which represents 3D face scans and its children nodes F 1 ,F 2 ,..., FNf which
represent facial features (landmark displacement, raw local texture and range around
landmarks, etc...)
The structure of our BBN is illustrated in fig. 3.17. The node X represents the
facial activity variable and has as many states as the kinds of facial expressions or
AUs that are to be recognized, such as six states for the six universal expressions
or 16 states for the 16 AUs mentioned later in this section. The node S is X’s
parent, representing human subjects that we explore. It has as many states as the
number of subjects. X’s children F1 ,F2 , ... ,FNf represent the facial features that
are extracted to carry face information.
Since there is only one parent for the node X, the factor P (X|ep ) in eq. 3.8 can
be expressed as:
P (X|ep ) = P (X|piS )P (piS |epS ) (3.11)
where Nf is the total number of subjects that we explore, P (piS |epS ) is the prior
Chapter 3. 3D Facial Expression Recognition
probability of the ith subject and P (X|piS ) is the conditional probability of X given
the state of the ith subject. When all tested subjects perform the same number of
expressions (as it is the case for the available face databases), P (X|piS ) and P (piS |epS )
follow an uniform distribution. Thus, P (X|ep ) also follows an uniform distribution.
In other cases, the computation of P (piS |epS ) can be based on face recognition ap-
proaches while P (X|piS ) can be computed either from expression probability distri-
bution in databases, or in a realistic situation, from the frequency of each expression
appearing on subjects’ face in a period of time in daily life.
Therefore, for a given face κ, eq. 3.8 can be rewritten as follows:
P (X|eκ ) ∝ P (ecl |X) (3.12)
where eκ refers to observations from the face κ. Thus, the belief for each expression
state is computed from eκ and the state holding the highest belief is considered as
the most probable expression (or AU) of the face κ, as in eq. 3.13.
Our BBN is derived from the Bayesian Belief Network in [Duda et al. 2000], a
general example of which is presented in section However, the method to
obtain P (ecl |X) has to be designed for our specific problem. In our case, we propose
to use a statistical feature model (SFM) to estimate P (ecl |X) as in the following
To know the beliefs for the X node, we need to estimate P (ecl |X) for each child node,
which is computed based on a statistical feature model (SFM) method. SFMs are
built for all features in an uniform manner. Specifically, given a training set for the
feature Fl , we divide it into N e (number of expressions or AUs) subsets containing
the corresponding faces. For each subset ix , Principle Component Analysis (PCA)
is applied to learn the variation modes of the feature under the ix expression, where
Chapter 3. 3D Facial Expression Recognition
where F̄lix is the feature mean, Plix is the set of eigenvectors resulting from PCA,
and bil x is a set of parameters which are supposed to follow Gaussian distributions
with a zero mean and a standard deviation σlj where j refers to each parameter
of bil x . The feature instances F̂lκ
can be generated from the above equation using
feature Flκ to estimate the best parameter bil x :
bil x = P il x (Flκ − F̄lix ) (3.15)
We set a boundary (±0.5σlj ) for the corresponding parameter in bil x to form b̂il x in
order to constrain the instance deformations and thus to increase their separability.
F̂lκ is computed by inputting b̂il x in eq.3.14.
The probability P (ecl |X) can be considered as the probability of matching the
feature Flκ with its instances F̂lκ knowing the expression state X, which follows a
Gibbs distribution.
Inserting the Gibbs distribution into eq. 3.12 and taking logarithm gives:
Y Nc
log p(X|eκ ) = log( P (ecl |X)) + c = Al Q l + c (3.17)
l=1 l=1
Through the above process, eq. 3.13 can be computed by taking 3.17. A block
diagram illustrating the recognition process using the BBN is demonstrated in fig.
Chapter 3. 3D Facial Expression Recognition
Figure 3.18: Block diagram of the BBN for expression and AU recognition.
Two strategies for recognizing facial expressions can be drawn: detection of affects
(emotions) and detection of facial muscle actions (AUs). The first one infers what
underlies a displayed face, such as the six universal expressions, while the second
one aims at describing objectively the facial appearance mostly by FACS. Both rely
on the representation and analysis of facial deformations. Our approach for this
Chapter 3. 3D Facial Expression Recognition
Facial activities including expressions and AUs are both consequences of facial mus-
cle activities and the difference between them lies on the muscles involved and the
intensity of their contraction. AUs describe facial deformation locally at a low level
manner while facial expressions can be considered as a combination of AUs at a
high-level manner over the whole face. Some combinations of AUs correspond to
basic expressions according to decision making rules. For instance, the combination
of AU4, AU5, AU7 and AU24 corresponds to anger. Thus, we are convinced that
a good characterization on AUs at a low level can also be effective to represent the
six universal expression at a higher level. In the following paragraphs, facial repre-
sentations are drawn mainly by analysing AUs. However, this representation also
applies to facial expressions.
Totally, 16 facial AUs are analyzed in this work which are chosen based on the
3D data availability. They are AU2, AU4, AU7, AU9, AU10, AU12, AU14, AU17,
AU18, AU22, AU24, AU26, AU27, AU28, AU34, AU43, illustrated in fig. 3.19. More
details on AUs and their combination rules for recognizing emotions can be found
in the appendix part.
From fig. 3.19, we can observe that the variations of facial appearance occur
in three face representations: facial morphology, facial texture and facial geometry.
Specifically, facial morphology consists in a set of reproducible landmarks located
on different facial parts. Facial texture contains the unique lines, patterns, and
spots apparent in a face skin whereas facial geometry contains facial surface shape
information delivered by a face surface mesh. Facial variations caused by AU or
expression have an influence on these representations to different extents. For ex-
ample, AU7 and AU43 change the texture in the eye region significantly without
moving corners of the eyes. AU24 changes the local geometry and texture in mouth
region mostly while having less influence on landmark location. However, most of
AUs influence all three face representations simultaneously and notably, such as
AU4, AU10, AU22, AU26, AU27, etc. AUs normally occur locally and change the
Chapter 3. 3D Facial Expression Recognition
appearance in the regions where the corresponding muscles are located. However,
some AUs can influence appearance in other regions besides where they happen. For
instance, AU10 raises the upper lip while deepens the nasolabial furrow between the
Chapter 3. 3D Facial Expression Recognition
nose and the eyes. Thus, the description scheme we use is based on local regions that
are distributed on the important facial parts where most of AUs occur, including the
eyebrows, the eye, the nose and the mouth, as it is detailed in the next subsection.
where N is the number of landmarks and m is the number of vertex in all local
For representing the morphology representation, S is used to compute a distance
feature L and a point displacement feature D. 11 distances between the involved
landmarks are computed and then concatenated into feature vector L. The distances
are pictorially shown as green lines in the fig. 3.20 and their textual descriptions are
Chapter 3. 3D Facial Expression Recognition
Table 3.6: Distances between some strategical facial landmarks on the 3D facial
expression model. Distance index refers to the fig. 3.20.
Chapter 3. 3D Facial Expression Recognition
the need for providing a neutral face in conjunction with the face to be recognized
with expression, which is unrealistic in a real application.
D = S − S̄neutral (3.21)
The LBP operator, a powerful texture measure used widely in 2D face analysis,
extracts information which is invariant to local gray-scale variations of the image
with low computational complexity. Multi-Scale LBP [Shan & Gritti 2008] is an im-
proved facial representation compared to standard LBP (eq. 3.22). We have adopted
multi-scale LBP features for three reasons: first, LBP describes local property of
images, which is consistent with the local deformations that correspond to AUs;
second, the variance in the apparent AU magnitude is large since some are quite
notable while some are subtle, thus it is necessary to analyze them under different
scales; third, LBP is efficient and easy to compute.
X −1
LBPP,R (x, y) = s(gp − gc )2p (3.22)
Figure 3.21: LBP Operator. The circular (8,1), (16,2), and (8,2) neighborhoods.
The pixel values are bilinearly interpolated whenever the sampling point is not in
the center of a pixel.
In our case, LBP are computed and extracted from scale 1 to 5 respectively for
U 2 U 2 U 2
all points on the local grids on both texture LBP(16,1) t, LBP(16,2) t , ..., LBP(16,5) t
Chapter 3. 3D Facial Expression Recognition
U 2 U 2 U 2
and range maps LBP(16,1) r , LBP(16,2) r , ..., LBP(16,5) r. Superscript U 2 indicates
that the definition relates to uniform patterns with a U value of at most 2 (refer
to [Chan et al. 2007] for details). fig.3.22 illustrates the extraction of LBP feature
at different scales and on both local texture and range maps. Finally, the values
for each (P,R) pair on local grids are concatenated into a vector to build 10 LBP
feature vectors: (LBPt 1 − 5, LBPr 1 − 5).
Figure 3.22: Multi-Scale LBP extracted from local texture and range map on a 3D
face scan. In the first row are LBP features extracted from texture and in the second
row are LBP features extracted from range. In the third row are the (P,R) values
of the corresponding columns.
By combining the BBN with SFAM, a fully automatic 3D facial expression recogni-
tion system can be realized. It consists of 4 main stages, as shown in fig. 4.3: offline
SFAM construction, offline BBN training, online landmarking and feature extrac-
Chapter 3. 3D Facial Expression Recognition
Chapter 3. 3D Facial Expression Recognition
tion, and finally online facial expression/AU recognition. Thus, SFAM is trained
using a small set of faces with all kinds of expressions or AUs. A set of statistical
feature models are also trained corresponding to these classes and for each feature
respectively. During online recognition, faces are first landmarked by SFAM, then
a variety of features are extracted and used as evidence by the BBN for computing
belief of states for the facial activity node X. Specifically, feature instances are
generated corresponding to trained feature models and further used to compute the
post-probability of each extracted feature. The output of the system is the type
of expression whose corresponding state has the highest belief among different ex-
pression or AU states, which are computed from probabilities on both parents and
children nodes. Of course, this system is also applicable with manual landmarks.
In this case, the landmarking process is skipped for input faces where features are
directly extracted based on manual landmarks.
Figure 3.24: Flow chart of the automatic facial expression/AU recognition system
We present in this section our experiments driven in order to evaluate the perfor-
mance and efficiency of our facial expression/AUs recognition approach based on
statistical feature models merged by a BBN. To do so, we have compared the per-
formance of our BBN against other popular classifiers, i.e. Support Vector Machine
(SVM) and Sparse Representation Classifier (SRC) on identifying the six univer-
sal expressions. Then, in order to prove BBN flexibility and robustness, we have
experimented the recognition of 16 AUs. Finally, we have tested the expression
Chapter 3. 3D Facial Expression Recognition
recognition scheme which combines the SFAM and the BBN in order to recognize
the six universal expression in a fully automatic manner.
In the experiments for facial AU recognition, face scans displaying 16 AUs have
been used from 60 subjects in the Bosphorus database[Savran et al. 2008], which
are AU2, AU4, AU7, AU9, AU10, AU12, AU14, AU17, AU18, AU22, AU24, AU26,
AU27, AU28, AU34 and AU43. Thus, 60*16=960 3D face scans have been involved
in this tests. Noting that these acted AUs are not FACS coded and singly occurring
AUs. The FACS coded version of the database will soon be available.
In the experiments for facial expression recognition, face scans of two high-
intensity from each expression have been used from each subject in BU3DFE
database [Yin et al. 2006]. For both tests using manual landmarks and automatic
landmarks, we have used the data of 60 subjects. A part of subjects are different
between the tests using manual landmark and those using automatic landmarks, be-
cause face scans from a group of subjects are consumed to build the SFAM, which
is used to obtain the automatic landmarks on the left face scans. In fact, SFAM has
been trained using the data of 11 subjects with scans displaying the six universal
expressions at two high-intensity level and neutral. The trained SFAM has then
been used to locate 19 landmarks for scans of other 89 subjects.
All tests in AU and facial expression recognition have followed a 10-fold person-
independent cross-validation process. Thus, 60 subjects have been partitioned into
two subsets in each round (totally 10 rounds): one with 54 subjects for training and
the other with 6 subjects for testing. This experiment setup guarantees that each
subject appears once in testing set and 9 times in training set and any subject used
for testing does not appear in the training set because the partition is based on the
subjects rather than the individual expressions.
In the test for AU recognition, we have defined the states of the X in the BBN
corresponding to the aforementioned 16 AUs.
Chapter 3. 3D Facial Expression Recognition
Table 3.8: Average positive rates (PR) and Average false-alarm rates (FAR) of AUs.
The results are given in Table 3.8 in terms of average positive rates and average
false-alarm rates for all AUs. Indeed, recognizing each AUi can be considered as
a two-class classification according to the AUi and the non-AUi . The positive rate
is defined as P R = T P +F N and the false-alarm rate is F AR = T P +F P where T P
stands for "True Positive", F N for "False negative" and F P for "False Positive"
(see Table 3.9 for details) .
Among the 16 AUs, 7 of them (AU10 , AU18, AU22, AU26, AU27, AU2, AU43)
have an average PR over 90%, while 4 of them (AU14, AU24, AU7, AU4) have
average PR below 80%. Meanwhile, AU24 has the highest FAR, which suggests
that it is easily confused with other AUs, which is also the case for AU34 and AU4
having a FAR above 20%. On the contrary, AU43, AU27, AU22 having a FAR below
5% are relatively clearly identified. Globally, our BBN achieves an overall average
PR for all 16 AUs of 85.6% with an overall average FAR of 13.6%.
Chapter 3. 3D Facial Expression Recognition
by plotting the true-positive rates against the false-positive rates. Notice that the
values on the left end of ROC curves correspond to the positive rates in table 3.8
because our decision threshold in use is 1 after score normalization. Specifically,
the highest score of an AU is always transformed into 1. Actually we choose the
state which have this score as the predicted AU. fig. 3.25 and fig. 3.26 are the ROC
curves for 16 AUs respectively. The ROC curves which have a greater area below
indicates a better recognition. Thus, we can see AU43, AU27 and AU10 are among
those best recognized, which correspond to the results in table 3.8.
In [Savran & Sankur 2009], 22 AUs are detected automatically by estimating
the deformation between the registered face and the reference. Based on the same
dataset, they achieve an average PR of 91.1% . In [Sun et al. 2008], 7 AUs are
considered and a AU combination on their own database is performed allowing to
achieve a PR of 89.1%. In [Tong et al. 2010], authors use a Dynamic Bayesian Net
to learn the relationship between AUs on 2D Cohn-Kanade database in order to
enhance the recognition performance using gabor features and Ababoost classifier.
They achieve an 85.8% PR on 14 AUs. Our approach achieves an average PR of
85.6% for 16 AUs, which achieves a consistent result with the highly optimized 2D
method [Tong et al. 2010].
In order to evaluate the performance of our BBN, we have compared it with two
other classifiers, the Support Vector Machine (SVM) [Chang & Lin 2001] and the
Sparse Representation Classifier (SRC) [Wright et al. 2009]. All tests have followed
a 10-fold cross validation process. The face scans in level 3 and level 4 are tested
separately and the final recognition rate is obtained by averaging the results from
two intensity levels for all three approaches.
For classification tests using SVM, a multi-class SVM has been trained respec-
tively for each feature extracted from each level of expression (30 SVMs in total).
Parameters have been empirically tuned to gain the best performance for each of
them. The output of the SVMs is a set of probabilities describing how likely the
face belongs to each expression class according to the testing feature. These prob-
Chapter 3. 3D Facial Expression Recognition
abilities (15 in total per level) have been added together and the testing faces have
been labeled according to the maximum probability score.
For classification tests using SRC, 30 SRCs have been trained respectively follow-
ing the principle of the approach proposed in [Wright et al. 2009], with a l1 − norm
minimization via orthogonal matching pursuit. Parameters have also been set em-
pirically to obtain the best performance. The SRC output is a set of distances
between the testing feature and its six approximations which are generated from a
set of coefficients associated with each class. These distances (15 in total per ex-
pression intensity level) have been added together and the testing faces have been
labelled according to the minimum distance.
Table 3.10: Average recognition rates for the six universal expressions with different
features configurations (Morphology, Texture and Geometry) and different classifiers
using manual landmarks. The standard deviations over 10 fold tests are the values
in the brackets.
Table 3.10 shows the performance in terms of average recognition rates for BBN
with different setups on children nodes, as well as the comparison with other clas-
sifiers. The first row contains the results where the BBN only has two children
nodes for inference, i.e. L, D features extracted from the morphology representation
M . The second and third rows contain the results where the BBN adopt features
from texture T and geometry representation Ge respectively, i.e. G, LBPt 1 − 5 and
Z, LBPr 1 − 5, SI. The following rows contain the results with different combina-
tions of these features. We can see that SVM performs better in the tests on each of
the single representation, named M, T, Ge. However, BBN is comparable with it in
tests on two representations for manual landmarks and finally outperforms SVM on
Chapter 3. 3D Facial Expression Recognition
all three representations with an average recognition rate of 89.2% and least std of
3.6%. Therefore, proved by the tests, BBN is more effective than score-level fusion
strategy with SVM and SRC when adopting features from all three representations.
Moreover, BBN uses an uniform non-parameter-tuning process for building SFM
and estimate parameters, which avoids the trouble for manually tuning parameters
to optimize the performances in SVM and SRC.
We have also evaluated the influence of the local grids size and the number of
local grids in the feature extraction process. Using the same data as in the previous
test, we have first extracted the feature from the same 19 local grids which has a
25mm*25mm size, as shown in fig.3.27a; then we have extracted the feature from
selected 32 local grids which has a 15mm*15mm size same as the one in the previous
test, as shown in fig.3.27b. The average recognition rate for the test on faces sampled
on 19 grids with a size of 25mm*25mm is 90.3% and the average recognition rate
for the test on faces sampled on 32 grids with a size of 15mm*15mm is 89.3%.
These results (90.3% vs 89.3% vs 89.2%) suggest that it is sufficient and has a
lower computation burden to use the grids on the 19 locations with the size of
15mm*15mm to extract features.
For the fully automatic 3D facial expression recognition, the SFAM has first be used
to locate 19 landmarks automatically and then features have been extracted around
these landmarks.
The results are given in Table 3.11 with different child node setup for the BBN
similar to those in Table 3.10. We can see that the recognition rate increases with the
number of child nodes, and finally achieved 84.9% when adopting all children nodes,
corresponding to the 15 features. Unlike the results based on manual landmarks,
we can not observe an notable contribution of M in the recognition rates (0% vs
3.1%) in the last row which may be due to the inaccuracy of automatic landmarks.
Besides, when using only M, an obvious decrease on the recognition rate is observed
from using manual landmarks to using automatic ones. This confirms the claim
that landmark-based features have a high reliance on locating accuracy and thus
Chapter 3. 3D Facial Expression Recognition
Table 3.11: Recognition rates for 6 universal expressions with different features con-
figurations (Morphology, Texture and Geometry) using both manual and automatic
landmarks. The left column is results based on manual landmarks (m) and the right
column is results based on automatic landmarks (a).
Table 3.12: Confusion Matrix of the expression recognition. Left value on each
cell is the result based on manual landmarks and right value is the result based on
automatic landmarks.
Table 3.12 contains the average recognition rates for the six universal expressions
based on manual landmarks (first value in each cell) and by the fully automatic ap-
proach (second value in each cell), using the combination of all features (M+T+Ge).
The average recognition rate is 89.2% based on manual landmarks and 84.9% for
automatic ones. The decrease is mainly due to localization errors for automatic
landmarks. Most of the expressions are indeed identified with high accuracy in both
tests, while anger and fear have comparatively lower recognition rates. Anger is
Chapter 3. 3D Facial Expression Recognition
classified more likely into sadness because their confusion, even for humans, is much
larger than for other expressions. Faces with sadness are more easily to be misclas-
sified into anger in the tests on automatic landmarks. However, the case of fear is
different. The motions of this expression are moderate compared to happiness or
surprise for example, and thus more difficult to discriminate. Discussion
Table 3.13 presents a comparison with typical results of the literature. While most
of other works are dedicated to the recognition of the six universal expressions in 3D,
our classification scheme based on BBN and statistical feature models performs the
recognition of both expressions and AUs with an uniform structure. It is also found
that the proposed BBN outperforms most of the other methods while it requires
no parameter tuning and less constraints, such as a large number of landmarks and
the neutral face from each subject. Indeed, our approach has achieved the second
rank in the literature, the first one having been obtained by [Tang & Huang 2008a].
However, their method requires a neutral face from each subject for distance nor-
malization, which introduce subject bias.
Concerning the fully automatic expression recognition, our results are also of
good quality since the second rank in the literature has been reached. Compared
with [Mpiperis et al. 2008], our approach has two advantages. Firstly, the building
and fitting of the SFAM can be easily implemented. Secondly, the recognition by
BBN is not only efficient according to the accuracy but also in terms of computa-
tional cost. The normalized cross-correlations are computed between each feature
and its instances within 0.24s for each child node in average on a desktop PC with
Intel Core2 E4400@2.00GHz CPU.
3.5.5 Conclusion
Chapter 3. 3D Facial Expression Recognition
Table 3.13: Comparison of the results from different facial expression recognition
Different from graphical models built for 2D facial expression analysis, the proposed
BBN has a flexible topology allowing to integrate knowledge carried on new features
by adding new children nodes of the X node. By defining the states in the X node,
we can change the facial expressions or AUs that need to be recognized. Further-
more, we have proposed a novel parameter estimation method for the BBN which
evaluates the similarity between features and their instances generated from statis-
tical feature models. Our experiments have proved that the BBN is more effective
than score-level fusion approaches using SVM, SRC when employing features from
all three representations. Meanwhile, it is easy to apply BBN in fusing information
from a group of features for recognizing since it does not require any parameter tun-
ing procedure. In general, our approach has achieved an average positive rates of
85.6% for 16 AUs and 89.2% for the six universal expressions. Furthermore, thanks
to using the feature extracted from three representations, it is robust to the land-
marking errors, which allows it to be implemented as an automatic FER approaches.
The recognition rate of 84.9% has been achieved for recognizing the six universal
expressions automatically. Compared to other existing 3D FER approaches, our
method offers the advantages of good performance and implementation simplicity
with the ability to be fully automatic.
In this chapter, we have proposed two approaches for analysing 3D facial expression.
In the first approach, a new feature named SGAND, has been proposed to describe
Chapter 3. 3D Facial Expression Recognition
local facial geometry property by comparing the number of sampled peripheral ver-
tices above and below a face plane around a vertex. A head pose estimation method
has been elaborated in conjunction with the feature so that it can be extracted un-
der various head poses. SGAND has been evaluated for the purpose of recognizing
the six universal expressions.
The results demonstrate the efficiency of SGAND when classifying disgust, hap-
piness, sadness and surprise. However the other two universal expressions are not
classified satisfyingly. There are two conceivable directions to improve this approach:
• Feature extraction process: currently, we use a binary value obtained from the
numbers of local sampled points on the two sides of the plane to describe the
local surface. It is enough to describe the bending trend of the local surface,
which is more like a qualitative analysis. However, this binary value is not
sufficient to analyse the surface bending quantitatively. Thus, in order to
represent the local surface characteristic more precisely, more values will be
set according to the distances of sampled vertices to the plane. Moreover, a
lookup table will be created to map the value arrays to the typical surface
Chapter 3. 3D Facial Expression Recognition
the BBN achieves good results for recognizing both expressions (second rank in the
literature) and facial AUs. Tested on automatically located landmarks, the BBN
shows its robustness to landmark locating errors. In the future, we envisage to build
a probabilistic latent semantic space of AUs and recognize spontaneous expressions
based on this space.
Chapter 3. 3D Facial Expression Recognition
AU14 (97.7) AU17 (98.2)
Figure 3.25: ROC curves for the 16 AUs on the Borphorus database. The area
under ROC curve is in the bracket. (Part 1)
Chapter 3. 3D Facial Expression Recognition
AU34 (97.1) AU43 (99.7)
Figure 3.26: ROC curves for the 16 AUs on the Borphorus database. (Part 2)
Chapter 3. 3D Facial Expression Recognition
a b
Figure 3.27: Two examples of local grid configuration (number and size).
Chapter 4
4.1 Introduction
scene to estimate people number inside. Chao et al [Chan et al. 2008] segmented
crowd by motion model and extracted features from each segmentation. The corre-
spondence between features and number of people were learned by Gaussian Process
regression. [Dalal & Triggs 2005] proved that locally normalized Histograms of Ori-
ented Gradient (HOG) in a dense overlapping grid can be applied as a successful
feature in a pedestrian detector. The speed of HOG based pedestrian detector
has been increased significantly in [Cui et al. 2008], which make the detector ap-
plicable in practical application. However, HOG based pedestrian detector is not
applicable to our study because the full body of pedestrian are not always pre-
sented in our collected datasets. In the second case, authors either count tracked
people at a defined counting line or count people trajectories from tracking. In
[Kim et al. 2002], a tracking region was partitioned off from the scene with counting
line on the edge. people were tracked by motion prediction combined with back-
ground subtraction and counted at the line. Another approach consists in getting
feature trajectories in the scene by Kanade-Lucas-Tomasi (KLT) tracker, and then
clustered trajectories with similar movement together for representing one moving
object [Rabaud & Belongie 2006]. This kind of methods are generally able to count
a large number of people in a homogeneous crowd. From the state of the art, it
appears that most of people counting approaches rely on the assumption that any
moving objects in scenes are humans and suffer the miscount of other moving ob-
jects. In [Schlögl et al. 2003], a model of humans is defined based on average people
size. In [Harasse et al. 2005], a skin color model is used to detect human. These are
among the first tentatives to elaborate more accurate people counting systems but
still lack accuracy. In order to avoid this kind of miscounting, the basic idea of our
approach is to use the most discriminant human feature: their face.
Chapter 4. A minor contribution: People Counting based on Face
In this work, we address the problem of counting people moving toward the camera
in a close space such as the entrance of a supermarket, bank or bus, where lighting
conditions are relatively stable and people are generally facing the camera. Based on
these scenes, we propose an approach that presents several improvements compared
to the literature. The first improvement is the use of the face detector to ensure that
counted objects are people. Second, in order to deal with drastic changes of face
scales in our scene, a scale-invariant Kalman filter is proposed. It is further combined
with a kernel-based object tracking algorithm to handle face occlusions. Finally,
we propose a strategy to count people by automatically classify face trajectories,
which are characterized by an angle histogram of neighboring points. Two Earth
Mover’s Distance based classifiers are used to discriminate true trajectories and
false trajectories. The advantages are twofold. On the one hand, a filtering of the
trajectories can be realized in order to reject false trajectories caused by false face
detection and thus to improve counting accuracy. On the other hand, the automatic
classification of the trajectories allows to avoid the manual and empirical elaboration
of rules for counting people in a given scene.
Fig. 4.1 shows the framework of this system. It combines a face detection module,
a face tracking module and a counting module. Synchronizing periodically with
the face tracker (every 5 frames), the face detector can initialize tracking for new
faces as soon as it detect them, and verify faces being tracked. Moreover, the
synchronization results can reveal the events that new faces appear in the video,
faces disappear temporarily caused by occlusion and faces leave the scene. After
they leave, the face trajectories are sent to the counter for further analysis.
In our work, we use the face detector of [Tsishkou et al. 2004] which is based on
Viola’s one [Viola & Jones 2002]. The overall form of the detection process is that
of a degenerate decision tree, what is called a "cascade". Because overwhelming
majority of sub-windows is negative for face detection, the cascade attempts to
Chapter 4. A minor contribution: People Counting based on Face
Existing trackers usually track objects without large scale changes. However, our
system faces the difficulties of drastic face scale changes in the scene. Thus, we make
an improvement on the original Kalman filter to track objects more accurately under
this situation. Face occlusion is another problem we aim to solve. By the prediction
of face position from Kalman filter, we can continue tracking the occluded faces
Chapter 4. A minor contribution: People Counting based on Face
In the scenes where faces move towards a camera, an expansion on the face scale
is inevitable. As a consequence, faces seem to move faster when they are near a
camera in a video. This phenomenon may change the evaluation of movements
and introduce process noises into Kalman filter. As shown in fig. 4.2a, the red
line is a face’s trajectory moving towards the camera O, and the red points are its
positions in image sequence with the same time interval. After projected into camera
coordinate system, the movement changes from uniform motion (points along the
trajectory) to variable motion (points on X axis). This variable motion requires a
more complicated movement model than the linear one commonly used in Kalman
filter, which is hard to develop.
However, the complexity can be reduced as follows. We consider that face move-
ments with scale changing in a video is "2.5D" movements through image planes,
planes vertical to the camera optical axis. Face scales imply some information on
Chapter 4. A minor contribution: People Counting based on Face
the distance between faces and camera. Based on [Azarbayejani & Pentland 1995]
which presented a 3D central projection model to recover 3D positions of tracking
objects, we propose a scale-invariant Kalman filter as (4.1) and (4.2), taking the ad-
vantage that a face has the constant size in real world but different sizes in different
image planes. In our Kalman filter, face movements are projected into a fixed image
plane using on face scales and thus "2.5D" tracking problem can be simplified into
"2D" tracking problem, like shown in fig. 4.2b.
J k M k = J k HGK + V k (4.2)
Sx S k
where Jk = , p(V ) ∼ N (0, R)
Sx S k
1 0 0 0 0 0 h iT
H = , Mk = Zxk Zyk
(X, Y ) is the face location,
0 0 0 1 0 0
and (vx , vy ), (ax , ay ) are the velocity and acceleration of the face movement. A
and H are process model and measurement model for Kalman filter. T is the time
interval between two continuous frames, k is the index of frames. W k is the process
noise, white Gaussian noise with diagonal variance Q. Mk is the measurement of
face location. V k is the measurement noise, white Gaussian noise with diagonal
variance R. S is the face scale, Sx is the face scale in the fixed image plane, like the
plane x in fig. 4.2b. In our implementation, Sx is set to 20 pixels, which is the lower
boundary of face scale for our face detector.
Chapter 4. A minor contribution: People Counting based on Face
Each tracked faces are assigned with a Kalman filter and kernal based tracker. For
each frame, Kalman filter first predicts the face position for tracker. Then, a coarse-
to-fine tracking process is performed by the tracker which handles back a measured
face position and scale to Kalman filter for measurement update. In cases of tracking
failure due to occlusion or pose variance, the predicted face position and previous
scale are give back to Kalman filter. This process is illustrated in fig. 4.3.
Color-based features can be used for tracking non-rigid objects and can keep
consistency when face scales change. They also tolerate more changes in pose than
edge and texture features. Thus, we use chromatic colors defined in (4.3) to reduce
the influence from lighting changes.
r= ,g = (4.3)
Detected faces are represented by a kernel based 2-D color histogram, which
Chapter 4. A minor contribution: People Counting based on Face
consists of 200 bins in each axis. The value of each bin u is calculated as in 4.4.
k(kx∗i k2 )δ [b (x∗i ) − u], C = Pn
qu = C ∗ 2
i=1 i=1 k(kxi k )
where p, q is the histogram of face model and face candidate respectively with a
dimension of 400 ∗ 1; n is the number of bins in the color histogram. The p of face
model is initially computed when face is first time detected and updated when a
face is detected close to tracked position. Face detector scans every five frames for
a balance of accuracy and efficiency.
For each tracked face, the predicted position from Kalman filter is used to locate
the center of sub-image whose size varies dynamically with the face scale. A coarse
scan procedure is first processed in this sub-image to approximate the face position.
The location with the maximum similarity ρ is used to initialize a fine tracking
Then, the kernel-based tracking algorithm is used to move the face location
iteratively to reach the maximum of similarity between the face model and the face
candidate. A dynamic threshold εk is set and updated every frame, as shown in
(4.6). If the maximum of ρ is above this threshold ε, we consider that the face has
been measured at the position obtaining the maximum of ρ .
Chapter 4. A minor contribution: People Counting based on Face
algorithm, the face positions are always predicted no matter the face is occluded or
not. If the face is really occluded, we assume the face is at the prediction position.
The prediction and the assumption are always made until the face appears again or
the face has not been detected for 20 frames consecutively.
Ju = C δ[b(κ) − u],
θ, if xi > xi−1
κ= π + θ, if xi < xi−1 &yi > yi−1 (4.7)
−π + θ, if xi < xi−1 &yi < yi−1
Chapter 4. A minor contribution: People Counting based on Face
In this section, we present some experimental results of our scale invariant Kalman
filter, face detection and tracking algorithm, and people counting application. They
are carried out on different video sequences with multiple faces appearing at different
We compared our Kalman filter with the original Kalman filter in 3 videos, where
a single face moves towards the camera and its scale increases. The frontal face is
always showed in the video and we manually measure the nose tips for ground truth
position of faces, as in fig. 4.4. Each video was divided into two parts according
to face scales. Face scales in first parts varies from the minimum face scale in the
whole sequence to around 0.6 of the maximum face scale, and face scales in second
parts are from around 0.6 of the maximum to the maximum. ω is a ratio of our
Kalman filter’s error to the original Kalman filter’s error, defined as (4.8):
Chapter 4. A minor contribution: People Counting based on Face
Es Xp
ω= , E = (Xi − XGT i )2 + (Yi − YGT i )2 (4.8)
where I is different parts of test videos, (X,Y) is the location states of face in Kalman
filter(XGT , YGT ) is the ground truth of face locations.
Table 4.1 shows the comparison between two Kalman filters. We can see that
compared to the original one, when face scale increases, the error of our Kalman
filter decreases more. In other words, our Kalman filter works more accurately when
face scales increase.
Video Video
Part 1 Part 2
Face ω Face ω
Scale Scale
Range Range
1 45∼89 1 89∼117 0.67
2 41∼69 0.83 69∼88 0.5
3 33∼65 0.92 65∼92 0.67
Chapter 4. A minor contribution: People Counting based on Face
We evaluated the robustness of the framework when multiple faces appear and
occlusion happens. fig. 4.5 shows the results of tracking multiple faces. Three faces
moved together in this video and have been detected and tracked separately. Because
of the partial face occlusion, the trajectory of first detected face is not smooth in
the last frame.
fig 4.6 shows the results when the tracked face experience a totally occlusion.
The tracked face had been detected and was tracked for several frames before it was
totally occluded by another face. Our tracking algorithm overcame the occlusion
and continued to track the face until it appeared again.
Chapter 4. A minor contribution: People Counting based on Face
We tested the people counting application on our database, which contains 5 videos
(6345 frames in total) recorded by the cameras installed at the corridor of our
building and the entrance of our conference room, as scenes in fig. 4.4 and 4.5. In
these video, people passed either individually or in group more than 100 times. All
videos were processed at the size 320*240 pixels. Different databases, like CAVIAR
can not readily be used since we require frontal faces detectable by the face detector.
In order to train and test the K-NN classifier and the mean-trajectory classifier,
we process our dataset to obtain trajectories. To get more false trajectories to
balance the two classes, we tune the detector to have more false detection. Thus
105 true trajectories and 56 false trajectories are obtained. For K-NN classifier, we
Chapter 4. A minor contribution: People Counting based on Face
randomly choose n true trajectories and n number of false trajectories for training
and use other trajectories for testing. For mean trajectory classifier, we also choose
n trajectories for training and use other trajectories for testing. For each pair of K
and n in first classifier and each pair of n and threshold T in second classifier, we
test the classification rate for 20 times and choose 10 continuous results with higher
scores for evaluation. Results are shown in fig. 4.7. The best counting accuracy we
reached 93% by 1-nearest neighbor classification algorithm.
Chapter 4. A minor contribution: People Counting based on Face
4.6 Conclusion
We have presented in this chapter a novel video-based people counting system that
integrates several improvements as compared to the literature. The detection of faces
allows to validate that counted objects are human. Then, scale-invariant Kalman fil-
ter is proposed to deal with drastic changes in face scales. Moreover, a combination
of it and a kernel based object tracking algorithm enhance the robustness of tracking
faces with head pose variations and face occlusions. Finally, we have proposed a
strategy for counting people based on the automatic classification of potential face
trajectories. They are characterized by an angle histogram and the similarities be-
tween histograms are evaluated by the Earth Mover’s Distance. Thus, not only bad
trajectories can be filtered out to enhance the system’s counting accuracy but also
the automatic classification can avoid the manual and empiric elaboration of rules
for counting in a given scene. Our approach has been validated by our experimental
results which have demonstrated a good performance on these different aspects and
finally a people counting accuracy of around 93%.
In our future work, we envisage to extend our system to other more complex
contexts, such as outdoor where illuminations changes drastically. Moreover, the
classification of face trajectories will be improved to better fit different contexts by
online learning and the adaptation of a more robust classifier.
Chapter 5
5.1 Contributions
This research work mainly addresses the problem of 3D face analysis, including facial
landmarking and facial expression recognition. The approaches we have proposed
for these purposes can also be combined to build a fully automatic facial expression
recognition system.
The contributions in this thesis are discussed as follows.
partial face model (SFAM) which learns variations in 3D shape as well as local
texture and local geometry. The fitting is performed thanks to the minimization of
an objective function describing the similarity between a query face and SFAM with
consideration of partial occlusion, thus enabling landmarking on partial occluded
faces. The optimization of the objective function is accelerated by pre-computing
correlation meshes. Moreover, an occlusion detection method has been proposed
to detect the local regions occluded and give a set of occlusion parameters for the
objective function.
Experimental results have demonstrated that by considering both texture and
geometry information, our methods is able to locate a set of landmarks beyond those
characterized by salient shape with a better accuracy. Thus, SFAM has reached
a better landmarking ability than the previous models proposed in the literature
in terms of accuracy and robustness when encountering severe conditions such as
expression and occlusion.
Chapter 5. Conclusion and Future Works
or AUs appearing on faces and features extracted from face appearance. Moreover,
it has a flexible topology allowing to integrate knowledge carried on new features
and to express facial activity as expression or AUS. Thus, it can be applied on
both expression recognition and AU recognition problems. By combining BBN with
SFAM, a fully automatic facial expression recognition system is elaborated.
The experimental results have demonstrated the efficiency of BBN compared
with SVM and SRC to fuse features from different face representations. Using a
uniform structure, the BBN achieves good results for recognizing both expressions
(second rank in the literature) and facial AUs. Tested on automatically located
landmarks by SFAM, the BBN shows its robustness to landmark locating errors.
Moreover, in order to enrich information used for 3D face analysis, we have also
proposed in this Ph.D work a new 3D facial feature, named SGAND, to characterize
the face geometry properties. Indeed, pose-invariant features for 3D faces can be a
shortcut for face analysis because using this kind of features avoids the procedure of
face alignment. However, most of pose-invariant features, such as shape index, HK
curvature, are sensitive to face scale because they are extracted from face meshes
which varies with face scale. On the contrary, the SGAND feature we have elab-
orated to characterize surface properties only relies on the point clouds instead of
face meshes. Thus, this feature is insensitive to scale, easy to implement and quick
to compute. It relies on the comparison of numbers of vertices above and below
face planes with a preset direction in sampled local regions of 3D faces. In order
to compute this direction, a head pose estimation method has been developed in
conjunction with the feature so that the feature can be extracted under various head
poses. As experiments have shown, SGAND feature has been applied successfully
to the recognition of the six universal expressions.
Finally, our last contribution concerning face analysis deals with people counting
based on face tracking. Existing people counting systems rely on the assumption
that detected or tracked objects are humans. Some of them have a preliminary
people model to verify this assumption, such as the ratio of height and width of ob-
Chapter 5. Conclusion and Future Works
jects. Unfortunately, these methods suffer from inaccuracy when validating humans.
Thus, we have proposed a method that makes use of the face, the most discrimi-
native feature of human to accomplish this validation. This approach is composed
of a face detector and a face tracker that collaborate to detect and track faces in
2D videos. The face detector is cascade Adaboost classifiers and the tracker is a
combination of Kalman filter and the kernel based object tracking algorithm. The
tracker is improved to be used in the senarios when people move towards the camera.
In this case, face scale varies drastically so that it introduces errors to the tracking
process using traditional Kalman filter. Thus, we have designed a scale-invariant
Kalman filter which tracks faces in an image plane where the face trajectories are
projected, so that 2.5D face movements (face movement with scale changes) can be
normalized to 2D face movements. Face trajectories from the tracker are featured
by histogram of moving directions and classified using a K-NN classifier. By doing
this, bad trajectories caused by false face detection and tracking fragment can be
filter out so that only correct face trajectories are counted for people counting. Our
approach has been validated by our experimental results which have demonstrated a
good performance on these different aspects and finally a people counting accuracy
of around 93% as been reached.
Extensions of this work that we envisage are presented in the following paragraphs.
In this thesis, local range and texture maps have been used as simple features to
represent local shape and texture around a landmark. In the future, the landmark
location may be improved by extracting other features such as our proposed SGAND
feature, HK curvature, shape index, etc. for shape feature, and Local Binary Pat-
tern, Gabor filtering, etc. for texture property within our statistical landmarking
Another improvement may concern the constraints applied to instances gener-
Chapter 5. Conclusion and Future Works
ated by SFAM during the fitting process. Indeed, SFAM parameters (bi ) are empir-
ically limited to constrain possible deformations. We plan to add a process to set
the boundaries of their variation range according to the face properties available in
the training data.
Chapter 5. Conclusion and Future Works
fication, expression and illumination. More recently TensorFace has been proposed
for a multi-linear analysis to model explicitly the multiple modes of variations in
these factors and their inter-relationships [Jia & Gong 2005]. Thus, in the future, we
will investigate 3D TensorFace for a joint recognition of facial expression recognition
and face recognition.
Chapter 6
FACS describes all visually distinguishable muscular activities that produce mo-
mentary changes in facial appearance on the basis of 44 unique AUs, as well as
several categories of head and eye positions and movements. Each AU has a nu-
meric code. They are sorted into three categories: upper face AU, lower face AU
and Miscellaneous AU. The first category includes AUs named Inner Brow Raiser
(1), Outer Brow Raiser (2), Brow lowerer (4), Upper Lid Raiser (5), Cheek Raiser
(6), Lid Tightener (7), Eyes Closure (43), Blink (45), Wink (46). The second cat-
egory includes AUs named Nose Wrinkler (9), Upper Lip Raiser (10), Nasolabial
Fold Deepener (11), Lip Corner Puller (12), Cheek Puffer (13), Sharp Lip Puller,
Dimpler (14), Lip Corner Depressor (15), Lower Lip Depressor (16), Chin Raiser
(17), Lip Puckerer (18), Lip Stretcher (20), Lip Funneler (22), Lip Tightener (23),
Lip Presser (24), Lips Part (25), Jaw Drop (26), Mouth Stretch(27) and Lip Suck
(28). The third category includes Lips Toward Each Other (8), Tongue Show (19),
Neck Tightener (21), Jaw Thrust (29), Jaw Sideways (30), Jaw Clencher (31), Lip
Bite (32), Blow (33), Puff (34), Cheek suck (35), Tongue Bulge (36), Lip Wipe (37),
Nostril Dilator (38), Nostril Compressor (39). It is crucial to note that while FACS is
anatomically based, there is not a one-to-one correspondence between muscle groups
and AUs, since a given muscle may act in different ways and thus produce different
Chapter 6. Appendix: FACS and used Action Units
6.1 AU Examples
Totally 16 facial AUs are analyzed in the chapter 3. They are AU2, AU4, AU7,
AU9, AU10, AU12, AU14, AU17, AU18, AU22, AU24, AU26, AU27, AU28, AU34,
AU43 respectively. In order to analyze their characteristics, we demonstrate these
AUs here and give explanations on them.
AU2 Outer Brow Raiser: The muscle that underlies AU2 originates in the fore-
head and is attached to the skin in the area around the brows. In AU2 the action
is upwards, pulling the eyebrows and the adjacent skin in the outer portion of the
forehead upwards towards the hairline. It produces an arched shape to the eyebrows
and causes the lateral portion of the eye cover fold to be stretched upwards.
AU4 Brow Lowerer: Three muscle strands that underlie AU4. One strand runs
obliquely in the forehead. Another strand emerges from the root of the nose. A
third strand runs from the glabella to the medial corner of the eyebrow. It lowers
the eyebrow and pushes the eye cover fold downwards and may narrow the eye
aperture. Meanwhile, it pulls the eyebrows closer together and produces vertical
wrinkles between the eyebrows as well as an oblique wrinkle or bulge running from
the middle of the forehead down to the inner corner of the brow.
AU7 Lid Tightener: The muscle that circles the eye orbit is the basis for AU7.
This muscle runs in and near the eyelids. When it is contracted, AU7 pulls both
upper and lower eyelids and some adjacent skin below the eye together and towards
the inner eye corner. It tightens eyelids and narrows eye aperture. It raises the
lower lid so it covers more of the eyeball than is usually covered. Meanwhile, the
raising of the skin below the lower eyelid causes a bulge to appear in the lower lid.
AU9 Nose Wrinkler: The muscle underlying AU9 reaches from the area near
the root of the nose downward to a point adjacent to the nostril wings. When
contracted, this muscle pulls skin from the area below the nostril wings upwards
towards the root of the nose. It pulls the skin along the sides of the nose upwards
towards the root of the nose causing wrinkles to appear. It also lowers the medial
portion of the eyebrows and pulls the center of the upper lip upwards as well as
narrows the eye aperture.
Chapter 6. Appendix: FACS and used Action Units
Chapter 6. Appendix: FACS and used Action Units
AU10 Upper Lip Raiser: The muscle underlying AU10 emerges from the center
of the infraorbital triangle and attaches in the area of the nasolabial furrow. In
AU10 the skin above the upper lip is pulled upwards and towards the cheek, pulling
the upper lip up. It raises the upper lip, where center of upper lip is drawn straight
up and the outer portions of upper lip are drawn up but not as high as the center.
It pushes the infraorbital triangle up, widens the nostril wings and deepens the
nasolabial furrow.
AU12 Lip Corner Puller: The muscle underlying AU 12 emerges high up in the
lower face by the cheek bones and attaches at the corner of the lips. In AU12, the
direction of the action is to pull the lip corners up towards the cheek bone in an
oblique direction. It pulls the corners of the lips back and upward and deepens the
nasolabial furrow by pulling it laterally and up. In a strong action, it bags the skin
below the lower eyelid, narrows the eye aperture and produces crow’s feet at eye
AU14 Dimpler: The muscle underlying AU 14 emerges far back in the cheek
bones and attaches in the center portion of the lips. In AU14 the skin beyond the
lip corners is pulled inwards towards the lip corners, which are themselves drawn
somewhat towards the ears. It tightens the corners of the mouth, pulling the corners
somewhat inwards, and narrowing the lip corners. It also produces wrinkles and/or
a bulge at the lip corner and pulls the skin below the lip corners and the chin boss
up towards the lip corners, flattening and stretching the chin boss skin.
AU17 Chin Raiser: The muscle underlying AU 17 emerges from an area below
the lower lip and attaches far down the chin. In AU 17 the skin of the chin is pushed
upwards, pushing up the lower lip. It pushes the chin boss and the low lip upward
and may cause wrinkles to appear on the chin boss. It causes shape of mouth to
appear an inverted - U shape.
AU18 Lip Puckerer: The muscle relevant to AU18 is located above and below
the upper and lower lips. AU 18 draws the lips medially, pursing or puckering them,
causing the lips to protrude. It pushes the lips of the mouth forward and pulls
medially and de-elongates the mouth opening, making the mouth opening smaller
and rounder, and the lips appear tight. It makes short wrinkles on the skin above
Chapter 6. Appendix: FACS and used Action Units
the upper lip and also may cause wrinkles on the skin below the lower lip, and
wrinkles in the lips themselves.
AU22 Lip Funneler: It is based on the outer strands of the muscle that runs
around the mouth. It pulls in medially on the lip corners and makes lips funnel
outwards taking on the shape as though the person were saying the word flirt. It
exposes the teeth, gums and more of the red parts of the lips.
AU24 Lip Presser: It is based on the inner portion of the muscle orbiting the
mouth within the lips. The lips are pulled in medially and pressed together. It
lowers the upper lip and raises the lower lip to a small extent, without pushing up
the chin boss. It tightens and narrows the lips and may cause small lines or wrinkles
to appear on the upper lip and a bulging of the skin above and/or below the lips.
AU26 Jaw Drop: It describes the limited opening of the oral cavity (i.e., teeth
parting) that can be produced by relaxing the muscle that closes the jaw. In AU26,
the mandible is lowered by relaxation so that separation of the teeth can at least
be inferred. Mouth appears as if jaw has dropped or fallen with no sign of the jaw
being pulled open or stretching of the lips due to opening the jaw wide.
AU27 Mouth Stretch: AU 27 measures the forced opening and stretching of
the mouth by muscles that act in opposition to muscles that close the jaw. It pull
down the mandible and open the mouth quite far, changing the shape of the mouth
opening from an oval with the long axis in the horizontal plane to one in the vertical
direction. It flattens and stretched cheeks and changes shape of skin on the chin
boss and the appearance under the chin.
AU28 Lips Suck: It involves the orbital muscles surrounding the mouth and
lips. In AU28 the lips are pulled into the mouth. This movement can involve only
the upper or lower lip. It sucks the red parts of the lips causing the red parts to
disappear and adjacent skin into the mouth, covering the teeth. It stretches the
skin above and below the lips and flattens the chin boss.
AU34 Puff: The cheeks puff out as air is forced into the mouth, but the lips
remain closed keeping the air in.
AU43 Eye Closure: The same muscle, which when contracted raises the upper
eyelid and when partially relaxed lets it droop, allows the eye to close when totally
Chapter 6. Appendix: FACS and used Action Units
relaxed. In AU43, the eyelid droops down reducing the eye aperture and more
surface of the upper eyelid is exposed than usual.
Some of AU combinations can be converted into emotions using high level deci-
sion making rules. Fig. 6.2 cites the Table 10-1 in the FACS Investigator’s Guide
[Ekman et al. 2002] demonstrating some prototypes and major variants of AU com-
binations corresponding to the six universal emotions.
Excluded from the table are dozens of minor variants for each of the emotions,
AU combinations for variations in the intensity of each emotion, and AU combina-
tions for blends of two or more emotions [Ekman et al. 2002].
The results obtained during my PhD study have been the subject of five publications
in international conferences and one in a national conference. Moreover, two journal
paper have been submitted.
International Conferences:
National Conferences:
