Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Zhao_Towards_Pose_Invariant_CVPR_2018_paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Towards Pose Invariant Face Recognition in the Wild

Jian Zhao1,2∗† Yu Cheng3 Yan Xu4 Lin Xiong4 Jianshu Li1 Fang Zhao1
Karlekar Jayashree4 Sugiri Pranata4 Shengmei Shen4 Junliang Xing5 Shuicheng Yan1,6 Jiashi Feng1
1 2 3
National University of Singapore National University of Defense Technology Nanyang Technological University
4 5
Panasonic R&D Center Singapore National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
6
Qihoo 360 AI Institute

Abstract

Pose variation is one key challenge in face recognition.


As opposed to current techniques for pose invariant face
recognition, which either directly extract pose invariant fea-
tures for recognition, or first normalize profile face images
to frontal pose before feature extraction, we argue that it
is more desirable to perform both tasks jointly to allow
them to benefit from each other. To this end, we propose
a Pose Invariant Model (PIM) for face recognition in the
wild, with three distinct novelties. First, PIM is a novel
and unified deep architecture, containing a Face Frontaliza-
tion sub-Net (FFN) and a Discriminative Learning sub-Net
(DLN), which are jointly learned from end to end. Second, Figure 1: Pose invariant face recognition in the wild. Col. 1 & 6:
FFN is a well-designed dual-path Generative Adversar- distinct identities under different poses with other unconstrained
ial Network (GAN) which simultaneously perceives global factors (different expressions and lighting conditions); Col. 2 & 5:
structures and local details, incorporated with an unsuper- recovered frontal faces with our proposed PIM model; Col. 3 & 4:
vised cross-domain adversarial training and a “learning learned facial representations with our proposed PIM model. PIM
to learn” strategy for high-fidelity and identity-preserving can learn pose-invariant representations and recover photorealistic
frontal faces effectively. The representations are extracted from
frontal view synthesis. Third, DLN is a generic Convo-
the penultimate layer of PIM.
lutional Neural Network (CNN) for face recognition with
our enforced cross-entropy optimization strategy for learn-
ing discriminative yet generalized feature representation. tions. For example, in surveillance scenarios, free-walking
Qualitative and quantitative experiments on both controlled people would not always keep their faces frontal to the cam-
and in-the-wild benchmarks demonstrate the superiority of eras. Most face images captured in the wild are contam-
the proposed model over the state-of-the-arts. inated by unconstrained factors like extreme pose, bad il-
lumination, large expression, etc. Among them, the one
that harms face recognition performance arguably the most
1. Introduction is pose variation. The performance of most face recogni-
tion models degrades by over 10% from frontal-frontal to
Face recognition has been a key problem in computer
frontal-profile verification [24]. In contrast, human can rec-
vision for decades. Even though (near-) frontal1 face recog-
ognize faces with large pose variance without significant ac-
nition seems to be solved under constrained conditions, the
curacy drop. In this work, we aim to mitigate such a gap
more general problem of face recognition in the wild still
between human performance and automatic models for rec-
needs more studies, desiderated by many practical applica-
ognizing unconstrained faces with large pose variations.
∗ Homepage: https://zhaoj9014.github.io/.
† Work
Recent studies [10, 19] discovered that human brain has
done in part during an internship at Panasonic R&D Center Sin-
gapore.
a face-processing neural system that consists of several con-
1 “Near frontal” faces are almost equally visible for both sides and their nected regions. The neurons in some of these regions per-
yaw angles are within 10◦ from frontal view. form face normalization (i.e., profile to frontal) and others

12207
are tuned to identify the synthesized frontal faces, making nition methods.
face recognition robust to pose variation. This intriguing • We design a novel face frontalization network for pho-
function of primate brain inspires us to develop a novel torealistic face frontalization that can generalize well
and unified deep neural network, termed as Pose Invari- across multiple domains and fast adapt to new applica-
ant Model (PIM), which jointly learns face frontalization tion samples.
and discriminative representation end-to-end that mutually • We develop effective and novel training strategies for
boost each other to achieve pose-invariant face recognition. the frontalization network, the recognition network,
PIM takes face images of arbitrary poses with other poten- and the whole deep architecture, which generate pow-
tial distracting factors (e.g., bad illuminations or different erful face representations.
expressions) as inputs. It outputs facial representations in- • Our deep architecture for pose invariant face recogni-
variant to pose variation and meanwhile preserves discrim- tion significantly outperforms the state-of-the-arts on
inativeness across different identities. As shown in Fig. 1, three large benchmarks.
our proposed PIM can learn pose-invariant representations
and effectively recover frontal faces. 2. Related Work
In particular, PIM includes a Face Frontalization sub-
Net (FFN) and a Discriminative Learning sub-Net (DLN) Face Frontalization Face frontalization or normalization
to learn the representations. The FFN contains a care- is a challenging task due to its ill-posed nature. Tradi-
fully designed dual-path Generative Adversarial Network tional methods address this problem through 2D/3D local
(GAN) that simultaneously recovers global facial struc- texture warping [14, 38], statistical modeling [21], and deep
tures and local details. Besides, FFN introduces unsuper- learning based methods [18, 37]. For instance, Hassner et
vised cross-domain adversarial training and a “learning to al. [14] used a single and unmodified 3D surface to approx-
learn” strategy with the siamese discriminator for achiev- imate the shape of all the input faces, which is shown ef-
ing stronger generalizability and high-fidelity, identity- fective for face frontalization, but suffers big performance
preserving frontal face generation. Cross-domain adversar- drop for profile and near-profile2 faces due to severe texture
ial training is applied during training the generator to pro- loss and artifacts. Sagonas et al. [21] proposed to perform
mote features that are indistinguishable w.r.t. the shift be- joint frontal view reconstruction and landmark detection by
tween source (training) and target (test) domains. In this solving a constrained low-rank minimization problem. Kan
way, the generalizability of FFN can be significantly im- et al. [18] used Stacked Progressive Auto-Encoders (SPAE)
proved even in case of only a few training samples from tar- to rotate a profile face to frontal. Though with encourag-
get domains. The discriminator in FFN introduces dynamic ing results, the synthesized faces lack fine details and tend
convolution to implement “learning to learn” for more effi- to be blurry and unreal under a large pose. The quality of
cient adaption and a siamese architecture featuring a pair- synthesized images with current methods is still far from
wise training scheme to encourage the generator to pro- satisfactory for recognizing faces with large pose variation.
duce photorealistic frontal faces without identify informa- Pose Invariant Representation Learning Conventional
tion loss. We introduce the other branch to the discriminator approaches often leverage robust local descriptors [8, 2, 7]
as the “learner”, which predicts the dynamic convolutional and metric learning [4, 33] to tackle pose variance. In
parameters of the first branch from a single sample. DLN contrast, deep learning methods often handle pose vari-
is a generic Convolutional Neural Network (CNN) for face ance through a single pose-agnostic or several pose-specific
recognition with our proposed enforced cross-entropy op- models with pooling operation and specific loss func-
timization strategy. Such a strategy reduces the intra-class tions [6, 34]. For instance, the VGG-Face model [20] adopts
distance while increasing the inter-class distance, so that the the VGG architecture [27]. The DeepFace [30, 31] model
learned facial representations are discriminative yet gener- uses a deep CNN coupled with 3D alignment. FaceNet [23]
alizable. utilizes the inception architecture. The DeepID2+ [29] and
We conduct extensive qualitative and quantitative exper- DeepID3 [28] extend the FaceNet [23] model by includ-
iments on various benchmarks, including both controlled ing joint Bayesian metric learning and multi-task learning.
and in-the-wild datasets. The results demonstrate the effec- However, such data-driven methods heavily rely on well an-
tiveness of PIM on recognizing faces with extreme poses notated data. Collecting labeled data covering all variations
and the superiority over the state-of-the-arts consistently on is expensive and even impractical.
all the benchmarks. Our proposed PIM presents a similar idea with Two-
Our contributions are summarized as follows. Pathway GAN (TP-GAN) [17] and Disentangled Represen-
tation learning GAN (DR-GAN) [32]. TP-GAN considers
• We present a deep architecture unifying face frontal-
photorealistic and identity preserving frontal view synthesis
ization and recognition in a mutual boosting way. It in-
herits the merits of existing pose invariant face recog- 2 Faces with yaw angle greater than 60◦ .

2208
and DR-GAN considers both face frontalization and rep- ate the final frontalized face image I ′ . We also concatenate
resentation learning in a unified network. Our proposed a Gaussian random noise z at the bottleneck layer of the
model differs from them in following aspects: 1) PIM aims dual-path generator to model variations of other factors be-
to jointly learn face frontalization and pose invariant rep- sides pose, which may also help recover invisible details.
resentations end-to-end to allow them to mutually boost Formally, let the input profile face image with four land-
each other for addressing large pose variance issue in un- mark patches be collectively denoted as Itr . Then the pre-
constrained face recognition, whereas TP-GAN only tries dicted face is I ′ = Gθ (Itr ). The key requirements for the
to recover a frontal view from profile face images; 2) TP- FFN include two aspects. 1) The recovered frontal face im-
GAN [17] and DR-GAN [32] suffer from poor generaliz- age I ′ should visually resemble a real one and preserve the
ability and great optimization difficulties which limit their identity information as well as local textures. 2) It should
effectiveness in unconstrained face recognition, while our be hardly possible for an algorithm to identify the domain

PIM architecture effectively overcomes these issues by in- of origin of the observation I regardless of the underly-
troducing unsupervised cross-domain adversarial training, a ing gap between source domain (with ample annotated data)
“learning to learn” strategy using the siamese discriminator and target domain (with rare annotated data).
with dynamic convolution, and an enforced cross-entropy To this end, we propose to learn the parameters {θg , θil }
optimization strategy. Detailed experimental comparisons (here i=1, . . ., 4 index the four local path models) by mini-
are provided in Sec. 4. mizing the following composite losses:
LGθ =−Ladv +λ0 Lece −λ1 Ldomain +λ2 Lpixel +λ3 Lsym +λ4 LTV ,
3. Pose Invariant Model (1)
where Ladv is the adversarial loss for adding realism to
As shown in Fig. 2 (a), the proposed Pose Invariant
the synthetic images and alleviating artifacts, Lece is the
Model (PIM) consists of a Face Frontalization sub-Net
enforced cross-entropy loss for preserving the identity in-
(FFN) and a Discriminative Learning sub-Net (DLN) that
formation, Ldomain is the cross-domain adversarial loss for
jointly normalize faces and learn face representation end-
domain adaption and generalization capacity enhancement,
to-end. We now present each component in details.
Lpixel is the pixel-wise ℓ1 loss for encouraging multi-scale
3.1. Face Frontalization Sub-Net image content consistency, Lsym is the symmetry loss for
alleviating self-occlusion issue, LTV is the total variation
3.1.1 Domain Invariant Dual-Path Generator loss for reducing spiky artifacts and {λk }k=40 are weighting
A photorealistic frontal face image is important for repre- parameters among different losses.
senting a face identity. A natural scheme is thus to gener- In order to enhance generalizability of the FFN and
ate this reference face from face images of arbitrary poses. reduce over-fitting that hinders the practical application
Since the convolutional filters are usually shared across all of most previous GAN-based models [17, 32], we adopt
the spatial locations, merely using a single-path genera- Ldomain to promote the emergence of features encoded by
tor cannot learn filters that are powerful enough for both Gθg and Gθil that are indistinguishable w.r.t. the shift be-
sketching a rotated face structure and precisely recovering tween the source (training, Itr ) and target (testing, Ite ) do-
local textures. To address this issue, we propose a dual-path mains. Let Ii denote the images from both source and target
generator, as inspired by [17, 38], where one path aims to domains, yi ∈ {0, 1} indicate which domain Ii is from, and
infer the global sketch and the other to attend to local facial ri = GθE (Ii ) denote the representations. The cross-domain
details, as shown in Fig. 2 (b). adversarial loss is defined as follows:
In particular, the global path generator Gθg (with learn- 1 X
Ldomain = −yi log[Cφ (ri )] − (1 − yi ) log[1 − Cφ (ri )],
able parameters θg ) consists of a transition-down encoder N i
GθEg and a transition-up decoder GθD g . The local path gen- (2)
erator Gθl also has an auto-encoder architecture, which con- where φ denotes the learnable parameters for the domain
tains four identical sub-networks that learn separately to classifier. Minimizing Ldomain can reduce the domain
frontalize the following four center-cropped local patches: discrepancy and help the generator achieve similar face
left eye, right eye, nose and mouth. These patches are ac- frontalization performance across different domains, even if
quired by an off-the-shelf landmark detection model. Given training samples from the target domain are limited. Such
an input face image I, to effectively integrate information adapted representations are provided by augmenting the en-
from the global and local paths, we first align the feature coders of Gθg and Gθil with a few standard layers as the
maps f l predicted by Gθl to a single feature map according domain classifier Cφ , and a new gradient reversal layer to
to a pre-estimated landmark location template, which is fur- reverse the gradient during optimizing the encoders (i.e.,
ther concatenated with the feature map f g from the global gradient update as in Fig. 2 (b)), as inspired by [11].
path and then fed to following convolution layers to gener- Lpixel is introduced to enforce the multi-scale content

2209
,ℒ ,ℒVWV
λ0 ,X VWV λ0
YYZ ,[\]Z
9CE
9O
G
GG
G NO
9CD FLD M C
:
9O
J
L C 9: ℒ ece
L^ ℒ adv
J
9P
9CE GG
G NP !
FLD G
M C
9P:
9CD 9: !
Face Frontalization Sub-Net Discriminative Learning Sub-Net
(a) Overview of the proposed PIM framework.
Center Patch
Local Generator × 4
Cropping Alignment
J
9CE
K
C
9CD
Landmark Detection ℒTV ℒsym ℒpixel
!"
λ1

,ℒ-./0123
,ℒ

-λ1

)
-

'(
,4-5
./

&

,6
0

$% C
ℒ"
12
-

3
7

Global Generator
9: 9GT
J
K J Random Noise
9CE C
C Concat Operation
9CD !* — Pixel-Wise Difference
λ1

8
,ℒ ./0123
L
,ℒ

Weight Sharing
)

-λ1
8

'(

,485
./

&
,6

$%
0

ℒ*
8
12
7

FLD Facial Landmark Detection

(b) Dual-path generator architecture of the FFN.


Figure 2: Pose Invariant Model (PIM) for face recognition in the wild. The PIM contains an Face Frontalization sub-Net (FFN) and a
Discriminative Learning sub-Net (DLN) that jointly learn end-to-end. FFN is a dual-path (i.e., simultaneously perceiving global structures
and local details) GAN augmented by (1) unsupervised cross-domain (i.e., Itr and Ite ) adversarial training and (2) a siamese discriminator
with a “learning to learn” strategy — convolutional parameters (i.e., Wd ) dynamically predicted by the “learner” DL of the discriminator
and transferred to DM . DLN is a generic Convolutional Neural Network (CNN) for face recognition optimized by the proposed enforced
cross-entropy optimization. It takes in the frontalized face images from FFN and outputs learned pose invariant facial representations.

consistency between the final frontalized face and cor- The standard LTV is introduced as a regularization term
responding ground truths, defined as Lpixel = kI ′ − on the synthesized results to reduce spiky artifacts:
IGT k/|IGT | where |IGT | is the size of IGT . Since sym-
H q
W X
metry is an inherent feature of human faces, Lsym is intro- X
′ ′ ′ ′
LTV = (Ii,j+1 − Ii,j )2 + (Ii+1,j − Ii,j )2 . (4)
duced within the Laplacian space to exploit this prior infor- i j
mation and impose the symmetry constraint on the recov-
ered frontal view for alleviating self-occlusion issue:
3.1.2 Dynamic Convolutional Discriminator
W/2 H
1 XX ′ ′
Lsym = |Ii,j − IW −(i−1),j |, (3) To increase realism of the synthesized images to benefit face
W/2 × H i j
recognition, we need to narrow the gap between the distri-
where W , H denote the width and height of the final recov- butions of the synthetic and real images. Ideally, the genera-
ered frontal face image I ′ , respectively. tor should be able to generate images indistinguishable from

2210
real ones for a sufficiently powerful discriminator. Mean- 3.2. Discriminative Learning Sub-Net
while, since the training sample size in this scenario is usu-
The DLN is a generic CNN for face recognition trained
ally small, we need to develop a sample-efficient discrimi-
by our proposed enforced cross-entropy optimization strat-
nator. To this end, we propose a “learning to learn” strategy
egy for learning discriminative yet generalizable facial rep-
using a siamese adversarial pixel-wise discriminator with
resentations. This strategy reduces the intra-class distance
dynamic convolution, as shown in Fig. 2 (a). This siamese
while increasing the inter-class distance. Moreover, it helps
architecture implements a pair-wise training scheme where
improve the rubustness of the learned representations and
each sample from the generator consists of two frontalized
address the potential over-fitting issue.
faces with the same identity and the corresponding real sam-
DLN takes the frontalized face images I ′ from the FFN
ple consists of two distinct frontal faces of the same person.
as input, and outputs the learned pose invariant facial repre-
Different from conventional CNN based discriminators, sentations f = Mψ (I ′ ), which are further utilized for face
we construct the second branch of the discriminator as the verification and identification. Here Mψ denotes the DLN
“learner” DL that dynamically predicts the suitable con- model parameterized by ψ. We define every column vector
volutional parameters of the first branch DM from a sin- of the weights of the last fully connected layer of DLN as an
gle sample. Formally, consider a particular convolutional anchor vector a which represents the center of each identity
layer in DM . Given an input tensor (i.e., feature maps from in the feature space. Thus, the decision boundary can be
the previous layer) xin ∈ Rw×h×cin and kernel weights derived when the feature vector has the same distance (co-
W ∈ Rk×k×cin ×cout where k is the kernel size, the output sine metric) to several anchor vectors (cluster centers), i.e.,
′ ′
xout ∈ Rw ×h ×cout of the convolutional layer can be com- a⊤ ⊤
i f = aj f .
puted as xout = W ∗ xin , where ∗ denotes the convolution However, in such cases, the samples close to the deci-
operation. sion boundary can be wrongly classified with a high confi-
Inspired by [3], we perform the following factoriza- dence. A simple yet effective solution is to reduce the intra-
tion, which is analogous to Singular Value Decomposition class distance while increasing the inter-class distance of
(SVD), the feature vectors, through which the hard samples will be
xout = U ′ ∗ (Wd ) ∗cin U ∗ xin , (5) adjusted and re-allocated in the correct decision area. To
achieve this goal, we propose to impose a selective attenua-
where U ∈ R1×1×cin ×cin , U ′ ∈ R1×1×cin ×cout , Wd ∈ tion factor as a regularization term to the confidence scores
Rk×k×cin is the dynamic convolution kernel predicted by (predictions) of the genuine samples:
DL and ∗cin denotes independent filtering of cin chan- exp[τt · (a⊤i f )]
nels. Under the factorization of Eqn. (5), the number of pi = P ⊤
, (7)
parameters to learn by DL is significantly decreased from j exp[τt · (aj f )]

k × k × cin × cout to k × k × cin , allowing them to grow where pi denotes the predicted confidence score w.r.t. the
only linearly with the number of input feature map chan- ith identity, τt denotes the selective attenuation factor, a and
nels. f are ℓ2 normalized to achieve boundary equilibrium during
We leverage the same architecture of global-path en- network training. In  particular, τt in Eqn. (7) is updated by
n α
coder as DM and DL , learning separately without weight τt+1 = τt 1 − B , where n denotes the batch index, B
sharing, while two generator blocks in Fig. 2 (a) share their denotes the total batch number and α is the diversity ratio.
weights. The feature maps from DM and DL are further Selective attenuation on the confidence scores of genuine
concatenated and fed into a fully connected bottleneck layer samples in turn increases the corresponding classification
to compute Ladv , which serves as a supervision to push the losses, which narrows the decision boundary and controls
synthesized image to reside in the manifold of photoreal- the intra-class affinity and inter-class distance.
istic frontal view images, prevent blur effect, and produce The predictions of Eqn. (7) are used to compute the
visually pleasing results. In particular, Ladv is defined as multi-class cross-entropy objective function for updating
network parameters (i.e., gradient update as in Fig. 2 (a)),
1 X which is an enforced optimization scheme:
Ladv = − yi log[DM ←L (IM , IL )] 1 X
N i (6) Lece = −li log(p) − (1 − li ) log(1 − p), (8)
N i
− (1 − yi ) log[1 − DM ←L (IM , IL )],
where li is the face identity ground truth.
where DM ←L denotes the siamese discriminator with dy- 4. Experiments
namic convolution, (IM , IL ) denotes the pair of face im-
ages fed to DM ←L and y is the binary label indicating the We evaluate PIM qualitatively and quantitatively under
pair is synthesized or real. both controlled and in-the-wild settings for pose-invariant

2211
face recognition. For qualitative evaluation, we show visu- Table 1: Component analysis: rank-1 recognition rates (%) under
alized results of face frontalization on Multi-PIE [12] and Multi-PIE [12] Setting-1. b1 and b2 denote ResNet-50 [15] and
LFW [16] benchmark datasets. For quantitative evalua- Light CNN-29 [35], respectively. PIM1 and PIM2 use ResNet-
50 [15] and Light CNN-29 [35] as backbone architectures, respec-
tion, we evaluate face recognition performance using the
tively.
learned facial representations with a cosine distance metric Method ±90◦ ±75◦ ±60◦ ±45◦ ±30◦ ±15◦
on Multi-PIE [12] and CFP [24] benchmark datasets. b1 18.80 63.80 92.20 98.30 99.20 99.40
b2 33.00 76.10 95.20 97.90 99.20 99.80
w/o Lpixel 60.60 82.30 89.60 93.70 98.50 98.60
Implementation Details Throughout the experiments, w/o G l
θ
66.80 89.30 95.60 98.20 99.30 99.80
i
the size of the RGB face images from training domain w/o Dϕ 66.90 90.00 96.50 98.00 99.20 99.80
w/o dyn conv 69.80 90.70 96.80 98.10 99.40 99.80
(Itr ), testing domain (Ite ), and the FFN prediction (I ′ ) is w/o Ldomain 71.10 90.80 97.10 98.30 99.30 99.80
w/o Lsym 72.30 90.40 96.80 98.20 99.30 99.80
fixed as 128×128; the sizes of the four RGB local patches PIM1 71.60 92.50 97.00 98.60 99.30 99.40
PIM2 75.00 91.20 97.70 98.30 99.40 99.80
(i.e., left/right eye, nose and mouth) are fixed as 40×40,
40×40, 32×40 and 48×32, respectively; the dimensional- for each of the remaining 137 identities and other images
ity of the Gaussian random noise z is fixed as 100; the diver- are used as probes.
sity ratio α and the constraint factors λi , i ∈ {0, 13 , 2, 3, 4}
4.1.1 Component Analysis
are empirically fixed as 0.9, 5×10−3 , 0.1, 0.3, 5×10−2 and
5×10−4 , respectively; the dropout ratio is fixed as 0.7; We first investigate different architectures and loss function
the weight decay, batch size and learning rate are fixed as combinations of PIM to see their respective roles in pose in-
5×10−4 , 10 and 2×10−4 , respectively. We use off-the-shelf variant face recognition. We compare eight variants of PIM,
OpenPose [25] for landmark detection4 . We initialize the i.e., different DLN architectures (ResNet-50 [15] vs. Light
DLN with ResNet-50 [15] and Light CNN-29 [35] archi- CNN-29 [35]), w/o Lpixel , w/o local-path generator Gθil ,
tectures as our two baselines, which are pre-trained on MS- w/o siamese discriminator Dϕ (DL is removed), w/o dy-
Celeb-1M [13] and fine-tuned on the target dataset. We ini- namic convolution (siamese discriminator without sharing
tialize DM and DL with the same architecture as the global- weights), w/o cross-domain adversarial training Ldomain
path encoder and pre-train DL on MS-Celeb-1M [13]. The and w/o Lsym , in each case.
proposed network is implemented based on the publicly Averaged rank-1 recognition rates are compared in
available TensorFlow [1] platform, which is trained using Setting-1 in Tab. 1. The results on the profile images serve
Adam (β1 =0.5) on three NVIDIA GeForce GTX TITAN X as our baselines (i.e., b1 and b2). The results of the mid-
GPUs with 12G memory. dle panel variations are all based on Light CNN-29 [35].
By comparing the results from the top and bottom panels,
4.1. Evaluations on the Multi-PIE Benchmark we observe that our PIM is not restricted to the DLN ar-
chitecture used, since similar improvements (e.g. 52.80%
The CMU Multi-PIE [12] dataset is the largest multi-
v.s. 42.00% under ±90◦ ) can be achieved with our joint
view face recognition benchmark, which contains 754,204
face frontalization and discriminative representation learn-
images of 337 identities from 15 view points and 20 illu-
ing framework. The pixel loss, dual-path generator and the
mination conditions. We conduct experiments under two
“learning to learn” strategy using the siamese discriminator
settings: Setting-1 concentrates on pose, illumination and
with dynamic convolution of the FFN contribute the most
minor expression variations. It only uses the images in ses-
to improving the face recognition performance, especially
sion one, which contains 250 identities. The images with
for large pose cases. Although not apparent, the cross-
11 poses within ±90◦ and 20 illumination levels of the first
domain adversarial training and symmetry loss also help
150 identities are used for training. For testing, one frontal
improve the recognition performance. Cross-domain ad-
view with neutral expression and illumination (i.e., ID07)
versarial training is crucial for enhancing the generalization
is used as the gallery image for each of the remaining 100
capacity of PIM on Multi-PIE [12] as well as other bench-
identities and other images are used as probes. Setting-2
mark datasets. Fig. 3 illustrates the perceptual performance
concentrates on pose, illumination and session variations. It
of these variants. As expected, the inference result with-
uses the images with neutral expression from all four ses-
out pixel loss, local-path generator or “learning to learn”
sions, which contains 337 identities. The images with 11
strategy using the siamese discriminator with dynamic con-
poses within ±90◦ and 20 illumination levels of the first
volution deviates from the true appearance seriously. The
200 identities are used for training. For testing, one frontal
synthesis without cross-domain adversarial training tends to
view with neural illumination is used as the gallery image
present inferior generalizability while that without symme-
3 Cross-domain adversarial training is an option, if there is no need to try loss sometimes shows factitious asymmetrical effect.
do domain adaptation, simply set λ1 =0. 4.1.2 Intermediate Results Visualization
4 For profile face images with large yaw angles, OpenPose [25] may

fail to locate both eyes. In such cases, we use the detected eye after center Most previous works on face frontalization and pose invari-
cropping as the input left/right eye patch. ant representation learning are dedicated to address prob-

2212
w/o dy conv
Table 3: Rank-1 recognition rates (%) across views, illuminations

w/o ℒdomain
w/o ℒpixel

w/o ℒsym
w/o !"#$
and sessions under Multi-PIE [12] Setting-2. “-” means the result

w/o %&
Profile

PIM
is not reported. b1 and b2 denote ResNet-50 [15] and Light CNN-

GT
29 [35], respectively. PIM1 and PIM2 use ResNet-50 [15] and
Light CNN-29 [35] as backbone architectures, respectively.
Method ±90◦ ±75◦ ±60◦ ±45◦ ±30◦ ±15◦
b1 15.50 55.10 85.90 97.10 98.40 98.60
b2 27.10 68.70 91.40 97.70 98.60 99.10
FIP [39] - - 45.90 64.10 80.70 90.70
MVP [40] - - 60.10 72.90 83.70 92.80
CPF [37] - - 61.90 79.90 88.50 95.00
DR-GAN [32] - - 83.20 86.20 90.10 94.00
TP-GAN [17] 64.64 77.43 87.72 95.38 98.06 98.68
PIM1 81.30 92.70 96.60 97.30 98.40 98.80
PIM2 86.50 95.00 98.10 98.50 99.00 99.30

Figure 3: Component analysis. Synthesized results of PIM and


its variants.
est [36] by 10.97% and 27.74% under ±90◦ , respectively.
Note that TP-GAN [17] adopts Light CNN-29 [35] as the
Table 2: Rank-1 recognition rates (%) across views, minor ex- feature extractor which has the same architecture as our
pressions and illuminations under Multi-PIE [12] Setting-1. “- DLN and c-CNN Forest [36] is an ensemble of three mod-
” means the result is not reported. b1 and b2 denote ResNet- els, while our PIM has a more effective and efficient joint
50 [15] and Light CNN-29 [35], respectively. PIM1 and PIM2 training scheme and a much simpler network architecture.
use ResNet-50 [15] and Light CNN-29 [35] as backbone architec- Tab. 3 shows the face recognition comparison of our PIM
tures, respectively.
Method ±90◦ ±75◦ ±60◦ ±45◦ ±30◦ ±15◦
with two baselines and other state-of-the-arts in Setting-2.
b1 18.80 63.80 92.20 98.30 99.20 99.40 Similar to the observation under Setting-1, PIM consistently
b2 33.00 76.10 95.20 97.90 99.20 99.80 achieves the best performance across all poses. In partic-
CPF [37] - - - 71.65 81.05 89.45
Hassner [14] - - 44.81 74.68 89.59 96.78
ular, PIM (Light CNN-29 [35]) outperforms TP-GAN [17]
FV [26] 24.53 45.51 68.71 80.33 87.21 93.30 by 21.86% under ±90◦ , and outperforms TP-GAN [17] and
HPN [9] 29.82 47.57 61.24 72.77 78.26 84.23
FIP 40 [39] 31.37 49.10 69.75 85.54 92.98 96.30
DR-GAN [32] by 10.38% and 14.90% under ±60◦ , respec-
c-CNN [36] 47.26 60.66 74.38 89.02 94.05 96.97 tively. This well verifies the superiority of our proposed
TP-GAN [17] 64.03 84.10 92.93 98.58 99.85 99.78
PIM1 71.60 92.50 97.00 98.60 99.30 99.40
cross-domain adversarial training, the “learning to learn”
PIM2 75.00 91.20 97.70 98.30 99.40 99.80 strategy using the siamese discriminator with dynamic con-
volution and the enforced cross-entropy optimization strat-
lems within a pose range of ±60◦ , since it is commonly egy in improving the overall recognition performance.
believed with a pose larger than 60◦ , it is difficult for a
model to generate faithful frontal images or learn discrimi- 4.2. Evaluations on the CFP Benchmark
native yet generative facial representations. However, with The CFP [24] dataset aims to evaluate the strength of
enough training data and proper architecture and objective face verification approaches across pose, more specifically,
function design of the proposed PIM, it is in fact feasible between frontal view (yaw angle<10◦ ) and profile view
to recover high-fidelity and identity-preserving frontal faces (yaw angle>60◦ ). CFP contains 7,000 images of 500 sub-
under very large poses and learn pose invariant representa- jects, where each subject has 10 frontal and 4 profile face
tions for face recognition in the wild. images. The data are randomly organized into 10 splits,
The intermediate results of recovered face images in the each containing an equal number of frontal-frontal and
frontal view and learned facial representations are visual- frontal-profile pairs, with 350 genuine and 350 imposter
ized in Fig. 1. We observe that the frontalized faces present ones, respectively. Evaluation systems report the mean and
compelling perceptual quality across poses larger than 60◦ , standard deviation of accuracy, Equal Error Rate (EER) and
and the learned representations are discriminative and pose Area Under Curve (AUC) over the 10 splits for both frontal-
invariant. frontal and frontal-profile face verification settings.
Tab. 4 compares the face recognition performance of our
4.1.3 Face Recognition Comparison
PIM (Light CNN-29 [35]) with other state-of-the-arts on the
Tab. 2 shows the face recognition performance comparison CFP [24] benchmark dataset. The results on the original im-
of our PIM with two baselines and other state-of-the-arts ages serve as our baseline. PIM achieves comparable per-
in Setting-1. Regardless of the adopted DLN architecture, formance as human under fontal-profile setting and outper-
PIM consistently achieves the best performance across all forms human performance under frontal-frontal setting. In
poses (except comparable with TP-GAN [17] under ±30◦ ), particular, for frontal-frontal cases, PIM gives stably similar
especially for large yaw angles. In particular, PIM (Light saturated performance with b (Light CNN-29 [35]), both of
CNN-29 [35]) outperforms TP-GAN [17] and c-CNN For- which reduce the EER of human performance by around

2213
Table 4: Face recognition performance (%) comparison on CFP [24]. The results are averaged over 10 testing splits.
Frontal-Profile Frontal-Frontal
Method
Acc EER AUC Acc EER AUC
FV+DML [24] 58.47±3.51 38.54±1.59 65.74±2.02 91.18±1.34 8.62±1.19 97.25±0.60
LBP+Sub-SML [24] 70.02±2.14 29.60±2.11 77.98±1.86 83.54±2.40 16.00±1.74 91.70±1.55
HoG+Sub-SML [24] 77.31±1.61 22.20±1.18 85.97±1.03 88.34±1.33 11.45±1.35 94.83±0.80
FV+Sub-SML [24] 80.63±2.12 19.28±1.60 88.53±1.58 91.30±0.85 8.85±0.74 96.87±0.39
Deep Features [24] 84.91±1.82 14.97±1.98 93.00±1.55 96.40±0.69 3.48±0.67 99.43±0.31
Triplet Embedding [22] 89.17±2.35 8.85±0.99 97.00±0.53 96.93±0.61 2.51±0.81 99.68±0.16
Chen et al. [5] 91.97±1.70 8.00±1.68 97.70±0.82 98.41±0.45 1.54±0.43 99.89±0.06
Light CNN-29 [35] 92.47±1.44 8.71±1.80 97.77±0.76 99.64±0.32 0.57±0.40 99.92±0.15
PIM (Light CNN-29 [35]) 93.10±1.01 7.69±1.29 97.65±0.62 99.44±0.36 0.86±0.49 99.92±0.10
Human 94.57±1.10 5.02±1.07 98.92±0.46 96.24±0.67 5.34±1.79 98.19±1.13

5.00%. For more challenging frontal-profile cases, PIM


consistently outperforms the baseline and other state-of-the-
arts. In particular, PIM reduces the EER by 1.02% com-
pared with b (Light CNN-29 [35]) and improves the accu-
racy by 1.13% over the 2nd -best. This shows that the facial
representations learned by PIM are discriminative and ro-
bust even at extreme pose variations.
4.3. Evaluations on the LFW Benchmark
LFW [16] contains 13,233 face images of 5,749 iden-
tities. The images were obtained by trawling the Internet
followed by face centering, scaling and cropping based on
bounding boxes provided by an automatic face locator. The
LFW data have large in-the-wild variabilities, e.g., in-plane
rotations, non-frontal poses, low resolution, non-frontal il-
lumination, varying expressions and imperfect localization.
As a demonstration of our model’s superior generaliz-
ability to in-the-wild face images, we qualitatively com- LFW PIM (Ours) TP-GAN DR-GAN Hassner et al.
pare the intermediate face frontalization results of our PIM Figure 4: Comparison of face frontalization on LFW [16].
(Light CNN-29 [35]) with TP-GAN [17], DR-GAN [32],
and the approach from Hassner et al. [14], which are the
state-of-the-arts aiming to generate photorealistic and iden- ations. PIM unifies a Face Frontalization sub-Net (FFN)
tity preserving frontal view from profiles. As in Fig. 4, and a Discriminative Learning sub-Net (DLN) for pose
the predictions of TP-GAN [17] suffer severe texture loss invariant recognition in an end-to-end deep architecture.
and involved artifacts, and the predictions of DR-GAN [32] The FFN introduces unsupervised cross-domain adversarial
and the method by Hassner et al. [14] deviate from true ap- training and a “learning to learn” strategy to provide high-
pearance seriously, for both near-frontal (the top two rows) fidelity frontal reference face image for effective learning
and profile (the bottom three rows) cases. Comparatively, face representation from DLN. Comprehensive experiments
PIM can faithfully recover high fidelity frontal view face demonstrate the superiority of PIM over the state-of-the-
images with finer local details and global face shapes. This arts. We plan to apply PIM for other domain adaption and
well verifies that the unsupervised cross-domain adversarial transfer learning applications in the future.
training can effectively advance generalizability and reduce
over-fitting, and that the “learning to learn” strategy us-
ing a siamese discriminator with dynamic convolution con- Acknowledgement
tributes to the synthesized perceptually natural and photore-
alistic results. Moreover, the joint learning scheme of face The work of Jian Zhao was partially supported by China
frontalization and discriminative representation also helps, Scholarship Council (CSC) grant 201503170248.
since the two sub-nets leverage each other during end-to- The work of Junliang Xing was partially supported by
end training to achieve a final win-win outcome. the National Science Foundation of Chian 61672519.

5. Conclusion The work of Jiashi Feng was partially supported by NUS


startup R-263-000-C08-133, MOE Tier-I R-263-000-C21-
We proposed a novel Pose Invariant Model (PIM) to ad- 112, NUS IDS R-263-000-C67-646 and ECRA R-263-000-
dress the challenging face recognition with large pose vari- C87-133.

2214
References [18] M. Kan, S. Shan, H. Chang, and X. Chen. Stacked progres-
sive auto-encoders (spae) for face recognition across poses.
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, In CVPR, pages 1883–1890, 2014. 2
M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-
[19] S. Ohayon, W. A. Freiwald, and D. Y. Tsao. What makes
flow: A system for large-scale machine learning. 6
a cell face selective? the importance of contrast. JN,
[2] T. Ahonen, A. Hadid, and M. Pietikainen. Face description 74(3):567–581, 2012. 1
with local binary patterns: Application to face recognition. [20] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face
TPAMI, 28(12):2037–2041, 2006. 2 recognition. In BMVC, 2015. 2
[3] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and [21] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic. Ro-
A. Vedaldi. Learning feed-forward one-shot learners. In bust statistical face frontalization. In ICCV, pages 3871–
NIPS, pages 523–531, 2016. 5 3879, 2015. 2
[4] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimension- [22] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chel-
ality: High-dimensional feature and its efficient compression lappa. Triplet probabilistic embedding for face verification
for face verification. In CVPR, pages 3025–3032, 2013. 2 and clustering. In BTAS, pages 1–8, 2016. 8
[5] J.-C. Chen, J. Zheng, V. M. Patel, and R. Chellappa. [23] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
Fisher vector encoded deep convolutional features for uncon- fied embedding for face recognition and clustering. In CVPR,
strained face verification. In ICIP, pages 2981–2985, 2016. pages 815–823, 2015. 2
8
[24] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel-
[6] W. Chen, T.-Y. Liu, Y. Lan, Z.-M. Ma, and H. Li. Ranking lappa, and D. W. Jacobs. Frontal to profile face verification
measures and loss functions in learning to rank. In NIPS, in the wild. In WACV, pages 1–9, 2016. 1, 6, 7, 8
pages 315–323, 2009. 2 [25] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint
[7] N. Dalal and B. Triggs. Histograms of oriented gradients for detection in single images using multiview bootstrapping. In
human detection. In CVPR, pages 886–893, 2005. 2 CVPR, 2017. 6
[8] J. G. Daugman. Uncertainty relation for resolution in [26] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman.
space, spatial frequency, and orientation optimized by two- Fisher vector faces in the wild. In BMVC, 2013. 7
dimensional visual cortical filters. JOSA A, 2(7):1160–1169, [27] K. Simonyan and A. Zisserman. Very deep convolutional
1985. 2 networks for large-scale image recognition. arXiv preprint
[9] C. Ding and D. Tao. Pose-invariant face recognition with arXiv:1409.1556, 2014. 2
homography-based normalization. PR, 66:144–152, 2017. 7 [28] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face
[10] W. A. Freiwald and D. Y. Tsao. Functional compartmen- recognition with very deep neural networks. arXiv preprint
talization and viewpoint generalization within the macaque arXiv:1502.00873, 2015. 2
face-processing system. Science, 330(6005):845–851, 2010. [29] Y. Sun, X. Wang, and X. Tang. Deeply learned face repre-
1 sentations are sparse, selective, and robust. In CVPR, pages
[11] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, 2892–2900, 2015. 2
F. Laviolette, M. Marchand, and V. Lempitsky. Domain- [30] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
adversarial training of neural networks. JMLR, 17(59):1–35, Closing the gap to human-level performance in face verifica-
2016. 3 tion. In CVPR, pages 1701–1708, 2014. 2
[12] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. [31] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale
Multi-pie. JIVC, 28(5):807–813, 2010. 6, 7 training for face identification. In CVPR, pages 2746–2754,
[13] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: 2015. 2
A dataset and benchmark for large-scale face recognition. In [32] L. Tran, X. Yin, and X. Liu. Disentangled representation
ECCV, pages 87–102. Springer, 2016. 6 learning gan for pose-invariant face recognition. In CVPR,
[14] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face 2017. 2, 3, 7, 8
frontalization in unconstrained images. In CVPR, pages [33] K. Q. Weinberger and L. K. Saul. Distance metric learn-
4295–4304, 2015. 2, 7, 8 ing for large margin nearest neighbor classification. JMLR,
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 10(Feb):207–244, 2009. 2
for image recognition. In CVPR, pages 770–778, 2016. 6, 7 [34] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative fea-
[16] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. ture learning approach for deep face recognition. In ECCV,
Labeled faces in the wild: A database for studying face pages 499–515, 2016. 2
recognition in unconstrained environments. Technical report, [35] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for
pages 07-49, University of Massachusetts, Amherst, 2007. 6, deep face representation with noisy labels. arXiv preprint
8 arXiv:1511.02683, 2015. 6, 7, 8
[17] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rota- [36] C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, and
tion: Global and local perception gan for photorealistic and T.-K. Kim. Conditional convolutional neural network for
identity preserving frontal view synthesis. arXiv preprint modality-aware face recognition. In ICCV, pages 3667–
arXiv:1704.04086, 2017. 2, 3, 7, 8 3675, 2015. 7

2215
[37] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim. Ro- [39] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning
tating your face using multi-task deep neural network. In identity-preserving face space. In ICCV, pages 113–120,
CVPR, pages 676–684, 2015. 2, 7 2013. 7
[38] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity [40] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view percep-
pose and expression normalization for face recognition in the tron: a deep model for learning face identity and view repre-
wild. In CVPR, pages 787–796, 2015. 2, 3 sentations. In NIPS, pages 217–225, 2014. 7

2216

You might also like