3D Face Reconstruction With Dense Landmarks: Microsoft
3D Face Reconstruction With Dense Landmarks: Microsoft
3D Face Reconstruction With Dense Landmarks: Microsoft
dense landmarks
Microsoft
Abstract
Landmarks often play a key role in face analysis, but many aspects of identity or
expression cannot be represented by sparse landmarks alone. Thus, in order to
reconstruct faces more accurately, landmarks are often combined with addi- tional
signals like depth images or techniques like differentiable rendering. Can we keep
things simple by just using more landmarks? In answer, we present the first method that
accurately predicts 10× as many landmarks as usual, covering the whole head,
including the eyes and teeth. This is accomplished using synthetic training data, which
guarantees perfect landmark annotations. By fitting a morphable model to these dense
landmarks, we achieve state-of- the-art results for monocular 3D face reconstruction in
the wild. We show that dense landmarks are an ideal signal for integrating face shape
information across frames by demonstrating accurate and expressive facial performance
capture in both monocular and multi-view scenarios. This approach is also highly
efficient: we can predict dense landmarks and fit our 3D face model at over 150FPS on
a single CPU thread.
1Introduction
Landmarks
corner of theare points
eye. Theyinoften
correspondence
play a role inacross all faces,
face-related like the
computer tip ofe.g.,
vision, the being
nose used
or the
to
extract facial regions of interest [35], or helping to constrain 3D model fitting [27, 80].
Unfortunately, many aspects of facial identity or expression cannot be encoded by a typical
sparse set of 68 landmarks alone. For example, without landmarks on the cheeks, we cannot
tell whether or not someone has high cheek-bones. Likewise, without landmarks around the
outer eye region, we cannot tell if someone is softly closing their eyes, or scrunching up
their face.
In order
additional to reconstruct
signals faces
beyond color more such
images, accurately,
as depthprevious
images work
[65] orhas therefore
optical used
flow [14].
However, these signals may not be available or reliable to compute. Instead, given color
images alone, others have approached the problem using analysis-by-synthesis: minimizing
a photometric error [27] between a generative 3D face model and an ob- served image
using differentiable rendering [19, 28]. Unfortunately, these approaches
Figure 1: Given a single image (top), we first robustly and accurately predict 703 landmarks (middle).
To aid visualization, we draw lines between landmarks. We then fit our 3D morphable face model to
these landmarks to reconstruct faces in 3D (bottom).
are limited
to be by the approximations
computationally feasible. In that mustfaces
reality, be made in purely
are not order for differentiable
Lambertian [24],ren-
anddering
many
important illumination effects are not explained using spherical harmonics alone [19], e.g.,
ambient occlusion or shadows cast by the nose.
We Faced
presentwith
the this
firstcomplexity,
method thatwouldn’t
predicts itover
be great if we could
700 landmarks justaccurately
both use more and land-robustly.
marks?
Instead of only the frontal “hockey-mask” portion of the face, our land- marks cover the
entire head, including the ears, eyeballs, and teeth. As shown in Figure 1, these landmarks
provide a rich signal for both facial identity and expres- sion. Even with as few as 68, it is
hard for humans to precisely annotate landmarks that are not aligned with a salient image
feature. That is why we use synthetic training data which guarantees consistent annotations.
Furthermore, instead of rep- resenting each landmark as just a 2D coordinate, we predict
each one as a random variable: a 2D circular Gaussian with position and uncertainty [39].
This allows our predictor to express uncertainty about certain landmarks, e.g., occluded
landmarks on the back of the head.
Since3D
perform ourface
dense landmarks represent
reconstruction by fittingpoints of correspondence
a morphable face modelacross
[7] toall faces,
them. we can
Although
previous approaches have fit models to landmarks in a similar way [78], we are the first to
show that landmarks are the only signal required to achieve state-of- the-art results for
monocular face reconstruction in the wild.
overThe probabilistic
a temporal natureor
sequence, of across
our predictions
multiple also makes
views. them idealcan
An optimizer for discount
fitting a 3D model
uncertain
landmarks and rely on more certain ones. We demonstrate this with ac- curate and
expressive results for both multi-view and monocular facial performance capture. Finally,
we show that predicting dense landmarks and then fitting a model can be highly efficient by
demonstrating real-time facial performance capture at over
2
a) b)
Figure 2: Compared to a typical sparse set of 68 facial landmarks (a), our dense landmarks
(b) cover the entire head in great detail, including ears, eyes, and teeth. These dense landmarks are
better at encoding facial identity and subtle expressions.
3
1) Dense landmark prediction 2) 3D model fitting
L E(L,Φ)
CNN
μ
σ
Iterations
Optimization-based
reconstruction 3D faceisreconstruction
of face geometry Traditionally,
achieved with multi-view high-quality
stereo mark-
[5, 57], fol- lowederless
by
optical flow based alignment, and then optimisation using geometric and temporal priors [6,
10, 51]. While such methods produce detailed results, each step takes hours to complete.
They also suffer from drift and other issues due to their reliance on optical flow and multi-
view stereo [16]. Finally, manual supervision is required, e.g., to establish an initial
alignment.
flowIf[14]
onlyconstraints
a single image is available,
are commonly useddense photometric
to recover [19,and
face shape 66],motion.
depth [65], or optical
However, these
methods still rely on sparse landmarks for initializing the optimization close to the dense
constraint’s basin of convergence, and coping with fast head motion [80]. In contrast, we
argue that dense landmarks alone are sufficient for precise fitting.
Dense
[13], few landmark prediction
methods directly While
predict sparse
dense landmarkorprediction
landmarks is a mainstay
correspondences. This of
is the field
because
annotating a face with dense landmarks is a highly ambiguous task, so either synthetic data
[72], pseudo-labels made with model-fitting [17, 26, 79], or semi- automatic refinement of
training data [37, 38] are used. Another issue with predicting dense landmarks is that
heatmaps, the de facto technique for predicting landmarks [12, 13], rise in computational
complexity with the number of landmarks. While a few previous methods have predicted
dense frontal-face landmarks via cascade regression
[37] or direct regression
700 landmarks [17,whole
covering the 30, 38], we including
head, are the first
eyesto accurately
and teeth. and robustly predict over
each pixel corresponds to a fixed point in a UV-unwrapping of as
Some methods choose to predict dense correspondences thean image
face instead,
[2, 26] or body where
[31,
61]. Such parameterization suffers from several drawbacks. How does one handle self-
occluded portions of the face, e.g., the back of the head? Furthermore, what occurs at UV-
island boundaries? If a pixel is half-nose and half-cheek, to which does it correspond?
Instead, we choose to discretize the face into dense landmarks. This lets us predict parts of
the face that are self-occluded, or lie outside image bounds. Having a fixed set of
correspondences also benefits the model-fitter, making it more amenable to running in real-
time.
4
Figure 4: Examples of our synthetic training data. Without the perfectly consistent anno- tations
provided by synthetic data, dense landmark prediction would not be possible.
3Method
In recent years,
complicated, methods
involving for 3D face
differentiable reconstruction
rendering have become
and complex more andtraining
neural network more
strategies. We show instead that success can be found by keeping things simple. Our
approach consists of two stages: First we predict probabilistic dense 2D landmarks L using
a traditional convolutional neural network (CNN). Then, we fit a 3D face model,
parameterized by Φ, to the 2D landmarks by minimizing an energy function E(Φ; L).
Images themselves are not part of this optimization; the only data used are 2D landmarks.
The ofmain
quality difference
landmarks. No between
one beforeourhas
work and previous
predicted so manyapproaches is thesonumber
2D landmarks, and
accurately.
This lets us achieve high quality 3D face reconstruction results by fitting a 3D model to
these landmarks alone.
3.1Landmark prediction
Synthetic
data. While atraining data.
human can Our results
consistently are face
label onlyimages
possiblewith
because we land-
e.g., 68 use synthetic
marks, ittraining
would
be almost impossible for them to annotate an image with dense landmarks. How would it be
possible to consistently annotate occluded landmarks on the back of the head, or multiple
landmarks over a largely featureless patch of skin e.g., the forehead? In previous work,
pseudo-labelled real images with dense cor- respondences are obtained by fitting a 3DMM
to images [2], but the resulting label consistency heavily depends on the quality of the 3D
fitting. Using synthetic data has the advantage of guaranteeing perfectly consistent labels.
We rendered a training dataset of 100k images using the method of Wood et al. [72] with
some minor modi- fications: we include expression-dependent wrinkle texture maps [1] for
more realistic skin appearance, and additional clothing, accessory, and hair assets. See
Figure 4 for some examples.
5
Figure 5: When parts of the face are occluded by e.g. hair or clothing, the corresponding landmarks are
predicted with high uncertainty (red), compared to those visible (green).
6
a) b)
Figure 6: We implemented two versions of our approach: one for processing multi-view
recordings offline (a), and one for real-time facial performance capture (b).
E landmarks is the only term that encourages the 3D model to explain the observed 2D
landmarks. The other terms use prior knowledge to regularize the fit.
Part ofIn the
cameras. thisbeauty
sectionofweourpresent
approach is how form
the general naturally it scales
of our method, to suitable
multiplefor
images and
F frames
over C cameras, i.e., multi-view performance capture.
and 3DK =face model.joints
4 skeletal We use
(thethe faceneck,
head, modelanddescribed in [72],
two eyes).
|β|+|ψ|+|θ|
comprising
Vertex positions N are= determined
7,667 vertices
by
the mesh-generating |ψ| function (β, ψ, θ) : R 3K+3 R 3N
which takes parameters β R|β|
for identity, ψ R for expression, and θ R for skeletal pose (including root joint
translation).
M
M(β,
∈ ψ, θ) = L(T (β,∈ψ), θ, J (β); W) ∈
where (V, θ, J; W) is a standard linear blend skinning (LBS) function [41] that ro- tates
vertex positions
L V R3N about joint locations J R3K by local joint rotations in θ, with per-
vertex weights W RK×N|β|+|ψ| . The face mesh and joint locations in the bind pose are
determined by (β, ψ) : R ∈ R3N and (β) : R|β| R3K respectively.
∈ See Wood et al. [72]
for more details.
∈ 3×4
Camerasrotation
comprising are described by a world-to-camera
T and a pinhole
and translation, rigid transform
→camera JthX∈R matrix
projection = [R T]Π
3×3
R . Thus, the image-space projection of the j landmark in the i camera is xi,j =
th
ΠiXiM j. In the monocular case, X and can be ignored. ∈
Parameters Φ are optimized to minimize E. The main parameters
the face, but we also optimize camera parameters if they are unknown. of interest control
Φ = {β, ΨF ×|ψ|, ΘF×| θ| ; RC×3, TC×3, f C }
s ˛¸ x s ˛¸ x
F a ce Cam e ra(s)
7
Eintersect encourages to remain outside Without Eintersect With Eintersect
these skin vertices these convex shapes.
Figure 7: We encourage the optimizer to avoid face mesh self-intersections by penalizing skin vertices
that enter the convex hulls of the eyeballs or teeth parts.
Facial identity
per frame. For βeach
is shared
of ourover a sequence
C cameras of F six
we have frames, but of
degrees expression
freedom Ψ
forand pose ΘR vary
rotation and
translation T, and a single focal length parameter f . In the monocular case, we only
optimize focal length.
Elandmarks encourages the 3D model to explain the predicted 2D landmarks:
F,C,|L|
Elandmarks ∥xijk − µijk∥2
Σ
= 2
2 ijk
i,j,k (2)
where, for theand kth landmark
uncertaintyseen by the jby camera in thelandmark
ith frame,CNN,
[µijk, and
σijk] is
xijkthe
th
2D location predicted our dense =
ΠjXjM(β, ψi, θi)k is the 2D projection of that landmark on our 3D model. The
similarity of Equation 2 to Lossµ in Equation 1 is no accident: treating landmarks as 2D
random variables during both prediction and model-fitting allows our approach to
elegantly handle uncertainty, taking advantage of landmarks the CNN is confident in,
and discounting those it is uncertain about.
Eidentity penalizes unlikely face shape by maximizing the relative log-likelihood of shape
parameters β under a multivariate Gaussian Mixture Model (GMM) of G
components fit to a library of 3D head scans [72].
ΣG
Eidentity = − log (p(β)) where p(β) = γi N (β|νi, Σi). νi and Σi are the
mean
ponent. and covariance matrix of the i th
component,
i and γi is the weight of that com-
as little expression and joint rotation as possible. We do optimizer
E expression = ψ and E joints = θi:i∈[2,K] encourage the to explain the data with
2 2
not penalize global translation or
rotation by ignoring the root joint θ1.
∥ ∥ 2 ∥ ∥
Etemporal
remain still=between
F,C,|L|
∥xneighboring
i,j,k − xi−1,j,k∥ reduces jitter by encouraging face mesh vertices x to
frames i − 1 and i.
i=2,j
Σ
and Eeyeballs
intersect encourages
or teeth the optimizer
(Figure to findrefer
7). Please solutions
to thewithout intersections
supplementary between
material for the skin
further
details.
8
3.3Implementation
We implemented
offline, and one for two versionsfacial
real-time of our system: onecapture.
performance for processing multi-camera record- ings
Our offline system produces the best quality
We predict 703 landmarks with a ResNet 101 [36]. To extractresults withouta constraints
facial Region-on com- pute.
of-Interest
(ROI) from an image we run a full-head probabilistic landmark CNN on multi-scale sliding
windows, and select the window with the lowest uncertainty. When fitting our 3DMM, we
use PyTorch [49] to minimize E(Φ) with L-BFGS [44], opti- mizing all parameters across
all frames simultaneously.
For ourV2
MobileNet real-time system,
architecture [55]. weTo trained
compensatea lightweight denseinlandmark
for a reduction model with
network capacity, we
predict 320 landmarks rather than 703, and modify the ROI strategy: aligning the face so it
appears upright with the eyes a fixed distance apart. This makes the CNN’s job easier for
frontal faces at the expense of profile ones.
Real-time model
model-fitting energy.fitting.
CameraWe anduseidentity
the Levenberg-Marquardt
parameters are onlyalgorithm
fit occa- to op- timize
sionally. For our
the
majority of frames we fit pose and expression parameters 2 only. We rewrite the energy E in
terms of the vector of residuals, r, as E(Φ) = ∥r(Φ)∥ = i ri(Φ) . Then∂r(Φ) 2
at each iteration k
of our optimization, we can compute r(Φk) and the Jacobian, J(Φ k ) T= |Φ=Φk ,T and use
these to solve the symmetric, positive- semi-definite linear system, (J J + λdiag(J J))δk =
−J TInr practice
via Cholesky
we dodecompo- sition.
not actually formWethethen applyvector
residual the update
r nor rule, Φk+1 = Φmatrix
the Jacobian k + δk.
Σ
J.
weInstead, for term
visit each performance
ri(Φk) ofreasons,
the energy.we directly
Most ofcompute the quantities
the computational J isJ incurred
cost
T
and J T r as
in
evaluating these products for the landmark data term, as expected. However, the
Jacobian of landmark term∂ residuals is not fully dense. Each individual landmark
depends on its own subset of expression parameters, and is invariant to other
expression parameters. We performed a static analysis of the sparsity of each landmark
term with respect to parameters, ∂r i /∂Φ j , and we use this set of i, j indices to reduce the
cost of our outer products from O(|Φ|2) to O(m2), where mi is the sparsified
dimensionality of ∂ri/∂Φ. We further enhance the sparsity by ignoring any components
of the Jacobian with an absolute value below a certain empirically- determined
threshold.
By exploiting
become very cheap sparsity in this way,
to evaluate. Thisthe landmark term
formulation residuals
avoids and their deriva-
the correspondence tives
problem
usually seen with depth images [60], which requires a more expensive op- timization. In
addition, adding more landmarks does not significantly increase the cost of optimization. It
therefore becomes possible to implement a very detailed and well-regularized fitter with a
relatively small compute burden, simply by adding a sufficient numberi of landmarks. The
cost of the Cholesky solve for the update δk is independent of the number of landmarks.
9
Common Challenging Private
Method NME NME FR10%
LAB [73] 2.98 5.19 0.83
AWING [70] 2.724.52 0.33
ODN [77] 3.56 6.67 -
3FabRec [11] 3.36 5.74 0.17
Wood et al. [72] 3.09 4.86 0.50
LUVLi [40]2.765.16 -
ours (L2) 3.30 5.120.33
ours (GNLL) 3.034.80 0.17
Figure 8: Left: results on 300W dataset, lower is better. Note competitive performance of our model
(despite being evaluated across-dataset) and importance of GNLL loss. Right: sample predictions (top
row) with label-translated results (bottom row).
4Evaluation
4.1Landmark accuracy
We
For measure
benchmarkthe accuracy
purposes of a ResNet
only, 101 dense
we employ labellandmark model
translation [72]ontothe 300W
deal with[54] dataset.
systematic
inconsistencies between our 703 predicted dense landmarks and the 68 sparse landmarks
labelled as ground truth (see Figure 8). While previous work [72] used label translation to
evaluate a synthetically-trained sparse landmark predictor, we use it to evaluate a dense
landmark predictor.
We use
metrics [54].theOurstandard
model’snormalized mean error
results in Figure 8 are (NME) and failure
competitive with therate (FRof10%the
state ) error
art,
despite being trained with synthetic data alone. Note: these results provide a conservative
estimate of our method’s accuracy as the translation network may introduce error,
especially for rarely seen expressions.
Ablation
variable ratherstudy
than We
as ameasured the importance
2D coordinate. We trainedof predicting eachprediction
two landmark landmark as a random
models, one
with our proposed GNLL loss (Equation 1), and one with a simpler L2 loss on landmark
coordinate only. Results in Figure 8 confirm that including uncertainty in landmark
regression results in better accuracy.
Qualitative
model (MobileNet comparisons are shown in
V2) and MediaPipe Figure 9Mesh
Attention between
[30],oura real-time
publicly dense landmark
available dense
landmark method designed for mobile devices. Our method is more robust, perhaps due to
the consistency and diversity of our synthetic training data. See the supplementary material
for additional qualitative results, including landmark predictions on the Challenging subset
of 300W.
4.23D face reconstruction
Quantitatively, we compare
the NoW Challenge [56] andour
theoffline
MICCapproach with recent methods on two bench- marks:
dataset [3].
10
MediaPipe [30] ours MediaPipe [30] ours
Figure 9: We compare our real-time landmark CNN (MobileNet V2) with MediaPipe Atten- tion
Mesh [30], a publicly available method for dense landmark prediction. Our approach is more robust
to challenging expressions and illumination.
The NoW
accuracy Challenge of
and robustness [56]
3Dprovides a standard evaluation
face reconstruction protocol
in the wild. for measuring
It consists the
of 2054 face
images of 100 subjects along with a 3D head scan for each subject which serves as ground
truth. We undertake the challenge in two ways: single view, where we fit our face model to
each image separately, and multi-view, where we fit a per-subject face model to all image of
a particular subject. As shown in Figure 10, we achieve state of the art results.
wereThe MICC in
recorded dataset
three [3] consists of 3D
environments: face scans andlaboratory
a “cooperative” videos of environment,
53 subjects. The
an videos
indoor
environment, and an outdoor environment. We follow Deng et al. [18], and evaluate our
method in two ways: single view, where we estimate one face shape per frame in a video,
and average the resulting face meshes, and multi-view, where we fit a single face model to
all frames in a video jointly. As shown in Table 1, we achieve state of the art results.
Notemultiple
across that many previous
views. Themethods
fact oursarecan
incapable
benefitof from
aggregating faceviews
multiple shapehighlights
infor- mation
the
flexibility of our hybrid model-fitting approach.
Ablation
quantity studies
for 3D We conducted an
face reconstruction. Weexperiment to landmark
trained three measure the importance
CNNs, of 703,
predicting landmark
320,
and 68 landmarks respectively, and used these on the NoW Challenge (validation set). As
shown in Figure 11, fitting with more landmarks results in more accurate 3D face
reconstruction.
11
100
Single view Multi-view Method Error (mm)
Single view Median Mean Std
80
ours DECA Deng et al. [18] 1.11 1.41 1.21
Deng et al. RINGNET 3DFAv2
RingNet [56] 1.21 1.53 1.31
Percenta
60 Dib et al.
3DFAv2 [32] 1.23 1.57 1.39
40 DECA [25] 1.09 1.38 1.18
Dib et al. [20] 1.26 1.57 1.31
20
ours ours 1.02 1.28 1.08
Bai et al.
0
Multi-view
0 1 2 3 4 0 1 2 3 4
Bai et al. [4] 1.08 1.35 1.15
Error (mm) Error (mm) ours 0.81 1.01 0.84
Figure 10: Results for the NoW Challenge [56]. We outperform the state of the art on both single- and
multi-view 3D face reconstruction.
Table 1: Results on the MICC dataset [3], following the single and multi-frame evaluation protocol of
Deng et al. [18]. We achieve state-of-the-art results.
model fitting.rather
using fixed We fitthan
our predicted
model to 703 landmark
σ. Figure 11 predictions
(bottom row on of
thetable)
NoW shows
validation
thatset, but
fitting
without σ leads to worse results.
[18,Qualitative
25, 32, 56, comparisons
59] can be foundbetween our13.
in Figure work and several publicly available meth- ods
Multi-view
facial performance
expression parameters capture Goodto synthetic
from which training
sample. We data requires
acquired such aa database
database by
of
conducting markerless facial performance capture for 108 subjects. We recorded each
subject in our 17-camera capture studio, and processed each recording with our offline
multi-view model fitter. For a 520 frame sequence it takes 3 minutes to predict dense
landmarks for all images, and a further 9 minutes to optimize face model parameters. See
Figure 12 for some of the 125,000 frames of expression data captured with our system. As
the system which is used to create the database is then subsequently re-trained with it, we
produced several databases in this manner until no further improvement in accuracy was
seen. While previous work may be capable of tracking a detailed face mesh across each
recording [6, 10, 51], they often require manual intervention in form of initial alignment, or
clean-up in difficult regions such
12
Landmarks Median Mean Std
Fit with: 68 ldmks. 320 ldmks. 703 ldmks. Number of Error (mm)
68 1.10 1.38 1.16
320 1.00 1.24 1.02
703 0.95 1.17 0.97
703 (without σ) 1.02 1.26 1.03
Figure 11: Ablation studies on the NoW [56] validation set confirm that denser is better: model fitting
with more landmarks leads to more accurate results. In addition, we see that fitting without using σ
leads to worse results.
Figure 12: We demonstrate the robustness and reliability of our method by using it to collect a
massive database of 125,000 facial expressions, fully automatically.
as the eyes
convert or corners
the results of the
into face mouth
model [16]. Further
expression processing would also be required to
parameters.
Real-time monocular
a comparison between ourperformance capture See
offline and real-time the last
systems fortwo columns 3D
monocular of Fig- ure 13 for
model-fitting.
While our offline system produces the best possible results by using a large CNN and
optimizing over all frames simultaneously, our real-time system can still produce accurate
and expressive results fitting frame-to-frame. Please refer to the supplementary material for
a video. Running on a single CPU thread (i5-11600K), our real-time system spends 6.5ms
processing a frame (150FPS), of which 4.1ms is spent predicting dense landmarks and
2.3ms is spent fitting our face model.
5Limitations and future work
Our methodpredicted,
are poorly depends the
entirely on accurate
resulting landmarks.
model fit As plan
suffers. We shownto in Figurethis
address 14,by
if improving
landmarks
our synthetic training data. Additionally, since our model does not include tongue
articulation we cannot recover tongue movement.
wereHeatmaps have
pleasantly dominated
surprised the landmark
to find prediction
that directly literature
regressing for someco-
2D landmark time [12, 13].with
ordinates We
unspecialized architectures works well and eliminates the need for computationally-costly
heatmap generation. In addition, we were surprised that pre- dicting σ helps accuracy. We
look forward to further investigating direct probabilistic
13
RingNet Deng et al. 3DFAv2 MGCNet DECA ours ours
[56] [18] [32] [59] [25] (offline) (real-time)
Figure 13: Compared to previous recent monocular 3D face reconstruction methods, ours better
captures gaze, expressions like winks and sneers, and subtleties of facial identity. In addition, our
method can run in real time with only a minor loss of fidelity.
Figure 14: Limitations remain. Poorly localized landmarks result in bad fits, and we are incapable of
tracking the tongue.
14
Supplementary material
Eintersect
Since our during
can occur 3D facemodel-fitting.
model contains separate
Though theyparts for the teethweand
are uncommon, eyeballs,the
encourage intersec- tions
optimizer to
avoid face mesh self-intersections by penalizing skin vertices that enter the convex hulls of
the eyeballs or teeth parts.
Eintersect = Eeyeballs + Eteeth
We
fallsattach
insideaits
sphere of fixed radius
corresponding to sphere,
eyeball each eyeball
Eeyeballscenter. For the
penalizes eachsquared
eyelid distance
skin ver-between
tex that
that vertex to the sphere’s exterior surface. Since this is trivial to implement, we omit the
details.
Unfortunately,
sphere, the teeth
so Eteeth is more cannot be
complicated. well-represented
Instead, we representwith a simple
the upper primitive
and lower teethlike
partsa
each with a convex hull of J planes defined by normal vector nˆ j and distance to origin p j.
Lets
thesesay I represents
convex hulls. a set of lip vertices we wish to keep outside one of
Σ
Eteeth = D2
i∈I i
Where Di measures the distance the ith skin vertex is inside the convex hull,
and di,j measures the internal distance between the ith skin vertex xi and the jth
15
Region of Interest
a) b) c) d)
Figure 15: To extract a distortion-free Region-of-Interest (ROI) of the head we run a full-head sparse
landmark CNN (a) on multi-scale sliding windows across the full image (b), take the landmarks from
the most confident window (c), inscribe an expanded square around them, and use it as our ROI (d)
for our dense landmark CNN.
a) b)
Figure 16: When running in real time, we extract ROIs using an affine transform. The eyes
and mouth, shown as green points in (a) are remapped every frame to a fixed triangle in ROI space (b).
The resulting rotated ROI rectangle is shown in red in (a).
16
100
80 Offline single-view (703 ldmks.) Offline multi-view (703 ldmks.) Real-time single-view (320 ldmks.) Real-time multi-view (320 ldmks.)
Method Error (mm)
Single view Median Mean Std
60
Ours, offline 0.95 1.17 0.97
Percenta
Figure 17: A comparison between the offline and real-time versions of our systems on the NoW
Challenge [56] validation set. Our real-time system achieves similar quality results to our offline
system.
17
Figure 18: Dense landmark predictions for all of the images in 300W Challenging dataset. Note the
accuracy of our landmark model in challenging scenarios including extreme expres- sions, occlusion,
pose variation, lighting variation, and poor image quality.
18
RingNet [56] Deng et al. [18] 3DFAv2 [32] MGCNet [59] DECA [25] Ours
Figure 19: Further qualitative comparisons between our approach and publicly available recent previous
methods for monocular 3D face reconstruction.
19
References
[1]Alexander, O., Fyffe, G., Busch, J., Yu, X., Ichikari, R., Jones, A., Debevec, P.,
Jimenez, J., Danvoye, E., Antionazzi, B., et al.: Digital Ira: Creating a real-time
photoreal digital actor. In: SIGGRAPH (2013)
[2]Alp Gu¨ler, R., Trigeorgis, G., Antonakos, E., Snape, P., Zafeiriou, S., Kokki- nos, I.:
DenseReg: Fully Convolutional Dense Shape regression In-the-Wild. In: CVPR (2017)
[3]Bagdanov, A.D., Del Bimbo, A., Masi, I.: The Florence 2D/3D Hybrid Face Dataset. In:
Workshop on Human Gesture and Behavior Understanding, ACM (2011)
[4]Bai, Z., Cui, Z., Liu, X., Tan, P.: Riggable 3D Face Reconstruction via In- Network
Optimization. In: CVPR (2021)
[5]Beeler, T., Bickel, B., Beardsley, P., Sumner, B., Gross, M.: High-Quality Single-
Shot Capture of Facial Geometry. ACM Trans. Graph. (2010)
[6]Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R.W.,
Gross, M.: High-quality passive facial performance capture using anchor frames.
ACM Trans. Graph. (2011)
[7]Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Computer
graphics
[8] Blanz, V., and interactive
Vetter, T.: Facetechniques (1999)
Recognition Based on Fitting a 3D Morphable Model.
TPAMI (2003)
[9]Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it
SMPL: Automatic estimation of 3D human pose and shape from a single image. In:
Computer Vision – ECCV 2016, Lecture Notes in Computer Science, Springer
International Publishing (Oct 2016)
[10] Bradley, D., Heidrich, W., Popa, T., Sheffer, A.: High Resolution Passive Facial
Performance Capture. ACM Trans. Graph. 29(4) (2010)
[11] Browatzki, B., Wallraven, C.: 3FabRec: Fast Few-shot Face alignment by Re-
construction. In: CVPR (2020)
[12] Bulat, A., Sanchez, E., Tzimiropoulos, G.: Subpixel Heatmap Regression for Facial
Landmark Localization. In: BMVC (2021)
[13] Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D Face
Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: ICCV (2017)
[14] Cao, C., Chai, M., Woodford, O., Luo, L.: Stabilized real-time face tracking via a
learned dynamic rigidity prior. ACM Transactions on Graphics (2018) [15]Chandran, P.,
Bradley, D., Gross, M.,Conference
In: International Beeler, T.:on
Semantic Deep
3D Vision Face(2020)
(3DV) Models.
[16] Cong, M., Lan, L., Fedkiw, R.: Local geometric indexing of high resolution data for
facial reconstruction from sparse
URL http://arxiv.org/abs/1903.00119 markers. CoRR abs/1903.00119 (2019),
[17] Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: RetinaFace: Single-shot Multi-
level Face Localisation in the Wild. In: CVPR (2020)
20
[18] Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D Face
Reconstruction With Weakly-Supervised Learning: From Single Image to Image Set.
In: CVPR Workshops (2019)
[19] Dib, A., Bharaj, G., Ahn, J., Th´ebault, C., Gosselin, P., Romeo, M., Cheval- lier, L.:
Practical face reconstruction via differentiable ray tracing. Computer Graphics Forum
40(2) (2021)
[20] Dib, A., Thebault, C., Ahn, J., Gosselin, P.H., Theobalt, C., Chevallier, L.: Towards
High Fidelity Monocular Face Reconstruction with Rich Reflectance using Self-
supervised Learning and Ray Tracing. In: CVPR (2021)
[21] Dou, P., Kakadiaris, I.A.: Multi-view 3D face reconstruction with deep recurrent
neural networks. Image and Vision Computing (2018)
[22] Dou, P., Shah, S.K., Kakadiaris, I.A.: End-to-end 3D face reconstruction with deep
neural networks. In: CVPR (2017)
[23] Falcon, W., et al.: Pytorch lightning.
com/PyTorchLightning/pytorch-lightning 3, 6 GitHub.
(2019) Note: https://github.
[24] Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face
model from in-the-wild images. ACM Transactions on Graphics (ToG) (2021)
[25] Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an Animatable Detailed 3D Face
Model from In-the-Wild Images. ACM Transactions on Graphics (ToG) (2021)
[26] Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D Face Reconstruction and
Dense Alignment with Position Map Regression Network. In: ECCV (2018) [27]Garrido,
P., Zollh¨ofer, M., Casas, D., Valgaerts, L., Varanasi, K., P´erez, P., Theobalt, C.:
Reconstruction
Video. ACM Trans. Graph. of Personalized 3D Face Rigs from Monocular
35(3) (2016)
[28]Genova, K., Cole, F., Maschinot, A., Sarna,
Training for 3D Morphable Model Regression. A., Vlasic, D., Freeman,
In: CVPR W.T.: Unsupervised
(2018) [29]Gerig, T., Morel-
Forster, A., Blumer, C., Egger, B., Luthi, M., Sch¨onborn, S., Vetter, T.: Morphable face
Gesture Recognition (FG), IEEE (2018) models-an open framework. In: Automatic Face &
[30] Grishchenko, I., Ablavatski, A., Kartynnik, Y., Raveendran, K., Grundmann, M.:
Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. In: CVPR
Workshops (2020)
[31] Gu¨ler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estima- tion
in the wild. In: CVPR (2018)
[32] Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards Fast, Accurate and
Stable 3D Dense Face Alignment. In: ECCV (2020)
[33] Guo, Y., Cai, J., Jiang, B., Zheng, J., et al.: Cnn-based real-time dense face
reconstruction with inverse-rendered photo-realistic face images. TPAMI (2018) [34]Han,
S., Liu, B., Cabezas, R., Twigg, C.D., Zhang, P., Petkau, J., Yu, T.H., Tai, C.J., Akbay, M.,
Wang, Z., et al.: Megatrack: monochrome egocentric articulated hand-tracking for virtual
87–1 (2020) reality. ACM Transactions on Graphics (TOG) 39(4),
[35] Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective Face Frontalization in Un-
constrained Images. In: CVPR (2015)
21
[36] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
[37] Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3D face alignment from 2D videos in real-
time. In: Automatic Face and Gesture Recognition (FG), IEEE (2015) [38]Kartynnik, Y.,
Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time facial surface geometry from
monocular
(2019)video on mobile GPUs. In: CVPR Workshops
[39]Kendall,
computer A., Gal, Y.:
vision? Advances What information
in neural uncertaintiesprocessing
do we needsystems
in bayesian deep learning
30 (2017) for
[40]Kumar,
A., Marks, T.K., Mou, W., Wang, Y., Jones, M., Cherian, A., Koike- Akino, T., Liu, X.,
Feng,likelihood.
tion, uncertainty, and visibility C.: Luvli face alignment:
In: CVPR (2020)Estimating landmarks’ loca-
[41] Lewis, J.P., Cordner, M., Fong, N.: Pose Space Deformation: A Unified Approach to
Shape Interpolation and Skeleton-Driven Deformation. In: SIG- GRAPH (2000)
[42] Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and
expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)
(2017)
[43] Li, Y., Yang, S., Zhang, S., Wang, Z., Yang, W., Xia, S.T., Zhou, E.: Is 2D Heatmap
Representation Even Necessary for Human Pose Estimation? (2021) [44]Liu, D.C.,
Nocedal, programming
optimization. Mathematical J.: On the limited
45(1),memory
503–528bfgs method for large scale
(1989)
[45] Liu, F., Zhu, R., Zeng, D., Zhao, Q., Liu, X.: Disentangling features in 3D face shapes
for joint face reconstruction and recognition. In: CVPR (2018)
[46] Liu, Y., Jourabloo, A., Ren, W., Liu, X.: Dense Face Alignment. In: ICCV Workshops
(2017)
[47] Loshchilov, I., Hutter, F.: Decoupled Weight Decay Regularization. In: ICLR (2019)
[48] Morales, A., Piella, G., Sukno, F.M.: Survey on 3d face reconstruction from
uncalibrated images. Computer Science Review 40, 100400 (2021)
[49] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin,
Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch:
An imperative style, high-performance deep learning library. In: NeurIPS (2019)
[50] Piotraschke, M., Blanz, V.: Automated 3D face reconstruction from multiple images
using quality measures. In: CVPR (2016)
[51] Popa, T., South-Dickinson, I., Bradley, D., Sheffer, A., Heidrich, W.: Globally
Consistent Space-Time Reconstruction. Comput. Graph. Forum (2010) [52]Richardson,
E., Sela, M., Kimmel,
synthetic data. In:R.: 3D IEEE
3DV, face reconstruction
(2016) by learning from
[53] Richardson, E., Sela, M., Or-El, R., Kimmel, R.: Learning detailed face recon-
struction from a single image. In: CVPR (2017)
[54] Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces In-
the-wild challenge: Database and results. Image and Vision Computing (IMAVIS)
(2016)
22
[55] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobilenetV2:
Inverted residuals and linear bottlenecks. In: CVPR (2018)
[56] Sanyal, S., Bolkart, T., Feng, H., Black, M.: Learning to Regress 3D Face Shape and
Expression from an Image without 3D Supervision. In: CVPR (2019) [57]Seitz, S.M.,
Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A Comparison
and Evaluation
[58]Sela, of Multi-View Stereo Reconstruction Algorithms. CVPR (2006)
tion usingM., Richardson,
image-to-image E., Kimmel,
translation. R.:
In: Unrestricted
ICCV (2017)facial geometry reconstruc-
[59] Shang, J., Shen, T., Li, S., Zhou, L., Zhen, M., Fang, T., Quan, L.: Self- supervised
monocular 3D face reconstruction by occlusion-aware multi-view ge- ometry
consistency. In: ECCV (2020)
[60] Taylor, J., Bordeaux, L., Cashman, T., Corish, B., Keskin, C., Sharp, T., Soto, E.,
Sweeney, D., Valentin, J., Luff, B., et al.: Efficient and precise interactive hand
tracking through joint, continuous optimization of pose and correspon- dences. ACM
Transactions on Graphics (ToG) (2016)
[61] Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.: The vitruvian manifold: In- ferring
dense correspondences for one-shot human pose estimation. In: CVPR (2012)
[62] Tewari, A., Bernard, F., Garrido, P., Bharaj, G., Elgharib, M., Seidel, H.P., P´erez, P.,
Zollhofer, M., Theobalt, C.: FML: Face Model Learning from Videos. In: CVPR
(2019)
[63] Tewari, A., Zollh¨ofer, M., Garrido, P., Bernard, F., Kim, H., P´erez, P., Theobalt, C.:
Self-supervised m¡ulti-level Face Model Learning for Monocular Reconstruc- tion at
over 250 Hz. In: CVPR (2018)
[64] Tewari, A., Zollhofer, M., Kim, H., Garrido, P., Bernard, F., Perez, P., Theobalt, C.:
Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular
reconstruction. In: ICCV Workshops (2017)
[65] Thies, J., Zollh¨ofer, M., Nießner, M., Valgaerts, L., Stamminger, M., Theobalt, C.:
Real-time expression transfer for facial reenactment. ACM Trans. Graph. (oct 2015)
[66] Thies, J., Zollh¨ofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: Real-
Time Face Capture and Reenactment of RGB Videos. In: CVPR (2016) [67]Tran, L., Liu,
model. In: CVPR (2019) F., Liu, X.: Towards high-fidelity nonlinear 3D face morphable
[68]Tran, L., Liu, X.: Nonlinear
A., Hassner, T., Masi, I., Medioni, 3dG.:
faceRegressing
morphablerobust
model. In:dis-
and CVPR (2018) [69]Tuan Tran,
criminative 3D morphable models with a very deep neural network. In: CVPR (2017)
[70] Wang, X., Bo, L., Fuxin, L.: Adaptive Wing Loss for Robust Face Alignment via
Heatmap Regression. In: ICCV (2010)
[71] Wightman, R.: Pytorch image models. https://github.com/rwightman/
pytorch-image-models (2019), doi:10.5281/zenodo.4414861
[72] Wood, E., Baltruˇsaitis, T., Hewitt, C., Dziadzio, S., Johnson, M., Estellers, V.,
Cashman, T.J., Shotton, J.: Fake It Till You Make It: Face analysis in the wild using
synthetic data alone (2021)
23
[73] Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at Boundary: A
Boundary-Aware
[74] Yi, H., Li, C., Cao, Face Alignment
Q., Shen, X., Li,Algorithm. In: CVPR
S., Wang, G., (2018)
Tai, Y.W.: MMFace: A Multi-
metric Regression Network for Unconstrained Face Reconstruction. In: CVPR (2019)
[75] Yoon, J.S., Shiratori, T., Yu, S.I., Park, H.S.: Self-supervised adaptation of high-
fidelity face models for monocular performance tracking. In: CVPR (2019) [76]Zhou, Y.,
Deng, J., Kotsia, I., Zafeiriou, S.: Dense 3D Face Decoding over 2500FPS: Joint
(2019) Texture & Shape Convolutional Mesh Decoders. In: CVPR
[77] Zhu, M., Shi, D., Zheng, M., Sadiq, M.: Robust Facial Landmark Detection via
Occlusion-Adaptive Deep Networks. In: CVPR (2019)
[78] Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face Alignment Across Large Poses: A 3D
Solution. In: CVPR (2016)
[79] Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: A 3d
solution. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 146–155 (2016)
[80] Zollh¨ofer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., P´erez, P., Stam-
minger, M., Nießner, M., Theobalt, C.: State of the art on monocular 3d face re-
construction, tracking, and applications. Computer Graphics Forum 37(2) (2018)
24