research-article

Open access

HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field

Authors:

Yebin LiuAuthors Info & Claims

ACM Transactions on Graphics, Volume 43, Issue 1

Article No.: 6, Pages 1 - 16

https://doi.org/10.1145/3626316

Published: 30 November 2023 Publication History

PDF eReader

Abstract

The problem of modeling an animatable 3D human head avatar under lightweight setups is of significant importance but has not been well solved. Existing 3D representations either perform well in the realism of portrait images synthesis or the accuracy of expression control, but not both. To address the problem, we introduce a novel hybrid explicit-implicit 3D representation, Facial Model Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF and the prior information from the parametric template. At the core of our representation, a synthetic-renderings-based condition method is proposed to fuse the prior information from the parametric model into the implicit field without constraining its topological flexibility. Besides, based on the hybrid representation, we properly overcome the inconsistent shape issue presented in existing methods and improve the animation stability. Moreover, by adopting an overall GAN-based architecture using an image-to-image translation network, we achieve high-resolution, realistic and view-consistent synthesis of dynamic head appearance. Experiments demonstrate that our method can achieve state-of-the-art performance for 3D head avatar animation compared with previous methods.

1 Introduction

Animatable 3D human head avatar modeling is of great significance in many applications such as VR/AR games and telepresence. There are two key factors for a lifelike virtual personal character: the accuracy of facial expression control and the realism of portrait images synthesis. Though existing solutions [Lombardi et al. 2018;, 2021; Ma et al. 2021] are able to reconstruct high-quality dynamic human heads, they typically depend on complicated dense-view capture systems and even rely on hundreds of cameras. By leveraging the learning-based techniques, researchers have shifted interest to explore the possibility of automatically modeling human head avatar, with accurate controllability and high-fidelity appearance, under lightweight setup.

First, to establish a controllable personalized head character, the most straightforward way is to directly learn a global parameter conditioned neural head avatar from image sequences, but such method [Gafni et al. 2021] limits the generalization ability in expression control. To improve control robustness, other works [Zheng et al. 2022; Grassal et al. 2022] attempt to leverage parametric templates [Li et al. 2017] to help regulate the avatar modeling during the training stage. However, the explicit surface prior from the parametric model constrains the expressive power for complex-topology parts (i.e., glasses).

Second, for high-fidelity human head avatar modeling, recent implicit-surface-based methods [Yenamandra et al. 2021; Zheng et al. 2022; Grassal et al. 2022] recover more texture details compared with conventional methods [Yang et al. 2020; Cao et al. 2014; Li et al. 2017; Wang et al. 2022b] with limited-resolution texture representation. Nevertheless, the quality of the recovered appearance is still far from satisfactory. Built on the expressive neural radiance field (NeRF) [Mildenhall et al. 2020], NeRFace [Gafni et al. 2021] is able to generate more promising dynamic appearance results. However, based on the MLP backbone, it is trained in an auto-decoding fashion and tends to overfit training sequences, leading to the obvious inconsistent shape across different frames and unnatural head shaking in the test phase.

Combining the expressiveness of NeRF and the prior information from the parametric template is a promising way for achieving fine-grained expression control and realistic portrait synthesis. Recent work [Athar et al. 2022] establishes a deformable-mesh-guided dynamic NeRF for head avatar modeling. However, the prominent challenge for the coupling of geometry models and NeRF comes from the difficulty in establishing reliable dense correspondences between the real-world subject and the fitted parametric template. Due to the limited expressiveness of the morphable model, it is hard for the deformed mesh to perfectly align with the real-world head with high diversity in terms of geometry and topology. Resulted from the obvious misalignment, the spatial sampling points in the neural radiance field tend to establish ambiguous correspondences with the mesh surface, leading to blurry or unstable rendering performance.

In this article, we introduce a novel Parametric Model-conditioned Neural Radiance Field for Human Head Avatar. Inspired by the effective rendering-to-video translation architecture adopted by Kim et al. [2018], we extend the synthetic-rendering-based condition for 3D head control by integrating it with triplane-based neural volumetric representation [Chan et al. 2022]. The dynamic head character is conditioned by the axis-aligned feature planes generated by the orthogonal renderings of the textured fitted parametric model in the canonical space. We leverage a powerful convolutional network to learn the reasonable correspondences between the canonical synthetic renderings and the observed head appearance, hence avoiding the ambiguous correspondences determined by the Euclidean distance. On the one hand, such a synthetic-rendering-based condition introduces the prior of the fully controllable 3D facial model into the neural representation to achieve fine-grained and consistent expression control. On the other hand, orthogonal renderings can supply rough 3D descriptions and avoid excessive restriction from the coarse geometry of the model mesh, so our head avatar is capable of describing complex topology structure. Considering that the dynamic content mainly comes from facial expressions, we utilize a facial parametric model rather than a full-head model in practice, leaving only the facial region benefiting from the model’s prior.

Moreover, while retaining the powerful appearance expressiveness of NeRF [Mildenhall et al. 2020], our method also overcomes the inconsistent shape issue that commonly occurs in NeRF-based modeling methods [Gafni et al. 2021]. Based on our synthetic-renderings-based orthogonal-plane representation, we utilize learnable embeddings to modulate the plane feature generators rather than condition the MLP decoder in an auto-decoding fashion like NeRFace [Gafni et al. 2021]. By modulating the convolutional kernels and normalizing the feature generation, the embeddings are able to regulate the whole feature volume to avoid overfitting, leading to the consistent head shape in the animation. Our experiments prove that, with per-frame embeddings modulating on the convolutional generators, the shape consistency and the animation stability of our head avatar are significantly improved.

Finally, our method inherits the advantage of NeRF [Mildenhall et al. 2020], which intrinsically supports differentiable rendering and maintains multiview consistency. Thanks to this strength, we further integrate the NeRF-based volume rendering with the neural rendering and optimize the whole architecture end-to-end with image observations to recover facial details. Specifically, by leveraging the effective image-to-image translation network commonly used in researches of portrait video synthesis [Zakharov et al. 2019; Xu et al. 2020; Thies et al. 2019; Chen et al. 2020; Kim et al. 2018], we translate the rendered 3D-aware feature map into RGB images. Training the overall network in an adversarial manner, our solution first achieves high-resolution and view-consistent photo-realistic synthesis for 3D head avatar, as illustrated in Figure.

Given monocular or sparse-views videos, after fitting per frame 3D facial models with an off-the-shelf tracker, our approach is able to learn a high-fidelity and view-consistent personalized 3D head avatar, including hair, accessories, and torso, under full control of head poses and facial expressions. Meanwhile, we optimize a linear blend skinning (LBS) weight field as well, which decouples the motions of the head and the torso via a backward warping. During test time, given a single-view driving video, pose and expression parameters are extracted to deform the facial model, and our method can faithfully recover the entire head appearance under novel expressions, poses, and viewpoints.

In summary, the main contributions of this article include:

–

We propose a novel facial model conditioned NeRF for personalized 3D head avatar, which is built on an orthogonal synthetic-renderings-based feature volume. Our representation enables flexible topology and accurate control over the head motion and facial expressions.

–

Benefiting from our hybrid representation, we develop a new strategy of modulating generators with conditional embeddings to handle the inconsistent shape issue presented in existing NeRF-based avatar modeling methods and significantly improve the animation stability.

–

We first achieve high-resolution realistic and view-consistent synthesis of dynamic head appearance by adopting an overall GAN-based architecture combining our efficient avatar representation with an image-to-image translation module.

–

Besides the learning head avatar from monocular videos, we also present head avatar modeling from multiview videos (using six cameras), and experiments demonstrate the superior performance of our approach compared with other modified SOTA methods.

2 Related Works

Our method draws inspirations from explicit parametrical facial model, synthetic-renderings-based 2D facial avatar, and implicit 3D head avatar. So, we divide this section into three parts.

2.1 Explicit Parametrical Facial Model

Parametric modeling of 3D face has been intensively studied in the past two decades. In the form of explicit meshes, parametric face models are compact, controllable, and easy to be animated. The pioneer work of Blanz and Vetter [1999] builds 3D morphable model to represent facial shape, expression, and appearances. Recently, the parametric face models become more expressive by exploiting more powerful modeling techniques, including multi-linear or nonlinear models [Vlasic et al. 2006; Neumann et al. 2013; Brunton et al. 2014; Li et al. 2010; Tewari et al. 2018; Tran et al. 2019; Tran and Liu 2018; Li et al. 2020] and the articulated control of expression [Li et al. 2017]. To model detailed deformations of expression, recent state-of-the-art methods [Feng et al. 2021; Danecek et al. 2022] further learn additional displacement maps with the conditions of image inputs. Moreover, learning-based generative models such as GAN [Karras et al. 2017] or styleGAN [Karras et al. 2020; Karras et al. 2020] are also used in existing models [Gecer et al. 2021; Cheng et al. 2019; Nagano et al. 2018; Wang et al. 2022b; Nagano et al. 2019; Lattas et al. 2021; Luo et al. 2021] to enhance the accuracy of facial texture or geometry modeling. Despite the remarkable progress, all these parametric models can only capture the relatively coarse geometry and appearance of the facial region with the explicit mesh representations, which limits the realism of those reconstruction and animation approaches [Grassal et al. 2022; Cao 2016; Hu et al. 2017] built upon them. Instead of solely relying on explicit face models, our approach proposes a controllable hybrid explicit-implicit representation for photo-realistic rendering of 3D face.

2.2 Synthetic-renderings-based 2D Facial Avatar

To utilize the explicit facial model to represent the entire dynamic human head, some methods [Kim et al. 2018; Koujan et al. 2020; Doukas et al. 2021; Thies et al. 2019;, 2020] combine classical rendering and learned image synthesis to establish 2D avatar based on the monocular video. Deep Video Portraits [Kim et al. 2018] presented impressive full head reenactment and photo-realistic image results based on an image2image translation framework. Head2Head [Koujan et al. 2020; Doukas et al. 2021] further improved the temporal coherency with a sequential, video-based rendering network. Instead of using the raw texture of the fitted coarse facial model, Deferred Neural Render [Thies et al. 2019;, 2020] extended the idea by rendering the local feature embedded on the mesh surface. Though the rendering-to-video architecture shows an impressive performance in video portrait synthesis, it does not establish 3D representation for the full-head appearance.

2.3 Implicit 3D Head Avatar

In the past three years, it has been an emerging trend to model 3D scenes and objects in an implicit fashion with the success of implicit representations [Mildenhall et al. 2020; Yariv et al. 2020], based on which some works [Chan et al. 2022; Wang et al. 2022a; Kellnhofer et al. 2021; Mihajlovic et al. 2022] have explored to reconstruct high-fidelity view-consistent 3D appearance for static portraits or [Park et al. 2021b;, 2021a] model the dynamic scene with head movements. As for animatable personalized head character, many methods [Lombardi et al. 2019;, 2021; Wang et al. 2021; Gafni et al. 2021; Grassal et al. 2022; Zheng et al. 2022; Athar et al. 2022] attempted to build implicit representation-based personalized full-head avatar. Based on dense multiview capture systems, some researches [Lombardi et al. 2019;, 2021; Wang et al. 2021; Cao et al. 2021] are able to generate facial avatars with impressive subtle details and highly flexible controllability for immersive metric-telepresence. Though the recent work of Cao et al. [2022] supports creating authentic avatars from a phone scan, it relies on a prior model that is pretrained in a large-scale multiview-videos dataset captured in a complicated systems. High cost in data acquisition limited the broad applications. Under lightweight camera settings, based on implicit surface representation, I M Avatar [Zheng et al. 2022] improved generalization to novel expressions by incorporating skinning fields within an implicit morphing-based model, but showed blurry unsatisfying appearance performance. NeRFace [Gafni et al. 2021] showed state-of-the-art reenactment results with a parameter-controlled neural radiance field, but struggled to extrapolate to unseen expressions. Recently, RigNeRF [Athar et al. 2022] proposed to maintain a canonical neural radiance field with a backward deformation field guided by parametric model mesh, but suffers from the ambiguous correspondences determined by the Euclidean distance. Besides, for NeRF-based head avatar modeling methods [Gafni et al. 2021; Hong et al. 2022; Guo et al. 2021], there is a tendency to generate frame-wise inconsistent shape. The problem is originated from the unavoidable noise in the estimation of expressions and head poses, thus the similar input expressions may correspond to slightly different observed appearances, causing unstable canonical shape recovery. Skillfully incorporating the synthetic renderings of parametric model into neural radiance field, our approach achieves both expressive appearance and robust full-head control and further addresses the inconsistent shape by modulating feature generation with learnable embeddings.

3 Overview

The overview of our proposed method is illustrated in Figure 2. Given the monocular or sparse-view videos, we estimate per-frame facial parametric model \(\mathbf {M_t}\) from image sequences \(\mathbf {{I}}_t,t=1,\dots ,T\). Our method conditions the NeRF on the orthogonal synthetic renderings of the model to describe the expression-related head appearance in the canonical space \(H_C\), which supports arbitrary topology and precise expressions control. Besides, per-frame learnable embeddings are utilized to modulate plane feature generation to address expression-shape coupling issue (Section 4.1). Based on the learned LBS weight field, the canonical appearance volume \(H_C\) is warped into the observed space H using the estimated head pose, resulting in the decoupled motions of the head and the body (Section 4.2). With an image-to-image translation network to transferring the volumetrically rendered 2D feature maps to final RGB images, our method achieves high-resolution, photo-realistic, and view-consistent portrait image synthesis (Section 4.3). The overall framework is trained in an adversarial manner with image observations, and the established head avatar can be applied for training sequence 4D reconstruction or novel full head reenactment (Section 4.4).

Fig. 1.

Fig. 2.

3.1 Recap: NeRFace

NeRFace [Gafni et al. 2021] first extends NeRF [Mildenhall et al. 2020] to describe expression-related dynamic head appearance. Based on the classical backbone of eight fully connected layers, NeRFace additionally inputs low-dimensional expressions of the morphable model to condition the neural scene representation network for dynamically changing content. By employing the estimated head pose to transform the rays into the canonical space shared by all frames, the head canonical appearance volume \(H_C\) can be formulated as:

\begin{equation} H_C(\mathbf {x_c}, \mathbf {\gamma _t}, \mathbf {\delta _t}) = (\mathbf {c}, \sigma), \end{equation}

(1)

where the implicit function maps the position in canonical space \(\mathbf {x_c}\) to density \(\sigma\) and color feature \(\mathbf {c}\), under the control of facial expression parameters \(\delta _t\), as well as per-frame embeddings \(\gamma _t\) to compensate for missing tracking information.

NeRFace relies on global expression blendshape parameters to represent diverse expression-related appearances. However, by simply learning the mapping from the global conditional vectors to appearances with only a short video sequence, it is easy to be overfit. Hence, though NeRFace is good at faithfully reconstructing the training sequences, without the awareness of the underlying 3D structure of human face, it struggles to generalize to unseen expressions.

4 Method

4.1 Parametric Model-conditioned NeRF

To introduce the facial structure prior into NeRF [Mildenhall et al. 2020], we propose Parametric Face Model-conditioned Neural Radiance Field. Our definition of \(H_C\) is reformulated as:

\begin{equation} H_C(\mathbf {x_c}, M_t, \mathbf {\gamma _t}, \mathbf {p_t}) = (\mathbf {c}, \sigma), \end{equation}

(2)

where we utilize the tracked deformed mesh model \(M_t\) in zero pose to condition the implicit function, as well as the head pose \(\mathbf {p_t}\) to describe the pose-related non-rigid deformation.

4.1.1 Synthetic-renderings-based Feature Volume.

Figure 3 illustrates the architecture of our NeRF-based representation. The head avatar, embedded with a neural network, is conditioned by model-related local features rather than a global vector for better generalization and precision. Specifically, the orthogonal synthetic renderings of the facial model are leveraged to generate the feature volume for the canonical head appearance.

Fig. 3.

We orthogonally render the 3D facial model in zero pose and integrate the renderings similar to tri-plane-based neural representation [Chan et al. 2022]. Considering the special structure of the human head, we abandon the horizontal plane and utilize the front-view and two side-views planes to characterize the head avatar in the canonical space. Instead of sharing one StyleGAN-based backbone to generate all the feature planes, our method utilizes two separate 2D generators to output feature maps individually.¹ It is also feasible to condition one StyleGAN-based backbone with synthetic renderings to generate all feature planes, but we empirically found that utilizing two separate 2D generators individually contributes to accelerate convergence. As shown in Figure 4, the synthetic renderings are introduced to the generators to condition the plane feature generation in an explicit manner for achieving fine-grained controllability. With convolutional encoders extracting image features from the renderings, the extracted multi-resolution features are injected into the generators for spatial-wise feature fusion. In practice, we generate the front-view plane feature \(F_{front}\) based on front-view orthogonal renderings and leverage both left- and right-view renderings to get the side-view plane feature \(F_{side}\). Practically, the deformed mesh \(M_t\) is rendered as a normal map, a texture map, and a mask map in each view. For the experiments reported in this work, each generator produces a \(128 \times 128 \times 64\) feature image.

Fig. 4.

Based on the generated plane feature images, \(F_{front}\) and \(F_{side}\), for any 3D point in the canonical space, we retrieve its feature vectors via orthogonal projection and bilinear interpolation. All the sampled feature vectors, as well as the positional encoding vector of the coordinate, are concatenated into the point feature \(\mathbf {f}\), which is fed into an additional lightweight MLP module with two hidden layers of 128 units. Finally, a scalar density \(\sigma\) and a 64-channel color feature \(\mathbf {c}\) are predicted for the query point. Indeed, the combination of orthogonal plane features and lightweight MLP makes the burden of scene representation learning fall on the plane feature generation. Hence, we can rely on the powerful and efficient 2D convolutional network, rather than the large MLP backbone, to extract condition information from synthetic renderings and characterize the dynamic head appearance.

As shown in Figure 5, with the 3D hint from the facial model, our representation improves the quality of view-consistent image synthesis. The usage of both front, left, and right elevation is a succinct but efficient description for 3D human head, containing the full observation of the primary part of the head, as well as getting rid of the constraint from the coarse geometry of the mesh model. Setting more planes will lead to information redundancy and unnecessary memory consumption.

Fig. 5.

4.1.2 Conditional Learnable Embeddings.

Though our proposed representation is competent for the generation of expression-related canonical head appearance, there is still an unsolved problem: the misalignment between the tracked facial model and the ground-truth observation, which may lead to the frame-wise inconsistent shape. We tackle it by setting additional conditional embeddings for our representation to distinguish similar expressions at different frames.

To account for the misalignment challenges, the previous method [Gafni et al. 2021] also provides per-frame learnable embedding to the neural head avatar, which contributes to better training sequences reconstruction but cannot eliminate the unnatural head shaking while being driven by time-varying expressions in the test phase. This is because it conditions the MLP backbone with embeddings in an auto-decoding fashion [Park et al. 2019], causing the embeddings to overfit the training dataset. Thanks to our representation that conditions the scene with orthogonal synthetic renderings, we condition the plane feature generators with the learnable embeddings, which are fed into a mapping network to modulate the convolutional kernels of the networks in the manner of StyleGAN2 [Karras et al. 2020]. The embeddings essentially serve as the normalization of the overall feature and concentrate on maximizing global similarity instead of overfitting per-frame local details during the training. Hence, our condition manner contributes to producing a latent space with better interpolation performance and learning a consistent expression-independent head shape. As shown in the Figure 6, apart from the reasonable expression-related deformation around the cheek, our animation results hardly present shape shaking, proving that our conditional embeddings are able to improve the animation stability.

Fig. 6.

Specifically, the per-frame embedding is first input to a shared mapping network to yield an intermediate latent code that then modulates the convolutional layers of all the separate generators. By constraining the variance of the learnable embeddings, there is a preference to let the generator mainly rely on the synthetic renderings for prediction, and per-frame embedding is utilized to account for the variability resulting from the tracking error.

4.1.3 Pose-related Non-rigid Deformation.

Though our solution is able to tackle the skeleton motion of head that will be introduced in next section, there still exists pose-related non-rigid deformation caused by head movements in the canonical space, especially in the neck region. To describe this, similar to the tackling of per-frame learnable embedding, the estimated head poses are also fed to the mapping network to condition the avatar generation.

4.2 Head Motion Decoupling Module

In this section, we will explain how to handle the rigid skeleton deformation driven by head poses. The straightforward treatment in NeRFace [Gafni et al. 2021], that the estimated head poses serve as camera poses, leads to the identical motion of both head and body, which is unrealistic. To render images agreeing with the ground-truth observation, the relative movement between the head and torso needs to be considered. As shown in Figure 7, the canonical appearance volume \(H_C\) should be warped to an observed posed appearance volume H with the rigid deformation T:

\begin{equation} H(\mathbf {x}, M_t, \mathbf {p_t}, \mathbf {\gamma _t}) = H_C(T(\mathbf {x}, \mathbf {p_t}), M_t, \mathbf {\gamma _t}, \mathbf {p_t}). \end{equation}

(3)

Specifically, we compute the head rigid deformation T as inverse linear blend skinning that maps points from the posed space to the shared canonical space:

\begin{equation} T(\mathbf {x}, \mathbf {p_t}) = w_p(\mathbf {x})(R_{head}\mathbf {x}+t_{head}) + (1 - w_p(\mathbf {x}))(R_{torso}\mathbf {x}+t_{torso}), \end{equation}

(4)

where \(w_p\) represents the blend weight, \(R_{head}\) and \(t_{head}\) the head rotation and translation that comes from estimated head pose \(\mathbf {p_t}\), and \(R_{torso}\) and \(t_{torso}\) means the torso movement, which is static by default. To avoid overfitting caused by learning backward skinning [Chen et al. 2021; Zheng et al. 2022], following HumanNerf [Weng et al. 2022], we solve for the weight volume in canonical space to derive the \(w_p\) as:

\begin{equation} w_p(\mathbf {x}) = \frac{w_c(R_{head}\mathbf {x}+t_{head})}{w_c(R_{head}\mathbf {x}+t_{head}) + (1-w_c(\mathbf {x}))} . \end{equation}

(5)

Concretely, we set a 3D convolutional network \(W_c\) that inputs a constant random vector and generates the canonical weight volume \(w_c(x)\) with limited resolution that can be resampled via trilinear interpolation. With the optimized motion decoupling module, our method can separate out the head movement and stabilize the torso motion.

Fig. 7.

4.3 Photo-realistic 3D-aware Portrait Synthesis

Although the aforementioned hybrid NeRF-based representation is more expressive than available methods, only relying on pixel supervision (\(\mathit {MSE}\)/\(\mathit {l}_1\) RGB loss) can hardly yield high-frequency details in the rendered images. Hence, we incorporate the 3D representation into an image2image translation architecture and train the overall network jointly in an adversarial manner to enhance facial details and recover realistic portrait images.

Based on the established appearance volume, volume rendering is implemented using two-pass importance sampling as in Mildenhall et al. [2020]. To remain more 3D-aware information for the subsequent module to generate view-consistent images, similar to previous works [Niemeyer and Geiger 2021; Chan et al. 2022; Gu et al. 2021; Hong et al. 2022], we predict a low-resolution feature map \(128 \times 128 \times 64\) from a given camera pose via volumetric rendering, instead of directly rendering an RGB image. However, different from these methods leveraging up-sample super-resolution (SR) module, our approach chooses a UNet-style image2image translation network to transfer the raw feature maps to the final RGB images. The down-sampling encoding process in UNet helps the 2D network learn the global portrait features, which conduces to the view-consistent images generation.

Our architecture is presented in Figure 8, which includes two main modifications. First, we incorporate skip connections in the decoder, which map each intermediate feature image to an RGB image and integrate the previous output with the next output through addition. Second, we represent the output image as a wavelet (WT) following Gal et al. [2021], and the RGB image is generated through an inverse wavelet transform (IWT). This design choice helps reduce the number of parameters and speed up network computations.

Fig. 8.

The joint training of the overall network can guide NeRF module to provide sufficient and appropriate information for the image2image translation module to raise 3D awareness for regularizing time- and view- inconsistent tendencies. In the next section, we will explain the training procedure and the used loss functions in detail.

4.4 Network Training and Avatar Re-animation

4.4.1 Training Strategy.

Given the tracked facial models of the training sequence and segmented mask images, we employ a two-stage training procedure to optimize the neural head avatar, including the pretraining of the NeRF-based appearance volume and the overall joint training. First, we train only the volume renderer part, the parametric model conditioned NeRF along with the motion decoupling module, to preliminarily establish 3D representation. The objective of the training at the first stage is composed of two components, including an RGB reconstruction loss and a mask loss:

\begin{equation} \mathcal {L}_{nerf} = \lambda _{rgb} \mathcal {L}_{rgb}+\lambda _{mask} \mathcal {L}_{mask} . \end{equation}

(6)

For ease of notation, we drop the subscript (t) of all variables in this subsection.

RGB Reconstruction loss: We additionally set a single linear layer for converting the 64-channel color feature output by the MLP decoder to a three-channel RGB and calculate the pixel color via volume rendering [Mildenhall et al. 2020]. The main supervision is \(\mathcal {L}_{rgb}\), which measures the mean squared error between the rendered and true pixel colors:

\begin{equation} \mathcal {L}_{rgb} = \sum _{r\in R} \left\Vert \hat{C_r} - C(r|M, \mathbf {p}, \mathbf {\gamma })\right\Vert _2^2 , \end{equation}

(7)

where R is the set of rays in each batch, \(\hat{C_r}\) the ground truth pixel color, \(C(r | M, \mathbf {p}, \mathbf {\gamma })\) the corresponding reconstructed color determined by parametric model (\(M, \mathbf {p}\)) and conditional variables (\(\mathbf {\gamma }\)) and the network (H) via volume rendering function.

Silhouette Mask loss: Additionally, we utilize the foreground mask that can be easily obtained with BgMatting [Lin et al. 2020] algorithm to provide supervision:

\begin{equation} \mathcal {L}_{mask} = \sum _{r\in R} BCE(\hat{S_r} , S(r \Vert \mathbf {\delta }, \mathbf {p}, \mathbf {\gamma })), \end{equation}

(8)

where \(BCE(\cdot)\) is the binary cross-entropy loss calculated between the rendered silhouette mask value \(S(r \Vert \mathbf {\delta }, \mathbf {p}, \mathbf {\gamma })\) and the ground truth mask \(\hat{S_r}\).

Next, we train the whole network end-to-end in an adversarial manner with a discriminator [Gal et al. 2021] using the non-saturating GAN loss [Goodfellow et al. 2014] with R1 regularization [Mescheder et al. 2018], denoted \(\mathcal {L}_{adv}\). On top of that, the additional loss terms, an \(\mathit {l}_{1}\)-norm reproduction loss \(\mathcal {L}_{recon}\) and a perceptual loss \(\mathcal {L}_{percep}\), are utilized to penalize the distance between the synthesized image and the ground-truth image.

\begin{equation} \mathcal {L}_{total} = \lambda _{recon} \mathcal {L}_{recon}+\lambda _{percep} \mathcal {L}_{percep}+\lambda _{adv} \mathcal {L}_{adv} \end{equation}

(9)

4.4.2 Full Head Re-animation.

After network training, the neural head avatar is obtained and can be used to faithfully reconstruct the 4D training sequence and be observed under novel viewpoints. As shown in Figure 9, facial reenactment can be achieved by transferring expression and pose information from the actor to the avatar. Specifically, given a monocular source video, we only need to extract pose and expression parameters from the estimated parametric model for each frame and combine these parameters with our pre-established avatar-specific facial model to generate the sequence of deformed mesh models serving as the network input. As for the conditional embedding vectors, we use the average of all learned embeddings and fix it during the test phase. Finally, the photo-realistic head appearance, which shares the same identity with the modeled avatar but has the novel poses and expressions from the actor in the source video, is generated.

Fig. 9.

4.4.3 Implementation Details.

We use Adam optimizer to train our networks with the learning rate to \(1\times 10^{-3}\) for the image-to-image translation module and \(5\times 10^{-4}\) for all the others. We use 80 samples (64 from coarse sampling and 16 from fine sampling) per ray. The first stage of training takes about 12 hours, and the joint training takes about 36 hours using two NVIDIA 3090 GPUs, while rendering an color image with resolution of 512×512 typically takes 0.15 second on one NVIDIA 3090 GPU.

5 Experiments

Dataset and Metrics. We separate the evaluation and comparison into two parts: monocular-video-based and multi-view-videos-based experiments. Our monocular dataset contains the public sequence from I M Avatar [Zheng et al. 2022] and a self-made sequence captured with a phone. We collect multi-view sequences with six cameras focusing on the frontal face. All images are cropped and scaled to 512x512. We calculate the foreground masks with BgMatting [Lin et al. 2020] and estimate per-frame parametric facial model FaceVerse [Wang et al. 2022b] using the released code. Note that we also track eye gaze and additionally draw the position of pupils on top of the RGB renderings. With each sequence split into training frames and testing ones, we train the networks using the training frames from all viewpoints and test the animation quality using the testing frames. For quantitative evaluation, we use two standard metrics: peak signal-to-noise ratio (PSNR) and learned perceptual image patch similarity (LPIPS).

5.1 Comparisons

We mainly compare our method with the state-of-the-art 3D head avatar modeling methods: NeRFace [Gafni et al. 2021], IM Avatar (IMA) [Zheng et al. 2022], RigNeRF [Athar et al. 2022], and Neural Head Avatar (NHA) [Grassal et al. 2022]. For the monocular settings, we also compare with 2D facial reenactment method Head2Head++ (H2H++) [Doukas et al. 2021]. We conduct the comparison on the dataset of Zheng et al. [2022] and our own data. For IMA, NHA, and H2H++, the released data preprocessing codes are utilized to process the monocular videos, and we use our tracked data for NeRFace.² As the code of RigNeRF is not open-source, we re-implement it and leverage the tracked data of IMA’s preprocessing codes to train. To validate the expressiveness of our synthetic-rendering-based NeRF, we also provide a NeRF-baseline(SynR-NeRF) without the image2image translation module.

Qualitative results are presented in Figure 10. For IMA and NHA, their texture heavily relies on shape reconstruction, and the performance of appearance recovery is inferior. NeRFace cannot split the head motion and is more prone to generate unstable results for unseen expressions, as it lacks structure prior from the parametric model. H2H++ suffers from unrealistic image artifacts especially when dealing with challenging head poses. As RigNeRF is built on a backward deformation field guided by a coarse 3DMM mesh, for the unconstrained area such as the mouth interior, RigNeRF tends to establish ambiguous correspondences and generates blurry appearance. Compared with the above approaches, our SynR-NeRF baseline is capable of full-head control and accurate reconstruction of the expressions and head poses. Our full pipeline can, moreover, recover high-frequency details. The quantitative results presented in Table 1 further demonstrate the superiority of our method. Note that instead of focusing on the pixel-wise similarity, our full pipeline further improves the strength in detail generation and increases the perceptual similarity, which is proven by the gap of LPIPS scores between our method and the other methods. We also illustrate the comparison on monocular-based animation in Figure 12. In this experiment, we utilize part of the video from IMA dataset to drive an established head avatar. While dealing with novel expressions and poses obviously different from the training dataset, our approach shows significantly superior performance and robustness.

Table 1.

	case 1		case 2
Method	PSNR \(\uparrow\)	LPIPS \(\downarrow\)	PSNR \(\uparrow\)	LPIPS \(\downarrow\)
NeRFace	26.47	0.221	24.45	0.164
IMA	25.59	0.208	23.90	0.166
NHA	19.54	0.154	18.21	0.158
RigNerf	27.12	0.202	27.92	0.118
H2H++	24.39	0.258	27.12	0.154
Ours(SynR-NeRF)	27.24	0.125	28.78	0.109
Ours	27.58	0.070	28.476	0.058

Table 1. Quantitative Evaluation for Monocular-Videos Datasets

Case 1 refers to the top two rows of Figure 10, and case 2 refers to the bottom two rows.

Fig. 10.

Fig. 11.

Fig. 12.

For the multi-view settings, as far as we know, there is no method focusing on sparse-views-based head avatar modeling available. To this end, similar to our extension to multiview scenario, we extend NeRFace-MV and NHA-MV by adopting multiview parametric face model tracking and optimizing the avatar according to multi-view image evidence.³ Compared with monocular data, multi-view observations can help model a more complete 3D head avatar, but also cause more obvious misalignment between the estimated mesh models and images due to the limited expressiveness of the parametric model, which raises more challenges for high-quality appearance recovery. Figure 11 illustrates the qualitative results of two different views, which demonstrates that our method can achieve fine-grained expression control and generate a view-consistent appearance. NeRFace tends to produce view-inconsistent artifacts, and NHA struggles to describe the topology-varying parts like glasses. The numeric results in Table 2 show that our method achieves higher accuracy in both metrics. We present the monocular-based animation results in Figure 12, which demonstrates our better performance on 3D reenactment.

Table 2.

	case 1		case 2
Method	PSNR \(\uparrow\)	LPIPS \(\downarrow\)	PSNR \(\uparrow\)	LPIPS \(\downarrow\)
NeRFace-MV	21.77	0.239	19.90	0.247
NHA-MV	16.39	0.238	14.96	0.216
Ours(SynR-NeRF)	22.66	0.122	20.06	0.15
Ours	23.83	0.078	21.65	0.095

Table 2. Quantitative Evaluation for Sparse-Views-Videos Datasets

Case 1 refers to the top two rows of Figure 11, and case 2 refers to the bottom two rows.

5.2 Ablation Study

Synthetic-rendering-based condition. In this part, two modified baselines were implemented for this ablation study. The first one, named “ExprPlanes-NeRF,” replaces our synthetic-rendering-based condition with the implicit vector-based condition used in NeRFace, with all other things being equal. The second baseline, named “ExprMLP-NeRF,” further replaces the orthogonal-planes-based neural representation with a deep MLP backbone used in NeRFace. To evaluate the effectiveness of our synthetic-rendering-based volumetric representation, we optimized a head avatar on a monocular video dataset using these two variants, and the results are presented in Figure 13 and Table 3. Comparing “ExprPlanes-NeRF” and “ExprMLP-NeRF,” we found that only applying the orthogonal-planes representation does not significantly improve performance. However, by using synthetic renderings for explicit condition, our method contributes to more accurate expression control.

Table 3.

PSNR		SynR-NeRF (Ours)	ExprPlanes NeRF	ExprMLP NeRF
Backbone	MLP			\(\checkmark\)
Backbone	Orthogonal planes	\(\checkmark\)	\(\checkmark\)
Condition	Expression vector		\(\checkmark\)	\(\checkmark\)
Condition	Synthetic rendering	\(\checkmark\)
PSNR		26.05	22.93	22.10
LPIPS		0.1516	0.1683	0.1780

Table 3. Ablation Study on Our Orthogonal Synthetic-rendering

Fig. 13.

Image-to-image Translation Module. Results in Section 5.1 have proven that the image translation module effectively enhances the fine-level details. We implement other two baselines for the ablation study to separately validate the choice of the 2D neural rendering network and the strategy of joint training: (1) We replace the image translation network with the up-sample SR module used by Chan et al. [2022]; Niemeyer and Geiger [2021] and train the whole pipeline end-to-end. However, when attempting to train the network with adversarial loss functions, we empirically find it hard to maintain stable training. We argue that, without the encoder part, the up-sample SR module alone is not suitable for the person-specific dataset with insufficient diversity. Instead, \(\mathit {l}_1\) loss and perceptual loss are adopted in this experiment. (2) We independently train the image translation module with GAN loss

to super-resolution the rendered images from a frozen pretrained SynR-NeRF. Experiments are conducted on a multi-view sequence, using five views for training and leaving one view for evaluation. As shown in Figure 14 and Table 4, up-sample SR-based baseline fails to generate fine details. For separate training baseline, it operates primarily in image-space and introduces undesirable inconsistent artifacts when dealing with the complex distribution of multi-view images. Through end-to-end training the whole framework, our pipeline contributes to guaranteeing realistic detail generation performance.

Table 4.

	Ours	baseline 1	baseline 2
	Image Translation	Up-Sample SR	Image Translation
	End-to-End	End-to-End	Separate
FID	7.67	8.62	17.34

Table 4. Ablation Study on Image Translation Module

Fig. 14.

Zero-posed Orthogonal Mesh Rendering. In our method, we render the 3D facial model orthogonally to create a canonical feature volume in zero pose for feature conditioning. In this part, we introduce a baseline method called “Posed Rendering,” which involves rendering the pose-dependent meshes to condition the orthogonal feature planes. The results, as shown in Figure 15 and Table 5, indicate that “Posed Rendering” performs worse on the testing set. We attribute this to the coupling of expressions and poses, which creates false correlations between the facial appearance and the face location in the renderings. Plane feature generators have to remember the potential diverse locations in the input image space, leading to reduced performance. In contrast, our method orthogonally renders zero-posed meshes, enabling the generators to concentrate on extracting expression-related information from the renderings and achieve fine-grained control.

Table 5.

	Posed Rendering	Ours (SynR-NeRF)
PSNR	27.67	29.79
LPIPS	0.1324	0.1219

Table 5. Ablation Study on Orthogonal Mesh Rendering

Fig. 15.

Conditional Learnable Embeddings. One strength of our method is that we solve the expression-shape coupling issue presented in previous NeRF-based head avatar methods. We owe it to our strategy of modulating plane feature generators, as mentioned in Section 4.1.2. For evaluation, similar to Figure 6, we implement a baseline that the NeRF is conditioned in an auto-decoding fashion by inputting the learnable embeddings into the MLP decoder. To fully explore the expression-pose coupling issues, we fix the head pose and only transfer the expression to animate the avatar. The results are presented in the video. NeRFace and our modified baseline both show obvious jitter, while our method illustrates stable animation results and better appearance quality.

Pose Condition. As mentioned in Section 4.1.3, by introducing pose vectors into condition, our method is able to describe pose-related non-rigid deformation in the canonical space. Figure 16 illustrates two cases. In the first case, without considering pose, artifacts are visible in the side-view observation when the head turns around. Our method, which includes pose condition, eliminates these artifacts and enhances view-consistency. In the second case, we demonstrate the ability of our method to describe simple pose-related movements of hair.

Fig. 16.

6 Discussion and Conclusion

Limitation. Although our approach is able to synthesize high-quality 3D-aware portrait images, the proxy shapes produced by our method cannot be competitive with the state-of-the-art alternative approaches [Zheng et al. 2022; Grassal et al. 2022], as shown in Figure 19 and Figure 17. While this is not important for the photo-realistic stable view-consistent head image synthesis application we consider in this article, other applications may benefit from reconstructing more accurate morphable geometry.

Fig. 17.

Fig. 18.

Fig. 19.

Compared to surface-based avatar modeling methods [Zheng et al. 2022], our method struggles with out-of-distribution head poses. Additionally, because our method relies on a parametric model to control facial expressions, it is challenging to handle extreme expressions that cannot be expressed by the facial model, as depicted in Figure 18. Furthermore, while our method can capture simple pose-related deformation of long hair, it faces difficulties in dealing with challenging topology-varying cases caused by large hair movements. Special treatment of the hair region is an important problem in the future.

Conclusion. We introduce a novel modeling method that first achieves high-resolution, photo-realistic and view-consistent portrait synthesis for controllable human head avatars. By integrating the parametric face model with the neural radiance field, it has expressive representation power for both topology and appearance, as well as the fine-grained control over head poses and facial expressions. Utilizing learnable embeddings to modulate feature generators, our method further stabilizes animation results. Besides monocular-video-based avatar modeling, we also present high-fidelity head avatar based on a sparse-view capture system. Compared to existing methods, the appearance quality and animation stability of our head avatar is significantly improved.

Footnotes

Removing texture map is also feasible for person-specific avatar modeling, but we empirically find that adding texture renderings can accelerate convergence.

For fair comparisions, we additionally take the position of pupils besides expression parameters as input.

As IMA relies on DECA [Feng et al. 2021] for tracking, which cannot straightforwardly accommodate to multi-view setting, we do not include it in multi-view experiments.

⁴

We set a learnable feature map in UV space that is shared for all frames, utilize a U-Net with 7 layers, and the number of channels used are \(64, 128, 256, 512, and 512\). For each frame, the U-Net is input with the shared learnable feature map and per-frame UV normal map to generate an expression-related feature map on the UV parameterization. Specifically, the size for our shared UV feature map and generated expression-related UV feature map is 256x256x64.

⁵

(https://github.com/LizhenWangT/FaceVerse)

Appendix

A Comparison with EG3D

EG3D [Chan et al. 2022] is a state-of-the-art powerful generative model for HD 3D portrait, hence, we compare it with our method to demonstrate the ability on novel view synthesis. After fitting the pretrained EG3D model with a single reference frame [Roich et al. 2021], we render the reconstructed 3D head in different views. As Figure 19 shows, our method models a more vivid head avatar and presents more convincing novel view synthesis, which benefits from the joint learning of the temporal observation of the person-specific video data. Besides, EG3D performs worse on poses that are rare in FFHQ dataset. As for geometry, as we only utilize monocular view observation and do not apply any depth or sigma regularization, the geometry of our head avatar is noisier than EG3D’s result.

B Comparison with Uv-based NeRF Baseline

To validate the effectiveness of our adopted synthetic-rendering-based condition, we implement a mesh conditioned baseline (UV-NeRF) that encodes the feature map defined based on the UV parameterization⁴ instead of our orthogonal plane features. For each sample point, the local feature, obtained from the nearest surface vertices, serves as the input to the MLP decoder. This baseline network is trained under the same setting as our network. The experiment is conducted based on a monocular dataset, and the results are presented in Figure 20. Not surprisingly, though UV-NeRF can accurately reconstruct the expression and reproduce reliable facial appearance, it generates unrealistic artifacts at the edge region of the model mesh. Notice the ambiguous appearance in the mouth and the sharp seam around the neck. Our synthetic-rendering-based condition fully utilizes the powerful convolutional network to learn the reasonable correspondence between the facial model and the entire head appearance, synthesizing consistent and stable images.

Fig. 20.

C Ablation Studies On the Textured Mesh Rendering Condition

To further validate our synthetic-rendering-based condition in detail, we implement two baselines that generate the feature volume directly from latent codes. The first one, called “VectorPlane,” uses the expression parameters as input for the feature generation backbone, while the second one, “VectorPlane(ExprMod),” feeds the expression parameters to a mapping network to modulate the convolutional kernels of the networks. Both baselines modulate the feature generation backbone with pose vectors and per-frame latent codes. As shown in Figure 21 and Table 6, learning the mapping from the global vectors to appearances tends to overfit training sequences, and the ability of expression control degraded for out-of-distribution expressions.

Table 6.

	VectorPlane (ExprMod)	VectorPlane	woTexture	SynR-NeRF
PSNR	26.99	27.53	28.40	30.20
LPIPS	0.1446	0.1358	0.1268	0.1243

Table 6. Ablation Studies on the Textured Mesh Rendering Condition

Fig. 21.

We present our analysis the differences between the two conditional methods. Two similar expressions may be too close in the parameter space for a global-vector-conditioned feature generator to distinguish between them. Our synthetic-rendering-based conditional method preserves the spatial alignment between the mesh renderings and the feature planes, hence the local spatial changes in the rendering image space can be reflected in the feature volume. Additionally, as we align individual local features at the pixel level of renderings to the global context of the entire appearance, it is more likely to infer plausible results for unseen expressions.

Besides, we also implement a baseline, named “woTexture,” that only utilizes normal and mask renderings to condition feature volume generation. Removing texture rendering is feasible for person-specific avatar modeling and does not significantly affect the robustness of expression control. However, despite the similar numerical results, the visualization results of our method exhibit more detailed appearances around the eyes. We hypothesize that the texture rendering contains more high-frequency information around the eye region, as shown in the last row of Figure 21, which may facilitate the network in effectively learning dynamic appearance around eyes.

D Comparison with 2d Re-enactment-based Baseline

By omitting the Nerf module, we implement a 2D re-enactment method that only utilizes our image2image translation module. In our practice, the image2image translation network takes the renderings of the fitted 3DMM in the observed view as input and generates the corresponding 2D images. This 2D re-enactment method has limitations. First, it cannot establish an avatar based on a multi-view dataset, because it cannot differentiate between camera poses and head poses. Second, when applied to monocular videos, the 2D re-enactment method is sensitive to the location of the facial model in the image space, as shown in Figure 22. The 2D re-enactment method is prone to generating artifacts or distorted faces, particularly at the edges of the image.

Fig. 22.

E Multi-view Setting

While fitting per-frame 3DMM model, we use the detected landmarks in multi-view images at the same instant as supervision and additionally estimate the scale parameter of 3DMM. As for NeRF optimization, we simply leverage multi-view images of the same frame to supervise the appearance with all loss terms and the training strategy as same as monocular-based setting.

G Inference Time

Table 7 shows the detailed time-consuming during inference. Rendering a color image with a resolution of 512 \(\times\) 512 takes 0.15 second on one NVIDIA 3090 GPU, and the most time-intensive part is the volume rendering of our NeRF module.

Table 7.

Data IO	Feature Plane Generation	NeRF Module	2D Image Module	Total
47.4 ms	22.3 ms	58 ms	24.5 ms	152.2 ms

Table 7. Detailed Time-consuming during Inference

References

[1]

Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Trans. Graph. 35, 4 (2016).

Abstract

1 Introduction

2 Related Works

2.1 Explicit Parametrical Facial Model

2.2 Synthetic-renderings-based 2D Facial Avatar

2.3 Implicit 3D Head Avatar

3 Overview

3.1 Recap: NeRFace

4 Method

4.1 Parametric Model-conditioned NeRF

4.1.1 Synthetic-renderings-based Feature Volume.

4.1.2 Conditional Learnable Embeddings.

4.1.3 Pose-related Non-rigid Deformation.

4.2 Head Motion Decoupling Module

4.3 Photo-realistic 3D-aware Portrait Synthesis

4.4 Network Training and Avatar Re-animation

4.4.1 Training Strategy.

4.4.2 Full Head Re-animation.

4.4.3 Implementation Details.

5 Experiments

5.1 Comparisons

5.2 Ablation Study

6 Discussion and Conclusion

Footnotes

Appendix

A Comparison with EG3D

B Comparison with Uv-based NeRF Baseline

C Ablation Studies On the Textured Mesh Rendering Condition

D Comparison with 2d Re-enactment-based Baseline

E Multi-view Setting

G Inference Time

References

Cited By

Index Terms

Recommendations

AvatarMAV: Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels

HQ3DAvatar: High-quality Implicit 3D Head Avatar

MoFaNeRF: Morphable Facial Neural Radiance Field

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

F Data Preprocessing

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations