1 Introduction
Animatable 3D human head avatar modeling is of great significance in many applications such as VR/AR games and telepresence. There are two key factors for a lifelike virtual personal character: the accuracy of facial expression control and the realism of portrait images synthesis. Though existing solutions [Lombardi et al.
2018;,
2021; Ma et al.
2021] are able to reconstruct high-quality dynamic human heads, they typically depend on complicated dense-view capture systems and even rely on hundreds of cameras. By leveraging the learning-based techniques, researchers have shifted interest to explore the possibility of automatically modeling human head avatar, with accurate controllability and high-fidelity appearance, under lightweight setup.
First, to establish a controllable personalized head character, the most straightforward way is to directly learn a global parameter conditioned neural head avatar from image sequences, but such method [Gafni et al.
2021] limits the generalization ability in expression control. To improve control robustness, other works [Zheng et al.
2022; Grassal et al.
2022] attempt to leverage parametric templates [Li et al.
2017] to help regulate the avatar modeling during the training stage. However, the explicit surface prior from the parametric model constrains the expressive power for complex-topology parts (i.e., glasses).
Second, for high-fidelity human head avatar modeling, recent implicit-surface-based methods [Yenamandra et al.
2021; Zheng et al.
2022; Grassal et al.
2022] recover more texture details compared with conventional methods [Yang et al.
2020; Cao et al.
2014; Li et al.
2017; Wang et al.
2022b] with limited-resolution texture representation. Nevertheless, the quality of the recovered appearance is still far from satisfactory. Built on the expressive
neural radiance field (NeRF) [Mildenhall et al.
2020], NeRFace [Gafni et al.
2021] is able to generate more promising dynamic appearance results. However, based on the MLP backbone, it is trained in an auto-decoding fashion and tends to overfit training sequences, leading to the obvious inconsistent shape across different frames and unnatural head shaking in the test phase.
Combining the expressiveness of NeRF and the prior information from the parametric template is a promising way for achieving fine-grained expression control and realistic portrait synthesis. Recent work [Athar et al.
2022] establishes a deformable-mesh-guided dynamic NeRF for head avatar modeling. However, the prominent challenge for the coupling of geometry models and NeRF comes from the difficulty in establishing reliable dense correspondences between the real-world subject and the fitted parametric template. Due to the limited expressiveness of the morphable model, it is hard for the deformed mesh to perfectly align with the real-world head with high diversity in terms of geometry and topology. Resulted from the obvious misalignment, the spatial sampling points in the neural radiance field tend to establish ambiguous correspondences with the mesh surface, leading to blurry or unstable rendering performance.
In this article, we introduce a novel Parametric Model-conditioned Neural Radiance Field for Human Head Avatar. Inspired by the effective rendering-to-video translation architecture adopted by Kim et al. [
2018], we extend the synthetic-rendering-based condition for 3D head control by integrating it with triplane-based neural volumetric representation [Chan et al.
2022]. The dynamic head character is conditioned by the axis-aligned feature planes generated by the orthogonal renderings of the textured fitted parametric model in the canonical space. We leverage a powerful convolutional network to learn the reasonable correspondences between the canonical synthetic renderings and the observed head appearance, hence avoiding the ambiguous correspondences determined by the Euclidean distance. On the one hand, such a synthetic-rendering-based condition introduces the prior of the fully controllable 3D facial model into the neural representation to achieve fine-grained and consistent expression control. On the other hand, orthogonal renderings can supply rough 3D descriptions and avoid excessive restriction from the coarse geometry of the model mesh, so our head avatar is capable of describing complex topology structure. Considering that the dynamic content mainly comes from facial expressions, we utilize a facial parametric model rather than a full-head model in practice, leaving only the facial region benefiting from the model’s prior.
Moreover, while retaining the powerful appearance expressiveness of NeRF [Mildenhall et al.
2020], our method also overcomes the inconsistent shape issue that commonly occurs in NeRF-based modeling methods [Gafni et al.
2021]. Based on our synthetic-renderings-based orthogonal-plane representation, we utilize learnable embeddings to modulate the plane feature generators rather than condition the MLP decoder in an auto-decoding fashion like NeRFace [Gafni et al.
2021]. By modulating the convolutional kernels and normalizing the feature generation, the embeddings are able to regulate the whole feature volume to avoid overfitting, leading to the consistent head shape in the animation. Our experiments prove that, with per-frame embeddings modulating on the convolutional generators, the shape consistency and the animation stability of our head avatar are significantly improved.
Finally, our method inherits the advantage of NeRF [Mildenhall et al.
2020], which intrinsically supports differentiable rendering and maintains multiview consistency. Thanks to this strength, we further integrate the NeRF-based volume rendering with the neural rendering and optimize the whole architecture end-to-end with image observations to recover facial details. Specifically, by leveraging the effective image-to-image translation network commonly used in researches of portrait video synthesis [Zakharov et al.
2019; Xu et al.
2020; Thies et al.
2019; Chen et al.
2020; Kim et al.
2018], we translate the rendered 3D-aware feature map into RGB images. Training the overall network in an adversarial manner, our solution first achieves high-resolution and view-consistent photo-realistic synthesis for 3D head avatar, as illustrated in Figure.
Given monocular or sparse-views videos, after fitting per frame 3D facial models with an off-the-shelf tracker, our approach is able to learn a high-fidelity and view-consistent personalized 3D head avatar, including hair, accessories, and torso, under full control of head poses and facial expressions. Meanwhile, we optimize a linear blend skinning (LBS) weight field as well, which decouples the motions of the head and the torso via a backward warping. During test time, given a single-view driving video, pose and expression parameters are extracted to deform the facial model, and our method can faithfully recover the entire head appearance under novel expressions, poses, and viewpoints.
In summary, the main contributions of this article include:
–
We propose a novel facial model conditioned NeRF for personalized 3D head avatar, which is built on an orthogonal synthetic-renderings-based feature volume. Our representation enables flexible topology and accurate control over the head motion and facial expressions.
–
Benefiting from our hybrid representation, we develop a new strategy of modulating generators with conditional embeddings to handle the inconsistent shape issue presented in existing NeRF-based avatar modeling methods and significantly improve the animation stability.
–
We first achieve high-resolution realistic and view-consistent synthesis of dynamic head appearance by adopting an overall GAN-based architecture combining our efficient avatar representation with an image-to-image translation module.
–
Besides the learning head avatar from monocular videos, we also present head avatar modeling from multiview videos (using six cameras), and experiments demonstrate the superior performance of our approach compared with other modified SOTA methods.
3 Overview
The overview of our proposed method is illustrated in Figure
2. Given the monocular or sparse-view videos, we estimate per-frame facial parametric model
\(\mathbf {M_t}\) from image sequences
\(\mathbf {{I}}_t,t=1,\dots ,T\). Our method conditions the NeRF on the orthogonal synthetic renderings of the model to describe the expression-related head appearance in the canonical space
\(H_C\), which supports arbitrary topology and precise expressions control. Besides, per-frame learnable embeddings are utilized to modulate plane feature generation to address expression-shape coupling issue (Section
4.1). Based on the learned LBS weight field, the canonical appearance volume
\(H_C\) is warped into the observed space
H using the estimated head pose, resulting in the decoupled motions of the head and the body (Section
4.2). With an image-to-image translation network to transferring the volumetrically rendered 2D feature maps to final RGB images, our method achieves high-resolution, photo-realistic, and view-consistent portrait image synthesis (Section
4.3). The overall framework is trained in an adversarial manner with image observations, and the established head avatar can be applied for training sequence 4D reconstruction or novel full head reenactment (Section
4.4).
3.1 Recap: NeRFace
NeRFace [Gafni et al.
2021] first extends NeRF [Mildenhall et al.
2020] to describe expression-related dynamic head appearance. Based on the classical backbone of eight fully connected layers, NeRFace additionally inputs low-dimensional expressions of the morphable model to condition the neural scene representation network for dynamically changing content. By employing the estimated head pose to transform the rays into the canonical space shared by all frames, the head canonical appearance volume
\(H_C\) can be formulated as:
where the implicit function maps the position in canonical space
\(\mathbf {x_c}\) to density
\(\sigma\) and color feature
\(\mathbf {c}\), under the control of facial expression parameters
\(\delta _t\), as well as per-frame embeddings
\(\gamma _t\) to compensate for missing tracking information.
NeRFace relies on global expression blendshape parameters to represent diverse expression-related appearances. However, by simply learning the mapping from the global conditional vectors to appearances with only a short video sequence, it is easy to be overfit. Hence, though NeRFace is good at faithfully reconstructing the training sequences, without the awareness of the underlying 3D structure of human face, it struggles to generalize to unseen expressions.
6 Discussion and Conclusion
Limitation. Although our approach is able to synthesize high-quality 3D-aware portrait images, the proxy shapes produced by our method cannot be competitive with the state-of-the-art alternative approaches [Zheng et al.
2022; Grassal et al.
2022], as shown in Figure
19 and Figure
17. While this is not important for the photo-realistic stable view-consistent head image synthesis application we consider in this article, other applications may benefit from reconstructing more accurate morphable geometry.
Compared to surface-based avatar modeling methods [Zheng et al.
2022], our method struggles with out-of-distribution head poses. Additionally, because our method relies on a parametric model to control facial expressions, it is challenging to handle extreme expressions that cannot be expressed by the facial model, as depicted in Figure
18. Furthermore, while our method can capture simple pose-related deformation of long hair, it faces difficulties in dealing with challenging topology-varying cases caused by large hair movements. Special treatment of the hair region is an important problem in the future.
Conclusion. We introduce a novel modeling method that first achieves high-resolution, photo-realistic and view-consistent portrait synthesis for controllable human head avatars. By integrating the parametric face model with the neural radiance field, it has expressive representation power for both topology and appearance, as well as the fine-grained control over head poses and facial expressions. Utilizing learnable embeddings to modulate feature generators, our method further stabilizes animation results. Besides monocular-video-based avatar modeling, we also present high-fidelity head avatar based on a sparse-view capture system. Compared to existing methods, the appearance quality and animation stability of our head avatar is significantly improved.