research-article

Open access

URAvatar: Universal Relightable Gaussian Codec Avatars

Authors: Junxuan Li, Chen Cao, Gabriel Schwartz, Rawal Khirodkar, Christian Richardt, Tomas Simon, Yaser Sheikh, Shunsuke SaitoAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 128, Pages 1 - 11

https://doi.org/10.1145/3680528.3687653

Published: 03 December 2024 Publication History

All formats PDF

Abstract

We present a new approach to creating photorealistic and relightable head avatars from a phone scan with unknown illumination. The reconstructed avatars can be animated and relit in real time with the global illumination of diverse environments. Unlike existing approaches that estimate parametric reflectance parameters via inverse rendering, our approach directly models learnable radiance transfer that incorporates global light transport in an efficient manner for real-time rendering. However, learning such a complex light transport that can generalize across identities is non-trivial. A phone scan in a single environment lacks sufficient information to infer how the head would appear in general environments. To address this, we build a universal relightable avatar model represented by 3D Gaussians. We train on hundreds of high-quality multi-view human scans with controllable point lights. High-resolution geometric guidance further enhances the reconstruction accuracy and generalization. Once trained, we finetune the pretrained model on a phone scan using inverse rendering to obtain a personalized relightable avatar. Our experiments establish the efficacy of our design, outperforming existing approaches while retaining real-time rendering capability.

1 Introduction

Photorealistic head avatars are fundamental to enabling communication in virtual environments [Lombardi et al. 2018; Ma et al. 2021]. To establish coherent presence in such environments, avatars have to be illuminated to match the particular environment that they are in. Consider a virtual interaction in a room with natural light shining in from a side window. If the avatars in the scene are uniformly lit, or lit as if they were in a room with ceiling fluorescent light, the incongruence between environment and avatars will interfere with–and likely break–the sense of presence.

Fig. 1:

Fig. 2:

The challenge is that human heads are among the most complex objects to relight accurately. Light interacts with the head in varied ways, scattering in the skin, reflecting in the eyes and teeth, getting trapped in hair strands, and so on. This complexity is compounded by the diversity among human beings in facial structure, skin types, eye colors, accessories, and hair types. Traditionally, measuring scattering and reflectance properties to build authentic relightable avatars has required detailed scans in multi-light capture systems. [Bi et al. 2021; Debevec et al. 2000; Ghosh et al. 2011; Saito et al. 2024] Such capture systems are costly and require specialists to build. The scans themselves are time-consuming and inconvenient. To truly build virtual communities that the majority of people can access, we require the means to quickly and effortlessly create relightable avatars, across the span of human diversity.

Recent approaches have attempted to drastically reduce the capture data to as little as a single input image [Lattas et al. 2020; Yamaguchi et al. 2018] or a monocular video [Bharadwaj et al. 2023; Wang et al. 2023a]. Yet, there remains a clear fidelity gap between the studio-captured avatars and the ones from lightweight inputs. In this paper, our goal is to achieve comparable relightable quality to those studio-captured avatars from just a single phone scan.

To close the quality gap, we introduce URAvatar (pronounced “your avatar”), a Universal Relightable Avatar prior learned from hundreds of individuals captured with a multi-view and multi-light capture system in an end-to-end manner. URAvatar uses a set of 3D Gaussians [Kerbl et al. 2023] to represent the intricate geometry of human heads and hair, and builds a prior on the joint distribution of identity, expressions, and illumination. This enables the modeling of a relightable and drivable avatar with high-fidelity details from under-constrained input as shown in fig. 1. Unlike existing approaches that learn priors based on parametric BDRFs [Lattas et al. 2021; Li et al. 2020; Smith et al. 2020; Yamaguchi et al. 2018], we build our relightable appearance prior based on learnable radiance transfer [Saito et al. 2024] that incorporates global light transport as a result of multi-bounce scattering and reflection. This way, we can efficiently relight avatars with global illumination under various environments without expensive ray tracing. Moreover, the model can be directly supervised to reproduce the ground-truth images without being restricted by the expressiveness of the chosen BRDF model. For consistent drivability across identities, we balance between the explicitness of control and the scalability of training. In particular, we choose to explicitly model eye gaze and neck rotation in the form of linear blend skinning, as they can be reliably tracked. On the other hand, facial expressions, including complex tongue motions, are all learned as latent codes in a self-supervised manner [Lombardi et al. 2018; Xu et al. 2023].

Once trained, we finetune the avatar with an input phone scan of a new person. We reduce the domain gap between the pretrained model and the phone scan by estimating the albedo in screen space and unwrapping it to UV space for identity conditioning. Then, we estimate illumination by regression and refine it via inverse rendering. Finally, the weights of the prior model are updated to best explain the phone scan via inverse rendering. Our carefully designed finetuning strategy ensures that the relightability is retained from the prior, while recovering essential person-specific details.

To measure the fidelity of our approach, we collect ground-truth relighting data under various continuous illumination conditions with a capture dome that consists of multiple LED screens. This allows us to quantitatively compare the synthesis and real-world observations given known natural illumination. Our experiments show that our approach outperforms prior methods by a large margin, and clearly demonstrates the efficacy of our prior-based relighting that accounts for global light transport in real time.

Our principal contributions are:

(1)

We introduce a universal relightable avatar prior model learned from hundreds of dynamic performance captures with a multi-view and multi-light system.

(2)

We build a drivable head avatar from a phone scan that can be rendered and relit with global light transport in real-time.

(3)

A capture system and evaluation protocol to measure the accuracy of relighting under continuous illuminations.

2 Related Work

Authentic avatars for mass adoption must satisfy the following criteria: they must be drivable, relightable, and lightweight enough for anyone to create. In what follows, we discuss prior works based on these criteria.

2.1 Drivable Avatars

In computer graphics, controlling facial expressions of avatars has been primarily driven by visual effects and games. To enable consistent control across identities, anatomically motivated FACS action units [Friesen and Ekman 1978] are widely used as the basis of blendshapes [Lewis et al. 2014]. However, this basis is often insufficient to capture person-specific variations, and tends to require additional correctives [Li et al. 2013]. Data-driven approaches construct linear [Blanz and Vetter 2023; Pighin et al. 2006], multi-linear [Cao et al. 2013; Vlasic et al. 2006], and non-linear [Ranjan et al. 2018; Tran and Liu 2018] bases from captured 3D data. The FLAME model [Li et al. 2017] also incorporates linear blend skinning (LBS) for jaw and neck motions. These approaches lack fine-grained subtle expressions as well as tongue and eye motions. Deep appearance models [Lombardi et al. 2018] propose a self-supervised method to discover the expression latent space using variational autoencoders (VAEs). This approach allows the driving of authentic facial expressions of users in a fully data-driven manner. Later, LatentAvatar [Xu et al. 2023] shows that a similar construction is possible for less constrained setups. Cao et al. [2022] learn the latent expression space across multiple identities to enable semantically consistent driving while retaining person-specific expressions. While this approach works well for relatively small deformations, large deformations caused by articulations, such as neck or eye motions, lead to undesired artifacts. To address this, our work combines the latent expression codes with explicit eye models [Li et al. 2022; Schwartz et al. 2020] and neck articulations via LBS [Li et al. 2017], further enhancing the drivability and fidelity.

2.2 Lightweight Avatar Generation

Early works on photorealistic human digitization required dedicated reconstruction pipelines and capture systems for individual components including hair [Echevarria et al. 2014; Luo et al. 2013; Nam et al. 2019], face[Debevec et al. 2000; Ghosh et al. 2011], eyes [Bérard et al. 2014], and inner mouth [Wu et al. 2016]. While these approaches are non-trivial to scale for large-scale identities, Ichim et al. [2015] show the promise of reconstructing personalized avatars from a phone scan. Follow-up works support more diverse hair styles [Cao et al. 2016] or enable reconstruction from a single image [Hu et al. 2017; Nagano et al. 2018]. However, these approaches tend to lack photorealism due to the limited expressiveness of the underlying morphable models and/or mesh representations. More recently, neural fields [Xie et al. 2022], including NeRF [Mildenhall et al. 2021], show remarkable progress on modeling complex geometry and appearance. This is also extended to avatar reconstructions from casually captured video data [Athar et al. 2023; 2022; Gafni et al. 2021; Grassal et al. 2022; Zheng et al. 2021; Zielonka et al. 2023]. While NeRF and its variants are slow to render, neural rendering approaches based on Mixture of Volumetric Primitives [Cao et al. 2022; Lombardi et al. 2021] or Neural Deferred Rendering [Thies et al. 2019; Wang et al. 2023b] show the ability to render 3D avatars in real-time for interactive applications. Despite impressive progress on modeling authentic avatars from lightweight inputs, the common limitation of these approaches is that the illumination is baked into the appearance model and avatars cannot be relit under different environments.

2.3 Avatar Relighting

Relighting is a critical property to enable photorealistic composition of avatars into a scene. Debevec et al. [2000] show that one-light-at-a-time (OLAT) capture can be used to recover reflectance fields. The follow-up work further supports dynamics relighting [Peers et al. 2007] and accelerates the acquisition process by leveraging spherical gradient illuminations [Fyffe et al. 2016; Ghosh et al. 2011; Ma et al. 2007]. Despite high-fidelity outputs, it remains non-trivial to widely adopt such a system. While the work by Sengupta et al. [2021] reduces the hardware requirement to a single camera and a single monitor, it supports neither reanimation nor novel-view synthesis. While portrait relighting approaches [Kim et al. 2024; Meka et al. 2019; 2020; Pandey et al. 2021; Sun et al. 2019; Tewari et al. 2020; Wang et al. 2020; Yeh et al. 2022; Zhang et al. 2021] support relighting in screen space, due to the lack of temporal or 3D information, they tend to produce flickering when changing view or expressions. Some approaches [Mei et al. 2024; Tan et al. 2022] use 3D-aware GANs to synthesize relightable faces, but they have limited animatability. Another common approach for relighting is to estimate skin reflectance properties such as albedo, roughness, and surface normals from multi-view images [Li et al. 2020; Liu et al. 2022] or a single image [Chen et al. 2019; Lattas et al. 2020; Lin et al. 2023; Yamaguchi et al. 2018]. Once estimated, the avatars can be relit with path tracing or real-time shaders. However, these approaches are limited to skin regions and non-trivial to unify for the entire head due to complex scattering/reflectance properties of different components including hair, eyes, teeth, and skin. Incorporating intrinsic decomposition into the image formation process of GANs also achieves relighting in the wild [Deng et al. 2024; Ranjan et al. 2023]. Optimization-based approaches from a phone scan demonstrate personalized relightable avatar reconstruction [Bharadwaj et al. 2023; Wang et al. 2023a; Zheng et al. 2023]. While they show promising results, fine-grained driving remains challenging and photorealism under novel illumination is still limited due to the limited prior knowledge about light transport. Recent neural relighting approaches show remarkable progress in terms of photorealism [Bi et al. 2021; Saito et al. 2024; Xu et al. 2024; Yang et al. 2023]. MEGANE [Li et al. 2023] learns a relightable appearance model across multiple identities. URHand [Chen et al. 2024] enables the instant personalization of a pretrained relightable hand prior. VRMM [Yang et al. 2024] concurrently proposes a multi-identity relightable avatar model based on MVP [Lombardi et al. 2021] and a linear lighting model [Yang et al. 2023]. However, it remains a challenge to faithfully capture geometric details as hair strands using mesh [Chen et al. 2024] or Mixture of Volumetric Primitives [Li et al. 2023; Lombardi et al. 2021; Yang et al. 2024]. In this work, we base our geometric and appearance representation on 3D Gaussians [Kerbl et al. 2023] and learnable radiance transfer [Saito et al. 2024; Sloan et al. 2002], respectively. Our approach enables, for the first time, the learning of a universal relightable prior that natively supports real-time relighting with global light transport under various illumination. In addition, our approach enables personalization from a phone capture with unknown illumination.

Fig. 3:

3 Method

As our geometry and appearance representations are based on Relightable Gaussian Codec Avatars [Saito et al. 2024], we first describe the foundation of 3D Gaussians and learnable radiance transfer. We then discuss how we extend Gaussian Codec Avatars to build a universal relightable prior with multi-identity training data. Finally, we provide the details of our finetuning approach to create a personalized relightable model from a phone scan using the universal relightable prior.

3.1 Preliminaries: Relightable 3D Gaussians

Avatars are represented as a collection of 3D Gaussians, denoted as g_k = {t_k, q_k, s_k, o_k, c_k}. The parameters include a translation vector \(\mathbf {t}_k \in \mathbb {R}^3\), rotation parameterized by a unit quaternion \(\mathbf {q}_k \in \mathbb {R}^4\), scale factors \(\mathbf {s}_k \in \mathbb {R}_+^3\) along three orthogonal axis, an opacity value \(o_k \in \mathbb {R}_+\), and a color \(\mathbf {c}_k \in \mathbb {R}_+^3\). Following 3D Gaussian Splatting [Kerbl et al. 2023], the Gaussians can be efficiently rendered at high resolution in real-time.

We base our appearance model on learnable radiance transfer [Saito et al. 2024]. To model the appearance under different illuminations, precomputed radiance transfer (PRT) [Sloan et al. 2002; Wang et al. 2009] decomposes the integral of the rendering equation into the product of extrinsic illumination L(ω) and intrinsic radiance transfer T(p, ω, ω^o). Saito et al. further extend PRT by directly learning the parameters of the transfer function from multi-view and multi-light capture data, and decomposing T(p, ω, ω^o) into diffuse terms (independent of the viewing direction) and specular terms:

\begin{align} \begin{aligned} \mathbf {c}(\mathbf {p}, \boldsymbol {\omega }^o) &=\int _{\mathbb {S}^2} L(\boldsymbol {\omega }) \cdot T(\mathbf {p}, \boldsymbol {\omega },\boldsymbol {\omega }^o) d \boldsymbol {\omega }, \\ &= \int _{\mathbb {S}^2} L(\boldsymbol {\omega }) \cdot \left(T^\textrm {diffuse}(\mathbf {p}, \boldsymbol {\omega }) + T^\textrm {specular}(\mathbf {p}, \boldsymbol {\omega }, \boldsymbol {\omega }^o) \right) d \boldsymbol {\omega } \text{,} \end{aligned} \end{align}

(1)

where c is the outgoing radiance at the position p along ω^o, and L incoming light. In particular, the outgoing radiance for each Gaussian c_k is decomposed into view-independent diffuse and view-dependent specular terms, represented as \(\mathbf {c}_k = \mathbf {c}^\text{diffuse}_{k} + \mathbf {c}^\text{specular}_{k}\). The diffuse color is calculated through the integration of the incoming radiance and the intrinsic radiance transfer, both of which are parameterized by spherical harmonics (SH) [Sloan et al. 2002]:

\begin{align} \mathbf {c}^\text{diffuse}_{k} = \boldsymbol {\rho }_k\odot \sum _{i=1}^{(n+1)^2}{\mathbf {L}_{i}\odot \mathbf {d}_{k}^{i}} \text{,} \end{align}

(2)

where L_i denotes the i-th element in n-th order spherical harmonics (SH) coefficients of the incident lights, \(\mathbf {d}^{i}_{k}\) represents the i-th element in n-th order SH coefficients of the learnable radiance transfer function, and ρ_k is the base albedo color. These terms are modeled individually for RGB channels. Inspired by Wang et al. [2009], the specular reflection is represented as spherical Gaussians G_s(ω;a, σ) with the central direction of the lobe a and roughness σ :

\begin{align} \mathbf {c}^{\textrm {specular}}_{k}(\boldsymbol {\omega }^o_k) &= v_k(\boldsymbol {\omega }^o_k) \int _{\mathbb {S}^2} \mathbf {L}(\boldsymbol {\omega })G_s(\boldsymbol {\omega }; \mathbf {a}_k, \sigma _{k}) \mathrm{d} \boldsymbol {\omega }, \end{align}

(3)

\begin{align} \mathbf {a}_k &= 2(\boldsymbol {\omega }^o_k \cdot \mathbf {n}_k)\mathbf {n}_k - \boldsymbol {\omega }^o_k \text{.} \end{align}

(4)

Here, \(v_k(\boldsymbol {\omega }^o_k) \in (0,1)\) is a learnable view-dependent visibility term that accounts for Fresnel reflection, occlusion, and geometric attenuation, \(\boldsymbol {\omega }^o_k\in \mathbb {R}^3\) is the viewing direction evaluated at the Gaussian center, and n_k is a view-dependent normal for each Gaussian.

3.2 Universal Relightable Prior Model

Inspired by prior work [Cao et al. 2022], we employ an identity-conditioned hypernetwork [Ha et al. 2017] to generate person-specific avatars. In particular, the hypernetwork takes identity features as input, and produces a subset of person-specific network weights for each subject’s avatar decoder. This decoder produces relightable 3D Gaussians corresponding to the input head state (facial expression, gaze direction, and neck rotation), and input lighting environment and viewpoint. We show the overview in fig. 2 (a).

3.2.1 Identity-conditioned Hypernetwork.

To allow extraction of high-frequency person-specific details, our identity encoder \({\mathcal {E}}_{\text{id}}\) takes identity features in the form of a mean albedo texture map T_mean and a mean geometry map G_mean unwrapped in a 1024² UV space as input, and produces ‘untied’ bias maps \(\Theta ^{\text{id}}_{\text{g}}, \Theta ^{\text{id}}_{\text{fi}}, \Theta ^{\text{id}}_{\text{fv}}\). These bias maps are injected at various levels of the decoding architecture, described below. Our hypernetwork also produces an expression-agnostic opacity \(\lbrace o_k\rbrace _{k=1}^{M}\) and albedo \(\lbrace \boldsymbol {\rho }_k\rbrace _{k=1}^{M}\) for the 3D Gaussians. Formally, the identity encoder is defined as:

\begin{align} \Theta ^{\text{id}}_{\text{g}}, \Theta ^{\text{id}}_{\text{fi}}, \Theta ^{\text{id}}_{\text{fv}}, \lbrace o_k, \boldsymbol {\rho }_k\rbrace _{k=1}^{M} = {\mathcal {E}}_{\text{id}} ({\boldsymbol {T}}_{\text{mean}}, {\boldsymbol {G}}_{\text{mean}}; \Phi _{\text{id}}) \text{.} \end{align}

(5)

3.2.2 Expression Encoder.

We use a variational autoencoder to model a shared latent distribution of facial expressions across identities. To avoid the domain shift between studio and phone-captured textures, our expression encoder \({\mathcal {E}}_{\text{exp}}\) takes only the difference of geometry maps as input, i.e., ΔG_exp = G_exp − G_mean, where \({\boldsymbol {G}}_\text{exp}\) is the current and \({\boldsymbol {G}}_\text{mean}\) is the mean geometry. To preserve subtle facial expression details when using geometry-only inputs, we use high-quality tracking (Sec. 3.2.5) to generate these maps. We then generate a universal expression latent code \({\mathbf {z}}\in \mathbb {R}^{256}\) as follows:

\begin{align} \boldsymbol {\mu },\boldsymbol {\sigma } &= {\mathcal {E}}_{\text{exp}} (\Delta {\boldsymbol {G}}_{\text{exp}} ; \Phi _{\text{exp}}) \text{,} \end{align}

(6)

\begin{align} {\mathbf {z}}&= \boldsymbol {\mu } + \boldsymbol {\sigma } \cdot {\mathcal {N}}(0,1) \text{,} \end{align}

(7)

where \({\mathcal {N}}(0,1)\) is the unit normal distribution. Since the expression latent code is trained in an end-to-end manner with multi-identity data, once the model is trained, the same expression code can be applied to different identities for driving.

3.2.3 Avatar Decoder.

Building upon the foundations laid by previous work [Cao et al. 2022; Li et al. 2017; Lombardi et al. 2021; Saito et al. 2024], we parameterize and anchor Gaussians on a guide mesh and model the facial expressions in a canonical space. We further expand the geometry to encompass the shoulder region and use predefined linear blend skinning to model neck rotations. A geometry decoder \({\mathcal {D}}_{\text{g}}\) produces the vertices \(\lbrace \hat{{\mathbf {t}}}_k\rbrace _{k=1}^{M}\) of this extended guide mesh:

\begin{align} \lbrace \hat{{\mathbf {t}}}_k\rbrace _{k=1}^{M} = \mathcal {D}_{\text{g}}(\mathbf {z}, \mathbf {e}_{\lbrace l,r\rbrace }, \mathbf {r}_{\text{n}}; \Theta ^{\text{id}}_{\text{g}}, \Phi _{\text{g}}) \text{,} \end{align}

(8)

where z is the expression code, \(\mathbf {e}_{\lbrace l,r\rbrace } \in \mathbb {R}^3\) are eye gaze direction vectors, \({\mathbf {r}}_{\text{n}} \in \mathbb {R}^3\) denotes the axis-angle neck rotation relative to the head, and \(\Theta ^{\text{id}}_{\text{g}}\) is an identity-specific bias from eq. (5).

We split the appearance model into view-independent and view-dependent components. Our view-independent face relightable Gaussian decoder, \({\mathcal {D}}_{\text{fi}}\), takes expression code, gaze vectors, and neck rotation as input. It is further conditioned with the identity ‘untied’ bias map \(\Theta ^{\text{id}}_{\text{fi}}\) derived from the identity encoder, resulting in the output of view-independent attributes for each 3D Gaussian:

\begin{align} \lbrace \mathbf {\delta t}_k, \mathbf {q}_k, \mathbf {s}_k, \mathbf {d}^{\textrm {c}}_{k}, \mathbf {d}^{\textrm {m}}_{k}, \sigma _{k}\rbrace _{k=1}^{M} = \mathcal {D}_{\text{fi}}(\mathbf {z}, \mathbf {e}_{\lbrace l,r\rbrace }, \mathbf {r}_{\text{n}}; \Theta ^{\text{id}}_{\text{fi}}, \Phi _{\text{fi}}) \text{.} \end{align}

(9)

Here, \(\mathbf {d}^{\textrm {c}}_{k}\) and \(\mathbf {d}^{\textrm {m}}_{k}\) are color and monochrome SH coefficients [Saito et al. 2024], respectively, and σ_k is the roughness parameter defined in eq. (3). Our view-dependent face relightable Gaussian decoders, \({\mathcal {D}}_{\text{fv}}\), incorporates the view direction to the head center, ω_o, as additional input, subsequently generating view-dependent delta normal and visibility terms for each Gaussian:

\begin{align} \lbrace \mathbf {\delta n}_k, v_k\rbrace _{k=1}^{M} = \mathcal {D}_{\text{fv}}(\mathbf {z}, \mathbf {e}_{\lbrace l,r\rbrace }, \mathbf {r}_{\text{n}}, \boldsymbol {\omega }_o; \Theta ^{\text{id}}_{\text{fv}}, \Phi _{\text{fv}}). \end{align}

(10)

Here, Φ_fi and Φ_fv represent the learnable parameters of each respective decoder. The final Gaussian position is a composite of the guide mesh’s vertex positions, as derived from eq. (8), and the delta position from eq. (9), taking the form \({\mathbf {t}}_k = \hat{{\mathbf {t}}}_k + \mathbf {\delta t}_k\). The view-dependent surface normal of each Gaussian, n_k, is a composition of the mesh normal, derived from the guided mesh, and the delta normal decoded from eq. (10), taking the form \({\mathbf {n}}_k = \hat{{\mathbf {n}}}_k + \mathbf {\delta n}_k\).

3.2.4 Universal Relightable Explicit Eye Model.

We adapt previous works [Saito et al. 2024; Schwartz et al. 2020] for modeling the eye as a universal relightable explicit eye model. We use the same network architecture as the encoder \({\mathcal {E}}_{\text{id}}\) for each identity’s eye encoder \({\mathcal {E}}_{\text{eye}}\), defined as below:

\begin{align} \Theta ^{\text{eye}}_{\text{ei}}, \Theta ^{\text{eye}}_{\text{ea}} = {\mathcal {E}}_{\text{eye}} ({\boldsymbol {T}}_{\text{eye}}, {\boldsymbol {G}}_{\text{eye}}; \Phi _{\text{eye}}) \text{,} \end{align}

(11)

where T_eye and G_eye are the cropped eye regions from the mean texture and mean geometry maps T_mean and G_mean. The output of this encoder is the ‘untied’ bias map of each level of the eye’s decoder.

During the phone capture fine-tuning stage, we observed that the networks were unable to preserve the prior knowledge acquired during the pre-training stage and failed to reproduce eye glint after fine-tuning, due to the limited observation of the eye regions from phone captures. Consequently, we propose a unified specular visibility decoder \({\mathcal {D}}_{\text{ev}}\), which does not require any conditional information as input. This design is intended to encourage the network to learn an eye reflection model that can easily generalize to unseen identities. Therefore, our universal relightable eye decoder is defined as follows:

\begin{align} \lbrace \mathbf {q}_k, \mathbf {s}_k, o_k,\mathbf {d}^{\textrm {c}}_{k}, \mathbf {d}^{\textrm {m}}_{k}, \sigma _{k}\rbrace _{k=1}^{M_e} &= \mathcal {D}_{\text{ei}}(\mathbf {e}_{\lbrace l,r\rbrace }; \Theta ^{\text{eye}}_{\text{ei}}, \Phi _{\text{ei}}) \text{,} \end{align}

(12)

\begin{align} \lbrace \boldsymbol {\rho }_k\rbrace _{k=1}^{M_e} &= \mathcal {D}_{\text{ea}}(\mathbf {e}_{\lbrace l,r\rbrace }, \boldsymbol {\omega }_o; \Theta ^{\text{eye}}_{\text{ea}}, \Phi _{\text{ea}}) \text{,} \end{align}

(13)

\begin{align} \lbrace v_k\rbrace _{k=1}^{M_e} &= \mathcal {D}_{\text{ev}}(\mathbf {e}_{\lbrace l,r\rbrace }, \boldsymbol {\omega }_o; \Phi _{\text{ev}}) \text{.} \end{align}

(14)

3.2.5 High-Quality Tracking Geometry.

We first track each subject independently using a high-quality template head mesh. This mesh is subsequently used to supervise our geometry decoder’s output \(\hat{{\mathbf {t}}}_k\). We observed that this form of geometry supervision plays a crucial role in preventing 3D Gaussians from getting trapped in local minima during the early stages of universal relightable prior model training. To obtain this tracked mesh, we modify the Pixel Codec Avatar (PiCA) [Ma et al. 2021] architecture to extend coverage to the upper body, specifically the head, neck, and shoulders. For each identity in our training dataset, we therefore fit a personalized pixel codec avatar using inverse rendering to reconstruct all the fully-lit, multi-view input images. Subsequently, we extract the geometry for each frame from PiCA’s geometry branch. Leveraging the dense position map-based geometry representation and per-pixel color decoding, we can reconstruct high-quality geometry for each frame. For monocular phone scans, we follow a pipeline similar to Cao et al. [2022], but replace the coarse base geometry with the aforementioned high-resolution template mesh and photometric refinement [Ma et al. 2021].

3.2.6 Conditioning Albedo Texture Acquisition.

We use an off-the-shelf portrait relighting method [Kim et al. 2024] to estimate the illumination and the albedo of input images of the face. The estimated illumination is used as an initialization for the environment light fitting in Sec. 3.3.1. To maximize dataset consistency, we use the same algorithm to extract albedo in both studio and phone settings. These albedo images are then unwrapped onto the high-quality tracked mesh to obtain the mean albedo texture \({\boldsymbol {T}}_\text{mean}\). This albedo texture is used as the conditioning input to the identity encoder in eq. (5). During experiments, we found that utilizing this mean albedo texture, in contrast to a simple color transformation [Cao et al. 2022], assists in more effectively de-lighting the face under varying illumination conditions and offers a more consistent conditional input for our identity encoder.

3.2.7 Training Losses.

Given multi-view video data of a person illuminated with known point light patterns, we follow previous works [Kerbl et al. 2023; Saito et al. 2024] to use the same loss functions to optimize all trainable network parameters Φ with the following loss function:

\begin{align} \mathcal {L} = \mathcal {L}_{\mathrm{rec}} + \mathcal {L}_{\mathrm{reg}} + \lambda _{\mathrm{kl}} \mathcal {L}_{\mathrm{kl}} \text{,} \end{align}

(15)

where \(\mathcal {L}_{\mathrm{rec}}\) is the reconstruction loss consists of L1 and SSIM on rendered images and L2 loss on the high quality tracking geometry, \(\mathcal {L}_{\mathrm{reg}}\) is the set of loss functions regularizing the scale of Gaussians, negative color values, and scale, opacity, and visibility of eye Gaussians. \(\mathcal {L}_{\mathrm{kl}}\) is the KL-divergence loss on expression code z. Except the geometry loss, where we use the relative weight of 0.1, we use the identical relative weights to Saito et al. [2024]. Please refer to Saito et al. [2024] for the details of each loss function.

Fig. 4:

Fig. 5:

3.3 Personalized Avatar from Phone Scan

With the pre-trained universal relightable prior model, we further fine-tune it using phone capture data to generate a personalized relightable avatar. During the fine-tuning stage, we initially freeze the pre-trained encoders and decoder, optimizing the environment lights to achieve a low reconstruction loss on the face. Subsequently, we freeze the environment lights and fine-tune the encoders and decoders for a few iterations to improve the likeness of the captured subject. See fig. 2 (b) for an overview. For preprocessing the phone capture data, we apply the geometry tracking method detailed in Sec. 3.2.5 and albedo texture acquisition in Sec. 3.2.6 to obtain the tracked template mesh and its corresponding unwrapped mean albedo texture for each phone capture.

3.3.1 Fitting Environment Lighting.

We parameterize the environment illumination at the time of phone capture as N point lights with position and RGB intensity, \(\lbrace {\mathbf {l}}^{\text{pos}}_{i}, {\mathbf {l}}^{\text{int}}_{i}\rbrace ^{N}_{i=1}\). In this paper, we use N = 512 distant point lights that we assume to be uniformly distributed on the sphere. During fine-tuning, we freeze the parameters of the encoders and decoders, optimizing only the intensities \(\lbrace {\mathbf {l}}^{\text{int}}_{i}\rbrace ^{N}_{i=1}\) of the lights to minimize the L1 loss between the rendered image I_f and the ground truth \(\hat{I_f}\) in the face region:

\begin{align} {\mathcal {L}}_{\text{fit-light}} = \left\Vert I_f - \hat{I_f}\right\Vert _1 \text{.} \end{align}

(16)

We show the optimized environment lights of our method in fig. 4. Compared to the data captured by the phone, our refined light accurately reconstructs the primary light source within the space. This adjustment brings the color space of the avatar more in line with the observed data.

3.3.2 Fine-tuning Encoders and Decoders.

After fitting the environment lights, we freeze them and finetune all the encoder and decoder parameters Φ_id, Φ_exp, Φ_g, Φ_fi, Φ_fv, Φ_eye, Φ_ei, Φ_ea, and Φ_ev. The loss function employed during the fine-tuning stage mirrors that utilized in the pre-training stage, with the inclusion of two additional terms:

\begin{align} \mathcal {L}_{\text{finetune}} = \mathcal {L}_{\mathrm{rec}} + \mathcal {L}_{\mathrm{reg}} + \lambda _{\mathrm{kl}} \mathcal {L}_{\mathrm{kl}} + {\mathcal {L}}_{\text{alpha}} + {\mathcal {L}}_{\text{vgg}} \text{,} \end{align}

(17)

where the \({\mathcal {L}}_\text{alpha}\) represents the alpha loss on the rendered Gaussian and the input images’ predicted alpha matting. Due to the blurriness inherent in the phone capture data, we incorporate the VGG loss [Johnson et al. 2016] between the rendered images and ground truth, denoted as \({\mathcal {L}}_\text{vgg}\), during the fine-tuning stage to retain more details from the phone capture.

3.4 Training details

We follow previous work [Cao et al. 2022] and use a network architecture based on U-Net [Ronneberger et al. 2015] for our identity encoder. The ID encoder consists of a series of convolutional networks that encode the identity conditional data into a set of feature maps, which are then added to each layer’s output in the decoder. Please refer to fig. 3 for networks details.

For the pre-training of the universal relightable prior model, we use the Adam optimizer [Kingma and Ba 2015] with a learning rate of 5 × 10^{− 4}. We use 64 NVIDIA A100 GPUs with a batch size of 128 for 400k iterations, which takes 5 days to converge. For the personalized avatar finetuning stage, we set the learning rate as 10^{− 4} and finetune the pre-trained networks using 8 NVIDIA A100 GPUs with a batch size of 16 for 13k iterations, where the first 3k iterations are for environment light fitting and the last 10k iterations are for encoder and decoder finetuning. Including the preprocessing stages, the personalization stage takes approximately 3 hours.

4 Experiments

4.1 Datasets

Studio Captures. Our studio capture setup is similar to the capture system presented by Cao et al. [2022] and Saito et al. [2024], where we obtain calibrated and synchronized multi-view images at a resolution of 4,096 × 2,668 pixels through the use of 110 cameras and 460 white LED lights operating at 90 Hz. Participants are instructed to perform a predetermined set of various facial expressions, sentences, and gaze motions for approximately 144,000 frames. To observe diverse illumination patterns while maintaining stable facial tracking, we employ time-multiplexed illumination [Bi et al. 2020]. Specifically, full-on illumination is interleaved every third frame to facilitate tracking, while the remaining frames utilize either grouped or randomly selected sets of 5 lights. In total, this capture script was used to record data from 345 participants. Of these, 342 were used to train our universal relightable model, while the remaining three were reserved for evaluation.

Phone Captures. We captured 10 people under five alternating illumination conditions via mobile phone. Each video duration is 5–10 minutes per subject. Participants performed a series of actions, including head rotations, changing eye gaze, making various facial expressions, and regular daily speech. They were captured in an LED wall cylinder with a diameter of 4.7 m and a height of 3 m. The LED panels have a maximum brightness exceeding 1,500 cd/m² and a pixel pitch of 2.84 mm to realistically light the participants under multiple real-world lighting environments. To estimate the ground-truth environment map for each lighting condition, we first geometrically calibrate the pose of all LED panels, so that we can ray-trace environment maps at arbitrary positions in the capture volume, including the head’s center of mass. Additionally, we characterize the LED panels’ response using a Ricoh Theta Z1 360° camera placed at the center of the capture volume, such that we can predict how a particular displayed lighting environment affects a participant in terms of incident illumination. For this, we display a ramp of red, green, blue and white lighting environments, one at a time, e.g. from 0% red (black) to 100% red, and capture a linear HDR image for each condition using a five-photo exposure bracket of raw images. The capture with white LED light is used only for validation. We further optimize a learnable per-camera 3 × 3 color correction matrix to account for differences in color spaces between cameras. For evaluation, we select one environment as input to build a personalized avatar for each subject. Note that we do not use the ground-truth illumination for our avatar reconstruction. We then relight the reconstructed avatar in other environments, and compare our rendered avatar with the ground-truth data. We randomly select 100 frames to cover sufficient variation of head rotation and expressions.

Fig. 6:

4.2 Comparisons

Table 1:

Method	MAE ↓	MSE ↓	SSIM ↑	LPIPS ↓
FLARE	0.0327	0.0068	0.8849	0.1722
Ours	0.0136	0.0025	0.9524	0.0605

Table 1: Comparison of environment relighting for Phone captured identities.

In Tab. 1, we show the quantitative evaluation of our method compared to FLARE, the current state-of-the-art relightable avatar reconstruction method from a phone scan [Bharadwaj et al. 2023]. We report mean-absolute-error (MAE), mean-squared-error (MSE), SSIM and LPIPS as the metrics for the face region. Unlike our approach, FLARE uses meshes as a shape representation and a parametric BRDF model (i.e., Lambertian for diffuse, and Cook and Torrance [1982] for specular) with only the first bounce considered. Our approach achieves significantly lower error, and the qualitative comparison in fig. 6 further confirms the substantial quality improvement. This illustrates the efficacy of our geometry and appearance representation as well as our universal relightable prior model.

4.3 Evaluation

We evaluate our key design choice of using albedo as an identity feature instead of a simple color-transformed mean texture [Cao et al. 2022]. Figure 7 shows that illumination-agnostic albedo greatly reduces artifacts in the zero-shot reconstruction case. In contrast, the simple color transformation leads to more “baked-in” artifacts in the reconstructed avatar. We also conducted an ablation study on the unified eye specular visibility decoder, and the results are shown in 5a . While the baseline model loses glints due to overfitting to inaccurate ground-truth gaze, our unified decoder faithfully preserve eye glints. Furthermore, 5b illustrates the efficacy of the proposed high-resolution tracked mesh over a coarse mesh tracking used in prior work [Cao et al. 2022; Saito et al. 2024] on a novel identity. We also show additional qualitative results in fig. 8. This shows that our approach generalizes to a wide range of identities, illuminations, and expressions.

Fig. 7:

Fig. 8:

5 Discussion and Conclusion

The relighting quality of our work could be degraded in several cases. Our model relies on the universal prior learned from studio-captured data, and variations not covered by the training corpus may lead to suboptimal generalization. For example, as the captured subjects in the training data all wear gray T-shirts, the relighting of clothing is less accurate than for the head area (see the bottom row in fig. 6). Also, while our light estimation is robust, any inaccuracy in the light estimation leads to ‘baked-in’ artifacts. In particular, estimating high-frequency illumination remains a challenge, which mainly affects the quality of eye relighting. Future work could address this by incorporating a strong illumination prior [Gardner et al. 2022; Lyu et al. 2023]. Last but not least, the current personalization process requires multiple preprocessing steps and test-time finetuning, and is hence not instant. Enabling instant personalization with full relighting capability remains exciting future work.

We presented URAvatar, a novel framework to create photorealistic relightable head avatars driven by gaze, neck rotation, and a latent facial expression code in real-time, with faithful diffuse scattering and specular highlights. Our experiments show that building a strong generalizable prior of global light transport for dynamic facial expressions from multi-identity studio data is now possible with a sufficiently expressive geometric representation (3D Gaussians) and appearance representation (learnable radiance transfer). We also evaluated the fidelity of our reconstruction on unseen natural illuminations by building a new virtual environment capture system consisting of LED display panels, and showed the efficacy of our method compared to prior works. Our work, for the first time, enables the faithful reconstruction of relightable avatars with global light transport from a single phone scan, unlocking the possibility of virtual teleportation with authentic avatars as a communication tool.

Supplemental Material

MP4 File

supplemental video

Download
370.20 MB

References

[1]

ShahRukh Athar, Zhixin Shu, and Dimitris Samaras. 2023. FLAME-in-NeRF: Neural control of radiance fields for free view face animation. In International Conference on Automatic Face and Gesture Recognition (FG).

Abstract

1 Introduction

2 Related Work

2.1 Drivable Avatars

2.2 Lightweight Avatar Generation

2.3 Avatar Relighting

3 Method

3.1 Preliminaries: Relightable 3D Gaussians

3.2 Universal Relightable Prior Model

3.2.1 Identity-conditioned Hypernetwork.

3.2.2 Expression Encoder.

3.2.3 Avatar Decoder.

3.2.4 Universal Relightable Explicit Eye Model.

3.2.5 High-Quality Tracking Geometry.

3.2.6 Conditioning Albedo Texture Acquisition.

3.2.7 Training Losses.

3.3 Personalized Avatar from Phone Scan

3.3.1 Fitting Environment Lighting.

3.3.2 Fine-tuning Encoders and Decoders.

3.4 Training details

4 Experiments

4.1 Datasets

4.2 Comparisons

4.3 Evaluation

5 Discussion and Conclusion

Supplemental Material

References

Index Terms

Recommendations

Deep Reflectance Volumes: Relightable Reconstructions from Multi-view Photometric Images

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

RNA: Relightable Neural Assets

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations