Our goal is to obtain a drivable, photorealistic, and geometrically detailed avatar of a real human in any type of clothing solely learned from multi-view RGB video. More precisely, given a skeleton motion and virtual camera view as input, we want to synthesize highly realistic renderings of the human in motion as well as the high-fidelity and deforming geometry in real time. An overview of our method is shown in Figure
1. Next, we define the problem setting (Section
3.1). Then, we describe the main challenges of space canonicalization that current methods are facing followed by our proposed space mapping, which alleviates the inherent ambiguities (Section
3.2). Given this novel space canonicalization strategy, we show how this UTTS can be efficiently parameterized with a tri-plane representation leading to real-time performance during rendering and geometry recovery (Section
3.3). Last, we introduce our supervision and training strategy (Section
3.4).
3.1 Problem Setting
Input Assumptions. We assume a segmented multi-view video of a human actor using
C calibrated and synchronized RGB cameras as well as a static 3D template is given.
\(\mathbf {I}_{f,c} \in \mathbb {R}^{H \times W}\) denotes frame
f of camera
c where
W and
H are the width and height of the image, respectively. We then extract the skeletal pose
\(\boldsymbol {\theta }_f \in \mathbb {R}^P\) for each frame
f using markerless motion capture [TheCaptury
2020]. Here,
P denotes the number of
degrees of freedom (
DoFs). A skeletal motion from frame
\(f-k\) to
f is denoted as
\(\boldsymbol {\theta }_{\bar{f}} \in \mathbb {R}^{kP}\) and
\(\hat{\boldsymbol {\theta }}_{\bar{f}}\) is the translation normalized equivalent, i.e., by displacing the root motion such that the translation of frame
f is zero. During training, our model takes the skeletal motion as input and the multi-view videos as supervision, while at inference, our method only requires a skeletal motion and a virtual camera.
Static Scene Representation. Recent progress in neural scene representation learning has shown great success in terms of geometry reconstruction [Wang et al.
2021;,
2023a] and view synthesis [Mildenhall et al.
2021] of
static scenes by employing neural fields. Inspired by NeuS [Wang et al.
2021], we also represent the human geometry and appearance as neural field
\(\mathcal {F}_\mathrm{sdf}\) and
\(\mathcal {F}_\mathrm{col}\):
where
\(\mathbf {x}_i \in \mathbb {R}^3\) is a point along the camera
\(r(t_i,\mathbf {o}, \mathbf {d}) = \mathbf {o} + t_i \mathbf {d}\) with origin
\(\mathbf {o}\) and direction
\(\mathbf {d}\).
\(p(\cdot)\) is a positional encoding [Mildenhall et al.
2021] to better model and synthesize higher frequency details. The SDF field stores the SDF
\(s_i\) and a respective shape code
\(\mathbf {q}_i\) for every point
\(\mathbf {x}_i\) in global space. Note that the normal at point
\(\mathbf {x}_i\) can be computed as
\(\mathbf {n}_i =\frac{\partial s_i}{\partial \mathbf {x}_i}\). Moreover, the color field encodes the color
\(\mathbf {c}_i\), and as it is conditioned on the viewing direction
\(\mathbf {d}\), it can also encode view-dependent appearance changes. In practice, both fields are parameterized as
multi-layer perceptrons (
MLPs) with learnable weights
\(\Gamma\) and
\(\Psi\).
To render the color of a ray (pixel), volume rendering is performed, which accumulates the color
\(\mathbf {c}_i\) and the density
\(\alpha _i\) along the ray as
Here, the density
\(\alpha _i\) is a function of the SDF. For an unbiased SDF estimate, the conversion from SDF to density can be defined as
where
z is a trainable parameter whose reciprocal approaches 0 when training converges. For a detailed derivation, we refer to the original work [Wang et al.
2021]. The scene geometry and appearance can then be solely supervised by comparing the obtained pixel color with the ground truth color, typically employed by an L1 loss. Important for us, this representation allows the modeling of fine geometric details and appearance while only requiring multi-view imagery. However, for now, this representation only allows for the modeling of static scenes and requires multiple hours of training (even for a single frame).
Problem Setting. Instead, we want to learn a dynamic, controllable, and efficient human representation
\(\mathcal {H}_\mathrm{sdf}\) and
\(\mathcal {H}_\mathrm{col}\):
which is conditioned on the skeletal motion of the human as well. Note that the SDF, shape feature, and color are now a function of skeletal motion indicated by the subscript
\((\cdot)_f\). Previous work [Liu et al.
2021] has shown that naively adding the motion as a function input to the field leads to blurred and unrealistic results. Many works [Liu et al.
2021; Peng et al.
2021b;,
2021a] have therefore tried to transform points into canonical 3D pose space to then query the neural field in this canonical space. This has shown to improve quality, however, they typically parameterize the field in this space with an MLP leading to slow runtimes.
Tri-planes [Chan et al.
2022] offer an efficient alternative and have been applied to generative tasks. However, mostly for convex surfaces such as faces where the mapping onto the planes introduces little ambiguity. However, using them for representing the complex, articulated, and dynamic structure of humans in clothing requires additional attention since, if not handled carefully, the mapping onto the tri-plane can lead to so-called mapping collisions, where multiple points in the global space map onto the same tri-plane locations. Thus, in the remainder of this section, we first introduce our UTTS, which effectively reduces these collisions (Section
3.2). Then, we explain how the tri-plane can be conditioned on the skeletal motion using an efficient encoding of surface dynamics into texture space, which is then decoded into the tri-plane features leveraging a 3D-aware convolutional architecture [Wang et al.
2023b] (Section
3.3). Last, we describe our supervision and training strategy (Section
3.4).
3.2 Undeformed Tri-Plane Texture Space
Intuitively, our idea is that one of the tri-planes, i.e., the surface plane, corresponds to the surface of a skeletal motion-conditioned deformable human mesh model, while the other two planes, i.e., the perpendicular planes, are perpendicular to the first one and to each other. Next, we define the deformable and skeletal motion-dependent surface model of the human.
Motion-dependent and Deformable Human Model. We assume a person-specific, rigged and skinned triangular mesh with
N vertices
\(\mathbf {M} \in \mathbb {R}^{N \times 3}\) is given and the vertex connectivity remains fixed. The triangular mesh
\(\mathbf {M}\) is obtained from a 3D scanner [Treedys
2020] and down-sampled to around
\(5{,}000\) vertices to strike a balance between quality and efficiency. Now, we denote the deformable and motion-dependent human model as
where
\(\Omega \in \mathbb {R}^{W}\) are the learnable network weights and
\(\mathbf {V}_{\bar{f}} \in \mathbb {R}^{N \times 3}\) are the posed and non-rigidly deformed vertex positions. Important for us, this function has to satisfy two properties: (1) It has to be a function of skeletal motion. (2) It has to have the capability of capturing non-rigid surface deformation.
We found that the representation of Habermann et al. [
2021] meets these requirements, and we, thus, leverage it for our task. In their formulation, the human geometry is first non-rigidly deformed in a canonical pose as
where
\(\mathbf {M}_v \in \mathbb {R}^{3}\) and
\(\mathbf {Y}_v \in \mathbb {R}^{3}\) denote the undeformed and deformed template vertices in the rest pose.
\(\mathcal {N}_{\mathrm{vn},v} \in \mathbb {N}\) denotes the indexes of embedded graph nodes that are connected to template mesh vertex
\(v \in \mathbb {R}^{3}\).
\(\mathbf {G}_k \in \mathbb {R}^{3}\),
\(\mathbf {A}_k \in \mathbb {R}^{3}\), and
\(\mathbf {T}_k \in \mathbb {R}^{3}\) indicate the rest positions, rotation Euler angles, and translations of the embedded graph nodes. Specifically, the connectivity of the embedded graph
\(\mathbf {G}_k\) can be obtained by simplifying the deformable template mesh
\(\mathbf {M}\) with quadric edge collapse decimation in Meshlab [Cignoni et al.
2008].
\(R(\cdot)\) denotes the function that converts the Euler angle to a rotation matrix. Similar to [Sorkine and Alexa
2007], we compute the weight applied to the neighboring vertices
\(\mathbf {w}_{v,k} \in \mathbb {R}\) based on geodesic distances.
To model higher-frequency deformations, an additional per-vertex displacement
\(\mathbf {D}_i \in \mathbb {R}^{3}\) is added. The embedded graph parameters
\(\mathbf {A},\mathbf {T}\), and per-vertex displacements
\(\mathbf {D}\) are further functions of translation-normalized skeletal motion implemented as two graph convolutional networks
where the skeletal motion is encoded according to [Habermann et al.
2021]. For more details, we refer to the original work.
Finally, the deformed vertices
\(\mathbf {Y}_{v}\) in the rest pose can be posed using
Dual Quaternion (
DQ) skinning
\(\mathcal {S}\) [Kavan et al.
2007], which defines the motion-dependent deformable model
Note that Equation (
12) is (1) solely a function of the skeletal motion and (2) can account for non-rigid deformations by means of training the weights
\(\Omega\) and, thus, this formulation satisfies our initial requirements.
Non-rigid Space Canonicalization. Next, we introduce our non-rigid space canonicalization function (see also Figure
2)
which takes the deformable template and a point
\(\mathbf {x}\) in global space and maps it to the so-called UTTS, denoted as
\(\bar{\mathbf {x}}\), as explained in the following. Given the point
\(\mathbf {x}\) in global space and assuming the closest point
\(\mathbf {p}\) is located on the non-degenerated triangle with vertices
\(\lbrace \mathbf {v}_a, \mathbf {v}_b, \mathbf {v}_c \rbrace\) on the posed and deformed template
\(\mathbf {V}_{\bar{f}}\), the closest point on the surface can either be a face, edge, or vertex on the mesh. In the following, we discuss the different cases where the goal is to find the 2D texture coordinate of
\(\mathbf {p}\) as well as the distance between
\(\mathbf {x}\) and
\(\mathbf {p}\), which then defines the 3D coordinate
\(\bar{\mathbf {x}}\) in UTTS.
(1) Face. If the closest point lies on the triangular surface, the projected position
\(\mathbf {p}\) and the 2D texture coordinate
\(\mathbf {u}\) is computed as
where
\(\mathbf {n}_f\) denotes the face normal of the closest surface, and
\(\mathbf {u}_a\),
\(\mathbf {u}_b\) and
\(\mathbf {u}_c\) indicates the texture coordinate of the triangle vertices.
(2) Edge. For global points
\(\mathbf {x}\) mapping onto the edge
\((\mathbf {v}_a, \mathbf {v}_b)\), the projected position
\(\mathbf {p}\) and the 2D texture coordinate
\(\mathbf {u}\) are defined as
(3) Vertex. If the global point
\(\mathbf {x}\) maps onto a vertex
\(\mathbf {v}_a\), the projected position
\(\mathbf {p}\) and the 2D texture coordinate
\(\mathbf {u}\) are defined as
Given the projected position
\(\mathbf {p}\), we compute the signed distance between
\(\mathbf {x}\) and
\(\mathbf {p}\) as
which will be required in the following.
So far, we can now canonicalize points from global 3D space to our UTTS space, and we denote the canonical point \((\mathbf {u},d)^T\) or \((u_x,u_y,d)^T\) simply as \(\bar{\mathbf {x}}\). Note that \(\mathbf {u}=(u_x, u_y)\) denotes the point on our surface plane and \((u_x,d)\) and \((u_y,d)\) correspond to the points on the perpendicular planes. These coordinates can now be used to query the features on the respective tri-planes.
Concerning the mapping collision, we highlight that case (1) where a point maps onto a triangle, is a bijection and, thus, the concatenated tri-plane features are unique, which was our goal. Only for cases (2) and (3) can the aforementioned collisions happen since the 2D texture coordinate on the surface is no longer unique for points with the same distance to a point on the edge of the mesh vertex itself. However, the occurrence of these cases highly depends on how far away from the deformable surface points the sample is still being taken. By constraining the maximum distance to
\(d_\mathrm{max}\), which effectively means we only draw samples close to the deformable surface, we found that cases (2) and (3) happens less frequently. However, when the deformable model is not aligned, this will introduce an error by design, as surface points are not sampled in regions covered by the human. Therefore, we gradually deform the surface along the SDF field to account for such cases, and iteratively reduce
\(d_\mathrm{max}\). In the limit, this strategy reduces the mapping collisions, improves sampling efficiency, and ensures that the sampled points do not miss the real surface. More details about this will be explained in Section
3.4 and our supplemental material.
3.3 Efficient and Motion-Dependent Tri-Plane Encoding
So far, we are able to map points in the global space to our UTTS space; however, as mentioned earlier, we want to ensure that the tri-planes contain skeletal motion-aware features. Thus, we propose a 3D-aware convolutional motion encoder:
which takes several 2D textures as input, encoding the position
\(\mathbf {T}_{\mathrm{p},f}\), velocity
\(\mathbf {T}_{\mathrm{v},f}\), acceleration
\(\mathbf {T}_{\mathrm{a},f}\), texture coordinate
\(\mathbf {T}_{\mathrm{u},f}\), and normal
\(\mathbf {T}_{\mathrm{n},f}\) of the deforming human mesh surface, which we root normalize, i.e., we subtract the skeletal root translation from the mesh vertex positions (Equation (
12)) and scale them to a range of
\([-1, 1]\). Note that the individual texel values for
\(\mathbf {T}_{\mathrm{v},f}\),
\(\mathbf {T}_{\mathrm{a},f}\),
\(\mathbf {T}_{\mathrm{u},f}\) and
\(\mathbf {T}_{\mathrm{n},f}\) can be simply computed using inverse texture mapping. The first 3 textures, i.e., position
\(\mathbf {T}_{\mathrm{p},f}\), velocity
\(\mathbf {T}_{\mathrm{v},f}\), and acceleration
\(\mathbf {T}_{\mathrm{a},f}\), encode the dynamics of the deforming surface, which can be computed on the fly from the skinned template of the current and previous frames. The texture coordinate map
\(\mathbf {T}_{\mathrm{u},f}\) encodes a unique ID for each texel covered by a triangle in the UV-atlas. The normal textures
\(\mathbf {T}_{\mathrm{n},f}\) are adopted to emphasize the surface orientation. All textures have a resolution of
\(256 \times 256\). Here,
\(\mathbf {g}_f\) is a
global motion code (
GMC), which is obtained by encoding the translation normalized motion vector
\(\hat{\boldsymbol {\theta }}_{\bar{f}}\) through a shallow MLP. Notably, the GMC provides awareness of global skeletal motion and, thus, is able to encode global shape and appearance effects, which may be hard to encode through the above texture inputs.
Given the motion texture and the GMC, we first adopt three separated convolutional layers to generate the coarse level/initial features for each plane in the tri-plane. Inspired by the design of Wang et al. [
2023b], we adopt a 5-layer UNet with roll-out convolutions to fuse the features from different planes, which enhances the spatial consistency in the feature space. Moreover, we concatenate the GMC channel-wise to the bottleneck feature maps to provide an awareness of the global skeletal motion. Please refer to the supplemental materials for more details regarding the network architectures of the 3D-aware convolutional motion encoder and the global motion encoder.
Finally, our motion encoder outputs three orthogonal skeletal motion-dependent tri-planes
\(\mathbf {P}_{x,f}\),
\(\mathbf {P}_{y,f}\), and
\(\mathbf {P}_{z,f}\). The tri-plane feature for a sample
\(\bar{\mathbf {x}}_i\) in UTTS space can be obtained by querying the planes
\(\mathbf {P}_{x,f},\mathbf {P}_{y,f},\mathbf {P}_{z,f}\) at
\(\mathbf {u} = (u_x, u_y)\),
\((u_x,d)\), and
\((u_y,d)\), respectively, thanks to our UTTS mapping. The final motion-dependent tri-plane feature
\(\mathbf {F}_{i,f}\) for a sample
\(\bar{\mathbf {x}}_i\) can then be obtained by concatenating the three individual features from each plane. Finally, our initial human representation in Equations (
6) and (
7) can be re-defined with the proposed efficient motion-dependent tri-plane as
Here,
\(\mathbf {t}_f\) is the global position of the character, accounting for the fact that appearance can change depending on the global position in space due to non-uniform lighting conditions. In practice, the above functions are parameterized by two shallow (4-layer) MLPs with a width of 256, since most of the capacity is in the tri-plane motion encoder whose evaluation time is
independent of the number of samples along a ray. Thus, evaluating a single sample
i can be efficiently performed, leading to real-time performance.
3.4 Supervision and Training Strategy
First, we pre-train the deformable mesh model (Equation (
8)) according to [Habermann et al.
2021]. Then, the training of our human representation proceeds in three stages: field pre-training, SDF-driven surface refinement, and field fine-tuning. Additionally, a real-time optimization approach is proposed to generate detailed and temporally coherent triangular meshes during inference. In Table
1, we illustrate the model components’ status in different stages, indicating whether they are activated or frozen during training. We refer to the supplemental materials for the implementation details of the loss terms.
Field Pre-Training. Given the initial deformed mesh, we train the SDF (Equation (
19)) and color field (Equation (
20)) using the following losses:
Here, the
\(\mathcal {L}_\mathrm{col}\) and
\(\mathcal {L}_\mathrm{mask}\) denote an L1 color and mask loss, ensuring that the rendered color for a ray matches the ground truth one, and that accumulated transmittance along the ray coincides with the ground truth masks. Moreover, the Eikonal loss
\(\mathcal {L}_\mathrm{eik}\) [Gropp et al.
2020] regularizes the network predictions for the SDF value. Last, we introduce a seam loss
\(\mathcal {L}_\mathrm{seam}\), which samples points along texture seams on the mesh. For a single point on the seam, the two corresponding UV coordinates in the 3D texture space are computed, and both are randomly displaced along the third dimension, resulting in two samples, where the loss ensures that the SDF network predicts the same value for both samples. This ensures that the SDF prediction on a texture seam is consistent. More details about the seam loss are provided in the supplemental document.
SDF-driven Surface Refinement. Once the SDF and color field training are converged, we then refine the pre-trained deformable mesh model, i.e., the learnable embedded graph, to better align with the SDF, using the following loss terms:
The SDF loss
\(\mathcal {L}_\mathrm{sdf}\) ensures that the SDF queried at the template vertex positions is equal to zero, thus, dragging the mesh toward the implicit surface estimate of the network. Though this term could also backpropagate into the mapping directly, i.e., into the morphable clothed human body model, we found network training is more stable when keeping the mapping fixed according to the initial deformed mesh.
\(\mathcal {L}_\mathrm{reg}\) denotes the Laplacian loss that penalizes the Laplacian difference between the updated posed template vertices and the posed template vertices before the surface refinement.
\(\mathcal {L}_\mathrm{zero}\) denotes a smoothing term that pushes the Laplacian for the template vertices to approach zero. As the flipping of the faces would lead to abrupt changes in the UV parameterization, we adopt a face normal consistency loss
\(\mathcal {L}_\mathrm{normal}\) to avoid the face flipping, which can be computed through the cosine similarity of neighboring face normals. Moreover, as the degraded faces would lead to numerical errors for UV mapping, we adopted the face stretching loss
\(\mathcal {L}_\mathrm{area}\), which can be computed with the deviation of the edge lengths within each face. Again, more details about the individual loss terms can be found in the supplemental document. Importantly, the more SDF-aligned template will allow us to adaptively lower the maximum distance
\(d_\mathrm{max}\) for the tri-plane dimension orthogonal to the UV layout
without missing the real surface while also reducing mapping collisions.
Field Finetuning. Since the deformable mesh is now updated, the implicit field surrounding it has to be updated accordingly. Therefore, in our last stage, we once more refine the SDF and color field using the following losses
while lowering the distance
\(d_\mathrm{max}\), which effectively reduces mapping collisions. This time we also add a patch-based perceptual loss
\(\mathcal {L}_\mathrm{lap}\) [Zhang et al.
2018] and a laplacian pyramid loss
\(\mathcal {L}_\mathrm{perc}\) [Bojanowski et al.
2018]. We found that this helps improve the detailedness in appearance and geometry.
Real-time Mesh Optimization. At test time, we propose a real-time mesh optimization, which embosses the template mesh with fine-grained and motion-aware geometry details from the implicit field. We subdivide the original template once using edge subdivision, i.e., cutting the edges into half, to obtain a higher resolution output. Then, we update the subdivided template mesh along the implicit field, i.e., by evaluating the SDF value and displacing the vertex along the normal by the magnitude of the SDF value. Due to our efficient SDF evaluation leveraging our tri-planar representation, this optimization is very efficient, allowing real-time generation of high-quality and consistent clothed human geometry.
Implementation Details. Our approach is implemented in PyTorch [Paszke et al.
2019] and custom CUDA kernels. Specifically, we implement the skeletal computation, rasterization-based ray sample filter, and mapping with custom CUDA kernels. The remaining components are implemented using PyTorch. We train our method on a single Nvidia A100 graphics card using ground truth images with a resolution of
\(1285 \times 940\). The Field Pre-Training stage is trained for 600K iterations using a learning rate of 5e-4 scheduled with a cosine decay scheduler, which takes around 2 days. Here, we set the distance
\(d=4cm\). We perform a random sampling of
\(4{,}096\) rays from the foreground pixels of the ground truth images. Along each of these rays, we take 64 samples for ray marching. The loss terms adopted for supervising the training of Field Pre-Training (Equation (
21)) stage are weighted as 1.0, 0.1, 0.1, and 1.0 in the order of appearance in the equation. The SDF-driven Surface Refinement stage is trained for 200k iterations using a learning rate of 1e-5, which takes around 0.4 days. Here, the losses (Equation (
22)) are weighted with 1.0, 0.15, 0.005, 0.005, 5.0, again in the order of appearance in the equation. Last, the Field Finetuning stage is trained for 300k iterations with a learning rate of 2e-4, decayed with a cosine decay scheduler, which takes around 1.4 days. Here, we set the distance
\(d=2cm\).
Similar to the Field Pre-Training stage, we again randomly sampled 4,096 rays from the foreground pixels and adopted 64 samples per ray for ray marching. Moreover, we randomly crop patches with a resolution of 128 for evaluating perceptual-related losses, i.e.,
\(\mathcal {L}_\mathrm{lap}\), and
\(\mathcal {L}_\mathrm{perc}\). This time, the losses (Equation (
23)) are weighted with 1.0, 0.1, 0.1, 1.0, 1.0, 0.5.