research-article

Open access

TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

Authors:

Heming Zhu,

Fangneng Zhan,

Christian Theobalt,

Marc HabermannAuthors Info & Claims

ACM Transactions on Graphics, Volume 44, Issue 1

Article No.: 4, Pages 1 - 17

https://doi.org/10.1145/3697140

Published: 09 October 2024 Publication History

PDF eReader

Abstract

Creating controllable, photorealistic, and geometrically detailed digital doubles of real humans solely from video data is a key challenge in Computer Graphics and Vision, especially when real-time performance is required. Recent methods attach a neural radiance field (NeRF) to an articulated structure, e.g., a body model or a skeleton, to map points into a pose canonical space while conditioning the NeRF on the skeletal pose. These approaches typically parameterize the neural field with a multi-layer perceptron (MLP) leading to a slow runtime. To address this drawback, we propose TriHuman a novel human-tailored, deformable, and efficient tri-plane representation, which achieves real-time performance, state-of-the-art pose-controllable geometry synthesis as well as photorealistic rendering quality. At the core, we non-rigidly warp global ray samples into our undeformed tri-plane texture space, which effectively addresses the problem of global points being mapped to the same tri-plane locations. We then show how such a tri-plane feature representation can be conditioned on the skeletal motion to account for dynamic appearance and geometry changes. Our results demonstrate a clear step toward higher quality in terms of geometry and appearance modeling of humans as well as runtime performance.

1 Introduction

Digitizing real humans and creating their virtual double is a long-standing and challenging problem in Graphics and Vision with many applications in the movie industry, gaming, telecommunication, and VR/AR. Ideally, the virtual double should be controllable, should contain highly-detailed and dynamic geometry, and respective renderings should look photoreal while computations should be real-time capable. However, so far, creating high-quality and photoreal digital characters requires a tremendous amount of work from experienced artists, takes a lot of time, and is extremely expensive. Thus, simplifying the character creation by learning it directly from multi-view video data and making it more efficient became an active research area in recent years, especially with the advent of deep scene representations.

Recent works [Liu et al. 2021; Wang et al. 2022; Li et al. 2022] incorporate neural radiance fields (NeRFs) into the modeling of humans due to their capability of representing rich appearance details. These methods typically map points from a global space, or posed space, into a canonical space by transforming 3D points using the piece-wise rigid transform of nearby bones or surface points of a naked human body model or skeleton. The canonical point and some type of pose conditioning are fed into an MLP, parameterizing the NeRF, to obtain the per-point density and color, which is then volume rendered to obtain the final pixel color. However, most methods have to perform multiple MLP evaluations per ray, which makes real-time performance impossible.

To overcome this, we present TriHuman, which is the first real-time method for controllable character synthesis that jointly models detailed, coherent, and motion-dependent surface deformations of arbitrary types of clothing as well as photorealistic motion- and view-dependent appearance. Given a skeleton motion and camera configuration as input, our method regresses a detailed motion-dependent geometry and view- and motion-dependent appearance, while for training, it only requires multi-view video.

At the technical core, we represent human geometry and appearance as a signed distance field (SDF) and color field in global space, which can be volume-rendered into an image. To overcome the limited runtime performance of previous methods, we investigate in this work how the efficient tri-plane representation [Chan et al. 2022] can be leveraged to improve runtime performance while maintaining high quality. Important to note is that the tri-plane representation typically works well for convex shapes like faces as there are only a few points in global space mapping to the same point on the tri-planes. However, humans with their clothing are articulated and deformable, which makes it more challenging to prevent tri-plane mapping collisions, i.e., global points map to the same tri-plane locations. To overcome this, we map global points into an undeformed tri-plane texture space (UTTS) using a deformable human model [Habermann et al. 2021]. Intuitively, one of the tri-planes coincides with the 2D UV map of the deformable model, while the other two planes are perpendicular to the first one and each other. We show that this reduces the mapping collisions when projecting points onto the planes and, thus, leads to better results. Another challenge is to condition the tri-plane features on the skeletal motion in order to obtain an animatable representation. Here, we propose an efficient 2D motion texture conditioning encoding the surface dynamics of the deformable model in conjunction with a 3D-aware convolutional architecture [Wang et al. 2023b] in order to generate tri-plane features that effectively encode the skeletal motion. Last, these features are decoded to an SDF value and a color value using a shallow MLP, and unbiased volume rendering [Wang et al. 2021] is performed to generate the final pixel color.

To evaluate our method, we found that most existing datasets contain limited skeletal pose-variation and camera views. Moreover, they lack ground truth 3D data for evaluating the accuracy of the recovered human geometry. To address this, we propose a new dataset and extend existing datasets consisting of dense multi-view captures using 120 cameras of human performances comprising significantly higher pose variations than current benchmarks. The dataset further provides skeletal pose annotations, foreground segmentations, and, most importantly, 4D ground truth reconstructions. We demonstrate state-of-the-art results on this novel and significantly more challenging benchmark compared with previous works (see Figure 1). In summary, our contributions are:

—

A novel controllable human avatar representation enabling highly detailed and skeletal motion-dependent geometry and appearance synthesis at real time frame rates while supporting arbitrary types of apparel.

—

A mapping, which transforms global points into an UTTS, greatly reduces tri-plane collisions.

—

A skeletal motion-dependent tri-plane network architecture encoding the surface dynamics, which allows the tri-plane representation to be skeletal motion conditioned.

—

A new benchmark dataset of dense multi-view videos of multiple people performing various challenging motions, which improves over existing datasets in terms of scale and annotation quality.

2 Related Works

Recently, neural scene representations [Sitzmann et al. 2019; Mildenhall et al. 2021; Oechsle et al. 2021; Wang et al. 2021; Yariv et al. 2021;, 2020; Niemeyer et al. 2020] have achieved great success in multifarious vision and graphics applications, including novel view synthesis [Yu et al. 2021; Hedman et al. 2021; Fridovich-Keil et al. 2022; Müller et al. 2022; Chen et al. 2022], generative modeling [Schwarz et al. 2020; Niemeyer and Geiger 2021; Chan et al. 2021], surface reconstruction [Wang et al. 2021; Oechsle et al. 2021; Yariv et al. 2021], and many more. While the above works mainly focus on static scenes, recent efforts [Tretschk et al. 2021; Park et al. 2021; Pumarola et al. 2021; Deng et al. 2020] have been devoted to extending neural scene / implicit representations for modeling dynamic scenes or articulated objects. With a special focus on dynamic human modeling, existing works can be categorized according to their space canonicalization strategy, which will be introduced in the ensuing paragraphs.

Piece-wise Rigid Mapping. Reconstructing the 3D human has attracted increasing attention in recent years. A popular line of research [Alldieck et al. 2018a;, 2018b; Xiang et al. 2020] utilizes a parametric body model such as SMPL [Loper et al. 2015] to represent a human body with clothing deformations, which produces an animatable 3D model. With the emergence of neural scene representations [Sitzmann et al. 2019; Mildenhall et al. 2021], series of works [Gafni et al. 2021; Peng et al. 2021b; Su et al. 2021; Saito et al. 2021; Bhatnagar et al. 2020] combine scene representation networks with parametric models [Loper et al. 2015; Blanz and Vetter 1999] to reconstruct dynamic humans. With a special focus on human body modeling, some methods [Weng et al. 2020; Chen et al. 2021; Noguchi et al. 2021; Wang et al. 2022; Bergman et al. 2022] transform points from a global space, or posed space, into a canonical space by mapping 3D points using piece-wise rigid transformations. For instance, Chen et al. [2021] extend NeRFs to dynamic scenes by introducing explicit pose-guided deformation with SMPL [Loper et al. 2015] to achieve a mapping from the observation space to a constant canonical space. Instead of learning rigid transformations from the full parametric model, NARF [Noguchi et al. 2021] considers only the rigid transformation of the most relevant object part for each 3D point. ENARF-GAN [Noguchi et al. 2022] further extends NARF to achieve efficient and unsupervised training from unposed image collections. To accelerate the neural volume rendering, InstantAvatar [Jiang et al. 2023] incorporate Instant-NGP [Müller et al. 2022] to learn a canonical shape and appearance, which derives a continuous deformation field via an efficient articulation module [Chen et al. 2023]. Remelli et al. [2022] models the motion-aware appearance of the clothed human as cubic volumetric primitives on texture space. As the inferred geometry from NeRF often lacks detail, ARAH [Wang et al. 2022] builds an articulated signed-distance-field (SDF) representation to better model the geometry of clothed humans, where an efficient joint root-finding algorithm is introduced for the mapping from observation space to canonical space. However, piece-wise rigid mapping has limited capability to represent complex geometry such as loose clothing.

Piece-wise Rigid and Learned Residual Deformation. Recently, an improved deformable NeRF representation [Liu et al. 2021; Peng et al. 2021a; Xu et al. 2021; Zhang et al. 2021; Gao et al. 2024; Pang et al. 2024; Li et al. 2024; Hu et al. 2024; Shao et al. 2024] has become a common paradigm for dynamic human modeling, by unwarping different poses to a shared canonical space with piece-wise rigid transformations and learned residual deformations [Tretschk et al. 2021; Park et al. 2021; Pumarola et al. 2021; Zhan et al. 2023]. For instance, Liu et al. [2021] employs an inverse skinning transformation [Lewis et al. 2000] to deform the posed space to the canonical pose space, accompanied by a predicted residual deformation for each pose. Similarly, Weng et al. [2022]; Peng et al. [2021a] propose to optimize a human representation in a canonical T-pose, relying on a motion field consisting of skeletal rigid and non-rigid deformations; Gao et al. [2024] further proposes to model the residual deformation by leveraging geometry features and relative displacement. Recently, TAVA [Li et al. 2022] incorporates a deformation model that captures non-linear pose-dependent deformations, which is anchored in an LBS formulation. InstantNVR [Geng et al. 2023] applies the multi-resolution hash encoding to the transformed point and regresses a residual to obtain the canonical space, though the model does not preserve the motion-aware surface details. While such a residual deformation can typically compensate for smaller misalignments and wrinkle deformations, we found that it typically fails to handle clothing types and deformations that significantly deviate from the underlying articulated structure.

Modeling Surface Deformation. Notably, some recent efforts [Habermann et al. 2021] have been devoted to modeling both coarse and fine dynamic deformation by introducing a parametric human representation with explicit space-time coherent mesh geometry and high-quality dynamic textures. However, they still face challenges in capturing fine-scale details due to the complexity of the optimization process involved in deforming meshes with sparse supervision. Bagautdinov et al. [2021] and Xiang et al. [2021] adopt the auto-encoder in the texture space of a deformable template mesh to model the motion-aware clothing and human appearance. However, they assume the availability of registered meshes. Similarly, Xiang et al. [2023] models dynamic clothing and human appearance in the texture space of deformable template meshes. Although Xiang et al. [2023] achieves high-quality rendering, it requires additional input from multiple RGB-D cameras. Alternatively, the prevailing implicit representation methods offer a more flexible human representation. Habermann et al. [2023] propose to condition NeRF on a densely deforming template to enable the tracking of loose clothing and further refine the template deformations. However, their method requires multiple MLP evaluations per ray sample, resulting in slower computation. Additionally, the recovered surface quality is compromised since they model the scene as a density field rather than a SDF. Recently, Deliffas [Kwon et al. 2023] achieves real-time rendering of dynamic characters through a surface light field attached to the deformable template mesh. Nevertheless, similar to prior methods, the generated geometry is of lower quality and lacks delicate surface details.

3 Methodology

Our goal is to obtain a drivable, photorealistic, and geometrically detailed avatar of a real human in any type of clothing solely learned from multi-view RGB video. More precisely, given a skeleton motion and virtual camera view as input, we want to synthesize highly realistic renderings of the human in motion as well as the high-fidelity and deforming geometry in real time. An overview of our method is shown in Figure 1. Next, we define the problem setting (Section 3.1). Then, we describe the main challenges of space canonicalization that current methods are facing followed by our proposed space mapping, which alleviates the inherent ambiguities (Section 3.2). Given this novel space canonicalization strategy, we show how this UTTS can be efficiently parameterized with a tri-plane representation leading to real-time performance during rendering and geometry recovery (Section 3.3). Last, we introduce our supervision and training strategy (Section 3.4).

Fig. 1.

Fig. 2.

3.1 Problem Setting

Input Assumptions. We assume a segmented multi-view video of a human actor using C calibrated and synchronized RGB cameras as well as a static 3D template is given. \(\mathbf {I}_{f,c} \in \mathbb {R}^{H \times W}\) denotes frame f of camera c where W and H are the width and height of the image, respectively. We then extract the skeletal pose \(\boldsymbol {\theta }_f \in \mathbb {R}^P\) for each frame f using markerless motion capture [TheCaptury 2020]. Here, P denotes the number of degrees of freedom (DoFs). A skeletal motion from frame \(f-k\) to f is denoted as \(\boldsymbol {\theta }_{\bar{f}} \in \mathbb {R}^{kP}\) and \(\hat{\boldsymbol {\theta }}_{\bar{f}}\) is the translation normalized equivalent, i.e., by displacing the root motion such that the translation of frame f is zero. During training, our model takes the skeletal motion as input and the multi-view videos as supervision, while at inference, our method only requires a skeletal motion and a virtual camera.

Static Scene Representation. Recent progress in neural scene representation learning has shown great success in terms of geometry reconstruction [Wang et al. 2021;, 2023a] and view synthesis [Mildenhall et al. 2021] of static scenes by employing neural fields. Inspired by NeuS [Wang et al. 2021], we also represent the human geometry and appearance as neural field \(\mathcal {F}_\mathrm{sdf}\) and \(\mathcal {F}_\mathrm{col}\):

\begin{equation} \mathcal {F}_\mathrm{sdf}(p(\mathbf {x}_i); \Gamma)= s_i, \mathbf {q}_i \end{equation}

(1)

\begin{equation} \mathcal {F}_\mathrm{col}(\mathbf {q}_i, s_i, \mathbf {n}_i, p(\mathbf {d}); \Psi)= \mathbf {c}_i \end{equation}

(2)

where \(\mathbf {x}_i \in \mathbb {R}^3\) is a point along the camera \(r(t_i,\mathbf {o}, \mathbf {d}) = \mathbf {o} + t_i \mathbf {d}\) with origin \(\mathbf {o}\) and direction \(\mathbf {d}\). \(p(\cdot)\) is a positional encoding [Mildenhall et al. 2021] to better model and synthesize higher frequency details. The SDF field stores the SDF \(s_i\) and a respective shape code \(\mathbf {q}_i\) for every point \(\mathbf {x}_i\) in global space. Note that the normal at point \(\mathbf {x}_i\) can be computed as \(\mathbf {n}_i =\frac{\partial s_i}{\partial \mathbf {x}_i}\). Moreover, the color field encodes the color \(\mathbf {c}_i\), and as it is conditioned on the viewing direction \(\mathbf {d}\), it can also encode view-dependent appearance changes. In practice, both fields are parameterized as multi-layer perceptrons (MLPs) with learnable weights \(\Gamma\) and \(\Psi\).

To render the color of a ray (pixel), volume rendering is performed, which accumulates the color \(\mathbf {c}_i\) and the density \(\alpha _i\) along the ray as

\begin{equation} \mathbf {c} = \sum ^R_i T_i \alpha _i \mathbf {c}_i, \quad T_i = \prod ^{i-1}_{j=1} (1 - \alpha _j). \end{equation}

(3)

Here, the density \(\alpha _i\) is a function of the SDF. For an unbiased SDF estimate, the conversion from SDF to density can be defined as

\begin{equation} \alpha _i = \mathrm{max} \left(\frac{\Phi (s_i) - \Phi (s_{i+1})}{\Phi (s_i)}, 0 \right) \end{equation}

(4)

\begin{equation} \Phi (s_i) = (1 + e^{-zx})^{-1}, \end{equation}

(5)

where z is a trainable parameter whose reciprocal approaches 0 when training converges. For a detailed derivation, we refer to the original work [Wang et al. 2021]. The scene geometry and appearance can then be solely supervised by comparing the obtained pixel color with the ground truth color, typically employed by an L1 loss. Important for us, this representation allows the modeling of fine geometric details and appearance while only requiring multi-view imagery. However, for now, this representation only allows for the modeling of static scenes and requires multiple hours of training (even for a single frame).

Problem Setting. Instead, we want to learn a dynamic, controllable, and efficient human representation \(\mathcal {H}_\mathrm{sdf}\) and \(\mathcal {H}_\mathrm{col}\):

\begin{equation} \mathcal {H}_\mathrm{sdf}(\boldsymbol {\theta }_{\bar{f}}, p(\mathbf {x}_i); \Gamma)= s_{i,f}, \mathbf {q}_{i,f} \end{equation}

(6)

\begin{equation} \mathcal {H}_\mathrm{col}(\boldsymbol {\theta }_{\bar{f}}, \mathbf {q}_{i,f}, s_{i,f}, \mathbf {n}_{i,f}, p(\mathbf {d}); \Psi)= \mathbf {c}_{i,f}, \end{equation}

(7)

which is conditioned on the skeletal motion of the human as well. Note that the SDF, shape feature, and color are now a function of skeletal motion indicated by the subscript \((\cdot)_f\). Previous work [Liu et al. 2021] has shown that naively adding the motion as a function input to the field leads to blurred and unrealistic results. Many works [Liu et al. 2021; Peng et al. 2021b;, 2021a] have therefore tried to transform points into canonical 3D pose space to then query the neural field in this canonical space. This has shown to improve quality, however, they typically parameterize the field in this space with an MLP leading to slow runtimes.

Tri-planes [Chan et al. 2022] offer an efficient alternative and have been applied to generative tasks. However, mostly for convex surfaces such as faces where the mapping onto the planes introduces little ambiguity. However, using them for representing the complex, articulated, and dynamic structure of humans in clothing requires additional attention since, if not handled carefully, the mapping onto the tri-plane can lead to so-called mapping collisions, where multiple points in the global space map onto the same tri-plane locations. Thus, in the remainder of this section, we first introduce our UTTS, which effectively reduces these collisions (Section 3.2). Then, we explain how the tri-plane can be conditioned on the skeletal motion using an efficient encoding of surface dynamics into texture space, which is then decoded into the tri-plane features leveraging a 3D-aware convolutional architecture [Wang et al. 2023b] (Section 3.3). Last, we describe our supervision and training strategy (Section 3.4).

3.2 Undeformed Tri-Plane Texture Space

Intuitively, our idea is that one of the tri-planes, i.e., the surface plane, corresponds to the surface of a skeletal motion-conditioned deformable human mesh model, while the other two planes, i.e., the perpendicular planes, are perpendicular to the first one and to each other. Next, we define the deformable and skeletal motion-dependent surface model of the human.

Motion-dependent and Deformable Human Model. We assume a person-specific, rigged and skinned triangular mesh with N vertices \(\mathbf {M} \in \mathbb {R}^{N \times 3}\) is given and the vertex connectivity remains fixed. The triangular mesh \(\mathbf {M}\) is obtained from a 3D scanner [Treedys 2020] and down-sampled to around \(5{,}000\) vertices to strike a balance between quality and efficiency. Now, we denote the deformable and motion-dependent human model as

\begin{equation} \mathcal {V}(\boldsymbol {\theta }_{\bar{f}};\Omega) = \mathbf {V}_{\bar{f}} \end{equation}

(8)

where \(\Omega \in \mathbb {R}^{W}\) are the learnable network weights and \(\mathbf {V}_{\bar{f}} \in \mathbb {R}^{N \times 3}\) are the posed and non-rigidly deformed vertex positions. Important for us, this function has to satisfy two properties: (1) It has to be a function of skeletal motion. (2) It has to have the capability of capturing non-rigid surface deformation.

We found that the representation of Habermann et al. [2021] meets these requirements, and we, thus, leverage it for our task. In their formulation, the human geometry is first non-rigidly deformed in a canonical pose as

\begin{equation} \mathbf {Y}_{v} = \mathbf {D}_i + \sum _{k \in \mathcal {N}_{\mathrm{vn},v}} \mathbf {w}_{v,k}(R(\mathbf {A}_k)(\mathbf {M}_v - \mathbf {G}_k) + \mathbf {G}_k + \mathbf {T}_k) \end{equation}

(9)

where \(\mathbf {M}_v \in \mathbb {R}^{3}\) and \(\mathbf {Y}_v \in \mathbb {R}^{3}\) denote the undeformed and deformed template vertices in the rest pose. \(\mathcal {N}_{\mathrm{vn},v} \in \mathbb {N}\) denotes the indexes of embedded graph nodes that are connected to template mesh vertex \(v \in \mathbb {R}^{3}\). \(\mathbf {G}_k \in \mathbb {R}^{3}\), \(\mathbf {A}_k \in \mathbb {R}^{3}\), and \(\mathbf {T}_k \in \mathbb {R}^{3}\) indicate the rest positions, rotation Euler angles, and translations of the embedded graph nodes. Specifically, the connectivity of the embedded graph \(\mathbf {G}_k\) can be obtained by simplifying the deformable template mesh \(\mathbf {M}\) with quadric edge collapse decimation in Meshlab [Cignoni et al. 2008]. \(R(\cdot)\) denotes the function that converts the Euler angle to a rotation matrix. Similar to [Sorkine and Alexa 2007], we compute the weight applied to the neighboring vertices \(\mathbf {w}_{v,k} \in \mathbb {R}\) based on geodesic distances.

To model higher-frequency deformations, an additional per-vertex displacement \(\mathbf {D}_i \in \mathbb {R}^{3}\) is added. The embedded graph parameters \(\mathbf {A},\mathbf {T}\), and per-vertex displacements \(\mathbf {D}\) are further functions of translation-normalized skeletal motion implemented as two graph convolutional networks

\begin{align} \mathcal {F_\mathrm{eg}}(\hat{\boldsymbol {\theta }}_{\bar{f}}; \Omega _\mathrm{eg}) &= \mathbf {A}, \mathbf {T} \end{align}

(10)

\begin{align} \mathcal {F_\mathrm{delta}}(\hat{\boldsymbol {\theta }}_{\bar{f}}; \Omega _\mathrm{delta}) &= \mathbf {D} \end{align}

(11)

where the skeletal motion is encoded according to [Habermann et al. 2021]. For more details, we refer to the original work.

Finally, the deformed vertices \(\mathbf {Y}_{v}\) in the rest pose can be posed using Dual Quaternion (DQ) skinning \(\mathcal {S}\) [Kavan et al. 2007], which defines the motion-dependent deformable model

\begin{equation} \mathcal {S}\left(\boldsymbol {\theta },\mathbf {Y}\right) = \mathbf {V}_{\bar{f}} = \mathcal {V}(\boldsymbol {\theta }_{\bar{f}};\Omega). \end{equation}

(12)

Note that Equation (12) is (1) solely a function of the skeletal motion and (2) can account for non-rigid deformations by means of training the weights \(\Omega\) and, thus, this formulation satisfies our initial requirements.

Non-rigid Space Canonicalization. Next, we introduce our non-rigid space canonicalization function (see also Figure 2)

\begin{equation} \mathcal {M}(\mathcal {V}(\boldsymbol {\theta }_{\bar{f}};\Omega), \mathbf {x}) = \bar{\mathbf {x}}, \end{equation}

(13)

which takes the deformable template and a point \(\mathbf {x}\) in global space and maps it to the so-called UTTS, denoted as \(\bar{\mathbf {x}}\), as explained in the following. Given the point \(\mathbf {x}\) in global space and assuming the closest point \(\mathbf {p}\) is located on the non-degenerated triangle with vertices \(\lbrace \mathbf {v}_a, \mathbf {v}_b, \mathbf {v}_c \rbrace\) on the posed and deformed template \(\mathbf {V}_{\bar{f}}\), the closest point on the surface can either be a face, edge, or vertex on the mesh. In the following, we discuss the different cases where the goal is to find the 2D texture coordinate of \(\mathbf {p}\) as well as the distance between \(\mathbf {x}\) and \(\mathbf {p}\), which then defines the 3D coordinate \(\bar{\mathbf {x}}\) in UTTS.

Fig. 3.

(1) Face. If the closest point lies on the triangular surface, the projected position \(\mathbf {p}\) and the 2D texture coordinate \(\mathbf {u}\) is computed as

\begin{equation} \begin{aligned}\mathbf {p} &= \mathbf {x} - (\mathbf {n}_f \cdot (\mathbf {x} - \mathbf {v}_a)) \mathbf {n}_f\\ \lambda _a &= \frac{\Vert (\mathbf {v}_c - \mathbf {v}_b) \times (\mathbf {p} - \mathbf {v}_b) \Vert _{2}}{\Vert (\mathbf {v}_c - \mathbf {v}_b) \times (\mathbf {v}_a - \mathbf {v}_b) \Vert _{2}}\\ \lambda _b &= \frac{\Vert (\mathbf {v}_a - \mathbf {v}_c) \times (\mathbf {p} - \mathbf {v}_c) \Vert _{2}}{\Vert (\mathbf {v}_c - \mathbf {v}_b) \times (\mathbf {v}_a - \mathbf {v}_b) \Vert _{2}}\\ \mathbf {u} &= \lambda _a \mathbf {u}_a + \lambda _b \mathbf {u}_b + (1 - \lambda _a - \lambda _b) \mathbf {u}_c \end{aligned} \end{equation}

(14)

where \(\mathbf {n}_f\) denotes the face normal of the closest surface, and \(\mathbf {u}_a\), \(\mathbf {u}_b\) and \(\mathbf {u}_c\) indicates the texture coordinate of the triangle vertices.

(2) Edge. For global points \(\mathbf {x}\) mapping onto the edge \((\mathbf {v}_a, \mathbf {v}_b)\), the projected position \(\mathbf {p}\) and the 2D texture coordinate \(\mathbf {u}\) are defined as

\begin{equation} \begin{aligned}\lambda &= \frac{(\mathbf {v}_b - \mathbf {v}_a) \cdot (\mathbf {x} - \mathbf {v}_a)}{\Vert (\mathbf {v}_b - \mathbf {v}_a) \Vert _{2}}\\ \mathbf {p} &= \mathbf {v}_a + \lambda (\mathbf {v}_b - \mathbf {v}_a) \\ \mathbf {u} &= \lambda \mathbf {u}_a + (1 - \lambda)\mathbf {u}_b. \end{aligned} \end{equation}

(15)

(3) Vertex. If the global point \(\mathbf {x}\) maps onto a vertex \(\mathbf {v}_a\), the projected position \(\mathbf {p}\) and the 2D texture coordinate \(\mathbf {u}\) are defined as

\begin{equation} \begin{aligned}\mathbf {p} &= \mathbf {v}_a \\ \mathbf {u} &= \mathbf {u}_a. \end{aligned} \end{equation}

(16)

Given the projected position \(\mathbf {p}\), we compute the signed distance between \(\mathbf {x}\) and \(\mathbf {p}\) as

\begin{equation} d = sgn((\mathbf {x} - \mathbf {p}) \cdot \mathbf {n}_f) \times \Vert \mathbf {x} - \mathbf {p} \Vert _{2}, \end{equation}

(17)

which will be required in the following.

So far, we can now canonicalize points from global 3D space to our UTTS space, and we denote the canonical point \((\mathbf {u},d)^T\) or \((u_x,u_y,d)^T\) simply as \(\bar{\mathbf {x}}\). Note that \(\mathbf {u}=(u_x, u_y)\) denotes the point on our surface plane and \((u_x,d)\) and \((u_y,d)\) correspond to the points on the perpendicular planes. These coordinates can now be used to query the features on the respective tri-planes.

Concerning the mapping collision, we highlight that case (1) where a point maps onto a triangle, is a bijection and, thus, the concatenated tri-plane features are unique, which was our goal. Only for cases (2) and (3) can the aforementioned collisions happen since the 2D texture coordinate on the surface is no longer unique for points with the same distance to a point on the edge of the mesh vertex itself. However, the occurrence of these cases highly depends on how far away from the deformable surface points the sample is still being taken. By constraining the maximum distance to \(d_\mathrm{max}\), which effectively means we only draw samples close to the deformable surface, we found that cases (2) and (3) happens less frequently. However, when the deformable model is not aligned, this will introduce an error by design, as surface points are not sampled in regions covered by the human. Therefore, we gradually deform the surface along the SDF field to account for such cases, and iteratively reduce \(d_\mathrm{max}\). In the limit, this strategy reduces the mapping collisions, improves sampling efficiency, and ensures that the sampled points do not miss the real surface. More details about this will be explained in Section 3.4 and our supplemental material.

3.3 Efficient and Motion-Dependent Tri-Plane Encoding

So far, we are able to map points in the global space to our UTTS space; however, as mentioned earlier, we want to ensure that the tri-planes contain skeletal motion-aware features. Thus, we propose a 3D-aware convolutional motion encoder:

\begin{equation} \mathcal {E}(\mathbf {T}_{\mathrm{p},f}, \mathbf {T}_{\mathrm{v},f}, \mathbf {T}_{\mathrm{a},f}, \mathbf {T}_{\mathrm{u},f}, \mathbf {T}_{\mathrm{n},f}, \mathbf {g}_f; \Phi) = \mathbf {P}_{x,f},\mathbf {P}_{y,f},\mathbf {P}_{z,f}, \end{equation}

(18)

which takes several 2D textures as input, encoding the position \(\mathbf {T}_{\mathrm{p},f}\), velocity \(\mathbf {T}_{\mathrm{v},f}\), acceleration \(\mathbf {T}_{\mathrm{a},f}\), texture coordinate \(\mathbf {T}_{\mathrm{u},f}\), and normal \(\mathbf {T}_{\mathrm{n},f}\) of the deforming human mesh surface, which we root normalize, i.e., we subtract the skeletal root translation from the mesh vertex positions (Equation (12)) and scale them to a range of \([-1, 1]\). Note that the individual texel values for \(\mathbf {T}_{\mathrm{v},f}\), \(\mathbf {T}_{\mathrm{a},f}\), \(\mathbf {T}_{\mathrm{u},f}\) and \(\mathbf {T}_{\mathrm{n},f}\) can be simply computed using inverse texture mapping. The first 3 textures, i.e., position \(\mathbf {T}_{\mathrm{p},f}\), velocity \(\mathbf {T}_{\mathrm{v},f}\), and acceleration \(\mathbf {T}_{\mathrm{a},f}\), encode the dynamics of the deforming surface, which can be computed on the fly from the skinned template of the current and previous frames. The texture coordinate map \(\mathbf {T}_{\mathrm{u},f}\) encodes a unique ID for each texel covered by a triangle in the UV-atlas. The normal textures \(\mathbf {T}_{\mathrm{n},f}\) are adopted to emphasize the surface orientation. All textures have a resolution of \(256 \times 256\). Here, \(\mathbf {g}_f\) is a global motion code (GMC), which is obtained by encoding the translation normalized motion vector \(\hat{\boldsymbol {\theta }}_{\bar{f}}\) through a shallow MLP. Notably, the GMC provides awareness of global skeletal motion and, thus, is able to encode global shape and appearance effects, which may be hard to encode through the above texture inputs.

Given the motion texture and the GMC, we first adopt three separated convolutional layers to generate the coarse level/initial features for each plane in the tri-plane. Inspired by the design of Wang et al. [2023b], we adopt a 5-layer UNet with roll-out convolutions to fuse the features from different planes, which enhances the spatial consistency in the feature space. Moreover, we concatenate the GMC channel-wise to the bottleneck feature maps to provide an awareness of the global skeletal motion. Please refer to the supplemental materials for more details regarding the network architectures of the 3D-aware convolutional motion encoder and the global motion encoder.

Finally, our motion encoder outputs three orthogonal skeletal motion-dependent tri-planes \(\mathbf {P}_{x,f}\), \(\mathbf {P}_{y,f}\), and \(\mathbf {P}_{z,f}\). The tri-plane feature for a sample \(\bar{\mathbf {x}}_i\) in UTTS space can be obtained by querying the planes \(\mathbf {P}_{x,f},\mathbf {P}_{y,f},\mathbf {P}_{z,f}\) at \(\mathbf {u} = (u_x, u_y)\), \((u_x,d)\), and \((u_y,d)\), respectively, thanks to our UTTS mapping. The final motion-dependent tri-plane feature \(\mathbf {F}_{i,f}\) for a sample \(\bar{\mathbf {x}}_i\) can then be obtained by concatenating the three individual features from each plane. Finally, our initial human representation in Equations (6) and (7) can be re-defined with the proposed efficient motion-dependent tri-plane as

\begin{equation} \mathcal {H}_\mathrm{sdf}(\mathbf {F}_{i,f}, \mathbf {g}_f, p(\bar{\mathbf {x}}_i); \Gamma) = s_{i,f}, \mathbf {q}_{i,f} \end{equation}

(19)

\begin{equation} \mathcal {H}_\mathrm{col}(\mathbf {q}_{i,f}, s_{i,f}, \mathbf {n}_{i,f}, p(\mathbf {d}), \mathbf {t}_f; \Psi) = \mathbf {c}_{i,f}. \end{equation}

(20)

Here, \(\mathbf {t}_f\) is the global position of the character, accounting for the fact that appearance can change depending on the global position in space due to non-uniform lighting conditions. In practice, the above functions are parameterized by two shallow (4-layer) MLPs with a width of 256, since most of the capacity is in the tri-plane motion encoder whose evaluation time is independent of the number of samples along a ray. Thus, evaluating a single sample i can be efficiently performed, leading to real-time performance.

3.4 Supervision and Training Strategy

First, we pre-train the deformable mesh model (Equation (8)) according to [Habermann et al. 2021]. Then, the training of our human representation proceeds in three stages: field pre-training, SDF-driven surface refinement, and field fine-tuning. Additionally, a real-time optimization approach is proposed to generate detailed and temporally coherent triangular meshes during inference. In Table 1, we illustrate the model components’ status in different stages, indicating whether they are activated or frozen during training. We refer to the supplemental materials for the implementation details of the loss terms.

Table 1.

Components	Embed. Def.	Triplane Gen.	SDF/Color MLP
Components	\(\mathcal {V}(\cdot)\)	\(\mathcal {E}(\cdot)\)	\(\mathcal {H}_\mathrm{sdf},\mathcal {H}_\mathrm{col}\)
Field Pre-train.	✗	✓	✓
Surface Ref.	✓	✗	✗
Field Finetune	✗	✓	✓
Real-time Opt.	✗	✗	✗

Table 1. Training Status for Each Component

We provide the status for each component in the different training stages. The status of each component is either ✓, indicating that the weights will be updated, or ✗, denoting the weights are fixed and will not be optimized.

Field Pre-Training. Given the initial deformed mesh, we train the SDF (Equation (19)) and color field (Equation (20)) using the following losses:

\begin{equation} \mathcal {L}_\mathrm{col} + \mathcal {L}_\mathrm{mask} + \mathcal {L}_\mathrm{eik} + \mathcal {L}_\mathrm{seam}. \end{equation}

(21)

Here, the \(\mathcal {L}_\mathrm{col}\) and \(\mathcal {L}_\mathrm{mask}\) denote an L1 color and mask loss, ensuring that the rendered color for a ray matches the ground truth one, and that accumulated transmittance along the ray coincides with the ground truth masks. Moreover, the Eikonal loss \(\mathcal {L}_\mathrm{eik}\) [Gropp et al. 2020] regularizes the network predictions for the SDF value. Last, we introduce a seam loss \(\mathcal {L}_\mathrm{seam}\), which samples points along texture seams on the mesh. For a single point on the seam, the two corresponding UV coordinates in the 3D texture space are computed, and both are randomly displaced along the third dimension, resulting in two samples, where the loss ensures that the SDF network predicts the same value for both samples. This ensures that the SDF prediction on a texture seam is consistent. More details about the seam loss are provided in the supplemental document.

SDF-driven Surface Refinement. Once the SDF and color field training are converged, we then refine the pre-trained deformable mesh model, i.e., the learnable embedded graph, to better align with the SDF, using the following loss terms:

\begin{equation} \mathcal {L}_\mathrm{sdf} + \mathcal {L}_\mathrm{reg} + \mathcal {L}_\mathrm{zero} + \mathcal {L}_\mathrm{normal}+\mathcal {L}_\mathrm{area}. \end{equation}

(22)

The SDF loss \(\mathcal {L}_\mathrm{sdf}\) ensures that the SDF queried at the template vertex positions is equal to zero, thus, dragging the mesh toward the implicit surface estimate of the network. Though this term could also backpropagate into the mapping directly, i.e., into the morphable clothed human body model, we found network training is more stable when keeping the mapping fixed according to the initial deformed mesh. \(\mathcal {L}_\mathrm{reg}\) denotes the Laplacian loss that penalizes the Laplacian difference between the updated posed template vertices and the posed template vertices before the surface refinement. \(\mathcal {L}_\mathrm{zero}\) denotes a smoothing term that pushes the Laplacian for the template vertices to approach zero. As the flipping of the faces would lead to abrupt changes in the UV parameterization, we adopt a face normal consistency loss \(\mathcal {L}_\mathrm{normal}\) to avoid the face flipping, which can be computed through the cosine similarity of neighboring face normals. Moreover, as the degraded faces would lead to numerical errors for UV mapping, we adopted the face stretching loss \(\mathcal {L}_\mathrm{area}\), which can be computed with the deviation of the edge lengths within each face. Again, more details about the individual loss terms can be found in the supplemental document. Importantly, the more SDF-aligned template will allow us to adaptively lower the maximum distance \(d_\mathrm{max}\) for the tri-plane dimension orthogonal to the UV layout without missing the real surface while also reducing mapping collisions.

Field Finetuning. Since the deformable mesh is now updated, the implicit field surrounding it has to be updated accordingly. Therefore, in our last stage, we once more refine the SDF and color field using the following losses

\begin{equation} \mathcal {L}_\mathrm{col} + \mathcal {L}_\mathrm{mask} + \mathcal {L}_\mathrm{eik} + \mathcal {L}_\mathrm{seam} + \mathcal {L}_\mathrm{lap} + \mathcal {L}_\mathrm{perc}. \end{equation}

(23)

while lowering the distance \(d_\mathrm{max}\), which effectively reduces mapping collisions. This time we also add a patch-based perceptual loss \(\mathcal {L}_\mathrm{lap}\) [Zhang et al. 2018] and a laplacian pyramid loss \(\mathcal {L}_\mathrm{perc}\) [Bojanowski et al. 2018]. We found that this helps improve the detailedness in appearance and geometry.

Real-time Mesh Optimization. At test time, we propose a real-time mesh optimization, which embosses the template mesh with fine-grained and motion-aware geometry details from the implicit field. We subdivide the original template once using edge subdivision, i.e., cutting the edges into half, to obtain a higher resolution output. Then, we update the subdivided template mesh along the implicit field, i.e., by evaluating the SDF value and displacing the vertex along the normal by the magnitude of the SDF value. Due to our efficient SDF evaluation leveraging our tri-planar representation, this optimization is very efficient, allowing real-time generation of high-quality and consistent clothed human geometry.

Implementation Details. Our approach is implemented in PyTorch [Paszke et al. 2019] and custom CUDA kernels. Specifically, we implement the skeletal computation, rasterization-based ray sample filter, and mapping with custom CUDA kernels. The remaining components are implemented using PyTorch. We train our method on a single Nvidia A100 graphics card using ground truth images with a resolution of \(1285 \times 940\). The Field Pre-Training stage is trained for 600K iterations using a learning rate of 5e-4 scheduled with a cosine decay scheduler, which takes around 2 days. Here, we set the distance \(d=4cm\). We perform a random sampling of \(4{,}096\) rays from the foreground pixels of the ground truth images. Along each of these rays, we take 64 samples for ray marching. The loss terms adopted for supervising the training of Field Pre-Training (Equation (21)) stage are weighted as 1.0, 0.1, 0.1, and 1.0 in the order of appearance in the equation. The SDF-driven Surface Refinement stage is trained for 200k iterations using a learning rate of 1e-5, which takes around 0.4 days. Here, the losses (Equation (22)) are weighted with 1.0, 0.15, 0.005, 0.005, 5.0, again in the order of appearance in the equation. Last, the Field Finetuning stage is trained for 300k iterations with a learning rate of 2e-4, decayed with a cosine decay scheduler, which takes around 1.4 days. Here, we set the distance \(d=2cm\).

Similar to the Field Pre-Training stage, we again randomly sampled 4,096 rays from the foreground pixels and adopted 64 samples per ray for ray marching. Moreover, we randomly crop patches with a resolution of 128 for evaluating perceptual-related losses, i.e., \(\mathcal {L}_\mathrm{lap}\), and \(\mathcal {L}_\mathrm{perc}\). This time, the losses (Equation (23)) are weighted with 1.0, 0.1, 0.1, 1.0, 1.0, 0.5.

4 Dataset

Our new dataset comprises three subjects wearing different types of apparel. For each subject, we recorded separate training and testing sequences in which the person performs various motions such as boxing, jumping, or jogging. We assume there is no other person or object in the capture volume during recording. Furthermore, hand-cloth interaction is also avoided throughout the recording process. All of the sequences are recorded with a multi-camera system consisting of 120 synchronized and calibrated cameras at a frame rate of 25 fps. The training sequences typically contain \(30{,}000\) frames, and the testing sequence around \(7{,}000\) frames Notably, to assess the model’s generalization ability to novel camera views, we hold out four cameras for testing, following the protocol mentioned in the DynaCap [Habermann et al. 2021] dataset.

For all frames, we provide skeletal pose tracking using markerless motion capture [TheCaptury 2020], foreground segmentations using background matting [Sengupta et al. 2020], and ground truth 3D geometry, which we obtained with the state-of-the-art implicit surface reconstruction method [Wang et al. 2023a]. Note that we use the ground truth geometry solely to evaluate our method.

Moreover, we reprocessed three subjects from the DynaCap [Habermann et al. 2021] dataset, which is publicly available. More specifically, we improve the foreground segmentations and also provide ground truth geometry for each frame.

To the best of our knowledge, there is no other dataset available that has similar specifications, i.e., very long sequences for individual subjects in conjunction with 3D ground truth meshes. Compared with the datasets [Peng et al. 2021b; Alldieck et al. 2018b] adopted by previous works, our training and testing sequences for individual subjects are two orders of magnitude longer and contain a significantly wider variety of poses. We will further elaborate on how the greater variety of training poses improves the model’s generalization ability to unseen poses in the supplemental materials. Thus, we believe this dataset can further stimulate research in this important direction.

5 Experiments

We first provide qualitative results of our approach concerning geometry synthesis and appearance synthesis (Section 5.1). Then, we compare our method with prior works focusing on the same task (Section 5.2). Last, we ablate our major design choices, both, quantitatively and qualitatively (Section 5.3).

5.1 Qualitative Results

For a qualitative evaluation of our method, we selected six subjects wearing different types of apparel, ranging from loose clothing such as dresses to more tight clothing such as short pants and T-shirts. Three of the subjects are from our newly acquired dataset, and the remaining three subjects are from the DynaCap dataset [Habermann et al. 2021].

Geometry Synthesis. Figure 3 presents the reconstructed geometry generated from training and testing motions. For subjects wearing various types of apparel, our model allows for high-fidelity geometry reconstruction, including large-scale clothing dynamics and wrinkle details. Note that the wrinkle details are dynamically changing as a function of the skeletal motion, which can be best observed in our supplemental video. Importantly, our model excels in generating detailed geometry in real time and yields consistent performance for both training poses and novel poses. Moreover, our reconstructed geometry is represented as a consistent triangular mesh, making it well-suited for applications such as consistent texture editing (Section 6.2).

Fig. 4.

Image Synthesis. Additionally, we show the qualitative results of our method for image synthesis in Figure 4. Our model yields highly photorealistic renderings of the entire human in real time for both novel views and novel poses, which significantly deviate from the ones seen during training. Notably, view-dependent appearance effects, delicate clothing wrinkles, and the dynamics of loose clothing are also synthesized realistically. Again, we refer to the supplemental video for more results.

Fig. 5.

These results demonstrate the versatility and capability of our approach in terms of high-quality geometry recovery and synthesis as well as photorealistic appearance modeling, enabling novel view and pose synthesis.

5.2 Comparisons

Competing Methods. We compare our method with two types of previous approaches, including (1) NA [Liu et al. 2021] and TAVA [Li et al. 2022], which adopt a piece-wise rigid mapping with learned residual deformations; (2) HDHumans [Habermann et al. 2023] and DDC [Habermann et al. 2021], which model motion-aware surface deformation. Note that only our method and DDC support real-time inference, while other approaches require multiple seconds per frame. The comparisons are conducted on two subjects from the DynaCap dataset [Habermann et al. 2021], one wearing a loose type of apparel, referred to as Loose Clothing, and the other one wearing a tight type of apparel, referred to as Tight Clothing.

Metrics. In the following, we explain the individual metrics for quantitative comparisons:

For assessing the geometry quality, we provide measurements of the Chamfer distance, which computes the discrepancy between the pseudo ground truth obtained using an implicit surface reconstruction method [Wang et al. 2023a] and the reconstructed shape results. A lower Chamfer distance means a closer alignment between two shapes, indicating a higher quality reconstruction. We average the per-frame Chamfer distance over every 10^th frame.

To evaluate the quality of image synthesis, we employ the widely used Peak Signal-to-Noise Ratio (PSNR) metric. However, PSNR alone only captures the low-level error between images and has severe limitations when it comes to assessing the perceptual quality of images. Thus, PSNR may not accurately reflect the quality as perceived by the human eye. Consequently, we report the learned perceptual image patch similarity (LPIPS) metric [Zhang et al. 2018], which better reflects human perception. We follow the test split from the DynaCap [Habermann et al. 2021] dataset having 4 test camera views. Similar to the geometry quality, the metrics for image synthesis quality are averaged over every 10^th frame and over all of the test cameras.

Geometry. In Tables 2 and 3, we conduct a quantitative evaluation of our method and competing approaches to assess their performance in terms of geometry synthesis for training and testing motions. For NA and TAVA, we employed Marching Cubes to extract per-frame reconstructions from the learned NeRF representation. However, these recovered geometries exhibit a significant amount of noise due to the lack of geometry regularization and piece-wise rigid modeling during learning. As a result, these methods demonstrate inferior performance compared with our approach, both, visually and quantitatively. Compared with NA and TAVA, DDC yields better performance as it models the space-time coherent template deformation. However, DDC relies solely on image-based supervision to learn the deformations, which only yields fixed wrinkles derived from the base template and struggles to track the dynamic wrinkle patterns. In contrast, HDHumans [Habermann et al. 2023] outperforms DDC in terms of the overall surface quality with the inclusion of the NeRF, while it falls short in real-time reconstruction.

Table 2.

Table 3.

Besides, we also qualitatively compare the generated geometry of our approach with other works, as shown in Figure 6. Note that our method achieves the highest geometric details among all methods while also achieving real-time performance. This is consistent for, both, training and novel skeletal motions. We refer to the supplemental video to better see the dynamic deformations, which our method is able to recover.

Fig. 6.

Fig. 7.

Novel View Synthesis. We quantitatively evaluate the novel view synthesis quality of different approaches as shown in Table 2. For the comparison within real-time methods, our approach outperforms the competing method DDC [Habermann et al. 2021] by a substantial margin in terms of PSNR and LPIPS. The difference in PSNR is relatively less pronounced, as it is less sensitive to blurry results and does not faithfully reflect the realism perceived by humans. For the biased comparison with non real-time methods, our method still outperforms previous works remarkably in terms of PSNR, further verifying the effectiveness of our approach in achieving superior results. The LPIPS score of our approach is inferior to HDHumans [Habermann et al. 2023]. We speculate that their density-based formulation might help to achieve slightly better image quality compared with the SDF-based representation that we use. Moreover, they have a significantly higher computational budget, which should also be considered here as their method runs multiple seconds per frame while we achieve real-time performance. In summary, Table 2 provides quantitative confirmation of our method’s outstanding view synthesis performance. Even though the comparison is biased toward non-real-time methods like NA, TAVA, and HDHumans, the overall superiority of our approach, including its PSNR performance, reinforces the validity of our method.

We also qualitatively compare our approach with previous works in terms of the novel view synthesis. As shown in Figure 5, the visual quality of our method is better or on par with current offline approaches, including NA, TAVA, and HDHumans, while showing superior quality compared with the real-time method DDC. Specifically, the view synthesis results of TAVA are very blurry and contain obvious visual artifacts, as this method inherently struggles to handle more challenging datasets like ours and the DynaCap dataset. NA shows reasonable performance on subjects wearing tight clothing. However, for loose clothing, it becomes obvious that their method cannot correctly handle the skirt region since the residual deformation network fails to correctly account for this. The results of HDHumans are less blurry compared with the aforementioned methods, however, at the cost of real-time performance. DDC can capture medium-frequency wrinkles well but lacks finer details. In contrast, our method is able to achieve high-fidelity synthesis with sharper details in real time.

Novel Pose Synthesis. The same tendency can be observed when comparing to other works in terms of novel pose synthesis as shown in Table 3 and Figure 5. Again, our method achieves the best perceptual results due to the high-quality synthesis of our approach. In terms of PSNR, our proposed method achieves the best among the real-time and non real-time methods. Habermann et al. [2023] still achieve the best LPIPS score while their method is limited to non real-time synthesis. While DDC is a real-time method, our approach clearly outperforms DDC in terms of synthesis quality.

5.3 Ablation Studies

We quantitatively ablate our design choices on the novel view synthesis task in Table 4. A qualitative ablation study is also performed for novel views in terms of image and geometry, as shown in Figure 7. For an ablation on the novel pose synthesis, we refer to the supplemental material.

Table 4.

Training Poses (Loose Clothing)
Methods	PSNR \(\uparrow\)	LPIPS \(\downarrow\)	Cham. \(\downarrow\)
w/skin. mesh	27.82	30.36	3.768
w/o map opt.	30.21	29.20	1.714
w/can. tri-plane	30.54	23.89	1.807
w/MLP	30.09	25.47	2.008
2D Feat + D	31.12	23.92	1.521
w/o GMC SDF	31.57	16.71	1.532
w/o GMC	31.16	17.19	1.595
Ours	31.68	16.14	1.488

Table 4. Ablation Study

We quantitatively evaluate our design choices for the novel view synthesis and geometry generation on a subject wearing a loose type of apparel. Note that our final design achieves the best quantitative results in all metrics.

Fig. 8.

In the following, we first compare our design choices of using non-rigid space canonicalization and our proposed UTTS space to alternative baselines.

Skinning-based Deformation Only. As shown in Table 4, employing pure skinning-based deformation mentioned on template mesh, i.e., setting \(\mathbf {Y}_v=\mathbf {M}_v\) in Equation (9), without a non-rigid residue (i.e., w/skin. mesh) leads to significant performance degradation in terms of synthesis quality (evaluated by PSNR and LPIPS) and geometry quality (evaluated by Chamfer loss). The qualitative results in Figure 7 also validate the performance drop with pure skinning-based deformation. The reason is that the mapping into UTTS is less accurate since skinning alone cannot account for non-rigidly deforming cloth areas, leading to mapping collisions and wrongly mapped points. This confirms that our design choice of accounting for non-rigid deformations within the mapping procedure via a deformable model, which is also gradually refined throughout the training, is superior over piecewise rigid transformations, i.e., skinning-based only transformations.

SDF-driven Surface Refinement. Next, we evaluate the impact of our second training phase, where we update the learnable parameters of the deformable human model to better fit the SDF (see SDF-driven Surface Refinement in Section 3.4). Our SDF-driven surface refinement is beneficial to both synthesis quality and geometry quality as evident from the ablation study. As mentioned earlier, the better the deformable model approximates the true surface, the smaller the distance \(d_\mathrm{max}\) (see discussion at the end of Section 3.2) can become reducing the cases of points mapping onto edges and vertices. Discarding SDF-driven surface refinement (i.e., w/o map opt.) will result in performance drop as shown in Table 4 and Figure 7.

Tri-plane in Canonical Pose Space. Next, we evaluate the design of our UTTS space, which suggests mapping global points into a 3D texture space. A popular alternative in literature is the canonical unposed 3D space, i.e., the character is usually in a T-pose. For this ablation (referred to as w/can. tri-plane), we, therefore, placed the tri-plane into this canonical unposed 3D space and evaluated the performance. To warp the sample from the global space to the canonical space, we first project the samples to the closest face of the posed template mesh surface. Then, we determine the canonical position of the observation space sample by offsetting the canonical counterpart of the projected position along the canonical face normal. As seen from Figure 7 and Table 4, our proposed approach, which parameterizes the dynamic human in the UTTS space, demonstrates significantly better performance, both, qualitatively and quantitatively. Furthermore, in the supplemental materials, we elaborate on how the proposed UTTS parameterization improves the tri-plane usage and reduces tri-plane feature collisions compared with the parameterization upon canonical space.

Next, we compare our motion-dependent tri-plane encoder against several baselines to evaluate its effectiveness.

MLP-only. To evaluate the importance of our tri-planar motion encoder (see Section 3.3), we compare it to a pure MLP-based representation (i.e., w/MLP). Here, we remove the tri-plane features and instead feed the GMC and the position-encoded UTTS coordinate directly as input to the SDF and color MLP. It can be clearly seen that this design falls short in terms of visual quality and geometry recovery Table 4 and Figure 7. This can be explained by the fact that the MLP’s representation capability is insufficient to model the challenging dynamic human body and clothing. A deeper MLP-based architecture could help here, however, at the cost of real-time performance, since deeper architectures are significantly slower as the MLP has to be evaluated for every sample along every ray.

2D Feature and Pose-encoded D. To assess the necessity of the tri-plane for our task, we orchestrate an ablation experiment, which replaces the motion-dependent tri-plane features with 2D features and positionally-encoded distances, termed as 2D Feat + D. To achieve this, we adopt the original UNet architecture for generating 2D features from motion textures. Notably, similar to our 3D-aware convolutional motion encoder, the GMC is channel-wise concatenated to the bottleneck feature maps. As illustrated in Table 4 and Figure 7, our motion-dependent tri-plane representation exhibits superior appearance/geometry accuracy because the d-dimension in our motion-dependent tri-plane can encode motion-aware features by indexing respective feature planes (UD/VD), while 2D Feat + D only allows the UV-plane to be motion-aware.

Global Motion Code. We conduct two ablations to demonstrate the effectiveness of the GMC. The first ablation removes the GMC from the SDF MLP input features, referred to as w/o GMC SDF. Moreover, we conducted the second ablation that eliminates the global motion code from both the tri-plane bottleneck features and the SDF MLP input features, termed as w/o GMC. The results in Table 4 and Figure 7 indicate that removing GMC from the SDF (w/o GMC SDF) leads to a minor drop in the performance, while removing GMC from the tri-plane bottleneck and SDF (w/o GMC) experiences a more significant drop due to the lack of global motion awareness.

6 Applications

In this section, we will introduce two applications built upon TriHuman: the TriHuman Viewer, a real-time interactive system designed for inspecting and generating highly detailed clothed humans (Section 6.1), and the consistent texture editing supported by TriHuman (Section 6.2).

6.1 TriHuman Viewer

Building upon TriHuman, we introduce an interactive real-time system, i.e., the TriHuman Viewer (Figure 8), that enables users to inspect and generate high-fidelity clothed human geometry and renderings, given skeletal motion and camera poses as inputs. We refer to the supplemental video and document for more details regarding the supported interactions and the runtime for each algorithmic component.

Fig. 9.

6.2 Texture Editing

As highlighted in Section 3, TriHuman can generate detailed geometry with consistent triangulation, opening up new possibilities for a broad spectrum of downstream applications. Here, we use consistent texture editing as an illustrative example of such applications.

Figure 9 presents the results for consistent texture editing, which can be achieved through the following steps: Firstly, we select an image with an alpha channel, serving as the edits to the texture map for the character’s template mesh. Next, we render the texture color and the alpha value through the rasterization process applied to the textured template mesh. Finally, we achieve consistent texture editing results by alpha-blending the neural-rendered character imagery with the rasterized texture. Notably, thanks to the high-fidelity and consistent geometry generated by TriHuman, the rendered edits follow the wrinkle deformation of the clothing. Moreover, the edited result effectively retains the occlusions resulting from different poses.

Fig. 10.

7 Limitations and Future Work

While our approach enables controllable, high-quality, and real-time synthesis of human appearance and geometry, there are some limitations, which we hope seeing addressed in the future.

First, our model is currently not capable of generating re-lightable human appearance since we are not decomposing appearance into material and lighting. However, since the geometry reconstructed by our method is highly accurate, it becomes possible to incorporate re-lightability into our model to enhance the realism and visual coherence of the reconstructed human body in various applications and environments. Second, we are currently representing the human surface as an SDF and explicit mesh model. However, for the hair region, such a representation might not be ideal. Future work could consider a hybrid density and SDF-based representation accounting for the different body parts and regions that may prefer one representation over the other. Third, our model is person and outfit-specific and does not support generalization across identities. A possible avenue for future work could be to leverage transfer learning approaches, where pre-trained models on large-scale datasets are fine-tuned or adapted to specific identities. Moreover, our method does not support generating controllable facial expression rendering due to the absence of facial tracking in our dataset, which could be addressed in the future by incorporating facial tracking into the dataset. Last, like all existing methods, our method cannot model surface dynamics induced by external forces like wind. A promising future direction would be introducing the physical constraints into the training of the geometry and appearance generation models.

8 Conclusion

We introduced TriHuman, a novel approach for controllable, real-time, and high-fidelity synthesis of space-time coherent geometry and appearance solely learned from multi-view video data. Our method excels in reconstructing and generating a virtual human with challenging loose clothing of exceptional quality. The key ingredient of our approach lies in a deformable and pose-dependent tri-plane representation, which enables real-time yet superior performance. A differentiable and mesh-based mapping function is introduced to reduce the ambiguity of the transformation from global space to canonical space. The results on our new benchmark dataset with challenging motions unequivocally demonstrate significant progress toward achieving more lifelike and higher-resolution digital avatars, which hold great importance in the emerging realms of virtual reality (VR). We anticipate that the proposed model with the new benchmark datasets can serve as a robust foundation for future research in this domain.

Supplemental Material

Supplementary Information for TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

Download
16.88 MB

Supplemental Video

Supplementary Information for TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

Download
107.15 MB

References

[1]

Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018a. Detailed human avatars from monocular video. In International Conference on 3D Vision (3DV). IEEE, 98–109.

Abstract

1 Introduction

2 Related Works

3 Methodology

3.1 Problem Setting

3.2 Undeformed Tri-Plane Texture Space

3.3 Efficient and Motion-Dependent Tri-Plane Encoding

3.4 Supervision and Training Strategy

4 Dataset

5 Experiments

5.1 Qualitative Results

5.2 Comparisons

5.3 Ablation Studies

6 Applications

6.1 TriHuman Viewer

6.2 Texture Editing

7 Limitations and Future Work

8 Conclusion

Supplemental Material

References

Cited By

Index Terms

Recommendations

Geometry-shader-based real-time voxelization and applications

A fast translucency appearance model for real-time applications

Real-time novel-view synthesis for volume rendering using a piecewise-analytic representation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations