Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

Published: 09 October 2024 Publication History

Abstract

Creating controllable, photorealistic, and geometrically detailed digital doubles of real humans solely from video data is a key challenge in Computer Graphics and Vision, especially when real-time performance is required. Recent methods attach a neural radiance field (NeRF) to an articulated structure, e.g., a body model or a skeleton, to map points into a pose canonical space while conditioning the NeRF on the skeletal pose. These approaches typically parameterize the neural field with a multi-layer perceptron (MLP) leading to a slow runtime. To address this drawback, we propose TriHuman a novel human-tailored, deformable, and efficient tri-plane representation, which achieves real-time performance, state-of-the-art pose-controllable geometry synthesis as well as photorealistic rendering quality. At the core, we non-rigidly warp global ray samples into our undeformed tri-plane texture space, which effectively addresses the problem of global points being mapped to the same tri-plane locations. We then show how such a tri-plane feature representation can be conditioned on the skeletal motion to account for dynamic appearance and geometry changes. Our results demonstrate a clear step toward higher quality in terms of geometry and appearance modeling of humans as well as runtime performance.

1 Introduction

Digitizing real humans and creating their virtual double is a long-standing and challenging problem in Graphics and Vision with many applications in the movie industry, gaming, telecommunication, and VR/AR. Ideally, the virtual double should be controllable, should contain highly-detailed and dynamic geometry, and respective renderings should look photoreal while computations should be real-time capable. However, so far, creating high-quality and photoreal digital characters requires a tremendous amount of work from experienced artists, takes a lot of time, and is extremely expensive. Thus, simplifying the character creation by learning it directly from multi-view video data and making it more efficient became an active research area in recent years, especially with the advent of deep scene representations.
Recent works [Liu et al. 2021; Wang et al. 2022; Li et al. 2022] incorporate neural radiance fields (NeRFs) into the modeling of humans due to their capability of representing rich appearance details. These methods typically map points from a global space, or posed space, into a canonical space by transforming 3D points using the piece-wise rigid transform of nearby bones or surface points of a naked human body model or skeleton. The canonical point and some type of pose conditioning are fed into an MLP, parameterizing the NeRF, to obtain the per-point density and color, which is then volume rendered to obtain the final pixel color. However, most methods have to perform multiple MLP evaluations per ray, which makes real-time performance impossible.
To overcome this, we present TriHuman, which is the first real-time method for controllable character synthesis that jointly models detailed, coherent, and motion-dependent surface deformations of arbitrary types of clothing as well as photorealistic motion- and view-dependent appearance. Given a skeleton motion and camera configuration as input, our method regresses a detailed motion-dependent geometry and view- and motion-dependent appearance, while for training, it only requires multi-view video.
At the technical core, we represent human geometry and appearance as a signed distance field (SDF) and color field in global space, which can be volume-rendered into an image. To overcome the limited runtime performance of previous methods, we investigate in this work how the efficient tri-plane representation [Chan et al. 2022] can be leveraged to improve runtime performance while maintaining high quality. Important to note is that the tri-plane representation typically works well for convex shapes like faces as there are only a few points in global space mapping to the same point on the tri-planes. However, humans with their clothing are articulated and deformable, which makes it more challenging to prevent tri-plane mapping collisions, i.e., global points map to the same tri-plane locations. To overcome this, we map global points into an undeformed tri-plane texture space (UTTS) using a deformable human model [Habermann et al. 2021]. Intuitively, one of the tri-planes coincides with the 2D UV map of the deformable model, while the other two planes are perpendicular to the first one and each other. We show that this reduces the mapping collisions when projecting points onto the planes and, thus, leads to better results. Another challenge is to condition the tri-plane features on the skeletal motion in order to obtain an animatable representation. Here, we propose an efficient 2D motion texture conditioning encoding the surface dynamics of the deformable model in conjunction with a 3D-aware convolutional architecture [Wang et al. 2023b] in order to generate tri-plane features that effectively encode the skeletal motion. Last, these features are decoded to an SDF value and a color value using a shallow MLP, and unbiased volume rendering [Wang et al. 2021] is performed to generate the final pixel color.
To evaluate our method, we found that most existing datasets contain limited skeletal pose-variation and camera views. Moreover, they lack ground truth 3D data for evaluating the accuracy of the recovered human geometry. To address this, we propose a new dataset and extend existing datasets consisting of dense multi-view captures using 120 cameras of human performances comprising significantly higher pose variations than current benchmarks. The dataset further provides skeletal pose annotations, foreground segmentations, and, most importantly, 4D ground truth reconstructions. We demonstrate state-of-the-art results on this novel and significantly more challenging benchmark compared with previous works (see Figure 1). In summary, our contributions are:
A novel controllable human avatar representation enabling highly detailed and skeletal motion-dependent geometry and appearance synthesis at real time frame rates while supporting arbitrary types of apparel.
A mapping, which transforms global points into an UTTS, greatly reduces tri-plane collisions.
A skeletal motion-dependent tri-plane network architecture encoding the surface dynamics, which allows the tri-plane representation to be skeletal motion conditioned.
A new benchmark dataset of dense multi-view videos of multiple people performing various challenging motions, which improves over existing datasets in terms of scale and annotation quality.

2 Related Works

Recently, neural scene representations [Sitzmann et al. 2019; Mildenhall et al. 2021; Oechsle et al. 2021; Wang et al. 2021; Yariv et al. 2021;, 2020; Niemeyer et al. 2020] have achieved great success in multifarious vision and graphics applications, including novel view synthesis [Yu et al. 2021; Hedman et al. 2021; Fridovich-Keil et al. 2022; Müller et al. 2022; Chen et al. 2022], generative modeling [Schwarz et al. 2020; Niemeyer and Geiger 2021; Chan et al. 2021], surface reconstruction [Wang et al. 2021; Oechsle et al. 2021; Yariv et al. 2021], and many more. While the above works mainly focus on static scenes, recent efforts [Tretschk et al. 2021; Park et al. 2021; Pumarola et al. 2021; Deng et al. 2020] have been devoted to extending neural scene / implicit representations for modeling dynamic scenes or articulated objects. With a special focus on dynamic human modeling, existing works can be categorized according to their space canonicalization strategy, which will be introduced in the ensuing paragraphs.
Piece-wise Rigid Mapping. Reconstructing the 3D human has attracted increasing attention in recent years. A popular line of research [Alldieck et al. 2018a;, 2018b; Xiang et al. 2020] utilizes a parametric body model such as SMPL [Loper et al. 2015] to represent a human body with clothing deformations, which produces an animatable 3D model. With the emergence of neural scene representations [Sitzmann et al. 2019; Mildenhall et al. 2021], series of works [Gafni et al. 2021; Peng et al. 2021b; Su et al. 2021; Saito et al. 2021; Bhatnagar et al. 2020] combine scene representation networks with parametric models [Loper et al. 2015; Blanz and Vetter 1999] to reconstruct dynamic humans. With a special focus on human body modeling, some methods [Weng et al. 2020; Chen et al. 2021; Noguchi et al. 2021; Wang et al. 2022; Bergman et al. 2022] transform points from a global space, or posed space, into a canonical space by mapping 3D points using piece-wise rigid transformations. For instance, Chen et al. [2021] extend NeRFs to dynamic scenes by introducing explicit pose-guided deformation with SMPL [Loper et al. 2015] to achieve a mapping from the observation space to a constant canonical space. Instead of learning rigid transformations from the full parametric model, NARF [Noguchi et al. 2021] considers only the rigid transformation of the most relevant object part for each 3D point. ENARF-GAN [Noguchi et al. 2022] further extends NARF to achieve efficient and unsupervised training from unposed image collections. To accelerate the neural volume rendering, InstantAvatar [Jiang et al. 2023] incorporate Instant-NGP [Müller et al. 2022] to learn a canonical shape and appearance, which derives a continuous deformation field via an efficient articulation module [Chen et al. 2023]. Remelli et al. [2022] models the motion-aware appearance of the clothed human as cubic volumetric primitives on texture space. As the inferred geometry from NeRF often lacks detail, ARAH [Wang et al. 2022] builds an articulated signed-distance-field (SDF) representation to better model the geometry of clothed humans, where an efficient joint root-finding algorithm is introduced for the mapping from observation space to canonical space. However, piece-wise rigid mapping has limited capability to represent complex geometry such as loose clothing.
Piece-wise Rigid and Learned Residual Deformation. Recently, an improved deformable NeRF representation [Liu et al. 2021; Peng et al. 2021a; Xu et al. 2021; Zhang et al. 2021; Gao et al. 2024; Pang et al. 2024; Li et al. 2024; Hu et al. 2024; Shao et al. 2024] has become a common paradigm for dynamic human modeling, by unwarping different poses to a shared canonical space with piece-wise rigid transformations and learned residual deformations [Tretschk et al. 2021; Park et al. 2021; Pumarola et al. 2021; Zhan et al. 2023]. For instance, Liu et al. [2021] employs an inverse skinning transformation [Lewis et al. 2000] to deform the posed space to the canonical pose space, accompanied by a predicted residual deformation for each pose. Similarly, Weng et al. [2022]; Peng et al. [2021a] propose to optimize a human representation in a canonical T-pose, relying on a motion field consisting of skeletal rigid and non-rigid deformations; Gao et al. [2024] further proposes to model the residual deformation by leveraging geometry features and relative displacement. Recently, TAVA [Li et al. 2022] incorporates a deformation model that captures non-linear pose-dependent deformations, which is anchored in an LBS formulation. InstantNVR [Geng et al. 2023] applies the multi-resolution hash encoding to the transformed point and regresses a residual to obtain the canonical space, though the model does not preserve the motion-aware surface details. While such a residual deformation can typically compensate for smaller misalignments and wrinkle deformations, we found that it typically fails to handle clothing types and deformations that significantly deviate from the underlying articulated structure.
Modeling Surface Deformation. Notably, some recent efforts [Habermann et al. 2021] have been devoted to modeling both coarse and fine dynamic deformation by introducing a parametric human representation with explicit space-time coherent mesh geometry and high-quality dynamic textures. However, they still face challenges in capturing fine-scale details due to the complexity of the optimization process involved in deforming meshes with sparse supervision. Bagautdinov et al. [2021] and Xiang et al. [2021] adopt the auto-encoder in the texture space of a deformable template mesh to model the motion-aware clothing and human appearance. However, they assume the availability of registered meshes. Similarly, Xiang et al. [2023] models dynamic clothing and human appearance in the texture space of deformable template meshes. Although Xiang et al. [2023] achieves high-quality rendering, it requires additional input from multiple RGB-D cameras. Alternatively, the prevailing implicit representation methods offer a more flexible human representation. Habermann et al. [2023] propose to condition NeRF on a densely deforming template to enable the tracking of loose clothing and further refine the template deformations. However, their method requires multiple MLP evaluations per ray sample, resulting in slower computation. Additionally, the recovered surface quality is compromised since they model the scene as a density field rather than a SDF. Recently, Deliffas [Kwon et al. 2023] achieves real-time rendering of dynamic characters through a surface light field attached to the deformable template mesh. Nevertheless, similar to prior methods, the generated geometry is of lower quality and lacks delicate surface details.

3 Methodology

Our goal is to obtain a drivable, photorealistic, and geometrically detailed avatar of a real human in any type of clothing solely learned from multi-view RGB video. More precisely, given a skeleton motion and virtual camera view as input, we want to synthesize highly realistic renderings of the human in motion as well as the high-fidelity and deforming geometry in real time. An overview of our method is shown in Figure 1. Next, we define the problem setting (Section 3.1). Then, we describe the main challenges of space canonicalization that current methods are facing followed by our proposed space mapping, which alleviates the inherent ambiguities (Section 3.2). Given this novel space canonicalization strategy, we show how this UTTS can be efficiently parameterized with a tri-plane representation leading to real-time performance during rendering and geometry recovery (Section 3.3). Last, we introduce our supervision and training strategy (Section 3.4).
Fig. 1.
Fig. 1. TriHuman renders photorealistic images of the virtual human and also generates detailed and topology-consistent clothed human geometry given the skeletal motion and virtual camera view as input. Importantly, our method runs in real-time due to our efficient human representation and can be solely supervised on multi-view imagery during training.
Fig. 2.
Fig. 2. Overview. Given skeletal motion and a virtual camera view as input, our method generates highly realistic renderings of the human in the specified pose and view. Moreover, it can generate high-quality and geometry represented with consistent meshes. To this end, firstly, a rough motion-dependent and deforming human mesh is regressed. Then, we render the motion texture from the human meshes, passing through a 3D-aware convolutional architecture to generate a motion-conditioned tri-plane. Ray samples in observation space can be mapped into a 3D texture cube, which can then be used to sample a feature from the tri-plane. This feature is passed to a small MLP predicting color and density. Finally, volume rendering and our proposed run-time mesh optimization can generate the images and geometry. Note that our method is solely supervised by multi-view imagery.

3.1 Problem Setting

Input Assumptions. We assume a segmented multi-view video of a human actor using C calibrated and synchronized RGB cameras as well as a static 3D template is given. \(\mathbf {I}_{f,c} \in \mathbb {R}^{H \times W}\) denotes frame f of camera c where W and H are the width and height of the image, respectively. We then extract the skeletal pose \(\boldsymbol {\theta }_f \in \mathbb {R}^P\) for each frame f using markerless motion capture [TheCaptury 2020]. Here, P denotes the number of degrees of freedom (DoFs). A skeletal motion from frame \(f-k\) to f is denoted as \(\boldsymbol {\theta }_{\bar{f}} \in \mathbb {R}^{kP}\) and \(\hat{\boldsymbol {\theta }}_{\bar{f}}\) is the translation normalized equivalent, i.e., by displacing the root motion such that the translation of frame f is zero. During training, our model takes the skeletal motion as input and the multi-view videos as supervision, while at inference, our method only requires a skeletal motion and a virtual camera.
Static Scene Representation. Recent progress in neural scene representation learning has shown great success in terms of geometry reconstruction [Wang et al. 2021;, 2023a] and view synthesis [Mildenhall et al. 2021] of static scenes by employing neural fields. Inspired by NeuS [Wang et al. 2021], we also represent the human geometry and appearance as neural field \(\mathcal {F}_\mathrm{sdf}\) and \(\mathcal {F}_\mathrm{col}\):
\begin{equation} \mathcal {F}_\mathrm{sdf}(p(\mathbf {x}_i); \Gamma)= s_i, \mathbf {q}_i \end{equation}
(1)
\begin{equation} \mathcal {F}_\mathrm{col}(\mathbf {q}_i, s_i, \mathbf {n}_i, p(\mathbf {d}); \Psi)= \mathbf {c}_i \end{equation}
(2)
where \(\mathbf {x}_i \in \mathbb {R}^3\) is a point along the camera \(r(t_i,\mathbf {o}, \mathbf {d}) = \mathbf {o} + t_i \mathbf {d}\) with origin \(\mathbf {o}\) and direction \(\mathbf {d}\). \(p(\cdot)\) is a positional encoding [Mildenhall et al. 2021] to better model and synthesize higher frequency details. The SDF field stores the SDF \(s_i\) and a respective shape code \(\mathbf {q}_i\) for every point \(\mathbf {x}_i\) in global space. Note that the normal at point \(\mathbf {x}_i\) can be computed as \(\mathbf {n}_i =\frac{\partial s_i}{\partial \mathbf {x}_i}\). Moreover, the color field encodes the color \(\mathbf {c}_i\), and as it is conditioned on the viewing direction \(\mathbf {d}\), it can also encode view-dependent appearance changes. In practice, both fields are parameterized as multi-layer perceptrons (MLPs) with learnable weights \(\Gamma\) and \(\Psi\).
To render the color of a ray (pixel), volume rendering is performed, which accumulates the color \(\mathbf {c}_i\) and the density \(\alpha _i\) along the ray as
\begin{equation} \mathbf {c} = \sum ^R_i T_i \alpha _i \mathbf {c}_i, \quad T_i = \prod ^{i-1}_{j=1} (1 - \alpha _j). \end{equation}
(3)
Here, the density \(\alpha _i\) is a function of the SDF. For an unbiased SDF estimate, the conversion from SDF to density can be defined as
\begin{equation} \alpha _i = \mathrm{max} \left(\frac{\Phi (s_i) - \Phi (s_{i+1})}{\Phi (s_i)}, 0 \right) \end{equation}
(4)
\begin{equation} \Phi (s_i) = (1 + e^{-zx})^{-1}, \end{equation}
(5)
where z is a trainable parameter whose reciprocal approaches 0 when training converges. For a detailed derivation, we refer to the original work [Wang et al. 2021]. The scene geometry and appearance can then be solely supervised by comparing the obtained pixel color with the ground truth color, typically employed by an L1 loss. Important for us, this representation allows the modeling of fine geometric details and appearance while only requiring multi-view imagery. However, for now, this representation only allows for the modeling of static scenes and requires multiple hours of training (even for a single frame).
Problem Setting. Instead, we want to learn a dynamic, controllable, and efficient human representation \(\mathcal {H}_\mathrm{sdf}\) and \(\mathcal {H}_\mathrm{col}\):
\begin{equation} \mathcal {H}_\mathrm{sdf}(\boldsymbol {\theta }_{\bar{f}}, p(\mathbf {x}_i); \Gamma)= s_{i,f}, \mathbf {q}_{i,f} \end{equation}
(6)
\begin{equation} \mathcal {H}_\mathrm{col}(\boldsymbol {\theta }_{\bar{f}}, \mathbf {q}_{i,f}, s_{i,f}, \mathbf {n}_{i,f}, p(\mathbf {d}); \Psi)= \mathbf {c}_{i,f}, \end{equation}
(7)
which is conditioned on the skeletal motion of the human as well. Note that the SDF, shape feature, and color are now a function of skeletal motion indicated by the subscript \((\cdot)_f\). Previous work [Liu et al. 2021] has shown that naively adding the motion as a function input to the field leads to blurred and unrealistic results. Many works [Liu et al. 2021; Peng et al. 2021b;, 2021a] have therefore tried to transform points into canonical 3D pose space to then query the neural field in this canonical space. This has shown to improve quality, however, they typically parameterize the field in this space with an MLP leading to slow runtimes.
Tri-planes [Chan et al. 2022] offer an efficient alternative and have been applied to generative tasks. However, mostly for convex surfaces such as faces where the mapping onto the planes introduces little ambiguity. However, using them for representing the complex, articulated, and dynamic structure of humans in clothing requires additional attention since, if not handled carefully, the mapping onto the tri-plane can lead to so-called mapping collisions, where multiple points in the global space map onto the same tri-plane locations. Thus, in the remainder of this section, we first introduce our UTTS, which effectively reduces these collisions (Section 3.2). Then, we explain how the tri-plane can be conditioned on the skeletal motion using an efficient encoding of surface dynamics into texture space, which is then decoded into the tri-plane features leveraging a 3D-aware convolutional architecture [Wang et al. 2023b] (Section 3.3). Last, we describe our supervision and training strategy (Section 3.4).

3.2 Undeformed Tri-Plane Texture Space

Intuitively, our idea is that one of the tri-planes, i.e., the surface plane, corresponds to the surface of a skeletal motion-conditioned deformable human mesh model, while the other two planes, i.e., the perpendicular planes, are perpendicular to the first one and to each other. Next, we define the deformable and skeletal motion-dependent surface model of the human.
Motion-dependent and Deformable Human Model. We assume a person-specific, rigged and skinned triangular mesh with N vertices \(\mathbf {M} \in \mathbb {R}^{N \times 3}\) is given and the vertex connectivity remains fixed. The triangular mesh \(\mathbf {M}\) is obtained from a 3D scanner [Treedys 2020] and down-sampled to around \(5{,}000\) vertices to strike a balance between quality and efficiency. Now, we denote the deformable and motion-dependent human model as
\begin{equation} \mathcal {V}(\boldsymbol {\theta }_{\bar{f}};\Omega) = \mathbf {V}_{\bar{f}} \end{equation}
(8)
where \(\Omega \in \mathbb {R}^{W}\) are the learnable network weights and \(\mathbf {V}_{\bar{f}} \in \mathbb {R}^{N \times 3}\) are the posed and non-rigidly deformed vertex positions. Important for us, this function has to satisfy two properties: (1) It has to be a function of skeletal motion. (2) It has to have the capability of capturing non-rigid surface deformation.
We found that the representation of Habermann et al. [2021] meets these requirements, and we, thus, leverage it for our task. In their formulation, the human geometry is first non-rigidly deformed in a canonical pose as
\begin{equation} \mathbf {Y}_{v} = \mathbf {D}_i + \sum _{k \in \mathcal {N}_{\mathrm{vn},v}} \mathbf {w}_{v,k}(R(\mathbf {A}_k)(\mathbf {M}_v - \mathbf {G}_k) + \mathbf {G}_k + \mathbf {T}_k) \end{equation}
(9)
where \(\mathbf {M}_v \in \mathbb {R}^{3}\) and \(\mathbf {Y}_v \in \mathbb {R}^{3}\) denote the undeformed and deformed template vertices in the rest pose. \(\mathcal {N}_{\mathrm{vn},v} \in \mathbb {N}\) denotes the indexes of embedded graph nodes that are connected to template mesh vertex \(v \in \mathbb {R}^{3}\). \(\mathbf {G}_k \in \mathbb {R}^{3}\), \(\mathbf {A}_k \in \mathbb {R}^{3}\), and \(\mathbf {T}_k \in \mathbb {R}^{3}\) indicate the rest positions, rotation Euler angles, and translations of the embedded graph nodes. Specifically, the connectivity of the embedded graph \(\mathbf {G}_k\) can be obtained by simplifying the deformable template mesh \(\mathbf {M}\) with quadric edge collapse decimation in Meshlab [Cignoni et al. 2008]. \(R(\cdot)\) denotes the function that converts the Euler angle to a rotation matrix. Similar to [Sorkine and Alexa 2007], we compute the weight applied to the neighboring vertices \(\mathbf {w}_{v,k} \in \mathbb {R}\) based on geodesic distances.
To model higher-frequency deformations, an additional per-vertex displacement \(\mathbf {D}_i \in \mathbb {R}^{3}\) is added. The embedded graph parameters \(\mathbf {A},\mathbf {T}\), and per-vertex displacements \(\mathbf {D}\) are further functions of translation-normalized skeletal motion implemented as two graph convolutional networks
\begin{align} \mathcal {F_\mathrm{eg}}(\hat{\boldsymbol {\theta }}_{\bar{f}}; \Omega _\mathrm{eg}) &= \mathbf {A}, \mathbf {T} \end{align}
(10)
\begin{align} \mathcal {F_\mathrm{delta}}(\hat{\boldsymbol {\theta }}_{\bar{f}}; \Omega _\mathrm{delta}) &= \mathbf {D} \end{align}
(11)
where the skeletal motion is encoded according to [Habermann et al. 2021]. For more details, we refer to the original work.
Finally, the deformed vertices \(\mathbf {Y}_{v}\) in the rest pose can be posed using Dual Quaternion (DQ) skinning \(\mathcal {S}\) [Kavan et al. 2007], which defines the motion-dependent deformable model
\begin{equation} \mathcal {S}\left(\boldsymbol {\theta },\mathbf {Y}\right) = \mathbf {V}_{\bar{f}} = \mathcal {V}(\boldsymbol {\theta }_{\bar{f}};\Omega). \end{equation}
(12)
Note that Equation (12) is (1) solely a function of the skeletal motion and (2) can account for non-rigid deformations by means of training the weights \(\Omega\) and, thus, this formulation satisfies our initial requirements.
Non-rigid Space Canonicalization. Next, we introduce our non-rigid space canonicalization function (see also Figure 2)
\begin{equation} \mathcal {M}(\mathcal {V}(\boldsymbol {\theta }_{\bar{f}};\Omega), \mathbf {x}) = \bar{\mathbf {x}}, \end{equation}
(13)
which takes the deformable template and a point \(\mathbf {x}\) in global space and maps it to the so-called UTTS, denoted as \(\bar{\mathbf {x}}\), as explained in the following. Given the point \(\mathbf {x}\) in global space and assuming the closest point \(\mathbf {p}\) is located on the non-degenerated triangle with vertices \(\lbrace \mathbf {v}_a, \mathbf {v}_b, \mathbf {v}_c \rbrace\) on the posed and deformed template \(\mathbf {V}_{\bar{f}}\), the closest point on the surface can either be a face, edge, or vertex on the mesh. In the following, we discuss the different cases where the goal is to find the 2D texture coordinate of \(\mathbf {p}\) as well as the distance between \(\mathbf {x}\) and \(\mathbf {p}\), which then defines the 3D coordinate \(\bar{\mathbf {x}}\) in UTTS.
Fig. 3.
Fig. 3. The illustration of UTTS mapping from 3D perspective (A) and 2D perspective (B). Each spatial sample in the observation space undergoes a non-rigid transformation into the UTTS space via non-rigid canonicalization.
(1) Face. If the closest point lies on the triangular surface, the projected position \(\mathbf {p}\) and the 2D texture coordinate \(\mathbf {u}\) is computed as
\begin{equation} \begin{aligned}\mathbf {p} &= \mathbf {x} - (\mathbf {n}_f \cdot (\mathbf {x} - \mathbf {v}_a)) \mathbf {n}_f\\ \lambda _a &= \frac{\Vert (\mathbf {v}_c - \mathbf {v}_b) \times (\mathbf {p} - \mathbf {v}_b) \Vert _{2}}{\Vert (\mathbf {v}_c - \mathbf {v}_b) \times (\mathbf {v}_a - \mathbf {v}_b) \Vert _{2}}\\ \lambda _b &= \frac{\Vert (\mathbf {v}_a - \mathbf {v}_c) \times (\mathbf {p} - \mathbf {v}_c) \Vert _{2}}{\Vert (\mathbf {v}_c - \mathbf {v}_b) \times (\mathbf {v}_a - \mathbf {v}_b) \Vert _{2}}\\ \mathbf {u} &= \lambda _a \mathbf {u}_a + \lambda _b \mathbf {u}_b + (1 - \lambda _a - \lambda _b) \mathbf {u}_c \end{aligned} \end{equation}
(14)
where \(\mathbf {n}_f\) denotes the face normal of the closest surface, and \(\mathbf {u}_a\), \(\mathbf {u}_b\) and \(\mathbf {u}_c\) indicates the texture coordinate of the triangle vertices.
(2) Edge. For global points \(\mathbf {x}\) mapping onto the edge \((\mathbf {v}_a, \mathbf {v}_b)\), the projected position \(\mathbf {p}\) and the 2D texture coordinate \(\mathbf {u}\) are defined as
\begin{equation} \begin{aligned}\lambda &= \frac{(\mathbf {v}_b - \mathbf {v}_a) \cdot (\mathbf {x} - \mathbf {v}_a)}{\Vert (\mathbf {v}_b - \mathbf {v}_a) \Vert _{2}}\\ \mathbf {p} &= \mathbf {v}_a + \lambda (\mathbf {v}_b - \mathbf {v}_a) \\ \mathbf {u} &= \lambda \mathbf {u}_a + (1 - \lambda)\mathbf {u}_b. \end{aligned} \end{equation}
(15)
(3) Vertex. If the global point \(\mathbf {x}\) maps onto a vertex \(\mathbf {v}_a\), the projected position \(\mathbf {p}\) and the 2D texture coordinate \(\mathbf {u}\) are defined as
\begin{equation} \begin{aligned}\mathbf {p} &= \mathbf {v}_a \\ \mathbf {u} &= \mathbf {u}_a. \end{aligned} \end{equation}
(16)
Given the projected position \(\mathbf {p}\), we compute the signed distance between \(\mathbf {x}\) and \(\mathbf {p}\) as
\begin{equation} d = sgn((\mathbf {x} - \mathbf {p}) \cdot \mathbf {n}_f) \times \Vert \mathbf {x} - \mathbf {p} \Vert _{2}, \end{equation}
(17)
which will be required in the following.
So far, we can now canonicalize points from global 3D space to our UTTS space, and we denote the canonical point \((\mathbf {u},d)^T\) or \((u_x,u_y,d)^T\) simply as \(\bar{\mathbf {x}}\). Note that \(\mathbf {u}=(u_x, u_y)\) denotes the point on our surface plane and \((u_x,d)\) and \((u_y,d)\) correspond to the points on the perpendicular planes. These coordinates can now be used to query the features on the respective tri-planes.
Concerning the mapping collision, we highlight that case (1) where a point maps onto a triangle, is a bijection and, thus, the concatenated tri-plane features are unique, which was our goal. Only for cases (2) and (3) can the aforementioned collisions happen since the 2D texture coordinate on the surface is no longer unique for points with the same distance to a point on the edge of the mesh vertex itself. However, the occurrence of these cases highly depends on how far away from the deformable surface points the sample is still being taken. By constraining the maximum distance to \(d_\mathrm{max}\), which effectively means we only draw samples close to the deformable surface, we found that cases (2) and (3) happens less frequently. However, when the deformable model is not aligned, this will introduce an error by design, as surface points are not sampled in regions covered by the human. Therefore, we gradually deform the surface along the SDF field to account for such cases, and iteratively reduce \(d_\mathrm{max}\). In the limit, this strategy reduces the mapping collisions, improves sampling efficiency, and ensures that the sampled points do not miss the real surface. More details about this will be explained in Section 3.4 and our supplemental material.

3.3 Efficient and Motion-Dependent Tri-Plane Encoding

So far, we are able to map points in the global space to our UTTS space; however, as mentioned earlier, we want to ensure that the tri-planes contain skeletal motion-aware features. Thus, we propose a 3D-aware convolutional motion encoder:
\begin{equation} \mathcal {E}(\mathbf {T}_{\mathrm{p},f}, \mathbf {T}_{\mathrm{v},f}, \mathbf {T}_{\mathrm{a},f}, \mathbf {T}_{\mathrm{u},f}, \mathbf {T}_{\mathrm{n},f}, \mathbf {g}_f; \Phi) = \mathbf {P}_{x,f},\mathbf {P}_{y,f},\mathbf {P}_{z,f}, \end{equation}
(18)
which takes several 2D textures as input, encoding the position \(\mathbf {T}_{\mathrm{p},f}\), velocity \(\mathbf {T}_{\mathrm{v},f}\), acceleration \(\mathbf {T}_{\mathrm{a},f}\), texture coordinate \(\mathbf {T}_{\mathrm{u},f}\), and normal \(\mathbf {T}_{\mathrm{n},f}\) of the deforming human mesh surface, which we root normalize, i.e., we subtract the skeletal root translation from the mesh vertex positions (Equation (12)) and scale them to a range of \([-1, 1]\). Note that the individual texel values for \(\mathbf {T}_{\mathrm{v},f}\), \(\mathbf {T}_{\mathrm{a},f}\), \(\mathbf {T}_{\mathrm{u},f}\) and \(\mathbf {T}_{\mathrm{n},f}\) can be simply computed using inverse texture mapping. The first 3 textures, i.e., position \(\mathbf {T}_{\mathrm{p},f}\), velocity \(\mathbf {T}_{\mathrm{v},f}\), and acceleration \(\mathbf {T}_{\mathrm{a},f}\), encode the dynamics of the deforming surface, which can be computed on the fly from the skinned template of the current and previous frames. The texture coordinate map \(\mathbf {T}_{\mathrm{u},f}\) encodes a unique ID for each texel covered by a triangle in the UV-atlas. The normal textures \(\mathbf {T}_{\mathrm{n},f}\) are adopted to emphasize the surface orientation. All textures have a resolution of \(256 \times 256\). Here, \(\mathbf {g}_f\) is a global motion code (GMC), which is obtained by encoding the translation normalized motion vector \(\hat{\boldsymbol {\theta }}_{\bar{f}}\) through a shallow MLP. Notably, the GMC provides awareness of global skeletal motion and, thus, is able to encode global shape and appearance effects, which may be hard to encode through the above texture inputs.
Given the motion texture and the GMC, we first adopt three separated convolutional layers to generate the coarse level/initial features for each plane in the tri-plane. Inspired by the design of Wang et al. [2023b], we adopt a 5-layer UNet with roll-out convolutions to fuse the features from different planes, which enhances the spatial consistency in the feature space. Moreover, we concatenate the GMC channel-wise to the bottleneck feature maps to provide an awareness of the global skeletal motion. Please refer to the supplemental materials for more details regarding the network architectures of the 3D-aware convolutional motion encoder and the global motion encoder.
Finally, our motion encoder outputs three orthogonal skeletal motion-dependent tri-planes \(\mathbf {P}_{x,f}\), \(\mathbf {P}_{y,f}\), and \(\mathbf {P}_{z,f}\). The tri-plane feature for a sample \(\bar{\mathbf {x}}_i\) in UTTS space can be obtained by querying the planes \(\mathbf {P}_{x,f},\mathbf {P}_{y,f},\mathbf {P}_{z,f}\) at \(\mathbf {u} = (u_x, u_y)\), \((u_x,d)\), and \((u_y,d)\), respectively, thanks to our UTTS mapping. The final motion-dependent tri-plane feature \(\mathbf {F}_{i,f}\) for a sample \(\bar{\mathbf {x}}_i\) can then be obtained by concatenating the three individual features from each plane. Finally, our initial human representation in Equations (6) and (7) can be re-defined with the proposed efficient motion-dependent tri-plane as
\begin{equation} \mathcal {H}_\mathrm{sdf}(\mathbf {F}_{i,f}, \mathbf {g}_f, p(\bar{\mathbf {x}}_i); \Gamma) = s_{i,f}, \mathbf {q}_{i,f} \end{equation}
(19)
\begin{equation} \mathcal {H}_\mathrm{col}(\mathbf {q}_{i,f}, s_{i,f}, \mathbf {n}_{i,f}, p(\mathbf {d}), \mathbf {t}_f; \Psi) = \mathbf {c}_{i,f}. \end{equation}
(20)
Here, \(\mathbf {t}_f\) is the global position of the character, accounting for the fact that appearance can change depending on the global position in space due to non-uniform lighting conditions. In practice, the above functions are parameterized by two shallow (4-layer) MLPs with a width of 256, since most of the capacity is in the tri-plane motion encoder whose evaluation time is independent of the number of samples along a ray. Thus, evaluating a single sample i can be efficiently performed, leading to real-time performance.

3.4 Supervision and Training Strategy

First, we pre-train the deformable mesh model (Equation (8)) according to [Habermann et al. 2021]. Then, the training of our human representation proceeds in three stages: field pre-training, SDF-driven surface refinement, and field fine-tuning. Additionally, a real-time optimization approach is proposed to generate detailed and temporally coherent triangular meshes during inference. In Table 1, we illustrate the model components’ status in different stages, indicating whether they are activated or frozen during training. We refer to the supplemental materials for the implementation details of the loss terms.
Table 1.
ComponentsEmbed. Def.Triplane Gen.SDF/Color MLP
\(\mathcal {V}(\cdot)\)\(\mathcal {E}(\cdot)\)\(\mathcal {H}_\mathrm{sdf},\mathcal {H}_\mathrm{col}\)
Field Pre-train.
Surface Ref.
Field Finetune
Real-time Opt.
Table 1. Training Status for Each Component
We provide the status for each component in the different training stages. The status of each component is either , indicating that the weights will be updated, or , denoting the weights are fixed and will not be optimized.
Field Pre-Training. Given the initial deformed mesh, we train the SDF (Equation (19)) and color field (Equation (20)) using the following losses:
\begin{equation} \mathcal {L}_\mathrm{col} + \mathcal {L}_\mathrm{mask} + \mathcal {L}_\mathrm{eik} + \mathcal {L}_\mathrm{seam}. \end{equation}
(21)
Here, the \(\mathcal {L}_\mathrm{col}\) and \(\mathcal {L}_\mathrm{mask}\) denote an L1 color and mask loss, ensuring that the rendered color for a ray matches the ground truth one, and that accumulated transmittance along the ray coincides with the ground truth masks. Moreover, the Eikonal loss \(\mathcal {L}_\mathrm{eik}\) [Gropp et al. 2020] regularizes the network predictions for the SDF value. Last, we introduce a seam loss \(\mathcal {L}_\mathrm{seam}\), which samples points along texture seams on the mesh. For a single point on the seam, the two corresponding UV coordinates in the 3D texture space are computed, and both are randomly displaced along the third dimension, resulting in two samples, where the loss ensures that the SDF network predicts the same value for both samples. This ensures that the SDF prediction on a texture seam is consistent. More details about the seam loss are provided in the supplemental document.
SDF-driven Surface Refinement. Once the SDF and color field training are converged, we then refine the pre-trained deformable mesh model, i.e., the learnable embedded graph, to better align with the SDF, using the following loss terms:
\begin{equation} \mathcal {L}_\mathrm{sdf} + \mathcal {L}_\mathrm{reg} + \mathcal {L}_\mathrm{zero} + \mathcal {L}_\mathrm{normal}+\mathcal {L}_\mathrm{area}. \end{equation}
(22)
The SDF loss \(\mathcal {L}_\mathrm{sdf}\) ensures that the SDF queried at the template vertex positions is equal to zero, thus, dragging the mesh toward the implicit surface estimate of the network. Though this term could also backpropagate into the mapping directly, i.e., into the morphable clothed human body model, we found network training is more stable when keeping the mapping fixed according to the initial deformed mesh. \(\mathcal {L}_\mathrm{reg}\) denotes the Laplacian loss that penalizes the Laplacian difference between the updated posed template vertices and the posed template vertices before the surface refinement. \(\mathcal {L}_\mathrm{zero}\) denotes a smoothing term that pushes the Laplacian for the template vertices to approach zero. As the flipping of the faces would lead to abrupt changes in the UV parameterization, we adopt a face normal consistency loss \(\mathcal {L}_\mathrm{normal}\) to avoid the face flipping, which can be computed through the cosine similarity of neighboring face normals. Moreover, as the degraded faces would lead to numerical errors for UV mapping, we adopted the face stretching loss \(\mathcal {L}_\mathrm{area}\), which can be computed with the deviation of the edge lengths within each face. Again, more details about the individual loss terms can be found in the supplemental document. Importantly, the more SDF-aligned template will allow us to adaptively lower the maximum distance \(d_\mathrm{max}\) for the tri-plane dimension orthogonal to the UV layout without missing the real surface while also reducing mapping collisions.
Field Finetuning. Since the deformable mesh is now updated, the implicit field surrounding it has to be updated accordingly. Therefore, in our last stage, we once more refine the SDF and color field using the following losses
\begin{equation} \mathcal {L}_\mathrm{col} + \mathcal {L}_\mathrm{mask} + \mathcal {L}_\mathrm{eik} + \mathcal {L}_\mathrm{seam} + \mathcal {L}_\mathrm{lap} + \mathcal {L}_\mathrm{perc}. \end{equation}
(23)
while lowering the distance \(d_\mathrm{max}\), which effectively reduces mapping collisions. This time we also add a patch-based perceptual loss \(\mathcal {L}_\mathrm{lap}\) [Zhang et al. 2018] and a laplacian pyramid loss \(\mathcal {L}_\mathrm{perc}\) [Bojanowski et al. 2018]. We found that this helps improve the detailedness in appearance and geometry.
Real-time Mesh Optimization. At test time, we propose a real-time mesh optimization, which embosses the template mesh with fine-grained and motion-aware geometry details from the implicit field. We subdivide the original template once using edge subdivision, i.e., cutting the edges into half, to obtain a higher resolution output. Then, we update the subdivided template mesh along the implicit field, i.e., by evaluating the SDF value and displacing the vertex along the normal by the magnitude of the SDF value. Due to our efficient SDF evaluation leveraging our tri-planar representation, this optimization is very efficient, allowing real-time generation of high-quality and consistent clothed human geometry.
Implementation Details. Our approach is implemented in PyTorch [Paszke et al. 2019] and custom CUDA kernels. Specifically, we implement the skeletal computation, rasterization-based ray sample filter, and mapping with custom CUDA kernels. The remaining components are implemented using PyTorch. We train our method on a single Nvidia A100 graphics card using ground truth images with a resolution of \(1285 \times 940\). The Field Pre-Training stage is trained for 600K iterations using a learning rate of 5e-4 scheduled with a cosine decay scheduler, which takes around 2 days. Here, we set the distance \(d=4cm\). We perform a random sampling of \(4{,}096\) rays from the foreground pixels of the ground truth images. Along each of these rays, we take 64 samples for ray marching. The loss terms adopted for supervising the training of Field Pre-Training (Equation (21)) stage are weighted as 1.0, 0.1, 0.1, and 1.0 in the order of appearance in the equation. The SDF-driven Surface Refinement stage is trained for 200k iterations using a learning rate of 1e-5, which takes around 0.4 days. Here, the losses (Equation (22)) are weighted with 1.0, 0.15, 0.005, 0.005, 5.0, again in the order of appearance in the equation. Last, the Field Finetuning stage is trained for 300k iterations with a learning rate of 2e-4, decayed with a cosine decay scheduler, which takes around 1.4 days. Here, we set the distance \(d=2cm\).
Similar to the Field Pre-Training stage, we again randomly sampled 4,096 rays from the foreground pixels and adopted 64 samples per ray for ray marching. Moreover, we randomly crop patches with a resolution of 128 for evaluating perceptual-related losses, i.e., \(\mathcal {L}_\mathrm{lap}\), and \(\mathcal {L}_\mathrm{perc}\). This time, the losses (Equation (23)) are weighted with 1.0, 0.1, 0.1, 1.0, 1.0, 0.5.

4 Dataset

Our new dataset comprises three subjects wearing different types of apparel. For each subject, we recorded separate training and testing sequences in which the person performs various motions such as boxing, jumping, or jogging. We assume there is no other person or object in the capture volume during recording. Furthermore, hand-cloth interaction is also avoided throughout the recording process. All of the sequences are recorded with a multi-camera system consisting of 120 synchronized and calibrated cameras at a frame rate of 25 fps. The training sequences typically contain \(30{,}000\) frames, and the testing sequence around \(7{,}000\) frames Notably, to assess the model’s generalization ability to novel camera views, we hold out four cameras for testing, following the protocol mentioned in the DynaCap [Habermann et al. 2021] dataset.
For all frames, we provide skeletal pose tracking using markerless motion capture [TheCaptury 2020], foreground segmentations using background matting [Sengupta et al. 2020], and ground truth 3D geometry, which we obtained with the state-of-the-art implicit surface reconstruction method [Wang et al. 2023a]. Note that we use the ground truth geometry solely to evaluate our method.
Moreover, we reprocessed three subjects from the DynaCap [Habermann et al. 2021] dataset, which is publicly available. More specifically, we improve the foreground segmentations and also provide ground truth geometry for each frame.
To the best of our knowledge, there is no other dataset available that has similar specifications, i.e., very long sequences for individual subjects in conjunction with 3D ground truth meshes. Compared with the datasets [Peng et al. 2021b; Alldieck et al. 2018b] adopted by previous works, our training and testing sequences for individual subjects are two orders of magnitude longer and contain a significantly wider variety of poses. We will further elaborate on how the greater variety of training poses improves the model’s generalization ability to unseen poses in the supplemental materials. Thus, we believe this dataset can further stimulate research in this important direction.

5 Experiments

We first provide qualitative results of our approach concerning geometry synthesis and appearance synthesis (Section 5.1). Then, we compare our method with prior works focusing on the same task (Section 5.2). Last, we ablate our major design choices, both, quantitatively and qualitatively (Section 5.3).

5.1 Qualitative Results

For a qualitative evaluation of our method, we selected six subjects wearing different types of apparel, ranging from loose clothing such as dresses to more tight clothing such as short pants and T-shirts. Three of the subjects are from our newly acquired dataset, and the remaining three subjects are from the DynaCap dataset [Habermann et al. 2021].
Geometry Synthesis. Figure 3 presents the reconstructed geometry generated from training and testing motions. For subjects wearing various types of apparel, our model allows for high-fidelity geometry reconstruction, including large-scale clothing dynamics and wrinkle details. Note that the wrinkle details are dynamically changing as a function of the skeletal motion, which can be best observed in our supplemental video. Importantly, our model excels in generating detailed geometry in real time and yields consistent performance for both training poses and novel poses. Moreover, our reconstructed geometry is represented as a consistent triangular mesh, making it well-suited for applications such as consistent texture editing (Section 6.2).
Fig. 4.
Fig. 4. Qualitative geometry results. We show the geometry synthesis results of our method for training and novel skeletal motions. Note that in both cases, our method generates high-fidelity geometry in real-time. This can be especially observed in the clothing areas where dynamic wrinkles are forming as a function of the skeletal motion.
Image Synthesis. Additionally, we show the qualitative results of our method for image synthesis in Figure 4. Our model yields highly photorealistic renderings of the entire human in real time for both novel views and novel poses, which significantly deviate from the ones seen during training. Notably, view-dependent appearance effects, delicate clothing wrinkles, and the dynamics of loose clothing are also synthesized realistically. Again, we refer to the supplemental video for more results.
Fig. 5.
Fig. 5. Qualitative image synthesis results. We show the results of our method in terms of image synthesis. Note that the novel view synthesis results are rendered from the camera views that are excluded during the training phase. The novel pose synthesis results are generated with skeletal motions sampled from the testing sequences. Our method achieves photorealistic renderings of virtual humans in real time. Please note how the appearance dynamically changes, given different views and poses.
These results demonstrate the versatility and capability of our approach in terms of high-quality geometry recovery and synthesis as well as photorealistic appearance modeling, enabling novel view and pose synthesis.

5.2 Comparisons

Competing Methods. We compare our method with two types of previous approaches, including (1) NA [Liu et al. 2021] and TAVA [Li et al. 2022], which adopt a piece-wise rigid mapping with learned residual deformations; (2) HDHumans [Habermann et al. 2023] and DDC [Habermann et al. 2021], which model motion-aware surface deformation. Note that only our method and DDC support real-time inference, while other approaches require multiple seconds per frame. The comparisons are conducted on two subjects from the DynaCap dataset [Habermann et al. 2021], one wearing a loose type of apparel, referred to as Loose Clothing, and the other one wearing a tight type of apparel, referred to as Tight Clothing.
Metrics. In the following, we explain the individual metrics for quantitative comparisons:
For assessing the geometry quality, we provide measurements of the Chamfer distance, which computes the discrepancy between the pseudo ground truth obtained using an implicit surface reconstruction method [Wang et al. 2023a] and the reconstructed shape results. A lower Chamfer distance means a closer alignment between two shapes, indicating a higher quality reconstruction. We average the per-frame Chamfer distance over every 10th frame.
To evaluate the quality of image synthesis, we employ the widely used Peak Signal-to-Noise Ratio (PSNR) metric. However, PSNR alone only captures the low-level error between images and has severe limitations when it comes to assessing the perceptual quality of images. Thus, PSNR may not accurately reflect the quality as perceived by the human eye. Consequently, we report the learned perceptual image patch similarity (LPIPS) metric [Zhang et al. 2018], which better reflects human perception. We follow the test split from the DynaCap [Habermann et al. 2021] dataset having 4 test camera views. Similar to the geometry quality, the metrics for image synthesis quality are averaged over every 10th frame and over all of the test cameras.
Geometry. In Tables 2 and 3, we conduct a quantitative evaluation of our method and competing approaches to assess their performance in terms of geometry synthesis for training and testing motions. For NA and TAVA, we employed Marching Cubes to extract per-frame reconstructions from the learned NeRF representation. However, these recovered geometries exhibit a significant amount of noise due to the lack of geometry regularization and piece-wise rigid modeling during learning. As a result, these methods demonstrate inferior performance compared with our approach, both, visually and quantitatively. Compared with NA and TAVA, DDC yields better performance as it models the space-time coherent template deformation. However, DDC relies solely on image-based supervision to learn the deformations, which only yields fixed wrinkles derived from the base template and struggles to track the dynamic wrinkle patterns. In contrast, HDHumans [Habermann et al. 2023] outperforms DDC in terms of the overall surface quality with the inclusion of the NeRF, while it falls short in real-time reconstruction.
Table 2.
Table 2. Quantitative View Synthesis Comparison
Table 3.
Table 3. Quantitative Pose Synthesis Comparison
Besides, we also qualitatively compare the generated geometry of our approach with other works, as shown in Figure 6. Note that our method achieves the highest geometric details among all methods while also achieving real-time performance. This is consistent for, both, training and novel skeletal motions. We refer to the supplemental video to better see the dynamic deformations, which our method is able to recover.
Fig. 6.
Fig. 6. Qualitative image synthesis comparison. Here, we qualitatively evaluate the image synthesis quality of our method and others [Liu et al. 2021; Li et al. 2022; Habermann et al. 2023;, 2021]. Note that the visual quality of our method is better or comparable to current offline approaches [Liu et al. 2021; Li et al. 2022; Habermann et al. 2023] while showing superior quality compared with other real-time methods [Habermann et al. 2021].
Fig. 7.
Fig. 7. Qualitative geometry comparison. Here, we qualitatively compare the generated geometry with other works [Liu et al. 2021; Li et al. 2022; Habermann et al. 2023;, 2021]. Each row of the generated geometry is followed by its corresponding error map. Note that our method achieves the highest geometric details among all methods while also achieving real-time performance. This is consistent for, both, training and novel skeletal motions.
Novel View Synthesis. We quantitatively evaluate the novel view synthesis quality of different approaches as shown in Table 2. For the comparison within real-time methods, our approach outperforms the competing method DDC [Habermann et al. 2021] by a substantial margin in terms of PSNR and LPIPS. The difference in PSNR is relatively less pronounced, as it is less sensitive to blurry results and does not faithfully reflect the realism perceived by humans. For the biased comparison with non real-time methods, our method still outperforms previous works remarkably in terms of PSNR, further verifying the effectiveness of our approach in achieving superior results. The LPIPS score of our approach is inferior to HDHumans [Habermann et al. 2023]. We speculate that their density-based formulation might help to achieve slightly better image quality compared with the SDF-based representation that we use. Moreover, they have a significantly higher computational budget, which should also be considered here as their method runs multiple seconds per frame while we achieve real-time performance. In summary, Table 2 provides quantitative confirmation of our method’s outstanding view synthesis performance. Even though the comparison is biased toward non-real-time methods like NA, TAVA, and HDHumans, the overall superiority of our approach, including its PSNR performance, reinforces the validity of our method.
We also qualitatively compare our approach with previous works in terms of the novel view synthesis. As shown in Figure 5, the visual quality of our method is better or on par with current offline approaches, including NA, TAVA, and HDHumans, while showing superior quality compared with the real-time method DDC. Specifically, the view synthesis results of TAVA are very blurry and contain obvious visual artifacts, as this method inherently struggles to handle more challenging datasets like ours and the DynaCap dataset. NA shows reasonable performance on subjects wearing tight clothing. However, for loose clothing, it becomes obvious that their method cannot correctly handle the skirt region since the residual deformation network fails to correctly account for this. The results of HDHumans are less blurry compared with the aforementioned methods, however, at the cost of real-time performance. DDC can capture medium-frequency wrinkles well but lacks finer details. In contrast, our method is able to achieve high-fidelity synthesis with sharper details in real time.
Novel Pose Synthesis. The same tendency can be observed when comparing to other works in terms of novel pose synthesis as shown in Table 3 and Figure 5. Again, our method achieves the best perceptual results due to the high-quality synthesis of our approach. In terms of PSNR, our proposed method achieves the best among the real-time and non real-time methods. Habermann et al. [2023] still achieve the best LPIPS score while their method is limited to non real-time synthesis. While DDC is a real-time method, our approach clearly outperforms DDC in terms of synthesis quality.

5.3 Ablation Studies

We quantitatively ablate our design choices on the novel view synthesis task in Table 4. A qualitative ablation study is also performed for novel views in terms of image and geometry, as shown in Figure 7. For an ablation on the novel pose synthesis, we refer to the supplemental material.
Table 4.
Training Poses (Loose Clothing)
MethodsPSNR \(\uparrow\)LPIPS \(\downarrow\)Cham. \(\downarrow\)
w/skin. mesh27.8230.363.768
w/o map opt.30.2129.201.714
w/can. tri-plane30.5423.891.807
w/MLP30.0925.472.008
2D Feat + D31.1223.921.521
w/o GMC SDF31.5716.711.532
w/o GMC31.1617.191.595
Ours31.6816.141.488
Table 4. Ablation Study
We quantitatively evaluate our design choices for the novel view synthesis and geometry generation on a subject wearing a loose type of apparel. Note that our final design achieves the best quantitative results in all metrics.
Fig. 8.
Fig. 8. Ablation study. We qualitatively evaluate our individual design choices for novel views in terms of image and geometry synthesis. Note that each row of the generated geometry is followed by its corresponding error map. Our results demonstrate that our proposed method consistently outperforms the baselines, which shows the superiority of our method.
In the following, we first compare our design choices of using non-rigid space canonicalization and our proposed UTTS space to alternative baselines.
Skinning-based Deformation Only. As shown in Table 4, employing pure skinning-based deformation mentioned on template mesh, i.e., setting \(\mathbf {Y}_v=\mathbf {M}_v\) in Equation (9), without a non-rigid residue (i.e., w/skin. mesh) leads to significant performance degradation in terms of synthesis quality (evaluated by PSNR and LPIPS) and geometry quality (evaluated by Chamfer loss). The qualitative results in Figure 7 also validate the performance drop with pure skinning-based deformation. The reason is that the mapping into UTTS is less accurate since skinning alone cannot account for non-rigidly deforming cloth areas, leading to mapping collisions and wrongly mapped points. This confirms that our design choice of accounting for non-rigid deformations within the mapping procedure via a deformable model, which is also gradually refined throughout the training, is superior over piecewise rigid transformations, i.e., skinning-based only transformations.
SDF-driven Surface Refinement. Next, we evaluate the impact of our second training phase, where we update the learnable parameters of the deformable human model to better fit the SDF (see SDF-driven Surface Refinement in Section 3.4). Our SDF-driven surface refinement is beneficial to both synthesis quality and geometry quality as evident from the ablation study. As mentioned earlier, the better the deformable model approximates the true surface, the smaller the distance \(d_\mathrm{max}\) (see discussion at the end of Section 3.2) can become reducing the cases of points mapping onto edges and vertices. Discarding SDF-driven surface refinement (i.e., w/o map opt.) will result in performance drop as shown in Table 4 and Figure 7.
Tri-plane in Canonical Pose Space. Next, we evaluate the design of our UTTS space, which suggests mapping global points into a 3D texture space. A popular alternative in literature is the canonical unposed 3D space, i.e., the character is usually in a T-pose. For this ablation (referred to as w/can. tri-plane), we, therefore, placed the tri-plane into this canonical unposed 3D space and evaluated the performance. To warp the sample from the global space to the canonical space, we first project the samples to the closest face of the posed template mesh surface. Then, we determine the canonical position of the observation space sample by offsetting the canonical counterpart of the projected position along the canonical face normal. As seen from Figure 7 and Table 4, our proposed approach, which parameterizes the dynamic human in the UTTS space, demonstrates significantly better performance, both, qualitatively and quantitatively. Furthermore, in the supplemental materials, we elaborate on how the proposed UTTS parameterization improves the tri-plane usage and reduces tri-plane feature collisions compared with the parameterization upon canonical space.
Next, we compare our motion-dependent tri-plane encoder against several baselines to evaluate its effectiveness.
MLP-only. To evaluate the importance of our tri-planar motion encoder (see Section 3.3), we compare it to a pure MLP-based representation (i.e., w/MLP). Here, we remove the tri-plane features and instead feed the GMC and the position-encoded UTTS coordinate directly as input to the SDF and color MLP. It can be clearly seen that this design falls short in terms of visual quality and geometry recovery Table 4 and Figure 7. This can be explained by the fact that the MLP’s representation capability is insufficient to model the challenging dynamic human body and clothing. A deeper MLP-based architecture could help here, however, at the cost of real-time performance, since deeper architectures are significantly slower as the MLP has to be evaluated for every sample along every ray.
2D Feature and Pose-encoded D. To assess the necessity of the tri-plane for our task, we orchestrate an ablation experiment, which replaces the motion-dependent tri-plane features with 2D features and positionally-encoded distances, termed as 2D Feat + D. To achieve this, we adopt the original UNet architecture for generating 2D features from motion textures. Notably, similar to our 3D-aware convolutional motion encoder, the GMC is channel-wise concatenated to the bottleneck feature maps. As illustrated in Table 4 and Figure 7, our motion-dependent tri-plane representation exhibits superior appearance/geometry accuracy because the d-dimension in our motion-dependent tri-plane can encode motion-aware features by indexing respective feature planes (UD/VD), while 2D Feat + D only allows the UV-plane to be motion-aware.
Global Motion Code. We conduct two ablations to demonstrate the effectiveness of the GMC. The first ablation removes the GMC from the SDF MLP input features, referred to as w/o GMC SDF. Moreover, we conducted the second ablation that eliminates the global motion code from both the tri-plane bottleneck features and the SDF MLP input features, termed as w/o GMC. The results in Table 4 and Figure 7 indicate that removing GMC from the SDF (w/o GMC SDF) leads to a minor drop in the performance, while removing GMC from the tri-plane bottleneck and SDF (w/o GMC) experiences a more significant drop due to the lack of global motion awareness.

6 Applications

In this section, we will introduce two applications built upon TriHuman: the TriHuman Viewer, a real-time interactive system designed for inspecting and generating highly detailed clothed humans (Section 6.1), and the consistent texture editing supported by TriHuman (Section 6.2).

6.1 TriHuman Viewer

Building upon TriHuman, we introduce an interactive real-time system, i.e., the TriHuman Viewer (Figure 8), that enables users to inspect and generate high-fidelity clothed human geometry and renderings, given skeletal motion and camera poses as inputs. We refer to the supplemental video and document for more details regarding the supported interactions and the runtime for each algorithmic component.
Fig. 9.
Fig. 9. UI for Trihuman Viewer. TriHuman Viewer offers a real-time interface that enables users to examine the rendering and geometry of training and validation motions. Furthermore, Trihuman empowers users to customize camera positions and skeletal DOFs for creating novel-view renderings and novel motion geometries and renderings.

6.2 Texture Editing

As highlighted in Section 3, TriHuman can generate detailed geometry with consistent triangulation, opening up new possibilities for a broad spectrum of downstream applications. Here, we use consistent texture editing as an illustrative example of such applications.
Figure 9 presents the results for consistent texture editing, which can be achieved through the following steps: Firstly, we select an image with an alpha channel, serving as the edits to the texture map for the character’s template mesh. Next, we render the texture color and the alpha value through the rasterization process applied to the textured template mesh. Finally, we achieve consistent texture editing results by alpha-blending the neural-rendered character imagery with the rasterized texture. Notably, thanks to the high-fidelity and consistent geometry generated by TriHuman, the rendered edits follow the wrinkle deformation of the clothing. Moreover, the edited result effectively retains the occlusions resulting from different poses.
Fig. 10.
Fig. 10. Consistent texture editing. The flowers in the leftmost column can be seamlessly integrated into the clothed human rendering, faithfully adapting to the clothed human’s deformations and preserving occlusions caused by various poses.

7 Limitations and Future Work

While our approach enables controllable, high-quality, and real-time synthesis of human appearance and geometry, there are some limitations, which we hope seeing addressed in the future.
First, our model is currently not capable of generating re-lightable human appearance since we are not decomposing appearance into material and lighting. However, since the geometry reconstructed by our method is highly accurate, it becomes possible to incorporate re-lightability into our model to enhance the realism and visual coherence of the reconstructed human body in various applications and environments. Second, we are currently representing the human surface as an SDF and explicit mesh model. However, for the hair region, such a representation might not be ideal. Future work could consider a hybrid density and SDF-based representation accounting for the different body parts and regions that may prefer one representation over the other. Third, our model is person and outfit-specific and does not support generalization across identities. A possible avenue for future work could be to leverage transfer learning approaches, where pre-trained models on large-scale datasets are fine-tuned or adapted to specific identities. Moreover, our method does not support generating controllable facial expression rendering due to the absence of facial tracking in our dataset, which could be addressed in the future by incorporating facial tracking into the dataset. Last, like all existing methods, our method cannot model surface dynamics induced by external forces like wind. A promising future direction would be introducing the physical constraints into the training of the geometry and appearance generation models.

8 Conclusion

We introduced TriHuman, a novel approach for controllable, real-time, and high-fidelity synthesis of space-time coherent geometry and appearance solely learned from multi-view video data. Our method excels in reconstructing and generating a virtual human with challenging loose clothing of exceptional quality. The key ingredient of our approach lies in a deformable and pose-dependent tri-plane representation, which enables real-time yet superior performance. A differentiable and mesh-based mapping function is introduced to reduce the ambiguity of the transformation from global space to canonical space. The results on our new benchmark dataset with challenging motions unequivocally demonstrate significant progress toward achieving more lifelike and higher-resolution digital avatars, which hold great importance in the emerging realms of virtual reality (VR). We anticipate that the proposed model with the new benchmark datasets can serve as a robust foundation for future research in this domain.

Supplemental Material

Supplemental Material
Supplementary Information for TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis
Supplemental Video
Supplementary Information for TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

References

[1]
Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018a. Detailed human avatars from monocular video. In International Conference on 3D Vision (3DV). IEEE, 98–109.
[2]
Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018b. Video based reconstruction of 3d people models. In IEEE Conf. Comput. Vis. Pattern Recog.8387–8397.
[3]
Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabián Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. 2021. Driving-signal aware full-body avatars. ACM Trans. Graph. 40, 4 (2021), 1–17.
[4]
Alexander Bergman, Petr Kellnhofer, Wang Yifan, Eric Chan, David Lindell, and Gordon Wetzstein. 2022. Generative neural articulated radiance fields. Adv. Neural Inform. Process. Syst. 35 (2022), 19900–19916.
[5]
Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. 2020. Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. Adv. Neural Inform. Process. Syst. 33 (2020), 12909–12922.
[6]
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 187–194.
[7]
Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. 2018. Optimizing the Latent Space of Generative Networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 599–608.
[8]
Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Nagano Koki, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J. Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. 2022. Efficient geometry-aware 3D generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog. IEEE, 16123–16133.
[9]
Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. 2021. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In IEEE Conf. Comput. Vis. Pattern Recog. 5799–5809.
[10]
Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. 2022. Tensorf: Tensorial radiance fields. In Eur. Conf. Comput. Vis. Springer, 333–350.
[11]
Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao Bao, Xu Jia, and Huchuan Lu. 2021. Animatable neural radiance fields from monocular rgb videos. CoRR abs/2106.13629 (2021). arXiv:2106.13629 https://arxiv.org/abs/2106.13629
[12]
Xu Chen, Tianjian Jiang, Jie Song, Max Rietmann, Andreas Geiger, Michael J Black, and Otmar Hilliges. 2023. Fast-SNARF: A fast deformer for articulated neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 45, 10 (2023), 11796–11809.
[13]
Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepiane, Fabio Ganovelli, and Guido Ranzuglia. 2008. MeshLab: an open-source mesh processing tool. In Eurographics Italian Chapter Conference. The Eurographics Association.
[14]
Boyang Deng, John P. Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and Andrea Tagliasacchi. 2020. Nasa neural articulated shape approximation. In Eur. Conf. Comput. Vis. Springer, 612–628.
[15]
Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5501–5510.
[16]
Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. 2021. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog.8649–8658.
[17]
Qingzhe Gao, Yiming Wang, Libin Liu, Lingjie Liu, Christian Theobalt, and Baoquan Chen. 2024. Neural novel actor: Learning a generalized animatable neural representation for human actors. IEEE Trans. Vis. Comput. Graph. 30, 8 (2024), 5719–5732.
[18]
Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, and Xiaowei Zhou. 2023. Learning neural volumetric representations of dynamic humans in minutes. In IEEE Conf. Comput. Vis. Pattern Recog.8759–8770.
[19]
Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. 2020. Implicit geometric regularization for learning shapes. In Int. Conf. Machine Learning. 3789–3799.
[20]
Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt. 2023. Hdhumans: A hybrid approach for high-fidelity digital humans. Proceedings of the ACM on Computer Graphics and Interactive Techniques 6, 3 (2023), 1–23.
[21]
Marc Habermann, Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. 2021. Real-time deep dynamic characters. ACM Trans. Graph. 40, 4 (2021), 1–16.
[22]
Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. 2021. Baking neural radiance fields for real-time view synthesis. In IEEE Conf. Comput. Vis. Pattern Recog.5875–5884.
[23]
Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. 2024. GaussianAvatar: Towards realistic human avatar modeling from a single video via animatable 3D gaussians. In IEEE Conf. Comput. Vis. Pattern Recog.
[24]
T. Jiang, X. Chen, J. Song, and O. Hilliges. 2023. In IEEE Conf. Comput. Vis. Pattern Recog.16922–16932.
[25]
Ladislav Kavan, Steven Collins, Jiří Žára, and Carol O’Sullivan. 2007. Skinning with dual quaternions. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games. ACM, 39–46.
[26]
Youngjoong Kwon, Lingjie Liu, Henry Fuchs, Marc Habermann, and Christian Theobalt. 2023. DELIFFAS: Deformable light fields for fast avatar synthesis. In Adv. Neural Inform. Process. Syst., 36 (2023), 40944–40962.
[27]
JP Lewis, Matt Cordner, and Nickson Fong. 2000. Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. 165–172.
[28]
Ruilong Li, Julian Tanke, Minh Vo, Michael Zollhöfer, Jürgen Gall, Angjoo Kanazawa, and Christoph Lassner. 2022. Tava: Template-free animatable volumetric actors. In Eur. Conf. Comput. Vis. Springer, 419–436.
[29]
Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. 2024. Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. In IEEE Conf. Comput. Vis. Pattern Recog.
[30]
Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. 2021. Neural actor: Neural free-view synthesis of human actors with pose control. ACM Trans. Graph. 40, 6 (2021), 1–16.
[31]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6 (2015), 1–16.
[32]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
[33]
Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41, 4 (2022), 1–15.
[34]
Michael Niemeyer and Andreas Geiger. 2021. Giraffe: Representing scenes as compositional generative neural feature fields. In IEEE Conf. Comput. Vis. Pattern Recog.11453–11464.
[35]
Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2020. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In IEEE Conf. Comput. Vis. Pattern Recog.3504–3515.
[36]
Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya Harada. 2021. Neural articulated radiance field. In Int. Conf. Comput. Vis.5762–5772.
[37]
Atsuhiro Noguchi, Xiao Sun, Stephen Lin, and Tatsuya Harada. 2022. Unsupervised learning of efficient geometry-aware neural articulated representations. In Eur. Conf. Comput. Vis. Springer, 597–614.
[38]
Michael Oechsle, Songyou Peng, and Andreas Geiger. 2021. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Int. Conf. Comput. Vis.5589–5599.
[39]
Haokai Pang, Heming Zhu, Adam Kortylewski, Christian Theobalt, and Marc Habermann. 2024. Ash: Animatable gaussian splats for efficient and photoreal human rendering. In IEEE Conf. Comput. Vis. Pattern Recog.1165–1175.
[40]
Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M Seitz, and Ricardo Martin-Brualla. 2021. Nerfies: Deformable neural radiance fields. In Int. Conf. Comput. Vis.5865–5874.
[41]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Adv. Neural Inform. Process. Syst., 32 (2019), 8024–8035.
[42]
Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. 2021a. Animatable neural radiance fields for modeling dynamic human bodies. In Int. Conf. Comput. Vis.14314–14323.
[43]
Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. 2021b. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In IEEE Conf. Comput. Vis. Pattern Recog. 9054–9063.
[44]
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2021. D-nerf: Neural radiance fields for dynamic scenes. In IEEE Conf. Comput. Vis. Pattern Recog.10318–10327.
[45]
Edoardo Remelli, Timur M. Bagautdinov, Shunsuke Saito, Chenglei Wu, Tomas Simon, Shih-EnWei, Kaiwen Guo, Zhe Cao, Fabian Prada, Jason M. Saragih, and Yaser Sheikh. 2022. Drivable volumetric avatars using texel-aligned features. In SIGGRAPH’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, August 7-11, 2022. ACM, 56:1–56:9.
[46]
Shunsuke Saito, Jinlong Yang, Qianli Ma, and Michael J. Black. 2021. SCANimate: Weakly supervised learning of skinned clothed avatar networks. In IEEE Conf. Comput. Vis. Pattern Recog.2886–2897.
[47]
Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. 2020. Graf: Generative radiance fields for 3d-aware image synthesis. Adv. Neural Inform. Process. Syst. 33 (2020), 20154–20166.
[48]
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2020. Background matting: The world is your green screen. In IEEE Conf. Comput. Vis. Pattern Recog.2291–2300.
[49]
Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. 2024. Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. In IEEE Conf. Comput. Vis. Pattern Recog.1606–1616.
[50]
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. 2019. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Adv. Neural Inform. Process. Syst., 32 (2019), 1119–1130.
[51]
Olga Sorkine and Marc Alexa. 2007. As-rigid-as-possible surface modeling. In Symposium on Geometry Processing, Vol. 4. Citeseer, 109–116.
[52]
Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Rhodin. 2021. A-NeRF: Articulated neural radiance fields for learning human shape, appearance, and pose. In Adv. Neural Inform. Process. Syst., 34 (2021), 12278–12291.
[53]
TheCaptury. 2020. The Captury. Retrieved from http://www.thecaptury.com/
[54]
Treedys. 2020. Treedys. Retrieved from https://www.treedys.com/
[55]
Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. 2021. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Int. Conf. Comput. Vis.12959–12970.
[56]
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. 2021. NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Adv. Neural Inform. Process. Syst. 34 (2021), 27171–27183.
[57]
Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu Tang. 2022. Arah: Animatable volume rendering of articulated human sdfs. In Eur. Conf. Comput. Vis. Springer, 1–19.
[58]
Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, FangWen, Qifeng Chen, and Baining Guo. 2023b. RODIN: A generative model for sculpting 3D digital avatars using diffusion. In IEEE Conf. Comput. Vis. Pattern Recog. 4563–4573.
[59]
Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu. 2023a. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3295–3306.
[60]
Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. 2020. Vid2actor: Free-viewpoint animatable person synthesis from video in the wild. CoRR abs/2012.12884 (2020). arXiv:2012.12884 https://arxiv.org/abs/2012.12884
[61]
Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. In IEEE Conf. Comput. Vis. Pattern Recog.16210–16220.
[62]
Donglai Xiang, Fabian Prada, Timur Bagautdinov, Weipeng Xu, Yuan Dong, He Wen, Jessica Hodgins, and Chenglei Wu. 2021. Modeling clothing as a separate layer for an animatable human avatar. ACM Trans. Graph. 40, 6 (2021), 1–15.
[63]
Donglai Xiang, Fabian Prada, Zhe Cao, Kaiwen Guo, Chenglei Wu, Jessica Hodgins, and Timur Bagautdinov. 2023. Drivable avatar clothing: Faithful full-body telepresence with dynamic clothing driven by sparse rgb-d input. In SIGGRAPH Asia 2023 Conference Papers. 1–11.
[64]
Donglai Xiang, Fabian Prada, Chenglei Wu, and Jessica Hodgins. 2020. Monoclothcap: Towards temporally coherent clothing capture from monocular rgb video. In International Conference on 3D Vision (3DV). IEEE, 322–332.
[65]
Hongyi Xu, Thiemo Alldieck, and Cristian Sminchisescu. 2021. H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. Adv. Neural Inform. Process. Syst. 34 (2021), 14955–14966.
[66]
Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. 2021. Volume rendering of neural implicit surfaces. Adv. Neural Inform. Process. Syst. 34 (2021), 4805–4815.
[67]
Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. 2020. Multiview neural surface reconstruction by disentangling geometry and appearance. Adv. Neural Inform. Process. Syst. 33 (2020), 2492–2502.
[68]
Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. 2021. Plenoctrees for real-time rendering of neural radiance fields. In Int. Conf. Comput. Vis.5752–5761.
[69]
Fangneng Zhan, Lingjie Liu, Adam Kortylewski, and Christian Theobalt. 2023. General neural gauge fields. In Int. Conf. Learn. Represent.
[70]
Jiakai Zhang, Xinhang Liu, Xinyi Ye, Fuqiang Zhao, Yanshun Zhang, Minye Wu, Yingliang Zhang, Lan Xu, and Jingyi Yu. 2021. Editable free-viewpoint video using a layered neural representation. ACM Trans. Graph. 40, 4 (2021), 1–18.
[71]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog.586–595.

Cited By

View all
  • (2024)iHuman: Instant Animatable Digital Humans From Monocular VideosComputer Vision – ECCV 202410.1007/978-3-031-73226-3_18(304-323)Online publication date: 1-Nov-2024
  • (2024)MetaCap: Meta-learning Priors from Multi-view Imagery for Sparse-View Human Performance Capture and RenderingComputer Vision – ECCV 202410.1007/978-3-031-72952-2_20(341-361)Online publication date: 1-Oct-2024

Index Terms

  1. TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Graphics
      ACM Transactions on Graphics  Volume 44, Issue 1
      February 2025
      64 pages
      EISSN:1557-7368
      DOI:10.1145/3696812
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 October 2024
      Online AM: 24 September 2024
      Accepted: 10 September 2024
      Revised: 11 June 2024
      Received: 20 October 2023
      Published in TOG Volume 44, Issue 1

      Check for updates

      Author Tags

      1. Neural human rendering
      2. pose-dependent geometry
      3. human modeling

      Qualifiers

      • Research-article

      Funding Sources

      • ERC Consolidator
      • Saarbrücken Research Center for Visual Computing, Interaction, and AI

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)427
      • Downloads (Last 6 weeks)414
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)iHuman: Instant Animatable Digital Humans From Monocular VideosComputer Vision – ECCV 202410.1007/978-3-031-73226-3_18(304-323)Online publication date: 1-Nov-2024
      • (2024)MetaCap: Meta-learning Priors from Multi-view Imagery for Sparse-View Human Performance Capture and RenderingComputer Vision – ECCV 202410.1007/978-3-031-72952-2_20(341-361)Online publication date: 1-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media