Monocular Real-Time Volumetric Performance Capture
Monocular Real-Time Volumetric Performance Capture
Monocular Real-Time Volumetric Performance Capture
Capture
3
Pinscreen
{ruilongl, yxiu, zenghuan}@usc.edu,
{shunsuke.saito16, olszewski.kyle}@gmail.com, hao@hao-li.com
1 Introduction
Videoconferencing using a single camera is still the most common approach
face-to-face communication over long distances, despite recent advances in virtual
and augmented reality and 3D displays that allow for far more immersive and
compelling interaction. The reason for this is simple: convenience. Though the
technology exists to obtain high-fidelity digital representations of one’s specific
appearance that can be rendered from arbitrary viewpoints, existing methods
?
indicates equal contribution
2 R. Li et al.
to capture and stream this data [7, 10, 16, 53, 69] require cumbersome capture
technology, such as a large number of calibrated cameras or depth sensors, and
the expert knowledge to install and deploy these systems. Videoconferencing,
on the other hand, simply requires a single video camera, such as those found
on common consumer devices, e.g. laptops and smartphones. Thus, if we can
capture a complete model of a person’s unique appearance and motion from a
single consumer-grade camera, we can bridge the gap preventing novice users
from engaging in immersive communication in virtual environments.
However, successful reconstruction of not only the geometry but also the
texture of a person from a single viewpoint poses significant challenges due to
depth ambiguity, changing topology, and severe occlusions. To address these
challenges, data-driven approaches using high-capacity deep neural networks have
been employed, demonstrating significant advances in the fidelity and robustness
of human modeling [47, 61, 71, 86]. In particular, Pixel-Aligned Implicit Function
(PIFu) [61] achieves fully-textured reconstructions of clothed humans with a very
high resolution that is infeasible with voxel-based approaches. On the other hand,
the main limitation of PIFu is that the subsequent reconstruction process is
not fast enough for real-time applications: given an input image, PIFu densely
evaluates 3D occupancy fields, from which the underlining surface geometry
is extracted using the Marching Cubes algorithm [41]. After the surface mesh
reconstruction, the texture on the surface is inferred in a similar manner. Finally,
the colored meshes are rendered from arbitrary viewpoints. The whole process
takes tens of seconds per object when using a 2563 resolution. Our goal is to
achieve such fidelity and robustness with the highly efficient reconstruction and
rendering speed for real-time applications.
To this end, we introduce a novel surface reconstruction algorithm, as well
as a direct rendering method that does not require extracting surface meshes
for rendering. The newly introduced surface localization algorithm progressively
queries 3D locations in a coarse-to-fine manner to construct 3D occupancy fields
with a smaller number of points to be evaluated. We empirically demonstrate
that our algorithm retains the accuracy of the original reconstruction, while
being two orders of magnitude faster than the brute-force baseline. Additionally,
combined with the proposed surface reconstruction algorithm, our implicit texture
representation enables direct novel-view synthesis without geometry tessellation
or texture mapping, which halves the time required for rendering. As a result,
we enable 15 fps processing time with a 2563 spatial resolution for volumetric
performance capture.
In addition, we present a key enhancement to the training method of [61] to
further improve the quality and efficiency of reconstruction. To suppress failure
cases that rarely occur during training due to the unbalanced data distribution
with respect to viewing angles, poses and clothing styles, we introduce an adaptive
data sampling algorithm inspired by the Online Hard Example Mining (OHEM)
method [64]. We incrementally update the sampling probability based on the
current prediction accuracy to train more frequently with hard examples without
manually selecting these samples. We find this automatic sampling approach
highly effective for reducing artifacts, resulting in state-of-the-art accuracy.
Monocular Real-Time Volumetric Performance Capture 3
2 Related Work
3 Method
In this section, we describe the overall pipeline of our algorithm for real-time
volumetric capture (Fig. 2). Given a live stream of RGB images, our goal is to
obtain the complete 3D geometry of the performing subject in real-time with
the full textured surface, including unseen regions. To achieve an accessible
solution with minimal requirements, we process each frame independently, as
tracking-based solutions are prone to accumulating errors and sensitive to
initialization, causing drift and instability [49, 88]. Although recent approaches
have demonstrated that the use of anchor frames [3,10] can alleviate drift, ad-hoc
engineering is still required to handle common but extremely challenging scenarios
such as changing the subject.
For each frame, we first apply real-time segmentation of the subject from the
background. The segmented image is then fed into our enhanced Pixel-Aligned
Implicit Function (PIFu) [61] to predict continuous occupancy fields where the
underlining surface is defined as a 0.5-level set. Once the surface is determined,
texture inference on the surface geometry is also performed using PIFu, allowing
for rendering from any viewpoint for various applications. As this deep learning
framework with effective 3D shape representation is the core building block of
the proposed system, we review it in Sec. 3.1, describe our enhancements to it,
and point out the limitations on its surface inference and rendering speed. At
the heart of our system, we develop a novel acceleration framework that enables
real-time inference and rendering from novel viewpoints using PIFu (Sec. 3.2).
Furthermore, we further improve the robustness of the system by sampling hard
examples on the fly to efficiently suppress failure modes in a manner inspired by
Online Hard Example Mining [64] (Sec. 3.3).
be directly regressed using voxels, where the target space is explicitly discretized
[24, 71], the Pixel-Aligned Implicit Function (PIFu) models a function O(P) that
queries any 3D point and predicts the binary occupancy field in normalized device
coordinates P = (Px , Py , Pz ) ∈ R3 . Notably, with this approach no discretization
is needed to infer 3D shapes, allowing reconstruction at arbitrary resolutions.
PIFu first extracts an image feature obtained from a fully convolutional image
encoder gO (I) by a differentiable sampling function Φ(Pxy , gO (I)) (following [61],
we use a bilinear sampling function [25] for Φ). Given the sampled image feature,
a function parameterized by another neural network fO estimates the occupancy
of a queried point P as follows:
(
1 if P is inside surface
O(P) = fO (Φ(Pxy , gO (I)), Pz ) = (1)
0 otherwise.
where given a surface point P, the implicit function T predicts RGB color C.
The advantage of this representation is that texture inference can be performed
on any surface geometry including occluded regions without requiring a shared
2D parameterization [35, 78]. We use the L1 loss from the sampled point colors.
Furthermore, we made several modifications to the original implementation
of [61] to further improve the accuracy and efficiency. For shape inference, instead
of the stacked hourglass [50], we use HRNetV2-W18-Small-v2 [68] as a backbone,
which demonstrates superior accuracy with less computation and parameters.
We also use conditional batch normalization [9, 11, 45] to condition the MLPs
on the sampled image features instead of the concatenation of these features to
the queried depth value, which further improves the accuracy without increasing
computational overhead. Additionally, inspired by an ordinal depth regression
approach [12], we found that representing depth Pz as a soft one-hot vector
more effectively propagates depth information, resulting in faster convergence.
For texture inference, we detect the visible surface from the reconstruction and
directly use the color from the corresponding pixel, as these regions do not require
any inference, further improving the realism of free viewpoint rendering. We
provide additional ablation studies to validate our design choices in the appendix.
Monocular Real-Time Volumetric Performance Capture 7
Inference for Human Reconstruction. In [61], the entire digitization pipeline starts
with the dense evaluation of the occupancy fields in 3D, from which the surface
mesh is extracted using Marching Cubes [41]. Then, to obtain the fully textured
mesh, the texture inference module is applied to the vertices on the surface mesh.
While the implicit shape representation allows us to reconstruct 3D shapes with
an arbitrary resolution, the evaluation in the entire 3D space is prohibitively
slow, requiring tens of seconds to process a single frame. Thus, acceleration by
at least two orders of magnitude is crucial for real-time performance.
Fig. 3. Comparison of surface reconstruction methods. The plot shows the trade-off
between the retention of the accuracy of the original reconstruction (i.e., IOU) and
speed. The acceleration factor is computed by dividing the number of evaluation points
by that with the brute-force baseline. Note that the thresholds used for the octree
reconstructions are 0.05, 0.08, 0.12, 0.2, 0.3, and 0.4 from left to right in the plot.
8 R. Li et al.
0 0 0 .5 1 .5 0 0 0 0 .5 1 .5 0 0 0 0 0 1 0 0
Fig. 4. Our surface localization algorithm overview. The dash and solid line denote the
true surface and the reconstructed surface respectively. The nodes that are not used for
the time-consuming network evaluation are shaded grey.
is to subdivide grids if the maximum absolute deviation of the neighbor coarse
grids is larger than a threshold. While this approach allows for control over the
trade-off between reconstruction accuracy and acceleration, we also found that
this algorithm either excessively evaluates unnecessary points to perform accurate
reconstruction or suffers from impaired reconstruction quality in exchange for
higher acceleration. To this end, we introduce a surface localization algorithm
that hierarchically and precisely determines the boundary nodes.
Given the occupancy prediction at the coarser level, we first binarize the
occupancy values with threshold of 0.5, and apply interpolation (i.e., bilinear
for 2D cases, and trilinear for 3D) to tentatively assign occupancy values to
the grid points at the current level (Fig. 4(a)). Then, we extract the boundary
candidates by extracting the grid points whose values are neither 0 nor 1. To
cover sufficiently large regions, we apply a dilation operation to incorporate the
1-ring neighbor of these boundary candidates (Fig. 4(b)). These selected nodes are
evaluated with the network and the occupancy values at these nodes are updated.
Note that if we terminate at this point and move on to the next level, the true
boundary candidates may be culled similar to the aforementioned acceleration
approaches. Thus, as an additional step, we detect conflict nodes by comparing
the binarized values of the interpolation and the network prediction for the
boundary candidates. The key observation is that there must be a missing surface
region when the value of prediction and the interpolation is inconsistent. The
nodes adjacent to the conflict nodes are evaluated with the network iteratively
until all the conflicts are resolved (Fig. 4(c)).
Monocular Real-Time Volumetric Performance Capture 9
0 0 0 .5 1 .5 0 0 0 0 .5 1 .5 0 0 0 0 .5 1 0 0
initial state (a) shadow region detection (b) boundary evaluation (c) surface point extraction
Fig. 5. Our mesh-free rendering overview. The dash and solid line denote the true
surface and the reconstructed surface respectively.
Fig. 4 shows the octree-based reconstruction with binarization [45] and the
subdivision with a higher threshold suffers from inaccurate surface localization.
While the subdivision approach with a lower threshold can prevent inaccurate
reconstruction, an excessive number of nodes are evaluated. On the other hand,
our approach not only extracts the accurate surface but also effectively reduces
the number of nodes to be evaluated (see the number of blue-colored nodes).
2.5 cm
0 cm
input recon. error recon. error point-ohem input recon. error recon. error
w/o point-ohem w/ point-ohem weight w/o item-ohem w/ item-ohem
Fig. 6. Qualitative Evaluation of the OHEM sampling. The proposed sampling effectively
selects challenging regions, resulting in significantly more robust reconstruction.
discarded for the network evaluation (Fig. 5(a)). Once shadow nodes are marked,
we evaluate the remaining nodes with the interpolated value of 0.5 and update
the occupancy values (Fig. 5(b)). Finally, we apply binarization to the current
occupancy values and perform the argmax operation again along the z axis to
obtain the updated nearest-point indices. For the pixels with Omax (q) = 1, we
take the nodes with the index of imax (q) − 1 and imax (q) as surface points and
compute the 3D coordinates of surface P(q) by interpolating these two nodes by
the predicted occupancy value (Fig. 5(c)). Then a novel-view image R is rendered
as follows: (
T(P(q), I) if Omax (q) = 1
R(q) = (3)
B otherwise,
tasks such as learning image descriptors [65], image classifiers [42], and object
detection [64], each employs a mining strategy specific to their task. So it is
non-trivial to extend there algorithms to another problem. On the contrary, our
formulation is general and can be applied to any problem domain as it requires
no domain-specific knowledge.
Given a dataset M, a common approach for supervised learning is to define
an objective function Lm per data sample m and reduce an error within a mini-
batch using optimizers (e.g., SGD, Adam [30]). Assuming uniform distribution
for data sampling, we are minimizing the following function L w.r.t. variables
(i.e., network weights) over the course of iterative optimization:
1 X
L= Lm . (4)
kMk
m∈M
Now suppose the dataset is implicitly clustered into S classes denoted as {Mi }
based on various attributes (e.g., poses, illumination). Eq. 4 can be written as:
1 X X X 1 X
L= ( Lm ) = Pi · ( Lm ), (5)
kMk i i
kMi k
m∈Mi m∈Mi
where Pi = kM ik
kMk is the sampling probability of the cluster Mi among all the
data samples. As shown in Eq. 5, the objective functions in each cluster are
weighted by the probability Pi . This indicates that hard examples with lower
probability are outweighed by the majority of the training data, resulting in poor
reconstruction. On the other hand, if we modify the sampling probability of data
samples in each cluster to be proportional to the inverse of the class probability
Pi−1 , we can effectively penalize hard examples by removing this bias.
In our problem setting, the goal is to define the sampling probability per
target image Pim and per 3D point Ppt , or alternatively to define the inverse
of these directly. Note that the inverse of probability needs to be positive and
not to go to infinity. By assuming the accuracy of prediction is correlated with
class probability, we approximate the probability of occurrence of each image
by an accuracy measurement as Pim ∼ IoU, where IoU is computed by the
sampled nO points for each image. Similarly, we use a Binary Cross Entropy
loss to approximate the original probability of sampling points. Based on these
approximations, we model the inverse of the probabilities as follows:
−1 −1 1
Pim = exp(−IoU/αi + βi ), Ppt = , (6)
exp(−LBCE /αp ) + βp
4 Results
We train our networks using NVIDIA GV100s with 512 × 512 images. During
inference, we use a Logitech C920 webcam on a desktop system equipped with 62
GB RAM, a 6-core Intel i7-5930K processor, and 2 GV100s. One GPU performs
geometry and color inference, while the other performs surface reconstruction,
which can be done in parallel in an asynchronized manner when processing
multiple frames. The overall latency of our system is on average 0.25 second.
We evaluate our proposed algorithms on the RenderPeople [57] and BUFF
datasets [82], and on self-captured performances. In particular, as public datasets
of 3D clothed humans in motion are highly limited, we use the BUFF datasets [82]
for quantitative comparison and evaluation and report the average error measured
by the Chamfer distance and point-to-surface (P2S) distance from the prediction
to the ground truth. We provide implementation details, including the training
dataset and real-time segmentation module, in the appendix.
In Fig. 1, we demonstrate our real-time performance capture and rendering
from a single RGB camera. Because both the reconstructed geometry and texture
inference for unseen regions are plausible, we can obtain novel-view renderings in
real-time from a wide range of poses and clothing styles. We provide additional
results with various poses, illuminations, viewing angles, and clothing in the
appendix and supplemental video.
4.1 Evaluation
Fig. 3 shows a comparison of surface reconstruction algorithms. The surface
localization based on a binarized octree [45] does not guarantee the same
reconstruction as the brute-force baseline, potentially losing some body parts.
The octree-based reconstruction with a threshold shows the trade-off between
performance and accuracy. Our method achieves the best acceleration without any
hyperparameters, retaining the original reconstruction accuracy while accelerating
surface reconstruction from 30 seconds to 0.14 seconds (7 fps). By combining it
with our mesh-free rendering technique, we require only 0.06 seconds per frame
(15 fps) for novel-view rendering at the volumetric resolution of 2563 , enabling
the first real-time volumetric performance capture from a monocular video.
In Tab. 1 and Fig. 6, we evaluate the effectiveness of the proposed Online
Hard Example Mining algorithm quantitatively and qualitatively. Using the
same training setting, we train our model with and without the point-ohem and
Monocular Real-Time Volumetric Performance Capture 13
item-ohem sampling. Fig. 6 shows the reconstruction results and error maps from
the worst 5 results in the training set. The point-ohem successfully improves
the fidelity of reconstruction by focusing on the regions with high error (see the
point-ohem weight in Fig. 6). Similarly, the item-ohem automatically supervises
more on hard images with less frequent clothing styles or poses, which we expect
to capture as accurately as more common poses and clothing styles. As a result,
the overall reconstruction quality is significantly improved, compared with the
original implementation of [61], achieving state-of-the-art accuracy (Tab. 1).
4.2 Comparison
In Tab. 1 and Fig. 7, we compare our method with the state-of-the-art 3D human
reconstruction algorithms from RGB input. Note that we train PIFu [61] using the
14 R. Li et al.
same training data with the other settings identical to [61] for a fair comparison,
while we use the public pretrained models for VIBE [31] and DeepHuman [86]
due to the custom datasets required by each method and their dependency on
external modules such as the SMPL [40] model. Although a template-based
regression approach [31] achieves robust 3D human estimations from images in
the wild, the lack of fidelity and details severely impairs the authenticity of the
performances. Similarly, a volumetric performance capture based on voxels [86]
suffers from a lack of fidelity due to the limited resolution. While an implicit shape
representation [61] achieves high-resolution reconstruction, the reconstructions
become less plausible for infrequent poses and the inference speed (30 seconds) is
too slow for real-time applications, both of which we address in this paper. We
also qualitatively compare our reconstruction with the state-of-the-art real-time
performance capture using a pre-captured template [18] (Fig. 8). While the
reconstructed geometries are comparable, our method can render performances
with dynamic textures that reflect lively expressions, unlike a tracking method
using a fixed template. Our approach is also agnostic to topology changes, and
can thus handle very challenging scenarios such as changing clothing (Fig. 1).
5 Conclusion
We have demonstrated that volumetric reconstruction and rendering of humans
from a single input image is possible to achieve in near real-time speed without
sacrificing the final image quality. Our novel progressive surface localization
method allows us to vastly reduce the number of points queried during surface
reconstruction, giving us a speedup of two orders of magnitude without reducing
the final surface quality. Furthermore, we demonstrate that directly rendering
novel viewpoints of the captured subject is possible without explicitly extracting
a mesh or performing naive, computationally intensive volumetric rendering,
allowing us to obtain real-time rendering performance with the reconstructed
surface. Finally, our Online Hard Example Mining technique allows us to find and
learn the appropriate response to challenging input examples, thereby making it
feasible to train our networks with a tractable amount of data while attaining
high-quality results with large appearance and motion variations.
While we demonstrate our approach on human subjects and performances,
our acceleration techniques are straightforward to implement and generalize to
any object or topology. We thus believe this will be a critical building block to
virtually teleport anything captured by a commodity camera anywhere.
6 Acknowledgement
This research was funded by in part by the ONR YIP grant N00014-17-S-FO14, the
CONIX Research Center, a Semiconductor Research Corporation (SRC) program
sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S.
Army Research Laboratory (ARL) under contract number W911NF-14-D-0005,
Adobe, and Sony.
Monocular Real-Time Volumetric Performance Capture 15
References
1. Alp Güler, R., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation
in the wild. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 7297–7306 (2018)
2. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE:
shape completion and animation of people. ACM Transactions on Graphics 24(3),
408–416 (2005)
3. Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner,
R.W., Gross, M.: High-quality passive facial performance capture using anchor
frames. ACM Transactions on Graphics (TOG) 30(4), 75 (2011)
4. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it
SMPL: Automatic estimation of 3D human pose and shape from a single image. In:
European Conference on Computer Vision. pp. 561–578 (2016)
5. Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: A 3d facial
expression database for visual computing. IEEE Transactions on Visualization and
Computer Graphics 20(3), 413–425 (2013)
6. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In:
IEEE Conference on Computer Vision and Pattern Recognition. pp. 5939–5948
(2019)
7. Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., Hoppe,
H., Kirk, A., Sullivan, S.: High-quality streamable free-viewpoint video. ACM
Transactions on Graphics 34(4), 69 (2015)
8. De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.:
Performance capture from sparse multi-view video. ACM Transactions on Graphics
27(3), 98 (2008)
9. De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.:
Modulating early visual processing by language. In: Advances in Neural Information
Processing Systems. pp. 6594–6604 (2017)
10. Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S.R., Kowdle, A.,
Escolano, S.O., Rhemann, C., Kim, D., Taylor, J., et al.: Fusion4d: Real-time
performance capture of challenging scenes. ACM Transactions on Graphics 35(4),
114 (2016)
11. Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M.,
Courville, A.: Adversarially learned inference. arXiv preprint arXiv:1606.00704
(2016)
12. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression
network for monocular depth estimation. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. pp. 2002–2011 (2018)
13. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE
Transactions on Pattern Analysis and Machine Intelligence 32(8), 1362–1376 (2010)
14. Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L.: Look into person: Self-supervised
structure-sensitive learning and a new benchmark for human parsing. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 932–940
(2017)
15. Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose
from a single image. In: IEEE International Conference on Computer Vision. pp.
1381–1388 (2009)
16. Guo, K., Lincoln, P., Davidson, P., Busch, J., Yu, X., Whalen, M., Harvey, G.,
Orts-Escolano, S., Pandey, R., Dourgarian, J., et al.: The relightables: Volumetric
performance capture of humans with realistic relighting. ACM Trans. Graph. 38(6)
16 R. Li et al.
34. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite
the people: Closing the loop between 3d and 2d human representations. In: IEEE
Conference on Computer Vision and Pattern Recognition. pp. 6050–6059 (2017)
35. Lazova, V., Insafutdinov, E., Pons-Moll, G.: 360-degree textures of people in clothing
from a single image. In: International Conference on 3D Vision (3DV) (sep 2019)
36. Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial
shape and expression from 4d scans. ACM Transactions on Graphics (TOG) 36(6),
194 (2017)
37. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755
(2014)
38. Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: Dist: Rendering deep
implicit signed distance function with differentiable sphere tracing. arXiv preprint
arXiv:1911.13225 (2019)
39. Liu, Y., Stoll, C., Gall, J., Seidel, H.P., Theobalt, C.: Markerless motion capture of
interacting characters using multi-view image segmentation. In: IEEE Conference
on Computer Vision and Pattern Recognition. pp. 1249–1256 (2011)
40. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned
multi-person linear model. ACM Transactions on Graphics 34(6), 248 (2015)
41. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface
construction algorithm. ACM siggraph computer graphics 21(4), 163–169 (1987)
42. Loshchilov, I., Hutter, F.: Online batch selection for faster training of neural
networks. arXiv preprint arXiv:1511.06343 (2015)
43. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based
visual hulls. In: ACM SIGGRAPH. pp. 369–374 (2000)
44. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.P., Xu,
W., Casas, D., Theobalt, C.: VNect: Real-time 3D Human Pose Estimation with a
Single RGB Camera. ACM Transactions on Graphics 36(4), 44:1–44:14 (2017)
45. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy net-
works: Learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828
(2018)
46. Mixamo: (2018), https://www.mixamo.com/
47. Natsume, R., Saito, S., Huang, Z., Chen, W., Ma, C., Li, H., Morishima, S.: Siclope:
Silhouette-based clothed people. In: CVPR. pp. 4480–4490 (2019)
48. Newcombe, R.A., Fox, D., Seitz, S.M.: DynamicFusion: Reconstruction and tracking
of non-rigid scenes in real-time. In: IEEE Conference on Computer Vision and
Pattern Recognition. pp. 343–352 (2015)
49. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J.,
Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: Real-time dense
surface mapping and tracking. In: Mixed and augmented reality (ISMAR), 2011
10th IEEE international symposium on. pp. 127–136 (2011)
50. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose
estimation. In: European Conference on Computer Vision. pp. 483–499 (2016)
51. Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric
rendering: Learning implicit 3d representations without 3d supervision. arXiv
preprint arXiv:1912.07372 (2019)
52. Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields:
Learning texture representations in function space. In: The IEEE International
Conference on Computer Vision (ICCV) (October 2019)
53. Orts-Escolano, S., Rhemann, C., Fanello, S., Chang, W., Kowdle, A., Degtyarev,
Y., Kim, D., Davidson, P.L., Khamis, S., Dou, M., et al.: Holoportation: Virtual 3d
18 R. Li et al.
72. Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from
multi-view silhouettes. ACM Transactions on Graphics 27(3), 97 (2008)
73. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., Matusik,
W.: Dynamic shape capture using multi-view photometric stereo. ACM Transactions
on Graphics 28(5), 174 (2009)
74. Waschbüsch, M., Würmlin, S., Cotting, D., Sadlo, F., Gross, M.: Scalable 3D video
of dynamic scenes. The Visual Computer 21(8), 629–638 (2005)
75. Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of
multiple actors with a stereo camera. ACM Transactions on Graphics 32(6), 161
(2013)
76. Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body, and
hands in the wild. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. pp. 10965–10974 (2019)
77. Xu, W., Chatterjee, A., Zollhöfer, M., Rhodin, H., Mehta, D., Seidel, H.P., Theobalt,
C.: Monoperfcap: Human performance capture from monocular video. ACM
Transactions on Graphics 37(2), 27:1–27:15 (2018)
78. Yamaguchi, S., Saito, S., Nagano, K., Zhao, Y., Chen, W., Olszewski, K.,
Morishima, S., Li, H.: High-fidelity facial reflectance and geometry inference from
an unconstrained image. ACM Transactions on Graphics 37(4), 162 (2018)
79. Ye, G., Liu, Y., Hasler, N., Ji, X., Dai, Q., Theobalt, C.: Performance capture of
interacting characters with handheld kinects. European Conference on Computer
Vision pp. 828–841 (2012)
80. Yu, T., Zheng, Z., Guo, K., Zhao, J., Dai, Q., Li, H., Pons-Moll, G., Liu, Y.:
Doublefusion: Real-time capture of human performances with inner body shapes
from a single depth sensor. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 7287–7296 (2018)
81. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701 (2012)
82. Zhang, C., Pujades, S., Black, M.J., Pons-Moll, G.: Detailed, accurate, human
shape estimation from clothed 3d scan sequences. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 4191–4200 (2017)
83. Zhang, P., Siu, K., Zhang, J., Liu, C.K., Chai, J.: Leveraging depth cameras and
wearable pressure sensors for full-body kinematics and dynamics capture. ACM
Transactions on Graphics (TOG) 33(6), 221 (2014)
84. Zhang, S.H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., Yang, D., Huang, H.,
Hu, S.M.: Pose2seg: Detection free human instance segmentation. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 889–898
(2019)
85. Zheng, Z., Yu, T., Li, H., Guo, K., Dai, Q., Fang, L., Liu, Y.: Hybridfusion: real-time
performance capture using a single depth sensor and sparse imus. In: Proceedings
of the European Conference on Computer Vision (ECCV). pp. 384–400 (2018)
86. Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3d human reconstruction
from a single image. In: The IEEE International Conference on Computer Vision
(ICCV) (October 2019)
87. Zhou, K., Gong, M., Huang, X., Guo, B.: Data-parallel octrees for surface
reconstruction. IEEE transactions on visualization and computer graphics 17(5),
669–681 (2010)
88. Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C.,
Fitzgibbon, A., Loop, C., Theobalt, C., et al.: Real-time non-rigid reconstruction
using an rgb-d camera. ACM Transactions on Graphics 33(4), 156 (2014)
20 R. Li et al.
A1 Implementation Details
A1.1 Datasets
Shape Inference
128 x 128 x 256 for 𝒊 = 𝟏, … , 𝟓
256
128
𝛽𝑖,1 𝛽𝑖,2 𝛽
input frame Shape Image Encoder 𝛾𝑖,1 𝛾𝑖,2 𝛾
(512 x 512 x 3) (HRNet)
CBN + CBN
in / out
CBN
64 128 128 128
soft one-hot 128
Z
where Pz0 = 0.5 · (Pz + 1.0) and N = 64 in our experiments. We term this
multi-channel depth representation soft one-hot depth (SoftZ). Figure 10 and
Table A1.3 demonstrates the faster convergence and more accurate reconstruction
of the proposed depth representation.
0.08
1-D depth
MSE Loss
0.06
0.05
0.04
0 1 2 3 4 5
epoch
Training procedure We use RMSProp [70] and Adam [30] for the surface re-
construction and texture inference respectively, with a learning rate of 1e − 3.
Since the batch normalization layer in HRNet and CBN can benefit from large
batch sizes, we use a batch size of 24 for both surface reconstruction and texture
Monocular Real-Time Volumetric Performance Capture 23
inference. The number of sampled points per image is 4096 in every training
batch. We first train the surface reconstruction network for 5 epochs with the
constant learning rate, then fix it and only train the texture inference network
for 5 more epochs. The training of our networks for surface reconstruction and
texture inference takes 3 days each on a single NVIDIA GV100 GPU.
A2 Additional Results
We evaluate the robustness of our algorithm under different lighting conditions,
viewpoints, and clothes topology in Figure A2. We also provide additional
qualitative results from a video sequence (see Figure 16) and from internet photos
(see Figure 17). The other video reconstruction results can be found in the
supplemental video.
A2.1 Limitations
As our training data consists of only a single person at a time, the presence of
multiple people confuses the network (see Figure 12). Modeling multiple subjects
[27,39] is essential to understanding social interaction for a truly believable virtual
4
https://www.remove.bg/
24 R. Li et al.
Fig. 11. We qualitatively evaluate the robustness of our approach by demonstrating the
consistency of reconstruction with different lighting conditions, viewpoints and surface
topology.
Fig. 13. Sampled BUFF benchmark. We apply K-Medoids to each sequence of BUFF
dataset to construct the test set. Sufficient pose variations in BUFF dataset are covered
with K = 10.
26 R. Li et al.