Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Accelerating 3D Deep Learning With Pytorch3D: Equal Contribution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Accelerating 3D Deep Learning with PyTorch3D

Nikhila Ravi Jeremy Reizenstein David Novotny Taylor Gordon


Wan-Yen Lo Justin Johnson∗ Georgia Gkioxari∗
Facebook AI Research
arXiv:2007.08501v1 [cs.CV] 16 Jul 2020

Abstract

Deep learning has significantly improved 2D image recognition. Extending into


3D may advance many new applications including autonomous vehicles, virtual
and augmented reality, authoring 3D content, and even improving 2D recognition.
However despite growing interest, 3D deep learning remains relatively underex-
plored. We believe that some of this disparity is due to the engineering challenges
involved in 3D deep learning, such as efficiently processing heterogeneous data and
reframing graphics operations to be differentiable. We address these challenges by
introducing PyTorch3D, a library of modular, efficient, and differentiable operators
for 3D deep learning. It includes a fast, modular differentiable renderer for meshes
and point clouds, enabling analysis-by-synthesis approaches. Compared with other
differentiable renderers, PyTorch3D is more modular and efficient, allowing users
to more easily extend it while also gracefully scaling to large meshes and images.
We compare the PyTorch3D operators and renderer with other implementations and
demonstrate significant speed and memory improvements. We also use PyTorch3D
to improve the state-of-the-art for unsupervised 3D mesh and point cloud prediction
from 2D images on ShapeNet. PyTorch3D is open-source and we hope it will help
accelerate research in 3D deep learning.

1 Introduction
Over the past decade, deep learning has significantly advanced the ability of AI systems to process
2D image data. We can now build high-performing systems for tasks such as object [49, 27, 52, 19]
and scene [64, 65] classification, object detection [48], semantic [32] and instance [20] segmentation,
and human pose estimation [2]. These systems can operate on complex image data and have been
deployed in countless real-world settings. Though sucessful, these methods suffer from a common
shortcoming: they process 2D snapshots and ignore the true 3D nature of the world.
Extending deep learning into 3D can unlock many new applications. Recognizing objects in 3D point
clouds [46, 47] can enhance the sensing abilities of autonomous vehicles, or enable new augmented
reality experiences. Predicting depth [10, 9], or 3D shape [8, 11, 59, 41] can lift 2D images into 3D.
Generative models [61, 62, 39] might one day aid artists in authoring 3D content. Image-based tasks
like view synthesis can be improved with 3D representations given only 2D supervision [53, 60, 38].
Despite growing interest, 3D deep learning remains relatively underexplored.
We believe that some of this disparity is due to the significant engineering challenges involved in
3D deep learning. One such challenge is heterogeneous data. 2D images are almost universally
represented by regular pixel grids. In contrast, 3D data are stored in a variety of structured formats
including voxel grids [8, 55], point clouds [46, 11], and meshes [59, 14] which can exhibit per-
element heterogeneity. For example, meshes may differ in their number of vertices and faces, and
their topology. Such heterogeneity makes it difficult to efficiently implement batched operations on 3D
data using the tensor-centric primitives provided by standard deep learning toolkits like PyTorch [43]

equal contribution

Preprint. Under review.


and Tensorflow [1]. A second key challenge is differentiability. The computer graphics community
has developed many methods for efficiently processing 3D data. However, to be embedded in deep
learning pipelines, each operation must be revisited to also efficiently compute gradients. Some
operations, such as camera transformations, admit gradients trivially via automatic differentiation.
Others, such as mesh rendering, must be reformulated via differentiable relaxation [34, 25, 31, 7].
We address these challenges by introducing PyTorch3D, a library of modular, efficient, and well-tested
operators for 3D deep learning built on PyTorch [43]. All operators are fast and differentiable, and
many are implemented with custom CUDA kernels to improve efficiency and minimize memory
usage. We provide reusable data structures for managing batches of point cloud and meshes, allowing
all PyTorch3D operators to support batches of heterogeneous data.
One key feature of PyTorch3D is a modular and efficient differentiable rendering engine for meshes
and point clouds. Differentiable rendering projects 3D data to 2D images, enabling analysis-by-
synthesis [16] and inverse rendering [36, 44] approaches where 3D predictions can be made using
only image-level supervision [34]. Compared to other recent differentiable renderers [25, 31], ours
is both more modular and more scalable. We achieve modularity by decomposing the rendering
pipeline into stages (rasterization, lighting, shading, blending) which can easily be replaced with new
user-defined components, allowing users to adapt our renderer to their needs. We improve efficiency
via two-stage rasterization, and by limiting the number of primitives that influence each pixel.
We compare the optimized PyTorch3D operators with naïve PyTorch implementations and with
those provided in other open-source packages, demonstrating improvements in speed and memory
usage by up to 10×. We also showcase the flexibility of PyTorch3D by experimenting on the task of
unsupervised shape prediction on the ShapeNet [6] dataset. With our modular differentiable renderer
and efficient 3D operators we improve the state-of-the-art for unsupervised 3D mesh and point cloud
prediction from 2D images while maintaining high computational throughput.
PyTorch3D is open-source2 and will evolve over time. We hope that it will be a valuable tool to the
community and help accelerate research in 3D deep learning.

2 Related Work
3D deep learning libraries. There are a number of toolkits for 3D deep learning. [12] focuses on
learning on graphs, [58] provides differentiable graphics operators, [22] collects commonly used 3D
functions. However, they do not provide support for heterogeneous batching of 3D data, crucial for
large-scale learning, or modularity for differentiable rendering, crucial for exploration. PyTorch3D
introduces data structures that support batches of 3D data with varying sizes and topologies. This key
abstraction allows our 3D operators, including rendering, to operate on large heterogeneous batches.
Differentiable renderers. OpenDR [34] and NMR [25] perform traditional rasterization in the
forward pass and compute approximate gradients in the backward pass. More recently, SoftRas [31]
and DIB-R [7] propose differentiable renderers by viewing rasterization as a probabilistic process
where each pixel’s color depends on multiple mesh faces. Differentiable ray tracing methods, such as
Redner [30] and Mitsuba2 [40], give more photorealistic images at the expense of increased compute.
Differentiable point cloud rendering is explored in [21] which uses ray termination probabilities
and stores points in a voxel grid which limits resolution. DSS [63] renders each point as a disk.
SynSin [60] also splats a per point sphere from a soft z-buffer of fixed length. Most recently, Pul-
sar [29] uses an unlimited z-buffer for rendering but uses the first few points for gradient propagation.
Differentiable rendering is an active research area. PyTorch3D introduces a modular renderer, inspired
by [31], by redesigning and exposing intermediates computed during rasterization. Unlike other
differentiable renderers, users can easily customize the rendering pipeline with PyTorch shaders.
3D shape prediction. In Section 4 we experiment with unsupervised 3D shape prediction using the
differentiable silhouette and textured renderers for meshes and point clouds in PyTorch3D. There is a
vast line of work on 3D shape prediction, including two-view and multiview methods [51, 18], model-
based approaches [13, 5, 4, 35, 3, 33], and recent supervised deep methods that predict voxels [8, 55],
meshes [17, 59, 54, 14], point clouds [11] and implicit functions [37, 42, 50]. Differentiable renderers
allow for unsupervised shape prediction via 2D re-projection losses [56, 25, 24, 57, 31].

2
https://pytorch3d.org/

2
(a) Chamfer (b) Graph Conv (c) KNN (varying |P |) (d) KNN (varying K)
Figure 1: Benchmarks for our 3D operators with batch size 32. (a) Lcham (P, Q) for point clouds
with |P | = 1000 and heterogeneous and varying |Q|. (b) Graph convolution on heterogeneous mesh
batches with 128-dimensional features. (c,d) Our KNN vs Faiss [23] between homogeneous batches
of 3D point clouds P and Q with |P | = |Q|. In (c) K = 1, and in (d) |P | = |Q| = 50k; in both our
memory usage matches [23]. (a) and (b) are forward and backward; (c) and (d) are forward only.

3 PyTorch3D: Functionality and Performance

This section describes the core features of PyTorch3D. For a 3D deep learning library to be effective,
3D operators need to be efficient when handling complex 3D data. We benchmark the speed and
memory usage of key PyTorch3D operators, comparing to pure PyTorch and existing open-source
implementations. We show that PyTorch3D achieves speedups up to 10×.
3D data structures. Working with minibatches of data is crucial in deep learning both for stable
optimization and computational efficiency. However operating on batches of 3D meshes and point
clouds is challenging due to heterogeneity: meshes may have varying numbers of vertices and faces,
and point clouds may have varying numbers of points. To overcome this challenge, PyTorch3D
provides data structures to manage batches of meshes and point clouds which allow conversion
between different tensor-based representations (list, packed, padded) needed for various operations.
Implementation Details. We benchmark with meshes from ShapeNetCoreV1 [6] using homoge-
neous and heterogeneous batches. We form point clouds by sampling uniformly from mesh surfaces.
All results are averaged over 5 random batches and 10 runs per batch, and are run on a V100 GPU.

3.1 3D operators

We report time and memory usage for a representative set of popular 3D operators, namely Chamfer
loss, graph convolution and K nearest neighbors. Other 3D operators in PyTorch3D follow similar
trends. We compare to PyTorch and state-of-the-art open-source libraries.
Chamfer loss is a common metric that quantifies agreement between point clouds P and Q. Formally,
X X
Lcham (P, Q) = |P |−1 kp − qk2 + |Q|−1 kq − pk2 (1)
(p,q)∈ΛP,Q (q,p)∈ΛQ,P

where ΛP,Q = {(p, arg minq kp−qk) : p ∈ P } is the set of pairs (p, q) such that q ∈ Q is the nearest
neighbor of p ∈ P . A (homogeneously) batched implementation is straightforward in PyTorch, but is
inefficient since it requires forming a pairwise distance matrix with B × |P | × |Q| elements (where
B is the batch size). PyTorch3D avoids this inefficiency (and supports heterogeneity) by using our
efficient KNN to compute neighbors. Figure 1a compares ours against the naïve approach with
B = 32, |P | = 1000, and varying |Q|. The naïve approach runs out of memory for |Q| > 10k, while
ours scales to large point clouds and reduces time and memory use by more than 12×.
Graph convolution [26] is commonly used for processing 3D meshes
P [59, 14]. Given feature vectors
fv for each vertex v, it computes new features fv0 = W0 fv + u∈N (v) W1 fu where N (v) are
the neighbors of v in the mesh and W0 , W1 are learned weight matrices. PyTorch3D implements
graph convolution via a fused CUDA kernel for gather+scatter_add. Figure 1b shows that this
improves speed and memory use by up to 30% compared against a pure PyTorch implementation.

3
(a) PyTorch3D rendering pipeline (b) Problems with differentiability in rendering
Figure 2: (a) The modular rendering pipeline in PyTorch3D and (b) The z- & xy-discontinuities in
traditional rasterization and the soft formulations [31] which enable differentiability.

K Nearest Neighbors for D-dimensional points are used in Chamfer loss, normal estimation, and
other point cloud operations. We implement exact KNN with custom CUDA kernels that natively
handle heterogeneous batches. Our implementation is tuned for D ≤ 4 and K ≤ 32, and uses
template metaprogramming to individually optimize each (D, K) pair. We compare against Faiss [23],
a fast GPU library for KNN that targets a different portion of the design space: it does not handle
batching, is optimized for high-dimensional descriptors (D ≈ 128), and scales to billions of points.
Figures 1c and 1d show that we outperform Faiss by up to 5× for batched 3D problems.

3.2 Differentiable mesh renderer

A renderer inputs scene information (camera, geometry, materials, lights, textures) and outputs an
image. A differentiable renderer can also propagate gradients backward from rendered images to
scene information [34], allowing rendering to be embedded into deep learning pipelines [25, 31].
PyTorch3D includes a differentiable renderer that operates on heterogeneous batches of triangle
meshes. Our renderer follows three core design principles: differentiability, meaning that it computes
gradients with respect to all inputs; efficiency, meaning that it runs quickly and scales to large meshes
and images; and modularity, meaning that users can easily replace components of the renderer to
customize its functionality to their use case and experiment with alternate formulations.
As shown in Figure 2a, our renderer has two main components: the rasterizer selects the faces affect-
ing each pixel, and the shader computes pixel colors. Through careful design of these components,
we improve efficiency and modularity compared to prior differentiable renderers [25, 31, 7].
Rasterizer. The rasterizer first uses a camera to transform meshes from world to view coordinates.
Cameras are Python objects and compute gradients via autograd; this aids modularity, as users can
easily implement new camera models other than our provided orthographic and perspective cameras.
Next, the core rasterization algorithm finds triangles that intersect each pixel. In traditional rasteriza-
tion, each pixel is influenced only by its nearest face along the z-axis. As shown in Figure 2b, this
can cause step changes in pixel color as faces move along the z-axis (due to occlusion) and in the
xy-plane (due to face boundaries). Following [31] we soften these nondifferentiabilities by blending
the influence of multiple faces for each pixel, and decaying a face’s influence toward its boundary.
Our rasterizer departs from [31] in three ways to improve efficiency and modularity. First, in [31],
pixels are influenced by every face they intersect in the xy-plane; in contrast we constrain pixels to
be influenced by only the nearest K faces along the z-axis, computed using per-pixel priority queues.
Similar to traditional z-buffering, this lets us quickly discard many faces for each pixel, improving
efficiency. We show in Section 4 that this modification does not harm downstream task performance.
Second, [31] naïvely compares each pixel with each face. We improve efficiency using a two-pass
approach similar to [28], first working on image tiles to eliminate faces before moving to pixels.
Third, [31] fuses rasterization and shading into a monolithic CUDA kernel. We decouple these, and
as shown in Figure 2a our rasterizer returns Fragment data about the K nearest faces to each pixel:
face ID, barycentric coordinates of the pixel in the face, and (signed) pixel-to-face distances along

4
(a) Silhouette homogeneous (b) Silhouette heterogeneous (c) Texture heterogeneous
Figure 3: Benchmarks for silhouette and textured rendering for PyTorch3D and SoftRas [31]. We use
a batch size of 8, two image sizes (64 & 256) and two values for the number of faces per pixel K 10
& 50 (for PyTorch3D only). All benchmarks are for forward and backward.

the z-axis and in the xy-plane. This allows shaders to be implemented separately from the rasterizer,
significantly improving modularity. This change also improves efficiency, as cached Fragment data
can be used to avoid costly recomputation of face-pixel intersections in the backward pass.
Shaders consume the Fragment data produced by the rasterizer, and compute pixel values of the
rendered image. They typically involve two stages: first computing K values for the pixel (one for
each face identified by the Fragment data), then blending them to give a final pixel value.
Shaders are Python objects, and Frag-
ment data are stored in PyTorch ten- Algorithm 1: Silhouette blending
sors. Shaders can thus work with prob = (– dists / sigma).sigmoid()
F ragment data using standard Py- alpha = 1 – (1 – prob).prod(dim=–1)
Torch operators, and compute gradi-
ents via autograd. This design is Algorithm 2: Softmax blending
highly modular, as users can eas- prob = (– dists / sigma).sigmoid()
ily implement new shaders to cus- zinv = (zfar – zbuf) / (zfar – znear)
tomize the renderer. For example, Al- zinv_max = torch.max(zinv, dim=–1).values
gorithm 1 implements the silhouette weights = prob * ((zinv – zinv_max) / gamma)).exp()
renderer from [31] using a two-line weights = weights / weights.sum(dim=–1)
shader: dists is a tensor of shape image = (weights * top_k_colors_per_pixel).sum(dim=–2)
B × H × W × K giving signed dis-
tances in the xy-plane from each pixel to its K nearest faces (part of Fragment data), and sigma is
a hyperparameter. This is simpler than [31] where silhouette rendering is one path in a monolithic
CUDA kernel and gradients are manually computed. Similarly, Algorithm 2 implements the softmax
blending algorithm from [31] for textured rendering. dists, zbuf are part of the Fragment data,
and top_k_colors_per_pixel is the output from the shader. zf ar, znear, sigma, gamma are
hyper-parameters defined by the user.
Shaders can implement complex effects using the Fragment data from the rasterizer. Face IDs can be
used to fetch per-face data like normals, colors, or texture coordinates; barycentric coordinates can be
used to interpolate data over the face; xy and z distances can be used to blend the influence of faces
in different ways. Crucially, all texturing, lighting, and blending logic can be written using PyTorch
operators, and differentiated using autograd. We provide a variety of shaders implementing silhouette
rendering, flat, Gouraud [15], and Phong [45] shading with per-vertex colors or texture coordinates,
and which blend colors using hard assignment (similar to [25]) or softmax blending (like [31]).
Performance. In Figure 3 we benchmark the speed and memory usage of our renderer against
SoftRas [31]. We implement shaders to reproduce their silhouette rendering and textured mesh
rendering using per-vertex textures and Gouraud shading. Ours is significantly faster, especially
for large meshes, higher-resolution images, and heterogeneous batches: for textured rendering of
heterogenous batches of meshes with mean 50k faces each at 256 × 256, our renderer is more than 4×
faster than [31]. Our renderer uses more GPU memory than [31] since we explicitly store Fragment
data. However our absolute memory use (≈ 2GB for texture at 2562 ) is small compared to modern
GPU capacity (32GB for V100); we believe our improved modularity offsets our memory use.

5
(a) Alpha homogeneous (b) Alpha heterogeneous (c) Norm heterogeneous
Figure 4: Benchmarks for PyTorch3D’s point cloud render with Alpha and Norm weighted composit-
ing. We use a batch size of 8, two image sizes (64 & 256) and three values for the number of faces
per pixel K (10, 50, 150). All benchmarks are for forward and backward.

3.3 Differentiable point cloud renderer

PyTorch3D also provides an efficient and modular point cloud renderer following the same design
as the mesh renderer. It is similarly factored into a rasterizer that finds the K-nearest points to
each pixel along the z-direction, and shaders written in PyTorch that consume fragment data from
the rasterizer to compute pixel colors. We provide shaders for silhouette and textured point cloud
rendering, and users can easily implement custom shaders to customize the rendering pipeline. Like
the mesh renderer, the point cloud render natively supports heterogeneous batches of points.
Our point cloud renderer uses a similar strategy as our mesh renderer for overcoming the non-
differentiabilities discussed in Figure 2b. Each point is splatted to a circular region in screen-space
whose opacity decreases away from the region’s center. The value of each pixel is computed by
blending information for the K-nearest points in the z-axis whose splatted regions overlap the pixel.
In our experiments we consider two blending methods: Alpha-compositing and Normalized weighted
sums. Suppose a pixel is overlapped by the splats from K points with opacities α1 , . . . , αK ∈ [0, 1]
sorted in increasing z-order, and the points are associated with feature vectors f1 , . . . , fK ∈ RD .
Features might be boolean (for silhouette rendering), RGB colors (for textured rendering), or neural
features [21, 60]. The blending methods compute features fAlpha , fN orm ∈ RD for the pixel:

  ! !
K
X i−1
Y K
X K
X
fAlpha = αi (1 − αj ) fi fN orm = αi fi / αi . (2)
i=1 j=1 i=1 i=1

Alpha-compositing uses the depth ordering of points so that nearer points contribute more, while
Norm ignores the depth order. Both blending functions are differentiable and can propagate gradients
from pixel features backward to both point features and opacities. They can be implemented with a
few lines of PyTorch code similar to Algorithms 1 and 2.
We benchmark our point cloud renderer by sampling points from the surface of random ShapeNet
meshes, then rendering silhouettes using our two blending functions. We vary the point cloud size,
points per pixel (K = 10, 50, 150), and image size (64, 256). Results are shown in Figure 4.
Our renderer is efficient: rendering a batch of 8 point clouds with 200k points each to a batch
of 256 × 256 images with K = 50 points per pixel takes about 75ms and uses just over 1GB of
GPU memory, making it feasible to use the renderer as a differentiable layer when training neural
networks. Comparing Figures 4a and 4b shows similar performance when rendering homogenous and
heterogeneous batches of comparable size. Comparing Figures 4b and 4c shows that both blending
methods have similar memory usage, but Norm is up to 25% faster for large K since it omits the
inner cumulative product. Point cloud rendering is generally more efficient than mesh rendering since
it requires fewer computations per primitive during rasterization.

6
High Res
Input Sphere FC Sphere GCN Voxel GCN Input Sphere GCN Voxel GCN
Sphere GCN

Flat

Phong

Gouraud

(a) Silhouette Mesh Rendering (b) Textured Mesh Rendering


Figure 5: Mesh predictions via (a) silhouette and (b) textured rendering on ShapeNet test. For (a),
we show the input image, the prediction and an additional view for each model. For (b), we show the
input image, the predicted shapes and the predicted textures for each shader and model - the predicted
shapes across shaders are very similar, so we show one per model.

4 Experiments
Extending supervised learning into 3D is challenging due to the difficulty of obtaining 3D annotations.
Extracting 3D via weakly or unsupervised approaches can unlock exciting applications, such as
novel view synthesis, 3D content creation for AR/VR and more. Differentiable rendering makes
3D inference via 2D supervision possible. In this section, we experiment with unsupervised 3D
shape prediction using PyTorch3D. At test time, models predict an object’s 3D shape (point cloud or
mesh) from a single RGB image. During training they receive no 3D supervision, instead relying
on re-projection losses via differentiable rendering. We compare to SoftRas [31] and demonstrate
superior shape prediction and speed. The efficiency of our renderer allows us to scale to larger images
and more complex meshes, setting a new state-of-the-art for unsupervised 3D shape prediction.
Dataset. We experiment on ShapeNetCoreV1 [6], using the rendered images and train/test splits from
[8]. Images are 137×137, and portray instances from 13 object categories from various viewpoints.
There are roughly 840K train and 210K test images; we reserve 5% of train images for validation.
Metrics. We follow [59, 14] for evaluating 3D meshes. We sample 10k points uniformly at random
from the surface of predicted and ground-truth meshes. These are compared using Chamfer distance,
normal consistency, and F1τ score for various distance thresholds τ . Refer to [14] for more details.
To fairly extend evaluation to point cloud models, we predict 10k points per cloud.

4.1 Mesh prediction with differentiable rendering


In this task, we predict 3D object meshes from 2D image inputs with 2D silhouette supervision.
Following [31], we use a 2-view training setup: for each object image on the minibatch we include
its corresponding view under a random known transformation. At test time, all models take as input
a single image and directly predict the 3D object mesh in camera coordinates. Inspired by recent
advances in supervised shape prediction [59, 14], we explore the following model designs:
Sphere FC. Closely following [31], this model learns to deform an initial sphere template with 642
vertices and 1280 faces. The input image is encoded via a CNN backbone followed by two fully
connected layers, each of 1024 dimensions, which cast offset predictions for each vertex.
Sphere GCN. Inspired by recent advances in [59, 14], this model uses graph convolutions. Each
vertex pools its features from the output of the backbone indexed by its 2D projection. A set of graph
convs on the mesh graph predict vertex offsets. Similar to Sphere FC, this model deforms a sphere
mesh. Unlike Sphere FC, image to feature alignment is preserved. PyTorch3D’s scale efficiency
allows us to train a High Res variant which deforms an even larger sphere (2562 verts, 5120 faces).
Voxel GCN. The above models predict shapes homeomorphic to spheres, and are not able to capture
varying shape topologies. Recently, Mesh R-CNN [14] shows that coarse voxel predictions capture
instance-specific object topologies. These low fidelity voxel predictions perform poorly as they
can’t reconstruct fine structures or smooth surfaces, but when refined with 3D mesh supervision,
they become state-of-the-art. To demonstrate PyTorch3D’s flexibility with varying shape topologies

7
Model 3D Superv. Renderer Metrics Mesh Size
Net Vox Mesh Size Engine Ch. (↓) Nrml F10.1 F10.3 F10.5 |V | |F |
Sphere FC 7 7 64 SoftRas 1.475 0.691 25.5 68.4 82.3 642 1280
7 7 PyTorch3D 0.989 0.696 26.4 69.9 83.5 642 1280
7 7 128 SoftRas 0.346 0.700 26.1 70.6 85.2 642 1280
7 7 PyTorch3D 0.313 0.699 27.6 72.5 86.6 642 1280
Sphere GCN 7 7 64 SoftRas 0.316 0.713 24.4 70.2 85.8 642 1280
7 7 PyTorch3D 0.296 0.703 24.8 71.3 86.5 642 1280
7 7 128 SoftRas 0.301 0.709 26.1 71.9 86.5 642 1280
7 7 PyTorch3D 0.293 0.709 26.6 72.6 86.9 642 1280
High Res
7 7 128 PyTorch3D 0.281 0.696 26.7 73.8 87.8 2562 5120
Sphere GCN
Voxel GCN X 7 64 SoftRas 0.293 0.656 24.5 71.1 87.2 1947±923 3895±1851
X 7 PyTorch3D 0.267 0.675 26.1 73.3 88.5 1932±935 3866±1873
X 7 128 SoftRas 0.276 0.675 26.2 72.6 87.9 1918±928 3837±1860
X 7 PyTorch3D 0.277 0.687 26.2 73.4 87.8 1951±949 3903±1901
Voxel Only [14] X 7 n/a n/a 0.916 0.595 7.70 33.1 54.9 2433±925 4877±1856
Mesh R-CNN [14] X X n/a n/a 0.171 0.713 35.1 82.6 93.2 2292±902 4598±1812
Table 1: Mesh reconstruction via silhouette rendering on ShapeNet test with PyTorch3D and Soft-
Ras [31]. We compare to state-of-the-art Mesh R-CNN [14], trained with voxel & mesh supervision,
and its Voxel Only variant, trained with voxel supervision. We highlight best metrics for models
trained without any 3D supervision (blue) and with voxel supervision (red).
Model Renderer Metrics
Net Size Shading Chamfer (↓) Normal F10.1 F10.3 F10.5 Lfg
1 (↓) Lbg
1 (↓)

Sphere GCN 64 Flat 0.315 0.689 24.1 70.9 86.3 0.0217 0.0031
Phong 0.309 0.703 24.1 70.2 85.9 0.0173 0.0023
Gouraud 0.302 0.702 24.4 70.9 86.6 0.0180 0.0024
Voxel GCN 64 Flat 0.270 0.678 26.4 73.0 88.2 0.0128 0.0020
Phong 0.272 0.694 25.7 72.4 87.9 0.0126 0.0021
Gouraud 0.302 0.687 24.3 69.8 86.2 0.0128 0.0021
Table 2: Mesh and texture reconstruction via textured rendering on ShapeNet test. Lfg1 & Lbg
1
measure the reconstruction accuracy of the foreground object and background, respectively.

and heterogeneous batches, we train a model which makes voxel predictions and refines them via a
sequence of graph convs. At train time, this model uses coarse 483 voxel supervision similar to [14],
but unlike [14] it does not use 3D mesh supervision.
All models minimize the objective L = Ls + λl Ll + λe Le . Ls is the negative intersection over
union between rendered and ground truth 2D silhouette, as in [31]. Ll and Le are Laplacian and edge
length mesh regularizers, respectively, ensuring that meshes are smooth. We set λl =19 and λe =0.2.
All model variants use a ResNet50 [19] backbone, initialized with ImageNet weights. We follow the
training schedule from [14]; we use Adam with a constant learning rate of 10−4 for 25 epochs. We
use a 64 batch size across 8 V100 GPUs (8 images per GPU). Sphere GCN and Voxel GCN use a
sequence of 3 graph convs, each with 512 dimensions. The voxel head in Voxel GCN predicts 483
voxels via a 4 layer CNN, identical to [14]. For all models, the inputs are 137×137 images from [8].
We report performance using our mesh renderer and compare to SoftRas [31].
Table 1 shows our results. We compare to the supervised state-of-the-art Mesh R-CNN [14] and
its Voxel Only variant, the latter being a direct comparison to Voxel GCN as both use the same 3D
supervision – coarse voxels but no meshes. We render at 64×64, similar to SoftRas [31], and push to
even higher rendering resolution at 128×128. Figure 5a compares the models qualitatively.
From Table 1 we observe: (a) Compared to SoftRas, PyTorch3D achieves on par or better performance
across models. This validates the design of the Pytorch3D renderer and proves that rendering the
K closest faces, instead of all, does not hurt performance, (b) Even though Sphere FC & Sphere
GCN deform the same sphere, Sphere GCN is superior across renderers. Figure 5a backs this claim
qualitatively. Unlike Sphere GCN, Sphere FC is sensitive to rendering size (row 4 vs 6), (c) High
Res Sphere GCN significantly outperforms Sphere GCN both quantitatively (row 10 vs 11) and

8
Renderer Metrics
Input Point Align
Size K Ch. (↓) F10.1 F10.3 F10.5 Input
64 20 0.738 12.8 50.6 68.0
50 0.451 18.1 62.9 79.0
100 0.289 23.4 73.3 87.1 Alpha
150 0.289 23.9 73.5 87.3
128 20 0.623 13.7 53.1 71.2
50 0.398 19.2 65.6 81.3
100 0.272 25.3 75.2 88.0 Norm
150 0.280 25.1 74.6 87.8
(a) (b)
Table 4: Point cloud reconstruction via sil- Figure 6: Point cloud predictions via (a) silhouette and
houette rendering on ShapeNet test for dif- (b) textured rendering on ShapeNet test.
ferent rendering resolutions and K.
Net 3D superv Ch.(↓) F1τ F12τ
Renderer Metrics
PSG [11] X 0.593 48.6 69.8
Size K Blend Ch.(↓) F10.1 F10.3 F10.5 Lfg
1 (↓) Lbg
1 (↓) Point Align 7 0.647 61.0 74.6
128 100 Alpha 0.275 25.2 74.7 87.8 0.0178 Table 6: Comparison of our unsu-
0.0051
Norm 0.268 25.7 75.4 88.1 0.0187 0.0027
pervised point cloud model, Point
Table 5: Point cloud and texture reconstruction via textured Align, to PSG [11] under the non scale-
rendering on ShapeNet test with two blending functions. normalized metric (Table 1 in [14]).

qualitatively (Figure 5a) showing the advantages of PyTorch3D’s scale efficiency, (d) Voxel GCN
significantly outperforms Voxel Only, both trained with the same 3D supervision. As mentioned
in [14], Voxel Only performs poorly, since voxel predictions are coarse and fail to capture fine shapes.
Voxel GCN improves all metrics and reconstructs fine structures with complex topologies as shown
in Figure 5a. Here, PyTorch3D’s efficiency results in a 2× training speedup compared to SoftRas.
Varying K As described in Section 3, the Py-
Torch3D renderer exposes the K nearest faces per Net K Ch.(↓) Nrml F10.1 F10.3 F10.5
pixel. We experiment with different values of K for Sphere 20 0.294 0.697 27.2 72.2 86.8
Sphere GCN & Voxel GCN at a 128×128 resolu- GCN 50 0.293 0.709 26.6 72.6 86.9
tion in Table 3. We observe that K=50 results in 100 0.294 0.708 26.9 72.3 86.9
best performance for both models (Chamfer 0.293 150 0.314 0.716 25.2 71.1 86.1
for Sphere GCN & 0.277 for Voxel GCN). More Voxel 20 0.317 0.642 24.1 68.4 85.3
interestingly, a small value for K(=20) works well GCN 50 0.277 0.687 26.2 73.4 87.8
for smaller meshes (Sphere GCN) but results in a 100 0.282 0.669 25.5 71.6 87.3
performance drop for larger meshes (Voxel GCN) 150 0.285 0.674 25.9 72.2 87.2
with the same output image size. This is expected
Table 3: Varying K for mesh reconstruction.
since for the same K, for a larger mesh, fewer faces
per pixel are rendered in proportion to the mesh size. Finally, increasing K does not improve models
further. This empirically validates our design of rendering a fixed finite number of faces per pixel.
Textured rendering In addition to shapes, we reconstruct object textures by extending the above
models to predict per vertex (r, g, b) values using textured rendering. The models are trained with
an additional L1 loss between the rendered and the ground truth image. Table 2 shows our analysis.
Simultaneous shape and texture prediction is harder, yet our models achieve high reconstruction
quality for both shapes (compared to Table 1) and textures. Figure 5b shows qualitative results.

4.2 Point cloud prediction with differentiable rendering


To show the effectiveness of the point cloud renderer, we train unsupervised point cloud models.
Our model, called Point Align, deforms 10k points sampled from a sphere by pooling backbone
features and predicting per point offsets. Table 4 shows our analysis. Point Align slightly improves
shape metrics compared to meshes (Tables 4 vs 1). As with meshes, a finite K(=100) performs best
while increasing K does not improve models further. We reconstruct texture by predicting additional
(r, g, b) values per point via textured rendering (Table 5). Figure 6 shows qualitative results. Finally,
we compare Point Align to the supervised PSG [11] baseline in Table 6; we significantly improve F1
and show slightly worse Chamfer, which is expected as PSG directly minimizes Chamfer with 3D
supervision, while our model is not trained to minimize any 3D metric and uses no supervision.

9
5 Broader Impact
In this paper we introduce PyTorch3D, a library for 3D research which is fully differentiable to enable
easy inclusion into existing deep learning pipelines, modular for fast experimentation and extension,
optimized and efficient to allow scaling to large 3D data sizes, and with heterogeneous batching
capabilities to support the variable shape topologies encountered in real world use cases.
Similar to PyTorch or TensorFlow, which provide fundamental tools for deep learning, PyTorch3D is a
library of building blocks optimized for 3D deep learning. Frameworks like PyTorch and PyTorch3D
provide a platform for solving a plethora of AI problems including semantic or synthesis tasks.
Data-driven solutions to such problems necessitate the community’s caution regarding their potential
impact, especially when these models are being deployed in the real world. In this work, our goal
is to provide students, researchers and engineers with the best possible tools to accelerate research
at the intersection of 3D and deep learning. To this end, we are committed to support and develop
PyTorch3D in adherence with the needs of the academic, research and engineering community.

Appendix

A Sampling Batches for Benchmarks


The batches of 3D data used in the benchmarks in Section 3 were sampled from ShapeNetCoreV1
using a uniform sampling strategy. For meshes, homogeneous batches (σ = 0; σ here denotes the
variance in size for elements within the batch) consist of one mesh with the specified number of faces,
repeated B times with B being the batch size. For heterogeneous batches (σ > 0), we sample B
values from a uniform distribution with the specified µ and σ in the number of faces per mesh. For
each value, we find the mesh in the dataset with number of faces closest to the desired value.
For point cloud operators, we first sample a random mesh from the dataset and then uniformly sample
points from the surface of the mesh. For homogeneous batches, the same number of points is sampled
B times. For heterogeneous batches, B values for the number of points in each point cloud is sampled
from a uniform distribution with the specified µ and σ.

B Experiments: Unsupervised shape prediction


In Section 4, we experiment with unsupervised 3D shape prediction from a single image using the
PyTorch3D renderers. Predicting a 3D shape from an input image is ambiguous as infinite 3D shapes
can explain a 2D image. In the absence of ground truth supervision, to resolve shape ambiguity we
assume 2 views of the object at train time. This setup is also assumed in SoftRas [31].
The 2-view training setup is shown in Figure 7(a). When constructing minibatches, for every image
input sampled, we assume an additional view under known rotation R and translation t. The predicted
shape from the input is then transformed by (R, t) and its rendered output is compared against the
ground truth silhouette. At test time, the model takes as input a single image and predicts the object’s
3D shape in camera coordinates, as shown in Figure 7(b). In the case of textured rendering, the
textured views are additionally compared at train time. We do the same for point clouds.

B.1 Unsupervised mesh prediction

In the case of meshes, we experiment with three architectures for the 3D shape network, namely
Sphere FC, Sphere GCN and Voxel GCN. The first two deform an initial sphere template and use
no 3D shape supervision. The third deforms shape topologies cast by the voxel head and tests the
effectiveness of differentiable rendering for varying shapes and connectivities as predicted by a neural
network. These topologies are different and more diverse than human-defined or simple genus 0
shapes allowing us to stretch test our renderer. Voxel GCN uses coarse voxel supervision to train
the voxel head which learns to predict instance-specific, yet coarse, shapes. Note that Voxel GCN is
identical to Mesh R-CNN [14], but unlike Mesh R-CNN it uses no mesh supervision. The details of
the architectures for all three models are shown in Figure 8. For each model, we show the network
modules as well as the shapes of the intermediate batched tensor outputs. Note that for Voxel GCN,
the shapes of the meshes vary within the batch due to the heterogeneity from the voxel predictions.

10
(a) Train time (b) Inference
Figure 7: System overview for unsupervised shape prediction. (a) Shows the 2-view training setup.
During training, an additional view of known rotation R and translation t of the input view is assumed.
The predicted shape is transformed by (R, t) and its rendered silhouette is compared against the
ground truth one. (b) Shows the system at inference time. The input is a single RGB image and the
output is the predicted shape in camera coordinates.

(a) Sphere FC

(b) Sphere GCN

(c) Voxel GCN


Figure 8: 3D mesh network architectures used in our experiments.

Losses As stated in Section 4, all models optimize the same objective L = Ls + λl Ll + λe Le . Ls is


the negative intersection over union between rendered and ground truth 2D silhouette from [31]
Spred · Sgt
Ls = 1 − (3)
Spred + Sgt − Spred · Sgt
where Spred and Sgt are the predicted and ground truth silhouettes. To enforce smoothness in the
predicted shapes we also use shape regularizers. We use an edge length regularizer λe that minimizes
the length of the edges in the predicted mesh, identical to Mesh R-CNN [14]. We also use a laplacian
regularizer Ll , defined as follows
Ll = ||L · v||1 (4)
where L is the Laplacian matrix of shape V × V and v are the vertices of shape V × 3. We use
PyTorch3D’s implementation of this loss that handles heterogeneous batches efficiently. At train time,
the loss is averaged across vertices and elements in the batch. This is unlike SoftRas [31], where it is
summed across vertices and across elements in the batch.
More discussion on Table 1 and Figure 5a Table 1 compares performance for different shape
networks with PyTorch3D and SoftRas and under two rendering resolutions. We replicate the original

11
experiments in SoftRas [31] which use a Sphere FC model to deform a sphere of 642 vertices and
1280 faces and render at 64×64. While PyTorch3D performs better under that setting, we see that
the absolute performance for both SoftRas (chamfer 1.475) and PyTorch3D (chamfer 0.989) is not
good enough. Going to higher rendering resolutions improves models (chamfer 0.346 with SoftRas
vs 0.313 with PyTorch3D). Sphere GCN, a more geometry-aware network architecture, performs
much better for all rendering resolutions and all rendering engines (chamfer 0.301 with SoftRas
vs chamfer 0.293 with PyTorch3D at 128×128). This is also evident in Figure 5a where we show
predictions for all models at a 128×128 rendering resolution. Sphere FC is able to capture a global
pose and appearance but fails to capture instance-specific shape details. Sphere GCN, which deforms
the same sized sphere template, is able to reconstruct instance-specific shapes, such as chair legs and
tables, much more accurately. We take advantage of PyTorch3D’s scale efficiency and use a larger
sphere where we immediately see improved reconstruction quality (chamfer 0.281 with PyTorch3D).
However, Sphere FC and Sphere GCN can only make predictions homeomorphic to spheres. This
forces them to multiple face intersections and irregularly shaped faces in order to capture the complex
shape topologies. Our Voxel GCN variant tests the ability of our renderer to make predictions of
any genus by refining shape topologies predicted by a neural network. The reconstruction quality
improves for Voxel GCN which captures holes and complex shapes while maintaining regular shaped
faces. More importantly, Voxel GCN improves the Voxel Only baseline, which also predicts voxels
but performs no further refinement. Even though both models are trained with the same supervision,
Voxel GCN achieves a chamfer of 0.267 compared to 0.916 for Voxel Only.
More visualizations We show additional mesh predictions via silhouette rendering in Figure 9 and
via textured rendering in Figure 10. Our texture model is simple; we predict (r, g, b) texture values
per vertex. The shaders interpolate the vertex textures into face textures based on their algorithm and
the blending accumulates the face textures for each pixel. More sophisticated texture models could
involve directly sampling vertex textures from the input image or using GANs for natural looking
texture predictions. Despite its simplicity, our texture model does well at reconstructing textures.
Textures are more accurate for Voxel GCN mainly because of the regular shaped and sized faces in
the predicted mesh. This is in contrast to Sphere GCN, where faces can intersect and largely vary
in size in an effort to capture the complex object shape, which in turn affects the interpolated face
textures. In addition, simple shading functions, such as flat shading, visibly lead to worse textures, as
expected. More sophisticated shaders, like Phong and Gouraud, lead to better texture reconstructions.

B.2 Unsupervised point cloud prediction

Expanding on the brief description of the unsupervised point cloud prediction model in the main
paper, here we provide more details and analyze our results further. Our training and inference
procedures are similar to meshes; we follow the same 2-view train setup while at test time our model
only takes as input a single RGB image (Figure 7). Our Point Align model is similar to Sphere GCN.
We start from an initial point cloud of 10k points sampled randomly and uniformly from the surface
of a sphere. Each point samples features from the backbone using Vert Align. A sequence of fully
connected layers replace Graph Conv- point clouds do not have connectivity patterns - resulting
in per point offset predictions (and (r, g, b) values, in the case of textured models). The network
architecture is shown in Figure 11.
Losses Point Align is trained solely with L = Ls for silhouette rendering and an additional L1 loss
between the rendered and ground truth image for textured rendering. There are no shape regularizers.
Blending In the case of textured rendering, we experiment with two blending (or compositing)
functions, Alpha and Norm.
More discussion on Table 4 Our point cloud evaluation in Table 4 is directly comparable to that of
meshes in Table 1; we use the same number of points, 10k, to compute chamfer and F1 for meshes (by
sampling points from the mesh surface) and point clouds (directly using the points in the cloud). From
the comparison with meshes, we observe that our unsupervised point cloud model leads to slightly
better reconstruction quality than meshes (chamfer 0.272 for Point Align vs 0.281 for High Res
Sphere GCN). This is proof that our point cloud renderer is effective at predicting shapes. The quality
of our reconstructions is shown in Figure 12 which provides more shape and texture predictions by
our Point Align model with an alpha and norm compositor.

12
High Res
Input Sphere FC Sphere GCN Voxel GCN
Sphere GCN

Figure 9: Shape reconstruction predictions via silhouette mesh rendering on ShapeNet test. We show
the input image (left). For each model, we show the object shape prediction made the model and an
additional random view.
13
Input Sphere GCN Voxel GCN Input Sphere GCN Voxel GCN

Flat Flat

Phong Phong

Gouraud Gouraud

Flat Flat

Phong Phong

Gouraud Gouraud

Figure 10: Shape and texture predictions via textured mesh rendering on the ShapeNet test set. We
show the input image (left). For each shading function, we show the shape prediction and its textured
rendering (with an additional view) as predicted by the model.

Figure 11: The Point Align network architecture used for unsupervised point cloud predictions in our
experiments.

References
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,
M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, 2016.

[2] R. Alp Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the
wild. In CVPR, 2018.

[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. TPAMI, 2003.

[4] R. A. Brooks. Model-based three-dimensional interpretations of two-dimensional images.


TPAMI, 1983.

[5] R. A. Brooks, R. Creiner, and T. O. Binford. The acronym model-based vision system. In IJCAI,
1979.

[6] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva,


S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint
arXiv:1512.03012, 2015.

14
Input Alpha Norm Input Alpha Norm

Point Point
Align Align

Point Point
Align Align

Point Point
Align Align

Point Point
Align Align

Figure 12: Point cloud and texture predictions via textured rendering on the ShapeNet test set. For
each example, we show the input image and for each compositor (Alpha, Norm) we show the shape
(top row) and texture (bottom row) prediction as well as an additional view.

[7] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler. Learning to predict
3d objects with an interpolation-based differentiable renderer. In NeurIPS, 2019.
[8] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A unified approach for single
and multi-view 3d object reconstruction. In ECCV, 2016.
[9] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common
multi-scale convolutional architecture. In ICCV, 2015.
[10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a
multi-scale deep network. In NeurIPS, 2014.
[11] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction
from a single image. In CVPR, 2017.
[12] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR
Workshop on Representation Learning on Graphs and Manifolds, 2019.
[13] M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures.
IEEE Transactions on computers, 1973.
[14] G. Gkioxari, J. Malik, and J. Johnson. Mesh R-CNN. In ICCV, 2019.
[15] H. Gouraud. Continuous shading of curved surfaces. IEEE transactions on computers, 100(6):
623–629, 1971.
[16] U. Grenander. Lectures in Pattern Theory I, II and III: Pattern Analysis, Pattern Synthesis and
Regular Structures. 1976-1981.

15
[17] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papier-mâché approach to
learning 3d surface generation. In CVPR, 2018.
[18] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university
press, 2003.
[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,
2016.
[20] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
[21] E. Insafutdinov and A. Dosovitskiy. Unsupervised learning of shape and pose with differentiable
point clouds. In NeurIPS, 2018.
[22] K. M. Jatavallabhula, E. Smith, J.-F. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, and
T. Xiang. Kaolin: A PyTorch library for accelerating 3d deep learning research. arXiv preprint
arXiv:1911.05063, 2019.
[23] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus. arXiv preprint
arXiv:1702.08734, 2017.
[24] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-specific mesh recon-
struction from image collections. In ECCV, 2018.
[25] H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer. In CVPR, 2018.
[26] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
In ICLR, 2017.
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In NeurIPS, 2012.
[28] S. Laine and T. Karras. High-performance software rasterization on gpus. In Proceedings of the
ACM SIGGRAPH Symposium on High Performance Graphics, 2011.
[29] C. Lassner. Fast differentiable raycasting for neural rendering using sphere-based representations.
arXiv preprint arXiv:2004.07484, 2020.
[30] T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable monte carlo ray tracing through
edge sampling. ACM Transactions on Graphics (TOG), 2018.
[31] S. Liu, W. Chen, T. Li, and H. Li. Soft rasterizer: Differentiable rendering for unsupervised
single-view mesh reconstruction. In ICCV, 2019.
[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.
In CVPR, 2015.
[33] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned
multi-person linear model. ACM transactions on graphics (TOG), 2015.
[34] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In ECCV,
2014.
[35] D. G. Lowe et al. Fitting parameterized three-dimensional models to images. TPAMI, 1991.
[36] S. R. Marschner and D. P. Greenberg. Inverse rendering for computer graphics. Citeseer, 1998.
[37] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks:
Learning 3d reconstruction in function space. In CVPR, 2019.
[38] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Rep-
resenting scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934,
2020.
[39] C. Nash, Y. Ganin, S. Eslami, and P. W. Battaglia. Polygen: An autoregressive generative model
of 3d meshes. arXiv preprint arXiv:2002.10880, 2020.

16
[40] M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob. Mitsuba 2: a retargetable forward and
inverse renderer. ACM Transactions on Graphics (TOG), 2019.
[41] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi. C3dpo: Canonical 3d pose
networks for non-rigid structure from motion. In Proceedings of the IEEE International
Conference on Computer Vision, 2019.
[42] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous
signed distance functions for shape representation. In CVPR, 2019.
[43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning
library. In NeurIPS, 2019.
[44] G. Patow and X. Pueyo. A survey of inverse rendering problems. In Computer graphics forum,
2003.
[45] B. T. Phong. Illumination for computer generated pictures. Communications of the ACM, 18(6):
311–317, 1975.
[46] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d
classification and segmentation. In CVPR, 2017.
[47] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on
point sets in a metric space. In NeurIPS, 2017.
[48] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
region proposal networks. In NeurIPS, 2015.
[49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition
Challenge. IJCV, 2015.
[50] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. PIFu: Pixel-aligned
implicit function for high-resolution clothed human digitization. In ICCV, 2019.
[51] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspon-
dence algorithms. IJCV, 2002.
[52] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2015.
[53] V. Sitzmann, M. Zollhöfer, and G. Wetzstein. Scene representation networks: Continuous
3d-structure-aware neural scene representations. In NeurIPS, 2019.
[54] E. J. Smith, S. Fujimoto, A. Romero, and D. Meger. Geometrics: Exploiting geometric structure
for graph-encoded objects. In ICML, 2019.
[55] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolu-
tional architectures for high-resolution 3d outputs. In ICCV, 2017.
[56] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view
reconstruction via differentiable ray consistency. In CVPR, 2017.
[57] S. Tulsiani, R. Tucker, and N. Snavely. Layer-structured 3d scene inference via view synthesis.
In ECCV, 2018.
[58] J. Valentin, C. Keskin, P. Pidlypenskyi, A. Makadia, A. Sud, and S. Bouaziz. Tensorflow
graphics: Computer graphics meets deep learning. 2019.
[59] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2Mesh: Generating 3D mesh
models from single RGB images. In ECCV, 2018.
[60] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. SynSin: End-to-end view synthesis from a
single image. In CVPR, 2020.

17
[61] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space
of object shapes via 3d generative-adversarial modeling. In NeurIPS, 2016.
[62] G. Yang, X. Huang, Z. Hao, M.-Y. Liu, S. Belongie, and B. Hariharan. Pointflow: 3d point
cloud generation with continuous normalizing flows. In ICCV, 2019.
[63] W. Yifan, F. Serena, S. Wu, C. Öztireli, and O. Sorkine-Hornung. Differentiable surface splatting
for point-based geometry processing. ACM Transactions on Graphics (TOG), 2019.
[64] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. LSUN: Construction of
a large-scale image dataset using deep learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015.
[65] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for
deep scene understanding. TPAMI, 2017.

18

You might also like