Accelerating 3D Deep Learning With Pytorch3D: Equal Contribution
Accelerating 3D Deep Learning With Pytorch3D: Equal Contribution
Accelerating 3D Deep Learning With Pytorch3D: Equal Contribution
Abstract
1 Introduction
Over the past decade, deep learning has significantly advanced the ability of AI systems to process
2D image data. We can now build high-performing systems for tasks such as object [49, 27, 52, 19]
and scene [64, 65] classification, object detection [48], semantic [32] and instance [20] segmentation,
and human pose estimation [2]. These systems can operate on complex image data and have been
deployed in countless real-world settings. Though sucessful, these methods suffer from a common
shortcoming: they process 2D snapshots and ignore the true 3D nature of the world.
Extending deep learning into 3D can unlock many new applications. Recognizing objects in 3D point
clouds [46, 47] can enhance the sensing abilities of autonomous vehicles, or enable new augmented
reality experiences. Predicting depth [10, 9], or 3D shape [8, 11, 59, 41] can lift 2D images into 3D.
Generative models [61, 62, 39] might one day aid artists in authoring 3D content. Image-based tasks
like view synthesis can be improved with 3D representations given only 2D supervision [53, 60, 38].
Despite growing interest, 3D deep learning remains relatively underexplored.
We believe that some of this disparity is due to the significant engineering challenges involved in
3D deep learning. One such challenge is heterogeneous data. 2D images are almost universally
represented by regular pixel grids. In contrast, 3D data are stored in a variety of structured formats
including voxel grids [8, 55], point clouds [46, 11], and meshes [59, 14] which can exhibit per-
element heterogeneity. For example, meshes may differ in their number of vertices and faces, and
their topology. Such heterogeneity makes it difficult to efficiently implement batched operations on 3D
data using the tensor-centric primitives provided by standard deep learning toolkits like PyTorch [43]
∗
equal contribution
2 Related Work
3D deep learning libraries. There are a number of toolkits for 3D deep learning. [12] focuses on
learning on graphs, [58] provides differentiable graphics operators, [22] collects commonly used 3D
functions. However, they do not provide support for heterogeneous batching of 3D data, crucial for
large-scale learning, or modularity for differentiable rendering, crucial for exploration. PyTorch3D
introduces data structures that support batches of 3D data with varying sizes and topologies. This key
abstraction allows our 3D operators, including rendering, to operate on large heterogeneous batches.
Differentiable renderers. OpenDR [34] and NMR [25] perform traditional rasterization in the
forward pass and compute approximate gradients in the backward pass. More recently, SoftRas [31]
and DIB-R [7] propose differentiable renderers by viewing rasterization as a probabilistic process
where each pixel’s color depends on multiple mesh faces. Differentiable ray tracing methods, such as
Redner [30] and Mitsuba2 [40], give more photorealistic images at the expense of increased compute.
Differentiable point cloud rendering is explored in [21] which uses ray termination probabilities
and stores points in a voxel grid which limits resolution. DSS [63] renders each point as a disk.
SynSin [60] also splats a per point sphere from a soft z-buffer of fixed length. Most recently, Pul-
sar [29] uses an unlimited z-buffer for rendering but uses the first few points for gradient propagation.
Differentiable rendering is an active research area. PyTorch3D introduces a modular renderer, inspired
by [31], by redesigning and exposing intermediates computed during rasterization. Unlike other
differentiable renderers, users can easily customize the rendering pipeline with PyTorch shaders.
3D shape prediction. In Section 4 we experiment with unsupervised 3D shape prediction using the
differentiable silhouette and textured renderers for meshes and point clouds in PyTorch3D. There is a
vast line of work on 3D shape prediction, including two-view and multiview methods [51, 18], model-
based approaches [13, 5, 4, 35, 3, 33], and recent supervised deep methods that predict voxels [8, 55],
meshes [17, 59, 54, 14], point clouds [11] and implicit functions [37, 42, 50]. Differentiable renderers
allow for unsupervised shape prediction via 2D re-projection losses [56, 25, 24, 57, 31].
2
https://pytorch3d.org/
2
(a) Chamfer (b) Graph Conv (c) KNN (varying |P |) (d) KNN (varying K)
Figure 1: Benchmarks for our 3D operators with batch size 32. (a) Lcham (P, Q) for point clouds
with |P | = 1000 and heterogeneous and varying |Q|. (b) Graph convolution on heterogeneous mesh
batches with 128-dimensional features. (c,d) Our KNN vs Faiss [23] between homogeneous batches
of 3D point clouds P and Q with |P | = |Q|. In (c) K = 1, and in (d) |P | = |Q| = 50k; in both our
memory usage matches [23]. (a) and (b) are forward and backward; (c) and (d) are forward only.
This section describes the core features of PyTorch3D. For a 3D deep learning library to be effective,
3D operators need to be efficient when handling complex 3D data. We benchmark the speed and
memory usage of key PyTorch3D operators, comparing to pure PyTorch and existing open-source
implementations. We show that PyTorch3D achieves speedups up to 10×.
3D data structures. Working with minibatches of data is crucial in deep learning both for stable
optimization and computational efficiency. However operating on batches of 3D meshes and point
clouds is challenging due to heterogeneity: meshes may have varying numbers of vertices and faces,
and point clouds may have varying numbers of points. To overcome this challenge, PyTorch3D
provides data structures to manage batches of meshes and point clouds which allow conversion
between different tensor-based representations (list, packed, padded) needed for various operations.
Implementation Details. We benchmark with meshes from ShapeNetCoreV1 [6] using homoge-
neous and heterogeneous batches. We form point clouds by sampling uniformly from mesh surfaces.
All results are averaged over 5 random batches and 10 runs per batch, and are run on a V100 GPU.
3.1 3D operators
We report time and memory usage for a representative set of popular 3D operators, namely Chamfer
loss, graph convolution and K nearest neighbors. Other 3D operators in PyTorch3D follow similar
trends. We compare to PyTorch and state-of-the-art open-source libraries.
Chamfer loss is a common metric that quantifies agreement between point clouds P and Q. Formally,
X X
Lcham (P, Q) = |P |−1 kp − qk2 + |Q|−1 kq − pk2 (1)
(p,q)∈ΛP,Q (q,p)∈ΛQ,P
where ΛP,Q = {(p, arg minq kp−qk) : p ∈ P } is the set of pairs (p, q) such that q ∈ Q is the nearest
neighbor of p ∈ P . A (homogeneously) batched implementation is straightforward in PyTorch, but is
inefficient since it requires forming a pairwise distance matrix with B × |P | × |Q| elements (where
B is the batch size). PyTorch3D avoids this inefficiency (and supports heterogeneity) by using our
efficient KNN to compute neighbors. Figure 1a compares ours against the naïve approach with
B = 32, |P | = 1000, and varying |Q|. The naïve approach runs out of memory for |Q| > 10k, while
ours scales to large point clouds and reduces time and memory use by more than 12×.
Graph convolution [26] is commonly used for processing 3D meshes
P [59, 14]. Given feature vectors
fv for each vertex v, it computes new features fv0 = W0 fv + u∈N (v) W1 fu where N (v) are
the neighbors of v in the mesh and W0 , W1 are learned weight matrices. PyTorch3D implements
graph convolution via a fused CUDA kernel for gather+scatter_add. Figure 1b shows that this
improves speed and memory use by up to 30% compared against a pure PyTorch implementation.
3
(a) PyTorch3D rendering pipeline (b) Problems with differentiability in rendering
Figure 2: (a) The modular rendering pipeline in PyTorch3D and (b) The z- & xy-discontinuities in
traditional rasterization and the soft formulations [31] which enable differentiability.
K Nearest Neighbors for D-dimensional points are used in Chamfer loss, normal estimation, and
other point cloud operations. We implement exact KNN with custom CUDA kernels that natively
handle heterogeneous batches. Our implementation is tuned for D ≤ 4 and K ≤ 32, and uses
template metaprogramming to individually optimize each (D, K) pair. We compare against Faiss [23],
a fast GPU library for KNN that targets a different portion of the design space: it does not handle
batching, is optimized for high-dimensional descriptors (D ≈ 128), and scales to billions of points.
Figures 1c and 1d show that we outperform Faiss by up to 5× for batched 3D problems.
A renderer inputs scene information (camera, geometry, materials, lights, textures) and outputs an
image. A differentiable renderer can also propagate gradients backward from rendered images to
scene information [34], allowing rendering to be embedded into deep learning pipelines [25, 31].
PyTorch3D includes a differentiable renderer that operates on heterogeneous batches of triangle
meshes. Our renderer follows three core design principles: differentiability, meaning that it computes
gradients with respect to all inputs; efficiency, meaning that it runs quickly and scales to large meshes
and images; and modularity, meaning that users can easily replace components of the renderer to
customize its functionality to their use case and experiment with alternate formulations.
As shown in Figure 2a, our renderer has two main components: the rasterizer selects the faces affect-
ing each pixel, and the shader computes pixel colors. Through careful design of these components,
we improve efficiency and modularity compared to prior differentiable renderers [25, 31, 7].
Rasterizer. The rasterizer first uses a camera to transform meshes from world to view coordinates.
Cameras are Python objects and compute gradients via autograd; this aids modularity, as users can
easily implement new camera models other than our provided orthographic and perspective cameras.
Next, the core rasterization algorithm finds triangles that intersect each pixel. In traditional rasteriza-
tion, each pixel is influenced only by its nearest face along the z-axis. As shown in Figure 2b, this
can cause step changes in pixel color as faces move along the z-axis (due to occlusion) and in the
xy-plane (due to face boundaries). Following [31] we soften these nondifferentiabilities by blending
the influence of multiple faces for each pixel, and decaying a face’s influence toward its boundary.
Our rasterizer departs from [31] in three ways to improve efficiency and modularity. First, in [31],
pixels are influenced by every face they intersect in the xy-plane; in contrast we constrain pixels to
be influenced by only the nearest K faces along the z-axis, computed using per-pixel priority queues.
Similar to traditional z-buffering, this lets us quickly discard many faces for each pixel, improving
efficiency. We show in Section 4 that this modification does not harm downstream task performance.
Second, [31] naïvely compares each pixel with each face. We improve efficiency using a two-pass
approach similar to [28], first working on image tiles to eliminate faces before moving to pixels.
Third, [31] fuses rasterization and shading into a monolithic CUDA kernel. We decouple these, and
as shown in Figure 2a our rasterizer returns Fragment data about the K nearest faces to each pixel:
face ID, barycentric coordinates of the pixel in the face, and (signed) pixel-to-face distances along
4
(a) Silhouette homogeneous (b) Silhouette heterogeneous (c) Texture heterogeneous
Figure 3: Benchmarks for silhouette and textured rendering for PyTorch3D and SoftRas [31]. We use
a batch size of 8, two image sizes (64 & 256) and two values for the number of faces per pixel K 10
& 50 (for PyTorch3D only). All benchmarks are for forward and backward.
the z-axis and in the xy-plane. This allows shaders to be implemented separately from the rasterizer,
significantly improving modularity. This change also improves efficiency, as cached Fragment data
can be used to avoid costly recomputation of face-pixel intersections in the backward pass.
Shaders consume the Fragment data produced by the rasterizer, and compute pixel values of the
rendered image. They typically involve two stages: first computing K values for the pixel (one for
each face identified by the Fragment data), then blending them to give a final pixel value.
Shaders are Python objects, and Frag-
ment data are stored in PyTorch ten- Algorithm 1: Silhouette blending
sors. Shaders can thus work with prob = (– dists / sigma).sigmoid()
F ragment data using standard Py- alpha = 1 – (1 – prob).prod(dim=–1)
Torch operators, and compute gradi-
ents via autograd. This design is Algorithm 2: Softmax blending
highly modular, as users can eas- prob = (– dists / sigma).sigmoid()
ily implement new shaders to cus- zinv = (zfar – zbuf) / (zfar – znear)
tomize the renderer. For example, Al- zinv_max = torch.max(zinv, dim=–1).values
gorithm 1 implements the silhouette weights = prob * ((zinv – zinv_max) / gamma)).exp()
renderer from [31] using a two-line weights = weights / weights.sum(dim=–1)
shader: dists is a tensor of shape image = (weights * top_k_colors_per_pixel).sum(dim=–2)
B × H × W × K giving signed dis-
tances in the xy-plane from each pixel to its K nearest faces (part of Fragment data), and sigma is
a hyperparameter. This is simpler than [31] where silhouette rendering is one path in a monolithic
CUDA kernel and gradients are manually computed. Similarly, Algorithm 2 implements the softmax
blending algorithm from [31] for textured rendering. dists, zbuf are part of the Fragment data,
and top_k_colors_per_pixel is the output from the shader. zf ar, znear, sigma, gamma are
hyper-parameters defined by the user.
Shaders can implement complex effects using the Fragment data from the rasterizer. Face IDs can be
used to fetch per-face data like normals, colors, or texture coordinates; barycentric coordinates can be
used to interpolate data over the face; xy and z distances can be used to blend the influence of faces
in different ways. Crucially, all texturing, lighting, and blending logic can be written using PyTorch
operators, and differentiated using autograd. We provide a variety of shaders implementing silhouette
rendering, flat, Gouraud [15], and Phong [45] shading with per-vertex colors or texture coordinates,
and which blend colors using hard assignment (similar to [25]) or softmax blending (like [31]).
Performance. In Figure 3 we benchmark the speed and memory usage of our renderer against
SoftRas [31]. We implement shaders to reproduce their silhouette rendering and textured mesh
rendering using per-vertex textures and Gouraud shading. Ours is significantly faster, especially
for large meshes, higher-resolution images, and heterogeneous batches: for textured rendering of
heterogenous batches of meshes with mean 50k faces each at 256 × 256, our renderer is more than 4×
faster than [31]. Our renderer uses more GPU memory than [31] since we explicitly store Fragment
data. However our absolute memory use (≈ 2GB for texture at 2562 ) is small compared to modern
GPU capacity (32GB for V100); we believe our improved modularity offsets our memory use.
5
(a) Alpha homogeneous (b) Alpha heterogeneous (c) Norm heterogeneous
Figure 4: Benchmarks for PyTorch3D’s point cloud render with Alpha and Norm weighted composit-
ing. We use a batch size of 8, two image sizes (64 & 256) and three values for the number of faces
per pixel K (10, 50, 150). All benchmarks are for forward and backward.
PyTorch3D also provides an efficient and modular point cloud renderer following the same design
as the mesh renderer. It is similarly factored into a rasterizer that finds the K-nearest points to
each pixel along the z-direction, and shaders written in PyTorch that consume fragment data from
the rasterizer to compute pixel colors. We provide shaders for silhouette and textured point cloud
rendering, and users can easily implement custom shaders to customize the rendering pipeline. Like
the mesh renderer, the point cloud render natively supports heterogeneous batches of points.
Our point cloud renderer uses a similar strategy as our mesh renderer for overcoming the non-
differentiabilities discussed in Figure 2b. Each point is splatted to a circular region in screen-space
whose opacity decreases away from the region’s center. The value of each pixel is computed by
blending information for the K-nearest points in the z-axis whose splatted regions overlap the pixel.
In our experiments we consider two blending methods: Alpha-compositing and Normalized weighted
sums. Suppose a pixel is overlapped by the splats from K points with opacities α1 , . . . , αK ∈ [0, 1]
sorted in increasing z-order, and the points are associated with feature vectors f1 , . . . , fK ∈ RD .
Features might be boolean (for silhouette rendering), RGB colors (for textured rendering), or neural
features [21, 60]. The blending methods compute features fAlpha , fN orm ∈ RD for the pixel:
! !
K
X i−1
Y K
X K
X
fAlpha = αi (1 − αj ) fi fN orm = αi fi / αi . (2)
i=1 j=1 i=1 i=1
Alpha-compositing uses the depth ordering of points so that nearer points contribute more, while
Norm ignores the depth order. Both blending functions are differentiable and can propagate gradients
from pixel features backward to both point features and opacities. They can be implemented with a
few lines of PyTorch code similar to Algorithms 1 and 2.
We benchmark our point cloud renderer by sampling points from the surface of random ShapeNet
meshes, then rendering silhouettes using our two blending functions. We vary the point cloud size,
points per pixel (K = 10, 50, 150), and image size (64, 256). Results are shown in Figure 4.
Our renderer is efficient: rendering a batch of 8 point clouds with 200k points each to a batch
of 256 × 256 images with K = 50 points per pixel takes about 75ms and uses just over 1GB of
GPU memory, making it feasible to use the renderer as a differentiable layer when training neural
networks. Comparing Figures 4a and 4b shows similar performance when rendering homogenous and
heterogeneous batches of comparable size. Comparing Figures 4b and 4c shows that both blending
methods have similar memory usage, but Norm is up to 25% faster for large K since it omits the
inner cumulative product. Point cloud rendering is generally more efficient than mesh rendering since
it requires fewer computations per primitive during rasterization.
6
High Res
Input Sphere FC Sphere GCN Voxel GCN Input Sphere GCN Voxel GCN
Sphere GCN
Flat
Phong
Gouraud
4 Experiments
Extending supervised learning into 3D is challenging due to the difficulty of obtaining 3D annotations.
Extracting 3D via weakly or unsupervised approaches can unlock exciting applications, such as
novel view synthesis, 3D content creation for AR/VR and more. Differentiable rendering makes
3D inference via 2D supervision possible. In this section, we experiment with unsupervised 3D
shape prediction using PyTorch3D. At test time, models predict an object’s 3D shape (point cloud or
mesh) from a single RGB image. During training they receive no 3D supervision, instead relying
on re-projection losses via differentiable rendering. We compare to SoftRas [31] and demonstrate
superior shape prediction and speed. The efficiency of our renderer allows us to scale to larger images
and more complex meshes, setting a new state-of-the-art for unsupervised 3D shape prediction.
Dataset. We experiment on ShapeNetCoreV1 [6], using the rendered images and train/test splits from
[8]. Images are 137×137, and portray instances from 13 object categories from various viewpoints.
There are roughly 840K train and 210K test images; we reserve 5% of train images for validation.
Metrics. We follow [59, 14] for evaluating 3D meshes. We sample 10k points uniformly at random
from the surface of predicted and ground-truth meshes. These are compared using Chamfer distance,
normal consistency, and F1τ score for various distance thresholds τ . Refer to [14] for more details.
To fairly extend evaluation to point cloud models, we predict 10k points per cloud.
7
Model 3D Superv. Renderer Metrics Mesh Size
Net Vox Mesh Size Engine Ch. (↓) Nrml F10.1 F10.3 F10.5 |V | |F |
Sphere FC 7 7 64 SoftRas 1.475 0.691 25.5 68.4 82.3 642 1280
7 7 PyTorch3D 0.989 0.696 26.4 69.9 83.5 642 1280
7 7 128 SoftRas 0.346 0.700 26.1 70.6 85.2 642 1280
7 7 PyTorch3D 0.313 0.699 27.6 72.5 86.6 642 1280
Sphere GCN 7 7 64 SoftRas 0.316 0.713 24.4 70.2 85.8 642 1280
7 7 PyTorch3D 0.296 0.703 24.8 71.3 86.5 642 1280
7 7 128 SoftRas 0.301 0.709 26.1 71.9 86.5 642 1280
7 7 PyTorch3D 0.293 0.709 26.6 72.6 86.9 642 1280
High Res
7 7 128 PyTorch3D 0.281 0.696 26.7 73.8 87.8 2562 5120
Sphere GCN
Voxel GCN X 7 64 SoftRas 0.293 0.656 24.5 71.1 87.2 1947±923 3895±1851
X 7 PyTorch3D 0.267 0.675 26.1 73.3 88.5 1932±935 3866±1873
X 7 128 SoftRas 0.276 0.675 26.2 72.6 87.9 1918±928 3837±1860
X 7 PyTorch3D 0.277 0.687 26.2 73.4 87.8 1951±949 3903±1901
Voxel Only [14] X 7 n/a n/a 0.916 0.595 7.70 33.1 54.9 2433±925 4877±1856
Mesh R-CNN [14] X X n/a n/a 0.171 0.713 35.1 82.6 93.2 2292±902 4598±1812
Table 1: Mesh reconstruction via silhouette rendering on ShapeNet test with PyTorch3D and Soft-
Ras [31]. We compare to state-of-the-art Mesh R-CNN [14], trained with voxel & mesh supervision,
and its Voxel Only variant, trained with voxel supervision. We highlight best metrics for models
trained without any 3D supervision (blue) and with voxel supervision (red).
Model Renderer Metrics
Net Size Shading Chamfer (↓) Normal F10.1 F10.3 F10.5 Lfg
1 (↓) Lbg
1 (↓)
Sphere GCN 64 Flat 0.315 0.689 24.1 70.9 86.3 0.0217 0.0031
Phong 0.309 0.703 24.1 70.2 85.9 0.0173 0.0023
Gouraud 0.302 0.702 24.4 70.9 86.6 0.0180 0.0024
Voxel GCN 64 Flat 0.270 0.678 26.4 73.0 88.2 0.0128 0.0020
Phong 0.272 0.694 25.7 72.4 87.9 0.0126 0.0021
Gouraud 0.302 0.687 24.3 69.8 86.2 0.0128 0.0021
Table 2: Mesh and texture reconstruction via textured rendering on ShapeNet test. Lfg1 & Lbg
1
measure the reconstruction accuracy of the foreground object and background, respectively.
and heterogeneous batches, we train a model which makes voxel predictions and refines them via a
sequence of graph convs. At train time, this model uses coarse 483 voxel supervision similar to [14],
but unlike [14] it does not use 3D mesh supervision.
All models minimize the objective L = Ls + λl Ll + λe Le . Ls is the negative intersection over
union between rendered and ground truth 2D silhouette, as in [31]. Ll and Le are Laplacian and edge
length mesh regularizers, respectively, ensuring that meshes are smooth. We set λl =19 and λe =0.2.
All model variants use a ResNet50 [19] backbone, initialized with ImageNet weights. We follow the
training schedule from [14]; we use Adam with a constant learning rate of 10−4 for 25 epochs. We
use a 64 batch size across 8 V100 GPUs (8 images per GPU). Sphere GCN and Voxel GCN use a
sequence of 3 graph convs, each with 512 dimensions. The voxel head in Voxel GCN predicts 483
voxels via a 4 layer CNN, identical to [14]. For all models, the inputs are 137×137 images from [8].
We report performance using our mesh renderer and compare to SoftRas [31].
Table 1 shows our results. We compare to the supervised state-of-the-art Mesh R-CNN [14] and
its Voxel Only variant, the latter being a direct comparison to Voxel GCN as both use the same 3D
supervision – coarse voxels but no meshes. We render at 64×64, similar to SoftRas [31], and push to
even higher rendering resolution at 128×128. Figure 5a compares the models qualitatively.
From Table 1 we observe: (a) Compared to SoftRas, PyTorch3D achieves on par or better performance
across models. This validates the design of the Pytorch3D renderer and proves that rendering the
K closest faces, instead of all, does not hurt performance, (b) Even though Sphere FC & Sphere
GCN deform the same sphere, Sphere GCN is superior across renderers. Figure 5a backs this claim
qualitatively. Unlike Sphere GCN, Sphere FC is sensitive to rendering size (row 4 vs 6), (c) High
Res Sphere GCN significantly outperforms Sphere GCN both quantitatively (row 10 vs 11) and
8
Renderer Metrics
Input Point Align
Size K Ch. (↓) F10.1 F10.3 F10.5 Input
64 20 0.738 12.8 50.6 68.0
50 0.451 18.1 62.9 79.0
100 0.289 23.4 73.3 87.1 Alpha
150 0.289 23.9 73.5 87.3
128 20 0.623 13.7 53.1 71.2
50 0.398 19.2 65.6 81.3
100 0.272 25.3 75.2 88.0 Norm
150 0.280 25.1 74.6 87.8
(a) (b)
Table 4: Point cloud reconstruction via sil- Figure 6: Point cloud predictions via (a) silhouette and
houette rendering on ShapeNet test for dif- (b) textured rendering on ShapeNet test.
ferent rendering resolutions and K.
Net 3D superv Ch.(↓) F1τ F12τ
Renderer Metrics
PSG [11] X 0.593 48.6 69.8
Size K Blend Ch.(↓) F10.1 F10.3 F10.5 Lfg
1 (↓) Lbg
1 (↓) Point Align 7 0.647 61.0 74.6
128 100 Alpha 0.275 25.2 74.7 87.8 0.0178 Table 6: Comparison of our unsu-
0.0051
Norm 0.268 25.7 75.4 88.1 0.0187 0.0027
pervised point cloud model, Point
Table 5: Point cloud and texture reconstruction via textured Align, to PSG [11] under the non scale-
rendering on ShapeNet test with two blending functions. normalized metric (Table 1 in [14]).
qualitatively (Figure 5a) showing the advantages of PyTorch3D’s scale efficiency, (d) Voxel GCN
significantly outperforms Voxel Only, both trained with the same 3D supervision. As mentioned
in [14], Voxel Only performs poorly, since voxel predictions are coarse and fail to capture fine shapes.
Voxel GCN improves all metrics and reconstructs fine structures with complex topologies as shown
in Figure 5a. Here, PyTorch3D’s efficiency results in a 2× training speedup compared to SoftRas.
Varying K As described in Section 3, the Py-
Torch3D renderer exposes the K nearest faces per Net K Ch.(↓) Nrml F10.1 F10.3 F10.5
pixel. We experiment with different values of K for Sphere 20 0.294 0.697 27.2 72.2 86.8
Sphere GCN & Voxel GCN at a 128×128 resolu- GCN 50 0.293 0.709 26.6 72.6 86.9
tion in Table 3. We observe that K=50 results in 100 0.294 0.708 26.9 72.3 86.9
best performance for both models (Chamfer 0.293 150 0.314 0.716 25.2 71.1 86.1
for Sphere GCN & 0.277 for Voxel GCN). More Voxel 20 0.317 0.642 24.1 68.4 85.3
interestingly, a small value for K(=20) works well GCN 50 0.277 0.687 26.2 73.4 87.8
for smaller meshes (Sphere GCN) but results in a 100 0.282 0.669 25.5 71.6 87.3
performance drop for larger meshes (Voxel GCN) 150 0.285 0.674 25.9 72.2 87.2
with the same output image size. This is expected
Table 3: Varying K for mesh reconstruction.
since for the same K, for a larger mesh, fewer faces
per pixel are rendered in proportion to the mesh size. Finally, increasing K does not improve models
further. This empirically validates our design of rendering a fixed finite number of faces per pixel.
Textured rendering In addition to shapes, we reconstruct object textures by extending the above
models to predict per vertex (r, g, b) values using textured rendering. The models are trained with
an additional L1 loss between the rendered and the ground truth image. Table 2 shows our analysis.
Simultaneous shape and texture prediction is harder, yet our models achieve high reconstruction
quality for both shapes (compared to Table 1) and textures. Figure 5b shows qualitative results.
9
5 Broader Impact
In this paper we introduce PyTorch3D, a library for 3D research which is fully differentiable to enable
easy inclusion into existing deep learning pipelines, modular for fast experimentation and extension,
optimized and efficient to allow scaling to large 3D data sizes, and with heterogeneous batching
capabilities to support the variable shape topologies encountered in real world use cases.
Similar to PyTorch or TensorFlow, which provide fundamental tools for deep learning, PyTorch3D is a
library of building blocks optimized for 3D deep learning. Frameworks like PyTorch and PyTorch3D
provide a platform for solving a plethora of AI problems including semantic or synthesis tasks.
Data-driven solutions to such problems necessitate the community’s caution regarding their potential
impact, especially when these models are being deployed in the real world. In this work, our goal
is to provide students, researchers and engineers with the best possible tools to accelerate research
at the intersection of 3D and deep learning. To this end, we are committed to support and develop
PyTorch3D in adherence with the needs of the academic, research and engineering community.
Appendix
In the case of meshes, we experiment with three architectures for the 3D shape network, namely
Sphere FC, Sphere GCN and Voxel GCN. The first two deform an initial sphere template and use
no 3D shape supervision. The third deforms shape topologies cast by the voxel head and tests the
effectiveness of differentiable rendering for varying shapes and connectivities as predicted by a neural
network. These topologies are different and more diverse than human-defined or simple genus 0
shapes allowing us to stretch test our renderer. Voxel GCN uses coarse voxel supervision to train
the voxel head which learns to predict instance-specific, yet coarse, shapes. Note that Voxel GCN is
identical to Mesh R-CNN [14], but unlike Mesh R-CNN it uses no mesh supervision. The details of
the architectures for all three models are shown in Figure 8. For each model, we show the network
modules as well as the shapes of the intermediate batched tensor outputs. Note that for Voxel GCN,
the shapes of the meshes vary within the batch due to the heterogeneity from the voxel predictions.
10
(a) Train time (b) Inference
Figure 7: System overview for unsupervised shape prediction. (a) Shows the 2-view training setup.
During training, an additional view of known rotation R and translation t of the input view is assumed.
The predicted shape is transformed by (R, t) and its rendered silhouette is compared against the
ground truth one. (b) Shows the system at inference time. The input is a single RGB image and the
output is the predicted shape in camera coordinates.
(a) Sphere FC
11
experiments in SoftRas [31] which use a Sphere FC model to deform a sphere of 642 vertices and
1280 faces and render at 64×64. While PyTorch3D performs better under that setting, we see that
the absolute performance for both SoftRas (chamfer 1.475) and PyTorch3D (chamfer 0.989) is not
good enough. Going to higher rendering resolutions improves models (chamfer 0.346 with SoftRas
vs 0.313 with PyTorch3D). Sphere GCN, a more geometry-aware network architecture, performs
much better for all rendering resolutions and all rendering engines (chamfer 0.301 with SoftRas
vs chamfer 0.293 with PyTorch3D at 128×128). This is also evident in Figure 5a where we show
predictions for all models at a 128×128 rendering resolution. Sphere FC is able to capture a global
pose and appearance but fails to capture instance-specific shape details. Sphere GCN, which deforms
the same sized sphere template, is able to reconstruct instance-specific shapes, such as chair legs and
tables, much more accurately. We take advantage of PyTorch3D’s scale efficiency and use a larger
sphere where we immediately see improved reconstruction quality (chamfer 0.281 with PyTorch3D).
However, Sphere FC and Sphere GCN can only make predictions homeomorphic to spheres. This
forces them to multiple face intersections and irregularly shaped faces in order to capture the complex
shape topologies. Our Voxel GCN variant tests the ability of our renderer to make predictions of
any genus by refining shape topologies predicted by a neural network. The reconstruction quality
improves for Voxel GCN which captures holes and complex shapes while maintaining regular shaped
faces. More importantly, Voxel GCN improves the Voxel Only baseline, which also predicts voxels
but performs no further refinement. Even though both models are trained with the same supervision,
Voxel GCN achieves a chamfer of 0.267 compared to 0.916 for Voxel Only.
More visualizations We show additional mesh predictions via silhouette rendering in Figure 9 and
via textured rendering in Figure 10. Our texture model is simple; we predict (r, g, b) texture values
per vertex. The shaders interpolate the vertex textures into face textures based on their algorithm and
the blending accumulates the face textures for each pixel. More sophisticated texture models could
involve directly sampling vertex textures from the input image or using GANs for natural looking
texture predictions. Despite its simplicity, our texture model does well at reconstructing textures.
Textures are more accurate for Voxel GCN mainly because of the regular shaped and sized faces in
the predicted mesh. This is in contrast to Sphere GCN, where faces can intersect and largely vary
in size in an effort to capture the complex object shape, which in turn affects the interpolated face
textures. In addition, simple shading functions, such as flat shading, visibly lead to worse textures, as
expected. More sophisticated shaders, like Phong and Gouraud, lead to better texture reconstructions.
Expanding on the brief description of the unsupervised point cloud prediction model in the main
paper, here we provide more details and analyze our results further. Our training and inference
procedures are similar to meshes; we follow the same 2-view train setup while at test time our model
only takes as input a single RGB image (Figure 7). Our Point Align model is similar to Sphere GCN.
We start from an initial point cloud of 10k points sampled randomly and uniformly from the surface
of a sphere. Each point samples features from the backbone using Vert Align. A sequence of fully
connected layers replace Graph Conv- point clouds do not have connectivity patterns - resulting
in per point offset predictions (and (r, g, b) values, in the case of textured models). The network
architecture is shown in Figure 11.
Losses Point Align is trained solely with L = Ls for silhouette rendering and an additional L1 loss
between the rendered and ground truth image for textured rendering. There are no shape regularizers.
Blending In the case of textured rendering, we experiment with two blending (or compositing)
functions, Alpha and Norm.
More discussion on Table 4 Our point cloud evaluation in Table 4 is directly comparable to that of
meshes in Table 1; we use the same number of points, 10k, to compute chamfer and F1 for meshes (by
sampling points from the mesh surface) and point clouds (directly using the points in the cloud). From
the comparison with meshes, we observe that our unsupervised point cloud model leads to slightly
better reconstruction quality than meshes (chamfer 0.272 for Point Align vs 0.281 for High Res
Sphere GCN). This is proof that our point cloud renderer is effective at predicting shapes. The quality
of our reconstructions is shown in Figure 12 which provides more shape and texture predictions by
our Point Align model with an alpha and norm compositor.
12
High Res
Input Sphere FC Sphere GCN Voxel GCN
Sphere GCN
Figure 9: Shape reconstruction predictions via silhouette mesh rendering on ShapeNet test. We show
the input image (left). For each model, we show the object shape prediction made the model and an
additional random view.
13
Input Sphere GCN Voxel GCN Input Sphere GCN Voxel GCN
Flat Flat
Phong Phong
Gouraud Gouraud
Flat Flat
Phong Phong
Gouraud Gouraud
Figure 10: Shape and texture predictions via textured mesh rendering on the ShapeNet test set. We
show the input image (left). For each shading function, we show the shape prediction and its textured
rendering (with an additional view) as predicted by the model.
Figure 11: The Point Align network architecture used for unsupervised point cloud predictions in our
experiments.
References
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,
M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, 2016.
[2] R. Alp Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the
wild. In CVPR, 2018.
[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. TPAMI, 2003.
[5] R. A. Brooks, R. Creiner, and T. O. Binford. The acronym model-based vision system. In IJCAI,
1979.
14
Input Alpha Norm Input Alpha Norm
Point Point
Align Align
Point Point
Align Align
Point Point
Align Align
Point Point
Align Align
Figure 12: Point cloud and texture predictions via textured rendering on the ShapeNet test set. For
each example, we show the input image and for each compositor (Alpha, Norm) we show the shape
(top row) and texture (bottom row) prediction as well as an additional view.
[7] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler. Learning to predict
3d objects with an interpolation-based differentiable renderer. In NeurIPS, 2019.
[8] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A unified approach for single
and multi-view 3d object reconstruction. In ECCV, 2016.
[9] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common
multi-scale convolutional architecture. In ICCV, 2015.
[10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a
multi-scale deep network. In NeurIPS, 2014.
[11] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction
from a single image. In CVPR, 2017.
[12] M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR
Workshop on Representation Learning on Graphs and Manifolds, 2019.
[13] M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures.
IEEE Transactions on computers, 1973.
[14] G. Gkioxari, J. Malik, and J. Johnson. Mesh R-CNN. In ICCV, 2019.
[15] H. Gouraud. Continuous shading of curved surfaces. IEEE transactions on computers, 100(6):
623–629, 1971.
[16] U. Grenander. Lectures in Pattern Theory I, II and III: Pattern Analysis, Pattern Synthesis and
Regular Structures. 1976-1981.
15
[17] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry. A papier-mâché approach to
learning 3d surface generation. In CVPR, 2018.
[18] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university
press, 2003.
[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,
2016.
[20] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
[21] E. Insafutdinov and A. Dosovitskiy. Unsupervised learning of shape and pose with differentiable
point clouds. In NeurIPS, 2018.
[22] K. M. Jatavallabhula, E. Smith, J.-F. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, and
T. Xiang. Kaolin: A PyTorch library for accelerating 3d deep learning research. arXiv preprint
arXiv:1911.05063, 2019.
[23] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus. arXiv preprint
arXiv:1702.08734, 2017.
[24] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-specific mesh recon-
struction from image collections. In ECCV, 2018.
[25] H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer. In CVPR, 2018.
[26] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
In ICLR, 2017.
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In NeurIPS, 2012.
[28] S. Laine and T. Karras. High-performance software rasterization on gpus. In Proceedings of the
ACM SIGGRAPH Symposium on High Performance Graphics, 2011.
[29] C. Lassner. Fast differentiable raycasting for neural rendering using sphere-based representations.
arXiv preprint arXiv:2004.07484, 2020.
[30] T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable monte carlo ray tracing through
edge sampling. ACM Transactions on Graphics (TOG), 2018.
[31] S. Liu, W. Chen, T. Li, and H. Li. Soft rasterizer: Differentiable rendering for unsupervised
single-view mesh reconstruction. In ICCV, 2019.
[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.
In CVPR, 2015.
[33] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned
multi-person linear model. ACM transactions on graphics (TOG), 2015.
[34] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In ECCV,
2014.
[35] D. G. Lowe et al. Fitting parameterized three-dimensional models to images. TPAMI, 1991.
[36] S. R. Marschner and D. P. Greenberg. Inverse rendering for computer graphics. Citeseer, 1998.
[37] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks:
Learning 3d reconstruction in function space. In CVPR, 2019.
[38] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Rep-
resenting scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934,
2020.
[39] C. Nash, Y. Ganin, S. Eslami, and P. W. Battaglia. Polygen: An autoregressive generative model
of 3d meshes. arXiv preprint arXiv:2002.10880, 2020.
16
[40] M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob. Mitsuba 2: a retargetable forward and
inverse renderer. ACM Transactions on Graphics (TOG), 2019.
[41] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi. C3dpo: Canonical 3d pose
networks for non-rigid structure from motion. In Proceedings of the IEEE International
Conference on Computer Vision, 2019.
[42] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous
signed distance functions for shape representation. In CVPR, 2019.
[43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning
library. In NeurIPS, 2019.
[44] G. Patow and X. Pueyo. A survey of inverse rendering problems. In Computer graphics forum,
2003.
[45] B. T. Phong. Illumination for computer generated pictures. Communications of the ACM, 18(6):
311–317, 1975.
[46] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d
classification and segmentation. In CVPR, 2017.
[47] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on
point sets in a metric space. In NeurIPS, 2017.
[48] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
region proposal networks. In NeurIPS, 2015.
[49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition
Challenge. IJCV, 2015.
[50] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. PIFu: Pixel-aligned
implicit function for high-resolution clothed human digitization. In ICCV, 2019.
[51] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspon-
dence algorithms. IJCV, 2002.
[52] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2015.
[53] V. Sitzmann, M. Zollhöfer, and G. Wetzstein. Scene representation networks: Continuous
3d-structure-aware neural scene representations. In NeurIPS, 2019.
[54] E. J. Smith, S. Fujimoto, A. Romero, and D. Meger. Geometrics: Exploiting geometric structure
for graph-encoded objects. In ICML, 2019.
[55] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolu-
tional architectures for high-resolution 3d outputs. In ICCV, 2017.
[56] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view
reconstruction via differentiable ray consistency. In CVPR, 2017.
[57] S. Tulsiani, R. Tucker, and N. Snavely. Layer-structured 3d scene inference via view synthesis.
In ECCV, 2018.
[58] J. Valentin, C. Keskin, P. Pidlypenskyi, A. Makadia, A. Sud, and S. Bouaziz. Tensorflow
graphics: Computer graphics meets deep learning. 2019.
[59] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2Mesh: Generating 3D mesh
models from single RGB images. In ECCV, 2018.
[60] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. SynSin: End-to-end view synthesis from a
single image. In CVPR, 2020.
17
[61] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space
of object shapes via 3d generative-adversarial modeling. In NeurIPS, 2016.
[62] G. Yang, X. Huang, Z. Hao, M.-Y. Liu, S. Belongie, and B. Hariharan. Pointflow: 3d point
cloud generation with continuous normalizing flows. In ICCV, 2019.
[63] W. Yifan, F. Serena, S. Wu, C. Öztireli, and O. Sorkine-Hornung. Differentiable surface splatting
for point-based geometry processing. ACM Transactions on Graphics (TOG), 2019.
[64] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. LSUN: Construction of
a large-scale image dataset using deep learning with humans in the loop. arXiv preprint
arXiv:1506.03365, 2015.
[65] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for
deep scene understanding. TPAMI, 2017.
18