Large-Scale Point Cloud Semantic Segmentation With Superpoint Graphs
Large-Scale Point Cloud Semantic Segmentation With Superpoint Graphs
Large-Scale Point Cloud Semantic Segmentation With Superpoint Graphs
net/publication/321325284
CITATIONS READS
8 127
2 authors, including:
Loic Landrieu
Institut national de l’information géographique et forestière
11 PUBLICATIONS 60 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Cut Pursuit: a working-set approach of optimizing with graph-structure regularizers View project
All content following this page was uploaded by Loic Landrieu on 22 May 2018.
1
(a) RGB point cloud (b) Geometric partition (c) Superpoint graph (d) Semantic segmentation
Figure 1: Visualization of individual steps in our pipeline. An input point cloud (a) is partitioned into geometrically simple
shapes, called superpoints (b). Based on this preprocessing, a superpoints graph (SPG) is constructed by linking nearby
superpoints by superedges with rich attributes (c). Finally, superpoints are transformed into compact embeddings, processed
with graph convolutions to make use of contextual information, and classified into semantic labels.
lar, we improve mean per-class intersection over union edges [12]. Of particular interest are models supporting
(mIoU) by 11.9 points for the Semantic3D reduced test continuous edge attributes [45, 36], which we use to rep-
set, by 8.8 points for the Semantic3D full test set, and resent interactions. In image segmentation, convolutions
by up to 12.4 points for the S3DIS dataset. on graphs built over superpixels have been used for post-
processing: Liang et al. [32, 31] traverses such graphs in
2. Related Work a sequential node order based on unary confidences to im-
prove the final labels. We update graph nodes in parallel
The classic approach to large-scale point cloud segmen- and exploit edge attributes for informative context model-
tation is to classify each point or voxel independently using ing. Xu et al. [50] convolves information over graphs of
handcrafted features derived from their local neighborhood object detections to infer their contextual relationships. Our
[48]. The solution is then spatially regularized using graph- work infers relationships implicitly to improve segmenta-
ical models [37, 24, 34, 44, 21, 2, 38, 35, 49] or structured tion results. Qi et al. [41] also relies on graph convolutions
optimization [27]. Clustering as preprocessing [16, 13] or on 3D point clouds. However, we process large point clouds
postprocessing [47] have been used by several frameworks instead of small RGBD images with nodes embedded in 3D
to improve the accuracy of the classification. instead of 2D in a novel, rich-attributed graph. Finally, we
Deep Learning on Point Clouds. Several different note that graph convolutions also bear functional similar-
approaches going beyond naive volumetric processing of ity to deep learning formulations of CRFs [51], which we
point clouds have been proposed recently, notably set- discuss more in Section 3.4.
based [39, 40], tree-based [42, 23], and graph-based [45].
However, very few methods with deep learning components 3. Method
have been demonstrated to be able to segment large-scale
The main obstacle that our framework tries to overcome
point clouds. PointNet [39] can segment large clouds with
is the size of LiDAR scans. Indeed, they can reach hun-
a sliding window approach, therefore constraining contex-
dreds of millions of points, making direct deep learning
tual information within a small area only. Engelmann et
approaches intractable. The proposed SPG representation
al. [10] improves on this by increasing the context scope
allows us to split the semantic segmentation problem into
with multi-scale windows or by considering directly neigh-
three distinct problems of different scales, shown in Fig-
boring window positions on a voxel grid. SEGCloud [46]
ure 2, which can in turn be solved by methods of corre-
handles large clouds by voxelizing followed by interpola-
sponding complexity:
tion back to the original resolution and post-processing with
a conditional random field (CRF). None of these approaches 1 Geometrically homogeneous partition: The first step
is able to consider fine details and long-range contextual in- of our algorithm is to partition the point cloud into geo-
formation simultaneously. In contrast, our pipeline parti- metrically simple yet meaningful shapes, called super-
tions point clouds in an adaptive way according to their ge- points. This unsupervised step takes the whole point
ometric complexity and allows deep learning architecture to cloud as input, and therefore must be computationally
use both fine detail and interactions over long distance. very efficient. The SPG can be easily computed from
Graph Convolutions. A key step of our approach is this partition.
using graph convolutions to spread contextual information.
Formulations that are able to deal with graphs of variable 2 Superpoint embedding: Each node of the SPG corre-
sizes can be seen as a form of message passing over graph sponds to a small part of the point cloud correspond-
2
S2 S1 PointNet GRU table
S6 S2 S6 S2 pointnet GRU table
S5 S5 S3 pointnet GRU table
S1 S3
S4 pointnet GRU chair
S1 S3 S5
S4 S4
pointnet GRU chair
Figure 2: Illustration of our framework on a toy scan of a table and a chair. We perform geometric partitioning on the point
cloud (a), which allows us to build the superpoint graph (b). Each superpoint is embedded by a PointNet network. The
embeddings are then refined in GRUs by message passing along superedges to produce the final labeling (c).
ing to a geometrically simple primitive, which we as- pi , and, if available, other observations oi such as color or
sume to be semantically homogeneous. Such prim- intensity. For each point, we compute a set of dg geomet-
itives can be reliably represented by downsampling ric features fi ∈ Rdg characterizing the shape of its local
small point clouds to at most hundreds of points. This neighborhood. In this paper, we use three dimensionality
small size allows us to utilize recent point cloud em- values proposed by [9]: linearity, planarity and scattering,
bedding methods such as PointNet [39]. as well as the verticality feature introduced by [13]. We
also compute the elevation of each point, defined as the z
3 Contextual segmentation: The graph of superpoints coordinate of pi normalized over the whole input cloud.
is by orders of magnitude smaller than any graph built The global energy proposed by [13] is defined with re-
on the original point cloud. Deep learning algorithms spect to the 10-nearest neighbor adjacency graph Gnn =
based on graph convolutions can then be used to clas- (C, Enn ) of the point cloud (note that this is not the SPG).
sify its nodes using rich edge features facilitating long- The geometrically homogeneous partition is defined as the
range interactions. constant connected components of the solution of the fol-
lowing optimization problem:
The SPG representation allows us to perform end-to-end
learning of the trainable two last steps. We will describe arg min
X 2
kgi − fi k + µ
X
wi,j [gi − gj 6= 0] ,
each step of our pipeline in the following subsections. g∈Rdg i∈C (i,j)∈Enn
(1)
3.1. Geometric Partition with a Global Energy |E|
where [·] is the Iverson bracket. The edge weight w ∈ R+
In this subsection, we describe our method for partition- is chosen to be linearly decreasing with respect to the edge
ing the input point cloud into parts of simple shape. Our length. The factor µ is the regularization strength and deter-
objective is not to retrieve individual objects such as cars mines the coarseness of the resulting partition.
or chairs, but rather to break down the objects into simple The problem defined in Equation 1 is known as gen-
parts, as seen in Figure 3. However, the clusters being ge- eralized minimal partition problem, and can be seen as
ometrically simple, one can expect them to be semantically a continuous-space version of the Potts energy model, or
homogeneous as well, i.e. not to cover objects of different an `0 variant of the graph total variation. The minimized
classes. Note that this step of the pipeline is purely unsuper- functional being nonconvex and noncontinuous implies that
vised and makes no use of class labels beyond validation. the problem cannot realistically be solved exactly for large
We follow the global energy model described by [13] for point clouds. However, the `0 -cut pursuit algorithm intro-
its computational efficiency. Another advantage is that the duced by [26] is able to quickly find an approximate so-
segmentation is adaptive to the local geometric complexity. lution with a few graph-cut iterations. In contrast to other
In other words, the segments obtained can be large simple optimization methods such as α-expansion [6], the `0 -cut
shapes such as roads or walls, as well as much smaller com- pursuit algorithm does not require selecting the size of the
ponents such as parts of a car or a chair. partition in advance. The constant connected components
Let us consider the input point cloud C as a set of n S = {S1 , · · · , Sk } of the solution of Equation 1 define our
3D points. Each point i ∈ C is defined by its 3D position geometrically simple elements, and are referred as super-
3
Feature name Size Description in isolation; contextual information required for its reliable
mean offset 3 meanm∈δ(S,T ) δm classification is provided only in the following stage by the
offset deviation 3 stdm∈δ(S,T ) δm means of graph convolutions.
centroid offset 3 meani∈S pi − meanj∈T pj Several deep learning-based methods have been pro-
length ratio 1 log length (S) /length (T ) posed for this purpose recently. We choose PointNet [39]
surface ratio 1 log surface (S) /surface (T ) for its remarkable simplicity, efficiency, and robustness. In
volume ratio 1 log volume (S) /volume (T ) PointNet, input points are first aligned by a Spatial Trans-
point count ratio 1 log |S|/|T | former Network [19], independently processed by multi-
layer perceptrons (MLPs), and finally max-pooled to sum-
Table 1: List of df = 13 superedge features characterizing marize the shape.
the adjacency between two superpoints S and T .
In our case, input shapes are geometrically simple ob-
jects, which can be reliably represented by a small amount
points (i.e. set of points) in the rest of this paper. of points and embedded by a rather compact PointNet. This
is important to limit the memory needed when evaluating
3.2. Superpoint Graph Construction many superpoints on current GPUs. In particular, we sub-
sample superpoints on-the-fly down to np = 128 points to
In this subsection, we describe how we compute the SPG maintain efficient computation in batches and facilitate data
as well as its key features. The SPG is a structured represen- augmentation. Superpoints of less than np points are sam-
tation of the point cloud, defined as an oriented attributed pled with replacement, which in principle does not affect
graph G = (S, E, F ) whose nodes are the set of superpoints the evaluation of PointNet due to its max-pooling. How-
S and edges E (referred to as superedges) represent the ad- ever, we observed that including very small superpoints of
jacency between superpoints. The superedges are annotated less than nminp = 40 points in training harms the overall
by a set of df features: F ∈ RE×df characterizing the adja- performance. Thus, embedding of such superpoints is set
cency relationship between superpoints. to zero so that their classification relies solely on contextual
We define Gvor = (C, Evor ) as the symmetric Voronoi information.
adjacency graph of the complete input point cloud as de-
In order for PointNet to learn spatial distribution of dif-
fined by [20]. Two superpoints S and T are adjacent if there
ferent shapes, each superpoint is rescaled to unit sphere be-
is at least one edge in Evor with one end in S and one end in
fore embedding. Points are represented by their normal-
T:
ized position p0i , observations oi , and geometric features fi
E = (S, T ) ∈ S 2 | ∃ (i, j) ∈ Evor ∩ (S × T ) .
(2) (since these are already available precomputed from the par-
titioning step). Furthermore, the original metric diameter of
Important spatial features associated with a superedge the superpoint is concatenated as an additional feature after
(S, T ) are obtained from the set of offsets δ(S, T ) for edges PointNet max-pooling in order to stay covariant with shape
in Evor linking both superpoints: sizes.
Superedge features can also be derived by comparing the The final stage of the pipeline is to classify each su-
shape and size of the adjacent superpoints. To this end, perpoint Si based on its embedding zi and its local sur-
we compute |S| as the number of points comprised in a roundings within the SPG. Graph convolutions are naturally
superpoint S, as well as shape features length (S) = λ1 , suited to this task. In this section, we explain the propaga-
surface (S) = λ1 λ2 , volume (S) = λ1 λ2 λ3 derived from tion model of our system.
the eigenvalues λ1 , λ2 , λ3 of the covariance of the positions Our approach builds on the ideas from Gated Graph
of the points comprised in each superpoint, sorted by de- Neural Networks [30] and Edge-Conditioned Convolutions
creasing value. In Table 1, we describe a list of the different (ECC) [45]. The general idea is that superpoints refine their
superedge features used in this paper. Note that the break embedding according to pieces of information passed along
of symmetry in the edge features makes the SPG a directed superedges. Concretely, each superpoint Si maintains its
graph. state hidden in a Gated Recurrent Unit (GRU) [8]. The hid-
den state is initialized with embedding zi and is then pro-
3.3. Superpoint Embedding cessed over several iterations (time steps) t = 1 . . . T . At
(t)
The goal of this stage is to compute a descriptor for every each iteration t, a GRU takes its hidden state hi and an
(t)
superpoint Si by embedding it into a vector zi of fixed-size incoming message mi as input, and computes its new hid-
(t+1) (t)
dimensionality dz . Note that each superpoint is embedded den state hi . The incoming message mi to superpoint
4
(t)
i is computed as a weighted sum of hidden states hj of edge-specific weight vector and perform element-wise mul-
neighboring superpoints j. The actual weighting for a su- tiplication as in Equation 7 (ECC-VV). Channel mixing, al-
peredge (j, i) depends on its attributes Fji,· , listed in Ta- beit in an edge-unspecific fashion, is postponed to Equa-
ble 1. In particular, it is computed from the attributes by a tion 5. Finally, let us remark that Θ is shared over time iter-
multi-layer perceptron Θ, so-called Filter Generating Net- ations and that self-loops as proposed in [45] are not neces-
work. Formally: sary due to the existence of hidden states in GRUs.
5
training with Kullback-Leibler [25] divergence loss. It per- scans and 15 test scans with withheld labels. We also eval-
formed slightly worse in our initial experiments, though. uate on the reduced set of 4 subsampled scans, as common
Naive training on large SPGs may approach memory in past work.
limits of current GPUs. We circumvent this issue by ran- In Table 2, we provide the results of our algorithm com-
domly subsampling the sets of superpoints at each itera- pared to other state of the art recent algorithms and in Fig-
tion and training on induced subgraphs, i.e. graphs com- ure 3, we provide qualitative results of our framework. Our
posed of subsets of nodes and the original edges connecting framework improves significantly on the state of the art of
them. Specifically, graph neighborhoods of order 3 are sam- semantic segmentation for this data set, i.e. by nearly 12
pled to select at most 512 superpoints per SPG with more mIoU points on the reduced set and by nearly 9 mIoU points
than nminp points, as smaller superpoints are not embed- on the full set. In particular, we observe a steep gain on the
ded. Note that as the induced graph is a union of small ”artefact” class. This can be explained by the ability of the
neighborhoods, relationships over many hops may still be partitioning algorithm to detect artifacts due to their singu-
formed and learned. This strategy also doubles as data aug- lar shape, while they are hard to capture using snapshots, as
mentation and a strong regularization, together with ran- suggested by [5]. Furthermore, these small object are often
domized sampling of point clouds described in Section 3.3. merged with the road when performing spatial regulariza-
Additional data augmentation is performed by randomly tion.
rotating superpoints around the vertical axis and jittering
point features by Gaussian noise N (0, 0.01) truncated to 4.2. Stanford Large-Scale 3D Indoor Spaces
[−0.05, 0.05].
The S3DIS dataset [3] consists of 3D RGB point clouds
of six floors from three different buildings split into indi-
Testing. In modern deep learning frameworks, testing can vidual rooms. We evaluate our framework following two
be made very memory-efficient by discarding layer activa- dominant strategies found in previous works. As advocated
tions as soon as the follow-up layers have been computed. by [39, 10], we perform 6-fold cross validation with micro-
In practice, we were able to label full SPGs at once. To com- averaging, i.e. computing metrics once over the merged pre-
pensate for randomness due to subsampling of point clouds dictions of all test folds. Following [46], we also report
in PointNets, we average logits obtained over 10 runs with the performance on the fifth fold only (Area 5), correspond-
different seeds. ing to a building not present in the other folds. Since some
classes in this data set cannot be partitioned purely using ge-
4. Experiments ometric features (such as boards or paintings on walls), we
concatenate the color information o to the geometric fea-
We evaluate our pipeline on the two currently largest tures f for the partitioning step.
point cloud segmentation benchmarks, Semantic3D [14] The quantitative results are displayed in Table 3, with
and Stanford Large-Scale 3D Indoor Spaces (S3DIS) [3], on qualitative results in Figure 3 and in Appendix D. S3DIS is
both of which we set the new state of the art. Furthermore, a difficult dataset with hard to retrieve classes such as white
we perform an ablation study of our pipeline in Section 4.3. boards on white walls and columns within walls. From the
Even though the two data sets are quite different in nature quantitative results we can see that our framework performs
(large outdoor scenes for Semantic3D, smaller indoor scan- better than other methods on average. Notably, doors are
ning for S3DIS), we use nearly the same model for both. able to be correctly classified at a higher rate than other ap-
The deep model is rather compact and 6 GB of GPU mem- proaches, as long as they are open, as illustrated in Figure 3.
ory is enough for both testing and training. We refer to Ap- Indeed, doors are geometrically similar to walls, but their
pendix A for precise details on hyperparameter selection, position with respect to the door frame allows our network
architecture configuration, and training procedure. to retrieve them correctly. On the other hand, the partition
Performance is evaluated using three metrics: per-class merges white boards with walls, depriving the network from
intersection over union (IoU), per-class accuracy (Acc), and the opportunity to even learn to classify them: the IoU of
overall accuracy (OA), defined as the proportion of cor- boards for theoretical perfect classification of superpoints
rectly classified points. We stress that the metrics are com- (as in Section 4.3) is only 51.3.
puted on the original point clouds, not on superpoints. Computation Time. In Table 4, we report computation
time over the different steps of our pipeline for the infer-
4.1. Semantic3D
ence on Area 5 measured on a 4 GHz CPU and GTX 1080
Semantic3D [14] is the largest available LiDAR dataset Ti GPU. While the bulk of time is spent on the CPU for
with over 3 billion points from a variety of urban and rural partitioning and SPG computation, we show that voxeliza-
scenes. Each point has RGB and intensity values (the latter tion as pre-processing, detailed in Appendix A, leads to a
of which we do not use). The dataset consists of 15 training significant speed-up as well as improved accuracy.
6
man-made natural high low hard- scanning
Method OA mIoU buildings cars
terrain terrain vegetation vegetation scape artefact
reduced test set: 78 699 329 points
TMLC-MSR [15] 86.2 54.2 89.8 74.5 53.7 26.8 88.8 18.9 36.4 44.7
DeePr3SS [29] 88.9 58.5 85.6 83.2 74.2 32.4 89.7 18.5 25.1 59.2
SnapNet [5] 88.6 59.1 82.0 77.3 79.7 22.9 91.1 18.4 37.3 64.4
SegCloud [46] 88.1 61.3 83.9 66.0 86.0 40.5 91.1 30.9 27.5 64.3
SPG (Ours) 94.0 73.2 97.4 92.6 87.9 44.0 93.2 31.0 63.5 76.2
full test set: 2 091 952 018 points
TMLC-MS [15] 85.0 49.4 91.1 69.5 32.8 21.6 87.6 25.9 11.3 55.3
SnapNet [5] 91.0 67.4 89.6 79.5 74.8 56.1 90.9 36.5 34.3 77.2
SPG (Ours) 92.9 76.2 91.5 75.6 78.3 71.7 94.4 56.8 52.9 88.4
Table 2: Intersection over union metric for the different classes of the Semantic3D dataset. OA is the global accuracy, while
mIoU refers to the unweighted average of IoU of each class.
Semantic3D
road
grass
tree
bush
buildings
hardscape
artefacts
cars
S3DIS
ceiling
floor
wall
column
beam
window
door
table
chair
bookcase
sofa
board
clutter
(a) RGB point cloud (b) Geometric partitioning (c) Prediction (d) Ground truth unlabelled
Figure 3: Example visualizations on both datasets. The colors in (b) are chosen randomly for each element of the partition.
4.3. Ablation Studies The lower bound (Unary) is estimated by training PointNet
with dz = 13 but otherwise the same architecture, denoted
To better understand the influence of various design
as PointNet13, to directly predict class logits, without SPG
choices made in our framework, we compare it to several
and GRUs. The upper bound (Perfect) corresponds to as-
baselines and perform an ablation study. Due to the lack of
signing each superpoint its ground truth label, and thus sets
public ground truth for test sets of Semantic3D, we evaluate
the limit of performance due to the geometric partition. We
on S3DIS with 6-fold cross validation and show comparison
can see that contextual segmentation is able to win roughly
of different models to our Best model in Table 5.
22 mIoU points over unaries, confirming its importance.
Performance Limits. The contribution of contextual
Nevertheless, the learned model still has room of up to 26
segmentation can be bounded both from below and above.
7
Method OA mAcc mIoU ceiling floor wall beam column window door chair table bookcase sofa board clutter
A5 PointNet [39] – 48.98 41.09 88.80 97.33 69.80 0.05 3.92 46.26 10.76 52.61 58.93 40.28 5.85 26.38 33.22
A5 SEGCloud [46] – 57.35 48.92 90.06 96.05 69.86 0.00 18.37 38.35 23.12 75.89 70.40 58.42 40.88 12.96 41.60
A5 SPG (Ours) 86.38 66.50 58.04 89.35 96.87 78.12 0.0 42.81 48.93 61.58 84.66 75.41 69.84 52.60 2.10 52.22
PointNet [39] in [10] 78.5 66.2 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 42.0 54.1 38.2 9.6 29.4 35.2
Engelmann et al. [10] 81.1 66.4 49.7 90.3 92.1 67.9 44.7 24.2 52.3 51.2 47.4 58.1 39.0 6.9 30.0 41.9
SPG (Ours) 85.5 73.0 62.1 89.9 95.1 76.4 62.8 47.1 55.3 68.4 73.5 69.2 63.2 45.9 8.7 52.9
Table 3: Results on the S3DIS dataset on fold “Area 5” (top) and micro-averaged over all 6 folds (bottom). Intersection over
union is shown split per class.
8
References [17] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
convolutional networks. In CVPR, 2017. 5
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
S. Süsstrunk. SLIC superpixels compared to state-of-the-art
deep network training by reducing internal covariate shift. In
superpixel methods. IEEE Transactions on Pattern Analysis
ICML, 2015. 10
and Machine Intelligence, 34(11):2274–2282, 2012. 1
[19] M. Jaderberg, K. Simonyan, A. Zisserman, and
[2] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena.
K. Kavukcuoglu. Spatial transformer networks. In
Contextually guided semantic labeling and search for three-
NIPS, pages 2017–2025. 2015. 4
dimensional point clouds. The International Journal of
Robotics Research, 32(1):19–34, 2013. 2 [20] J. W. Jaromczyk and G. T. Toussaint. Relative neighbor-
hood graphs and their relatives. Proceedings of the IEEE,
[3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis,
80(9):1502–1517, 1992. 4
M. Fischer, and S. Savarese. 3D semantic parsing of large-
scale indoor spaces. In CVPR, 2016. 1, 6 [21] B.-S. Kim, P. Kohli, and S. Savarese. 3D scene understand-
ing by voxel-CRF. In ICCV, 2013. 2
[4] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization.
CoRR, abs/1607.06450, 2016. 5 [22] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. In ICLR, 2015. 11
[5] A. Boulch, B. L. Saux, and N. Audebert. Unstructured
point cloud semantic labeling using deep segmentation net- [23] R. Klokov and V. S. Lempitsky. Escape from cells: Deep
works. In Eurographics Workshop on 3D Object Retrieval, Kd-networks for the recognition of 3D point cloud models.
volume 2, 2017. 1, 6, 7 CoRR, abs/1704.01222, 2017. 2
[6] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en- [24] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se-
ergy minimization via graph cuts. IEEE Transactions on Pat- mantic labeling of 3D point clouds for indoor scenes. In
tern Analysis and Machine Intelligence, 23(11):1222–1239, NIPS, pages 244–252, 2011. 2
2001. 3 [25] S. Kullback and R. A. Leibler. On information and suffi-
[7] S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ciency. The Annals of Mathematical Statistics, 22 (1):79–86,
ference for semantic image segmentation with deep gaussian 1951. 6
crfs. In ECCV, 2016. 5 [26] L. Landrieu and G. Obozinski. Cut pursuit: Fast algorithms
[8] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, to learn piecewise constant functions on general weighted
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase graphs. SIAM Journal on Imaging Sciences, 10(4):1724–
representations using RNN encoder–decoder for statistical 1766, 2017. 3
machine translation. In EMNLP, 2014. 1, 4, 5 [27] L. Landrieu, H. Raguet, B. Vallet, C. Mallet, and M. Wein-
[9] J. Demantk, C. Mallet, N. David, and B. Vallet. Dimension- mann. A structured regularization framework for spatially
ality based scale selection in 3D lidar point clouds. Inter- smoothing semantic labelings of 3D point clouds. ISPRS
national Archives of the Photogrammetry, Remote Sensing Journal of Photogrammetry and Remote Sensing, 132:102 –
and Spatial Information Sciences, XXXVIII-5/W12:97–102, 118, 2017. 2
2011. 3 [28] M. Larsson, F. Kahl, S. Zheng, A. Arnab, P. H. S. Torr, and
[10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe. R. I. Hartley. Learning arbitrary potentials in CRFs with gra-
Exploring spatial context for 3d semantic segmentation of dient descent. CoRR, abs/1701.06805, 2017. 5
point clouds. In ICCV, 3DRMS Workshop,, 2017. 1, 2, 6, 8 [29] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan,
[11] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. Gehler. and M. Felsberg. Deep projective 3D semantic segmentation.
Superpixel convolutional networks using bilateral incep- arXiv preprint arXiv:1705.03428, 2017. 7
tions. In ECCV, 2016. 5 [30] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated
[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. graph sequence neural networks. In ICLR, 2016. 4
Dahl. Neural message passing for quantum chemistry. In [31] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. In-
ICML, pages 1263–1272, 2017. 2 terpretable structure-evolving LSTM. In CVPR, pages 2175–
[13] S. Guinard and L. Landrieu. Weakly supervised 2184, 2017. 2
segmentation-aided classification of urban scenes from 3d [32] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic
LiDAR point clouds. In ISPRS 2017, 2017. 2, 3, 8 object parsing with graph LSTM. In ECCV, pages 125–143,
[14] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, 2016. 2
K. Schindler, and M. Pollefeys. Semantic3d. net: A new [33] G. Lin, C. Shen, A. van den Hengel, and I. D. Reid. Efficient
large-scale point cloud classification benchmark. arXiv piecewise training of deep structured models for semantic
preprint arXiv:1704.03847, 2017. 1, 6 segmentation. In CVPR, 2016. 5
[15] T. Hackel, J. D. Wegner, and K. Schindler. Fast semantic [34] Y. Lu and C. Rasmussen. Simplified Markov random fields
segmentation of 3D point clouds with strongly varying den- for efficient semantic labeling of 3d point clouds. In IROS,
sity. ISPRS Annals of Photogrammetry, Remote Sensing & pages 2690–2697, 2012. 2
Spatial Information Sciences, 3(3), 2016. 7 [35] A. Martinovic, J. Knopp, H. Riemenschneider, and
[16] H. Hu, D. Munoz, J. A. Bagnell, and M. Hebert. Efficient L. Van Gool. 3D all the way: Semantic segmentation of
3-d scene analysis from streaming data. In ICRA, 2013. 2 urban scenes from start to end in 3D. In CVPR, 2015. 2
9
[36] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, Appendix
and M. M. Bronstein. Geometric deep learning on graphs
and manifolds using mixture model CNNs. In CVPR, pages A. Model Details
5425–5434, 2017. 2
Voxelization. We pre-process input point clouds with
[37] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. Con-
voxelization subsampling by computing per-voxel mean po-
textual classification with functional max-margin Markov
networks. In CVPR, 2009. 2
sitions and observations over a regular 3D grid (5 cm bins
for Semantic3D and 3 cm bins for S3DIS dataset). The re-
[38] J. Niemeyer, F. Rottensteiner, and U. Soergel. Contextual
sulting semantic segmentation is interpolated back to the
classification of lidar data and building object detection in
original point cloud in a nearest neighbor fashion. Voxeliza-
urban areas. ISPRS Journal of Photogrammetry and Remote
Sensing, 87:152–165, 2014. 2 tion helps decreasing the computation time and memory re-
quirement, and improves the accuracy of the semantic seg-
[39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep
mentation by acting as a form of geometric and radiometric
learning on point sets for 3D classification and segmentation.
In CVPR, 2017. 1, 2, 3, 4, 6, 8, 11 denoising as well (Table 4 in the main paper). The quality
of further steps is practically not affected, as superpoints are
[40] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep
usually strongly subsampled for embedding during learning
hierarchical feature learning on point sets in a metric space.
In NIPS, 2017. 1, 2
and inference anyway (Section 3.3 in the main paper).
[41] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3D graph
neural networks for RGBD semantic segmentation. In ICCV, Geometric Partition. We set regularization strength µ =
pages 5209–5218, 2017. 2 0.8 for Semantic3D and µ = 0.03 for S3DIS, which strikes
[42] G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning a balance between semantic homogeneity of superpoints
deep 3D representations at high resolutions. In CVPR, 2017. and the potential for their successful discrimination (S3DIS
1, 2 is composed of smaller semantic parts than Semantic3D).
[43] A. G. Schwing and R. Urtasun. Fully connected deep struc- In addition to five geometric features f (linearity, planarity,
tured networks. CoRR, abs/1503.02351, 2015. 5 scattering, verticality, elevation), we use color information
o for clustering in S3DIS due to some classes being geo-
[44] R. Shapovalov, D. Vetrov, and P. Kohli. Spatial inference
machines. In CVPR, 2013. 2 metrically indistinguishable, such as boards or doors.
[45] M. Simonovsky and N. Komodakis. Dynamic edge- MLP(64,64,128,128,256)
STN
conditioned filters in convolutional neural networks on
graphs. In CVPR, 2017. 1, 2, 4, 5, 11 MLP(256,64,32)
np × dp
np × d p
maxpool
10
everywhere. Input points have dp =11 dimensional features C. Extended Ablation Studies
for Semantic3D (position pi , color oi , geometric features
fi ), with 3 additional ones for S3DIS (room-normalized In this section, we present additional set of experiments
spatial coordinates, as in past work [39]). to validate our design choices and present their results in
Table 6.
a) Spatial Transformer Network. While STN makes
Segmentation Network. We use embedding dimension- superpoint embedding orientation invariant, the relationship
ality dz = 32 and T = 10 iterations. ECC-VV is used with surrounding objects are still captured by superedges,
for Semantic3D (there are only 15 point clouds even though which are orientation variant. In practice, STN helps by 4
the amount of points is large), while ECC-MV is used for mIoU points.
S3DIS (large number of point clouds). Filter-generating b) Geometric Features. Geometric features fi are com-
network Θ is a MLP with 4 layers (widths 32, 128, 64, and puted in the geometric partition step and can therefore be
32 or 322 for ECC-VV or ECC-MV) with ReLUs. Batch used in the following learning step for free. While Point-
Normalization is used only after the third parametric layer. Nets could be expected to learn similar features from the
No bias is used in the last layer. Superedges have df = 13 data, this is hampered by superpoint subsampling, and
dimensional features, normalized by mean subtraction and therefore their explicit use helps (+4 mIoU).
scaling to unit variance based on the whole training set. c) Sampling Superpoints. The main effect of subsam-
pling SPG is regularization by data augmentation. Too
Training. We train using Adam [22] with initial learning small a sample size leads to disregarding contextual infor-
rate 0.01 and batch size 2, i.e. effectively up to 1024 super- mation (-4 mIoU) while too large a size leads to overfitting
points per batch. For Semantic3D, we train for 500 epochs (-2 mIoU). Lower memory requirements at training is an
with stepwise learning rate decay of 0.7 at epochs 350, 400, extra benefit. There is no subsampling at test time.
and 450. For S3DIS, we train for 250 epochs with steps at d) Long-range Context. We observe that limiting the
200 and 230. We clip gradients within [−1, 1]. range of context information in SPG harms the perfor-
mance. Specifically, capping distances in Gvor to 1 m (as
B. CRF-ECC used in PointNet [39]) or 5 m (as used in SegCloud [46])
worsens the performance of our method (even more on our
In this section, we describe our adaptation of CRF- Semantic 3D validation set).
RNN mean field inference by Zheng et al. [51] for post- e) Input Gate. We evaluate the effect of input gating
processing PointNet embeddings in SPG, denoted as unary (IG) for GRUs as well as LSTM units. While a LSTM unit
potentials Ui here. achieves higher score than a GRU (-3 mIoU), the proposed
The original work proposed a dense CRF with pairwise IG reverses this situation in favor of GRU (+1 mIoU). Un-
potentials Ψ defined
P to be a mixture of m Gaussian kernels like the standard input gate of LSTM, which controls the
as Ψij = µ m wm Km (Fij ), where µ is label compati- information flow from the hidden state and input to the cell,
bility matrix, w are parameters, and K are fixed Gaussian our IG controls the input even before it is used to compute
kernels applied on edge features. all other gates.
We replace this definition of the pairwise term with a Fil- f) Regularization Strength µ. We investigate the bal-
ter generating network Θ [45] parameterized with weights ance between superpoints’ discriminative potential and their
We , which generalizes the message passing and compati- homogeneity controlled by parameter µ . We observe that
bility transform steps of Zheng et al. . Furthermore, we use the system is able to perform reasonably over a range of
superedge connectivity E instead of assuming a complete SPG sizes.
graph. The pseudo-code is listed in Algorithm 1. Its output g) Superpoint Sizes. We include a breakdown of su-
are marginal probability distributions Q. In practice we run perpoint sizes for µ = 0.03 in relation to hyperparameters
the inference for T = 10 iterations. nminp = 40 and np = 128, showing that 93% of points are
in embedded superpoints, and 79% in superpoints that are
Algorithm 1 CRF-ECC
subsampled.
Qi ← softmax(Ui ) Superedge Features. Finally, in Table 7 we evaluate
while not converged
P do empirical importance of individual superedge features by
Q̂i ← j|(j,i)∈E Θ(Fji,· ; We )Qj removing them from Best. Although no single feature is
Q̆i ← Ui − Q̂i crucial, the most being offset deviation (+3 mIoU), we re-
Qi ← softmax(Q̆i ) mind the reader than without any superedge features the net-
end while
Furthermore, SegCloud divides the inference into cubes without over-
lap, possibly causing inconsistencies across boundaries.
11
a) Spatial transf. no yes D. Video Illustration
mIoU 58.1 62.1
b) Geometric features no yes We provide a video illustrating our method and qual-
mIoU 58.4 62.1
c) Max superpoints 256 512 1024
itative results on S3DIS dataset, which can be viewed at
mIoU 57.9 62.1 60.4 https://youtu.be/Ijr3kGSU_tU.
d) Superedge limit 1m 5m ∞
mIoU 61.0 61.3 62.1
e) Input gate LSTM LSTM+IG GRU GRU+IG
mIoU 61.0 61.0 57.5 62.1
f) Regularization µ 0.01 0.02 0.03 0.04
# superpoints 785 010 385 091 251 266 186 108
perfect mIoU 90.6 88.2 86.6 85.2
mIoU 59.1 59.2 62.1 58.8
g) Superpoint size 1-40 40-128 128-1000 ≥ 1000
proportion of points 7% 14% 27% 52%
·106
3
number of points
12