Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Large-Scale Point Cloud Semantic Segmentation With Superpoint Graphs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321325284

Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs

Conference Paper · June 2018

CITATIONS READS

8 127

2 authors, including:

Loic Landrieu
Institut national de l’information géographique et forestière
11 PUBLICATIONS 60 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Semantic segmentation of large 3D point clouds View project

Cut Pursuit: a working-set approach of optimizing with graph-structure regularizers View project

All content following this page was uploaded by Loic Landrieu on 22 May 2018.

The user has requested enhancement of the downloaded file.


Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs

Loic Landrieu1? , Martin Simonovsky2?


1
Université Paris-Est, LASTIG MATIS IGN, ENSG
2
Université Paris-Est, Ecole des Ponts ParisTech
loic.landrieu@ign.fr, martin.simonovsky@enpc.fr
arXiv:1711.09869v2 [cs.CV] 28 Mar 2018

Abstract inefficient and tends to discard small details.


Deep learning architectures specifically designed for 3D
We propose a novel deep learning-based framework to point clouds [39, 45, 42, 40, 10] display good results, but
tackle the challenge of semantic segmentation of large- are limited by the size of inputs they can handle at once.
scale point clouds of millions of points. We argue that the We propose a representation of large 3D point clouds as
organization of 3D point clouds can be efficiently captured a collection of interconnected simple shapes coined super-
by a structure called superpoint graph (SPG), derived points, in spirit similar to superpixel methods for image seg-
from a partition of the scanned scene into geometrically mentation [1]. As illustrated in Figure 1, this structure can
homogeneous elements. SPGs offer a compact yet rich be captured by an attributed directed graph called the super-
representation of contextual relationships between object point graph (SPG). Its nodes represent simple shapes while
parts, which is then exploited by a graph convolutional edges describe their adjacency relationship characterized by
network. Our framework sets a new state of the art for rich edge features.
segmenting outdoor LiDAR scans (+11.9 and +8.8 mIoU The SPG representation has several compelling advan-
points for both Semantic3D test sets), as well as indoor tages. First, instead of classifying individual points or vox-
scans (+12.4 mIoU points for the S3DIS dataset). els, it considers entire object parts as whole, which are eas-
ier to identify. Second, it is able to describe in detail the
relationship between adjacent objects, which is crucial for
1. Introduction contextual classification: cars are generally above roads,
ceilings are surrounded by walls, etc. Third, the size of
Semantic segmentation of large 3D point clouds presents the SPG is defined by the number of simple structures in a
numerous challenges, the most obvious one being the scale scene rather than the total number of points, which is typ-
of the data. Another hurdle is the lack of clear structure ically several order of magnitude smaller. This allows us
akin to the regular grid arrangement in images. These obsta- to model long-range interaction which would be intractable
cles have likely prevented Convolutional Neural Networks otherwise without strong assumptions on the nature of the
(CNNs) from achieving on irregular data the impressive per- pairwise connections. Our contributions are as follows:
formances attained for speech processing or images.
Previous attempts at using deep learning for large 3D • We introduce superpoint graphs, a novel point cloud
data were trying to replicate successful CNN architectures representation with rich edge features encoding the
used for image segmentation. For example, SnapNet [5] contextual relationship between object parts in 3D
converts a 3D point cloud into a set of virtual 2D RGBD point clouds.
snapshots, the semantic segmentation of which can then be • Based on this representation, we are able to apply deep
projected on the original data. SegCloud [46] uses 3D con- learning on large-scale point clouds without major sac-
volutions on a regular voxel grid. However, we argue that rifice in fine details. Our architecture consists of Point-
such methods do not capture the inherent structure of 3D Nets [39] for superpoint embedding and graph con-
point clouds, which results in limited discrimination per- volutions for contextual segmentation. For the latter,
formance. Indeed, converting point clouds to 2D format we introduce a novel, more efficient version of Edge-
comes with loss of information and requires to perform sur- Conditioned Convolutions [45] as well as a new form
face reconstruction, a problem arguably as hard as semantic of input gating in Gated Recurrent Units [8].
segmentation. Volumetric representation of point clouds is
• We set a new state of the art on two publicly available
? Both authors contributed equally to this work. datasets: Semantic3D [14] and S3DIS [3]. In particu-

1
(a) RGB point cloud (b) Geometric partition (c) Superpoint graph (d) Semantic segmentation

Figure 1: Visualization of individual steps in our pipeline. An input point cloud (a) is partitioned into geometrically simple
shapes, called superpoints (b). Based on this preprocessing, a superpoints graph (SPG) is constructed by linking nearby
superpoints by superedges with rich attributes (c). Finally, superpoints are transformed into compact embeddings, processed
with graph convolutions to make use of contextual information, and classified into semantic labels.

lar, we improve mean per-class intersection over union edges [12]. Of particular interest are models supporting
(mIoU) by 11.9 points for the Semantic3D reduced test continuous edge attributes [45, 36], which we use to rep-
set, by 8.8 points for the Semantic3D full test set, and resent interactions. In image segmentation, convolutions
by up to 12.4 points for the S3DIS dataset. on graphs built over superpixels have been used for post-
processing: Liang et al. [32, 31] traverses such graphs in
2. Related Work a sequential node order based on unary confidences to im-
prove the final labels. We update graph nodes in parallel
The classic approach to large-scale point cloud segmen- and exploit edge attributes for informative context model-
tation is to classify each point or voxel independently using ing. Xu et al. [50] convolves information over graphs of
handcrafted features derived from their local neighborhood object detections to infer their contextual relationships. Our
[48]. The solution is then spatially regularized using graph- work infers relationships implicitly to improve segmenta-
ical models [37, 24, 34, 44, 21, 2, 38, 35, 49] or structured tion results. Qi et al. [41] also relies on graph convolutions
optimization [27]. Clustering as preprocessing [16, 13] or on 3D point clouds. However, we process large point clouds
postprocessing [47] have been used by several frameworks instead of small RGBD images with nodes embedded in 3D
to improve the accuracy of the classification. instead of 2D in a novel, rich-attributed graph. Finally, we
Deep Learning on Point Clouds. Several different note that graph convolutions also bear functional similar-
approaches going beyond naive volumetric processing of ity to deep learning formulations of CRFs [51], which we
point clouds have been proposed recently, notably set- discuss more in Section 3.4.
based [39, 40], tree-based [42, 23], and graph-based [45].
However, very few methods with deep learning components 3. Method
have been demonstrated to be able to segment large-scale
The main obstacle that our framework tries to overcome
point clouds. PointNet [39] can segment large clouds with
is the size of LiDAR scans. Indeed, they can reach hun-
a sliding window approach, therefore constraining contex-
dreds of millions of points, making direct deep learning
tual information within a small area only. Engelmann et
approaches intractable. The proposed SPG representation
al. [10] improves on this by increasing the context scope
allows us to split the semantic segmentation problem into
with multi-scale windows or by considering directly neigh-
three distinct problems of different scales, shown in Fig-
boring window positions on a voxel grid. SEGCloud [46]
ure 2, which can in turn be solved by methods of corre-
handles large clouds by voxelizing followed by interpola-
sponding complexity:
tion back to the original resolution and post-processing with
a conditional random field (CRF). None of these approaches 1 Geometrically homogeneous partition: The first step
is able to consider fine details and long-range contextual in- of our algorithm is to partition the point cloud into geo-
formation simultaneously. In contrast, our pipeline parti- metrically simple yet meaningful shapes, called super-
tions point clouds in an adaptive way according to their ge- points. This unsupervised step takes the whole point
ometric complexity and allows deep learning architecture to cloud as input, and therefore must be computationally
use both fine detail and interactions over long distance. very efficient. The SPG can be easily computed from
Graph Convolutions. A key step of our approach is this partition.
using graph convolutions to spread contextual information.
Formulations that are able to deal with graphs of variable 2 Superpoint embedding: Each node of the SPG corre-
sizes can be seen as a form of message passing over graph sponds to a small part of the point cloud correspond-

2
S2 S1 PointNet GRU table
S6 S2 S6 S2 pointnet GRU table
S5 S5 S3 pointnet GRU table

S1 S3
S4 pointnet GRU chair
S1 S3 S5
S4 S4
pointnet GRU chair

S6 pointnet GRU chair

point edge of Evor superpoint superedge embedding


(a) Input point cloud (b) Superpoint graph (c) Network architecture

Figure 2: Illustration of our framework on a toy scan of a table and a chair. We perform geometric partitioning on the point
cloud (a), which allows us to build the superpoint graph (b). Each superpoint is embedded by a PointNet network. The
embeddings are then refined in GRUs by message passing along superedges to produce the final labeling (c).

ing to a geometrically simple primitive, which we as- pi , and, if available, other observations oi such as color or
sume to be semantically homogeneous. Such prim- intensity. For each point, we compute a set of dg geomet-
itives can be reliably represented by downsampling ric features fi ∈ Rdg characterizing the shape of its local
small point clouds to at most hundreds of points. This neighborhood. In this paper, we use three dimensionality
small size allows us to utilize recent point cloud em- values proposed by [9]: linearity, planarity and scattering,
bedding methods such as PointNet [39]. as well as the verticality feature introduced by [13]. We
also compute the elevation of each point, defined as the z
3 Contextual segmentation: The graph of superpoints coordinate of pi normalized over the whole input cloud.
is by orders of magnitude smaller than any graph built The global energy proposed by [13] is defined with re-
on the original point cloud. Deep learning algorithms spect to the 10-nearest neighbor adjacency graph Gnn =
based on graph convolutions can then be used to clas- (C, Enn ) of the point cloud (note that this is not the SPG).
sify its nodes using rich edge features facilitating long- The geometrically homogeneous partition is defined as the
range interactions. constant connected components of the solution of the fol-
lowing optimization problem:
The SPG representation allows us to perform end-to-end
learning of the trainable two last steps. We will describe arg min
X 2
kgi − fi k + µ
X
wi,j [gi − gj 6= 0] ,
each step of our pipeline in the following subsections. g∈Rdg i∈C (i,j)∈Enn
(1)
3.1. Geometric Partition with a Global Energy |E|
where [·] is the Iverson bracket. The edge weight w ∈ R+
In this subsection, we describe our method for partition- is chosen to be linearly decreasing with respect to the edge
ing the input point cloud into parts of simple shape. Our length. The factor µ is the regularization strength and deter-
objective is not to retrieve individual objects such as cars mines the coarseness of the resulting partition.
or chairs, but rather to break down the objects into simple The problem defined in Equation 1 is known as gen-
parts, as seen in Figure 3. However, the clusters being ge- eralized minimal partition problem, and can be seen as
ometrically simple, one can expect them to be semantically a continuous-space version of the Potts energy model, or
homogeneous as well, i.e. not to cover objects of different an `0 variant of the graph total variation. The minimized
classes. Note that this step of the pipeline is purely unsuper- functional being nonconvex and noncontinuous implies that
vised and makes no use of class labels beyond validation. the problem cannot realistically be solved exactly for large
We follow the global energy model described by [13] for point clouds. However, the `0 -cut pursuit algorithm intro-
its computational efficiency. Another advantage is that the duced by [26] is able to quickly find an approximate so-
segmentation is adaptive to the local geometric complexity. lution with a few graph-cut iterations. In contrast to other
In other words, the segments obtained can be large simple optimization methods such as α-expansion [6], the `0 -cut
shapes such as roads or walls, as well as much smaller com- pursuit algorithm does not require selecting the size of the
ponents such as parts of a car or a chair. partition in advance. The constant connected components
Let us consider the input point cloud C as a set of n S = {S1 , · · · , Sk } of the solution of Equation 1 define our
3D points. Each point i ∈ C is defined by its 3D position geometrically simple elements, and are referred as super-

3
Feature name Size Description in isolation; contextual information required for its reliable
mean offset 3 meanm∈δ(S,T ) δm classification is provided only in the following stage by the
offset deviation 3 stdm∈δ(S,T ) δm means of graph convolutions.
centroid offset 3 meani∈S pi − meanj∈T pj Several deep learning-based methods have been pro-
length ratio 1 log length (S) /length (T ) posed for this purpose recently. We choose PointNet [39]
surface ratio 1 log surface (S) /surface (T ) for its remarkable simplicity, efficiency, and robustness. In
volume ratio 1 log volume (S) /volume (T ) PointNet, input points are first aligned by a Spatial Trans-
point count ratio 1 log |S|/|T | former Network [19], independently processed by multi-
layer perceptrons (MLPs), and finally max-pooled to sum-
Table 1: List of df = 13 superedge features characterizing marize the shape.
the adjacency between two superpoints S and T .
In our case, input shapes are geometrically simple ob-
jects, which can be reliably represented by a small amount
points (i.e. set of points) in the rest of this paper. of points and embedded by a rather compact PointNet. This
is important to limit the memory needed when evaluating
3.2. Superpoint Graph Construction many superpoints on current GPUs. In particular, we sub-
sample superpoints on-the-fly down to np = 128 points to
In this subsection, we describe how we compute the SPG maintain efficient computation in batches and facilitate data
as well as its key features. The SPG is a structured represen- augmentation. Superpoints of less than np points are sam-
tation of the point cloud, defined as an oriented attributed pled with replacement, which in principle does not affect
graph G = (S, E, F ) whose nodes are the set of superpoints the evaluation of PointNet due to its max-pooling. How-
S and edges E (referred to as superedges) represent the ad- ever, we observed that including very small superpoints of
jacency between superpoints. The superedges are annotated less than nminp = 40 points in training harms the overall
by a set of df features: F ∈ RE×df characterizing the adja- performance. Thus, embedding of such superpoints is set
cency relationship between superpoints. to zero so that their classification relies solely on contextual
We define Gvor = (C, Evor ) as the symmetric Voronoi information.
adjacency graph of the complete input point cloud as de-
In order for PointNet to learn spatial distribution of dif-
fined by [20]. Two superpoints S and T are adjacent if there
ferent shapes, each superpoint is rescaled to unit sphere be-
is at least one edge in Evor with one end in S and one end in
fore embedding. Points are represented by their normal-
T:
ized position p0i , observations oi , and geometric features fi
E = (S, T ) ∈ S 2 | ∃ (i, j) ∈ Evor ∩ (S × T ) .

(2) (since these are already available precomputed from the par-
titioning step). Furthermore, the original metric diameter of
Important spatial features associated with a superedge the superpoint is concatenated as an additional feature after
(S, T ) are obtained from the set of offsets δ(S, T ) for edges PointNet max-pooling in order to stay covariant with shape
in Evor linking both superpoints: sizes.

δ (S, T ) = {(pi − pj ) | (i, j) ∈ Evor ∩ (S × T )} . (3) 3.4. Contextual Segmentation

Superedge features can also be derived by comparing the The final stage of the pipeline is to classify each su-
shape and size of the adjacent superpoints. To this end, perpoint Si based on its embedding zi and its local sur-
we compute |S| as the number of points comprised in a roundings within the SPG. Graph convolutions are naturally
superpoint S, as well as shape features length (S) = λ1 , suited to this task. In this section, we explain the propaga-
surface (S) = λ1 λ2 , volume (S) = λ1 λ2 λ3 derived from tion model of our system.
the eigenvalues λ1 , λ2 , λ3 of the covariance of the positions Our approach builds on the ideas from Gated Graph
of the points comprised in each superpoint, sorted by de- Neural Networks [30] and Edge-Conditioned Convolutions
creasing value. In Table 1, we describe a list of the different (ECC) [45]. The general idea is that superpoints refine their
superedge features used in this paper. Note that the break embedding according to pieces of information passed along
of symmetry in the edge features makes the SPG a directed superedges. Concretely, each superpoint Si maintains its
graph. state hidden in a Gated Recurrent Unit (GRU) [8]. The hid-
den state is initialized with embedding zi and is then pro-
3.3. Superpoint Embedding cessed over several iterations (time steps) t = 1 . . . T . At
(t)
The goal of this stage is to compute a descriptor for every each iteration t, a GRU takes its hidden state hi and an
(t)
superpoint Si by embedding it into a vector zi of fixed-size incoming message mi as input, and computes its new hid-
(t+1) (t)
dimensionality dz . Note that each superpoint is embedded den state hi . The incoming message mi to superpoint

4
(t)
i is computed as a weighted sum of hidden states hj of edge-specific weight vector and perform element-wise mul-
neighboring superpoints j. The actual weighting for a su- tiplication as in Equation 7 (ECC-VV). Channel mixing, al-
peredge (j, i) depends on its attributes Fji,· , listed in Ta- beit in an edge-unspecific fashion, is postponed to Equa-
ble 1. In particular, it is computed from the attributes by a tion 5. Finally, let us remark that Θ is shared over time iter-
multi-layer perceptron Θ, so-called Filter Generating Net- ations and that self-loops as proposed in [45] are not neces-
work. Formally: sary due to the existence of hidden states in GRUs.

(t+1) (t) (t) (t) (t)


hi = (1 − ui ) qi + ui hi State Concatenation. Inspired by DenseNet [17], we
(t) (t) (t) (t) concatenate hidden states over all time steps and linearly
qi = tanh(x1,i + ri h1,i ) (4)
transform them to produce segmentation logits yi in Equa-
(t) (t) (t) (t) (t) (t)
ui = σ(x2,i + h2,i ), ri = σ(x3,i + h3,i ) tion 8. This allows to exploit the dynamics of hidden states
due to increasing receptive field for the final classification.
(t) (t) (t) (t)
(h1,i , h2,i , h3,i )T = ρ(Wh hi + bh )
(5) Relation to CRFs. In image segmentation, post-
(t) (t) (t) (t)
(x1,i , x2,i , x3,i )T = ρ(Wx xi + bx ) processing of convolutional outputs using Conditional
Random Fields (CRFs) is widely popular. Several infer-
(t) (t) (t) ence algorithms can be formulated as (recurrent) network
xi = σ(Wg hi + bg ) mi (6) layers amendable to end-to-end learning [51, 43], possibly
(t) (t)
mi = meanj|(j,i)∈E Θ(Fji,· ; We ) hj (7) with general pairwise potentials [33, 7, 28]. While our
method of information propagation shares both these
(1) (1) (T +1) T
hi = zi , yi = Wo (hi , . . . , hi ) , (8) characteristics, our GRUs operate on dz -dimensional
intermediate feature space, which is richer and less con-
where is element-wise multiplication, σ(·) sigmoid func- strained than low-dimensional vectors representing beliefs
tion, and W· and b· are trainable parameters shared among over classes, as also discussed in [11]. Such enhanced
all GRUs. Equation 4 lists the standard GRU rules [8] with access to information is motivated by the desire to learn
(t) (t) a powerful representation of context, which goes beyond
its update gate ui and reset gate ri . To improve stability
during training, in Equation 5 we apply Layer Normaliza- belief compatibilities, as well as the desire to be able to
tion [4] defined as ρ(a) := (a−mean(a))/(std(a)+) sep- discriminate our often relatively weak unaries (superpixel
(t)
arately to linearly transformed input xi and transformed embeddings). We empirically evaluate these claims in
(t) Section 4.3.
hidden state hi , with  being a small constant. Finally, the
model includes three interesting extensions in Equations 6– 3.5. Further Details
8, which we detail below.
Adjacency Graphs. In this paper, we use two different
adjacency graphs between points of the input clouds: Gnn in
Input Gating. We argue that GRU should possess the Section 3.1 and Gvor in Section 3.2. Indeed, different defini-
ability to down-weight (parts of) an input vector based on tions of adjacency have different advantages. Voronoi adja-
its hidden state. For example, GRU might learn to ignore its cency is more suited to capture long-range relationships be-
context if its class state is highly certain or to direct its atten- tween superpoints, which is beneficial for the SPG. Nearest
tion to only specific feature channels. Equation 6 achieves neighbors adjacency tends not to connect objects separated
(t) by a small gap. This is desirable for the global energy but
this by gating message mi by the hidden state before using
(t) tends to produce a SPG with many small connected com-
it as input xi .
ponents, decreasing embedding quality. Fixed radius adja-
cency should be avoided in general as it handles the variable
Edge-Conditioned Convolution. ECC plays a crucial density of LiDAR scans poorly.
role in our model as it can dynamically generate filter-
ing weights for any value of continuous attributes Fji,· Training. While the geometric partitioning step is unsu-
by processing them with a multi-layer perceptron Θ. In pervised, superpoint embedding and contextual segmenta-
the original formulation [45] (ECC-MV), Θ regresses tion are trained jointly in a supervised way with cross en-
a weight matrix to perform matrix-vector multiplication tropy loss. Superpoints are assumed to be semantically ho-
(t)
Θ(Fji,· ; We )hj for each edge. In this work, we propose mogeneous and, consequently, assigned a hard ground truth
a lightweight variant with lower memory requirements and label corresponding to the majority label among their con-
fewer parameters, which is beneficial for datasets with few tained points. We also considered using soft labels cor-
but large point clouds. Specifically, we regress only an responding to normalized histograms of point labels and

5
training with Kullback-Leibler [25] divergence loss. It per- scans and 15 test scans with withheld labels. We also eval-
formed slightly worse in our initial experiments, though. uate on the reduced set of 4 subsampled scans, as common
Naive training on large SPGs may approach memory in past work.
limits of current GPUs. We circumvent this issue by ran- In Table 2, we provide the results of our algorithm com-
domly subsampling the sets of superpoints at each itera- pared to other state of the art recent algorithms and in Fig-
tion and training on induced subgraphs, i.e. graphs com- ure 3, we provide qualitative results of our framework. Our
posed of subsets of nodes and the original edges connecting framework improves significantly on the state of the art of
them. Specifically, graph neighborhoods of order 3 are sam- semantic segmentation for this data set, i.e. by nearly 12
pled to select at most 512 superpoints per SPG with more mIoU points on the reduced set and by nearly 9 mIoU points
than nminp points, as smaller superpoints are not embed- on the full set. In particular, we observe a steep gain on the
ded. Note that as the induced graph is a union of small ”artefact” class. This can be explained by the ability of the
neighborhoods, relationships over many hops may still be partitioning algorithm to detect artifacts due to their singu-
formed and learned. This strategy also doubles as data aug- lar shape, while they are hard to capture using snapshots, as
mentation and a strong regularization, together with ran- suggested by [5]. Furthermore, these small object are often
domized sampling of point clouds described in Section 3.3. merged with the road when performing spatial regulariza-
Additional data augmentation is performed by randomly tion.
rotating superpoints around the vertical axis and jittering
point features by Gaussian noise N (0, 0.01) truncated to 4.2. Stanford Large-Scale 3D Indoor Spaces
[−0.05, 0.05].
The S3DIS dataset [3] consists of 3D RGB point clouds
of six floors from three different buildings split into indi-
Testing. In modern deep learning frameworks, testing can vidual rooms. We evaluate our framework following two
be made very memory-efficient by discarding layer activa- dominant strategies found in previous works. As advocated
tions as soon as the follow-up layers have been computed. by [39, 10], we perform 6-fold cross validation with micro-
In practice, we were able to label full SPGs at once. To com- averaging, i.e. computing metrics once over the merged pre-
pensate for randomness due to subsampling of point clouds dictions of all test folds. Following [46], we also report
in PointNets, we average logits obtained over 10 runs with the performance on the fifth fold only (Area 5), correspond-
different seeds. ing to a building not present in the other folds. Since some
classes in this data set cannot be partitioned purely using ge-
4. Experiments ometric features (such as boards or paintings on walls), we
concatenate the color information o to the geometric fea-
We evaluate our pipeline on the two currently largest tures f for the partitioning step.
point cloud segmentation benchmarks, Semantic3D [14] The quantitative results are displayed in Table 3, with
and Stanford Large-Scale 3D Indoor Spaces (S3DIS) [3], on qualitative results in Figure 3 and in Appendix D. S3DIS is
both of which we set the new state of the art. Furthermore, a difficult dataset with hard to retrieve classes such as white
we perform an ablation study of our pipeline in Section 4.3. boards on white walls and columns within walls. From the
Even though the two data sets are quite different in nature quantitative results we can see that our framework performs
(large outdoor scenes for Semantic3D, smaller indoor scan- better than other methods on average. Notably, doors are
ning for S3DIS), we use nearly the same model for both. able to be correctly classified at a higher rate than other ap-
The deep model is rather compact and 6 GB of GPU mem- proaches, as long as they are open, as illustrated in Figure 3.
ory is enough for both testing and training. We refer to Ap- Indeed, doors are geometrically similar to walls, but their
pendix A for precise details on hyperparameter selection, position with respect to the door frame allows our network
architecture configuration, and training procedure. to retrieve them correctly. On the other hand, the partition
Performance is evaluated using three metrics: per-class merges white boards with walls, depriving the network from
intersection over union (IoU), per-class accuracy (Acc), and the opportunity to even learn to classify them: the IoU of
overall accuracy (OA), defined as the proportion of cor- boards for theoretical perfect classification of superpoints
rectly classified points. We stress that the metrics are com- (as in Section 4.3) is only 51.3.
puted on the original point clouds, not on superpoints. Computation Time. In Table 4, we report computation
time over the different steps of our pipeline for the infer-
4.1. Semantic3D
ence on Area 5 measured on a 4 GHz CPU and GTX 1080
Semantic3D [14] is the largest available LiDAR dataset Ti GPU. While the bulk of time is spent on the CPU for
with over 3 billion points from a variety of urban and rural partitioning and SPG computation, we show that voxeliza-
scenes. Each point has RGB and intensity values (the latter tion as pre-processing, detailed in Appendix A, leads to a
of which we do not use). The dataset consists of 15 training significant speed-up as well as improved accuracy.

6
man-made natural high low hard- scanning
Method OA mIoU buildings cars
terrain terrain vegetation vegetation scape artefact
reduced test set: 78 699 329 points
TMLC-MSR [15] 86.2 54.2 89.8 74.5 53.7 26.8 88.8 18.9 36.4 44.7
DeePr3SS [29] 88.9 58.5 85.6 83.2 74.2 32.4 89.7 18.5 25.1 59.2
SnapNet [5] 88.6 59.1 82.0 77.3 79.7 22.9 91.1 18.4 37.3 64.4
SegCloud [46] 88.1 61.3 83.9 66.0 86.0 40.5 91.1 30.9 27.5 64.3
SPG (Ours) 94.0 73.2 97.4 92.6 87.9 44.0 93.2 31.0 63.5 76.2
full test set: 2 091 952 018 points
TMLC-MS [15] 85.0 49.4 91.1 69.5 32.8 21.6 87.6 25.9 11.3 55.3
SnapNet [5] 91.0 67.4 89.6 79.5 74.8 56.1 90.9 36.5 34.3 77.2
SPG (Ours) 92.9 76.2 91.5 75.6 78.3 71.7 94.4 56.8 52.9 88.4

Table 2: Intersection over union metric for the different classes of the Semantic3D dataset. OA is the global accuracy, while
mIoU refers to the unweighted average of IoU of each class.

Semantic3D
road
grass
tree
bush
buildings
hardscape
artefacts
cars
S3DIS
ceiling
floor
wall
column
beam
window
door
table
chair
bookcase
sofa
board
clutter
(a) RGB point cloud (b) Geometric partitioning (c) Prediction (d) Ground truth unlabelled

Figure 3: Example visualizations on both datasets. The colors in (b) are chosen randomly for each element of the partition.

4.3. Ablation Studies The lower bound (Unary) is estimated by training PointNet
with dz = 13 but otherwise the same architecture, denoted
To better understand the influence of various design
as PointNet13, to directly predict class logits, without SPG
choices made in our framework, we compare it to several
and GRUs. The upper bound (Perfect) corresponds to as-
baselines and perform an ablation study. Due to the lack of
signing each superpoint its ground truth label, and thus sets
public ground truth for test sets of Semantic3D, we evaluate
the limit of performance due to the geometric partition. We
on S3DIS with 6-fold cross validation and show comparison
can see that contextual segmentation is able to win roughly
of different models to our Best model in Table 5.
22 mIoU points over unaries, confirming its importance.
Performance Limits. The contribution of contextual
Nevertheless, the learned model still has room of up to 26
segmentation can be bounded both from below and above.

7
Method OA mAcc mIoU ceiling floor wall beam column window door chair table bookcase sofa board clutter
A5 PointNet [39] – 48.98 41.09 88.80 97.33 69.80 0.05 3.92 46.26 10.76 52.61 58.93 40.28 5.85 26.38 33.22
A5 SEGCloud [46] – 57.35 48.92 90.06 96.05 69.86 0.00 18.37 38.35 23.12 75.89 70.40 58.42 40.88 12.96 41.60
A5 SPG (Ours) 86.38 66.50 58.04 89.35 96.87 78.12 0.0 42.81 48.93 61.58 84.66 75.41 69.84 52.60 2.10 52.22
PointNet [39] in [10] 78.5 66.2 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 42.0 54.1 38.2 9.6 29.4 35.2
Engelmann et al. [10] 81.1 66.4 49.7 90.3 92.1 67.9 44.7 24.2 52.3 51.2 47.4 58.1 39.0 6.9 30.0 41.9
SPG (Ours) 85.5 73.0 62.1 89.9 95.1 76.4 62.8 47.1 55.3 68.4 73.5 69.2 63.2 45.9 8.7 52.9

Table 3: Results on the S3DIS dataset on fold “Area 5” (top) and micro-averaged over all 6 folds (bottom). Intersection over
union is shown split per class.

Step Full cloud 2 cm 3 cm 4 cm Model mAcc mIoU


Voxelization 0 40 24 16 Best 73.0 62.1
Feature computation 439 194 88 43 Perfect 92.7 88.2
Geometric partition 3428 1013 447 238 Unary 50.8 40.0
SPG computation 3800 958 436 252 iCRF 51.5 40.7
Inference 10 × 24 10 × 11 10 × 6 10 × 5 CRF − ECC 65.6 55.3
Total 7907 2315 1055 599 GRU13 69.1 58.5
mIoU 6-fold 54.1 60.2 62.1 57.1 NoInputGate 68.6 57.5
NoConcat 69.3 57.7
NoEdgeFeat 50.1 39.9
Table 4: Computation time in seconds for the inference on ECC − VV 70.2 59.4
S3DIS Area 5 (68 rooms, 78 649 682 points) for different
voxel sizes.
Table 5: Ablation study and comparison to various base-
lines on S3DIS (6-fold cross validation).
mIoU points for improvement, while about 12 mIoU points
are forfeited to the semantic inhomogeneity of superpoints.
CRFs. We compare the effect of our GRU+ECC- in ECC − VV we use the proposed lightweight formula-
based network to CRF-based regularization. As a baseline tion of ECC. We can see that each of the first two choices
(iCRF), we post-process Unary outputs by CRF inference accounts for about 5 mIoU points. Next, without edge fea-
over SPG connectivity with scalar transition matrix, as de- tures our method falls back even below iCRF to the level
scribed by [13]. Next (CRF − ECC), we adapt CRF-RNN of Unary, which validates their design and overall motiva-
framework of Zheng et al. [51] to general graphs with edge- tion for SPG. ECC − VV decreases the performance on the
conditioned convolutions (see Appendix B for details) and S3DIS dataset by 3 mIoU points, whereas it has improved
train it with PointNet13 end-to-end. Finally (GRU13), we the performance on Semantic3D by 2 mIoU. Finally, we in-
modify Best to use PointNet13. We observe that iCRF vite the reader to Appendix C for further ablations.
barely improves accuracy (+1 mIoU), which is to be ex-
pected, since the partitioning step already encourages spa-
tial regularity. CRF − ECC does better (+15 mIoU) due
to end-to-end learning and use of edge attributes, though it 5. Conclusion
is still below GRU13 (+18 mIoU), which performs more
complex operations and does not enforce normalization of We presented a deep learning framework for perform-
the embedding. Nevertheless, the 32 channels used in Best ing semantic segmentation of large point clouds based on
instead of the 13 used in GRU13 provide even more free- a partition into simple shapes. We showed that SPGs al-
dom for feature representation (+22 mIoU). low us to use effective deep learning tools, which would not
Ablation. We explore the advantages of several design be able to handle the data volume otherwise. Our method
choices by individually removing them from Best in order significantly improves on the state of the art on two publicly
to compare the framework’s performance with and without available datasets. Our experimental analysis suggested that
them. In NoInputGate we remove input gating in GRU; in future improvements can be made in both partitioning and
NoConcat we only consider the last hidden state in GRU learning deep contextual classifiers.
(T +1)
for output as yi = Wo hi instead of concatenation of The source code in PyTorch as well as the trained models
all steps; in NoEdgeFeat we perform homogeneous regu- are available at https://github.com/loicland/
larization by setting all superedge features to scalar 1; and superpoint_graph.

8
References [17] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
convolutional networks. In CVPR, 2017. 5
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
S. Süsstrunk. SLIC superpixels compared to state-of-the-art
deep network training by reducing internal covariate shift. In
superpixel methods. IEEE Transactions on Pattern Analysis
ICML, 2015. 10
and Machine Intelligence, 34(11):2274–2282, 2012. 1
[19] M. Jaderberg, K. Simonyan, A. Zisserman, and
[2] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena.
K. Kavukcuoglu. Spatial transformer networks. In
Contextually guided semantic labeling and search for three-
NIPS, pages 2017–2025. 2015. 4
dimensional point clouds. The International Journal of
Robotics Research, 32(1):19–34, 2013. 2 [20] J. W. Jaromczyk and G. T. Toussaint. Relative neighbor-
hood graphs and their relatives. Proceedings of the IEEE,
[3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis,
80(9):1502–1517, 1992. 4
M. Fischer, and S. Savarese. 3D semantic parsing of large-
scale indoor spaces. In CVPR, 2016. 1, 6 [21] B.-S. Kim, P. Kohli, and S. Savarese. 3D scene understand-
ing by voxel-CRF. In ICCV, 2013. 2
[4] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization.
CoRR, abs/1607.06450, 2016. 5 [22] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. In ICLR, 2015. 11
[5] A. Boulch, B. L. Saux, and N. Audebert. Unstructured
point cloud semantic labeling using deep segmentation net- [23] R. Klokov and V. S. Lempitsky. Escape from cells: Deep
works. In Eurographics Workshop on 3D Object Retrieval, Kd-networks for the recognition of 3D point cloud models.
volume 2, 2017. 1, 6, 7 CoRR, abs/1704.01222, 2017. 2
[6] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en- [24] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se-
ergy minimization via graph cuts. IEEE Transactions on Pat- mantic labeling of 3D point clouds for indoor scenes. In
tern Analysis and Machine Intelligence, 23(11):1222–1239, NIPS, pages 244–252, 2011. 2
2001. 3 [25] S. Kullback and R. A. Leibler. On information and suffi-
[7] S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ciency. The Annals of Mathematical Statistics, 22 (1):79–86,
ference for semantic image segmentation with deep gaussian 1951. 6
crfs. In ECCV, 2016. 5 [26] L. Landrieu and G. Obozinski. Cut pursuit: Fast algorithms
[8] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, to learn piecewise constant functions on general weighted
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase graphs. SIAM Journal on Imaging Sciences, 10(4):1724–
representations using RNN encoder–decoder for statistical 1766, 2017. 3
machine translation. In EMNLP, 2014. 1, 4, 5 [27] L. Landrieu, H. Raguet, B. Vallet, C. Mallet, and M. Wein-
[9] J. Demantk, C. Mallet, N. David, and B. Vallet. Dimension- mann. A structured regularization framework for spatially
ality based scale selection in 3D lidar point clouds. Inter- smoothing semantic labelings of 3D point clouds. ISPRS
national Archives of the Photogrammetry, Remote Sensing Journal of Photogrammetry and Remote Sensing, 132:102 –
and Spatial Information Sciences, XXXVIII-5/W12:97–102, 118, 2017. 2
2011. 3 [28] M. Larsson, F. Kahl, S. Zheng, A. Arnab, P. H. S. Torr, and
[10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe. R. I. Hartley. Learning arbitrary potentials in CRFs with gra-
Exploring spatial context for 3d semantic segmentation of dient descent. CoRR, abs/1701.06805, 2017. 5
point clouds. In ICCV, 3DRMS Workshop,, 2017. 1, 2, 6, 8 [29] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan,
[11] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. Gehler. and M. Felsberg. Deep projective 3D semantic segmentation.
Superpixel convolutional networks using bilateral incep- arXiv preprint arXiv:1705.03428, 2017. 7
tions. In ECCV, 2016. 5 [30] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated
[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. graph sequence neural networks. In ICLR, 2016. 4
Dahl. Neural message passing for quantum chemistry. In [31] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. In-
ICML, pages 1263–1272, 2017. 2 terpretable structure-evolving LSTM. In CVPR, pages 2175–
[13] S. Guinard and L. Landrieu. Weakly supervised 2184, 2017. 2
segmentation-aided classification of urban scenes from 3d [32] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic
LiDAR point clouds. In ISPRS 2017, 2017. 2, 3, 8 object parsing with graph LSTM. In ECCV, pages 125–143,
[14] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, 2016. 2
K. Schindler, and M. Pollefeys. Semantic3d. net: A new [33] G. Lin, C. Shen, A. van den Hengel, and I. D. Reid. Efficient
large-scale point cloud classification benchmark. arXiv piecewise training of deep structured models for semantic
preprint arXiv:1704.03847, 2017. 1, 6 segmentation. In CVPR, 2016. 5
[15] T. Hackel, J. D. Wegner, and K. Schindler. Fast semantic [34] Y. Lu and C. Rasmussen. Simplified Markov random fields
segmentation of 3D point clouds with strongly varying den- for efficient semantic labeling of 3d point clouds. In IROS,
sity. ISPRS Annals of Photogrammetry, Remote Sensing & pages 2690–2697, 2012. 2
Spatial Information Sciences, 3(3), 2016. 7 [35] A. Martinovic, J. Knopp, H. Riemenschneider, and
[16] H. Hu, D. Munoz, J. A. Bagnell, and M. Hebert. Efficient L. Van Gool. 3D all the way: Semantic segmentation of
3-d scene analysis from streaming data. In ICRA, 2013. 2 urban scenes from start to end in 3D. In CVPR, 2015. 2

9
[36] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, Appendix
and M. M. Bronstein. Geometric deep learning on graphs
and manifolds using mixture model CNNs. In CVPR, pages A. Model Details
5425–5434, 2017. 2
Voxelization. We pre-process input point clouds with
[37] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. Con-
voxelization subsampling by computing per-voxel mean po-
textual classification with functional max-margin Markov
networks. In CVPR, 2009. 2
sitions and observations over a regular 3D grid (5 cm bins
for Semantic3D and 3 cm bins for S3DIS dataset). The re-
[38] J. Niemeyer, F. Rottensteiner, and U. Soergel. Contextual
sulting semantic segmentation is interpolated back to the
classification of lidar data and building object detection in
original point cloud in a nearest neighbor fashion. Voxeliza-
urban areas. ISPRS Journal of Photogrammetry and Remote
Sensing, 87:152–165, 2014. 2 tion helps decreasing the computation time and memory re-
quirement, and improves the accuracy of the semantic seg-
[39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep
mentation by acting as a form of geometric and radiometric
learning on point sets for 3D classification and segmentation.
In CVPR, 2017. 1, 2, 3, 4, 6, 8, 11 denoising as well (Table 4 in the main paper). The quality
of further steps is practically not affected, as superpoints are
[40] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep
usually strongly subsampled for embedding during learning
hierarchical feature learning on point sets in a metric space.
In NIPS, 2017. 1, 2
and inference anyway (Section 3.3 in the main paper).
[41] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3D graph
neural networks for RGBD semantic segmentation. In ICCV, Geometric Partition. We set regularization strength µ =
pages 5209–5218, 2017. 2 0.8 for Semantic3D and µ = 0.03 for S3DIS, which strikes
[42] G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning a balance between semantic homogeneity of superpoints
deep 3D representations at high resolutions. In CVPR, 2017. and the potential for their successful discrimination (S3DIS
1, 2 is composed of smaller semantic parts than Semantic3D).
[43] A. G. Schwing and R. Urtasun. Fully connected deep struc- In addition to five geometric features f (linearity, planarity,
tured networks. CoRR, abs/1503.02351, 2015. 5 scattering, verticality, elevation), we use color information
o for clustering in S3DIS due to some classes being geo-
[44] R. Shapovalov, D. Vetrov, and P. Kohli. Spatial inference
machines. In CVPR, 2013. 2 metrically indistinguishable, such as boards or doors.
[45] M. Simonovsky and N. Komodakis. Dynamic edge- MLP(64,64,128,128,256)
STN
conditioned filters in convolutional neural networks on
graphs. In CVPR, 2017. 1, 2, 4, 5, 11 MLP(256,64,32)
np × dp

np × d p

maxpool

[46] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and shared shared 257 × 1 dz × 1


S. Savarese. SEGCloud: Semantic segmentation of 3D point
clouds. arXiv preprint arXiv:1710.07563, 2017. 1, 2, 6, 7, 8,
11
diameter
[47] M. Weinmann, S. Hinz, and M. Weinmann. A hybrid seman-
tic point cloud classification-segmentation framework based
on geometric features and semantic rules. PFG–Journal of Figure 4: The PointNet embedding np dp -dimensional sam-
Photogrammetry, Remote Sensing and Geoinformation Sci- ples of a superpoint to a dz -dimensional vector.
ence, 85(3):183–194, 2017. 2
[48] M. Weinmann, A. Schmidt, C. Mallet, S. Hinz, F. Rotten-
steiner, and B. Jutzi. Contextual classification of point cloud
data by exploiting individual 3D neighborhoods. ISPRS An- PointNet. We use a simplified shallow and narrow Point-
nals of the Photogrammetry, Remote Sensing and Spatial In- Net architecture with just a single Spatial Transformer Net-
formation Sciences, II-3/W4:271–278, 2015. 2 work (STN), see Figure 4. We set np = 128 and nminp =
[49] D. Wolf, J. Prankl, and M. Vincze. Fast semantic segmen- 40. Input points are processed by a sequence of MLPs
tation of 3D point clouds using a dense CRF with learned (widths 64, 64, 128, 128, 256) and max pooled to a sin-
parameters. In ICRA, 2015. 2 gle vector of 256 features. The scalar metric diameter is
[50] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph appended and the result further processed by a sequence
generation by iterative message passing. In CVPR, pages of MLPs (widths 256, 64, dz =32). A residual matrix Φ ∈
3097–3106, 2017. 2 R2×2 is regressed by STN and (I + Φ) is used to transform
[51] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, XY coordinates of input points as the first step. The archi-
Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional ran- tecture of STN is a ”small PointNet” with 3 MLPs (widths
dom fields as recurrent neural networks. In ICCV, 2015. 2, 64, 64, 128) before max pooling and 3 MLPs after (widths
5, 8, 11 128, 64, 4). Batch Normalization [18] and ReLUs are used

10
everywhere. Input points have dp =11 dimensional features C. Extended Ablation Studies
for Semantic3D (position pi , color oi , geometric features
fi ), with 3 additional ones for S3DIS (room-normalized In this section, we present additional set of experiments
spatial coordinates, as in past work [39]). to validate our design choices and present their results in
Table 6.
a) Spatial Transformer Network. While STN makes
Segmentation Network. We use embedding dimension- superpoint embedding orientation invariant, the relationship
ality dz = 32 and T = 10 iterations. ECC-VV is used with surrounding objects are still captured by superedges,
for Semantic3D (there are only 15 point clouds even though which are orientation variant. In practice, STN helps by 4
the amount of points is large), while ECC-MV is used for mIoU points.
S3DIS (large number of point clouds). Filter-generating b) Geometric Features. Geometric features fi are com-
network Θ is a MLP with 4 layers (widths 32, 128, 64, and puted in the geometric partition step and can therefore be
32 or 322 for ECC-VV or ECC-MV) with ReLUs. Batch used in the following learning step for free. While Point-
Normalization is used only after the third parametric layer. Nets could be expected to learn similar features from the
No bias is used in the last layer. Superedges have df = 13 data, this is hampered by superpoint subsampling, and
dimensional features, normalized by mean subtraction and therefore their explicit use helps (+4 mIoU).
scaling to unit variance based on the whole training set. c) Sampling Superpoints. The main effect of subsam-
pling SPG is regularization by data augmentation. Too
Training. We train using Adam [22] with initial learning small a sample size leads to disregarding contextual infor-
rate 0.01 and batch size 2, i.e. effectively up to 1024 super- mation (-4 mIoU) while too large a size leads to overfitting
points per batch. For Semantic3D, we train for 500 epochs (-2 mIoU). Lower memory requirements at training is an
with stepwise learning rate decay of 0.7 at epochs 350, 400, extra benefit. There is no subsampling at test time.
and 450. For S3DIS, we train for 250 epochs with steps at d) Long-range Context. We observe that limiting the
200 and 230. We clip gradients within [−1, 1]. range of context information in SPG harms the perfor-
mance. Specifically, capping distances in Gvor to 1 m (as
B. CRF-ECC used in PointNet [39]) or 5 m (as used in SegCloud [46])
worsens the performance of our method (even more on our
In this section, we describe our adaptation of CRF- Semantic 3D validation set).
RNN mean field inference by Zheng et al. [51] for post- e) Input Gate. We evaluate the effect of input gating
processing PointNet embeddings in SPG, denoted as unary (IG) for GRUs as well as LSTM units. While a LSTM unit
potentials Ui here. achieves higher score than a GRU (-3 mIoU), the proposed
The original work proposed a dense CRF with pairwise IG reverses this situation in favor of GRU (+1 mIoU). Un-
potentials Ψ defined
P to be a mixture of m Gaussian kernels like the standard input gate of LSTM, which controls the
as Ψij = µ m wm Km (Fij ), where µ is label compati- information flow from the hidden state and input to the cell,
bility matrix, w are parameters, and K are fixed Gaussian our IG controls the input even before it is used to compute
kernels applied on edge features. all other gates.
We replace this definition of the pairwise term with a Fil- f) Regularization Strength µ. We investigate the bal-
ter generating network Θ [45] parameterized with weights ance between superpoints’ discriminative potential and their
We , which generalizes the message passing and compati- homogeneity controlled by parameter µ . We observe that
bility transform steps of Zheng et al. . Furthermore, we use the system is able to perform reasonably over a range of
superedge connectivity E instead of assuming a complete SPG sizes.
graph. The pseudo-code is listed in Algorithm 1. Its output g) Superpoint Sizes. We include a breakdown of su-
are marginal probability distributions Q. In practice we run perpoint sizes for µ = 0.03 in relation to hyperparameters
the inference for T = 10 iterations. nminp = 40 and np = 128, showing that 93% of points are
in embedded superpoints, and 79% in superpoints that are
Algorithm 1 CRF-ECC
subsampled.
Qi ← softmax(Ui ) Superedge Features. Finally, in Table 7 we evaluate
while not converged
P do empirical importance of individual superedge features by
Q̂i ← j|(j,i)∈E Θ(Fji,· ; We )Qj removing them from Best. Although no single feature is
Q̆i ← Ui − Q̂i crucial, the most being offset deviation (+3 mIoU), we re-
Qi ← softmax(Q̆i ) mind the reader than without any superedge features the net-
end while
Furthermore, SegCloud divides the inference into cubes without over-
lap, possibly causing inconsistencies across boundaries.

11
a) Spatial transf. no yes D. Video Illustration
mIoU 58.1 62.1
b) Geometric features no yes We provide a video illustrating our method and qual-
mIoU 58.4 62.1
c) Max superpoints 256 512 1024
itative results on S3DIS dataset, which can be viewed at
mIoU 57.9 62.1 60.4 https://youtu.be/Ijr3kGSU_tU.
d) Superedge limit 1m 5m ∞
mIoU 61.0 61.3 62.1
e) Input gate LSTM LSTM+IG GRU GRU+IG
mIoU 61.0 61.0 57.5 62.1
f) Regularization µ 0.01 0.02 0.03 0.04
# superpoints 785 010 385 091 251 266 186 108
perfect mIoU 90.6 88.2 86.6 85.2
mIoU 59.1 59.2 62.1 58.8
g) Superpoint size 1-40 40-128 128-1000 ≥ 1000
proportion of points 7% 14% 27% 52%

Table 6: Ablation study of design decisions on S3DIS (6-


fold cross validation). Our choices in bold.

Model mAcc mIoU


Best 73.0 62.1
no mean offset 72.5 61.8
no offset deviation 71.7 59.3
no centroid offset 74.5 61.2
no len/surf/vol ratios 71.2 60.7
no point count ratio 72.7 61.7

Table 7: Ablation study of superedge features on S3DIS (6-


fold cross validation).

·106

3
number of points

40 128 1,000 10,000


size of superpoints

Figure 5: Histogram of points contained in superpoints of


different size (in log scale) on the full S3DIS dataset. The
embedding threshold nminp and subsampling threshold np
are marked in red.

work performs distinctly worse (NoEdgeFeat, -22 mIoU).

12

View publication stats

You might also like