Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/336147640
SegMap: Segment-based mapping and localization using data-driven

descriptors
Preprint · September 2019
CITATIONS READS
0 252
8 authors, including:
Renaud Dube Andrei Cramariuc

ETH Zurich 11 PUBLICATIONS 84 CITATIONS
40 PUBLICATIONS 441 CITATIONS
SEE PROFILE
SEE PROFILE
Marcin Dymczyk
Sevensense Robotics AG
33 PUBLICATIONS 461 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Real-Time Multi-Robot SLAM with 3D Point Clouds View project
Localisation and Mapping View project
All content following this page was uploaded by Cesar Cadena on 22 November 2019.
The user has requested enhancement of the downloaded file.

1
This paper has been accepted for publication in the International Journal of Robotics Research.
DOI: 10.1177/0278364919863090
Please cite our work as:
R. Dubé, A. Cramariuc, D. Dugas, H. Sommer, M. Dymczyk, J. Nieto, R. Siegwart, C. Cadena. “SegMap: Segment-based
mapping and localization using data-driven descriptors”. The International Journal of Robotics Research.
bibtex:
@article{segmap2019dube,
author = {Renaud Dub\’e and Andrei Cramariuc and Daniel Dugas
and Hannes Sommer and Marcin Dymczyk and Juan Nieto
arXiv:1909.12837v1 [cs.RO] 27 Sep 2019
and Roland Siegwart and Cesar Cadena},

title = {{SegMap}: Segment-based mapping and localization using
data-driven descriptors},
journal = {The International Journal of Robotics Research},
doi = {10.1177/0278364919863090}
}
Prepared using sagej.cls

SegMap: Segment-based mapping and localization using
data-driven descriptors
Renaud Dubé12* , Andrei Cramariuc1* , Daniel Dugas1 , Hannes Sommer12 , Marcin Dymczyk12 ,
Juan Nieto1 , Roland Siegwart1 , and Cesar Cadena1
Abstract
Precisely estimating a robot’s pose in a prior, global map is a fundamental capability for mobile robotics, e.g.
autonomous driving or exploration in disaster zones. This task, however, remains challenging in unstructured, dynamic
environments, where local features are not discriminative enough and global scene descriptors only provide coarse
information. We therefore present SegMap: a map representation solution for localization and mapping based on the
extraction of segments in 3D point clouds. Working at the level of segments offers increased invariance to view-point and
local structural changes, and facilitates real-time processing of large-scale 3D data. SegMap exploits a single compact
data-driven descriptor for performing multiple tasks: global localization, 3D dense map reconstruction, and semantic
information extraction. The performance of SegMap is evaluated in multiple urban driving and search and rescue
experiments. We show that the learned SegMap descriptor has superior segment retrieval capabilities, compared to
state-of-the-art handcrafted descriptors. In consequence, we achieve a higher localization accuracy and a 6% increase
in recall over state-of-the-art. These segment-based localizations allow us to reduce the open-loop odometry drift by
up to 50%. SegMap is open-source available along with easy to run demonstrations.
Keywords
Global localization, place recognition, simultaneous localization and mapping (SLAM), LiDAR, 3D point clouds,
segmentation, 3D reconstruction, convolutional neural network (CNN), auto-encoder
1 Introduction across the aforementioned changes. Current LiDAR-based

Simultaneous Localization and Mapping (SLAM) systems,
Mapping and localization are fundamental competencies for however, mostly use the 3D structure for local odometry
mobile robotics and have been well-studied topics over the estimation and map tracking, but fail to perform global
last couple of decades (Cadena et al. (2016)). Being able to localization without any prior on the pose of the robot (Hess
map an environment and later localize within it unlocks a et al. (2016)).
multitude of applications, that include autonomous driving,
rescue robotics, service robotics, warehouse automation There exist several approaches that propose to use 3D
or automated goods delivery, to name a few. Robotic point clouds for global place recognition. Some of them
technologies undoubtedly have the potential to disrupt those make use of various local features (Rusu et al. (2009); Salti
applications within the next years. In order to allow for the et al. (2014)), which permit to establish correspondences
successful deployment of autonomous robotic systems in between a query scan and a map and subsequently estimate
such real-world environments, several challenges need to be a 6-Degree-of-Freedom (DoF) pose. The performance of
overcome: mapping, localization and navigation in difficult those systems is limited, as local features are often not
conditions, for example crowded urban spaces, tight indoor discriminative enough and not repeatable given the changes
areas or harsh natural environments. Reliable, prior-free in the environment. Consequently, matching them is not
global localization lies at the core of this challenge. Knowing always reliable and also incurs a large computational cost
the precise pose of a robot is necessary to guarantee reliable, given the number of processed features. Another group
robust and most importantly safe operation of mobile of approaches relies on global descriptors of 3D LiDAR
platforms and also allows for multi-agent collaborations. scans (Yin et al. (2018)) that permit to find a correspondence
The problem of mapping and global localization has in the map. Global descriptors, however, are view-point
been well covered by the research community. On the dependent, especially when designed for only rotational-
one hand, a large body of algorithms use cameras and invariance and not as translation-invariant. Furthermore,
visual cues to perform place recognition. Relying purely a global scan descriptor is more prone to failures under
on appearance has, however, significant limitations. In spite
of tremendous progress within this field, state-of-the-art 1 Autonomous Systems Lab (ASL), ETH Zurich, Switzerland
2 Sevensense Robotics AG, Zurich, Switzerland
algorithms still struggle with changing seasons, weather or
* The authors contributed equally to this work.
even day-night variations (Lowry et al. (2016)). On the
other hand, several approaches address the variability of Corresponding authors:
appearance by relying instead on the 3D structure extracted Renaud Dubé and Andrei Cramariuc
from LiDAR data, which is expected to be more consistent Email: renaud.dube@sevensense.ch and crandrei@ethz.ch
Prepared using sagej.cls [Version: 2017/01/17 v1.20]

3
dynamic scenes, e.g. parked cars, which can be important for

reliable global localization in crowded, urban scenarios.
We therefore present SegMap∗ : a unified approach for
map representation in the localization and mapping problem
for 3D LiDAR point clouds. SegMap is formed on the
basis of partitioning point clouds into sets of descriptive
segments (Dubé et al. (2017a)), as illustrated in Figure 2.
The segment-based localization combines the advantages of
global scan descriptors and local features – it offers reliable
matching of segments and delivers accurate 6-DoF global
localizations in real-time. The 3D segments are obtained
using efficient region-growing techniques which are able to
repeatedly form similar partitions of the point clouds (Dubé
et al. (2018b)). This partitioning provides the means for
compact, yet discriminative features to efficiently represent Figure 1. An illustration of the SegMap approach. The red and
the environment. During localization global data associations orange paths represent the trajectories of two robots driving
are identified by segment descriptor retrieval, leveraging simultaneously in opposite directions through an intersection.
the repeatable and descriptive nature of segment-based In white we show the local segments extracted from the
features. This helps satisfy strict computational, memory and robots’ vicinity and characterized using our compact data-driven
bandwidth constraints, and therefore makes the approach descriptor. Correspondences are then made with the target
segments, resulting in a successful localization depicted with
appropriate for real-time use in both multi-robot and long- green vertical lines. A reconstruction of the target segments is
term applications. illustrated below, where colors represent semantic information
Previous work on segment-based localization considered (cars in red, buildings in light blue, and others in green),
hand-crafted features and provided only a sparse representa- all possible by leveraging the same compact representation.
We take advantage of the semantic information by performing
tion (Dubé et al. (2017a); Tinchev et al. (2018)). These fea-
localization only against static objects, improving robustness
tures lack the ability to generalize to different environments against dynamic changes. Both the reconstruction and semantic
and offer very limited insights into the underlying 3D struc- classification are computed by leveraging the same descriptors
ture. In this work, we overcome these shortcomings by intro- used for global prior-free localization.
ducing a novel data-driven segment descriptor which offers
high retrieval performance, even under variations in view-
point, and that generalizes well to unseen environments.
Moreover, as segments typically represent meaningful and
distinct elements that make up the environment, a scene can
be effectively summarized by only a handful of descriptors.
The resulting reconstructions, as depicted in Figure 1, can be
built at no extra cost in descriptor computation or bandwidth
usage. They can be used by robots for navigating around
obstacles and visualized to improve situational awareness of
remote operators. Moreover, we show that semantic labeling
can be executed through classification in the descriptor
space. This information can, for example, lead to increased Figure 2. Exemplary segments extracted from 3D LiDAR data
robustness to changes in the environment by rejecting inher- collected in a rural environment. These segments were extracted
ently dynamic classes. with an incremental Euclidean distance-based region-growing
To the best of our knowledge, this is the first work algorithm and represent, among others, vehicles, vegetation and
parts of buildings (Dubé et al. (2018b)).
on robot localization that is able to leverage the extracted
features for reconstructing environments in three dimensions
and for retrieving semantic information. This reconstruction
is, in our opinion, a very interesting capability for In relation to the Robotics: Science and System conference
real-world, large-scale applications with limited memory paper (Dubé et al. (2018a)), we make the following
and communication bandwidth. To summarize, this paper additional contributions:
presents the following contributions:
• A comparison of the accuracy of our localization
output with the results of recently published technique
• A data-driven 3D segment descriptor that improves based on data-driven global 3D scan descriptors (Yin
localization performance. et al. (2018)).
• A novel technique for reconstructing the environment • An evaluation of trajectory estimates by combining
based on the same compact features used for our place recognition approach with a state-of-the-art
localization.
• An extensive evaluation of the SegMap approach ∗ SegMap is open-source available along with easy to run demonstrations
using real-world, multi-robot automotive and disaster at www.github.com/ethz-asl/segmap. A video demonstration is
scenario datasets. available at https://youtu.be/CMk4w4eRobg

4 The International Journal of Robotics Research XX(X)
3D LiDAR-based SLAM technique (Zhang and Singh are computed for feature vectors and cross-correlation for
(2014)). histogram features, and an AdaBoost classifier is trained
• A triplet loss descriptor training technique and its to match places. Finally, Iterative Closest Point (ICP) is
comparison to the previously introduced classification- used for computing the relative pose between point clouds.
based approach. In another approach, Magnusson et al. (2009) split the
• A particularly lightweight variant of our SegMap cloud into overlapping grids and compute shape properties
descriptor that can be deployed on platforms with (spherical, linear, and several type of planar) of each
limited computational resources. cell and combine them into a matrix of surface shape
histograms. Similar to other works, these descriptors are
The remainder of the paper is structured as follows:
compared for recognizing places. Recently, Cop et al.
Section 2 provides an overview of the related work in the
(2018) proposed to leverage LiDAR intensity information
fields of localization and learning-based descriptors for 3D
with a global point cloud descriptor. A two-stage approach
point clouds. The SegMap approach and our novel descriptor
is adopted such that, after retrieving places based on
that enables reconstruction of the environment are detailed
global descriptors retrieval, a local keypoint-based geometric
in Section 3 and Section 4. The method is evaluated in
verification step estimates localization transformations. The
Section 5, and finally Sections 6 and 7 conclude with a short
authors demonstrated that using intensity information can
discussion and ideas on future works.
reduce the computational timings. However, the complete
localization pipeline operates at a frequency one order of
2 RELATED WORK magnitude lower than most LiDAR sensor frequencies.
This section first introduces state-of-the-art approaches to While local keypoint features often lack descriptive power,
localization in 3D point clouds. Data driven techniques using global descriptors can struggle with variations in view-
3D data which are relevant to the present work are then point. Therefore other works have also proposed to use 3D
presented. shapes or objects for the place recognition task. Fernández-
Localization in 3D point clouds Detecting loop-closures Moral et al. (2013), for example, propose to perform place
from 3D data has been tackled with different approaches. We recognition by detecting planes in 3D environments. The
have identified three main trends: (i) approaches based on planes are accumulated in a graph and an interpretation tree
local features, (ii) global descriptors and (iii) based on planes is used to match sub-graphs. A final geometric consistency
or objects. test is conducted over the planes in the matched sub-graphs.
A significant number of works propose to extract local The work is extended in Fernández-Moral et al. (2016) to use
features from keypoints and perform matching on the basis the covariance of the plane parameters instead of the number
of these features. Bosse and Zlot (2013) extract keypoints of points in planes for matching. This strategy is only applied
directly from the point clouds and describe them with a to small, indoor environments and assumes a plane model
3D Gestalt descriptor. Keypoints then vote for their nearest which is no longer valid in unstructured environments. A
neighbors in a vote matrix which is eventually thresholded somewhat analogous, seminal work on object-based loop-
for recognizing places. A similar approach has been used closure detection in indoor environments using RGB-D
in Gawel et al. (2016). Apart from such Gestalt descriptors, cameras is presented by Finman et al. (2015). Although
a number of alternative local feature descriptors exist, presenting interesting ideas, their work can only handle
which can be used in similar frameworks. This includes a small number of well segmented objects in small scale
features such as Fast Point Feature Histogram (FPFH) (Rusu environments. Similarily, Bowman et al. (2017) proposed
et al. (2009)) and SHOT (Salti et al. (2014)). Alternatively, a novel SLAM solution in which semantic information
Zhuang et al. (2013) transform the local scans into bearing- and local geometric features are jointly incorporated into
angle images and extract Speeded Up Robust Features a probabilistic framework. Such semantic-based approaches
(SURFs) from these images. A strategy based on 3D spatial have significant potential, for example robustness to stark
information is employed to order the scenes before matching changes in point of view, but require the presence of human-
the descriptors. A similar technique by Steder et al. (2010) known objects in the scene.
first transforms the local scans into a range image. Local We therefore aim for an approach which does not rely
features are extracted and compared to the ones stored in on assumptions about the environment being composed of
a database, employing the Euclidean distance for matching simplistic geometric primitives such as planes, or a rich
keypoints. This work is extended in Steder et al. (2011) by library of objects. This allows for a more general, scalable
using Normal-Aligned Radial Features (NARF) descriptors solution.
and a bag of words approach for matching. Learning with 3D point clouds In recent years,
Using global descriptors of the local point cloud for Convolutional Neural Networks (CNNs) have become
place recognition is also proposed in (Röhling et al. (2015); the state-of-the-art-method for generating learning-based
Granström et al. (2011); Magnusson et al. (2009); Cop et al. descriptors, due to their ability to find complex patterns
(2018)). Röhling et al. (2015) propose to describe each local in data (Krizhevsky et al. (2012)). For 3D point clouds,
point cloud with a 1D histogram of point heights, assuming methods based on CNNs achieve impressive performance in
that the sensor keeps a constant height above the ground. applications such as object detection (Engelcke et al. (2017);
The histograms are then compared using the Wasserstein Maturana and Scherer (2015); Riegler et al. (2017); Li et al.
metric for recognizing places. Granström et al. (2011) (2016); Wu et al. (2015); Wohlhart and Lepetit (2015); Qi
describe point clouds with rotation invariant features such et al. (2017); Fang et al. (2015)), semantic segmentation
as volume, nominal range, and range histogram. Distances (Riegler et al. (2017); Li et al. (2016); Qi et al. (2017);

5
Tchapmi et al. (2017); Graham et al. (2018); Wu et al. 3 The SegMap approach
(2018)), and 3D object generation (Wu et al. (2016)), and
This section presents our SegMap approach to localization
LiDAR-based local motion estimation (Dewan et al. (2018);
and mapping in 3D point clouds. It is composed of five core
Velas et al. (2018)).
modules: segment extraction, description, localization, map
Recently, a handful of works proposing the use of reconstruction, and semantics extraction. These modules are
CNNs for localization in 3D point clouds have been detailed in this section and together allow single and multi-
published. First, Zeng et al. (2017) proposes extracting data- robot systems to create a powerful unified representation
driven 3D keypoint descriptors (3DMatch) which are robust which can conveniently be transferred.
to changes in view-point. Although impressive retrieval
Segmentation The stream of point clouds generated
performance is demonstrated using an RGB-D sensor in
by a 3D sensor is first accumulated in a dynamic voxel
indoor environments, it is not clear whether this method is
grid† . Point cloud segments are then extracted in a section
applicable in real-time in large-scale outdoor environments.
of radius R around the robot. In this work we consider
A different approach based on 3D CNNs was proposed
two types of incremental segmentation algorithms (Dubé
in Ye et al. (2017) for performing localization in semi-
et al. (2018b)). The first one starts by removing points
dense maps generated with visual data. Recently, Yin
corresponding to the ground plane, which acts as a separator
et al. (2017) introduced a semi-handcrafted global descriptor
for clustering together the remaining points based on their
for performing place recognition and rely on an ICP
Euclidean distances. The second algorithm computes local
step for estimating the 6-DoF localization transformations.
normals and curvatures for each point and uses these to
This method will be used as a baseline solution in
extract flat or planar-like surfaces. Both methods are used
Section 5.8 when evaluating the precision of our localization
to incrementally grow segments by using only newly active
transformations. Elbaz et al. (2017) propose describing local
voxels as seeds which are either added to existing segments,
subsets of points using a deep neural network autoencoder.
form new segments or merge existing segments together‡ .
The authors state, however, that the implementation has
This results in a handful of local segments, which are
not been optimized for real-time operation and no timings
individually associated to a set of past observations i.e.
have been provided. In contrast, our work presents a data-
Si = {s1 , s2 , . . . , sn }. Each observation sj ∈ Si is a 3D
driven segment-based localization method that can operate in
point cloud representing a snapshot of the segment as points
real-time and that enables map reconstruction and semantic
are added to it. Note that sn represents the latest observation
extraction capabilities.
of a segment and is considered complete when no further
To achieve this reconstruction capability, the architecture measurements are collected, e.g. when the robot has moved
of our descriptor was inspired by autoencoders in which away.
an encoder network compresses the input to a small
Description Compact features are then extracted from
dimensional representation, and a decoder network attempts
these 3D segment point clouds using the data-driven
to decompress the representation back into the original input.
descriptor presented in Section 4. A global segment map is
The compressed representation can be used as a descriptor
created online by accumulating the segment centroids and
for performing 3D object classification (Brock et al. (2016)).
corresponding descriptors. In order for the global map to
Brock et al. (2016) also present successful results using
most accurately represent the latest state of the world, we
variational autoencoders for reconstructing voxelized 3D
only keep the descriptor associated with the last and most
data. Different configurations of encoding and decoding
complete observation.
networks have also been proposed for achieving localization
Localization In the next step, candidate correspondences
and for reconstructing and completing 3D shapes and
are identified between global and local segments using k-
environments (Guizilini and Ramos (2017); Dai et al. (2017);
Nearest Neighbors (k-NN) in feature space. The approximate
Varley et al. (2017); Ricao Canelhas et al. (2017); Elbaz et al.
k nearest descriptors are retrieved through an efficient
(2017); Schönberger et al. (2018)).
query in a kd-tree. Localization is finally performed by
While autoencoders present the interesting opportunity verifying the largest subset of candidate correspondences
of simultaneously accomplishing both compression and for geometrical consistency on the basis of the segment
feature extraction tasks, optimal performance at both is not centroids. Specifically, the centroids of the corresponding
guaranteed. As will be shown in Section 5.4, these two local and global segments must have the same geometric
tasks can have conflicting goals when robustness to changes configuration up to a small jitter in their position, to
in point of view is desired. In this work, we combine compensate for slight variations in segmentation. In the
the advantages of the encoding-decoding architecture of experiments presented in Section 5.9, this is achieved using
autoencoders with a technique proposed by Parkhi et al. an incremental recognition strategy which uses caching of
(2015). The authors address the face recognition problem correspondences for faster geometric verifications (Dubé
by first training a CNN to classify people in a training set et al. (2018b)).
and afterwards use the second to last layer as a descriptor
When a large enough geometrically consistent set
for new faces. Other alternative training techniques include
of correspondence is identified, a 6-DoF transformation
for example the use of contrastive loss (Bromley et al.
(1994)) or triplet loss (Weinberger et al. (2006)), the latter † Inour experiments, we consider two techniques for estimating the local
one being evaluated in Section 5.4. We use the resulting motion by registering successive LiDAR scans: one which uses ICP and one
segment descriptors in the context of SLAM to achieve better based on LOAM (Zhang and Singh (2014)).
performance, as well as significantly compressed maps that ‡ For more information on these segmentation algorithms, the reader is
can easily be stored, shared, and reconstructed. encouraged to consult our prior work (Dubé et al. (2018b)).

Descriptor extractor Reconstruction
Deconv | 32 | 3×3×3
Deconv | 32 | 3×3×3
Deconv | 1 | 3×3×3
Conv | 32 | 3×3×3
Conv | 64 | 3×3×3
Conv | 64 | 3×3×3
64x1
MaxPool | 2×2×2
MaxPool | 2×2×2
BN | Dropout 0.5
FC | 8192
FC | 512
Sigmoid
FC | 64
32x32x16 32x32x16
Scale 3x1
BN | Dropout 0.5 FC | N Softmax Class
Classification
Figure 3. The descriptor extractor is composed of three convolutional and two fully connected layers. The 3D segments are
compressed to a representation of dimension 64 × 1 which can be used for localization, map reconstruction and semantic extraction.
Right of the descriptor we illustrate the classification and reconstruction layers which are used for training. In the diagram
the convolutional (Conv), deconvolutional (Deconv), fully connected (FC) and batch normalization (BN) layers are abbreviated
respectively. As parameters the Conv and Deconv layers have the number of filters and their sizes, FC layers have the number of
nodes, max pool layers have the size of the pooling operation, and dropout layers have the ratio of values to drop. Unless otherwise
specified, Rectified Linear Unit (ReLU) activation functions are used for all layers.
between the local and global maps is estimated. This layers placed in between and two fully connected layers.
transformation is fed to an incremental pose-graph SLAM Unless otherwise specified, ReLU activation functions are
solver which in turn estimates, in real-time, the trajectories used for all layers. The original scale of the input segment is
of all robots (Dubé et al. (2017b)). passed as an additional parameter to the first fully connected
Reconstruction Thanks to our autoencoder-like descrip- layer to increase robustness to voxelization at different aspect
tor extractor architecture, the compressed representation can ratios. The descriptor is obtained by taking the activations of
at any time be used to reconstruct an approximate map the extractor’s last fully connected layer. This architecture
as illustrated in Figure 12. As the SegMap descriptor can was selected by grid search over various configurations and
conveniently be transmitted over wireless networks with parameters.
limited bandwidth, any agent in the network can reconstruct
and leverage this 3D information. More details on these
reconstruction capabilities are given in Section 4.3.
Semantics The SegMap descriptor also contains seman-
tically relevant information without the training process 4.2 Segment alignment and scaling
having enforced this property on the descriptor. This can,
for example, be used to discern between static and dynamic A pre-processing stage is required in order to input the 3D
objects in the environment to improve the robustness of the segment point clouds for description. First, an alignment step
localization task. In this work we present an experiment is applied such that segments extracted from the same objects
where the network is able to distinguish between three are similarly presented to the descriptor network. This is
different semantic classes: vehicles, buildings, and others performed by applying a 2D Principal Components Analysis
(see Section 4.4). (PCA) of all points located within a segment. The segment
is then rotated so that the x-axis of its frame of reference,
from the robot’s perspective, aligns with the eigenvector
4 The SegMap Descriptor corresponding to the largest eigenvalue. We choose to solve
In this section we present our main contribution: a data- the ambiguity in direction by rotating the segment so that the
driven descriptor for 3D segment point clouds which allows lower half section along the y-axis of its frame of reference
for localization, map reconstruction and semantic extraction. contains the highest number of points. From the multiple
The descriptor extractor’s architecture and the processing alignment strategies we evaluated, the presented strategy
steps for inputting the point clouds to the network are worked best.
introduced. We then describe our technique for training this
The network’s input voxel grid is applied to the segment
descriptor to accomplish tasks of both segment retrieval and
so that its center corresponds to the centroid of the aligned
map reconstruction. We finally show how the descriptor can
segment. By default the voxels have minimum side lengths
further be used to extract semantic information from the
of 0.1 m. These can individually be increased to exactly
point cloud.
fit segments having one or more larger dimension than the
grid. Whereas maintaining the aspect ratio while scaling can
4.1 Descriptor extractor architecture potentially offer better retrieval performance, this individual
The architecture of the descriptor extractor is presented scaling with a minimum side length better avoids large
in Figure 3. Its input is a 3D binary voxel grid of fixed errors caused by aliasing. We also found that this scaling
dimension 32 × 32 × 16 which was determined empirically method offers the best reconstruction performance, with only
to offer a good balance between descriptiveness and the a minimal impact on the retrieval performance when the
size of the network. The description part of the CNN is original scale of the segments is passed as a parameter to
composed of three 3D convolutional layers with max pool the network.

7
4.3 Training the SegMap descriptor Semantics

64x1
In order to achieve both a high retrieval performance
Dropout 0.5
Vehicle
FC | 64
and reconstruction capabilities, we propose a customized
FC | 3
SegMap
Building
learning technique. The two desired objectives are imposed Decriptor
Other
on the network by the softmax cross entropy loss Lc for
retrieval and the reconstruction loss Lr . We propose to
simultaneously apply both losses to the descriptor and to
Figure 4. A simple fully connected network that can be
this end define a combined loss function L which merges
appended to the SegMap descriptor (depicted in Figure 3) in
the contributions of both objectives: order to extract semantic information. In our experiments, we
train this network to distinguish between vehicles, buildings, and
L = Lc + αLr (1) other objects.
where the parameter α weighs the relative importance of

the two losses. The value α = 200 was empirically found the descriptor extraction needs to be run in real-time on the
to not significantly impact the performance of the combined robotic platforms, whereas the decoding part can be executed
network, as opposed to training separately with either of the any time a reconstruction is desired.
losses. Weights are initialized based on Xavier’s initialization As proposed by Brock et al. (2016), we use a specialized
method (Glorot and Bengio (2010)) and trained using the form of the binary cross entropy loss, which we denote by
Adaptive Moment Estimation (ADAM) optimizer (P. and Lr :
L. (2015)) with a learning rate of 10−4 . In comparison X
to Stochastic Gradient Descent (SGD), ADAM maintains Lr = − (γ txyz log(oxyz )
separate learning rates for each network parameter, which x,y,z (3)
facilitates training the network with two separate objectives + (1 − γ)(1 − txyz ) log(1 − oxyz ))
simultaneously. Regularization is achieved using dropout
where t and o respectively represent the target segment and
(Srivastava et al. (2014)) and batch normalization (Ioffe and
the network’s output and γ is a hyperparameter which weighs
Szegedy (2015)).
the relative importance of false positives and false negatives.
Classification loss Lc For training the descriptor to
This parameter addresses the fact that only a minority of
achieve better retrieval performance, we use a learning
voxels are activated in the voxel grid. In our experiments,
technique similar to the N-ways classification problem
the voxel grids used for training were on average only 3%
proposed by Parkhi et al. (2015). Specifically, we organize
occupied and we found γ = 0.9 to yield good results.
the training data into N classes where each class contains
all observations of a segment or of multiple segments that
belong to the same object or environment part. Note that
4.4 Knowledge transfer for semantic extraction
these classes are solely used for training the descriptor and As can be observed from Figure 1, segments extracted by the
are not related to the semantics presented in Section 4.4. As SegMap approach for localization and map reconstruction
seen in Fig 3, we then append a classification layer to the often represent objects or parts of objects. It is therefore
descriptor and teach the network to associate a score to each possible to assign semantic labels to these segments and
of the N predictors for each segment sample. These scores use this information to improve the performance of the
are compared to the true class labels using softmax cross localization process. As depicted in Figure 4, we transfer the
entropy loss: knowledge embedded in our compact descriptor by training
a semantic extraction network on top of it. This last network
N is trained with labeled data using the softmax cross entropy
X eli
Lc = − yi log PN (2) loss and by freezing the weights of the descriptor network.
i=1 k=1 elk
In this work, we choose to train this network to distinguish
where y is the one hot encoded vector of the true class labels between three different semantic classes: vehicles, buildings,
and l is the layer output. and others. Section 5.9 shows that this information can be
Given a large number of classes and a small descriptor used to increase the robustness of the localization algorithm
dimensionality, the network is forced to learn descriptors that to changes in the environment and to yield smaller map
better generalize and prevent overfitting to specific segment sizes. This is achieved by rejecting segments associated with
samples. Note that when deploying the system in a new potentially dynamic objects, such as vehicles, from the list of
environment the classification layer is removed, as its output segment candidates.
is no longer relevant. The activations of the previous fully
connected layer are then used as a descriptor for segment 4.5 SegMini
retrieval through k-NN. Finally we propose a lightweight version of the SegMap
Reconstruction loss Lr As depicted in Figure 3, map descriptor which is specifically tailored for resource-limited
reconstruction is achieved by appending a decoder network platforms. SegMini has the same architecture as SegMap (see
and training it simultaneously with the descriptor extractor Figure 3), with the exception that the number of filter in
and classification layer. This decoder is composed of one the convolutional layers and the size of the dense layers
fully connected and three deconvolutional layers with a final is halved. Without compromising much on the descriptor
sigmoid output. Note that no weights are shared between retrieval performance this model leads to a computational
the descriptor and the decoder networks. Furthermore, only speedup of 2x for GPU and 6x for CPU (Section 5.3).

Figure 5. An illustration of the SegMap reconstruction capabilities. The segments are extracted from sequence 00 of the KITTI
dataset and represent, from top to bottom respectively, vehicles, buildings, and other objects. For each segment pair, the
reconstruction is shown to the right of the original. The network manages to accurately reconstruct the segments despite the high
compression to only 64 values. Note that the voxelization effect is more visible on buildings as larger segments necessitate larger
voxels to keep the input dimension fixed.
5 EXPERIMENTS training epoch. Each segment is rotated at different angles to

the alignment described in Section 4.2 to simulate different
This section presents the experimental validation of our
view-points. In order to simulate the effect of occlusion for
approach. We first present a procedure for generating training
each segment we remove all points which fall on one side of a
data and detail the performance of the SegMap descriptor
randomly generated slicing plane that does not remove more
for localization, reconstruction and semantics extraction. We
than 50% of the points. Finally, random noise is simulated by
finally evaluate the complete SegMap solution in multiple
randomly removing up to 10% of the points in the segment.
real-world experiments.
Note that these two data augmentation steps are performed
prior to voxelization.
5.1 Experiment setup and implementation
All experiments were performed on a system equipped 5.2.2 Ground-truth generation In the following step,
with an Intel i7-6700K processor, and an NVIDIA GeForce we use GPS readings in order to identify ground truth
GTX 980 Ti GPU. The CNN models were developed and correspondences between segments extracted in areas where
executed in real-time using the TensorFlow library. The the vehicle performed multiple visits. Only segment pairs
libnabo library is used for descriptor retrieval with fast k-NN with a maximum distance between their centroids of 3.0 m
search in low dimensional space (Elseberg et al. (2012)). The are considered. We compute the 3D convex hull of each
incremental optimization back-end is based on the iSAM2 segment observation s1 and s2 and create a correspondence
implementation from Kaess et al. (2012). when the following condition, inspired from the Jaccard
index, holds:
5.2 Training data Volume(Conv(s1 ) ∩ Conv(s2 ))
The SegMap descriptor is trained using real-world data ≥p (4)
Volume(Conv(s1 ) ∪ Conv(s2 ))
from the KITTI odometry dataset (Geiger et al. (2012)).
Sequences 05 and 06 are used for generating training In our experiments we found p = 0.3 to generate a
and testing data, whereas sequence 00 is solely used for sufficient number of correspondences while preventing false
validation of the descriptor performance. In addition, end-to- labelling. The procedure is performed on sequences 00,
end experiments are done using sequences 00 and 08, as they 05, and 06, generating 150, 260, and 320 ground truth
feature long tracks with multiple overlapping areas in the correspondences respectively. We use two-thirds of the
trajectories. For each sequence, segments are extracted using correspondences for augmenting the training data and one-
an incremental Euclidean distance-based region growing third for creating validation samples. Finally, the ground-
technique (Dubé et al. (2018b)). This algorithm extracts truth correspondences extracted from sequence 00 are used
point clouds representing parts of objects or buildings which in Section 5.4 for evaluating the retrieval performance.
are separated after removing the ground plane (see Figure 5).
The training data is filtered by removing segments with 5.3 Training the models
too few observations, or training classes (as described in
The descriptor extractor and the decoding part of the
Section 4.3) with too few samples. In this manner, 3300,
reconstruction network are trained using all segments
1750, 810 and 2400 segments are respectively generated
extracted from drive 05 and 06. Training lasts three to four
from sequences 00, 05, 06 and 08, with an average of 12
hours on the GPU and produces the classification and scaled
observations per segment over the whole dataset.
reconstruction losses depicted in Figure 6. The total loss
5.2.1 Data augmentation To further increase robustness of the model is the sum of the two losses as describe in
by reducing sensitivity to rotation and view-point changes in Section 4.3. We note that for classification the validation
the descriptor extraction process, the dataset is augmented loss follows the training loss before converging towards a
through various transformations at the beginning of each corresponding accuracy of 41% and 43% respectively. In

9
8 5 SegMap Classification Triplet

(area = 0.84) (area = 0.85) (area = 0.91)
Train Train SegMini Autoencoder Eigen
(area = 0.87) (area = 0.63) (area = 0.67)
6 Test Test
Lr [-]
Lc [-]
4 1.0
4
0.8
2 3
True Positive Rate [-]

0 128 256 0 128 256
Epoch [#] Epoch [#] 0.6
Figure 6. The classification loss Lc (left) and the reconstruction

loss Lr (right) components of the total loss L, when training 0.4
the descriptor extractor along with the reconstruction and
classification networks. The depicted reconstruction loss has
already been scaled by α. 0.2
other words, 41% of the validation samples were correctly 0.0

assigned to one of the N = 2500 classes. This accuracy 0.0 0.2 0.4 0.6 0.8 1.0
is expected given the large quantity of classes and the False Positive Rate [-]
challenging task of discerning between multiple training
samples with similar semantic meaning, but few distinctive Figure 7. ROC curves for the descriptors considered in
features, e.g. flat walls. Note that we achieve a very similar this work. This evaluation is performed using ground-truth
classification loss Lc , when training with and without the correspondences extracted from sequence 00 of the KITTI
odometry dataset (Geiger et al. (2012)). Note that the ROC is not
Lr component of the combines loss L. On a GPU the
an optimal measure of the quality of the retrieval performance,
SegMap descriptor takes on average 0.8 ms to compute, since it only considers a single threshold for all segment pairs
while the SegMini descriptor takes 0.3 ms. On the CPU the and does not look at the relative ordering of matches on a per
performance gain is more significant, as it takes 245 ms for query basis.
a SegMap descriptor as opposed to only 41 ms for SegMini,
which is a 6x improvement in efficiency.
offers the best ROC performance on these datasets, as it
5.4 Descriptor retrieval performance imposes the most consistent separation margin across all
We evaluate the retrieval performance of the SegMap segments.
descriptor against state-of-the-art methods as well as other The ROC is not the best evaluation metric for this retrieval
networks trained with different secondary goals. First, our task, because it evaluates the quality of classification for
descriptor is compared with eigenvalue-based point cloud a single threshold across all segments. As introduced in
features (Weinmann et al. (2014)). We also evaluate the effect Section 3, correspondences are made between segments
of training only for the classification task (Classification) or from the local and global maps by using k-NN retrieval
of training only for the reconstruction one (Autoencoder). in feature space. The varying parameter is the number of
Additionally, we compare classification-based learning with neighbours that is retrieved and not a threshold on the feature
a triplet loss solution (Schroff et al. (2015)), where during distances, which only matter in a relative fashion on a per
training, we enforce segments from the same sequence to query basis. In order to avoid false localizations, the aim
have a minimal Euclidean distance. We use a per batch hard is to reduce the number k of neighbours that need to be
mining strategy and the best performing variant of triplet loss considered. Therefore, as a segment grows with time, it is
as proposed by Hermans et al. (2017). We finally evaluate the critical that its descriptor converges as quickly as possible
SegMini model introduced in Section 4.5. towards the descriptor of the corresponding segment in
The retrieval performance of the aforementioned descrip- the target map, which in our case is extracted from the
tors is depicted in Fig 7. The Receiver Operating Char- last and most complete observation (see Section 3). This
acteristic (ROC) curves are obtained by generating 45M behaviour is evaluated in Figure 8a which relates the number
labeled pairs of segment descriptors from sequence 00 of of neighbours which need to be considered to find the correct
the KITTI odometry dataset (Geiger et al. (2012)). Using association, as a function of segment completeness. We
ground-truth correspondences, a positive sample is created note that the SegMap descriptor offers competitive retrieval
for each possible segment observation pair. For each positive performance at every stage of the growing process. In
sample a thousand negative samples are generated by ran- practice this is important since it allows closing challenging
domly sampling segment pairs whose centroids are further loops such as the one presented in Figure 1.
than 20 m apart. The positive to negative sample ratio is Interestingly, the autoencoder has the worst performance
representative of our localization problem given that a map at the early growing stages whereas good performance is
created from KITTI sequence 00 contains around a thousand observed at later stages. This is in accordance with the
segments. The ROC curves are finally obtained by varying capacity of autoencoders to precisely describe the geometry
the threshold applied on the L2 norm between the two of a segment, without explicitly aiming at gaining a robust
segment descriptors. We note that training with triplet loss representation in the presence of occlusions or changes

SegMap Classification Triplet

Descriptor size
SegMini Autoencoder Eigen 16 32 64 128
Autoencoder 0.87 0.91 0.93 0.94
SegMap 0.86 0.89 0.91 0.92
Table 1. Average ratio of corresponding points within one
Median k-neighbours needed [-]
1000 voxel distance between original and reconstructed segments.

Statistics for SegMap and the autoencoder baseline using
different descriptor sizes.
100
the goal is to minimize the number of retrieved neighbours

and retrieving too many is computationally unfeasible for
10 later stages of the process. Therefore although the purely
classification-based model performs slightly better for very
early observations of a segment, this gain in performance
does not matter. The proposed SegMap descriptor achieves
1 the best performance for very complete segments, where
10 20 30 40 50 60 70 80 90 100
matches are most likely to happen, and maintains a
Segment completness [%]
comparable performance across very partial observations.
(a) Median k-nearest neighbours needed for all methods as a A more detailed plot for the retrieval performance of the
function of segment completeness. SegMap and SegMini is presented in Figure 8b, where also
the variance in the retrieval accuracy is shown.
SegMap
SegMini
5.5 Reconstruction performance
1000
In addition to offering high retrieval performance, the
k-neighbours needed [-]
SegMap descriptor allows us to reconstruct 3D maps using

the decoding CNN described in Section 4.3. Some examples
100 of the resulting reconstructions are illustrated in Figure 5, for
various objects captured during sequence 00 of the KITTI
odometry dataset. Experiments done at a larger scale are
presented in Figure 14, where buildings of a powerplant and
10
a foundry are reconstructed by fusing data from multiple
sensors.
Since most segments only sparsely model real-world
1 surfaces, they occupy on average only 3% of the voxel
10 20 30 40 50 60 70 80 90 100
grid. To obtain a visually relevant comparison metric, we
Segment completness [%]
calculate for both the original segment and its reconstruction
(b) More detailed plot of the k-nearest neighbours needed for the the ratio of points having a corresponding point in the other
proposed methods as a function of segment completeness. segment, within a distance of one voxel. The tolerance of
one voxel means that the shape of the original segment
Figure 8. This figure presents how quickly descriptors extracted
from incrementally grown segments contain relevant information must be preserved while not focusing on reconstructing each
that can be used for localization. The x-axis represents the individual point. Results calculated for different descriptor
completeness of a segment until all its measurements have sizes are presented in Table 1, in comparison with the purely
been accumulated (here termed complete, see Section 3). In reconstruction focused baseline. The SegMap descriptor with
(a) the log-scaled y-axis represents the median of how many a size of 64 has on average 91% correspondences between
neighbours in the target map need to be considered in order the points in the original and reconstructed segments, and
to retrieve the correct target segment (the lower the better).
is only slightly outperformed by the autoencoder baseline.
Similarly (b) presents the same results in more detail for
the proposed models. The SegMap descriptor offers over the Contrastingly, the significantly higher retrieval performance
majority of the growing process one order of magnitude better of the SegMap descriptor makes it a clear all-rounder choice
retrieval performance than the hand-crafted baseline descriptor. for achieving both localization and map reconstruction.
Overall, the reconstructions are well recognizable despite
the high compression ratio. In Figure 12, we note that
in view-point. Although the triplet loss training method the quantization error resulting from the voxelization step
offers the best ROC performance, Figure 8a suggests that mostly affects larger segments that have been downscaled
training with the secondry goal of classification yields to fit into the voxel grid. To mitigate this problem, one can
considerably better results at the later stages of growing. adopt a natural approach to representing this information
The poor performance of the triplet loss method especially in 3D space, which is to calculate the isosurface for a
for very similar segments could be caused by the hard given probability threshold. This can be computed using the
mining amplifying the noise in the dataset. After a certain “marching cubes” algorithm, as presented by Lorensen and
point the ordering of matches becomes irrelevant, because Cline (1987). The result is a triangle-mesh surface, which can

11
60
Correctly localized queries [%]

50
40
30
20
10 SegMap
LocNet
0
0 1 2 3 4 5
Distance threshold [m]
Figure 10. Cumulative distribution of position errors on KITTI 00

odometry sequence that compares SegMap with state-of-the-
art data-driven LocNet approach presented in Yin et al. (2017,
2018). Our proposed method retrieves a full 6-DoF pose while
LocNet uses global scan descriptors to obtain the nearest pose
of the target map. SegMap retrieves poses for a larger number of
Figure 9. A visual comparison between (left) the original point scans and the returned estimates are more accurate. The results
cloud, (middle) the reconstruction point cloud, and (right) the saturate at about 52% as not all query positions overlap with the
reconstruction mesh, for 3 segments. target map, with only 65% of them being within a radius of 50 m
from the map.
be used for intuitive visualization, as illustrated in Figure 9

and Figure 12. Figure 10 presents the evaluation of both methods on
the KITTI 00 odometry sequence (4541 scans). We use
5.6 Semantic extraction performance the first 3000 LiDAR scans and their ground-truth poses
For training the semantic extractor network (Figure 4), we to create a map, against which we then localize using
manually labeled the last observation of all 1750 segments the last 1350 scans. SegMap demonstrates a superior
extracted from KITTI sequence 05. The labels are then performance both by successfully localizing about 6% more
propagated to each observation of a segment for a total of 20k scans and by returning more accurate localized poses. To
labeled segment observations. We use 70% of the samples note is that from the query positions only 65% of them
for training the network and 30% for validation. Given the were taken within a distance of 50 m of the target map,
low complexity of the semantic extraction network and the therefore limiting the maximum possible saturation. We
small amount of labeled samples, training takes only a few believe that robust matching of segments, a principle of our
minutes. We achieve an accuracy of 89% and 85% on the method, helps to establish reliable correspondences with the
training and validation data respectively. Note that our goal target map, particularly for queries further away from the
is not to improve over other semantic extraction methods mapped areas. This state-of-the-art localization performance
(Li et al. (2016); Qi et al. (2017)), but rather to illustrate is further complemented by a compact map representation,
that our compressed representation can additionally be used with reconstruction and semantic labeling capabilities.
for discarding dynamic elements of the environment and for
reducing the map size (Section 5.9.1). 5.8 A complete mapping and localization
system
5.7 6-DoF pose retrieval performance So far, we have only evaluated SegMap as a stand-alone
In this section, we demonstrate how the advantageous global localization system, demonstrating the performance
properties of SegMap, particularly the descriptor retrieval of segment descriptors and the 6-DoF pose retrieval. Such
performance, translate to state-of-the-art global localization global localization systems, however, are commonly used
results. We therefore compare our approach to a global in conjunction with odometry and mapping algorithms. To
localization method, LocNet (Yin et al. (2017, 2018)). It prove the qualities of SegMap in such a scenario, we have
uses rotation-invariant, data-driven descriptors that yield combined it with a state-of-the-art LiDAR odometry and
reliable matching of 3D LiDAR scans. LocNet retrieves mapping system, LOAM (Zhang and Singh (2014)). Our
a nearest neighbor database scan and returns its pose, its implementation is based on a publicly available version of
output is thus limited to the poses already present in the LOAM and achieves similar odometry performance results
target map. Therefore, it works reliably in environments on KITTI, as the ones reported by other works, such as
with well defined trajectories (e.g. roads), but fails to return Velas et al. (2018). We use a loosely coupled approach,
a precise location within large traversable areas such as where LOAM is used to undistort the scans and provide an
squares or hallways. In contrast, SegMap uses segment odometry estimate between frames, in real-time. The scans
correspondences to estimate an accurate 6-DoF pose that from LOAM are used to build a local map from which
includes orientation, which cannot be retrieved directly using segments are extracted and attached to a pose-graph, together
the rotation-invariant LocNet descriptors. with the odometry measurements. Loop closures can then be

500 Ground truth

2.0 0.012
LOAM
Rotation error [deg/m]

LOAM + SegMap
0.010
Translation error [%]

400
1.5
0.008
300
Y [m]
1.0 0.006
200
0.004
100 0.5
0.002
0 LOAM
LOAM+SegMap
LOAM
LOAM+SegMap
0.0 0.000
−200 0 200 200 400 600 800 200 400 600 800
X [m] Path length [m] Path length [m]
(a) KITTI odometry sequence 00.
Ground truth
2.0 0.012
LOAM
Rotation error [deg/m]

LOAM + SegMap
0.010
Translation error [%]
400
1.5
0.008
300
Y [m]
1.0 0.006
200
0.004
100 0.5
0.002
LOAM LOAM
LOAM+SegMap LOAM+SegMap
0 0.0 0.000
−250 0 250 200 400 600 800 200 400 600 800
X [m] Path length [m] Path length [m]
(b) KITTI odometry sequence 08.
Figure 11. The trajectories for KITTI odometry sequences a) 00 and b) 08 for LOAM and the combination of LOAM and SegMap.
In addition we show translation and rotation errors for the two approaches, using the standard KITTI evaluation method Geiger et al.
(2012).
added in real-time as constraints in the graph, to correct the 5.9 Multi-robot experiments
drifting odometry. This results in a real-time LiDAR-only
We evaluate the SegMap approach on three large-scale multi-
end-to-end pipeline that produces segment-based maps of the
robot experiments: one in an urban-driving environment
environment, with loop-closures.
and two in search and rescue scenarios. In both indoor
In all experiments, we use a local map with a radius and outdoor scenarios we use the same model which was
of 50 m around the robot. When performing segment trained on the KITTI sequences 05 and 06 as described in
retrieval we consider 64 neighbours and require a minimum Section 5.3.
of 7 correspondences, which are altogether geometrically The experiments are run on a single machine, with
consistent, to output a localization. These parameters were a multi-thread approach to simulating a centralized
chosen empirically using the information presented in system. One thread per robot accumulates the 3D
Figure 7 and 8 as a reference. measurements, extracting segments, and performing the
descriptor extraction. The descriptors are transmitted to a
Our evaluations on KITTI sequences 00 and 08
separate thread which localizes the robots through descriptor
(Figure 11) demonstrate that global localization results from
retrieval and geometric verification, and runs the pose-
SegMap help correct for the drift of the odometry estimates.
graph optimization. In all experiments, sufficient global
The trajectories outputted by the system combining SegMap
associations need to be made, in real-time, for co-registration
and LOAM, follow more precisely the ground-truth poses
of the trajectories and merging of the maps. Moreover
provided by the benchmark, compared to the open-loop
in a centralized setup it might be crucial to limit the
solution. We also show how global localizations reduce
transmitted data over a wireless network with potentially
both translational and rotational errors. Particularly over
limited bandwidth.
longer paths SegMap is able to reduce the drift in the
trajectory estimate by up to 2 times, considering both
translation and rotation errors. For shorter paths, the drift 5.9.1 Multi-robot SLAM in urban scenario In order to
only improves marginally or remains the same, as local simulate a multi-robot setup, we split sequence 00 of the
errors are more dependent on the quality of the odometry KITTI odometry dataset into five sequences, which are
estimate. We believe that our evaluation showcases not only simultaneously played back on a single computer for a
the performance of SegMap, but also the general benefits duration of 114 seconds. In this experiment, the semantic
stemming from global localization algorithms. information extracted from the SegMap descriptors is used

13
Figure 12. Visualization of segment reconstructions, as point clouds (left), and as surface meshes (right), generated from sequence
00 of the KITTI dataset. The quantization of point cloud reconstructions is most notable in the large wall segments (blue) visible in
the background. Equivalent surface mesh representations do not suffer from this issue.
to reject segments classified as vehicles from the retrieval Table 2. Statistics resulting from the three experiments.
process. Statistic KITTI Powerplant Foundry
Duration (s) 114 850 1086
With this setup, 113 global associations were discovered,
Number of robots 5 3 2
allowing to link all the robot trajectories and create
Number of segmented local cloud 557 758 672
a common representation. We note that performing Average number of segments per cloud 42.9 37.0 45.4
ICP between the associated point clouds would refine Bandwidth for transmitting local clouds (kB/s) 4814.7 1269.2 738.1
the localization transformation by, on average, only Bandwidth for transmitting segments (kB/s) 2626.6 219.4 172.2
0.13 ± 0.06 m which is in the order of our voxelization Bandwidth for transmitting descriptors (kB/s) 60.4 9.5 8.1
resolution. However, this would require the original point Final map size with the SegMap descriptor (kB) 386.2 181.3 121.2
Number of successful localizations 113 27 85
cloud data to be kept in memory and transmitted to the
central computer. Future work could consider refining the
transformations by performing ICP on the reconstructions.
measuring 100 m long by 25 m wide. The second mission
Localization and map reconstruction was performed at an
took place at the Phoenix-West foundry in a semi-open
average frequency of 10.5 Hz and segment description was
building made of steel. A section measuring 100 m by 40 m
responsible for 30% of the total runtime with an average
was mapped using two UGVs. The buildings are shown in
duration of 28.4 ms per local cloud. A section of the target
Fig 13.
map which has been reconstructed from the descriptors is
For these two experiments, we used an incremental
depicted in Figure 1.
smoothness-based region growing algorithm which extracts
Table 2 presents the results of this experiment. The
plane-like segments (Dubé et al. (2018b)). The resulting
required bandwidth is estimated by considering that each
SegMap reconstructions are shown in Figure 14 and detailed
point is defined by three 32-bit floats and that 288 additional
statistics are presented in Table 2. Although these planar
bits are required to link each descriptor to the trajectories.
segments have a very different nature than the ones used
We only consider the useful data and ignore any transfer
for training the descriptor extractor, multiple localizations
overhead. The final map of the KITTI sequence 00 contains
have been made in real-time so that consistent maps could
1341 segments out of which 284 were classified as vehicles.
be reconstructed in both experiments. Note that these search
A map composed of all the raw segment point clouds would
and rescue experiments were performed with sensors without
be 16.8 MB whereas using our descriptor it is reduced to only
full 360◦ field of view. Nevertheless, SegMap allowed robots
386.2 kB. This compression ratio of 43.5x can be increased
to localize in areas visited in opposite directions.
to 55.2x if one decides to remove vehicles from the map.
This shows that our approach can be used for mapping much
larger environments. 6 DISCUSSION AND FUTURE WORK
5.9.2 Multi-robot SLAM in disaster environments For While our proposed method works well in the demonstrated
the two following experiments, we use data collected by experiments it is limited by the ability to only observe the
Unmanned Ground Vehicles (UGVs) equipped with multiple geometry of the surrounding structure. This can be problem-
motor encoders, an Xsens MTI-G Inertial Measurement Unit atic in some man-made environments, which are repetitive
(IMU) and a rotating 2D SICK LMS-151 LiDAR. First, and can lead to perceptual aliasing, influencing both the
three UGVs were deployed at the decommissioned Gustav descriptor and the geometric consistency verification. This
Knepper powerplant: a large two-floors utility building could be addressed by detecting such aliasing instances and

dealing with them explicitly, through for example an increase

in the constraints of the geometric verification. On the other
hand, featureless environments, such as for example flat
fields or straight corridors, are equally challenging. As even
LIDAR-based odometry methods struggle to maintain an
accurate estimate of pose, these environments do not allow
for reliable segment extraction. In these cases the map will
drift until a more distinct section is reached that can be loop-
closed, thus allowing for partial correction of the previously Figure 13. Buildings of the Gustav Knepper powerplant (left)
built pose-graph. In different environments the two seg- and the Phoenix-West foundry (right).
mentation algorithms will have varying performances, with
the Euclidean distance based one working better in outdoor
scenarios, while the curvature-based one is more suited for
indoor scenarios. A future approach would be to run the
two segmentation strategies in parallel, thus allowing them
to compensate for each others short-comings and enabling
robots to navigate in multiple types of environments during
the same mission.
In order to address some of the aforementioned
drawbacks, in future work we would like to extend
the SegMap approach to different sensor modalities and
different point cloud segmentation algorithms. For example,
integrating information from camera images, such as color,
into the descriptor learning could mitigate the lack of
descriptiveness of features extracted from segments with
little distinct geometric structure. In addition, color and
semantic information from camera images could not only
be used to improve the descriptor but also to enhance
the robustness of the underlying segmentation process.
Considering the real-time constraints of the system, to note
with respect to future work are the additional computational
expenses introduced by processing and combining more data
modalities. Figure 14. This figure illustrates a reconstruction of the
Furthermore, whereas the present work performs segment buildings of the Gustav Knepper powerplant (top) and the
description in a discrete manner, it would be interesting to Phoenix-West foundry (bottom). The point clouds are colored
investigate incremental updates of learning-based descriptors by height and the estimated robot trajectories are depicted with
colored lines.
that could make the description process more efficient, such
as the voting scheme proposed by Engelcke et al. (2017).
Instead of using a feed-forward network, one could also
consider a structure that leverages temporal information in
the form of recurrence in order to better describe segments Additionally, we have combined our localization approach
based on their evolution in time. Moreover, it could be of with LOAM, a LiDAR-based local motion estimator, and
interest to learn the usefulness of segments as a precursory have demonstrated that the output of SegMap helps correct
step to localization, based on their distinctiveness and the drift of the open-loop odometry estimate. Finally, we
semantic attributes. have introduced SegMini: a light-weight version of our
SegMap descriptor which can more easily be deployed on
platforms with limited computational power.
7 CONCLUSION
In addition to enabling global localization, the SegMap
This paper presented SegMap: a segment-based approach descriptor allows us to reconstruct a map of the environment
for map representation in localization and mapping with and to extract semantic information. The ability to
3D sensors. In essence, the robots’ surroundings are reconstruct the environment while achieving a high
decomposed into a set of segments, and each segment compression rate is one of the main features of SegMap.
is represented by a distinctive, low dimensional learning- This allows us to perform both SLAM and 3D reconstruction
based descriptor. Data associations are identified by segment with LiDARs at large scale and with low communication
descriptor retrieval and matching, made possible by the bandwidth between the robots and a central computer.
repeatable and descriptive nature of segment-based features. These capabilities have been demonstrated through multiple
We have shown that the descriptive power of SegMap experiments with real-world data in urban driving and
outperforms hand-crafted features as well as the evaluated search and rescue scenarios. The reconstructed maps could
data-driven baseline solutions. Our experiments indicate allow performing navigation tasks such as, for instance,
that SegMap offers competitive localization performance, multi-robot global path planning or increasing situational
in comparison to the state-of-the-art LocNet method. awareness.

15
ACKNOWLEDGMENTS mapping using data-driven descriptors. In Robotics: Science

and Systems (RSS), 2018a.
This work was supported by the European Union’s
Renaud Dubé, Mattia G Gollub, Hannes Sommer, Igor Gilitschen-
Seventh Framework Program for research, technological
ski, Roland Siegwart, Cesar Cadena, and Juan Nieto. Incre-
development and demonstration under the TRADR project
mental segment-based localization in 3D point clouds. IEEE
No. FP7-ICT-609763, from the EU H2020 research project
Robotics and Automation Letters, 3(3):1832–1839, 2018b.
under grant agreement No 688652, the Swiss State
Secretariat for Education, Research and Innovation (SERI) Gil Elbaz, Tamar Avraham, and Anath Fischer. 3D point cloud
No 15.0284, and by the Swiss National Science Foundation registration for localization using a deep neural network auto-
through the National Center of Competence in Research encoder. In IEEE Conference on Computer Vision and Pattern
Robotics (NCCR). The authors would like to thank Abel Recognition, pages 2472–2481. IEEE, 2017.
Gawel, Mark Pfeiffer, Mattia Gollub, Helen Oleynikova, J. Elseberg, S. Magnenat, R. Siegwart, and A. Nüchter. Comparison
Philipp Krüsi, Igor Gilitschenski and Elena Stumm for their of Nearest-Neighbor-Search Strategies and Implementations
valuable collaboration and support. for Efficient Shape Registration. Journal of Software
Engineering for Robotics, 3(1):2–12, 2012. ISSN 2035-3928.
Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay
References Tong, and Ingmar Posner. Vote3deep: Fast object detection in
Michael Bosse and Robert Zlot. Place recognition using keypoint 3D point clouds using efficient convolutional neural networks.
voting in large 3D lidar datasets. In IEEE International In IEEE International Conference on Robotics and Automation
Conference on Robotics and Automation (ICRA), 2013. (ICRA), pages 1355–1361. IEEE, 2017.
Sean L Bowman, Nikolay Atanasov, Kostas Daniilidis, and Yi Fang, Jin Xie, Guoxian Dai, Meng Wang, Fan Zhu, Tiantian
George J Pappas. Probabilistic data association for semantic Xu, and Edward Wong. 3d deep shape descriptor. In
slam. In IEEE International Conference on Robotics and IEEE Conference on Computer Vision and Pattern Recognition,
Automation (ICRA), pages 1722–1729. IEEE, 2017. pages 2319–2328, 2015.
Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Gen- Eduardo Fernández-Moral, Walterio Mayol-Cuevas, Vicente
erative and discriminative voxel modeling with convolutional Arevalo, and Javier Gonzalez-Jimenez. Fast place recognition
neural networks. In Workshop on 3D Deep Learning, NIPS, with plane-based maps. In IEEE International Conference on
2016. Robotics and Automation (ICRA), 2013.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, Eduardo Fernández-Moral, Patrick Rives, Vicente Arévalo, and
and Roopak Shah. Signature verification using a ”siamese” Javier González-Jiménez. Scene structure registration for
time delay neural network. In Advances in Neural Information localization and mapping. Robotics and Autonomous Systems,
Processing Systems, pages 737–744, 1994. 75:649–660, 2016.
C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, Ross Finman, Liam Paull, and John J Leonard. Toward object-based
J. Neira, I. Reid, and J.J. Leonard. Past, present, and future place recognition in dense rgb-d maps. In ICRA Workshop on
of simultaneous localization and mapping: Towards the robust- Visual Place Recognition in Changing Environments, 2015.
perception age. IEEE Transactions on Robotics, 32(6):1309– Abel Gawel, Titus Cieslewski, Renaud Dubé, Mike Bosse, Roland
1332, 2016. Siegwart, and Juan Nieto. Structure-based vision-laser
Konrad Cop, Paulo Borges, and Renaud Dubé. DELIGHT: matching. In IEEE/RSJ International Conference on Intelligent
An efficient descriptor for global localisation using lidar Robots and Systems (IROS), pages 182–188. IEEE, 2016.
intensities. In IEEE International Conference on Robotics and Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready
Automation (ICRA), 2018. for autonomous driving? the KITTI vision benchmark suite. In
Angela Dai, Charles Ruizhongtai Qi, and Matthias Niessner. IEEE Conference on Computer Vision and Pattern Recognition,
Shape completion using 3d-encoder-predictor cnns and shape 2012.
synthesis. In IEEE Conference on Computer Vision and Pattern Xavier Glorot and Yoshua Bengio. Understanding the difficulty
Recognition, July 2017. of training deep feedforward neural networks. In Aistats,
Ayush Dewan, Tim Caselitz, and Wolfram Burgard. Learning volume 9, pages 249–256, 2010.
a local feature descriptor for 3d lidar scans. In IEEE/RSJ Benjamin Graham, Martin Engelcke, and Laurens van der
International Conference on Intelligent Robots and Systems Maaten. 3d semantic segmentation with submanifold sparse
(IROS), Madrid, Spain, 2018. convolutional networks. In IEEE Conference on Computer
Renaud Dubé, Daniel Dugas, Elena Stumm, Juan Nieto, Roland Vision and Pattern Recognition, pages 18–22, 2018.
Siegwart, and Cesar Cadena. SegMatch: Segment-based Karl Granström, Thomas B Schön, Juan I Nieto, and Fabio T
place recognition in 3D point clouds. In IEEE International Ramos. Learning to close loops from range data. The
Conference on Robotics and Automation (ICRA), pages 5266– International Journal of Robotics Research, 30(14):1728–
5272. IEEE, 2017a. 1754, 2011.
Renaud Dubé, Abel Gawel, Hannes Sommer, Juan Nieto, Roland Vitor Guizilini and Fabio Ramos. Learning to reconstruct 3D
Siegwart, and Cesar Cadena. An online multi-robot SLAM structures for occupancy mapping. In Robotics: Science and
system for 3D lidars. In IEEE/RSJ International Conference Systems (RSS), 2017.
on Intelligent Robots and Systems (IROS), pages 1004–1011. Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense
IEEE, 2017b. of the triplet loss for person re-identification. arXiv preprint
Renaud Dubé, Andrei Cramariuc, Daniel Dugas, Juan Nieto, arXiv:1703.07737, 2017.
Roland Siegwart, and Cesar Cadena. SegMap: 3D segment

Wolfgang Hess, Damon Kohler, Holger Rapp, and Daniel Andor. Samuele Salti, Federico Tombari, and Luigi Di Stefano. SHOT:
Real-time loop closure in 2d lidar slam. In IEEE International unique signatures of histograms for surface and texture
Conference on Robotics and Automation (ICRA), pages 1271– description. Computer Vision and Image Understanding, 125:
1278. IEEE, 2016. 251–264, 2014.
Sergey Ioffe and Christian Szegedy. Batch normalization: Johannes L Schönberger, Marc Pollefeys, Andreas Geiger, and
Accelerating deep network training by reducing internal Torsten Sattler. Semantic Visual Localization. In IEEE
covariate shift. In International Conference on Machine Conference on Computer Vision and Pattern Recognition,
Learning, pages 448–456, 2015. 2018.
M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet:
F. Dellaert. iSAM2: Incremental smoothing and mapping using A unified embedding for face recognition and clustering. In
the Bayes tree. The International Journal of Robotics Research, IEEE Conference on Computer Vision and Pattern Recognition,
31(2):216–235, 2012. pages 815–823, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya
classification with deep convolutional neural networks. In Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to
Advances in Neural Information Processing Systems, pages prevent neural networks from overfitting. Journal of Machine
1097–1105, 2012. Learning Research, 15(1):1929–1958, 2014.
Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3D Bastian Steder, Giorgio Grisetti, and Wolfram Burgard. Robust
lidar using fully convolutional network. In Robotics: Science place recognition for 3D range data based on point features.
and Systems (RSS), 2016. In IEEE International Conference on Robotics and Automation
William E Lorensen and Harvey E Cline. Marching cubes: A high (ICRA), 2010.
resolution 3d surface construction algorithm. In ACM siggraph Bastian Steder, Michael Ruhnke, Slawomir Grzonka, and Wolfram
computer graphics, volume 21, pages 163–169. ACM, 1987. Burgard. Place recognition in 3D scans using a combination of
Stephanie Lowry, Niko Sunderhauf, Paul Newman, John J Leonard, bag of words and point feature based relative pose estimation.
David Cox, Peter Corke, and Michael J Milford. Visual place In IEEE/RSJ International Conference on Intelligent Robots
recognition: A survey. IEEE Transactions on Robotics, 2016. and Systems (IROS), 2011.
Martin Magnusson, Henrik Andreasson, Andreas Nüchter, and Lyne P. Tchapmi, Christopher B. Choy, Iro Armeni, JunYoung
Achim J Lilienthal. Automatic appearance-based loop Gwak, and Silvio Savarese. SEGCloud: Semantic segmentation
detection from three-dimensional laser data using the normal of 3D point clouds. In International Conference on 3D Vision
distributions transform. Journal of Field Robotics, 26(11-12): (3DV), 2017.
892–914, 2009. Georgi Tinchev, Simona Nobili, and Maurice Fallon. Seeing
Daniel Maturana and Sebastian Scherer. VoxNet: A 3D the wood for the trees: Reliable localization in urban and
convolutional neural network for real-time object recognition. natural environments. In IEEE/RSJ International Conference
In IEEE/RSJ International Conference on Intelligent Robots on Intelligent Robots and Systems (IROS). IEEE, 2018.
and Systems (IROS), 2015. J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen.
Kingma D. P. and Ba J. L. ADAM: a method for Shape completion enabled robotic grasping. In IEEE/RSJ
stochastic optimization. International Conference on Learning International Conference on Intelligent Robots and Systems
Representations, 113, 2015. (IROS), pages 2442–2447, 2017.
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep Martin Velas, Michal Spanel, Michal Hradis, and Adam Herout.
face recognition. In BMVC, volume 1, page 6, 2015. Cnn for imu assisted odometry estimation using velodyne lidar.
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. In Autonomous Robot Systems and Competitions (ICARSC),
Pointnet: Deep learning on point sets for 3D classification and 2018 IEEE International Conference on, pages 71–77. IEEE,
segmentation. In IEEE Conference on Computer Vision and 2018.
Pattern Recognition, July 2017. Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance
Daniel Ricao Canelhas, Erik Schaffernicht, Todor Stoyanov, metric learning for large margin nearest neighbor classification.
Achim J Lilienthal, and Andrew J Davison. Compressed Voxel- In Advances in Neural Information Processing Systems, pages
Based Mapping Using Unsupervised Learning. Robotics, 6(3): 1473–1480, 2006.
15, 2017. Martin Weinmann, Boris Jutzi, and Clément Mallet. Semantic
Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. OctNet: 3D scene interpretation: a framework combining optimal
Learning deep 3D representations at high resolutions. In neighborhood size selection with relevant features. ISPRS
IEEE Conference on Computer Vision and Pattern Recognition, Annals of the Photogrammetry, Remote Sensing and Spatial
2017. Information Sciences, 2(3):181, 2014.
Timo Röhling, Jennifer Mack, and Dirk Schulz. A fast histogram- Paul Wohlhart and Vincent Lepetit. Learning descriptors for object
based similarity measure for detecting loop closures in 3-d recognition and 3D pose estimation. In IEEE Conference on
lidar data. In IEEE/RSJ International Conference on Intelligent Computer Vision and Pattern Recognition, pages 3109–3118,
Robots and Systems (IROS), 2015. 2015.
Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer.
point feature histograms (FPFH) for 3D registration. In IEEE Squeezeseg: Convolutional neural nets with recurrent crf for
International Conference on Robotics and Automation (ICRA), real-time road-object segmentation from 3d lidar point cloud.
pages 3212–3217, 2009. In IEEE International Conference on Robotics and Automation
(ICRA), pages 1887–1893. IEEE, 2018.

17
Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Huan Yin, Li Tang, Xiaqing Ding, Yue Wang, and Rong Xiong.
Tenenbaum. Learning a probabilistic latent space of object Locnet: Global localization in 3d point clouds for mobile
shapes via 3D generative-adversarial modeling. In Advances vehicles. In Proceedings of the IEEE Intelligent Vehicles
in Neural Information Processing Systems, pages 82–90, 2016. Symposium (IV), pages 728–733. IEEE, 2018.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher,
Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D shapenets: A Jianxiong Xiao, and Thomas Funkhouser. 3DMatch: Learning
deep representation for volumetric shapes. In IEEE Conference local geometric descriptors from rgb-d reconstructions. In
on Computer Vision and Pattern Recognition, pages 1912– IEEE Conference on Computer Vision and Pattern Recognition,
1920, 2015. 2017.
Yawei Ye, Titus Cieslewski, Antonio Loquercio, and Davide Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in
Scaramuzza. Place recognition in semi-dense maps: Geometric real-time. In Robotics: Science and Systems (RSS), 2014.
and learning-based approaches. In Proceedings of the British Yan Zhuang, Nan Jiang, Huosheng Hu, and Fei Yan. 3-d-laser-
Machine Vision Conference (BMVC), 2017. based scene measurement and place recognition for mobile
Huan Yin, Yue Wang, Li Tang, Xiaqing Ding, and Rong Xiong. robots in dynamic indoor environments. IEEE Transactions
Locnet: Global localization in 3D point clouds for mobile on Instrumentation and Measurement, 62(2):438–450, 2013.
robots. arXiv preprint arXiv:1712.02165, 2017.
View publication stats

Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors

Uploaded by

Copyright:

Available Formats

Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

SegMap: Segment-based mapping and localization using data-driven

Preprint · September 2019

Renaud Dube Andrei Cramariuc

Real-Time Multi-Robot SLAM with 3D Point Clouds View project

Localisation and Mapping View project

The user has requested enhancement of the downloaded file.

Please cite our work as:

and Roland Siegwart and Cesar Cadena},

Prepared using sagej.cls

1 Introduction across the aforementioned changes. Current LiDAR-based

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

dynamic scenes, e.g. parked cars, which can be important for

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

Descriptor extractor Reconstruction

Prepared using sagej.cls

4.3 Training the SegMap descriptor Semantics

where the parameter α weighs the relative importance of

Prepared using sagej.cls

5 EXPERIMENTS training epoch. Each segment is rotated at different angles to

Prepared using sagej.cls

8 5 SegMap Classification Triplet

True Positive Rate [-]

Figure 6. The classification loss Lc (left) and the reconstruction

other words, 41% of the validation samples were correctly 0.0

Prepared using sagej.cls

SegMap Classification Triplet

1000 voxel distance between original and reconstructed segments.

the goal is to minimize the number of retrieved neighbours

SegMap descriptor allows us to reconstruct 3D maps using

Prepared using sagej.cls

Correctly localized queries [%]

Figure 10. Cumulative distribution of position errors on KITTI 00

be used for intuitive visualization, as illustrated in Figure 9

Prepared using sagej.cls

500 Ground truth

Rotation error [deg/m]

Translation error [%]

Rotation error [deg/m]

Prepared using sagej.cls

Prepared using sagej.cls

dealing with them explicitly, through for example an increase

Prepared using sagej.cls

ACKNOWLEDGMENTS mapping using data-driven descriptors. In Robotics: Science

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

View publication stats

You might also like