Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors
Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors
Segmap: Segment-Based Mapping and Localization Using Data-Driven Descriptors
net/publication/336147640
CITATIONS READS
0 252
8 authors, including:
Marcin Dymczyk
Sevensense Robotics AG
33 PUBLICATIONS 461 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Cesar Cadena on 22 November 2019.
This paper has been accepted for publication in the International Journal of Robotics Research.
DOI: 10.1177/0278364919863090
R. Dubé, A. Cramariuc, D. Dugas, H. Sommer, M. Dymczyk, J. Nieto, R. Siegwart, C. Cadena. “SegMap: Segment-based
mapping and localization using data-driven descriptors”. The International Journal of Robotics Research.
bibtex:
@article{segmap2019dube,
author = {Renaud Dub\’e and Andrei Cramariuc and Daniel Dugas
and Hannes Sommer and Marcin Dymczyk and Juan Nieto
arXiv:1909.12837v1 [cs.RO] 27 Sep 2019
Abstract
Precisely estimating a robot’s pose in a prior, global map is a fundamental capability for mobile robotics, e.g.
autonomous driving or exploration in disaster zones. This task, however, remains challenging in unstructured, dynamic
environments, where local features are not discriminative enough and global scene descriptors only provide coarse
information. We therefore present SegMap: a map representation solution for localization and mapping based on the
extraction of segments in 3D point clouds. Working at the level of segments offers increased invariance to view-point and
local structural changes, and facilitates real-time processing of large-scale 3D data. SegMap exploits a single compact
data-driven descriptor for performing multiple tasks: global localization, 3D dense map reconstruction, and semantic
information extraction. The performance of SegMap is evaluated in multiple urban driving and search and rescue
experiments. We show that the learned SegMap descriptor has superior segment retrieval capabilities, compared to
state-of-the-art handcrafted descriptors. In consequence, we achieve a higher localization accuracy and a 6% increase
in recall over state-of-the-art. These segment-based localizations allow us to reduce the open-loop odometry drift by
up to 50%. SegMap is open-source available along with easy to run demonstrations.
Keywords
Global localization, place recognition, simultaneous localization and mapping (SLAM), LiDAR, 3D point clouds,
segmentation, 3D reconstruction, convolutional neural network (CNN), auto-encoder
3D LiDAR-based SLAM technique (Zhang and Singh are computed for feature vectors and cross-correlation for
(2014)). histogram features, and an AdaBoost classifier is trained
• A triplet loss descriptor training technique and its to match places. Finally, Iterative Closest Point (ICP) is
comparison to the previously introduced classification- used for computing the relative pose between point clouds.
based approach. In another approach, Magnusson et al. (2009) split the
• A particularly lightweight variant of our SegMap cloud into overlapping grids and compute shape properties
descriptor that can be deployed on platforms with (spherical, linear, and several type of planar) of each
limited computational resources. cell and combine them into a matrix of surface shape
histograms. Similar to other works, these descriptors are
The remainder of the paper is structured as follows:
compared for recognizing places. Recently, Cop et al.
Section 2 provides an overview of the related work in the
(2018) proposed to leverage LiDAR intensity information
fields of localization and learning-based descriptors for 3D
with a global point cloud descriptor. A two-stage approach
point clouds. The SegMap approach and our novel descriptor
is adopted such that, after retrieving places based on
that enables reconstruction of the environment are detailed
global descriptors retrieval, a local keypoint-based geometric
in Section 3 and Section 4. The method is evaluated in
verification step estimates localization transformations. The
Section 5, and finally Sections 6 and 7 conclude with a short
authors demonstrated that using intensity information can
discussion and ideas on future works.
reduce the computational timings. However, the complete
localization pipeline operates at a frequency one order of
2 RELATED WORK magnitude lower than most LiDAR sensor frequencies.
This section first introduces state-of-the-art approaches to While local keypoint features often lack descriptive power,
localization in 3D point clouds. Data driven techniques using global descriptors can struggle with variations in view-
3D data which are relevant to the present work are then point. Therefore other works have also proposed to use 3D
presented. shapes or objects for the place recognition task. Fernández-
Localization in 3D point clouds Detecting loop-closures Moral et al. (2013), for example, propose to perform place
from 3D data has been tackled with different approaches. We recognition by detecting planes in 3D environments. The
have identified three main trends: (i) approaches based on planes are accumulated in a graph and an interpretation tree
local features, (ii) global descriptors and (iii) based on planes is used to match sub-graphs. A final geometric consistency
or objects. test is conducted over the planes in the matched sub-graphs.
A significant number of works propose to extract local The work is extended in Fernández-Moral et al. (2016) to use
features from keypoints and perform matching on the basis the covariance of the plane parameters instead of the number
of these features. Bosse and Zlot (2013) extract keypoints of points in planes for matching. This strategy is only applied
directly from the point clouds and describe them with a to small, indoor environments and assumes a plane model
3D Gestalt descriptor. Keypoints then vote for their nearest which is no longer valid in unstructured environments. A
neighbors in a vote matrix which is eventually thresholded somewhat analogous, seminal work on object-based loop-
for recognizing places. A similar approach has been used closure detection in indoor environments using RGB-D
in Gawel et al. (2016). Apart from such Gestalt descriptors, cameras is presented by Finman et al. (2015). Although
a number of alternative local feature descriptors exist, presenting interesting ideas, their work can only handle
which can be used in similar frameworks. This includes a small number of well segmented objects in small scale
features such as Fast Point Feature Histogram (FPFH) (Rusu environments. Similarily, Bowman et al. (2017) proposed
et al. (2009)) and SHOT (Salti et al. (2014)). Alternatively, a novel SLAM solution in which semantic information
Zhuang et al. (2013) transform the local scans into bearing- and local geometric features are jointly incorporated into
angle images and extract Speeded Up Robust Features a probabilistic framework. Such semantic-based approaches
(SURFs) from these images. A strategy based on 3D spatial have significant potential, for example robustness to stark
information is employed to order the scenes before matching changes in point of view, but require the presence of human-
the descriptors. A similar technique by Steder et al. (2010) known objects in the scene.
first transforms the local scans into a range image. Local We therefore aim for an approach which does not rely
features are extracted and compared to the ones stored in on assumptions about the environment being composed of
a database, employing the Euclidean distance for matching simplistic geometric primitives such as planes, or a rich
keypoints. This work is extended in Steder et al. (2011) by library of objects. This allows for a more general, scalable
using Normal-Aligned Radial Features (NARF) descriptors solution.
and a bag of words approach for matching. Learning with 3D point clouds In recent years,
Using global descriptors of the local point cloud for Convolutional Neural Networks (CNNs) have become
place recognition is also proposed in (Röhling et al. (2015); the state-of-the-art-method for generating learning-based
Granström et al. (2011); Magnusson et al. (2009); Cop et al. descriptors, due to their ability to find complex patterns
(2018)). Röhling et al. (2015) propose to describe each local in data (Krizhevsky et al. (2012)). For 3D point clouds,
point cloud with a 1D histogram of point heights, assuming methods based on CNNs achieve impressive performance in
that the sensor keeps a constant height above the ground. applications such as object detection (Engelcke et al. (2017);
The histograms are then compared using the Wasserstein Maturana and Scherer (2015); Riegler et al. (2017); Li et al.
metric for recognizing places. Granström et al. (2011) (2016); Wu et al. (2015); Wohlhart and Lepetit (2015); Qi
describe point clouds with rotation invariant features such et al. (2017); Fang et al. (2015)), semantic segmentation
as volume, nominal range, and range histogram. Distances (Riegler et al. (2017); Li et al. (2016); Qi et al. (2017);
Tchapmi et al. (2017); Graham et al. (2018); Wu et al. 3 The SegMap approach
(2018)), and 3D object generation (Wu et al. (2016)), and
This section presents our SegMap approach to localization
LiDAR-based local motion estimation (Dewan et al. (2018);
and mapping in 3D point clouds. It is composed of five core
Velas et al. (2018)).
modules: segment extraction, description, localization, map
Recently, a handful of works proposing the use of reconstruction, and semantics extraction. These modules are
CNNs for localization in 3D point clouds have been detailed in this section and together allow single and multi-
published. First, Zeng et al. (2017) proposes extracting data- robot systems to create a powerful unified representation
driven 3D keypoint descriptors (3DMatch) which are robust which can conveniently be transferred.
to changes in view-point. Although impressive retrieval
Segmentation The stream of point clouds generated
performance is demonstrated using an RGB-D sensor in
by a 3D sensor is first accumulated in a dynamic voxel
indoor environments, it is not clear whether this method is
grid† . Point cloud segments are then extracted in a section
applicable in real-time in large-scale outdoor environments.
of radius R around the robot. In this work we consider
A different approach based on 3D CNNs was proposed
two types of incremental segmentation algorithms (Dubé
in Ye et al. (2017) for performing localization in semi-
et al. (2018b)). The first one starts by removing points
dense maps generated with visual data. Recently, Yin
corresponding to the ground plane, which acts as a separator
et al. (2017) introduced a semi-handcrafted global descriptor
for clustering together the remaining points based on their
for performing place recognition and rely on an ICP
Euclidean distances. The second algorithm computes local
step for estimating the 6-DoF localization transformations.
normals and curvatures for each point and uses these to
This method will be used as a baseline solution in
extract flat or planar-like surfaces. Both methods are used
Section 5.8 when evaluating the precision of our localization
to incrementally grow segments by using only newly active
transformations. Elbaz et al. (2017) propose describing local
voxels as seeds which are either added to existing segments,
subsets of points using a deep neural network autoencoder.
form new segments or merge existing segments together‡ .
The authors state, however, that the implementation has
This results in a handful of local segments, which are
not been optimized for real-time operation and no timings
individually associated to a set of past observations i.e.
have been provided. In contrast, our work presents a data-
Si = {s1 , s2 , . . . , sn }. Each observation sj ∈ Si is a 3D
driven segment-based localization method that can operate in
point cloud representing a snapshot of the segment as points
real-time and that enables map reconstruction and semantic
are added to it. Note that sn represents the latest observation
extraction capabilities.
of a segment and is considered complete when no further
To achieve this reconstruction capability, the architecture measurements are collected, e.g. when the robot has moved
of our descriptor was inspired by autoencoders in which away.
an encoder network compresses the input to a small
Description Compact features are then extracted from
dimensional representation, and a decoder network attempts
these 3D segment point clouds using the data-driven
to decompress the representation back into the original input.
descriptor presented in Section 4. A global segment map is
The compressed representation can be used as a descriptor
created online by accumulating the segment centroids and
for performing 3D object classification (Brock et al. (2016)).
corresponding descriptors. In order for the global map to
Brock et al. (2016) also present successful results using
most accurately represent the latest state of the world, we
variational autoencoders for reconstructing voxelized 3D
only keep the descriptor associated with the last and most
data. Different configurations of encoding and decoding
complete observation.
networks have also been proposed for achieving localization
Localization In the next step, candidate correspondences
and for reconstructing and completing 3D shapes and
are identified between global and local segments using k-
environments (Guizilini and Ramos (2017); Dai et al. (2017);
Nearest Neighbors (k-NN) in feature space. The approximate
Varley et al. (2017); Ricao Canelhas et al. (2017); Elbaz et al.
k nearest descriptors are retrieved through an efficient
(2017); Schönberger et al. (2018)).
query in a kd-tree. Localization is finally performed by
While autoencoders present the interesting opportunity verifying the largest subset of candidate correspondences
of simultaneously accomplishing both compression and for geometrical consistency on the basis of the segment
feature extraction tasks, optimal performance at both is not centroids. Specifically, the centroids of the corresponding
guaranteed. As will be shown in Section 5.4, these two local and global segments must have the same geometric
tasks can have conflicting goals when robustness to changes configuration up to a small jitter in their position, to
in point of view is desired. In this work, we combine compensate for slight variations in segmentation. In the
the advantages of the encoding-decoding architecture of experiments presented in Section 5.9, this is achieved using
autoencoders with a technique proposed by Parkhi et al. an incremental recognition strategy which uses caching of
(2015). The authors address the face recognition problem correspondences for faster geometric verifications (Dubé
by first training a CNN to classify people in a training set et al. (2018b)).
and afterwards use the second to last layer as a descriptor
When a large enough geometrically consistent set
for new faces. Other alternative training techniques include
of correspondence is identified, a 6-DoF transformation
for example the use of contrastive loss (Bromley et al.
(1994)) or triplet loss (Weinberger et al. (2006)), the latter † Inour experiments, we consider two techniques for estimating the local
one being evaluated in Section 5.4. We use the resulting motion by registering successive LiDAR scans: one which uses ICP and one
segment descriptors in the context of SLAM to achieve better based on LOAM (Zhang and Singh (2014)).
performance, as well as significantly compressed maps that ‡ For more information on these segmentation algorithms, the reader is
can easily be stored, shared, and reconstructed. encouraged to consult our prior work (Dubé et al. (2018b)).
Deconv | 32 | 3×3×3
Deconv | 32 | 3×3×3
Deconv | 1 | 3×3×3
Conv | 32 | 3×3×3
Conv | 64 | 3×3×3
Conv | 64 | 3×3×3
64x1
MaxPool | 2×2×2
MaxPool | 2×2×2
BN | Dropout 0.5
FC | 8192
FC | 512
Sigmoid
FC | 64
32x32x16 32x32x16
Scale 3x1
BN | Dropout 0.5 FC | N Softmax Class
Classification
Figure 3. The descriptor extractor is composed of three convolutional and two fully connected layers. The 3D segments are
compressed to a representation of dimension 64 × 1 which can be used for localization, map reconstruction and semantic extraction.
Right of the descriptor we illustrate the classification and reconstruction layers which are used for training. In the diagram
the convolutional (Conv), deconvolutional (Deconv), fully connected (FC) and batch normalization (BN) layers are abbreviated
respectively. As parameters the Conv and Deconv layers have the number of filters and their sizes, FC layers have the number of
nodes, max pool layers have the size of the pooling operation, and dropout layers have the ratio of values to drop. Unless otherwise
specified, Rectified Linear Unit (ReLU) activation functions are used for all layers.
between the local and global maps is estimated. This layers placed in between and two fully connected layers.
transformation is fed to an incremental pose-graph SLAM Unless otherwise specified, ReLU activation functions are
solver which in turn estimates, in real-time, the trajectories used for all layers. The original scale of the input segment is
of all robots (Dubé et al. (2017b)). passed as an additional parameter to the first fully connected
Reconstruction Thanks to our autoencoder-like descrip- layer to increase robustness to voxelization at different aspect
tor extractor architecture, the compressed representation can ratios. The descriptor is obtained by taking the activations of
at any time be used to reconstruct an approximate map the extractor’s last fully connected layer. This architecture
as illustrated in Figure 12. As the SegMap descriptor can was selected by grid search over various configurations and
conveniently be transmitted over wireless networks with parameters.
limited bandwidth, any agent in the network can reconstruct
and leverage this 3D information. More details on these
reconstruction capabilities are given in Section 4.3.
Semantics The SegMap descriptor also contains seman-
tically relevant information without the training process 4.2 Segment alignment and scaling
having enforced this property on the descriptor. This can,
for example, be used to discern between static and dynamic A pre-processing stage is required in order to input the 3D
objects in the environment to improve the robustness of the segment point clouds for description. First, an alignment step
localization task. In this work we present an experiment is applied such that segments extracted from the same objects
where the network is able to distinguish between three are similarly presented to the descriptor network. This is
different semantic classes: vehicles, buildings, and others performed by applying a 2D Principal Components Analysis
(see Section 4.4). (PCA) of all points located within a segment. The segment
is then rotated so that the x-axis of its frame of reference,
from the robot’s perspective, aligns with the eigenvector
4 The SegMap Descriptor corresponding to the largest eigenvalue. We choose to solve
In this section we present our main contribution: a data- the ambiguity in direction by rotating the segment so that the
driven descriptor for 3D segment point clouds which allows lower half section along the y-axis of its frame of reference
for localization, map reconstruction and semantic extraction. contains the highest number of points. From the multiple
The descriptor extractor’s architecture and the processing alignment strategies we evaluated, the presented strategy
steps for inputting the point clouds to the network are worked best.
introduced. We then describe our technique for training this
The network’s input voxel grid is applied to the segment
descriptor to accomplish tasks of both segment retrieval and
so that its center corresponds to the centroid of the aligned
map reconstruction. We finally show how the descriptor can
segment. By default the voxels have minimum side lengths
further be used to extract semantic information from the
of 0.1 m. These can individually be increased to exactly
point cloud.
fit segments having one or more larger dimension than the
grid. Whereas maintaining the aspect ratio while scaling can
4.1 Descriptor extractor architecture potentially offer better retrieval performance, this individual
The architecture of the descriptor extractor is presented scaling with a minimum side length better avoids large
in Figure 3. Its input is a 3D binary voxel grid of fixed errors caused by aliasing. We also found that this scaling
dimension 32 × 32 × 16 which was determined empirically method offers the best reconstruction performance, with only
to offer a good balance between descriptiveness and the a minimal impact on the retrieval performance when the
size of the network. The description part of the CNN is original scale of the segments is passed as a parameter to
composed of three 3D convolutional layers with max pool the network.
Dropout 0.5
Vehicle
FC | 64
and reconstruction capabilities, we propose a customized
FC | 3
SegMap
Building
learning technique. The two desired objectives are imposed Decriptor
Other
on the network by the softmax cross entropy loss Lc for
retrieval and the reconstruction loss Lr . We propose to
simultaneously apply both losses to the descriptor and to
Figure 4. A simple fully connected network that can be
this end define a combined loss function L which merges
appended to the SegMap descriptor (depicted in Figure 3) in
the contributions of both objectives: order to extract semantic information. In our experiments, we
train this network to distinguish between vehicles, buildings, and
L = Lc + αLr (1) other objects.
Figure 5. An illustration of the SegMap reconstruction capabilities. The segments are extracted from sequence 00 of the KITTI
dataset and represent, from top to bottom respectively, vehicles, buildings, and other objects. For each segment pair, the
reconstruction is shown to the right of the original. The network manages to accurately reconstruct the segments despite the high
compression to only 64 values. Note that the voxelization effect is more visible on buildings as larger segments necessitate larger
voxels to keep the input dimension fixed.
Lr [-]
Lc [-]
4 1.0
4
0.8
2 3
100
60
40
30
20
10 SegMap
LocNet
0
0 1 2 3 4 5
Distance threshold [m]
1.0 0.006
200
0.004
100 0.5
0.002
0 LOAM
LOAM+SegMap
LOAM
LOAM+SegMap
0.0 0.000
−200 0 200 200 400 600 800 200 400 600 800
X [m] Path length [m] Path length [m]
(a) KITTI odometry sequence 00.
Ground truth
2.0 0.012
LOAM
400
1.5
0.008
300
Y [m]
1.0 0.006
200
0.004
100 0.5
0.002
LOAM LOAM
LOAM+SegMap LOAM+SegMap
0 0.0 0.000
−250 0 250 200 400 600 800 200 400 600 800
X [m] Path length [m] Path length [m]
(b) KITTI odometry sequence 08.
Figure 11. The trajectories for KITTI odometry sequences a) 00 and b) 08 for LOAM and the combination of LOAM and SegMap.
In addition we show translation and rotation errors for the two approaches, using the standard KITTI evaluation method Geiger et al.
(2012).
added in real-time as constraints in the graph, to correct the 5.9 Multi-robot experiments
drifting odometry. This results in a real-time LiDAR-only
We evaluate the SegMap approach on three large-scale multi-
end-to-end pipeline that produces segment-based maps of the
robot experiments: one in an urban-driving environment
environment, with loop-closures.
and two in search and rescue scenarios. In both indoor
In all experiments, we use a local map with a radius and outdoor scenarios we use the same model which was
of 50 m around the robot. When performing segment trained on the KITTI sequences 05 and 06 as described in
retrieval we consider 64 neighbours and require a minimum Section 5.3.
of 7 correspondences, which are altogether geometrically The experiments are run on a single machine, with
consistent, to output a localization. These parameters were a multi-thread approach to simulating a centralized
chosen empirically using the information presented in system. One thread per robot accumulates the 3D
Figure 7 and 8 as a reference. measurements, extracting segments, and performing the
descriptor extraction. The descriptors are transmitted to a
Our evaluations on KITTI sequences 00 and 08
separate thread which localizes the robots through descriptor
(Figure 11) demonstrate that global localization results from
retrieval and geometric verification, and runs the pose-
SegMap help correct for the drift of the odometry estimates.
graph optimization. In all experiments, sufficient global
The trajectories outputted by the system combining SegMap
associations need to be made, in real-time, for co-registration
and LOAM, follow more precisely the ground-truth poses
of the trajectories and merging of the maps. Moreover
provided by the benchmark, compared to the open-loop
in a centralized setup it might be crucial to limit the
solution. We also show how global localizations reduce
transmitted data over a wireless network with potentially
both translational and rotational errors. Particularly over
limited bandwidth.
longer paths SegMap is able to reduce the drift in the
trajectory estimate by up to 2 times, considering both
translation and rotation errors. For shorter paths, the drift 5.9.1 Multi-robot SLAM in urban scenario In order to
only improves marginally or remains the same, as local simulate a multi-robot setup, we split sequence 00 of the
errors are more dependent on the quality of the odometry KITTI odometry dataset into five sequences, which are
estimate. We believe that our evaluation showcases not only simultaneously played back on a single computer for a
the performance of SegMap, but also the general benefits duration of 114 seconds. In this experiment, the semantic
stemming from global localization algorithms. information extracted from the SegMap descriptors is used
Figure 12. Visualization of segment reconstructions, as point clouds (left), and as surface meshes (right), generated from sequence
00 of the KITTI dataset. The quantization of point cloud reconstructions is most notable in the large wall segments (blue) visible in
the background. Equivalent surface mesh representations do not suffer from this issue.
to reject segments classified as vehicles from the retrieval Table 2. Statistics resulting from the three experiments.
process. Statistic KITTI Powerplant Foundry
Duration (s) 114 850 1086
With this setup, 113 global associations were discovered,
Number of robots 5 3 2
allowing to link all the robot trajectories and create
Number of segmented local cloud 557 758 672
a common representation. We note that performing Average number of segments per cloud 42.9 37.0 45.4
ICP between the associated point clouds would refine Bandwidth for transmitting local clouds (kB/s) 4814.7 1269.2 738.1
the localization transformation by, on average, only Bandwidth for transmitting segments (kB/s) 2626.6 219.4 172.2
0.13 ± 0.06 m which is in the order of our voxelization Bandwidth for transmitting descriptors (kB/s) 60.4 9.5 8.1
resolution. However, this would require the original point Final map size with the SegMap descriptor (kB) 386.2 181.3 121.2
Number of successful localizations 113 27 85
cloud data to be kept in memory and transmitted to the
central computer. Future work could consider refining the
transformations by performing ICP on the reconstructions.
measuring 100 m long by 25 m wide. The second mission
Localization and map reconstruction was performed at an
took place at the Phoenix-West foundry in a semi-open
average frequency of 10.5 Hz and segment description was
building made of steel. A section measuring 100 m by 40 m
responsible for 30% of the total runtime with an average
was mapped using two UGVs. The buildings are shown in
duration of 28.4 ms per local cloud. A section of the target
Fig 13.
map which has been reconstructed from the descriptors is
For these two experiments, we used an incremental
depicted in Figure 1.
smoothness-based region growing algorithm which extracts
Table 2 presents the results of this experiment. The
plane-like segments (Dubé et al. (2018b)). The resulting
required bandwidth is estimated by considering that each
SegMap reconstructions are shown in Figure 14 and detailed
point is defined by three 32-bit floats and that 288 additional
statistics are presented in Table 2. Although these planar
bits are required to link each descriptor to the trajectories.
segments have a very different nature than the ones used
We only consider the useful data and ignore any transfer
for training the descriptor extractor, multiple localizations
overhead. The final map of the KITTI sequence 00 contains
have been made in real-time so that consistent maps could
1341 segments out of which 284 were classified as vehicles.
be reconstructed in both experiments. Note that these search
A map composed of all the raw segment point clouds would
and rescue experiments were performed with sensors without
be 16.8 MB whereas using our descriptor it is reduced to only
full 360◦ field of view. Nevertheless, SegMap allowed robots
386.2 kB. This compression ratio of 43.5x can be increased
to localize in areas visited in opposite directions.
to 55.2x if one decides to remove vehicles from the map.
This shows that our approach can be used for mapping much
larger environments. 6 DISCUSSION AND FUTURE WORK
5.9.2 Multi-robot SLAM in disaster environments For While our proposed method works well in the demonstrated
the two following experiments, we use data collected by experiments it is limited by the ability to only observe the
Unmanned Ground Vehicles (UGVs) equipped with multiple geometry of the surrounding structure. This can be problem-
motor encoders, an Xsens MTI-G Inertial Measurement Unit atic in some man-made environments, which are repetitive
(IMU) and a rotating 2D SICK LMS-151 LiDAR. First, and can lead to perceptual aliasing, influencing both the
three UGVs were deployed at the decommissioned Gustav descriptor and the geometric consistency verification. This
Knepper powerplant: a large two-floors utility building could be addressed by detecting such aliasing instances and
Wolfgang Hess, Damon Kohler, Holger Rapp, and Daniel Andor. Samuele Salti, Federico Tombari, and Luigi Di Stefano. SHOT:
Real-time loop closure in 2d lidar slam. In IEEE International unique signatures of histograms for surface and texture
Conference on Robotics and Automation (ICRA), pages 1271– description. Computer Vision and Image Understanding, 125:
1278. IEEE, 2016. 251–264, 2014.
Sergey Ioffe and Christian Szegedy. Batch normalization: Johannes L Schönberger, Marc Pollefeys, Andreas Geiger, and
Accelerating deep network training by reducing internal Torsten Sattler. Semantic Visual Localization. In IEEE
covariate shift. In International Conference on Machine Conference on Computer Vision and Pattern Recognition,
Learning, pages 448–456, 2015. 2018.
M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet:
F. Dellaert. iSAM2: Incremental smoothing and mapping using A unified embedding for face recognition and clustering. In
the Bayes tree. The International Journal of Robotics Research, IEEE Conference on Computer Vision and Pattern Recognition,
31(2):216–235, 2012. pages 815–823, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya
classification with deep convolutional neural networks. In Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to
Advances in Neural Information Processing Systems, pages prevent neural networks from overfitting. Journal of Machine
1097–1105, 2012. Learning Research, 15(1):1929–1958, 2014.
Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3D Bastian Steder, Giorgio Grisetti, and Wolfram Burgard. Robust
lidar using fully convolutional network. In Robotics: Science place recognition for 3D range data based on point features.
and Systems (RSS), 2016. In IEEE International Conference on Robotics and Automation
William E Lorensen and Harvey E Cline. Marching cubes: A high (ICRA), 2010.
resolution 3d surface construction algorithm. In ACM siggraph Bastian Steder, Michael Ruhnke, Slawomir Grzonka, and Wolfram
computer graphics, volume 21, pages 163–169. ACM, 1987. Burgard. Place recognition in 3D scans using a combination of
Stephanie Lowry, Niko Sunderhauf, Paul Newman, John J Leonard, bag of words and point feature based relative pose estimation.
David Cox, Peter Corke, and Michael J Milford. Visual place In IEEE/RSJ International Conference on Intelligent Robots
recognition: A survey. IEEE Transactions on Robotics, 2016. and Systems (IROS), 2011.
Martin Magnusson, Henrik Andreasson, Andreas Nüchter, and Lyne P. Tchapmi, Christopher B. Choy, Iro Armeni, JunYoung
Achim J Lilienthal. Automatic appearance-based loop Gwak, and Silvio Savarese. SEGCloud: Semantic segmentation
detection from three-dimensional laser data using the normal of 3D point clouds. In International Conference on 3D Vision
distributions transform. Journal of Field Robotics, 26(11-12): (3DV), 2017.
892–914, 2009. Georgi Tinchev, Simona Nobili, and Maurice Fallon. Seeing
Daniel Maturana and Sebastian Scherer. VoxNet: A 3D the wood for the trees: Reliable localization in urban and
convolutional neural network for real-time object recognition. natural environments. In IEEE/RSJ International Conference
In IEEE/RSJ International Conference on Intelligent Robots on Intelligent Robots and Systems (IROS). IEEE, 2018.
and Systems (IROS), 2015. J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen.
Kingma D. P. and Ba J. L. ADAM: a method for Shape completion enabled robotic grasping. In IEEE/RSJ
stochastic optimization. International Conference on Learning International Conference on Intelligent Robots and Systems
Representations, 113, 2015. (IROS), pages 2442–2447, 2017.
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep Martin Velas, Michal Spanel, Michal Hradis, and Adam Herout.
face recognition. In BMVC, volume 1, page 6, 2015. Cnn for imu assisted odometry estimation using velodyne lidar.
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. In Autonomous Robot Systems and Competitions (ICARSC),
Pointnet: Deep learning on point sets for 3D classification and 2018 IEEE International Conference on, pages 71–77. IEEE,
segmentation. In IEEE Conference on Computer Vision and 2018.
Pattern Recognition, July 2017. Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance
Daniel Ricao Canelhas, Erik Schaffernicht, Todor Stoyanov, metric learning for large margin nearest neighbor classification.
Achim J Lilienthal, and Andrew J Davison. Compressed Voxel- In Advances in Neural Information Processing Systems, pages
Based Mapping Using Unsupervised Learning. Robotics, 6(3): 1473–1480, 2006.
15, 2017. Martin Weinmann, Boris Jutzi, and Clément Mallet. Semantic
Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. OctNet: 3D scene interpretation: a framework combining optimal
Learning deep 3D representations at high resolutions. In neighborhood size selection with relevant features. ISPRS
IEEE Conference on Computer Vision and Pattern Recognition, Annals of the Photogrammetry, Remote Sensing and Spatial
2017. Information Sciences, 2(3):181, 2014.
Timo Röhling, Jennifer Mack, and Dirk Schulz. A fast histogram- Paul Wohlhart and Vincent Lepetit. Learning descriptors for object
based similarity measure for detecting loop closures in 3-d recognition and 3D pose estimation. In IEEE Conference on
lidar data. In IEEE/RSJ International Conference on Intelligent Computer Vision and Pattern Recognition, pages 3109–3118,
Robots and Systems (IROS), 2015. 2015.
Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer.
point feature histograms (FPFH) for 3D registration. In IEEE Squeezeseg: Convolutional neural nets with recurrent crf for
International Conference on Robotics and Automation (ICRA), real-time road-object segmentation from 3d lidar point cloud.
pages 3212–3217, 2009. In IEEE International Conference on Robotics and Automation
(ICRA), pages 1887–1893. IEEE, 2018.
Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Huan Yin, Li Tang, Xiaqing Ding, Yue Wang, and Rong Xiong.
Tenenbaum. Learning a probabilistic latent space of object Locnet: Global localization in 3d point clouds for mobile
shapes via 3D generative-adversarial modeling. In Advances vehicles. In Proceedings of the IEEE Intelligent Vehicles
in Neural Information Processing Systems, pages 82–90, 2016. Symposium (IV), pages 728–733. IEEE, 2018.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher,
Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D shapenets: A Jianxiong Xiao, and Thomas Funkhouser. 3DMatch: Learning
deep representation for volumetric shapes. In IEEE Conference local geometric descriptors from rgb-d reconstructions. In
on Computer Vision and Pattern Recognition, pages 1912– IEEE Conference on Computer Vision and Pattern Recognition,
1920, 2015. 2017.
Yawei Ye, Titus Cieslewski, Antonio Loquercio, and Davide Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in
Scaramuzza. Place recognition in semi-dense maps: Geometric real-time. In Robotics: Science and Systems (RSS), 2014.
and learning-based approaches. In Proceedings of the British Yan Zhuang, Nan Jiang, Huosheng Hu, and Fei Yan. 3-d-laser-
Machine Vision Conference (BMVC), 2017. based scene measurement and place recognition for mobile
Huan Yin, Yue Wang, Li Tang, Xiaqing Ding, and Rong Xiong. robots in dynamic indoor environments. IEEE Transactions
Locnet: Global localization in 3D point clouds for mobile on Instrumentation and Measurement, 62(2):438–450, 2013.
robots. arXiv preprint arXiv:1712.02165, 2017.