1. Introduction
Simultaneous Localization and Mapping (SLAM) plays a vital role in the implementation of autonomous navigation systems, as it enables an agent to estimate its current pose without previous knowledge of its surrounding environment. In more detail, SLAM estimates the robot’s position while incrementally building an environmental representation (i.e., a map). Its rise in popularity has been strictly concurrent with the growing variety of applications for autonomous systems, which include, but are not limited to, autonomous cars [
1], unmanned aerial vehicles, commonly known as drones [
2], underwater robotics [
3], space exploration [
4], and indoor navigation [
5,
6,
7]. The localization and mapping process leverages a combination of one or more sensors [
8,
9], which can be classified as
exteroceptive, i.e., aimed at perceiving the surrounding environment (e.g., camera, LiDAR), and
proprioceptive, which can sense the position, orientation, and movement of the agent in space (e.g., IMU, encoder, GPS). The fusion of one or more sensing modalities is made necessary by the impossibility of leveraging GPS technology as a stand-alone sensing technique under several use cases [
10] since autonomous systems need high-level positioning accuracy and consistent signal. However, despite the advent of high precision GNSS systems (e.g., Real-Time Kinematics—RTK), centimeter-level positioning accuracy is guaranteed only under specific conditions, suffering loss of signal under unfavorable atmospheric conditions or infrastructural characteristics (e.g., indoors, presence of high buildings causing canyoning effects) [
11]. The sensor information is consequently elaborated through an algorithm pipeline that commonly approaches the SLAM problem in a probabilistic fashion. The current state-of-the-art SLAM methodologies can be grouped into two major categories, namely,
filter-based methods and
optimization-based methods [
12,
13].
Filter -based methods can be considered direct derivations of Bayesian filtering [
14], as they consist of two-step iterative processes, namely, the following:
The prediction step, where the new state is estimated according to an evolution model and a control input;
The correction step, where the current observations are matched against the previously observed features.
Between the prediction and the correction step, the acquired observations must be associated with the already existing partial environmental reconstruction. If one or more observations are not matched with any of the pre-existing map features, new features are initialized. Some SLAM proposals model the map features as uniquely identifiable (known data association). In this case, the data association problem is addressed deterministically. However, a probabilistic approach is needed whenever the map features are not uniquely identifiable (unknown data association). In this dissertation, we address the unknown data association case for higher generality.
The earliest SLAM proposals belong to the filter-based category and are based on the Kalman Filter (KF) [
15], such as the Extended Kalman Filter (EKF) [
16] and its variations, among which we also mention the Unscented Kalman Filter (UKF) [
17] and the Information Filter (IF) [
18]. The main limitations of such approaches are related to handling nonlinearities, the Gaussian modeling of uncertainty, and the increasing dimension of the state as long as new features are observed. These problems are tackled in Particle Filter (PF)-based methods [
19] such as FastSLAM 1.0 [
20] and FastSLAM 2.0 [
21], where a proposal distribution is sampled through a set of weighted hypotheses, called
particles. The weight of each particle is proportional to the correspondence between the prediction it holds and the observed state values. In filter-based methods, the map can be represented as an Occupancy Grid or a Landmark-Based Map [
22]. Occupancy Grids model the environment as an array of cells, i.e., uniformly dimensional portions of space (squares or cubes). Each cell holds the probability of being free or occupied. On the other hand, Landmark-Based Maps consist of a set of points, commonly called
landmarks, whose position is fixed in the space. The robot’s pose estimation is based on the measurement of its distance from the observed landmarks. In this case, each landmark
l is identified with a 3D vector
containing its coordinates with respect to the map origin
and its numerical identifier
. While Occupancy Grids have higher information content, Landmark-Based Maps provide a more compact environmental representation, reducing memory consumption [
23,
24].
Optimization-based methods model the map as factor graphs. They can encode information at a geometric level (e.g., points, lines, planes) and a semantic level (instances of objects stored in a previously defined database).
Further insights on the main state-of-the-art SLAM proposals available are given in review works such as [
12,
13,
25,
26,
27,
28,
29]. While filter-based SLAM is more employed for
online SLAM, where the latest robot pose is estimated jointly with the map, optimization-based SLAM is mainly leveraged to solve the
full SLAM problem, where the whole trajectory is estimated [
12].
This work focuses on particle filter-based algorithms, specifically FastSLAM 2.0, based on Landmark-Based Maps. FastSLAM 2.0 leverages a Rao–Blackwellized particle filter where the robot’s pose and the map estimations are decoupled. In more detail, the robot pose estimation is first predicted according to the motion model and the control inputs, then adjusted considering the latest associated observations. On the other hand, each landmark position is estimated using an extended Kalman Filter. Further details on FastSLAM 2.0 are discussed in
Section 2.
One of the main limitations of SLAM approaches is their high computational complexity, and particle filter-based methods are no exception. In addition, SLAM should often run on onboard devices and guarantee real-time pose estimation. In this context, parallel programming and devices enabling massive parallelization, such as Graphic Processing Units (GPUs), can enable many SLAM methodologies to meet real-time constraints [
30,
31,
32,
33,
34]. PF-based approaches can particularly benefit from acceleration via parallel programming due to their modular structure. In fact, particles are calculated independently and, within each particle, the landmark poses are also independently estimated. In this context, the most commonly adopted acceleration modality in the literature is particle-wise parallelization, as in [
35,
36,
37,
38,
39]. Some proposals focus the acceleration efforts on specific steps of the pipeline, such as data association [
40,
41], resampling [
42,
43,
44], and particle weight computation [
37]. As for FastSLAM 2.0, a few works provide parallelized implementations on the edge. An accelerated implementation for FPGA is available in [
45], while a parallelized implementation of the algorithm for the OMAP4430 architecture is available at [
46]. To the best of our knowledge, the only parallelized implementation of FastSLAM 2.0 on a GPU device is discussed in [
39], where a particle-wise parallelization for monocular inertial SLAM is proposed. The FAST Corner Detection method carries out the data association step on the CPU.
The parallelization process requires identifying the parameters that affect the algorithm’s computational complexity to distribute the computation among the single multiprocessors efficiently. In some cases, such as in FastSLAM 2.0, the complexity can be determined by multiple parameters, which may depend on the use case scenario of interest (i.e., the context), providing multiple possibilities in the parallelization design. In addition, the specific characteristics of the hardware architecture in use significantly impact the performance in terms of latency. However, the state-of-the-art does not include any guidelines on addressing parallelization modality selection, but only provides single implementations specifically designed for a single architecture. Consequently, the optimal parallelization modality still needs to be empirically determined, involving the potential need for multiple re-designs. Therefore, in this paper, we propose a context-adaptable design of FastSLAM 2.0 for GPU based on the CUDA programming paradigm. The proposed design enables the implementation of different parallelization modalities depending on the context. By context, we mean the values assumed by the set of parameters determining the algorithm’s complexity, which, in turn, depends on the use case of interest. Combined with the flexible design for FastSLAM 2.0, we provide an evaluation methodology that enables the assessment of the optimal parallelization modality depending on the context and the hardware architecture in use.
The design proposed in this contribution parallelizes all the functional blocks of FastSLAM 2.0, including the data association step, through the Joint Compatibility Branch and Bound (JCBB) methodology [
47]. JCBB enables simultaneous handling of multiple non-uniquely identifiable observations, leveraging a branch-and-bound approach in combination with a joint compatibility test. To the best of our knowledge, this is the first parallelization of FastSLAM 2.0 tackling unknown data association, and no other parallelized version of JCBB is available.
In addition, parallelization of multiple resampling approaches is available with the goal of accommodating the needs of the broadest possible variety of navigation scenarios.
Our methodology was validated by deploying FastSLAM 2.0 onto a General Purpose GPU (GPGPU) interfaced with a simulation environment that included a model of an IMU and a generic range sensor.
The rest of this paper is organized as follows.
Section 2 provides theoretical background and discusses the state-of-the-art regarding the data association and the resampling problems. In more detail,
Section 2.1 discusses the elements of the FastSLAM 2.0 method,
Section 2.2 tackles the data association problem, and the resampling problem is discussed in
Section 2.3.
Section 3 describes the hardware and software setup:
Section 3.1 provides the specification of the hardware,
Section 3.2 and
Section 3.3 summarize the main characteristics of the CUDA programming paradigm and the PyCUDA library, and
Section 3.4 addresses the memory access management.
Section 4 describes our evaluation methodology, including the simulation environment (
Section 4.1), the partitioning of the pipeline into functional blocks (
Section 4.2), the analysis of the parameter dependencies (
Section 4.3), and the time gain evaluation (
Section 4.4).
Section 5 tackles the main characteristics of the functional blocks’ parallelization.
Section 6 is dedicated to the experimental results and discussions thereabout and
Section 7 draws the conclusions of this contribution and discusses future developments.
3. Hardware and Software Setup
This section describes the hardware and software setup leveraged for the deployment of the proposed design and the validation of the evaluation methodology described in
Section 4. The proposed parallelization of FastSLAM 2.0 is based on the CUDA programming paradigm and uses PyCUDA [
63] for Python interfacing. The performance was tested on a high-end laptop equipped with an Intel Core i7-10750H CPU [
63] and an NVIDIA GeForce GTX 1650 Ti [
63].
3.1. Hardware Setup
The hardware used to deploy and validate our design was a high-end laptop hosting an Intel Core i7-10750H CPU and an NVIDIA GeForce GTX 1650 Ti. The specifications of the CPU architecture are summarized in
Table 1.
The NVIDIA GeForce GTX 1650 Ti is based on the Turing architecture and features 1024 CUDA cores optimized for parallel processing tasks. Designed primarily for gaming laptops and compact desktops, this graphics processing unit (GPU) integrates several features that enhance its suitability for algorithm acceleration in computational research. The main specifications of the NVIDIA GeForce GTX 1650 Ti are gathered in
Table 2.
3.2. Cuda Programming Paradigm
High-performance computing has gained rising interest in the past few years, and modern Graphic Processing Units (GPUs) are at the center of this trend due to the possibility of heavily parallelizing computation in large-scale applications.
In this context, NVIDIA developed CUDA [
64], a programming model and software environment that allows programmers to write scalable codes, also called kernels, in a C-like programming language [
65]. In more detail, a kernel is a small program that runs on the GPU rather than the CPU. It is a fundamental concept in parallel computing and is used to execute a portion of a program on the GPU. In CUDA, a kernel is executed by multiple threads on the GPU. Threads are the basic execution units that execute the same kernel code but with different data. A group of threads cooperatively executing the same kernel is called a
block. Blocks are usually mapped to the single multiprocessors (SMs) and are divided into subsets of 32 threads called
warps.Warps are launched randomly within the same block. The execution context within each warp is maintained on-chip. While the number of threads per warp is constant among all GPU architectures (i.e., 32), the number of threads per block and the number of blocks in the grid depend on each specific device.
The threads belonging to the same block can cooperate by sharing data through shared memory. The shared memory is a small, fast memory space that is shared among threads within a block. It is divided into 32 banks, each of which can be accessed simultaneously by a warp. In contrast, global device memory is a larger, slower memory space shared among all GPU threads. While global device memory provides a larger storage capacity, it is slower than shared memory due to its higher latency.
When being launched, each block is assigned to a single multiprocessor (SM), a processing unit on the GPU that executes a block of threads. Each multiprocessor has its own memory, registers, and execution units. The threads in a block are executed concurrently on the multiprocessor, allowing for efficient parallel processing of data. However, to ensure correct execution, it is necessary to synchronize the threads. This is achieved through the use of barriers, which force threads to wait until all threads in the same block have reached a certain point before proceeding to the next instruction. A set of blocks executed on the GPU is called a grid. The blocks in a grid run independently, which means that block synchronization is not contemplated within the CUDA programming paradigm. This is achieved through the use of multiple multiprocessors on the GPU, each of which can execute a block of threads concurrently.
When a CUDA kernel is launched, the following steps occur: the host (CPU) launches a kernel, specifying the number of blocks and threads per block; the GPU schedules the blocks and assigns them to available multiprocessors; each block is executed on a single multiprocessor, where the threads in the block are executed concurrently; the threads in a block cooperate with each other, sharing data and synchronizing their execution; and the blocks in a grid are executed independently.
3.3. PyCUDA
PyCUDA [
63] is an open-source toolkit that supports Runtime Code Generation for GPUs. It allows the programmer to adopt a scripting-based approach when programming GPUs. Every feature of CUDA runtime is accessible through PyCUDA, while memory allocation and resource management are handled automatically as in a high-level programming language. However, it is possible to deallocate memory in applications with tight memory usage manually. PyCUDA allows for the creation of GPU binaries by providing CUDA source code. This architecture enables the adoption of an edit–run–repeat working style, which is quite comfortable for exploratory prototyping and full-scale code, too, thanks to the use of the GPU. Therefore, the Python code running on the CPU executes control and communication tasks that require high abstraction. In contrast, the C-like code running on the GPU solves low-level and throughput-oriented problems.
In our implementation, PyCUDA enables the launching of the CUDA kernels and the memory transfer from host to device.
3.4. Memory Access Management
Our parallelization of FastSLAM 2.0 is designed to run on a heterogeneous architecture, where the CPU and the GPU, which can be defined as
host and
device, respectively, are equipped with two separate memories and execute commands asynchronously. The parallelized FastSLAM 2.0 is interfaced with a simulator, described in
Section 4.1, that runs on the CPU and feeds the algorithm with synthetic sensor data. The CPU can also allocate memory on the GPU to store variables and launch kernels. All the FastSLAM 2.0 pipeline operations leverage either the GPU global memory or the on-chip memory (shared memory and local thread memory). The final results are eventually copied to the host device (CPU). A schema of the memory management is provided in
Figure 3.
The proposed implementation of FastSLAM 2.0 optimizes memory access to accelerate processing time. This is achieved by limiting access to the global device memory whenever possible in favor of shared memory when cooperative data manipulation is needed, as well as local thread memory.
Shared memory in a GPU is fast, on-chip memory accessible by all threads within the same thread block. It enables efficient data sharing among threads and significantly improves the performance of parallel algorithms by allowing data reuse and reducing redundant calculations.
While shared memory is much faster than global memory, offering latency and bandwidth comparable to registers, it is limited in size and can vary depending on each GPU architecture. The GPU architecture where we validate our approach is equipped with 48 kilobytes of shared memory per block. All threads within the same block can read from and write on shared memory. Shared memory is divided into banks. The bank size and the number of banks depend on each specific GPU architecture. If multiple threads belonging to the same warp access the same bank simultaneously, bank conflicts may arise, degrading performance. The memory bank conflict is commonly avoided through common CUDA programming good practices such as padding and accurate offset selection to ensure that all the threads in the same block contemporarily access different banks. Furthermore, conflicts can arise when thread divergence occurs. Therefore, it is utterly important to correctly execute thread synchronization to ensure that the read and write operations on the shared memory are coherently performed.
4. Evaluation Methodology
This section discusses the evaluation methodology we propose to adapt and optimize the deployment of FastSLAM 2.0 onto the hardware architecture in use. Firstly, we analyze the algorithm’s dependencies. Then, we divide the algorithm pipeline into functional blocks with the same dependencies. Subsequently, each parameter is thresholded to limit the processing time. The parallelized design of each functional block contemplates one or more parallelization modalities, depending on which parameters affect its complexity (i.e., number of particles
, number of existing landmarks
, number of observations
, and number of matched observations
), as further described in
Section 5.
Dependency analysis and parameter thresholding are widely employed in the parallelization process of different types of SLAM algorithms. In [
66,
67], such a methodology is leveraged to study the performance of graph-based SLAM on embedded systems; in [
68], it was used to evaluate an efficient implementation of EKF-SLAM, and in [
39], it was used for FastSLAM 2.0.
Our design takes a comprehensive approach by exploring multiple parallelization methodologies to achieve the best performance across different contexts. We leverage a dependency analysis to define multiple possible grid configurations, which are later deployed on the hardware architecture of interest. The evaluation is carried out by comparing each parallelization modality to the sequential implementation according to the performance indexes described in
Section 4.4. The performance analysis under different test cases (i.e., contexts) is carried out in the form of a sensitivity analysis, which means that the variation of the index of interest is observed as an effect of the variation of a single parameter, while the others are kept constant.
In the state-of-the-art, the parallelization modality is not always disclosed, making it rather impractical to perform our validation methodology on such proposals with the available information. However, whenever such information is available, the parallelization usually happens particle-wise (i.e., the number of concurrent operations is equal to ). Due to the adaptability of our design, we can test multiple modalities, including, among others, particle-wise parallelization. Furthermore, we wish to remark that, unlike the state-of-the-art proposals, it is not within our scope to prove the optimality of a single implementation under the widest variety of scenarios but to provide a flexible design that can be easily adapted to reach optimality under various conditions and with the support of various hardware architectures.
4.1. Simulation Environment
The simulation environment we leverage to benchmark our design models the robot as a rigid body—i.e., it is characterized by its Cartesian coordinates and its orientation and it moves in a bi-dimensional space characterized by landmarks. The landmarks are modeled as points in space and are defined by their Cartesian coordinates and a label . The robot is considered equipped with a generic range sensor and an inertial sensor. The sensors are characterized by a measurement noise modeled as zero-mean Gaussian white noise.
As for the inertial sensors, linear and rotational speed
are generated with the following:
and the range and bearing signals
coming from the range sensors are computed as
The simulation environment runs on CPU and is schematized in
Figure 4.
4.2. Functional Block Partitioning
As mentioned in
Section 4, we analyze the algorithm based on the instruction and operation order, as in [
39]. We then divide the pipeline into functional blocks as summarized in
Figure 5. The first functional block, dedicated to the
Particle Initialization step, initializes the particle poses with random numbers and the landmarks as null values. The output of the
Particle Initialization block constitutes the input of the
Particle Prediction functional block, together with the inertial sensor signals from the simulator. This functional block is supposed to run only at the first iteration of the algorithm. The predicted poses and the unlabeled observations retrieved from the simulator are the input of the
Data Association functional block. The associated observations outputted by the
Data Association block and the pose prediction are fed to the
Proposal Adjustment block, which outputs the updated robot pose estimate. The output of this functional block and the associated observations are then fed to the
Landmarks Update and Weights Computation block. The corrected landmark estimates, the corrected particle pose, and the newly assigned weights feed the
Resampling functional block, which is responsible for copying the selected particles to the new particle set and re-initializing the particle weights. The output of the
Resampling blocks, together with the new signals coming from the inertial sensors, will constitute the input for the
Particle Prediction and
Data Association functional blocks for the next iteration.
All the functional blocks are composed of a single kernel except for the
Data Association functional block, which is subdivided into multiple kernels. Further implementation details are discussed in
Section 5.
4.3. Algorithm Dependencies and Definition of Test Cases
After the algorithm has been analyzed and divided into functional blocks, we can identify their dependencies. By
dependencies, we mean the variables that affect the computational complexity of each block. The functional blocks discussed in
Section 4.2 coincide with the steps described in
Section 2.1. Their dependencies are summarized in
Table 3.
Based on the dependency analysis, we can finally determine the parallelization possibilities for each functional block. In particular, the parallelization modalities are reported below:
Particle Initialization: The parallelization is performed based on the particles.
Particle Prediction: The parallelization is performed based on the particles.
Data Association: The parallelization is performed based on the particles, on the previously observed landmarks, and jointly on the particles and the landmarks. Although the complexity of the data association step also depends on the number of observations, the necessity of performing feasible joint hypotheses (i.e., each observation is associated with at most one landmark) forces the developer to tackle them sequentially.
Proposal Adjustment: The parallelization is performed based on the particles, on the matched observations, and jointly on the particles and the matched observations.
Landmark Estimation and Weight Update: The parallelization is performed based on the particles, on the total observations, and jointly on the particles and the observations.
Importance Resampling: The parallelization is performed based on the particles.
Such parallelization modalities are then mapped onto the grid combination options available in the GPU programming paradigm:
Sequential—one block, one thread ()
One dimension, thread-based—one block, N threads ()
One dimension, block-bases—M blocks, one thread ()
Two dimensions—M blocks, N threads ().
The sensitivity analysis is carried out by varying one of the dependencies and keeping all the others constant. We perform such evaluation for all the dependencies in each step, except for the number of observations in the Data Association functional block. This is because JCBB builds the association hypotheses incrementally, which implies a sequential execution on the observations. Consequently, it is impossible to assess the benefit of parallelization based on this parameter.
Each dependency is assigned a threshold, as shown in
Table 4. These thresholds are generally defined with the sole scope of binding the execution time. However, further limitations may arise depending on the memory availability of each specific device and the memory requirements of each kernel. For the sake of the completeness of our sensitivity analysis, we apply the aforementioned thresholds when possible. However, in an actual use case scenario, it is necessary to find an optimal combination between the memory availability of the hardware device, the desired precision of the estimation (proportional to the number of particles), and the map’s scale.
4.4. Performance Assessment
The sensitivity analysis described in the previous sections for the performance assessment is based on two parameters: the
Elapsed Time and
Time Gain. The Elapsed Time is computed for each functional block
B, for each context
C (i.e., combination of parameters based on which the sensitivity analysis is performed), and for each grid combination
G as the average of
repeated executions, as follows:
and the Time Gain is defined as the ratio between the sequential grid combination (1B1T) and the most advantageous grid combination, as
The Elapsed Time quantity is expressed in milliseconds ([ms]), while the Time Gain is scalar.
5. Deployment
This section discusses the design details of each parallelized functional block.
Figure 6 shows how the memory transfer and the interaction between CPU and GPU are managed across the entire pipeline. The dashed rectangles correspond to data, while the colored rectangles contain instructions (CUDA kernels and their launchers).
In more detail, the CPU allocates memory on the GPU and initializes a vector of random states leveraged in the random number generation operations. Then, each functional block is implemented in the form of one or two kernels that start their execution on the GPU once a dedicated launcher is executed. The interaction within the kernels is akin to what is described in
Figure 5. Once all the operations are terminated, the pose estimates and the map estimates are copied back to the host memory to be fed as input to the
Particle Prediction Launcher at the next iteration.
The pseudocode relevant to the parallelization of the functional blocks in the following sections is available in
Appendix A.2,
Appendix A.3,
Appendix A.4,
Appendix A.5 and
Appendix A.6. This section is aimed at discussing the design peculiarities of our approach; for the methodological and mathematical foundations of FastSLAM 2.0, please refer to
Section 2.
5.1. Particle Initialization
The Particle Initialization is parallelized based on the number of particles. Therefore, a monodimensional parallelization is performed. Algorithm A8 iterates over each particle with a stride corresponding to . However, regarding the mapping of this monodimensional parallelization onto grid composition possibilities, it is possible to launch the kernel with the following:
Sequential configuration (): one block and one thread.
Thread-based configuration (): one block and threads.
Block-based configuration (): blocks and one thread.
Two-dimension configuration (): Upper Bound () blocks and 1024 threads.
In this case, the memory limitation is determined by the amount of global memory available in the GPU.
5.2. Particle Prediction
The Particle Prediction is parallelized based on the number of particles. Therefore, we perform a monodimensional parallelization. Algorithm A9 iterates over each particle with a stride corresponding to . However, regarding the mapping of this monodimensional parallelization onto grid composition possibilities, it is possible to launch the kernel with the following:
Sequential configuration (): one block and one thread
Thread-based configuration (): one block and threads.
Block-based configuration (): blocks and one thread
Two-dimension configuration (): Upper Bound () blocks and 1024 threads.
As in the Particle Prediction step, the memory limitation is determined by the amount of global memory available in the GPU.
5.3. Data Association
The
Data Association step is subdivided into three kernels, as summarized in
Figure 7. The pseudocode for the data association step is available in
Appendix A.4 for better manuscript readability. Though a single-kernel implementation is possible, the high amount of needed shared memory would heavily limit the map’s scale and the precision of the estimation (proportional to the number of particles). Therefore, with the goal of achieving a trade-off between the increase in performance given by the usage of the faster on-chip memory and the limitation given by its small size, we privilege the shared memory usage for those variables that are read and written more than once. In particular, the variables related to the individual compatibility assessment would need to be stored in the shared memory together with the expected observations
. However, the high dimensionality of the first variables (highlighted in Algorithm A10) and the reduced number of times they need to be accessed made a multiple-kernel implementation preferable, limiting the shared memory usage for the expected observations. Therefore, the JCBB operations are divided as follows. A first kernel, available in Algorithm A10, is in charge of computing the quantity
derived in Section Joint Compatibility Branch and Bound for each possible observation–landmark pair and performing an individual compatibility test. The resulting matrix is, therefore, stored in the global memory and constitutes the exploration tree where branch-and-bound is carried out. Each element in the cost matrix
represents the Mahalanobis distance between observation
i and map feature
j. If the individual compatibility test is not passed,
is set to infinity.
Therefore, all the feasible hypotheses are explored while all the non-feasible matches are excluded. An intermediate kernel called Problem Preparation, shown in Algorithm A9, arranges some auxiliary parameters to optimize tree exploration by excluding unmatchable observations (i.e., where none of the for all j passed the individual compatibility test) and initializes new labels for the newly observed landmarks. In this case, the rows of the cost matrix corresponding to non-matchable observations are not explored.
Eventually, the Branch and Bound kernel explores the feasible combinations and assesses the most promising hypotheses, as shown in Algorithm A12. Every time a new observation is considered, a joint compatibility test is performed for each feasible partial hypothesis (i.e., the most promising partial hypothesis outputted by the previous iteration in addition to the Mahalanobis distance between the new observation and each not-yet-matched map feature). If the compatibility test is passed, the lower bound for the partial hypothesis is computed. Otherwise, the cost is set to infinity. As a last step, the minimum cost for the partial hypothesis is computed through parallel reduction and the most promising combination is selected.
The complexity of the data association step depends on the number of particles, the number of previously observed landmarks, and the number of observations, but, as mentioned in
Section 4.3, the parallelization is performed on the particles and the landmarks. The choice is motivated by the consideration that JCBB builds the association hypothesis incrementally, implying that the observations must be iterated sequentially. Furthermore, in an average scenario, the number of observations is much lower than the number of landmarks already present in the map. This implies that the most significant benefit is given by performing particle-wise and landmark-wise parallelization. Therefore, monodimensional and bi-dimensional parallelizations are performed. An external for loop iterates over the particles and an inner loop iterates over the landmarks. Consequently, the different parallelization modalities are achieved according to the following grid configurations:
Sequential configuration (): one block and one thread.
Thread-based configuration (): one block and threads. Parallel on the landmarks and sequential on the particles and the observations.
Block-based configuration (): blocks and one thread. Parallel on the particles and sequential on the landmarks and observations.
Two-dimension configuration (): blocks and threads. Parallel on the particles and the landmarks and sequential on the observations.
The three kernels store their intermediate outputs in the global memory. Due to the high dimensionality of these outputs, the main limitation in terms of memory availability is the size of the device’s global memory.
5.4. Proposal Adjustment
The Proposal Adjustment functional block depends on the number of particles and matched observations. Therefore, monodimensional and bi-dimensional parallelizations are possible. This is achieved by assigning particles to blocks and observations to threads. The choice is motivated by the necessity of collaborative data management within the same particle. Therefore, an external loop iterates over the particles, and an internal loop over the observations. Consequently, the different parallelization modalities are achieved according to the following grid configurations:
Sequential configuration (): one block and one thread.
Thread-based configuration (): one block and threads—parallel on the observations and sequential on the particles.
Block-based configuration (): blocks and one thread—parallel on the particles and sequential on the observations.
Two-dimension configuration (): blocks and threads—parallel on the particles and the observations.
Algorithm 4 summarizes the proposal adjustment step. Each line corresponds to a block that is further expanded in Algorithms A13–A17 in
Appendix A.5.
Algorithm 4 Proposal Adjustment |
- 1:
for to , step = do - 2:
Initialization - 3:
Preparation for adjustment - 4:
computation - 5:
correction computation - 6:
Correction application - 7:
end for
|
Each block carries out a separate operation on the shared memory, which is why threads are synchronized at the end of each block. In particular, the shared memory variables, highlighted with comments in the relative algorithms, are related to the cumulative computation of the correction applied to the robot’s pose estimate based on the matched observations. In Algorithm A16, the innovation is computed for a second time to accommodate the cases in which because each block would tackle multiple observations sequentially, and the innovation values would be overwritten in the local memory threads.
The cumulative prefix sums in Algorithms A16 and A17 are carried out through stride-like parallel reduction to avoid bank conflicts as in [
69].
This functional block significantly uses the shared memory. Therefore, the main limitation in memory availability resides in the shared memory per block.
5.5. Landmark Estimation
The Landmark Estimation functional block depends on the number of particles and matched observations. Therefore, monodimensional and bi-dimensional parallelizations are feasible. In more detail, the particles are assigned to blocks, and threads are assigned to the observations. The choice is motivated by the fact that the reinitialization of the weights depends on all the matched observations within each particle. Therefore, the external for loop iterates over the particles and an inner loop iterates over the observations. Consequently, multiple parallelization modalities are achievable with the following grid configurations:
Sequential configuration (): one block and one thread.
Thread-based configuration (): one block and threads—parallel on the observations and sequential on the particles.
Block-based configuration (): blocks and one thread—parallel on the particles, sequential on the observations.
Two-dimension configuration (): blocks and threads—parallel on the particles and the observations.
The landmark estimation is carried out on a single kernel. However, as in the previous case, the algorithm is split into different sections for better readability. A summary of the landmark estimation step is available in Algorithm 5. Please refer to
Appendix A.6 for more details.
The main limitation in memory availability is the GPU’s global memory, as the kernel uses a limited amount of shared memory.
Algorithm 5 Landmark Estimation |
- 1:
for to , step = do - 2:
for to , step = do - 3:
Retrieve observation - 4:
if new landmark then - 5:
Landmark Initialization - 6:
else - 7:
Landmark Update - 8:
end if - 9:
end for - 10:
Weight Update - 11:
end for
|
5.6. Particle Resampling
The Particle Resampling is parallelized based on the number of particles. Therefore, a monodimensional parallelization is performed. Algorithm 6 shows a general template for the resampling step. Each tested resampling variation contains a different selection strategy, as reported in Line 4. The particle weights are preliminarily copied onto the shared memory to optimize memory access. Our implementation iterates over each particle with a stride corresponding to . Therefore, it is possible to launch the kernel with the following:
Sequential configuration (): one block and one thread.
Thread-based configuration (): one block and threads.
Block-based configuration (): blocks and one thread.
Two-dimension configuration (): Upper Bound () blocks and 1024 threads.
Algorithm 6 Importance Resampling |
- 1:
Retrieve W ▹ vector of weights, Shared Memory - 2:
▹ traditional methods only, Shared Memory - 3:
for to , step = do - 4:
Choose - 5:
Retrieve ▹ to Thread - 6:
Transfer ▹ to Global Memory - 7:
Retrieve map features for -th particle ▹ to Thread - 8:
Transfer map features for -th particle ▹ to Global Memory - 9:
end for
|
The cumulative prefix sum is computed with parallel reduction as in
Section 5.4 for the traditional methods. As for rejection resampling,
is computed as
, meaning that one collective operation is introduced. This is because we do not assume that the particle weights are normalized, which makes the estimation of an upper bound hard to formulate. On the other hand, a work-efficient parallel reduction based on [
69,
70] is implemented for the maximum computation. Unlike the inclusive prefix sum, which requires iterating over the particles twice, such an implementation requires iterating only once, increasing the efficiency of our implementation of rejection resampling in comparison with traditional methods.
In the case of resampling, the memory limitation is determined by the amount of shared memory available in the GPU.
7. Conclusions
In this paper, we proposed a context-adaptable design of FastSLAM 2.0 based on the CUDA parallel computing platform. The proposed design can accommodate different parallelization modalities without the need for a complete re-implementation. Multiple resampling methodologies have been included in the design to meet the needs of the widest possible variety of use case scenarios. In addition, we provided a methodology to assess the optimal parallelization modality depending on the GPU architecture in use and the specifications of the contemplated navigation use case. Such specifications define the set of parameters determining the algorithm’s computational complexity and, as a direct consequence, its execution latency. The algorithm was divided into functional blocks; the parameters affecting their computational complexity were identified and leveraged in the parallelization design, which uses shared memory to enhance collaborative data management. The methodology was validated on a high-end GPU, where all the parallelization modalities were tested under different parameter sets through a sensitivity analysis. As a further contribution, our design includes the parallelization of the data association step, implemented via the Joint Compatibility Branch and Bound (JCBB) methodology.
This flexible implementation of FastSLAM 2.0 and its corresponding selection methodology are designed to facilitate optimization in the deployment process of SLAM onto GPGPU, while enabling real-time robot localization and environmental mapping. Therefore, future work will explore the interfacing of the proposed design with a real environmental sensing setup within a real navigation scenario.
In addition, further adaptations of the proposed design will be contemplated in the scope of future work. Firstly, our design is meant to be compatible with a wide variety of hardware architectures where the CPU (host) and GPU (device) do not share the same memory space and, within the device, the global device memory and the shared memory are also located in separate locations. Examples of GPU memory architectures meeting these characteristics include, but are not limited to, Video Random Access Memory (VRAM), Graphics Double Data Rate (GDDR), and High-bandwidth Memory (HBM). Future developments of our contribution may involve extending the proposed method to architectures such as the Unified Memory Architecture (UMA), which is typical of devices such as the NVIDIA Jetson Nano, which are commonly used for robotics applications. In such devices, the shared memory is a part of the global device memory instead of a separate memory space. Such characteristics would require a redesign of the memory management criteria.
Furthermore, another possible extension of the proposed design is to contemplate 3D pose estimation. As described in
Section 4.1, our design models the robot as a rigid body moving across a planar space and the landmarks as points. While such simplification is feasible under many use cases (e.g., autonomous cars, ground robots) and is indeed to be preferred for avoiding adding unnecessary computational complexity, some scenarios might need pose estimation in the tri-dimensional space (e.g., aerial drones, underwater robots). Though necessary changes would not affect the evaluation methodology and the overall parallelization design, they would require an expansion of the state variables, increasing the memory occupancy. As a direct consequence, the hardware architecture in use might enable the support of a smaller map (i.e., less landmarks) and/or a smaller number of particles.