Abstract
Navigation in unknown 3D environments aims to progressively find an efficient path to a given target goal in unseen scenarios. A challenge is how to explore the navigation quickly and effectively. An end-to-end learning approach has been proposed to extract geometric shapes from RGB images, but it is not suitable for large environments due to its exhaustive exploration with exponential search space. Active Neural SLAM (ANS) presents a Neural SLAM module to maximize the exploration coverage to tackle the active SLAM task. However, ANS still frequently visits the explored areas due to the inappropriate local target selection. In this paper, we propose a Neural Attention-based Visual SLAM (NAVS) model to explore unknown 3D environments. Spatial attention is provided to quickly identify obstacles (such as similarly colored tea table or floor). We also leverage the priority of unknown regions in the short-term goal decision to avoid frequent exploration with a channel attention. The experimental results show that our model can build a more accurate map than ANS and other baseline methods with less running time. In terms of relative coverage, NAVS achieves a 0.5\(\%\) improvement over ANS in overall and a 1.1\(\%\) improvement over ANS in large environments.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Autonomous navigation is an important module for intelligent robots to accomplish the task of path finding. In unknown environments, robots usually need to traverse the environment exhaustively to build a map before navigating [1]. With the development of deep learning, it is possible to achieve autonomous navigation without building a map in advance [2]. A challenge for autonomous navigation in unknown environments is how to explore the unseen scenario quickly and effectively. The robot needs to identify its current position and build the map directly from the camera data, without any manual assistance.
Chen et al. [3] used end-to-end learning to realize robot navigation in unknown environments. This work proposes that the learning approach can extract geometric shapes directly from RGB images and is more flexible than traditional navigation methods. However, the method may not be effective for exhaustive exploration or long-term planning in large environments. The exponential growth of the search space with the distance to goals can be a significant challenge for this approach. The high cost associated with performing map matching, pose estimation, and path planning tasks due to the large search space may limit its practical application [4].
Devendra et al. [5] presented an autonomous exploration model called Activate Neural SLAM (ANS). The modular and hierarchical strategy reduces the search space during training and has better performance and sampling efficiency compared to end-to-end learning. There are still some problems with the ANS model. First, the model may have difficulty recognizing whether the floor is an obstacle if it has a similar color to the furniture. Second, the agent may visit frequently explored areas due to improper local goal selection. Figure 1 (the blue line) shows that ANS has many errors in the predicted map, and the agent wanders around in the explored room.
Recognition difficulties can arise when the environment is complex or when there are similar features that are difficult to distinguish. Attention mechanisms can help the robot to clearly identify targets [6, 7]. In addition, more sophisticated decision-making strategies can help robots navigate more efficiently and reduce wandering. Hierarchical planning and reinforcement learning are examples of decision-making strategies that can take into account both global and local information to help robots make more informed and efficient navigation decisions [8].
To address the above challenges, we propose both spatial and channel attention mechanisms to improve the efficiency of map matching and navigation. We add some different attention mechanisms based on the ANS framework proposed by Devendra. We use the CBAM (Convolutional Block Attention Module) [9] module to get better access to semantic information in the environment.
-
We use a spatial attention mechanism to increase the weight of the main obstacle in processing RGB images, which helps the robot to identify obstacles (such as similarly colored floor or tea table) and predict the position of robot.
-
We use a channel attention mechanism is developed to increase the weight of channels with unknown space during navigation. It makes the exploration for unvisited regions a higher priority and avoids visiting the explored area as much as possible. In summary, we conclude our contributions listed as follows:
-
Our work proposes a deep learning SLAM (Simultaneous Localization and Mapping) strategy to explore unknown 3D environments. It increases the priority of unknown regions to avoid frequent exploration.
-
We present a mechanism with a spatial attention and a channel attention to improve the matching efficiency and the prediction accuracy. This method is able to quickly identify obstacles and reduce the traversal time.
-
We implement our model in the Habitat simulator platform [10, 11] and evaluate the performance on Gibson datasets [12]. The experimental results demonstrate our model can build a more accurate map than ANS and other baseline methods with less runtime. The exploration coverage of our model is higher than baseline methods in unknown environments with the summary measure called SPL (Success weighted by (normalized inverse) Path Length) [13].
The structure of the paper is arranged as follows. We study the related work in Sect. 2. Our model and detailed design are presented in Sects. 3 and 4. The experimental evaluation and comparison are conducted in Sect. 5, and we conclude this paper in Sect. 6.
2 Related Work
In active SLAM, we need to maximize both environment exploration (to better understand the environment) and environment exploitation (to revisit previously traversed regions for closed-loop detection) for better map estimation and localization. There are several approaches to active SLAM which we address and summarize in our research.
The geometric approaches consider active SLAM as a task for the robot to select the optimal path and trajectory. The methods reduce the uncertainty in pose estimation and mapping to achieve efficient autonomous navigation in unknown environments. The approaches enable the use of frontier-based exploration as well as conventional path planners (e.g., RRT* [14] (Rapidly-exploring Random Tree), D* [15] (D-star), and A* [16] (A-star).
IT-based methods solve the exploration problem of joint entropy minimization. DRRT [17] (Distance Rapidly-exploring Random Tree) uses distance as the cost function, while ERRT [18] (Entropy Rapidly-exploring Random Tree) uses the entropy change of the traveled distance as the cost function. These algorithms aim to improve the exploration efficiency in path planning by considering the information gain and uncertainty reduction in the environment.
Frontier-based explorations characterize the boundary as a set of unknown points, where each point has at least one known spatial neighbor [19]. A hybrid control-switching exploration method with a particle-filtered SLAM backend uses frontier-based exploration with A* as the global planner and the Dynamic Window Approach (DWA) [20] reaction algorithm as the local planner [21]. Another work uses graph-based SLAM (i-SLAM) [22] optimization as a SLAM backend for autonomous exploration using frontier exploration.
In contrast to the use of traditional path planners (RRT*, D*, A*), the dynamics-based approaches formulate active SLAM as a problem of selecting a sequence of control inputs to generate collision-free trajectories and cover as much area as possible [23]. The methods minimize the uncertainty in state estimation, which improves localization and mapping in the environment. For the selection of the optimal target location, the theories employ a Linear Quadratic Regulator (LQR) [24] or Model Predictive Control (MPC) [25] to select the optimal future trajectory in order to balance the need of exploring new regions with the need of utilizing visited regions.
Several deep learning-based methods have been used in SLAM research. Direct methods use deep neural networks to directly estimate the pose of robot and construct a map from raw sensor data, while indirect methods extract features from sensor data and use them to estimate the pose of robot and map [26]. End-to-end methods learn the entire SLAM pipeline from raw sensor data to the final map [27]. Hybrid methods combine deep learning with traditional SLAM approaches [28]. While traditional multi-view learning mechanisms cannot handle multiple views with variable features, deep learning-based methods use multi-view image stitching to stitch images taken from different viewpoints into an image with a broader field of view [29, 30].
Some studies define the active SLAM problem as an approximation of a partially observable Markov decision process (POMDP) [31, 32]. They also seek to find a sequence of actions that minimizes the map uncertainty [33, 34]. Wang et al. used the depth network and the optical flow network to improve scale accuracy in unsupervised monocular vision odometry [35]. Savva et al. [10] used simple convolutional layers with a gate recurrent unit (GRU) [36] layer as a reinforcement learning policy. And Mirowski et al. [37] added Resnet18 [38] to extract image information. Chen et al. [3] tried to solve this problem by an end-to-end RL policy. Devendra et al. [5] took advantage of both classical and learning-based approaches to analyze paths by adding the Proximal Policy Optimization (PPO) [39] policy and the Fast Marching Method (FMM) [40]. In the follow-up study, Devendra et al. used self-supervised learning to improve decision making in active learning [41, 42]. They implemented goal-oriented navigation while retaining the semantic information in the images [43, 44]. Meanwhile, they used topological representation of space or transformer-based spatial planning for navigating [45, 46].The use of an attention network injected into the neural network enables a greater focus on useful images and rapid prediction of objects to augment the representations of the depth map in detail [47, 48].
3 Overall Design
We propose a modular model called Neural Attention-based Visual SLAM (NAVS) for navigation and mapping. The model combines the approaches of traditional navigation strategies, deep reinforcement learning with spatial and channel attention mechanisms to address the above challenges. The agent obtains the maximum reward from the environment through reinforcement learning policy to simulate the motion of robot in an unknown indoor environment. As shown in Fig. 2, it consists of three main components: SLAM, Global Policy, and Local Policy.
3.1 SLAM Module
The SLAM module predicts the current map of the environment and the pose of the agent. The CBAM module increases the weight of main obstacle in processing RGB images, which helps the agent to identify obstacles (such as similarly colored floor or tea table) and predict its position. As shown in Fig. 3, the SLAM module inputs the RGB observations and outputs the predicted map. The SLAM module transforms the self-centered map of the previous moment spatially, in the same coordinate system as the current moment, according to the change of sensor reading \(x_t\). The agent compares the current self-centered map with the previous transformed self-centered map and predicts the change in relative pose. The agent transforms the current self-centered map into a part of the global map according to the change of the pose estimation.
To ensure continuity in building the map, we use the map and the pose of agent from the previous moment to infer the current values, which can be established by
where M denotes the global spatial map, m denotes the self-centered map, X denotes the pose of agent, and \(d_x\) denotes the amount of change in the pose of agent. The Mapper module predicts the output self-centered 2D spatial map according to the observation of agent, and it is given by
where \(S_t\) denotes the input image of the sensor, V denotes the field of view, and \(\theta _M\) is the hyperparameter of the Mapper module.
We collect the change in sensor readings \(p_{t-1}=f_{ST}(m_{t-1}|d_x)\), and use this parameter to predict the relative position change of agent, \(dx_t=f_{PE}(p_{t-1} |\theta _P)\) where \(\theta _P\) is the hyperparameter of pose estimator. Eventually, we can predict the current pose of agent \(x_t=x_{t-1}+dx_t\).
We use a spatial transformation to transform the current self-centered map of the observers into a global spatial map. We converge the past map and the observed map into the current global map. The global map and the pose of agent can be described by
where \(\theta _p\) is the hyperparameter for neural network training.
3.2 Global Policy
The Global Policy module generates long-term goals that the agent can achieve using the current predicted environment map and the pose of agent. The Global Policy module uses the PPO policy to set long-term goals, and maximize the exploration area of the environment as a reward for the policy. The dominance function of this policy is shown in Eq. (5).
where \(s_t\) denotes the state, \(a_t\) denotes the action, \(Q(s_t,a_t)\) denotes the state action value function, \(A(s_t,a_t)\) is the advantage of taking action \(a_t\) in state \(s_t\), and \(V(s_t)\) denotes the state value function. The objective function is shown in Eqs. (6) and (7).
where \(\theta \) denotes the policy parameter, \(\pi _{\theta }(a_t|s_t)\) denotes the action probability of a given state under the current policy, \(\pi _{\theta _\textrm{old}}(a_t|s_t)\) denotes the action probability of a given state under the old policy, \(\epsilon \) denotes a hyperparameter, and \(L^{PPO}(\theta )\) is the clipped surrogate objective. The value function by minimizing the squared error between the predicted value and the actual return for this policy is shown in Eq. (8).
where \(L^{value}(\theta )\) is the loss of value, \(V(s_i, \theta )\) is the predicted value, and \(R_i\) is the actual cumulative reward obtained in the episode.
The Global Policy module uses \(h_t\in [0,1]^{(4*M*M)}\) as input. The \(h_t\) of the first two channels is the spatial map, the \(h_t\) of the third channel indicates the current pose estimation of agent, and the \(h_t\) of the fourth channel indicates the visited region. The \(h_t\) can be described by
The CBAM module increases the weight of channels with unknown space before the Global Policy module. It makes the exploration for unvisited regions with more priority and avoids visiting the explored area as much as possible.
3.3 Path Planning
The agent translates long-term goals into short-term goals using the path planning approach. The path planning takes the long-term goal (\(g_t^l\)), the spatial obstacle map (\(m_t\)) and the pose estimation of agent (\(x_t\)) as inputs to compute and generate the short-term goal (\(g_t^s\)), i.e. \(g_t^s=\ f_{Plan}\left( g_t^l,\ m_t,\ x_t\right) \). The path planning uses current environment map to compute the shortest path from the current location of agent to long-term goal with the FMM method. We choose 0.25m from the agent on this path as the short-term goal point. The unexplored areas are assumed to be open space.
3.4 Local Policy
The Local Policy module needs to transform the coordinates of short-term goal into relative distances and angles from the position of agent. The Local Policy module generates the orders of actions for agent to navigate to the short-term goals according to the current observations. The Local Policy module takes the current RGB observation (\(s_t\)) and the short-term goal (\(g_t^s\)) as input, and outputs a navigation action \(a_t\ ={\ \pi }_L\left( s_t,{\ g}_t^s|\theta _L\right) \). \(\theta _L\) is a parameter of the Local Policy module. The input RGB images are feature-optimized by the CBAM module, as in the SLAM module.
4 Methods
The goal of the attentional mechanism is to selectively focus on goals and ignore distractors as much as possible. In Neural Attention-based Visual SLAM, we use deep neural networks in a modular and hierarchical way. Meanwhile, we add various attention mechanisms into the deep neural network to adjust the different feature weights and optimize the performance.
4.1 Spatial Attention Mechanism in Mapping
In computer vision, it is not easy to finely identify objects in an image. Because texture features, object vestiges, light interference, and other factors can make identification more difficult. However, we do not need to perform such detailed processing in the construction of an obstacle map. We only need to recognize the outlines of obstacles and walls, not detailed patterns or placed objects. The spatial attention module we use is shown in Fig. 4.
We focus on targeting large pieces of furniture and walls, rather than the decorations and small household items in the room. The key to accurate mapping is to clearly identify the contours of large objects. We set up a weight mask layer for the input images using the formula based on Eq. (10), where \(\sigma \) is the sigmoid operation and \(7\times 7\) is the size of the convolution kernel. Spatial regions of large objects receive higher weights.
Because of the accuracy of obstacle map that uses spatial attention, it also helps the agent to predict its location more accurately.
4.2 Channel Attention Mechanism in Navigation
Each feature map carries favorable features. As shown in Fig. 5, the channel attention mechanism can assign different weights to the channel latitude.
In the Global Policy module, we have four channel dimensions for spatial mapping and agent location. We reduce the weight of the mapping channel and increase the weight of the channels with unknown space in the navigation progress. As shown in Eq. (11), where \(\sigma \) is the sigmoid operation, we find the channels with key features that exist in different tasks. We find that the improvement effect is more obvious when the interference of unimportant features is reduced.
4.3 CBAM Module in Active SLAM
As shown in Fig. 6, the CBAM module infers the attention map sequentially along two independent dimensions (channel and spatial). The CBAM module multiplies the attention map with the input feature map to perform adaptive feature optimization. We can ignore the cost of the module and integrate it into the CNN architecture for joint training.
We choose to use the pre-training model of ResNet18 network, taking into account the size of the whole system. Therefore, we cannot add the CBAM module into the block of network layer, in order not to change the network structure of ResNet18. We add the CBAM module in the first convolutional layer and the last convolutional layer of ResNet18 network. It is able to use the pre-trained parameters of ResNet18 networks to accelerate the convergence to the goal task. Table 1 shows the network structure of the ResNet18 network with the CBAM module.
The Mapper module in the SLAM module consists of 2 CBAM modules, 1 ResNet18 convolutional layer, 2 fully connected layers with a dropout of 0.5, and 3 deconvolutional layers. We use 3 convolutional layers and 3 fully connected layers for pose estimator. The Global Policy module uses 5 convolutional layers and 3 fully connected layers. The Local Policy module consists of 2 CBAM modules, 1 ResNet18 convolutional layer, 1 fully connected layer and a cyclic GRU (Gate Recurrent Unit) layer. The network structure of the Mapper module is shown in Fig. 7.
5 Results
5.1 Experimental Setup
Datasets. We use the Gibson datasets to evaluate our models. Gibson datasets are 3D reconstruction scenes of real-world environments with different sets of cameras, which are widely adopted as Navigation Benchmark Scenarios. It contains 572 buildings composed of 1447 floors, and covers a total area of 211,000m2. Each building comes with a certain level of clutter and navigation complexity, as shown in Fig. 8. The agent encounters problems such as obstacle occlusion and lighting variations while exploring, just like in a real environment. These scenes provide a significant diversity in terms of size and complexity, allowing us to evaluate the efficiency and performance of our model in handling environments of different sizes.
We choose the part of the Gibson dataset (indoor scenarios) for the point goal navigation in our experiment. We train models on Gibson training sets which have 72 scenes consisting mostly indoor scenes, and validate models on Gibson Val datasets which have 14 different indoor scenes. We train the model on some specific datasets and test it on unseen scenarios. This simulates agents performing active SLAM tasks in unseen environments and can help us evaluate the generalization ability of the model.
Implementation. We deploy the Habitat simulator platform in an Alibaba Cloud server with 4 NVIDIA-V100. The operating system is Ubuntu 18.04 Server Edition. We use the sensor and actuation noise models in Habitat to simulate the real world. The Habitat-Sim is deployed to process Gibson datasets to render a scene and simulate the robot interaction. Meanwhile, we apply the Habitat-Lab to define the navigation tasks, train the map matching models and evaluate the benchmark performance with standard metrics.
We conduct experiments with a camera frame size of \(128\times 128\) for the Local Policy and \(256\times 256\) for the Global Policy. We use Adam optimization with a learning rate of 0.0001 for the Local Policy and SLAM module, and \(2.5 \times 10^{-5}\) for the Global Policy. We use the SLAM module with a batch size of 72 and a memory size of 300000. The experience loss coefficient is set to 1.0, the projection loss coefficient is set to 1.0, and the pose loss coefficient is set to 10000.0. The value loss coefficient is set to 0.5, and the entropy coefficient is set to 0.001. We enable the pose estimator instead of Generalized Advantage Estimation (GAE). These specific parameter configurations played a crucial role in shaping the learning behavior and overall performance of the agent.
Baseline. In active SLAM, perception is used not only for map construction and localization, but also for proactively choosing the next action to maximize some goal. Traditional SLAM approaches such as ORB-SLAM, which rely only on sensor data, may not provide enough information to guide these proactive decisions. In addition, traditional SLAM approaches typically divide perception, mapping, and localization into separate steps, making it difficult to achieve real-time, tightly integrated proactive decision making. Therefore, we use some end-to-end Reinforcement Learning methods and Active Neural SLAM as baselines.
RL + 3LConv: The model which Savva et al. described in their work uses 3 layers convolutional network with a GRU as the RL Policy.
RL + Res18: The model uses pre-trained ResNet18 network with a GRU as the RL Policy.
RL + Res18 + AuxDepth: The baseline is proposed by Mirowski et al. The model uses the convolutional layers from ResNet18 with a deconvolutional layer for depth prediction, and adds 3 convolutional layers and GRU.
RL + Res18 + ProjDepth: This baseline is proposed by Chen et al. The model projects the predicted depth before passing to the GRU based on the architecture of RL + Res18 + AuxDepth.
Activate Neural SLAM: We implement the model of Active Neural SLAM as our comparative method. The model exploits the strengths of classical and learning-based navigation methods. We consider that the re-trained model is almost identical to the pre-trained model of Devendra by evaluation.
Our Models (NAVS): We also report the results of our Neural Attention-based Visual SLAM with channel attention (NAVSca) only, with spatial attention (NAVSsa) only, and with both attention techniques (NAVS).
The baseline models and our Neural Attention-based Visual SLAM (NAVS) algorithms are implemented in the Habitat simulator with the same datasets in our experiments.
5.2 Evaluation Metrics
We use the summary measure called SPL to denote the average absolute (Cov) and relative coverage (% Cov). The SPL is shown in Eq. (12), where N is the number of episodes, \(l_i\) is the shortest-path distance from the starting position of agent to the goal, \(p_i\) is the length of the path actually taken by the agent in this episode, \(S_i\) is the binary indicator of success. The value of SPL is equal to the value of the average relative coverage, and the value of \(p_i\) is equal to the value of the average absolute coverage.
We calculate the relative improvement of the relative coverage in the medium term with the average relative difference. The rate of relative improvement can be obtained in Eq. (13), where i and n are timesteps, RC is the relative coverage, \(\Delta RC\) is the change of relative coverage.
5.3 Experimental Evaluation
We measure the performance of the model in map matching, navigation, and operation. And, we provide detailed comparisons of the data and visualized images. By conducting these experiments, we will offer a more comprehensive evaluation of the NAVS model and further demonstrate its strengths and limitations.
5.3.1 The Performance of Matching the Map
In Fig. 9, we observe the predicted maps of ANS, NAVSca and NAVSsa at about 900 timesteps. NAVSca is the model that adds only channel attention. The exploration paths of NAVS overlap much less than ANS when the agent is in an explored area. This is because channel attention marks the channels that contain location information. When generating goal points, the agent gives preference to unexplored areas. This means that the goals do not need to appear frequently appeared in the same area. As a result, the agent avoids visiting the explored areas as much as possible.
NAVSsa is the model that adds only spatial attention. The performance of NAVSsa is more stable and accurate than ANS, as shown in Fig. 9. The trajectory result shows that our model can match the real map with less traversed space due to the spatial attention in the model NAVSsa by increasing the weight of large obstacles. This means that the agent almost exclusively detects large obstacles for mapping and pose estimation. As a result, the agent can build accurate maps without the interference of decorations.
The experiment uses the setup of a point goal navigation task in Habitat. It is difficult to quantitatively assess the accuracy of the predicted maps when Habitat-sim is unable to provide the agent with information about the category and location of objects in the environment. This is because an obstacle with an incorrectly predicted location that happens to be at the location of a real obstacle will still be shown as correctly judged in this task, but it is actually offset. However, the superior performance of NAVSsa allows us to clearly determine the gap between ANS and it, and the mapping accuracy of NAVSsa is about 80%.
Figure 10 shows the predicted maps of NAVS compared with ANS at about 900 timesteps. NAVS adds channel attention and spatial attention as the CBAM module. In the figure, ANS shows significant mapping and poss errors with the increase of cumulative errors. The predicted maps and pose estimation of agent deviate from the real data. Meanwhile, the agent explores the same area several times during exploration. These are the problems which we want to solve in active SLAM. The visualization map shows the performance improvements due to the CBAM module in NAVS. Compared to ANS, the pose estimation of agent is closer to the real position and the predicted map has less error. In the same area, the agent avoids visiting the explored areas as much as possible. As a result, the agent can explore more area with less traversing space and build accurate maps.
We just run the Global Policy module and the SLAM module. It causes the agent does not to move to the goal point, but to keep turning around in place. We compare the predicted maps for ANS and NAVS in place with 1000 timesteps of rotation in Fig. 11a and b.
The scanning range of agent at a fixed position should be the same with the same camera parameters. When the objective conditions are identical, NAVS can better extract the feature information from the high-weighted obstacle images and effectively build accurate prediction maps. This means that the agent of NAVS can explore more area faster and more efficiently in the large environment. As a result, NAVS has better performance of matching the map than ANS in active SLAM.
5.3.2 The performance of navigation
In Table 2, we count the average results of all models in 14 different unknown scenes of the Gibson Val dataset, for a total of 994 episodes. As a baseline, the average absolute and relative coverage of ANS of 32.701 m\(^2/0.948\). NAVSca and NAVSsa achieve an average absolute and relative coverage of 31.542 m\(^2/0.943\) and 31.653 m\(^2/0.948\), respectively. NAVSca and NAVSsa have an obvious deficiency compared with ANS in large environments, and about the same as ANS in small environments. NAVS achieves the average absolute and relative coverage of 31.981 m\(^2/0.953\). In the relative coverage, NAVS achieves an improvement of 0.5% over ANS in overall environments and an improvement of 1.1% over ANS in large environments.
In Fig. 12, we plot the changes of relative coverage (% Cov) in different Gibson Val scenarios as the timestep progresses. We find that NAVSca has the highest growth rate of relative coverage. However, NAVSca does not achieve the best relative coverage because the errors in the predicted map affects the exploration. On the whole, the relative coverage of NAVS is the highest of all the models. NAVS has the higher relative coverage than ANS at all times and a relative improvement of about 20% over ANS from 300 of episode length to 600 of episode length in large environment.
We make box plots to observe the distribution of the coverage of each model, as shown in Fig. 13a and b. There are a total of 274 episodes in the large scenes, and 710 episodes in the small scenes.
The box consists of two horizontal line segments and a rectangle. The upper and lower boundaries of the rectangle indicate the first quartile (Q1) and the third quartile (Q3) of the data, respectively, while the height of the rectangle indicates the range of the data distribution (IQR, i.e., interquartile range) between these two quartiles. Inside the box, a horizontal line segment indicates the median of the data (Q2). Above and below the box, two vertical line segments extend from the whisker line, connecting the maximum and minimum values. Outside the whisker lines, outliers that deviate from the regular distribution are indicated by separate small circles.
As can be seen in Fig. 13a, NAVS has a smaller box height. This indicates that the distribution of results for NAVS is more stable and concentrated in many experiments. We can determine that the exploration results of NAVS are better than those of ANS by comparing the median line, the lower boundary of the box, and the minimum value. In Fig. 13b, all models perform well in small scenes. However, we notice that the number of outliers with SPL below 0.7 is much lower for NAVS than for ANS. This means that NAVS is more robust compared to ANS.
5.3.3 The performance of operation
In Table 3, we count the runtime and FPS (Frames Per Second) of all models to complete running 71 episodes in the Gibson Val datasets from the log files. The runtime of ANS is 5 h, 34 min and 20 s at 49 FPS. The runtime of NAVSca is 6 h, 26 min and 52 s at 42 FPS, and the runtime of NAVSsa is 1 h, 54 min and 35 s at 144 FPS.
We consider that the channel attention mechanism marks different weights for the same channel at different stages. In mapping, it marks a high weight on map information and a low weight on pose information. In navigation, it is the opposite. That leads to a mismatch of channel information during running. As a result, NAVSca takes longer to run than ANS. On the other hand, NAVSsa improves the weight of obstacles and drastically reduces the amount of data to be processed in the input. The agent can perform accurate mapping and pose estimation faster using the processed RGB images. As a result, NAVSsa can complete the task at higher FPS and in less runtime.
In Table 3, the runtime of NAVS is 5 h, 34 min and 7 s. The runtime of NAVS is about 1 h less than NAVSca, even though the spatial attention mechanism results in a mismatch of channel weights. This means that that we can mark important goal information in mismatched channels to reduce the runtime. It is clear that even with the use of mismatched channels, NAVS still operates at the efficiency standard of ANS.
As shown in Fig. 14a and b, we observe the log files and plot the comparison of rewards in the test phase. In active SLAM, the models enable to maximize the environmental exploration area as a reward for the Global Policy.
In Fig. 14a, All models get higher rewards in the early stages. However, NAVS maintains the highest reward acquisition as the timesteps increase. This means that that each step of NAVS explores the largest area after a large number of areas have already been explored. NAVS has the highest and consistent rewards compared to ANS across all episodes, as shown in Fig. 14b. It proves that NAVS gets the most explored area out of all the episodes.
From the above experiments, the CBAM module that incorporates multiple attention mechanisms can achieve a more stable overall improvement for active SLAM models. We consider that it may be more important for active SLAM to simplify and clarify the object goals in images.
6 Conclusion
The problem of large deviations in the predicted maps and low position accuracy exists when navigating in unknown 3D environments. In this paper, we propose a modular and hierarchical navigation model. The model combines classical navigation policy, deep reinforcement learning approach and attention mechanism. This makes the agent focus more on identifying large furniture in mapping and more on the pose estimation during navigation. Experiments show that our model outperforms previous methods. The prediction map of agent matches the contours of the room and exists with fewer prediction errors. There are few overlapping parts of the trajectory. These improve the exploration rate of room. The model can also be further studied for use in a realistic home environment. In the future, we will work on extending the model for more complex semantic tasks, such as semantic goal navigation.
References
Khan A, Zhang C (2017) End-to-end navigation in unknown environments using neural networks. arXiv preprint arXiv:1707.07385
Savinov N, Dosovitskiy A, Koltun V (2018) Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653
Chen T, Gupta S, Gupta A (2019) Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959
Gupta S, Davidson J, Levine S, Sukthankar R, Malik J (2017) Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2616–2625
Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020) Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł., Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Guo M-H, Xu T-X, Liu J-J, Liu Z-N, Jiang P-T, Mu T-J, Zhang S-H, Martin RR, Cheng M-M, Hu S-M (2022) Attention mechanisms in computer vision: a survey. Comput Vis Med 8(3):331–368
Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020) Object goal navigation using goal-oriented semantic exploration. Adv Neural Inf Process Syst 33:4247–4258
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, Straub J, Liu J, Koltun V, Malik J, et al. (2019) Habitat: a platform for embodied ai research. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9339–9347
Szot A, Clegg A, Undersander E, Wijmans E, Zhao Y, Turner J, Maestre N, Mukadam M, Chaplot DS, Maksymets O et al (2021) Habitat 2.0: training home assistants to rearrange their habitat. Adv Neural Inf Process Syst 34:251–266
Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S (2018) Gibson env: real-world perception for embodied agents. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9068–9079
Anderson P, Chang A, Chaplot DS, Dosovitskiy A, Gupta S, Koltun V, Kosecka J, Malik J, Mottaghi R, Savva M, et al. (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757
Noreen I, Khan A, Habib Z (2016) Optimal path planning using rrt* based approaches: a survey and future directions. Int J Adv Comput Sci Appl 7(11)
Stentz A (1994) The d* algorithm for real-time planning of optimal traverses
Du J, Carlone L, Ng MK, Bona B, Indri MA (2011) comparative study on active slam and autonomous exploration with particle filters. In: 2011 IEEE/ASME international conference on advanced intelligent mechatronics (AIM), pp 916–923, IEEE
Shome R, Solovey K, Dobson A, Halperin D, Bekris KE (2020) DRRT*: scalable and informed asymptotically-optimal multi-robot motion planning. Auton Robots 44(3–4):443–467
Lindqvist B, Agha-Mohammadi A-A, Nikolakopoulos G (2021) Exploration-RRT: a multi-objective path planning and exploration framework for unknown and unstructured environments. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3429–3435 . IEEE
Gao W, Booker M, Adiwahono A, Yuan M, Wang J, Yun YW (2018) An improved frontier-based approach for autonomous exploration. In: 2018 15th international conference on control, automation, robotics and vision (ICARCV), pp. 292–297. IEEE
Missura M, Bennewitz M (2019) Predictive collision avoidance for the dynamic window approach. In: 2019 international conference on robotics and automation (ICRA), pp 8620–8626, IEEE
Mammolo D (2019) Active slam in crowded environments. Master’s thesis, Autonomous Systems Lab, ETH Zurich
Huang Q, Pu C, Fourie D, Khosoussi K, How JP, Leonard JJ (2021) Nf-isam: incremental smoothing and mapping via normalizing flows. In: 2021 IEEE international conference on robotics and automation (ICRA), pp 1095–1102, IEEE
Ahmed MF, Masood K, Fremont V (2022) Active slam: a review on last decade. arXiv preprint arXiv:2212.11654
Dean S, Mania H, Matni N, Recht B, Tu S (2020) On the sample complexity of the linear quadratic regulator. Found Comput Math 20(4):633–679
Chen Y, Huang S, Fitch R (2020) Active slam for mobile robots with area coverage and obstacle avoidance. IEEE/ASME Transact Mechatron 25(3):1182–1192
Yang N, Wang R, Stuckler J, Cremers D (2018) Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In: Proceedings of the European conference on computer vision (ECCV), pp 817–833
Mohanty V, Agrawal S, Datta S, Ghosh A, Sharma VD, Chakravarty D (2016) Deepvo: a deep learning approach for monocular visual odometry. arXiv preprint arXiv:1611.06069
Mac TT, Lin C-Y, Huan NG, Duc L, Nhat PCH, Hai HH (2021) Hybrid slam-based exploration of a mobile robot for 3D scenario reconstruction and autonomous navigation. Acta Polytech Hung 18:197–212
Zhu C, Miao D (2019) Semi-supervised one-pass multi-view learning with variable features and views. Neural Process Lett 50:189–226
Yan N, Mei Y, Xu L, Yu H, Sun B, Wang Z, Chen Y (2023) Deep learning on image stitching with multi-viewpoint images: a survey. Neural Process Lett 1–36
Martinez-Cantin R, De Freitas N, Brochu E, Castellanos J, Doucet A (2009) A bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton Robots 27:93–103
Kollar T, Roy N (2008) Trajectory optimization using reinforcement learning for map exploration. Int J Robot Res 27(2):175–196
Stachniss C, Grisetti G, Burgard W (2005) Information gain-based exploration using rao-blackwellized particle filters. In: Robotics: science and systems, vol 2, pp 65–72
Carlone L, Du J, Kaouk Ng M, Bona B, Indri M (2014) Active slam and exploration with particle filters using kullback-leibler divergence. J Intell Robot Syst 75:291–311
Wang Z, Shen M, Chen Q (2023) Eliminating scale ambiguity of unsupervised monocular visual odometry. Neural Process Lett 1–22
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259
Mirowski P, Pascanu R, Viola F, Soyer H, Ballard AJ, Banino A, Denil M, Goroshin R, Sifre L, Kavukcuoglu K et al. (2016) Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Sethian JA (1996) A fast marching level set method for monotonically advancing fronts. Proc Natl Acad Sci 93(4):1591–1595
Chaplot DS, Parisotto E, Salakhutdinov R (2018) Active neural localization. arXiv preprint arXiv:1801.08214
Chaplot DS, Dalal M, Gupta S, Malik J, Salakhutdinov RR (2021) Seal: self-supervised embodied active learning using exploration and 3D consistency. Adv Neural Inf Process Syst 34:13086–13098
Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020) Object goal navigation using goal-oriented semantic exploration. Adv Neural Inf Process Syst 33:4247–4258
Chaplot DS, Jiang H, Gupta S, Gupta A (2020) Semantic curiosity for active visual learning. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, 2020, proceedings, Part VI 16, pp 309–326. Springer
Chaplot DS, Pathak D, Malik J (2021) Differentiable spatial planning using transformers. In: International conference on machine learning, pp 1484–1495 . PMLR
Chaplot DS, Salakhutdinov R, Gupta A, Gupta S (2020) Neural topological slam for visual navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12875–12884
Wei J, Pan S, Gao W, Zhao T (2022) Triaxial squeeze attention module and mutual-exclusion loss based unsupervised monocular depth estimation. Neural Process Lett 54(5):4375–4390
Gao Z, Li E, Wang Z, Yang G, Lu J, Ouyang B, Xu D, Liang Z (2021) Object reconstruction based on attentive recurrent network from single and multiple images. Neural Process Lett 53:653–670
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC, No.61702320).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wu, Y., Chen, N., Fan, G. et al. NAVS: A Neural Attention-Based Visual SLAM for Autonomous Navigation in Unknown 3D Environments. Neural Process Lett 56, 61 (2024). https://doi.org/10.1007/s11063-024-11502-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11502-6