1. Introduction
Unmanned Aerial Vehicles (UAVs) are currently being used in a wide range of applications. The Visual Inertial Navigation System (VINS) has become the dominant method for enabling autonomous navigation of UAVs. VINS combines input from visual sensors with Inertial Measurement Units (IMUs) to calculate the position, velocity, and attitude of a UAV in real time. This allows the UAV to quickly adjust to different situations and effectively complete its navigation objectives [
1,
2]; achieving high-precision navigation positioning serves as a fundamental requirement for UAVs to accomplish complex tasks. However, when employing Visual Inertial Odometry (VIO) for navigation, cumulative errors may arise, which can degrade navigation accuracy; error accumulation can impede their applications across different domains [
3]. Loop closure detection (LCD) approaches can significantly reduce pose drift errors accumulated by VIO and improve navigation accuracy by determining if the system has revisited previously traversed places and implementing appropriate correction measures [
4]. Presently, commonly employed LCD techniques often entail extracting features from images and seeking correspondences between pairs of images. These approaches can be broadly classified into two categories: those that rely on deep learning and those that rely on Bag-of-Words (BoW) models. This paper uses the BoW model as the framework for designing the method, primarily for the following four reasons. Firstly, UAVs have a wide range of applications and sometimes lack the opportunity for pre-training before executing their tasks. This necessitates a flexible and readily deployable approach. Secondly, our research focuses on small UAVs, which have limited capacity to carry heavy computational equipment. Integrating high-performance computational devices into small UAVs poses significant challenges due to their size and weight constraints. Thirdly, the BoW model offers simplicity in implementation. It does not require the complex network training process inherent in many deep learning methods, making the development and debugging process more straightforward. This simplicity ensures that the system can be efficiently developed and maintained. Finally, deep learning models frequently exhibit a black-box nature, complicating the debugging process. Problems can originate from various sources, such as data quality, model architecture, or training procedures, and these issues often require sophisticated tools and extensive experiments to diagnose and resolve. Consequently, the BoW model provides a more transparent and manageable alternative, suitable for the computational and operational constraints of small UAVs.
The BoW concept was initially developed in the field of natural language processing [
5] and subsequently applied to the domain of VINS. BoW model-based LCD determines loop closures by comparing the consistency of “words” in two images, expressing images as vectors, and calculating similarity scores between images based on vector norms [
6]. Currently, BoW model-based LCD is widely adopted by mainstream open-source VINS, resulting in a significant enhancement in detection performance and system localization accuracy [
7,
8]. The key to BoW model-based LCD lies in extracting suitable feature points to achieve better clustering and matching. However, in areas with low density of texture, the extractable feature points may significantly decrease or even vanish [
9]. In application scenarios of UAVs, the high speed of aircraft movement makes it difficult to track, extract, and match feature points in scenes with certain lighting changes and weak textures. Furthermore, the restricted number of observation viewpoints offered by the flight path amplifies the challenge of identifying loop closures. Previous efforts to address this issue through scene reconstruction have often been impeded by slow reconstruction speeds. This significantly limits the error correction capabilities of the LCD process, making it impractical for real-world applications [
10]. This study proposes using Neural Radiance Field (NeRF) for rapid [
11], high-quality scene reconstruction, expanding image sequences by generating novel views to increase the number and optimize the quality of available images, thereby enhancing the detection rate of LCD.
This study proposed a BoW model based the LCD method utilizing NeRF, which is a type of neural rendering technique developed in recent years to overcome the performance limitations of traditional 3D reconstruction methods by employing differentiable rendering and neural networks. NeRF proposes using neural implicit fields for continuous scene representation and has achieved significant success in synthesizing high-quality views and 3D reconstruction in various scenes. Its core idea involves employing a multi-layer perceptron network to learn volumetric information of the 3D scene from input 2D images. The advantages of this technique lie in its ability to generate high-quality 3D reconstructions and its efficiency and flexibility compared to traditional 3D modeling methods. Currently, NeRF has demonstrated broad potential applications in multiple domains. In computer vision, NeRF is utilized for high-precision 3D scene reconstruction and novel view synthesis [
12]. In virtual and augmented reality, NeRF enhances immersive experiences [
13]. In film and game production, NeRF generates high-quality visual effects and scenes [
14]. With technological advancements, NeRF has also been applied in fields such as robot navigation, with some research focusing on integrating NeRF within Simultaneous Localization And Mapping (SLAM) systems [
15], and others using pre-trained NeRF maps for localization and optimizing vehicle trajectories [
16]. Leveraging NeRF’s view synthesis capabilities, this study proposes the following contributions:
- 1.
Adopting the rapid neural radiance field Instant Neural Graphics Primitives (Instant-NGP) [
17] as a scene reconstruction tool, we propose a method that utilizes reconstructed scenes to obtain virtual viewpoint images near the flight trajectory through a specific sampling approach. This method aims to increase the number of observation angles and expand the loop closure candidate images. By providing more diverse scene information, the proposed method enhances the success rate and accuracy of LCD.
- 2.
Designed a similarity factor-based method to construct BoW vectors with word frequency weight, which utilizes cosine similarity and dynamic weight assignment to obtain comprehensive similarity scores for loop closure determination. Specifically, the similarity between new words in virtual images and existing words is considered to reduce the probability of false positives in loop detection caused by the introduction of virtual images, thereby preventing the system from making incorrect corrections based on erroneous loop information.
The second part of this paper introduces the existing LCD techniques in VINS, along with their advantages and limitations. It also provides an overview of the current development status and superiority of NeRF, which forms the basis of the suggested approach.
Section 3 provides a detailed explanation of the BoW model-based LCD approach using NeRF. It specifically focuses on the process of selecting virtual view poses and constructing word frequency weight vectors.
Section 4 of the document outlines the experimental procedure and presents the findings, which clearly illustrate the exceptional effectiveness of the suggested strategy in dynamic environments. The fifth part discusses the results of the method design in the context of VINS research, summarizes the strengths and weaknesses of the proposed method, and proposes further research directions.
2. Related Work
Deep learning-based LCD approaches generally demonstrate higher levels of robustness in complicated situations. Nevertheless, the current mainstream LCD approaches in VINS depend heavily on BoW models due to constraints in real-time performance and processing capacity. Both approaches encounter difficulties in obtaining characteristic points when there are constraints on the observer viewpoints, low-textured surfaces, or variations in illumination conditions. NeRF’s capacity to generate perspectives and effectively recreate scenes provides methods to tackle these problems.
2.1. Deep Learning-Based Loop Closure Detection Methods
The application of deep learning models such as Convolutional Neural Networks (CNNs) and autoencoders in LCD has garnered significant attention, prompting numerous related attempts by researchers.
Chen attempted to apply CNN to a location recognition dataset spanning 70 km [
18], constructing a confusion matrix for matching. Hou compared the performance of CNN-extracted features with traditional descriptors in loop detection [
19], finding that CNN features performed better when the operating environment experienced lighting changes. Ma proposed the Local Relative Orientation (LRO) matching algorithm to compute correspondences between image pairs [
20], demonstrating significant robustness in scenarios with viewpoint changes and dynamic objects. Hao used ResNet to extract global image features and combined sequence image features as the features for the current frame [
21], which is more suitable for large-scale scenes. Sunderhauf discovered that intermediate layer feature encoding in CNNs is robust to conditions like weather and lighting [
22], while top layer feature encoding is robust to viewpoint changes.
Nevertheless, the characteristics obtained by algorithms based on CNN are usually of high dimensionality and require significant processing resources. As a result, many techniques for reducing the dimensionality are being explored. For example, Luo employed the T-distributed Stochastic Neighbor Embedding (TSNE) technique to decrease the number of dimensions in the high-dimensional features acquired from the Visual Geometry Group 16 (VGG16) network [
23], hence removing redundant data. Sunderhauf employed a binary local sensitive hashing algorithm to decrease the dimensionality of picture information while preserving 95% of the location identification performance [
22].
2.2. BoW Model-Based Loop Closure Detection Methods
BoW-based LCD methods treat image features as “words”. Initially, keypoints are extracted from images and descriptors are generated using a feature extraction algorithm. Subsequently, clustering algorithms are employed to construct a bag of visual words from these descriptors, enabling vectorized representation of images and computation of similarity between images.
Lopez suggested a technique called the LCD method, which utilizes the FAST points and BRIEF descriptors [
24]. This method employs a K-tree representation of the dictionary to enhance the speed of the search process. Nevertheless, it cannot maintain its accuracy and consistency when subjected to rotation and scale changes, rendering it inappropriate for applications using drones. Labbe presented RTAB-Map, a software package that incorporates a memory management-driven LCD algorithm [
25]. This algorithm efficiently utilizes a restricted set of sites for LCD and systematically visits all locations as needed. Garcia implemented a hierarchical binary BoW for LCD [
26] that was updated progressively. This approach allowed for efficient real-time search, insertion, and deletion of new visual words, resulting in improved real-time performance. Tsintotas suggested an incremental BoW model for LCD [
5], which encodes traversed paths by utilizing a small number of distinct visual words obtained from the feature tracking procedure. Certain studies proposed the integration of point-line features into loop closure identification by proposing a Line Band Descriptor (LBD) and a data-dependent point-line feature-based LCD method [
27]. This algorithm detects loops by considering data dependencies and calculating similarity.
BoW methods utilize vectors to represent images, calculate image similarity, and identify loops. These methods can be integrated with image sequence and semantic information to improve reliability. Deep learning-based methods construct visual descriptions using deep learning models, which provide superior accuracy and resilience. However, these methods are limited in their use due to constraints in device resource allocation and dataset needs. BoW-based approaches continue to be the prevailing approach, whereas deep learning-based methods are still in a phase of development and experimentation.
2.3. Neural Radiance Fields
NeRFs have gained significant traction in recent years, as they offer a solution to the challenge of representing 3D scenes without the need for extensive storage capacity. As a unique and widely accepted technique for representing scenes, it has been successful in generating new views of scenes [
11,
28,
29,
30]. The majority of NeRF works operate under the assumption that camera poses are already established. Thus, in NeRF-related literature, Colmap is frequently employed to calculate camera-intrinsic and camera-extrinsic parameters. Some studies enhance camera positions using NeRF photometric loss [
31,
32]; however, this procedure necessitates lengthy training durations. In response to this issue, Instant-NGP has created a system that can rapidly train NeRF by utilizing multi-resolution hashing and the Compute Unified Device Architecture (CUDA) platform [
33]. Several studies have specifically concentrated on constructing maps within SLAM systems [
34,
35], or merging NeRF with SLAM systems [
36,
37], showcasing commendable performance.
This research utilizes NeRF to reconstruct the scene in the VINS operating environment and generate virtual images according to the system’s flight trajectory. It achieves this for extending the number of candidate frames and enhancing the probability of LCD.
3. Method
This paper proposes a BoW model-based LCD method using NeRF, with the overall method outlined in
Figure 1. The red block at the top is the workflow of the VINS system, and the LCD method proposed in this paper is an important part of it. The general workflow is as follows. First, feature points are extracted from keyframes; then, the camera-intrinsic and camera-extrinsic parameters of the original camera images are estimated using Colmap, and together with the images, they are put into Instant-NGP for scene reconstruction. After applying slight pose offsets, virtual viewpoints are selected for rendering corresponding virtual images. Subsequently, virtual images are filtered, and along with the original images, they form loop closure candidate frames. The cosine similarity between the loop closure candidate frames and the current frame is computed, and based on this, dynamic weights are assigned to calculate a comprehensive score to determine loop closure. This method mainly consists of three modules: key feature point extraction, virtual image construction and filtering based on NeRF, and loop closure determination based on cosine similarity calculation using the word frequency weight vector. Each of these modules will be detailed in subsequent sections.
3.1. Keyframe Feature Point Extraction
The first step in the BoW model for LCD is feature extraction. After downsampling the images returned by VIO, keyframes are selected and keypoints are extracted. FAST keypoints are detected by rapidly comparing the central pixel with its 16 surrounding pixels using Equation (
1), as shown in
Figure 2 [
38]. Binary assignment and encoding are performed using Equations (2) and (3) to yield BRIEF descriptors [
38]. These encoded descriptors along with keypoints are stored in a database for subsequent feature matching processes.
3.2. Construction and Selection of Virtual Images Based on NeRF
After extracting keyframes’ feature points, the construction of virtual images can be initiated. The process is as follows.
3.2.1. Colmap Estimates Camera Poses and Instant-NGP Scene Reconstruction
Colmap is a structure-from-motion system that performs sparse reconstruction using scene information provided by images to obtain the camera-intrinsic and camera-extrinsic parameters, which are essential for generating virtual images. Colmap ultimately obtains the required camera-intrinsic and camera-extrinsic parameters by minimizing the bundle adjustment loss function, as Equation (
4) [
39]. In the equation,
represents the camera parameters,
denotes the point parameters,
is the projection function,
is the loss function, and
is the projected point.
NeRF is a novel scene reconstruction technique that learns the 3D representation of a scene from a collection of images with known camera viewpoints. Its input consists of spatial position
and viewing angles
. Its output includes the volume density and RGB values of each pixel under that pose. Pixel color is computed by integrating along sampled rays using volume rendering, as described in Equation (
5) [
11], where
is the near limit of the sampled ray,
is the far limit,
is the volume density, T(t) denotes the accumulated transmittance along the ray from
to t, c(r(t),d) is the expected color of camera ray r(t) = o + td with near and far bounds
and
, hld is the viewing direction. Instant-NGP utilizes multi-resolution hash encoding on the basis of the NeRF framework, resulting in a significant speed improvement.
In practical experiments training NeRF, addressing the estimation of camera intrinsic and extrinsic parameters corresponding to images, two main approaches are currently employed: inertial visual bundle adjustment and visual bundle adjustment [
40,
41]. In this study, the results obtained from three camera parameter estimation schemes, including Colmap, are compared as detailed in
Section 4.2. Each image’s estimated camera-intrinsic and camera-extrinsic parameters, pixel dimensions of input images, scene size for rendering, and sharpness values are sent to Instant-NGP along with the original camera images to swiftly reconstruct the 3D scene.
3.2.2. Virtual View Construction
Upon completing Instant-NGP model training, the desired virtual camera pose can be selected, and its corresponding observed scene image can be rendered using the trained Instant-NGP model.
- 1.
Coordinate Definition and Transformation
Colmap employs a right-down-front coordinate system, whereas Instant-NGP uses a right-up-back coordinate system. Therefore, when converting the camera poses estimated by Colmap to the Instant-NGP coordinate system, only reversing the parameters of the Y-axis and Z-axis is required to perform the coordinate transformation.
- 2.
Selection of Virtual View Poses
To effectively synthesize virtual images during subsequent view rendering, it is necessary to select the position and orientation of virtual image views. Firstly, for position selection, in order to fully utilize the scene information, the sampling is conducted with each original image’s capture position as the origin and a radius of 2 cm. Within this range, sampling is performed according to the following principles:
The distance between sampling points should not be less than a constant distance or greater than a constant distance , to avoid the virtual view’s sampling points being too dense or too sparse.
When the distance between adjacent original image capture positions is less than 2 cm, the generation of virtual views is abandoned to prevent overlap of adjacent sampling intervals when the vehicle moves slowly, resulting in misalignment of the virtual images.
The nearest neighbor search is implemented using KD-trees. Firstly, k random three-dimensional points are generated within a spherical space of 2 cm around the sampling point, which is
. The Euclidean distance between one of these random points
and the sampling point is calculated as follows:
Then, all random points are sorted according to the x-coordinate dimension, and the median point is selected as the root node. The left subtree contains points with x-coordinates less than the median, while the right subtree contains points greater than the median. For each subtree, the next dimension (i.e., y, z coordinates) is recursively selected for partitioning until all dimensions are processed or the subtree contains only one point. At this point, the KD-tree construction is completed. Next, the search begins from the root node, and during the backtracking process, the distance between the target point and the node is calculated. The current nearest neighbor item is updated, and distance constraints of and are added to the obtained point set, followed by pairwise distance checks, until all sampling points satisfy the conditions.
Next, it is necessary to determine the camera’s pose. To enrich the scene information, a significant overlap field of view between the virtual view and the original image is required when selecting the pose. In this study, when selecting the camera pose, a small perturbation is added to each element of the rotation matrix R, where the perturbation amount falls within the range of , with being the maximum perturbation amplitude.
- 3.
Virtual View Rendering
After the pose selection process, the chosen pose information can be input into the trained Instant-NGP model for rendering the virtual images.
Figure 3 illustrates an example of virtual view generation, where the yellow box represents the selected virtual camera pose. In this study, rendering is conducted at half resolution (376 × 240), and then upsampled to the original size (752 × 480) using Fast Super-Resolution Convolutional Neural Network (FSRCNN) [
42]. Rendering a single frame takes approximately 600 ms, achieving a balance between speed and quality.
Then, the rendered virtual view images are synchronized in time with the corresponding original images from the dataset and added as new topics to the rosbag of the original dataset, awaiting processing.
3.2.3. Quadtree Uniform Feature Point Extraction and Virtual Image Filtering
After generating the virtual images, the system has to process several times more virtual images than the original ones. To alleviate the computational burden, the system needs to select the best candidate frames from the virtual images for each timestamp. To improve efficiency, the strategy of the feature point extraction is changed. The images are divided into four regions using a quadtree, as shown in
Figure 4 [
43]. Then, it is determined whether to continue dividing the regions based on whether the number of feature points in the divided region exceeds a threshold t. If the number is greater than t, further division is continued; otherwise, it is stopped.
When the number of all extracted feature points exceeds the threshold T, all the division stops. The threshold T is set based on the average grayscale value within the grid. The calculation method is shown in the following formula [
44]:
where
is the mean grayscale value within the grid, the grid’s width is W, and the height is H. Once all regions have stopped dividing, the similarity score between the n synthesized virtual images and their corresponding original images is calculated using vector norms [
45]. The virtual image with the highest score becomes the final virtual candidate frame. Then, feature points are extracted using the method described in
Section 3.1, awaiting subsequent computation.
3.3. Cosine Similarity Calculation Based on Term Frequency Weight Vectors and Loop Determination
After filtering the virtual images, term frequency weight vectors are constructed as descriptive vectors for each image, and cosine similarity calculations are performed. Loop closures are determined based on the comprehensive scores obtained from dynamic weight allocation.
3.3.1. Construction of Term Frequency Weight Vectors and Cosine Similarity Scoring
In BoW models, the Term Frequency–Inverse Document Frequency (TF-IDF) method is commonly used to construct term frequency vectors [
44]. This method evaluates the weight of a word by combining term frequency (TF) and inverse document frequency (IDF). However, the expansion of virtual candidate frames introduces many “synonyms”—words similar to those in the original images—when describing the scene. The introduction of synonyms may cause mismatches and affect the position correction after LCD. Therefore, this study incorporates parameters to identify the similarity of “synonyms” when constructing BoW vectors.
After clustering feature points and descriptors using the K-means method, a dictionary is obtained, and BoW vectors can be used to represent the images. We then construct a BoW vector
with a dimension of k, where
is the frequency weight of the word
. The calculation method is as follows:
where
is the TF-IDF score of the word
, calculated as shown in Equation (
9).
is the ratio of the number of times the word
appears in the image to the total number of words in the image.
represents the total number of images in the database, and
represents the number of images in the database where the word
appears.
In Equation (
8),
is the unit vector indicating the word ID. For example, when
, it indicates that the word ID is 2.
is the angle between the vector and the normalized vector
.
is the weighted expectation score vector of the k words that appear in the image, calculated as shown in Equation (
10).
This approach holds that words with similar meanings, which are in close proximity to the key vocabulary identified by TF-IDF, also possess a certain level of significance. Therefore, when introducing new words from the virtual image, it is necessary to assess whether these new words are semantically similar to the important vocabulary. This involves calculating the relevance of these new words to each word in the dictionary. This idea is realized by adding a fractional weighting factor after . In this weighting factor, calculates the proportion of each word’s score in TF-IDF to the total score of all words. Subtracting this proportion from 1 and placing it in the denominator is performed to convert this weighting information into a number greater than 1. This is primarily to ensure significant differences in scores in subsequent calculations and to avoid all word weights being densely distributed in the interval [0,1]. Ultimately, we can consider that measures the “importance” of all words. then represents the similarity between word and these important or unimportant words by computing the cosine value of two normalized vectors. “Synonymous” words similar to important words will receive a higher weighting ratio, while “synonymous” words similar to unimportant words may receive lower weights or even be ignored.
After constructing the BoW vectors, it is necessary to choose a similarity calculation method to assess the similarity between these BoW vectors. In BoW models, vector norms are commonly used to compute Euclidean distance or Manhattan distance to represent similarity. Their calculation methods are shown, respectively, in Equations (11) and (12), where
are two k-dimensional BoW vectors.
The advantage of vector norms lies in its simple and fast calculation process, which meets the real-time requirements well. However, the drawback is its susceptibility to noise, where fluctuations in the absolute values of each element in the vector may affect the similarity calculation results.
The cosine similarity used in our method mitigates the potential impact of noise through normalized vectors [
45]. Additionally, in high-dimensional spaces like BoW vectors, Euclidean distances between vectors tend to be very close, which can render distance-based similarity calculation methods based on vector norms ineffective. Cosine similarity, on the other hand, focuses more on directional differences, allowing it to ignore differences in absolute values and providing a better assessment of similarity. The cosine similarity between BoW vectors can be calculated using Equation (
13), where
represents the i-th element of the BoW vectors x and y.
3.3.2. Dynamic Weight Allocation and Loop Closure Determination
After calculating the cosine similarity between the candidate frames from the original images and the current frame, as well as between the candidate frames from the virtual images and the current frame at the same timestamp, dynamic weighting allocation is needed based on the scores. This involves assigning higher statistical weights to candidate frames that are more similar to the current frame. Ultimately, the comprehensive score is calculated based on the dynamically adjusted weights. And a loop is determined by comparing it with a predefined threshold. Initially, the corresponding weights for the two images need to be initialized as Equation (
14), denoted as
for the original and virtual images, respectively.
Then, the similarity scores of the two comparisons,
, representing the similarity scores between the original image and the current frame, and between the virtual image and the current frame, are compared. The following formula is used to update the two weights:
The parameters are step length parameters, which can be manually adjusted by observing the dataset to verify the effect. In the experiments of this study, are used for adjustment. After determining the weight distribution, the two scores are added together as the final comprehensive score, which is then compared with the threshold r. If the final comprehensive score is greater than the threshold r, the system is considered to have encountered a loop.
4. Experiment
The Euroc dataset is used to evaluate the effectiveness of the proposed method in this study [
46]. A comparison is made between the method developed in this study and the conventional BoW model LCD method, with a focus on loop detection effectiveness and navigation localization accuracy.
4.1. Dataset and Server Information
The Euroc dataset is widely used in robot vision SLAM, specifically designed for micro UAVs. This dataset contains high-quality sensor data collected from real-world environments, including stereo camera images, IMU data, and ground truth information. The environments in the Euroc dataset range from industrial scenes to office environments, covering a variety of complex scenarios, making it an ideal choice for testing and validating SLAM algorithms. By providing diverse scenes and precise sensor synchronization data, the dataset has played an important role in advancing research in UAV autonomous navigation and environmental perception. The available data in the dataset and the computer server information used in this study are shown in
Table 1. In this study, the MH_01_easy scene from the Euroc dataset is used.
4.2. Experiment on Loop Closure Detection
This study initially compared three methods for estimating camera poses, using Colmap, maplab for visual bundle adjustment, and inertial visual bundle adjustment through the combination of Vicon2gt with IMU data and OptiTrack. Then, the scene reconstruction is performed based on the estimation results, and the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) values of the reconstructed images are used to measure the reconstruction effectiveness. The experimental results are presented in
Table 2. Through comparison, it is found that the reconstruction based on the estimation results from Colmap yielded better results. The visual effect of the reconstructed scene compared to the original images above can be more intuitively observed in
Figure 5, which is consistent with the results in
Table 2.
After the reconstruction, virtual image rendering is conducted, and the feasibility of using the obtained virtual images as loop closure candidate frames is verified by feature matching with the original images. Through experiments, it is observed that the rendered virtual images and the original images can produce normal matches in the overlapping field of view area, as shown in
Figure 6. This indicates that virtual images can effectively extract feature points, and the extracted feature points have the same image information representation effect as those extracted from real images. Therefore, expanding candidate frames with virtual images can provide richer scene information.
Next, we will conduct experiments to evaluate the effectiveness of LCD. This study verifies the LCD performance by counting the number of loop closures detected and the accuracy of the detection. The BoW model-based LCD methods used in VINS-Mono and ORB-SLAM3 [
47] are taken as the control for the experiments. The proposed NeRF-based BoW LCD method is applied to both systems, and its effectiveness is evaluated by comparing it with the original methods. In the BoW LCD methods of VINS-Mono and ORB-SLAM3, keyframes cannot be augmented, and their acquisition is solely based on camera captures. However, in the NeRF-based BoW LCD method, keyframes include both camera captures and virtual frames rendered by NeRF, significantly improving the success rate of matching with the current frame. The scene data from Euroc are used as the experimental input. We recorded the number of loop closures detected by both systems before and after applying the proposed NeRF-based BoW LCD method and determined the accuracy of these detected loop closures based on ground truth data. The experimental results are shown in
Table 3 and
Table 4.
Figure 7 and
Figure 8 present the LCD results for the two approaches used in VINS-Mono and ORB-SLAM3. The vertical axis shows 1 for loop closure detected and 0 for not detected. The pink and yellow areas indicate the additional loop closure frames detected by the NeRF-based BoW model with the LCD method, while the dark and purple areas represent loop closure frames detected by both methods.
The experiment demonstrates that by implementing the proposed NeRF-based BoW model LCD method, both systems can detect more accurate loop closures while maintaining detection accuracy. Taking one frame as an example for analysis, as shown in
Figure 9, in the upper middle position, due to exposure reasons, few feature points can be extracted from this area in the current frame. However, on the right side, the virtual candidate frame, observed from a virtual viewpoint, has a weakened exposure situation, resulting in some more features extracted from the upper area of the image, and successfully achieved feature matching with the current frame exhibiting exposure conditions. It indicates that this method can increase the likelihood of detecting loop closures by increasing the number of features available for matching.
4.3. Navigation and Localization Experiment
The main purpose of LCD is to correct localization errors. Therefore, this section will verify whether the proposed NeRF-based BoW model LCD method can effectively improve the navigation accuracy of the system. In this section, we apply the proposed NeRF-based BoW model LCD and the currently commonly used BoW model LCD to the VINS-Mono system and ORB-SLAM3 for visual–inertial navigation calculation. After obtaining the navigation trajectories, we compare them with the ground truth. First, we conduct a comparative experiment using the VINS-Mono system. The experimental trajectory image is shown in
Figure 10.
To comprehensively evaluate the accuracy of trajectory positioning, the Absolute Pose Error (APE) of the system is computed using the Evaluation of Odometry (EVO) tool as an indicator of navigation accuracy [
48]. The experimental results are shown in
Table 5. The distribution of APE error with image frame index is shown in
Figure 11, and its specific statistical data are shown in
Figure 12 and
Figure 13.
The color distribution of APE along the trajectory in the VINS-Mono system is shown in
Figure 14. Combined with the color bar, it can be observed that the NeRF-based BoW model LCD significantly improves the positioning accuracy of the system. The maximum trajectory error is reduced by 24%, the minimum trajectory error is reduced by 50%, the mean error is reduced by 53%, and the root mean square error is reduced by 50%. This indicates that the 58 additional loop closures detected by the NeRF-based BoW model LCD lead to more accurate position corrections by the system, resulting in a significant reduction in positioning error.
The above experiment demonstrates that the proposed method can help the VINS-Mono system achieve better LCD and navigation performance. Next, the proposed method will be applied to the ORB-SLAM3 system. The experimental trajectory images are shown in
Figure 15.
Using the EVO tool to calculate the system’s APE, the experimental results are shown in
Table 6. The APE data distribution over image frame numbers for both methods is shown in
Figure 16, and the APE statistical data for the two methods are presented in
Figure 17 and
Figure 18.
The distribution image of APE in the ORB-SLAM3 system with the color of the track is shown in
Figure 19. Combined with Colorbar, it can be found that the positioning accuracy of the ORB-SLAM3 system has been significantly improved by the LCD of the word bag model based on NeRF, and the maximum trajectory error has been reduced to 9% and the minimum trajectory error has been reduced to 9%. The mean error is reduced to 9.5%, and the root mean square error is reduced to 10%. This shows that 67 more loops detected by the NERF-based word bag model LCD result in more accurate position corrections and significantly reduced positioning errors.
4.4. System Running Time Statistics and Algorithm Complexity Evaluation
Next, we will evaluate the system’s runtime and the complexity of the algorithm. The method proposed in this paper is primarily applied to small UAVs, which require high real-time performance. Therefore, the computational complexity of the method is assessed from two perspectives.
First, in
Section 4.4.1, we mainly evaluate the time complexity of the method, i.e., the time required for the method to run. This evaluation will provide a detailed analysis of the execution time of the algorithm with different input scales. By combining experimental data with theoretical analysis, we determine the time complexity of the method. The purpose of the time complexity evaluation is to ensure that the method can meet the real-time requirements in practical applications on small UAVs, guaranteeing that the system can respond and process data quickly during actual flight, thereby achieving stable navigation and positioning functions.
Secondly, in
Section 4.4.2, we assess the space complexity of the method, i.e., the memory required during the operation of the method. The space complexity evaluation will consider the storage space needed by the algorithm for different input scales. By analyzing the space complexity, we can determine whether the method is feasible on resource-constrained small UAV platforms and optimize the method in subsequent research to reduce memory usage and improve overall system performance.
Through the evaluations of these two aspects, we can comprehensively understand the runtime efficiency and resource requirements of the proposed method in practical applications. This understanding provides a critical basis for subsequent optimization and improvement, ensuring the efficient and reliable application of the method on small UAVs.
4.4.1. Statistics on LCD Method Running Time
This experiment evaluates the time complexity of the proposed NeRF-based LCD method by comparing the time required for the VINS-Mono and ORB-SLAM3 systems to run the MH_01_easy scenario with both the proposed method and the original LCD methods. Since the ORB-SLAM3 graphical interface does not provide a reference for runtime, we used the <chrono> library for time measurement. Timing code was added to the LoopClosing function, the core loop detection function, to measure the total duration from the start to the end of running the MH_01_easy scenario. In contrast, the VINS-Mono graphical interface provides ROS time as a reference, which directly reflects the total duration from the start to the end of the system’s run. The experimental results are shown in
Table 7.
The results indicate that the proposed method is indeed more complex in terms of time complexity compared to the original method, requiring a longer processing time. This is expected, as the system needs to perform more comparisons and process more images. When running the VINS-Mono system, the NeRF-Based BoW LCD method required 12.7 s more than the original method, while ORB-SLAM3 required 9.23 s more. The time complexity of VINS-Mono and ORB-SLAM3 increased by 6.8% and 4.4%, respectively. Since this additional time is primarily spent on reading and filtering virtual images, it does not impact the real-time performance of the systems.
4.4.2. Statistics on Memory Required for LCD Operation
This experiment evaluates the spatial complexity of the proposed NeRF-based LCD method by comparing the Resident Set Size (RSS) of the VINS-Mono and ORB-SLAM3 systems when equipped with the proposed method versus their original LCD methods. RSS was chosen as the specific metric because it measures the total memory used by the process, including all shared libraries, providing an accurate assessment of the spatial complexity of the proposed method. Since both VINS-Mono and ORB-SLAM3 run on the Ubuntu 18.04 operating system, we used the system monitoring and process management tool Htop for the statistics. The results are shown in
Table 8.
The experimental results show that when running the VINS-Mono system, the NeRF-Based BoW LCD method requires 5420 kbytes more memory than the original method, while ORB-SLAM3 requires 19,990 kbytes more. The spatial complexity of VINS-Mono and ORB-SLAM3 increased by 0.3% and 2.6%, respectively. The actual physical memory usage of both systems did not change significantly when equipped with either method.
4.5. Parametric Sensitivity Analysis
The effectiveness of the proposed method in this study involves a critical parameter, the LCD comprehensive score threshold r. The setting of this parameter directly affects the sensitivity and performance of LCD. Due to variations in images captured by cameras in different scenarios, the r value must be adjusted accordingly. Using the MH_01_easy scenario from the Euroc dataset as an example, this study conducted a sensitivity test on the r value to evaluate the robustness and effectiveness of LCD under varying r values. The experimental results shown in
Figure 20 illustrate the sensitivity of LCD performance to changes in the r value.
The optimal composite score thresholds r for the VINS-Mono and ORB-SLAM3 systems in LCD were found to be 2.15 and 1.55, respectively. Considering experimental costs, this experiment set the minimum interval for the threshold r at 0.05. Experimental results indicate that when r is below its maximum value, the number of detected loop closures is significantly higher than the number of correct loop closures. This implies that while lowering the threshold increases the number of detected loop closures, it also increases the number of false matches. False matches can adversely affect the position correction after LCD, so it is essential to minimize their occurrence. Conversely, a higher threshold reduces the number of detected loop closures due to the stringent conditions, which in turn decreases the frequency of position corrections, thereby impairing the system’s ability to improve navigation accuracy through LCD.
5. Conclusions
This study proposed an LCD method based on NeRF and the BoW model, which features good real-time performance and high accuracy. It effectively reduces the difficulty of feature extraction and matches in the LCD process of the VINS system in dynamic scenes, weak texture environments, and lighting changes.
By incorporating NeRF, the LCD process gains richer observation perspectives and more feature information. A frequency-weighted vector based on similarity factors is designed to describe images in the candidate frame sequence composed of both virtual and original images. The method measures the correlation between vocabulary words, considers the importance of visually similar words introduced by virtual images, and formulates a corresponding dynamic weight allocation strategy to obtain comprehensive cosine similarity scores. With an improved LCD rate of 48% while maintaining accuracy, the mean positioning error of the VINS system is reduced by 53%. In the future, with computing power increases, the speed of NeRF operation will also significantly accelerate. We will continue to research integrating NeRF for online mapping and completing integration with other VINS systems to further improve LCD efficiency and navigation accuracy with real-time high-quality image rendering and scene reconstruction.
Nevertheless, there are three constraints to be tackled in this approach:
- 1.
The efficiency of training the Instant-NGP model relies greatly on the quality of the data collected by the sensors, notwithstanding its ability to give detailed scene information and observer perspectives. The image quality of the model training is directly impacted by the shaking of the drone, leading to a loss in efficacy.
- 2.
Conducting offline training for the Instant-NGP model necessitates extra storage capacity, and operating the VINS system and Instant-NGP also places demands on the system’s operational memory.
- 3.
The NeRF-based BoW model LCD method enhances the detection rate and accuracy of the VINS process. However, in environments with dynamic objects, the computational complexity of this method will increase, which may lead to a decrease in the success rate of detection and a longer response time. Additional trials are required to fine-tune the comprehensive scoring threshold used to identify loop closures, hence enhancing the stability and applicability of the LCD process to additional VINS systems.
In the future, with computing power increases, the speed of NeRF operation will also significantly accelerate. We will continue to research integrating NeRF for online mapping and completing integration with other VINS systems to further improve LCD efficiency and navigation accuracy with real-time high-quality image rendering and scene reconstruction.