1. Introduction
In today’s complex and ever-changing engineering environment, engineering tasks face a variety of complex challenges and potential hazards. Many engineers need to perform tasks in unknown environments. Unmanned surface vessels (USVs) can accomplish different tasks by carrying different mission loads, and their characteristics of small size, low cost, and strong mobility allow them to have great advantages in many fields. Environmental reconnaissance using unmanned platforms can significantly improve engineering efficiency and reduce personnel risks [
1,
2,
3]. With the continuous development of autonomous navigation unmanned reconnaissance systems, high-precision carrier positioning and target ranging become crucial parts of the engineering measurement process, and accurate measurement results can provide reliable data support for engineering tasks.
Ranging technologies primarily include laser, ultrasonic, and radar methods. These active technologies demonstrate high environmental sensitivity and low concealment, and often encounter challenges in target type identification [
4,
5]. In contrast, visual ranging technology utilizes rich image data to enhance target identification while addressing these issues. Simultaneous Localization and Mapping (SLAM), as a crucial artificial intelligence technology, has been employed in autonomous driving, environmental modeling, and unmanned aerial vehicle navigation. The essence of SLAM-based ranging lies in utilizing the camera to capture image information, analyzing the image features, integrating internal and external parameter information to estimate the camera pose in the scene, and subsequently determining its distance within the scene.
In marine environments, which are affected by wind and waves, normal visual SLAM’s efficacy is limited, especially with monocular cameras, mostly due to challenges in accurately extracting feature points from the dynamically changing surface. This results in errors in estimating unmanned ship positions and attitudes, particularly during shaking, where the rapid loss of feature point information impedes continuous pose estimation and global trajectory mapping [
2,
6,
7,
8].
The main improvements of this paper are as follows:
- (1)
This paper introduces a SLAM system specifically designed for sea environments. The ship’s image region is acquired through object detection and segmentation, while stable features are meticulously filtered to extract more reliable static features;
- (2)
A novel key frame optimization and relocation method based on the target feature set is proposed. The retained feature set serves to constrain and preserve target feature information, enabling rapid and efficient map retrieval during shaking events, thereby enhancing the accuracy and robustness of the system under such conditions;
- (3)
To verify our algorithm, we independently collected marine monocular video data, GNSS (global navigation satellite system) location information, and AIS (Automatic Identification System) data from ships anchored at sea in the East China Sea. The effectiveness and reliability of the algorithm are verified by the analysis of the overall accuracy and the accuracy of the shaking moment.
The subsequent parts of this article are outlined as follows:
Section 2 provides a detailed introduction to the relevant work,
Section 3 introduces the specific details of our technology for removing the marine feature points of the target,
Section 4 is the experimental analysis section that analyzes the performance of our system, and
Section 5 is a summary.
2. Related Work
VSLAM can be divided into direct and indirect methods. Indirect methods need to select a certain number of representative points, called keypoints, from the collected images and then detect and match them in subsequent images to obtain the camera pose. It not only preserves the key information of the image but also reduces the amount of calculation, so it has been widely used.
The MonoSLAM system [
9] was introduced in 2007 and pioneered real-time monocular vision, with advantages in terms of cost effectiveness and efficiency. However, problems such as sparse feature point loss and the cumulative error of EKF still exist. The advent of PTAM [
10] has improved the problem that real-time positioning map construction cannot work stably for a long time in a large-scale environment. As the first SLAM algorithm using nonlinear optimization in the backend, PTAM solves the problem of too fast data growth in filter-based methods. The ORB-SLAM system, presented in 2015, addresses these issues and improves accuracy, stability, and real-time performance in a static environment. It supports monocular, binocular, and depth cameras and runs through three parallel threads: visual tracking, local mapping optimization, and loop closure detection through graph optimization [
11].
SLAM systems based on point features rely too much on the quality and quantity of point features. In weakly textured scenes, it is difficult to detect enough feature points. Thus, the robustness and accuracy of the system are affected, and even tracking failure is caused. In addition, the matching quantity and quality of point features is severely degraded due to the fast movement of the camera, illumination changes, and other reasons. In order to improve feature-based SLAM algorithms, the application of line features in SLAM systems has attracted more and more attention [
12,
13]. The commonly used line feature extraction algorithm is LSD. Subsequent advances include one by Albert Pumarola et al. [
14], who proposed a study on point–line feature fusion, called ORB-SLAM, in 2017, which solved the feature information gap. In the same year, Mrartal et al. introduced ORB-SLAM2, extending its compatibility with RGB-D cameras to enhance tracking by RANSAC outlier removal, DBoW2 loop detection, and EPnP pose estimation [
15]. In addition, in some environments, SLAM algorithms focus on some obvious planar features. However, most valid line planes only exist in artificial environments, and suitable planes can rarely be found in natural environments. These limit the scope of its application. The VSLAM method, combining point, line, and surface features, improves the localization accuracy and robustness in weak texture scenes, illumination changes, and fast camera movement. However, the introduction of line and surface features increases the time consumption of feature extraction and matching and reduces the efficiency of SLAM systems. Therefore, the VSLAM algorithm based on point features still occupies the mainstream position [
16].
It is difficult to solve the problem of image blur caused by fast camera motion or poor illumination in the environment by only relying on a single sensor of the camera. Therefore, the fusion of SLAM with other sensors, such as GNSS, IMU, etc., is more widely used nowadays. In 2019, Hong Kong University of Science and Technology released VINS-Fusion [
17], which integrates monocular, binocular, IMU, and GPS to achieve robust localization, to mitigate IMU drift especially. Hongyu Fu et al. [
18] combined ORB-SLAM with GNSS for the multi-sensor fusion of unmanned ships in 2022.
As a popular tool for object recognition and segmentation, deep learning can help us accurately identify objects in images and their relationships with other objects [
19]. The feature detection layer of the most classical CNNS learns through training data, avoiding feature extraction at display time and implicitly learning from training data at use time. Loo et al. [
20], Kin et al. [
21], and other works have given a detailed summary. In 2017, Tateno et al. [
22] proposed a real-time SLAM system “CNN-SLAM” based on CNN under the framework of LSD-SLAM. In the same year, Godard et al. [
23] proposed an unsupervised image depth estimation scheme. Using the stereo dataset to improve unsupervised learning and then using a single frame for pose estimation is also a scheme to improve efficiency and performance.
Recent innovations include Wang et al.’s 2021 DSP-SLAM, which uses deep learning for real-time, accurate foreground object description [
24], and Liu et al.’s 2022 RGB-D SLAM [
25], which improves initial pose estimation and dynamic feature separation through motion consistency filters and graph cut optimization. Z. Xie et al. [
26] combined Yolo with ORBSLAM2 in 2022 as a new multi-sensory system. Theodorou et al. [
27] combined ORB-SLAM3 with YOLOR in 2022 to improve feature classification and visual odometer stability, while Jia et al. [
28] proposed LRD-SLAM to integrate fast CNNS into ORB-SLAM2 to cope with dynamic environmental challenges. Ni et al. [
29] also proposed an enhanced ORB-SLAM algorithm for monocular vision in dynamic environments, using reliability-based feature point selection and a novel frame difference method for dynamic object detection. Chen et al. [
30] warned that dynamic feature point selection based solely on semantics or geometry may interfere with the accuracy of SLAM systems and argue for customized monocular SLAM algorithms for use in dynamic settings. In 2024, Zhi Zheng et al. [
31] proposed a dynamic object perception visual SLAM algorithm specifically for dynamic indoor environments. The proposed method utilizes a semantic segmentation architecture called U-Net, which is used to track threads to detect potential moving objects, improving the robustness and reliability of SLAM systems by filtering out indoor moving objects. Based on ORB-SLAM3, Han Yu et al. [
32] used the GCNv2-minor network instead of the ORB method to improve the reliability of feature extraction and matching and the accuracy of position estimation. In general, we summarize the characteristics and drawbacks of the previous work in
Table 1.
The SLAM combined with the object recognition network and the SLAM based on the dynamic environment indeed show overall accuracy and robustness improvements. However, these methods all have their limitations. Combined with the target recognition network, they only carry out feature point screening and elimination according to the indoor target. SLAM based on a dynamic environment removes dynamic feature points by removing semantic masks, but the feature points that fail to match are confused with the feature points that fail due to semantic segmentation, which affects the accuracy of localization.
3. Method
3.1. Overall Framework
Our approach was to improve one of the monocular systems on top of the ORB-SLAM3. As shown in the
Figure 1, the core idea is to obtain the location information of the target in the video according to the deep learning method, extract and screen the ORB feature point information of the target, build a virtual feature information set of the target in the map synchronously, and synchronize the camera position information obtained by GNSS updating related key frames. This ensures the quality of the environmental characteristic information obtained in the working environment of the unmanned ship at sea. When the light or the camera itself wobbles, which often happens in the marine environment where the unmanned boat is sailing, we judge the wobble duration by combining the video and the GNSS information of the unmanned ship and match the wobble video frame with the updated feature set of the target to re-establish the position of the target in the 3D map of the camera and the new video.
As shown in
Figure 1, the video information first passes through the deep learning recognition network. As mentioned above, it is more reliable to obtain the environmental characteristics based on artificial objects than the characteristics of the natural environment, which is more obvious in the marine environment. In the object detection thread, we use the YOLOv5 algorithm for object detection and filter the environmental information by obtaining the position of artificial marine features (mainly stationary vessels) in the image. The YOLOv5 network is a lightweight target detection network, which is conducive to improving algorithm efficiency without affecting algorithm performance.
In the tracking thread, the feature point information of the target is input into the key frame extraction and the target feature set at the same time, which is associated with the GNSS position information of the camera frame. Along with the tracking of the feature points in the video frame, the feature points are constantly updated, and the background information in the target frame is filtered twice. This step is the key to improving SLAM’s resistance to the sea environment. Further, we complete the subsequent positioning and mapping tasks. In this paper, while focusing on the overall accuracy and robustness, we focus on the system to overcome the sea environment, especially in the case of a large degree of shaking.
3.2. Object Region Extraction Based on Deep Learning
YOLO (You Only Look Once) was introduced as the first single-stage detector for deep learning-based image detection by Redmon et al. [
33]. The primary advantage of employing the YOLO algorithm in this study is its rapid detection and recognition speed. The enhanced version can process images at up to 45 FPS on GPU, while the fast version achieves up to 155 FPS. The detection process of YOLO is illustrated in
Figure 1: the entire image is divided into S × S grid cells by a single neural network; each grid cell determines whether the center of a target object is located within it, and uses this information to ascertain the category and confidence level of the detected object; subsequently, grid cells with low-confidence predictions are filtered out based on a threshold, and the Non-Maximum Suppression (NMS) technique is applied to eliminate overlapping bounding boxes, resulting in the final identification of target categories and confidence scores. This end-to-end approach characterizes the YOLO algorithm’s object detection capability. In this paper, a ship target recognition model is developed using the YOLOv5 architecture, which includes the backbone, neck, and head components of the network structure, depicted in
Figure 2. During camera movement relative to the target, horizontal motion affects the position of the detection frame, while forward and backward movements affect its scale. The YOLO framework is utilized for target detection and identification, providing the coordinate information of the detection frame.
As shown in the
Figure 3, the YOLO network performs well for vessel target extraction from different angles and scales. At the same time, as a lightweight network, its efficiency is also guaranteed, and it can meet the real-time needs in the work of unmanned boats. In this paper, the YOLOv5 network is used as the method of object extraction in the image to extract the position information of relevant artificial objects in the video.
3.3. Object-Based Feature Set Establishment and Matching
We extracted the approximate area of the ship target through the neural network, but the ship target area extracted by the neural network is not closely attached to it, and the area still contains part of the natural area. Directly using this area for feature extraction will still extract the feature points in the background such as water, island, and sky. Therefore, we design a screening method based on the target feature set. In a certain length of time series, by updating the feature set and multiple screening, stable feature points are selected, and the stable feature points of the same target are formed into a feature set to prepare for subsequent relocation.
In contrast to the time-varying texture features of the sea surface, targets such as ships, islands, and ports offer more stable feature points in videos. After extracting ORB feature descriptors, the Hamming distance quantifies similarity between feature points. Smaller Hamming distances indicate greater similarity. Using a YOLOv5 network, frame
’s target recognition model identifies a set of feature points
where
represents the feature set of target
in the frame.
Let the set consisting of all feature points in the subsequent frame image be
where
is the feature set of the extracted target in the image target recognition frame, and
represents the feature points in the frame. The Hamming distance between
feature point in
and all feature points in
is calculated, and the nearest neighbor distance d
1 and the second-nearest neighbor distance d
2 are selected. If the ratio of the two distances is less than the set value k, then
and the nearest neighbor
are regarded as successfully matching.
The set of new target feature points with successful matching is
where, in the
feature points,
feature points have been verified.
Through two-way matching screening, it was ensured that each feature point in could find a unique and corresponding feature point in . After traversing , it is ensured that each feature point in can find unique and corresponding feature points in . Confirm that the matching point of feature point is , and the matching point of feature point is , that is, it is identified as a correct match; otherwise, it is judged as a false match and removed.
The selected matching point pairs are sorted according to the Hamming distance to obtain the minimum Hamming distance. If the Hamming distance between the matched point pairs exceeds the set minimum Hamming distance multiple, it is judged as a false match and removed, and a further screening of feature point pairs is carried out. An empirical threshold is set to identify the feature points that appear in enough feature point pairs as trust feature points. If the trusted feature points are close to each other in consecutive frames and have accumulated enough frames, they will be added to the feature point set of target
(Algorithm 1). Here’s the pseudocode:
Algorithm 1: Improved Feature Matching. |
Input: Target set ; Feature points in the subsequent frame ; |
Output: Target set |
# Initialization: Matching Pairs = Empty Set; Mismatch Pairs = Empty Set; Trusted Features Set = Empty Set |
|
# 1. Bidirectional Matching Filtering |
For each feature point in : |
Find corresponding feature point in |
If feature point in also uniquely matches to in : |
Add (, ) to Matching Pairs |
Else: |
Add (, ) to Mismatch Pairs |
End if |
End for |
For each feature point in : |
Find corresponding feature point in |
If feature point in also uniquely matches to in : |
Add (, ) to Matching Pairs |
Else: |
Add (, ) to Mismatch Pairs |
End if |
End for |
# 2. Sort Matching Pairs by Hamming Distance |
Sort(Matching Pairs, by Hamming Distance) |
|
# 3. Hamming Distance Multiplier Filtering |
For each (, ) in Filtered Matching Pairs: |
If Hamming Distance(, ) exceeds (N ×Minimum Hamming Distance): |
Add (, ) to Mismatch Pairs |
Else: |
Continue to keep in Filtered Matching Pair |
End if |
End for |
#4. Accumulate Trusted Feature Points |
For each (, ) in Filtered Matching Pairs: |
If appears in a sufficient number of matching pairs: |
Add to Trusted Features Set |
End if |
End for |
# 5. Accumulate Trusted Feature Points from Continuous Frames |
For each trusted feature point: |
If it is close in position across consecutive frames and accumulated frame count ≥ Experience Threshold: |
Add trusted feature point to Target ’s Feature Set |
End if |
End for |
Return Target set |
3.4. Local Map Optimization and Relocation Based on Feature Set
During camera movement in the recording process, parts of the scene may move out of the camera’s field of view, causing previously tracked feature points to gradually disappear. The swaying of the sea surface due to wind and waves accelerates the loss of tracking points, with greater swaying amplitudes correlating negatively with point retention. When the number of feature points drops below a set threshold, indicating reduced scene stability, a new keyframe is selected. Careful consideration is given to the spacing of keyframes to prevent excessive density, which could slow down processing and impact depth estimation accuracy. Hence, keyframes are selected based on significant camera motion. The initial keyframe, K1, establishes the global coordinate system, with feature points set as L0. Subsequent frames, like K2 and K3, are chosen to ensure a sufficient number of matching points (L1 and L2, respectively) between successive keyframes.
The keyframe initialization steps are as follows:
- (1)
Select keyframes K1, K2, and K3 that meet the criteria, extract their targets, and establish feature point sets, ensuring correspondence with L1 and L2;
- (2)
Set K1 as the global coordinate system, screen matched feature points between K1, K2, and K3 using improved RANSAC, and calculate their pose changes;
- (3)
Derive rotation matrix R and translation matrix T of K3 relative to the global coordinate using essential matrix E (K1 and K3);
- (4)
Perform triangulation using the pose change matrix and corresponding feature points to obtain 3D coordinates, inversely deriving K3’s position relative to K2;
We construct reprojection error in these keyframes using projected scene point positions.
After initialization, pose estimation proceeds with the following steps:
- (1)
Assess each image for keyframe suitability, iterating if not identified;
- (2)
Calculate initial pose changes, triangulate 3D scene points, and add selected keyframes to the running window;
- (3)
Optimize initial values using reprojection error within the window, iterating back to step 1.
By optimizing the initial pose estimation—specifically, the initial estimate of pose change—the reprojection error is minimized, thereby enhancing both the accuracy and stability of the pose estimation process. When the reprojection error falls below a predefined threshold, the pose information can be directly retrieved. The pose estimation process involves calculating the position and orientation matrix using the least squares method.
If the feature points extracted from the target region in a new image frame fail to match the target feature point set or if the re-input image frame lacks the target region (i.e., no feature points are present), it is assumed that the current tracking target has been lost, resulting in feature point tracking failure. In cases where the synchronized GNSS position information and image saturation remain relatively stable, it can be inferred that the tracking loss is due to shaking, and tracking should be reinitiated.
Sea surface jitter and periodic movements can cause the loss of map tracking, with the potential for fewer or even no matched points. To address the issue of swaying caused by periodic sea surface movement, the relocation frame—specifically, the selected key frame—must satisfy the following conditions:
- (1)
The time interval should not exceed 1 s;
- (2)
The number of feature points within the target type must be at least L1.
In the third matching step, the lost frame feature points are matched with the existing feature point set to ensure that the similarity between the key frame and the target set exceeds the threshold H1.
As shown in
Figure 4, the newly input frame matches the updated target set, which is equivalent to a blank image with only feature information. A new target feature set
matches
when the GNSS position information changes less than the set threshold P and meets the above migration frame standard. This frame is used to set a new keyframe and establish a connection with the feature set. On this basis, the new key frame initialization task is completed.
3.5. Front-End Construction of Multi-Sensor Fusion
The loose coupling fusion method, in contrast to the tight coupling method, offers advantages with simpler sensor combinations and data processing algorithms, resulting in lower computational costs [
34]. This method allows for the selection of sensor combinations tailored to different application scenarios and requirements, and it facilitates adaptation to environments where signals from individual sensors may be weak or intermittent. This study utilizes the loose coupling method to optimize GNSS measurements at key frames, integrating data from GNSS and visual odometry.
GNSS can provide outdoor absolute position information, which has good complementarity with monocular vision, in order to improve the robustness and accuracy of visual map construction. Since keyframes and GNSS data are collected at different initial times and different acquisition frequencies, the neighborhood linear interpolation method is used to align them.
where
is the GNSS position corresponding to time
.
and
are the GNSS positions obtained at the nearest time
and
on the adjacent sides of
.
The similarity transformation parameters
from the geographic coordinate system to the visual world coordinate system are used as the initial values to construct the least squares problem. Based on the constraints of all aligned GNSS positions and key frame positions, the global optimal similarity transformation parameters are solved. Firstly, three points are randomly selected by the RANSAC algorithm, and the similarity transformation parameters are initially solved according to Horn’s three-point method [
30]. Then, the more reliable initial values of similarity transformation are screened out according to the number of interior points and the overall error. After finding the initial value, we then find the best similarity transformation parameter to minimize the error. According to our experience, the number of iterations should be set to fewer than 20 to meet the needs of real-time and smaller error balance.
Optimize the function
where
is the coordinate of the key frame center in the world coordinate system;
and are the frame rotation matrix and translation matrix;
is the position of the keyframe relative to the starting point given only by the video information;
is the position obtained by GNSS corresponding to the key frame; and
is the similarity transformation parameter.
3.6. Target Ranging Based on Bundle Adjustment
During system motion, targets exhibit parallax between consecutive frames. The vision system provides parallax and relative pose information, which are used to compute distance based on a similar triangle principle, as illustrated in
Figure 5.
The ship travels from position A to position B. P
A denotes the video frame information at position A, while P
B denotes the video frame information at position B, typically comprising two key frames containing the target. The sensor captures the video information at position A, obtainable through the bundle equation [
35,
36]:
where
are the azimuth elements in the image;
are the coordinates of the video frame at position A;
are the coordinates of the target;
are the pixel location of the target extracted by the YOLO network; and
refers to the nine directional cosines composed of the three external azimuth elements of the image, which are from the previously obtained position and attitude
. Similarly, in position B, we have
The simultaneous equations are solved using the least squares method to determine the target’s location, from which the distance is subsequently derived.
4. Experiments and Results
The selection of dataset and evaluation criteria is crucial.
Section 4.1 discusses the offshore vSLAM methodology under evaluation,
Section 4.2 outlines the dataset employed for evaluation, and
Section 4.3 defines the evaluation criteria.
4.1. Comparative Experiments
In accordance with the vSLAM system adapted for offshore environments proposed in
Section 2, YOLO_ORBSLAM [
20], water segmentation SLAM [
12], and U-net dynamic SLAM [
31] were selected for comparative evaluation alongside the proposed method. In indoor settings, YOLO_ORBSLAM typically filters and removes feature points within detected object areas to enhance environmental mapping accuracy. However, in contrast to indoor scenarios, natural features such as ships, lighthouses, and ports offer more reliable and stable feature points for mapping and distance measurement tasks in marine environments. Therefore, the YOLO_ORBSLAM approach used in this study employs a strategy focusing on the detection region within the frame. Water segmentation SLAM, also based on the ORB-SLAM system, incorporates adaptive enhancements tailored to the water surface environment and USV characteristics. Similarly, the U-net method for splitting the dynamic indoor environment, in the environment discussed in this paper, should separate the dynamic water surface. Therefore, we use the U-net network to cull water surface when applying U-net dynamic environment SLAM. The segmentation of water in the original paper is realized by using the principle of HSV color space, and they detect turquoise water surface. When applied to the ocean, the color space needs to be adjusted according to the obtained video sequence to achieve the effective segmentation of gray and white sea surface. To evaluate mapping strategies in marine conditions, a series of video sequences capturing ships at sea were acquired. Specifically, sequences were chosen where camera movement naturally followed the ship’s motion for detailed analysis.
4.2. Data
The video sequences were captured near Zhoushan City in the East China Sea, using cameras and GNSS deployed on the ship. GNSS time was synchronized with camera time. Ship positions for filming and measuring were obtained from AIS and GNSS data. The videos were recorded on a late April morning under favorable lighting conditions, with the water experiencing high tide and fluctuating approximately 1.5 m (
Figure 6).
The distance between the camera and the main survey ship (the vessel with the most prominent appearance in the video) ranged from 100 to 150 m. Throughout the experiment, the target vessel remained stationary. The study primarily assesses the distance between the measurement sensor and the main observation ship, Yihai157, using an enhanced fusion algorithm tailored for marine environments. This assessment involves comparing the algorithm’s results with actual ship positions derived from GNSS and AIS data to evaluate its measurement accuracy. The large swaying occurred between 480 frames and 700 frames, and the shooting before and after swaying was relatively stable, while the ranging error at the moment of a wave was larger than that at other moments.
Table 2 and
Table 3 are the details of the video sequence.
Table 4 is an illustration of this experiment. Part of the video frame is shown in
Figure 7.
4.3. Evaluation Criteria
In this paper, the localization performance and anti-shaking performance are evaluated. For related vSLAM work, the localization and mapping performance can be evaluated by the rms Absolute Position Error (APE) of each trajectory. In this experiment, the restored trajectory is sampled according to the GPS and AIS position refresh, and the absolute distance error between the trajectory sampling point and the target ship Yihai157 is calculated. Among them, the overall ranging accuracy is evaluated by the average positioning error, and the map restoration performance under a shaking state is measured by the distance between the sampling point of the map trajectory corresponding to the position of the video segment where shaking occurs and the target ship.
Average positioning error:
where
is the trajectory map established for SLAM;
are the coordinates of the sampling points in the map; and
represents the distance between the sampling point and the target point.
Error of swaying sampling point:
4.4. Experimental Evaluation
Firstly, the mapping and ranging effects of the proposed system in marine video sequences are evaluated. The mapping phase of the experiment is conducted on a virtual machine running Ubuntu 16.04.7, while the ranging task is executed using the MATLAB 2024a compiler. Both tasks utilize an Intel (R) Xeon (R) Gold 5222 CPU @ 3.80 GHz with 35 GB of RAM, demonstrating real-time performance without NVIDIA GPU acceleration.
According to the target area obtained in
Figure 8, the regional target feature is extracted and the target feature set is established to prepare for the relocation of large shaking.
Figure 8 and
Figure 9 are the preprocessing part of this experiment. In the figure, the top left is the feature point at the beginning of the first frame, while the top right is the feature point after a period of movement. It can be seen that the feature points in the background of the detection box are filtered and eliminated. The bottom left is the video frame when the shaking occurs, and the bottom right is the end of the video.
A further quantitative analysis of map construction accuracy reveals that the maximum relative error of the enhanced algorithm proposed in this paper is 9.2% in oceanic environments as shown in
Table 4.
The performance of the mapping and ranging system proposed in this paper is evaluated according to the experimental evaluation criteria proposed in
Section 4.1.
The average ranging error is maintained within 10%, demonstrating high accuracy for marine target ranging at medium and short ranges, thereby meeting engineering requirements.
4.5. Comparison with Other Algorithms
In the video sequence dataset collected for this experiment, various surface scenarios were captured with the vessels in a stationary anchorage state. The trajectory design ensured a comprehensive coverage of the ship from bow to stern. The lower right image illustrates a moment of camera shake that was caused by natural wave fluctuations. Based on the target area identified in
Figure 8, we extracted the region’s features to create a feature set, preparing for robust repositioning despite significant shaking.
Figure 8 and
Figure 9 depict the preprocessing phase of the experiment, with the upper left corner showing the feature points at the start and the upper right corner displaying them after a period of motion. It is evident that background feature points have been filtered out. The lower left corner captures a video frame during shaking, while the lower right shows the final frame.
In
Figure 10, panel (a) illustrates the ship’s trajectory with a red line, while blue points represent the target feature points. Notably, the bridge and stern of the ship exhibit a higher concentration of feature points compared to the hull. The recovered trajectory shows two significant jumps, indicating substantial camera shake due to waves, which our proposed system effectively mitigated. For the YOLO_ORBSLAM (b) and water segmentation SLAM mapping (c), red indicates the recovered trajectory, and blue marks the associated feature points. Some feature points in the water segmentation SLAM are influenced by wave splashes. Yolo_ORBSLAM2 experienced considerable drift after losing sight of the main vessel, while the water segmentation SLAM struggled to maintain a complete trajectory due to wave. U-net dynamic environment split SLAM (d) in shaking moments lost the ship, but this did not cause greater error for the water division.
As discussed in
Section 4.2, we evaluated the performance of the proposed mapping and ranging system based on established experimental criteria. The maximum relative error reflects the SLAM’s positional attitude inversion performance during significant shaking, while the average distance relative error indicates overall ranging accuracy as shown in
Figure 11.
The HSV-based water segmentation SLAM was initially designed to extract non-water artificial features from green water areas, where the distinction between green water and nearby artificial structures is clear. However, in non-coastal marine environments, the HSV segmentation may inadvertently include parts of grayish-white vessels, adversely affecting overall position estimation and repositioning accuracy. Given that water segmentation SLAM failed after significant shaking—potentially due to the similarity in color between the water and the vessels—we calculated the ranging accuracy only for the motion prior to the shaking.
The YOLO_ORBSLAM2 method shares similarities with our proposed approach, utilizing the GNSS information coupling method for trajectory estimation optimization along the xy-axis. During calmer periods, the differences between Yolo_ORBSLAM2 and our method were minimal, as the extracted target region predominantly contained artificial feature points, with background points being less influential on pose estimation. Although the average positioning accuracy error was low at mid-range distances, it increased with the trajectory length, highlighting cumulative error concerns.
The SLAM method of U-net water dynamic environment removal stops at the moment of shaking. In the statistical overall average error, we do not take the last shaking key frame into account for accuracy consideration. The last moment position estimation comes entirely from the ORBSLAM3 system on which the method is based, and there is only one key frame. Using this frame to synchronize with GNSS information would make the analysis and comparison meaningless. This is also the reason for the sloshing error calculation in the middle part of
Table 4. However, before shaking, the accuracy is 4.4% when the ship is sailing stably, which indicates that reasonable water surface removal can meet the requirements of surface SLAM in general.
Table 4 presents a comparison of accuracy across different ranging algorithms. Our method demonstrates superior performance in both shaking trajectory error (9.2%) and average error (4.7%). In contrast, YOLO_ORBSLAM2 and HSV_Kmeans_SLAM exhibited significantly higher errors, with shaking trajectory errors of 12.3% and 23.7% and average errors of 10.3% and 18.6%, respectively. These results underscore the advantages of our proposed method in terms of stability and accuracy in dynamic environments.
Modifications applied to these sequences are detailed in
Section 4.2. The YOLO_ORBSLAM2 and water segmentation SLAM methods relied on their respective localization functionalities and the inherent capabilities of the ORB-SLAM system. Their mapping performance on the collected video sequences is summarized in
Table 4 and
Figure 12. YOLO_ORBSLAM2 demonstrated slow repositioning during vibrations, completing only 240 frames in the first video. Water segmentation SLAM, which focuses on vessel segmentation, was less affected by vibrations but exhibited significant mismatches in camera motion torque. The YOLO_ORBSLAM2 system struggled to achieve effective localization during oscillations.
The HSV algorithm for water segmentation treated the sky and ocean as gray. However, rough sea conditions may lead to the removal of parts of white vessels and large areas of sky, complicating the cleaning process for certain spray particles. Compared to Yolo_ORBSLAM2, the method demonstrated lower ballistic mapping accuracy under relatively stable sea conditions. Additionally, its enhanced keyframe initialization function did not adequately adapt to environments with substantial surface oscillation.
In summary, the YOLO_ORBSLAM2 system is ill-suited for marine environments due to its inability to mitigate significant oscillations. In our experiments based on video sequences, the ORB-SLAM system failed to map effectively during pronounced shaking, as extracted feature points included elements from the sea surface. Using a monocular camera to detect marine targets, YOLO_ORBSLAM2 faced challenges in overcoming shaking and could not fully recover the camera’s position and orientation throughout the video sequence. The SLAM systems with the water removal of HSV encountered issues with water segmentation thresholds. Unlike the coastal green-blue water areas, this method inadequately segmented the marine environment, diminishing the accuracy of mapping and ranging tasks, making it unsuitable for marine contexts. However, the U-net water area segmentation method shows that the correct segmentation can solve the problem of ship target location at sea to a certain extent, but unfortunately it is stuck in the shaking environment. Conversely, our proposed improved fusion algorithm effectively addresses the challenges of low precision and severe shaking in marine environments, consistently achieving high-accuracy tracking over short to medium ranges.