A Monocular Ranging Method for Ship Targets Based on Unmanned Surface Vessels in a Shaking Environment

Wang, Zimu; Li, Xiunan; Chen, Peng; Luo, Dan; Zheng, Gang; Chen, Xin

doi:10.3390/rs16224220

Open AccessArticle

A Monocular Ranging Method for Ship Targets Based on Unmanned Surface Vessels in a Shaking Environment

by

Zimu Wang

¹,

Xiunan Li

¹,

Peng Chen

^1,*,

Dan Luo

¹,

Gang Zheng

¹ and

Xin Chen

²

¹

Second Institute of Oceanography, Ministry of Natural Resources, Hangzhou 310012, China

²

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200230, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4220; https://doi.org/10.3390/rs16224220

Submission received: 12 September 2024 / Revised: 26 October 2024 / Accepted: 1 November 2024 / Published: 12 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Aiming to address errors in the estimation of the position and attitude of an unmanned vessel, especially during vibration, where the rapid loss of feature point information hinders continuous attitude estimation and global trajectory mapping, this paper improves the monocular ORB-SLAM framework based on the characteristics of the marine environment. In general, we extract the location area of the artificial sea target in the video, build a virtual feature set for it, and filter the background features. When shaking occurs, GNSS information is combined and the target feature set is used to complete the map reconstruction task. Specifically, firstly, the sea target area of interest is detected by YOLOv5, and the feature extraction and matching method is optimized in the front-end tracking stage to adapt to the sea environment. In the key frame selection and local map optimization stage, the characteristics of the feature set are improved to further improve the positioning accuracy, to provide more accurate position and attitude information about the unmanned platform. We use GNSS information to provide the scale and world coordinates for the map. Finally, the target distance is measured by the beam ranging method. In this paper, marine unmanned platform data, GNSS, and AIS position data are autonomously collected, and experiments are carried out using the proposed marine ranging system. Experimental results show that the maximum measurement error of this method is 9.2%, and the average error is 4.7%.

Keywords:

resistance to shaking; visual simultaneous localization and mapping; ship target ranging

1. Introduction

In today’s complex and ever-changing engineering environment, engineering tasks face a variety of complex challenges and potential hazards. Many engineers need to perform tasks in unknown environments. Unmanned surface vessels (USVs) can accomplish different tasks by carrying different mission loads, and their characteristics of small size, low cost, and strong mobility allow them to have great advantages in many fields. Environmental reconnaissance using unmanned platforms can significantly improve engineering efficiency and reduce personnel risks [1,2,3]. With the continuous development of autonomous navigation unmanned reconnaissance systems, high-precision carrier positioning and target ranging become crucial parts of the engineering measurement process, and accurate measurement results can provide reliable data support for engineering tasks.

Ranging technologies primarily include laser, ultrasonic, and radar methods. These active technologies demonstrate high environmental sensitivity and low concealment, and often encounter challenges in target type identification [4,5]. In contrast, visual ranging technology utilizes rich image data to enhance target identification while addressing these issues. Simultaneous Localization and Mapping (SLAM), as a crucial artificial intelligence technology, has been employed in autonomous driving, environmental modeling, and unmanned aerial vehicle navigation. The essence of SLAM-based ranging lies in utilizing the camera to capture image information, analyzing the image features, integrating internal and external parameter information to estimate the camera pose in the scene, and subsequently determining its distance within the scene.

In marine environments, which are affected by wind and waves, normal visual SLAM’s efficacy is limited, especially with monocular cameras, mostly due to challenges in accurately extracting feature points from the dynamically changing surface. This results in errors in estimating unmanned ship positions and attitudes, particularly during shaking, where the rapid loss of feature point information impedes continuous pose estimation and global trajectory mapping [2,6,7,8].

The main improvements of this paper are as follows:

(1): This paper introduces a SLAM system specifically designed for sea environments. The ship’s image region is acquired through object detection and segmentation, while stable features are meticulously filtered to extract more reliable static features;
(2): A novel key frame optimization and relocation method based on the target feature set is proposed. The retained feature set serves to constrain and preserve target feature information, enabling rapid and efficient map retrieval during shaking events, thereby enhancing the accuracy and robustness of the system under such conditions;
(3): To verify our algorithm, we independently collected marine monocular video data, GNSS (global navigation satellite system) location information, and AIS (Automatic Identification System) data from ships anchored at sea in the East China Sea. The effectiveness and reliability of the algorithm are verified by the analysis of the overall accuracy and the accuracy of the shaking moment.

The subsequent parts of this article are outlined as follows: Section 2 provides a detailed introduction to the relevant work, Section 3 introduces the specific details of our technology for removing the marine feature points of the target, Section 4 is the experimental analysis section that analyzes the performance of our system, and Section 5 is a summary.

2. Related Work

VSLAM can be divided into direct and indirect methods. Indirect methods need to select a certain number of representative points, called keypoints, from the collected images and then detect and match them in subsequent images to obtain the camera pose. It not only preserves the key information of the image but also reduces the amount of calculation, so it has been widely used.

The MonoSLAM system [9] was introduced in 2007 and pioneered real-time monocular vision, with advantages in terms of cost effectiveness and efficiency. However, problems such as sparse feature point loss and the cumulative error of EKF still exist. The advent of PTAM [10] has improved the problem that real-time positioning map construction cannot work stably for a long time in a large-scale environment. As the first SLAM algorithm using nonlinear optimization in the backend, PTAM solves the problem of too fast data growth in filter-based methods. The ORB-SLAM system, presented in 2015, addresses these issues and improves accuracy, stability, and real-time performance in a static environment. It supports monocular, binocular, and depth cameras and runs through three parallel threads: visual tracking, local mapping optimization, and loop closure detection through graph optimization [11].

SLAM systems based on point features rely too much on the quality and quantity of point features. In weakly textured scenes, it is difficult to detect enough feature points. Thus, the robustness and accuracy of the system are affected, and even tracking failure is caused. In addition, the matching quantity and quality of point features is severely degraded due to the fast movement of the camera, illumination changes, and other reasons. In order to improve feature-based SLAM algorithms, the application of line features in SLAM systems has attracted more and more attention [12,13]. The commonly used line feature extraction algorithm is LSD. Subsequent advances include one by Albert Pumarola et al. [14], who proposed a study on point–line feature fusion, called ORB-SLAM, in 2017, which solved the feature information gap. In the same year, Mrartal et al. introduced ORB-SLAM2, extending its compatibility with RGB-D cameras to enhance tracking by RANSAC outlier removal, DBoW2 loop detection, and EPnP pose estimation [15]. In addition, in some environments, SLAM algorithms focus on some obvious planar features. However, most valid line planes only exist in artificial environments, and suitable planes can rarely be found in natural environments. These limit the scope of its application. The VSLAM method, combining point, line, and surface features, improves the localization accuracy and robustness in weak texture scenes, illumination changes, and fast camera movement. However, the introduction of line and surface features increases the time consumption of feature extraction and matching and reduces the efficiency of SLAM systems. Therefore, the VSLAM algorithm based on point features still occupies the mainstream position [16].

It is difficult to solve the problem of image blur caused by fast camera motion or poor illumination in the environment by only relying on a single sensor of the camera. Therefore, the fusion of SLAM with other sensors, such as GNSS, IMU, etc., is more widely used nowadays. In 2019, Hong Kong University of Science and Technology released VINS-Fusion [17], which integrates monocular, binocular, IMU, and GPS to achieve robust localization, to mitigate IMU drift especially. Hongyu Fu et al. [18] combined ORB-SLAM with GNSS for the multi-sensor fusion of unmanned ships in 2022.

As a popular tool for object recognition and segmentation, deep learning can help us accurately identify objects in images and their relationships with other objects [19]. The feature detection layer of the most classical CNNS learns through training data, avoiding feature extraction at display time and implicitly learning from training data at use time. Loo et al. [20], Kin et al. [21], and other works have given a detailed summary. In 2017, Tateno et al. [22] proposed a real-time SLAM system “CNN-SLAM” based on CNN under the framework of LSD-SLAM. In the same year, Godard et al. [23] proposed an unsupervised image depth estimation scheme. Using the stereo dataset to improve unsupervised learning and then using a single frame for pose estimation is also a scheme to improve efficiency and performance.

Recent innovations include Wang et al.’s 2021 DSP-SLAM, which uses deep learning for real-time, accurate foreground object description [24], and Liu et al.’s 2022 RGB-D SLAM [25], which improves initial pose estimation and dynamic feature separation through motion consistency filters and graph cut optimization. Z. Xie et al. [26] combined Yolo with ORBSLAM2 in 2022 as a new multi-sensory system. Theodorou et al. [27] combined ORB-SLAM3 with YOLOR in 2022 to improve feature classification and visual odometer stability, while Jia et al. [28] proposed LRD-SLAM to integrate fast CNNS into ORB-SLAM2 to cope with dynamic environmental challenges. Ni et al. [29] also proposed an enhanced ORB-SLAM algorithm for monocular vision in dynamic environments, using reliability-based feature point selection and a novel frame difference method for dynamic object detection. Chen et al. [30] warned that dynamic feature point selection based solely on semantics or geometry may interfere with the accuracy of SLAM systems and argue for customized monocular SLAM algorithms for use in dynamic settings. In 2024, Zhi Zheng et al. [31] proposed a dynamic object perception visual SLAM algorithm specifically for dynamic indoor environments. The proposed method utilizes a semantic segmentation architecture called U-Net, which is used to track threads to detect potential moving objects, improving the robustness and reliability of SLAM systems by filtering out indoor moving objects. Based on ORB-SLAM3, Han Yu et al. [32] used the GCNv2-minor network instead of the ORB method to improve the reliability of feature extraction and matching and the accuracy of position estimation. In general, we summarize the characteristics and drawbacks of the previous work in Table 1.

The SLAM combined with the object recognition network and the SLAM based on the dynamic environment indeed show overall accuracy and robustness improvements. However, these methods all have their limitations. Combined with the target recognition network, they only carry out feature point screening and elimination according to the indoor target. SLAM based on a dynamic environment removes dynamic feature points by removing semantic masks, but the feature points that fail to match are confused with the feature points that fail due to semantic segmentation, which affects the accuracy of localization.

3. Method

3.1. Overall Framework

Our approach was to improve one of the monocular systems on top of the ORB-SLAM3. As shown in the Figure 1, the core idea is to obtain the location information of the target in the video according to the deep learning method, extract and screen the ORB feature point information of the target, build a virtual feature information set of the target in the map synchronously, and synchronize the camera position information obtained by GNSS updating related key frames. This ensures the quality of the environmental characteristic information obtained in the working environment of the unmanned ship at sea. When the light or the camera itself wobbles, which often happens in the marine environment where the unmanned boat is sailing, we judge the wobble duration by combining the video and the GNSS information of the unmanned ship and match the wobble video frame with the updated feature set of the target to re-establish the position of the target in the 3D map of the camera and the new video.

As shown in Figure 1, the video information first passes through the deep learning recognition network. As mentioned above, it is more reliable to obtain the environmental characteristics based on artificial objects than the characteristics of the natural environment, which is more obvious in the marine environment. In the object detection thread, we use the YOLOv5 algorithm for object detection and filter the environmental information by obtaining the position of artificial marine features (mainly stationary vessels) in the image. The YOLOv5 network is a lightweight target detection network, which is conducive to improving algorithm efficiency without affecting algorithm performance.

In the tracking thread, the feature point information of the target is input into the key frame extraction and the target feature set at the same time, which is associated with the GNSS position information of the camera frame. Along with the tracking of the feature points in the video frame, the feature points are constantly updated, and the background information in the target frame is filtered twice. This step is the key to improving SLAM’s resistance to the sea environment. Further, we complete the subsequent positioning and mapping tasks. In this paper, while focusing on the overall accuracy and robustness, we focus on the system to overcome the sea environment, especially in the case of a large degree of shaking.

3.2. Object Region Extraction Based on Deep Learning

YOLO (You Only Look Once) was introduced as the first single-stage detector for deep learning-based image detection by Redmon et al. [33]. The primary advantage of employing the YOLO algorithm in this study is its rapid detection and recognition speed. The enhanced version can process images at up to 45 FPS on GPU, while the fast version achieves up to 155 FPS. The detection process of YOLO is illustrated in Figure 1: the entire image is divided into S × S grid cells by a single neural network; each grid cell determines whether the center of a target object is located within it, and uses this information to ascertain the category and confidence level of the detected object; subsequently, grid cells with low-confidence predictions are filtered out based on a threshold, and the Non-Maximum Suppression (NMS) technique is applied to eliminate overlapping bounding boxes, resulting in the final identification of target categories and confidence scores. This end-to-end approach characterizes the YOLO algorithm’s object detection capability. In this paper, a ship target recognition model is developed using the YOLOv5 architecture, which includes the backbone, neck, and head components of the network structure, depicted in Figure 2. During camera movement relative to the target, horizontal motion affects the position of the detection frame, while forward and backward movements affect its scale. The YOLO framework is utilized for target detection and identification, providing the coordinate information of the detection frame.

As shown in the Figure 3, the YOLO network performs well for vessel target extraction from different angles and scales. At the same time, as a lightweight network, its efficiency is also guaranteed, and it can meet the real-time needs in the work of unmanned boats. In this paper, the YOLOv5 network is used as the method of object extraction in the image to extract the position information of relevant artificial objects in the video.

3.3. Object-Based Feature Set Establishment and Matching

We extracted the approximate area of the ship target through the neural network, but the ship target area extracted by the neural network is not closely attached to it, and the area still contains part of the natural area. Directly using this area for feature extraction will still extract the feature points in the background such as water, island, and sky. Therefore, we design a screening method based on the target feature set. In a certain length of time series, by updating the feature set and multiple screening, stable feature points are selected, and the stable feature points of the same target are formed into a feature set to prepare for subsequent relocation.

In contrast to the time-varying texture features of the sea surface, targets such as ships, islands, and ports offer more stable feature points in videos. After extracting ORB feature descriptors, the Hamming distance quantifies similarity between feature points. Smaller Hamming distances indicate greater similarity. Using a YOLOv5 network, frame

A

’s target recognition model identifies a set of feature points

A = \{e_{1}, e_{2}, \dots e_{m}\}

(1)

where

A

represents the feature set of target

A

in the frame.

Let the set consisting of all feature points in the subsequent frame image be

F = \{f_{1}, f_{2}, \dots f_{n}\}

(2)

where

F

is the feature set of the extracted target in the image target recognition frame, and

f_{i}

represents the feature points in the frame. The Hamming distance between

A

feature point in

F

and all feature points in

A

is calculated, and the nearest neighbor distance d₁ and the second-nearest neighbor distance d₂ are selected. If the ratio of the two distances is less than the set value k, then

e_{i}

and the nearest neighbor

f_{i}

are regarded as successfully matching.

The set of new target feature points with successful matching is

A = \{e_{1}, e_{2}, \dots e_{R}, \dots e_{m}\}

(3)

where, in the

m

feature points,

R

feature points have been verified.

Through two-way matching screening, it was ensured that each feature point

e_{i}

in

A

could find a unique and corresponding feature point

f_{i}

in

F

. After traversing

F

, it is ensured that each feature point

f_{i}

in

F

can find

A

unique and corresponding feature points

e_{i}

in

A

. Confirm that the matching point of feature point

e_{i}

is

f_{i}

, and the matching point of feature point

f_{i}

is

e_{i}

, that is, it is identified as a correct match; otherwise, it is judged as a false match and removed.

The selected matching point pairs are sorted according to the Hamming distance to obtain the minimum Hamming distance. If the Hamming distance between the matched point pairs exceeds the set minimum Hamming distance multiple, it is judged as a false match and removed, and a further screening of feature point pairs is carried out. An empirical threshold is set to identify the feature points that appear in enough feature point pairs as trust feature points. If the trusted feature points are close to each other in consecutive frames and have accumulated enough frames, they will be added to the feature point set of target

A

(Algorithm 1). Here’s the pseudocode:

Algorithm 1: Improved Feature Matching.

Input: Target set

A

; Feature points in the subsequent frame

F

;

Output: Target set

A

# Initialization: Matching Pairs = Empty Set; Mismatch Pairs = Empty Set; Trusted Features Set = Empty Set

# 1. Bidirectional Matching Filtering

For each feature point

e_{i}

in

A

:

Find corresponding feature point

f_{i}

in

F

If feature point

f_{i}

in

F

also uniquely matches to

e_{i}

in

A

:

Add (

e_{i}

,

f_{i}

) to Matching Pairs

Else:

Add (

e_{i}

,

f_{i}

) to Mismatch Pairs

End if

End for

For each feature point

f_{i}

in

F

:

Find corresponding feature point

e_{i}

in

A

If feature point

e_{i}

in

A

also uniquely matches to

f_{i}

in

F

:

Add (

e_{i}

,

f_{i}

) to Matching Pairs

Else:

Add (

e_{i}

,

f_{i}

) to Mismatch Pairs

End if

End for

# 2. Sort Matching Pairs by Hamming Distance

Sort(Matching Pairs, by Hamming Distance)

# 3. Hamming Distance Multiplier Filtering

For each (

e_{i}

,

f_{i}

) in Filtered Matching Pairs:

If Hamming Distance(

e_{i}

,

f_{i}

) exceeds (N ×Minimum Hamming Distance):

Add (

e_{i}

,

f_{i}

) to Mismatch Pairs

Else:

Continue to keep in Filtered Matching Pair

End if

End for

#4. Accumulate Trusted Feature Points

For each (

e_{i}

,

f_{i}

) in Filtered Matching Pairs:

If

e_{i}

appears in a sufficient number of matching pairs:

Add

e_{i}

to Trusted Features Set

End if

End for

# 5. Accumulate Trusted Feature Points from Continuous Frames

For each trusted feature point:

If it is close in position across consecutive frames and accumulated frame count ≥ Experience Threshold:

Add trusted feature point to Target

A

’s Feature Set

End if

End for

Return Target set

A

3.4. Local Map Optimization and Relocation Based on Feature Set

During camera movement in the recording process, parts of the scene may move out of the camera’s field of view, causing previously tracked feature points to gradually disappear. The swaying of the sea surface due to wind and waves accelerates the loss of tracking points, with greater swaying amplitudes correlating negatively with point retention. When the number of feature points drops below a set threshold, indicating reduced scene stability, a new keyframe is selected. Careful consideration is given to the spacing of keyframes to prevent excessive density, which could slow down processing and impact depth estimation accuracy. Hence, keyframes are selected based on significant camera motion. The initial keyframe, K1, establishes the global coordinate system, with feature points set as L0. Subsequent frames, like K2 and K3, are chosen to ensure a sufficient number of matching points (L1 and L2, respectively) between successive keyframes.

The keyframe initialization steps are as follows:

(1): Select keyframes K1, K2, and K3 that meet the criteria, extract their targets, and establish feature point sets, ensuring correspondence with L1 and L2;
(2): Set K1 as the global coordinate system, screen matched feature points between K1, K2, and K3 using improved RANSAC, and calculate their pose changes;
(3): Derive rotation matrix R and translation matrix T of K3 relative to the global coordinate using essential matrix E (K1 and K3);
(4): Perform triangulation using the pose change matrix and corresponding feature points to obtain 3D coordinates, inversely deriving K3’s position relative to K2;

We construct reprojection error in these keyframes using projected scene point positions.

After initialization, pose estimation proceeds with the following steps:

(1): Assess each image for keyframe suitability, iterating if not identified;
(2): Calculate initial pose changes, triangulate 3D scene points, and add selected keyframes to the running window;
(3): Optimize initial values using reprojection error within the window, iterating back to step 1.

By optimizing the initial pose estimation—specifically, the initial estimate of pose change—the reprojection error is minimized, thereby enhancing both the accuracy and stability of the pose estimation process. When the reprojection error falls below a predefined threshold, the pose information can be directly retrieved. The pose estimation process involves calculating the position and orientation matrix using the least squares method.

If the feature points extracted from the target region in a new image frame fail to match the target feature point set or if the re-input image frame lacks the target region (i.e., no feature points are present), it is assumed that the current tracking target has been lost, resulting in feature point tracking failure. In cases where the synchronized GNSS position information and image saturation remain relatively stable, it can be inferred that the tracking loss is due to shaking, and tracking should be reinitiated.

Sea surface jitter and periodic movements can cause the loss of map tracking, with the potential for fewer or even no matched points. To address the issue of swaying caused by periodic sea surface movement, the relocation frame—specifically, the selected key frame—must satisfy the following conditions:

(1): The time interval should not exceed 1 s;
(2): The number of feature points within the target type must be at least L1.

In the third matching step, the lost frame feature points are matched with the existing feature point set to ensure that the similarity between the key frame and the target set

A

exceeds the threshold H1.

As shown in Figure 4, the newly input frame matches the updated target set, which is equivalent to a blank image with only feature information. A new target feature set

F = \{f_{1}, f_{2}, \dots f_{n}\}

matches

A = \{e_{1}, e_{2}, \dots e_{r}\}

when the GNSS position information changes less than the set threshold P and meets the above migration frame standard. This frame is used to set a new keyframe and establish a connection with the feature set. On this basis, the new key frame initialization task is completed.

3.5. Front-End Construction of Multi-Sensor Fusion

The loose coupling fusion method, in contrast to the tight coupling method, offers advantages with simpler sensor combinations and data processing algorithms, resulting in lower computational costs [34]. This method allows for the selection of sensor combinations tailored to different application scenarios and requirements, and it facilitates adaptation to environments where signals from individual sensors may be weak or intermittent. This study utilizes the loose coupling method to optimize GNSS measurements at key frames, integrating data from GNSS and visual odometry.

GNSS can provide outdoor absolute position information, which has good complementarity with monocular vision, in order to improve the robustness and accuracy of visual map construction. Since keyframes and GNSS data are collected at different initial times and different acquisition frequencies, the neighborhood linear interpolation method is used to align them.

P_{i} = P_{l} + \frac{P_{l + 1} - P_{l}}{t_{l + 1} - P_{l}} (t_{i} - t_{l})

(4)

where

P_{i}

is the GNSS position corresponding to time

t_{i}

.

P_{l}

and

P_{l + 1}

are the GNSS positions obtained at the nearest time

t_{l}

and

t_{l + 1}

on the adjacent sides of

t_{i}

.

The similarity transformation parameters

[M, R, T]

from the geographic coordinate system to the visual world coordinate system are used as the initial values to construct the least squares problem. Based on the constraints of all aligned GNSS positions and key frame positions, the global optimal similarity transformation parameters are solved. Firstly, three points are randomly selected by the RANSAC algorithm, and the similarity transformation parameters are initially solved according to Horn’s three-point method [30]. Then, the more reliable initial values of similarity transformation are screened out according to the number of interior points and the overall error. After finding the initial value, we then find the best similarity transformation parameter to minimize the error. According to our experience, the number of iterations should be set to fewer than 20 to meet the needs of real-time and smaller error balance.

P_{F} = - {(R_{i})}^{T} t_{i} P_{0}

(5)

Optimize the function

e_{P} = P_{G N S S} - M P_{F}

(6)

where

P_{F}

is the coordinate of the key frame center in the world coordinate system;

R_{i}

and

t_{i}

are the frame rotation matrix and translation matrix;

P_{0}

is the position of the keyframe relative to the starting point given only by the video information;

P_{G N S S}

is the position obtained by GNSS corresponding to the key frame; and

M

is the similarity transformation parameter.

3.6. Target Ranging Based on Bundle Adjustment

During system motion, targets exhibit parallax between consecutive frames. The vision system provides parallax and relative pose information, which are used to compute distance based on a similar triangle principle, as illustrated in Figure 5.

The ship travels from position A to position B. P_A denotes the video frame information at position A, while P_B denotes the video frame information at position B, typically comprising two key frames containing the target. The sensor captures the video information at position A, obtainable through the bundle equation [35,36]:

x - x_{0} = - f \frac{a_{1} (X - X_{A}) + b_{1} (Y - Y_{A}) + c_{1} (Z - Z_{A})}{a_{3} (X - X_{A}) + b_{3} (Y - Y_{A}) + c_{3} (Z - Z_{A})}

(7)

y - y_{0} = - f \frac{a_{2} (X - X_{A}) + b_{2} (Y - Y_{A}) + c_{2} (Z - Z_{A})}{a_{3} (X - X_{A}) + b_{3} (Y - Y_{A}) + c_{3} (Z - Z_{A})}

(8)

where

x_{0,} y_{0,} f

are the azimuth elements in the image;

X_{A}

{, Y}_{A}, Z_{A}

are the coordinates of the video frame at position A;

X, Y, Z

are the coordinates of the target;

x, y

are the pixel location of the target extracted by the YOLO network; and

a_{i}

refers to the nine directional cosines composed of the three external azimuth elements of the image, which are from the previously obtained position and attitude

b_{i} c_{i} (ⅈ = 1,2, 3)

. Similarly, in position B, we have

x - x_{0} = - f \frac{a_{1} (X - X_{B}) + b_{1} (Y - Y_{B}) + c_{1} (Z - Z_{B})}{a_{3} (X - X_{B}) + b_{3} (Y - Y_{B}) + c_{3} (Z - Z_{B})}

(9)

y - y_{0} = - f \frac{a_{2} (X - X_{B}) + b_{2} (Y - Y_{B}) + c_{2} (Z - Z_{B})}{a_{3} (X - X_{B}) + b_{3} (Y - Y_{B}) + c_{3} (Z - Z_{B})}

(10)

The simultaneous equations are solved using the least squares method to determine the target’s location, from which the distance is subsequently derived.

4. Experiments and Results

The selection of dataset and evaluation criteria is crucial. Section 4.1 discusses the offshore vSLAM methodology under evaluation, Section 4.2 outlines the dataset employed for evaluation, and Section 4.3 defines the evaluation criteria.

4.1. Comparative Experiments

In accordance with the vSLAM system adapted for offshore environments proposed in Section 2, YOLO_ORBSLAM [20], water segmentation SLAM [12], and U-net dynamic SLAM [31] were selected for comparative evaluation alongside the proposed method. In indoor settings, YOLO_ORBSLAM typically filters and removes feature points within detected object areas to enhance environmental mapping accuracy. However, in contrast to indoor scenarios, natural features such as ships, lighthouses, and ports offer more reliable and stable feature points for mapping and distance measurement tasks in marine environments. Therefore, the YOLO_ORBSLAM approach used in this study employs a strategy focusing on the detection region within the frame. Water segmentation SLAM, also based on the ORB-SLAM system, incorporates adaptive enhancements tailored to the water surface environment and USV characteristics. Similarly, the U-net method for splitting the dynamic indoor environment, in the environment discussed in this paper, should separate the dynamic water surface. Therefore, we use the U-net network to cull water surface when applying U-net dynamic environment SLAM. The segmentation of water in the original paper is realized by using the principle of HSV color space, and they detect turquoise water surface. When applied to the ocean, the color space needs to be adjusted according to the obtained video sequence to achieve the effective segmentation of gray and white sea surface. To evaluate mapping strategies in marine conditions, a series of video sequences capturing ships at sea were acquired. Specifically, sequences were chosen where camera movement naturally followed the ship’s motion for detailed analysis.

4.2. Data

The video sequences were captured near Zhoushan City in the East China Sea, using cameras and GNSS deployed on the ship. GNSS time was synchronized with camera time. Ship positions for filming and measuring were obtained from AIS and GNSS data. The videos were recorded on a late April morning under favorable lighting conditions, with the water experiencing high tide and fluctuating approximately 1.5 m (Figure 6).

The distance between the camera and the main survey ship (the vessel with the most prominent appearance in the video) ranged from 100 to 150 m. Throughout the experiment, the target vessel remained stationary. The study primarily assesses the distance between the measurement sensor and the main observation ship, Yihai157, using an enhanced fusion algorithm tailored for marine environments. This assessment involves comparing the algorithm’s results with actual ship positions derived from GNSS and AIS data to evaluate its measurement accuracy. The large swaying occurred between 480 frames and 700 frames, and the shooting before and after swaying was relatively stable, while the ranging error at the moment of a wave was larger than that at other moments. Table 2 and Table 3 are the details of the video sequence. Table 4 is an illustration of this experiment. Part of the video frame is shown in Figure 7.

4.3. Evaluation Criteria

In this paper, the localization performance and anti-shaking performance are evaluated. For related vSLAM work, the localization and mapping performance can be evaluated by the rms Absolute Position Error (APE) of each trajectory. In this experiment, the restored trajectory is sampled according to the GPS and AIS position refresh, and the absolute distance error between the trajectory sampling point and the target ship Yihai157 is calculated. Among them, the overall ranging accuracy is evaluated by the average positioning error, and the map restoration performance under a shaking state is measured by the distance between the sampling point of the map trajectory corresponding to the position of the video segment where shaking occurs and the target ship.

Average positioning error:

e_{r m s} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} \sum_{T_{j} \in M} D {(T_{j})}^{2}}

(11)

where

M

is the trajectory map established for SLAM;

T_{j}

are the coordinates of the sampling points in the map; and

D (T_{j})

represents the distance between the sampling point and the target point.

Error of swaying sampling point:

e_{j m s} = \sqrt{\frac{1}{N} \sum_{l = 1}^{N} \sum_{T_{j} \in L} D {(T_{j})}^{2}}

(12)

4.4. Experimental Evaluation

Firstly, the mapping and ranging effects of the proposed system in marine video sequences are evaluated. The mapping phase of the experiment is conducted on a virtual machine running Ubuntu 16.04.7, while the ranging task is executed using the MATLAB 2024a compiler. Both tasks utilize an Intel (R) Xeon (R) Gold 5222 CPU @ 3.80 GHz with 35 GB of RAM, demonstrating real-time performance without NVIDIA GPU acceleration.

According to the target area obtained in Figure 8, the regional target feature is extracted and the target feature set is established to prepare for the relocation of large shaking. Figure 8 and Figure 9 are the preprocessing part of this experiment. In the figure, the top left is the feature point at the beginning of the first frame, while the top right is the feature point after a period of movement. It can be seen that the feature points in the background of the detection box are filtered and eliminated. The bottom left is the video frame when the shaking occurs, and the bottom right is the end of the video.

A further quantitative analysis of map construction accuracy reveals that the maximum relative error of the enhanced algorithm proposed in this paper is 9.2% in oceanic environments as shown in Table 4.

The performance of the mapping and ranging system proposed in this paper is evaluated according to the experimental evaluation criteria proposed in Section 4.1.

The average ranging error is maintained within 10%, demonstrating high accuracy for marine target ranging at medium and short ranges, thereby meeting engineering requirements.

4.5. Comparison with Other Algorithms

In the video sequence dataset collected for this experiment, various surface scenarios were captured with the vessels in a stationary anchorage state. The trajectory design ensured a comprehensive coverage of the ship from bow to stern. The lower right image illustrates a moment of camera shake that was caused by natural wave fluctuations. Based on the target area identified in Figure 8, we extracted the region’s features to create a feature set, preparing for robust repositioning despite significant shaking. Figure 8 and Figure 9 depict the preprocessing phase of the experiment, with the upper left corner showing the feature points at the start and the upper right corner displaying them after a period of motion. It is evident that background feature points have been filtered out. The lower left corner captures a video frame during shaking, while the lower right shows the final frame.

In Figure 10, panel (a) illustrates the ship’s trajectory with a red line, while blue points represent the target feature points. Notably, the bridge and stern of the ship exhibit a higher concentration of feature points compared to the hull. The recovered trajectory shows two significant jumps, indicating substantial camera shake due to waves, which our proposed system effectively mitigated. For the YOLO_ORBSLAM (b) and water segmentation SLAM mapping (c), red indicates the recovered trajectory, and blue marks the associated feature points. Some feature points in the water segmentation SLAM are influenced by wave splashes. Yolo_ORBSLAM2 experienced considerable drift after losing sight of the main vessel, while the water segmentation SLAM struggled to maintain a complete trajectory due to wave. U-net dynamic environment split SLAM (d) in shaking moments lost the ship, but this did not cause greater error for the water division.

As discussed in Section 4.2, we evaluated the performance of the proposed mapping and ranging system based on established experimental criteria. The maximum relative error reflects the SLAM’s positional attitude inversion performance during significant shaking, while the average distance relative error indicates overall ranging accuracy as shown in Figure 11.

The HSV-based water segmentation SLAM was initially designed to extract non-water artificial features from green water areas, where the distinction between green water and nearby artificial structures is clear. However, in non-coastal marine environments, the HSV segmentation may inadvertently include parts of grayish-white vessels, adversely affecting overall position estimation and repositioning accuracy. Given that water segmentation SLAM failed after significant shaking—potentially due to the similarity in color between the water and the vessels—we calculated the ranging accuracy only for the motion prior to the shaking.

The YOLO_ORBSLAM2 method shares similarities with our proposed approach, utilizing the GNSS information coupling method for trajectory estimation optimization along the xy-axis. During calmer periods, the differences between Yolo_ORBSLAM2 and our method were minimal, as the extracted target region predominantly contained artificial feature points, with background points being less influential on pose estimation. Although the average positioning accuracy error was low at mid-range distances, it increased with the trajectory length, highlighting cumulative error concerns.

The SLAM method of U-net water dynamic environment removal stops at the moment of shaking. In the statistical overall average error, we do not take the last shaking key frame into account for accuracy consideration. The last moment position estimation comes entirely from the ORBSLAM3 system on which the method is based, and there is only one key frame. Using this frame to synchronize with GNSS information would make the analysis and comparison meaningless. This is also the reason for the sloshing error calculation in the middle part of Table 4. However, before shaking, the accuracy is 4.4% when the ship is sailing stably, which indicates that reasonable water surface removal can meet the requirements of surface SLAM in general.

Table 4 presents a comparison of accuracy across different ranging algorithms. Our method demonstrates superior performance in both shaking trajectory error (9.2%) and average error (4.7%). In contrast, YOLO_ORBSLAM2 and HSV_Kmeans_SLAM exhibited significantly higher errors, with shaking trajectory errors of 12.3% and 23.7% and average errors of 10.3% and 18.6%, respectively. These results underscore the advantages of our proposed method in terms of stability and accuracy in dynamic environments.

Modifications applied to these sequences are detailed in Section 4.2. The YOLO_ORBSLAM2 and water segmentation SLAM methods relied on their respective localization functionalities and the inherent capabilities of the ORB-SLAM system. Their mapping performance on the collected video sequences is summarized in Table 4 and Figure 12. YOLO_ORBSLAM2 demonstrated slow repositioning during vibrations, completing only 240 frames in the first video. Water segmentation SLAM, which focuses on vessel segmentation, was less affected by vibrations but exhibited significant mismatches in camera motion torque. The YOLO_ORBSLAM2 system struggled to achieve effective localization during oscillations.

The HSV algorithm for water segmentation treated the sky and ocean as gray. However, rough sea conditions may lead to the removal of parts of white vessels and large areas of sky, complicating the cleaning process for certain spray particles. Compared to Yolo_ORBSLAM2, the method demonstrated lower ballistic mapping accuracy under relatively stable sea conditions. Additionally, its enhanced keyframe initialization function did not adequately adapt to environments with substantial surface oscillation.

In summary, the YOLO_ORBSLAM2 system is ill-suited for marine environments due to its inability to mitigate significant oscillations. In our experiments based on video sequences, the ORB-SLAM system failed to map effectively during pronounced shaking, as extracted feature points included elements from the sea surface. Using a monocular camera to detect marine targets, YOLO_ORBSLAM2 faced challenges in overcoming shaking and could not fully recover the camera’s position and orientation throughout the video sequence. The SLAM systems with the water removal of HSV encountered issues with water segmentation thresholds. Unlike the coastal green-blue water areas, this method inadequately segmented the marine environment, diminishing the accuracy of mapping and ranging tasks, making it unsuitable for marine contexts. However, the U-net water area segmentation method shows that the correct segmentation can solve the problem of ship target location at sea to a certain extent, but unfortunately it is stuck in the shaking environment. Conversely, our proposed improved fusion algorithm effectively addresses the challenges of low precision and severe shaking in marine environments, consistently achieving high-accuracy tracking over short to medium ranges.

5. Conclusions

This paper focuses on the challenges of low accuracy and unconquerable shaking in the distance measurement of the medium- and short-range targets of unmanned platforms in the marine environment. This scenario requires that the influence of the sea surface environmental background factors can be eliminated, the sea surface waves can be accurately relocated when the sea waves occur, the trajectory before the shaking due to the wind and waves can be recovered, and the influence of the sea waves on the vSLAM waves can be reduced as much as possible. Following our research on the application of vSLAM in the marine environment, this paper proposes a monocular vSLAM framework adapted to the complex marine environment based on ORB-SLAM. This method obtains the marine target area through the YOLO network and constructs a virtual target feature set by screening and updating the feature points belonging to the marine target. When the shaking occurred, the shaking trajectory was restored by improving the initialization and combining the video frames before and after shaking with the feature set of the virtual target, and the target distance was further obtained by bundle adjustment. After that, our system is tested and evaluated in the sea video sequence with shaking. In addition, two types of newly released water monocular vSLAM systems are selected as comparisons, and the comparative experiments are also carried out in the collected video sequences. The experimental results show that, compared with other versions of SLAM that focus on the impact of targets and clear waters, the improved algorithm proposed in this paper has good performance in swaying trajectory and overall ranging accuracy, which effectively solves the problem of ranging accuracy and difficult-to-overcome shaking for ship targets.

Author Contributions

Conceptualization, P.C.; methodology, Z.W.; software, Z.W.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W. and X.L.; resources, Z.W., P.C. and D.L.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W. and X.L.; visualization, Z.W.; supervision, X.L. and P.C.; project administration, P.C., G.Z. and X.C.; funding acquisition, P.C. and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Fund of Second Institute of Oceanography, MNR (grant no. SL2023ZD202), the Oceanic Interdisciplinary Program of Shanghai Jiao Tong University (grant no. SL2023ZD202), and Remote Sensing System (Phase I) for Project of China Electronics Technology East Communication Group Co., Ltd.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, C.; Gu, S.; Wen, Y.; Du, Z.; Xiao, C.; Huang, L.; Zhu, M. The review unmanned surface vehicle path planning: Based on multi-modality constraint. Ocean Eng. 2020, 200, 107043. [Google Scholar] [CrossRef]
Sathish, K.; Venkata, R.C.; Anbazhagan, R.; Pau, G. Review of localization and clustering in USV and AUV for underwater wireless sensor networks. Telecom 2023, 4, 43–64. [Google Scholar] [CrossRef]
Specht, M. Methodology for Performing Bathymetric and Photogrammetric Measurements Using UAV and USV Vehicles in the Coastal Zone. Remote Sens. 2024, 16, 3328. [Google Scholar] [CrossRef]
Taketomi, T.; Uchiyama, H.; Ikeda, S. Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 16. [Google Scholar] [CrossRef]
Ebadi, K.; Bernreiter, L.; Biggie, H.; Catt, G.; Chang, Y.; Chatterjee, A.; Denniston, C.E.; Deschênes, S.-P.; Harlow, K.; Khattak, S. Present and future of slam in extreme environments: The darpa subt challenge. IEEE Trans. Robot. 2023, 40, 936–959. [Google Scholar] [CrossRef]
Gong, Y.; Yang, S. Dynamic surface environment perception SLAM algorithm for USV based on visual semantics and point-line fusion. South Aquat. Sci. 2024, 20, 120–132. [Google Scholar]
Hösch, L.; Llorente, A.; An, X.; Llerena, J.P.; Medina, D. High Definition Mapping for Inland Waterways: Techniques, Challenges and Prospects. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 6034–6041. [Google Scholar]
Fu, H.; Huang, Y.; Li, W. Ship berthing motion control based on improved Simulation Localization and Mapping algorithm. In Proceedings of the 2021 4th International Symposium on Traffic Transportation and Civil Architecture (ISTTCA), Suzhou, China, 12–14 November 2021; pp. 233–236. [Google Scholar]
Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Douguet, R.; Heller, D.; Laurent, J. Multimodal perception for obstacle detection for flying boats-Unmanned Surface Vehicle (USV). In Proceedings of the OCEANS 2023, Limerick, Ireland, 5–8 June 2023; pp. 1–8. [Google Scholar]
Servières, M.; Renaudin, V.; Dupuis, A.; Antigny, N. Visual and Visual-Inertial SLAM: State of the Art, Classification, and Experimental Benchmarking. J. Sens. 2021, 2021, 2054828. [Google Scholar] [CrossRef]
Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-time monocular visual SLAM with points and lines. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual slam: From tradition to semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
Qin, T.; Cao, S.; Pan, J.; Shen, S. A general optimization-based framework for global pose estimation with multiple sensors. arXiv 2019, arXiv:1901.03642. [Google Scholar]
Fu, H. Research on Intelligent Perception of Unmanned Ship Based on Improved ORB-SLAM Algorithm and Multi-Sensors. Master’s Thesis, Dalian Maritime University, Dalian, China, 2022. [Google Scholar]
Mumuni, A.; Mumuni, F. CNN architectures for geometric transformation-invariant feature representation in computer vision: A review. SN Comput. Sci. 2021, 2, 340. [Google Scholar] [CrossRef]
Loo, S.Y.; Shakeri, M.; Tang, S.H.; Mashohor, S.; Zhang, H. Online mutual adaptation of deep depth prediction and visual slam. arXiv 2021, arXiv:2111.04096. [Google Scholar]
Kim, J.J.; Urschler, M.; Riddle, P.J.; Wicker, J.S. SymbioLCD: Ensemble-based loop closure detection using CNN-extracted objects and visual bag-of-words. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; p. 5425. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
Tateno, K.; Tombari, F.; Laina, I.; Navab, N. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6243–6252. [Google Scholar]
Wang, J.; Rünz, M.; Agapito, L. DSP-SLAM: Object oriented SLAM with deep shape priors. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 1362–1371. [Google Scholar]
Liu, C.; Qin, J.; Wang, S.; Yu, L.; Wang, Y. Accurate RGB-D SLAM in dynamic environments based on dynamic visual feature removal. Sci. China Inf. Sci. 2022, 65, 202206. [Google Scholar] [CrossRef]
Xie, Z.; Li, Z.; Zhang, Y.; Zhang, J.; Liu, F.; Chen, W. A multi-sensory guidance system for the visually impaired using YOLO and ORB-SLAM. Information 2022, 13, 343. [Google Scholar] [CrossRef]
Theodorou, C.; Velisavljevic, V.; Dyo, V. Visual SLAM for dynamic environments based on object detection and optical flow for dynamic object removal. Sensors 2022, 22, 7553. [Google Scholar] [CrossRef]
Jia, S. LRD-SLAM: A Lightweight Robust Dynamic SLAM Method by Semantic Segmentation Network. Wirel. Commun. Mob. Comput. 2022, 2022, 7332390. [Google Scholar] [CrossRef]
Ni, J.; Wang, X.; Gong, T.; Xie, Y. An improved adaptive ORB-SLAM method for monocular vision robot under dynamic environments. Int. J. Mach. Learn. Cybern. 2022, 13, 3821–3836. [Google Scholar] [CrossRef]
Chen, W.; Shang, G.; Hu, K.; Zhou, C.; Wang, X.; Fang, G.; Ji, A. A Monocular-visual SLAM system with semantic and optical-flow fusion for indoor dynamic environments. Micromachines 2022, 13, 2006. [Google Scholar] [CrossRef]
Zeng, Z.; Lin, H.; Kang, Z.; Xie, X.; Yang, J.; Li, C.; Zhu, L. A Semantics-Guided Visual Simultaneous Localization and Mapping with U-Net for Complex Dynamic Indoor Environments. Remote Sens. 2023, 15, 5479. [Google Scholar] [CrossRef]
Yu, H.; Wang, Q.; Yan, C.; Feng, Y.; Sun, Y.; Li, L. DLD-SLAM: RGB-D Visual Simultaneous Localisation and Mapping in Indoor Dynamic Environments Based on Deep Learning. Remote Sens. 2024, 16, 246. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Li, Q.; Zhang, L.; Wang, X. Loosely coupled GNSS/INS integration based on factor graph and aided by ARIMA model. IEEE Sens. J. 2021, 21, 24379–24387. [Google Scholar] [CrossRef]
Shen, T.; Liu, W.; Wang, J. Target ranging system based on binocular stereo vision. Electron. Meas. Technol. 2015, 38, 52–54. [Google Scholar]
Luhmann, T. Close range photogrammetry for industrial applications. ISPRS J. Photogramm. Remote Sens. 2010, 65, 558–569. [Google Scholar] [CrossRef]

Figure 1. Overview of our method. The maize-yellow parallelogram represents input and output, the blue rectangle represents methods, and the red diamond represents special cases.

Figure 2. YOLOv5 network structure.

Figure 3. Ship target extraction of the YOLOv5 network.

Figure 4. The three processes of video, mapping, and set update, and the strategy to be adopted when tracking loss occurs due to shaking. The red stars represent feature points that are not on the ship.

Figure 5. Triangulation ranging.

Figure 6. Video sequence acquisition route. Gray ship is the main positioning vessel and red line is the camera route. The red arrow represents the camera path, and the black ship is the camera’s main observation ship “Yihai 157”.

Figure 7. There are many water scenes in the video sequence data set collected in Figure 7, and the ships are in a static anchored state. The whole route ensures that the video can obtain complete ship information, from the bow to the stern as a whole. The bottom right image is the video frame at the moment when the camera shaking occurs due to the natural undulations of the sea.

Figure 8. Ship detection.

Figure 9. Object-based feature set establishment. The green points are the extracted feature points.

Figure 10. An overview of the three methods. The red lines represent the path of the ship and the blue points are the target feature points. Ours (a). Yolo_ORBSLAM (b). Water segmentation SLAM mapping (c). U-net dynamic SLAM (d).

Figure 11. Real distance and actual distance of sampling frame.

Figure 12. The analysis of USV movement path from the top view is clearer, and our method (a) effectively overcomes the shaking part. It can be clearly seen that there is a large shift at the end of YOLO_ORBSLAM (b), and the SLAM method of water segmentation (c) has poor performance. U-net dynamic SLAM (d) is stable for the first part but it finally stops at a shaking moment.

Table 1. Summary of related SLAM work in this section.

Year	Method	Main Features	Drawbacks
2007	MonoSLAM	Pioneered real-time monocular vision; cost-effective and efficient.	Sparse feature point loss; cumulative error of EKF.
2007	PTAM	First SLAM algorithm using non-linear optimization; improved stability in large-scale environments.	Data growth issue in filter-based methods.
2015	ORB-SLAM	Improved accuracy, stability, and real-time performance; supports various camera types.	Relies heavily on quality and quantity of point features, affecting performance in weakly textured scenes.
2017	Point-line Fusion	Enhanced feature information by fusing point and line features.	Still relies on the availability of valid line features, often limited to artificial environments.
	ORB-SLAM2	Extended compatibility with RGB-D cameras; improved tracking and pose estimation.	Inherent limitations of point-based feature extraction.
	CNN-SLAM	Real-time SLAM based on CNN, reducing reliance on manual feature extraction.	Performance may vary based on training data; limited adaptability to dynamic environments.
2019	VINS- Fusion	Integrated multiple sensors (monocular, binocular, IMU, GPS) for robust localization.	Complexity in sensor fusion may introduce additional processing overhead.
2022	Multi-sensor Fusion	Combined ORB-SLAM with GNSS for unmanned ship applications.	Dependency on GNSS availability and accuracy in various environments.
	DSP-SLAM	Utilized deep learning for foreground object description, enhancing accuracy.	Potential computational intensity; requires extensive training data for effective performance.
	Yolo-ORB-SLAM2	Integrated YOLO with ORB-SLAM2 for enhanced feature classification.	Dependence on YOLO’s performance and training data quality.
	ORB-SLAM3 with YOLOR	Improved feature classification and visual odometry stability.	Similar to previous YOLO-based methods; potential limitations in dynamic environments.
2023	Co-SLAM	Neural RGB-D SLAM with robust tracking and high-fidelity surface reconstruction using hybrid representation.	May require significant computational resources; performance can vary across different scenarios.
2023	VR-SLAM	Combines monocular camera with ultra-wideband sensors for enhanced SLAM capabilities.	May be limited by the effectiveness of the sensors in various environments.
2024	Dynamic Object Perception SLAM	Used U-Net for dynamic object detection, improving robustness in indoor environments.	May struggle with misclassification of dynamic objects affecting SLAM accuracy.
2024	Enhanced ORB-SLAM	Used GCNv2-minor network for improved feature extraction and matching.	Increased processing time; reliance on network performance for accurate feature detection.

Table 2. Main target vessel information in the video sequence.

Name	Length	Width
Yihai157	80 m	6 m

Table 3. Video properties.

Duration	Size of Frames	Rate
58s	1920 × 1080	25 FPS

Table 4. Accuracy comparison of ranging algorithms.

Method	The Shaking Trajectory Error	Average Error
YOLO_ORBSLAM [20]	12.3%	10.3%
Water segmentation SLAM With HSV and Kmeans [12]	23.7%	18.6%
U-net dynamic SLAM [31]	-	4.4%
Method in this article	9.2%	4.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Li, X.; Chen, P.; Luo, D.; Zheng, G.; Chen, X. A Monocular Ranging Method for Ship Targets Based on Unmanned Surface Vessels in a Shaking Environment. Remote Sens. 2024, 16, 4220. https://doi.org/10.3390/rs16224220

AMA Style

Wang Z, Li X, Chen P, Luo D, Zheng G, Chen X. A Monocular Ranging Method for Ship Targets Based on Unmanned Surface Vessels in a Shaking Environment. Remote Sensing. 2024; 16(22):4220. https://doi.org/10.3390/rs16224220

Chicago/Turabian Style

Wang, Zimu, Xiunan Li, Peng Chen, Dan Luo, Gang Zheng, and Xin Chen. 2024. "A Monocular Ranging Method for Ship Targets Based on Unmanned Surface Vessels in a Shaking Environment" Remote Sensing 16, no. 22: 4220. https://doi.org/10.3390/rs16224220

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Monocular Ranging Method for Ship Targets Based on Unmanned Surface Vessels in a Shaking Environment

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Overall Framework

3.2. Object Region Extraction Based on Deep Learning

3.3. Object-Based Feature Set Establishment and Matching

3.4. Local Map Optimization and Relocation Based on Feature Set

3.5. Front-End Construction of Multi-Sensor Fusion

3.6. Target Ranging Based on Bundle Adjustment

4. Experiments and Results

4.1. Comparative Experiments

4.2. Data

4.3. Evaluation Criteria

4.4. Experimental Evaluation

4.5. Comparison with Other Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI