Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Assessing the Black Sea Mesozooplankton Community Following the Nova Kakhovka Dam Breach
Next Article in Special Issue
AquaYOLO: Enhancing YOLOv8 for Accurate Underwater Object Detection for Sonar Images
Previous Article in Journal
Coastal Hazard and Vulnerability Assessment in Cameroon
Previous Article in Special Issue
Application and Analysis of the MFF-YOLOv7 Model in Underwater Sonar Image Target Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SS-YOLO: A Lightweight Deep Learning Model Focused on Side-Scan Sonar Target Detection

by
Na Yang
1,
Guoyu Li
2,
Shengli Wang
1,
Zhengrong Wei
1,*,
Hu Ren
1,
Xiaobo Zhang
1 and
Yanliang Pei
3
1
College of Ocean Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
2
Qingdao Xiushan Mobile Mapping Co., Ltd., Qingdao 266590, China
3
First Institute of Oceanography of Ministry of Natural Resources, Qingdao 266061, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2025, 13(1), 66; https://doi.org/10.3390/jmse13010066
Submission received: 10 December 2024 / Revised: 30 December 2024 / Accepted: 30 December 2024 / Published: 2 January 2025
(This article belongs to the Special Issue Application of Deep Learning in Underwater Image Processing)

Abstract

:
As seabed exploration activities increase, side-scan sonar (SSS) is being used more widely. However, distortion and noise during the acoustic pulse’s travel through water can blur target details and cause feature loss in images, making target recognition more challenging. In this paper, we improve the YOLO model in two aspects: lightweight design and accuracy enhancement. The lightweight design is essential for reducing computational complexity and resource consumption, allowing the model to be more efficient on edge devices with limited processing power and storage. Thus, meeting our need to deploy SSS target detection algorithms on unmanned surface vessel (USV) for real-time target detection. Firstly, we replace the original complex convolutional method in the C2f module with a combination of partial convolution (PConv) and pointwise convolution (PWConv), reducing redundant computations and memory access while maintaining high accuracy. In addition, we add an adaptive scale spatial fusion (ASSF) module using 3D convolution to combine feature maps of different sizes, maximizing the extraction of invariant features across various scales. Finally, we use an improved multi-head self-attention (MHSA) mechanism in the detection head, replacing the original complex convolution structure, to enhance the model’s ability to focus on important features with low computational load. To validate the detection performance of the model, we conducted experiments on the combined side-scan sonar dataset (SSSD). The results show that our proposed SS-YOLO model achieves average accuracies of 92.4% (mAP 0.5) and 64.7% (mAP 0.5:0.95), outperforming the original YOLOv8 model by 4.4% and 3%, respectively. In terms of model complexity, the improved SS-YOLO model has 2.55 M of parameters and 6.4 G of FLOPs, significantly lower than those of the original YOLOv8 model and similar detection models.

1. Introduction

In modern marine surveying, side-scan sonar (SSS) technology is widely used [1,2,3]. With the increase in marine development activities, the application of SSS technology has been continuously deepening. It is now widely used in various fields such as underwater target search, seabed geomorphological surveys, pipeline inspections, and marine environmental studies. The principle of side-scan sonar equipment is to transmit acoustic pulses to the seabed or underwater targets, through sensors mounted on an unmanned surface vessel (USV) or an autonomous underwater vehicle (AUV). The acoustic pulses transmitted outward will be reflected, and will generate scattered echo when they encounter underwater objects or the seabed [4]. The intensity of the scattered echo is influenced by factors such as the shape and material of the target, while the return time difference is affected by the position of the target. SSS records the scattered echos, which are used to generate a two-dimensional sonar image of the underwater surface. Thus, indicating the position and status of the target.
In recent decades, manual extraction has been the primary method for target identification in SSS. These manual approaches include support vector machine (SVM) [5,6], singular value decomposition (SVD) [7,8], and independent component analyses (ICA) [9,10], among others. They rely on image pixel features, image variations, grayscale thresholds, or known prior information, along with manually designed filters, to accomplish underwater target detection tasks. However, in addition to being laborious and resource-intensive, many approaches have drawbacks such inadequate robustness, excessive complexity, and low detection accuracy [11].
Because of the remarkable performance and broad application of convolutional neu-ral networks (CNN) in standard optical images, scholars have recently begun to pay more attention to the use of deep learning techniques for target detection in SSS images. Kong et al. [12] proposed a model via the dual-path network and the fusion transition module to conduct feature extraction. In addition, this model uses a dense connection technique to enhance multi-scale prediction, enabling low signal-to-noise ratio and small effective sample sizes for object location and classification. Wang et al. [11] proposed the use of multi-scale convolution and attention mechanisms with global reception fields to obtain sonar image multi-scale semantic features and enhance the correlation between features. Zhang et al. [13] improved the YOLOv7 network by adding the Contextual Transformer (CoT) module in the backbone and the Coordinate Attention (CA) module in the neck to enhance the model’s feature extraction capabilities. Additionally, they reconstructed the combined network features based on the BiFPN structure. Wen et al. [14] highlights the Swin-Transformer for dynamic attention and global modeling, which enhanced model attention to the desired regions inside SSS images containing large amounts of irrelevant information. They also incorporated a feature scaling factor into the model to address the uncertainty of geometric features in SSS target characteristics. In their subsequent research, Wen et al. [15] proposed a new YOLOv7 model that combines attention mechanisms and scaling factors for underwater SSS target detection. This model builds on their previous findings by incorporating scaling factors into the improved YOLOv7 architecture, addressing the issue of low detection accuracy caused by the uncertainty of geometric features in SSS targets. The new YOLOv7 model achieved a 2.58% increase in mAP accuracy compared to their earlier YOLOv7 model, and a 9.28% increase compared to the original YOLOv7 model.
Mittal et al. [16] evaluated the performance of several traditional lightweight object detection models on well-known datasets such as MS-COCO and PASCAL-VOC, demonstrating the effectiveness of these models on edge devices. They listed the applications of lightweight object detection models in areas such as autonomous driving, robotic vision, intelligent transportation, and industrial quality inspection. They also suggested that future research should continue to optimize lightweight object detection models to improve their practicality and performance on edge devices. Liu et al. [17] proposed a lightweight object detection algorithm for robots based on an improved YOLOv5. They introduced C3Ghost and GhostConv modules into the YOLOv5 backbone network to reduce model parameters while maintaining detection speed and accuracy. Additionally, DWConv modules were incorporated into the YOLOv5 neck network to further decrease model parameters and enhance feature fusion speed. Lang et al. [18] proposed a lightweight object detection framework called MSF-SNET for the real-time detection of airborne remotely sensed images in resource-limited situations. The framework uses SNET as a backbone network to reduce parameter and computational complexity while enhancing small object detection through multi-scale feature fusion. Zhang et al. [19] proposed MSFF-Net, a Multi-scale Spatiotemporal Feature Fusion Network, for video saliency prediction. The network uses 3D convolutions and a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP) to fully exploit spatiotemporal features. An Attention-Guided Fusion (AGF) mechanism and Frame-wise Attention (FA) module are introduced to adaptively learn fusion weights and emphasize useful frames. MSFF-Net outperforms existing methods in accuracy on DHF1K, Hollywood-2, and UCF-sports datasets.
In summary, the current challenges in target detection tasks for SSS images include a limited number of dataset samples, high model complexity, and poor detection accuracy. SSS images typically have low resolution, complex seabed backgrounds, and a variety of underwater targets, making object detection particularly difficult. Traditional object detection methods are often inefficient in these special environments, especially on edge devices where computational resources and storage space are limited, exacerbating these issues. To address these challenges, lightweight object detection models and feature fusion techniques have become key research directions. The lightweight model greatly reduces computational complexity and resource consumption, making object detection on edge devices more efficient. Feature fusion technology can effectively integrate multi-scale information to compensate for the recognition difficulties caused by low resolution and a complex background. Therefore, combining lightweight object detection with feature fusion technology can ensure high detection accuracy and improve efficiency and real-time performance, making the model more suitable for real-time object detection tasks in resource-limited environments. Single-stage object detection models are more suited for embedding in hardware devices to enable real-time target recognition in SSS images because they typically offer higher accuracy, quicker speed, and better real-time performance [20]. The YOLOv8 [21,22] network is used as the foundational model in this paper. It may be used for tasks like instance segmentation, object identification, and image categorization. The streamlined design of YOLOv8 makes it easily adjustable to a variety of hardware platforms, from cloud to edge devices, and suited to a wide range of applications [23].
The following are this study’s primary contributions:
(1) We combined self-collected SSS data with public datasets to create a new side-scan sonar dataset (SSSD) dataset. Through data augmentation, we expanded the original dataset, addressing the issue of a limited sample size and few target types in SSS data. This dataset can now be used for performance evaluation of various sonar target detection methods.
(2) To reduce the complexity of the convolution process, we introduced a combination of the partial convolution (PConv) and the pointwise convolution (PWConv) modules, replacing the original Bottleneck module in C2f. Under resource-constrained conditions, PConv performs convolution operations only on a portion of the input feature map, followed by PWConv, significantly reducing redundant computations and memory access while maintaining relatively high accuracy.
(3) The adaptive scale spatial fusion (ASSF) structure was designed to integrate information from multi-scale feature maps, addressing the feature fusion issue in low-resolution sidescan images. It processes and concatenates feature maps of different sizes, forming a unified dimensional composite view. Then, 3D convolution is applied to extract scale sequence features, enhancing target recognition.
(4) The detection head incorporates an improved multi-head self-attention (MHSA) mechanism. This method not only fully leverages contextual information to improve detection accuracy but also ensures detection efficiency by using the attention mechanism without the feed-forward network (FFN) layer.

2. Data

The side-scan sonar dataset (SSSD) used in this study consists of two parts: one part is sourced from the publicly available sonar common target detection dataset (SCTD) [24], and the other part was collected by our team using a USV equipped with a multi-beam and SSS collection system, gathered in Weifang, China.

2.1. Data Collection

The proprietary data were collected using the SS900F(Qingdao Hydro-tech Marine Technology Co. Ltd., Qingdao, China) SSS device, which is mounted on the unmanned surface vessel, as shown in Figure 1. The imaging principle of this type of sonar is to transmit sound waves through a transmitting transducer, forming multiple narrow beams distributed vertically along the navigation direction within a certain space, and recording the echoes to obtain multiple channel information.
The technical specifications of the SS900F SSS are also provided in Table 1. The data collection area is an operational wind farm, where the seafloor terrain is relatively flat with minor undulations. The majority of the surveyed sea area has a depth of less than 10 m. The seafloor features gentle undulations, with slope gradients ranging from 0.60‰ to 0.85‰ within the 0–5 m depth range, and approximately 0.23‰ within the 5–10 m depth range.
We used the specialized sonar processing software SonarWiz (v7.09.03) to process the raw data from each collected SSS line. First, the SSS data are imported under the correct coordinate system. Then, the data are filtered, adjusted for gain, and color adjusted to achieve the desired visual effect. When filtering data, a low-pass filter is used to remove background noise and high-frequency interference generated by the sensor itself by selecting the “Filters” function in the SonarWiz menu. For gain adjustment, selecting the “Automatic Gain Control” function of “Gain Control” allows SonarWiz to automatically make gain adjustments to optimize image quality. For color adjustment, select “Color Palette” and use the default color settings to enhance image visualization. Two types of target objects were selected for target detection and classification: the submerged portions of the wind turbine pile foundation and the underwater artificial reef. The processed SSS images, as shown in Figure 2, were cropped to a size of 640 × 640 pixels for each image containing a target object.
In this study, the processed SSS images of the wind turbine pile foundation and underwater artificial reef are combined with data from the publicly available SCTD to form a new dataset, named SSSD. SCTD contains three categories of data samples, with a total of 363 samples, including 57 aircraft, 34 humans, and 271 ships. The detection targets in the SSSD are ultimately divided into the following five categories: aircraft, human, ship, foundation, and reefs. The LabelImg image annotation tool was used to annotate the category information of the target objects’ locations in the images.

2.2. Data Post-Processing

To avoid longtail effects during model training, we ensured relative balance in the number of images for each target class, when merging and forming the new dataset. The newly constructed dataset specifically addresses the issue of insufficient small sample sizes in the original dataset, resulting in a more balanced distribution of samples across classes. This adjustment leads to a more uniform sample size distribution, which is shown in Figure 3. The original SCTD has only 357 images. After incorporating our experimental data, the total number of images in the SSSD was 682. The dataset was then split into three subsets at random: the training, validation, and testing sets. To ensure that each subset of the training, validation, and test sets accurately reflects the overall class distribution of the dataset, we employed a stratified sampling method. In total, 80% of the images are used for training in the training set, and another 15% of the images are provided to the validation set. The last 5% of the images are provided to the test set. This approach ensures that the class distribution in each subset is consistent with the distribution in the entire dataset, thus avoiding underrepresentation of any class in the subsets. Specifically, during the stratified sampling process, the data samples are randomly assigned to each subset, without considering the specific state or characteristics of the images, but solely based on the class distribution. This ensures adequate representation of each class in all subsets. The training set is used for model training to adjust the model weights by optimization algorithms such as back propagation and gradient descent. The validation set is used to evaluate the model performance and tune the hyperparameters. At the end of each training cycle, the model is evaluated on the validation set to determine whether there is overfitting or underfitting and to select the best hyperparameters. The test set is used for the final evaluation of the model’s generalization ability, which are data that the model has not seen during the training and validation phases and is used to simulate the model’s performance in real-world applications.
Given that the sample size of this training set is still small and may lead to overfitting problems, we performed data augmentation to expand the dataset. By using image augmentation techniques, we increased the diversity and quantity of the training data. This is crucial when dealing with limited datasets, as it allows the model to learn from a wider range of variations and conditions, thereby improving its generalization and accuracy in real-world scenarios. In this study, we applied various data augmentation techniques, such as adding random noise, adjusting image brightness, cutout, random rotation, cropping, translation, and mirroring to synchronously augment the original images and annotations. Each image was augmented by at least one method, and every original image was expanded into five images. We counted the total number of data samples in the training set after data augmentation. Since each image may contain multiple objects, the total number of data samples in the training set is greater than the number of images, with the total number of data samples being 5130. The sample quantity and size distribution are shown in Figure 3. Among them, the dataset includes 385 samples for aircraft, 190 for humans, 1280 for ships, 980 for foundations, and 2295 for reefs.

3. Method

To deploy the sonar target detection algorithm on our self-developed autonomous vessel for real-time detection, this research focuses on achieving model light weighting without compromising performance. The aim is to reduce the cost of model training and deployment, enabling the integration of the model on mobile and embedded devices. To effectively address this challenge, we selected the single-stage YOLOv8 model as the core of the algorithm and enhanced it specifically for underwater target detection.

3.1. The Network Structure

To meet various application objectives, YOLOv8 offers several network scales, including n, s, m, l, and x [25]. By modifying network scaling factors (hyperparameters such as widen_factor, deep_factor, and ratio), several different scales can be created. In particular, the widen_factor modifies the number of channels in each layer to alter the network’s width, and the deep_factor modifies the number of repeats of specific structures inside layers to alter the network’s depth. At the end of the network, the ratio is used to modify the number of channels in the feature maps [26]. Higher scaling factors can lead to deeper, wider networks that can learn more complicated characteristics, but they also increase the computational load. Considering that the SSS target detection task involves fewer categories, a smaller-scale network is sufficient to meet accuracy requirements while achieving faster speeds and lower memory access costs. Therefore, the model is improved based on YOLOv8n, by setting the hyperparameters widen_factor to 0.25, deep_factor to 0.33, and ratio to 2.0.

3.2. Improved Algorithm

Due to the limitation of the hydroacoustic communication bandwidth, it is relatively difficult to transmit large-volume data over long distances between the USV and the shore base station. Thus, the USV needs to independently complete the data processing and target detection tasks during operation and transmit the real-time detection results to the shore operator. In order to achieve efficient underwater target detection on the USV platform, it is necessary to lighten the detection model to improve the detection speed and minimize the consumption of computational resources while ensuring high detection accuracy. Although the YOLOv8 algorithm is already highly efficient, its performance for low-resolution, long-range SSS target recognition is limited by its feature extraction capability [27]. Therefore, it is necessary to design lightweight detection models that can efficiently extract structural features. This will ensure that the models can accurately and quickly complete target detection tasks with limited computational resources, meeting the needs of AUVs to independently perform underwater detection.
Based on these considerations, we improved the original YOLOv8 network, as illustrated in Figure 4. First, to address the complexity of the convolution operations in the original C2f module and the repetitive design of the Bottleneck module, we introduced the partial convolution(PConv) [28] structure from FasterNet, which reduces computation and memory access by processing only part of the input channels. Second, we integrated an adaptive scale spatial fusion (ASSF) module into the neck of the model, which preserves the structural features of the targets to the greatest extent, addressing the problem of information loss and degradation during the intermediate propagation of SSS images, thereby improving detection accuracy. Finally, we incorporated an improved multi-head self-attention (MHSA) attention mechanism into the detection head. This approach not only fully leverages contextual information to enhance detection precision, but also ensures detection efficiency by using an attention mechanism without an FFN layer.

3.2.1. Fast-C2f

The C2f module is a key component of the YOLOv8 network backbone. Through operations such as feature transformation, branch processing, and feature fusion, it extracts and transforms the input data’s features, generating output with stronger representational ability. This helps to improve the performance and representational capabilities of the network, allowing it to be better adapted to complex data tasks. These operations rely on a combination of multiple Bottleneck layers and complex convolutional operations, although this helps to improve the feature extraction capability of the network, allowing it to better adapt to complex data tasks, but they come at the cost of significantly increasing the number of parameters and computational complexity, leading to a slower inference time and higher computational overhead. This becomes especially problematic in contexts where real-time detection is critical, such as when deploying YOLOv8 on devices with limited resources, like edge devices or mobile platforms. Therefore, inspired by the FasterNet network design pattern, we introduced the partial convolution (PConv) and the pointwise convolution (PWConv) convolution modules to build a new FasterBlock module [28]. The new FasterBlock module is embedded into the original C2f structure of YOLOv8, replacing the original Bottleneck module and forming the new Fast-C2f structure. This approach reduces the number of parameters and speeds up the computation by reducing unnecessary convolution operations. Figure 4d depicts the Fast-C2f structure.
As shown in Figure 4d, the new PConv layer and PWConv layer act as primary operators that form the new FasterBlock module. Specifically, the PConv layer is followed by two PWConv layers to fuse information from different channels. The first PWConv layer fuses all channel functions to double the number of channels, while the second PWConv layer restores the number of channels to the original. Between these two PWConv layers, batch normalization and activation layers are applied, along with shortcut connections for efficiently reusing the input feature maps.
PConv optimizes memory access costs and saves computational resources by reducing feature map redundancy. Due to the high degree of similarity between feature maps across channels, PConv only processes a subset of the input channels using traditional spatial feature extraction, leaving the other channels unprocessed. For continuous or regular memory access, let the total number of channels be c . The first or last continuous c p channels are selected to represent the entire feature map for spatial feature extraction. The design of PConv and its combination with PWConv are shown in Figure 5.
When the input and output feature map channels are identical, PConv’s FLOPs are
h × w × k 2 × c p 2
The memory storage of PConv is
h × w × 2 c p + k 2 × c p 2 h × w × 2 c p
In Equation (1), when the ratio r = c p / c = 1 / 4 , the FLOPs of PConv is only 1/16 of regular Conv, and the memory storage of PConv is only 1/4 of regular Conv. PConv is followed by PWConv, and at this point, the effective receptive field on the input feature map forms a T-shaped Conv structure [29]. Compared to directly implementing a T-shaped convolution structure, this decomposed convolution fully exploits the redundancy between filters, further reducing computational demands. The FLOPs of the combined T-shaped Conv are calculated as follows, making full use of the information from the remaining channels.
h × w × k 2 × c × c p + c × c c p

3.2.2. Adaptive Scale Spatial Fusion

According to scale space theory, an image’s scale axis—which denotes the range of possible scales for an object, is used to build the scale space. Rather than simply modifying the image size, images of different scales are generated based on Gaussian filtering with varying degrees of blurring of the original image. As a result, the larger the scale value, the blurrier the image generated [30]. The target’s structural characteristics in the image, however, remain unchanged despite the scale shift. Due to the complexity of the underwater environment, the acoustic signal propagates in the water with problems such as deformation and noise pollution, which leads to problems such as low image illumination, blurred details, and feature loss, which will increase the difficulty of target detection. Therefore, making full use of the invariance of the target’s structural features and fusing image features of different scales is particularly important to improve detection accuracy.
In low-resolution SSS images, although blurry targets may lose some detail, the structural characteristics of the targets remain unchanged. However, when using the original classical FPN for multi-scale feature fusion, the high-level features, during interaction and propagation with the low-level features, often experience information loss or degradation after passing through multiple intermediate scales, reducing the effectiveness of the feature fusion between non-adjacent layers [31,32].
Therefore, modifying the feature pyramid structure to fully utilize the high-level features with low resolution and high semantic information is crucial for target recognition in low-resolution SSS images. Drawing on the design concepts of the feature fusion module by Kang et al. [33], we introduced an improved adaptive scale spatial fusion module (ASSF). This module consists of two parts: the Triple Feature Encoding module (TFE) and the Scale Sequence Feature Fusion module (SSFF).
Figure 4e shows the structure of the TFE module. First, convolution operation is used to adjust the number of feature channels, making the channel counts of the three different-sized feature maps the same. Let the original middle-size channel count be 1C; after convolution, the channel count of the large-size feature map is adjusted from the original 0.5C to 1C, and the channel count of the small-size feature map is adjusted from the original 2C to 1C.
Then, the large-size and small-size feature maps were downsampled and upsampled, respectively. For downsampling, a hybrid structure consisting of maximum pooling and average pooling is employed, which aids in reducing the number of parameters and computational load on the network. For feature maps of a small size, the nearest neighbor interpolation method is used for upsampling. The upsampling method can enhance low-resolution SSS images to high resolution, which allows the model to better learn and predict complex features in the data. Finally, the three feature maps of large, middle, and small sizes with the same dimensions are spliced in the channel dimension; the splicing method is as follows:
F T F E = C o n c a t F l , F m , F s
And the feature maps of the TFE module output are indicated by F T F E . The symbols F l , F m , and F s stand for large, middle, and small feature maps, in that order. Concatenation of F l , F m , and F s yields F T F E . F T F E has three times the channel number of F m and the same spatial dimensions.
The original YOLOv8 structure predicts image content by constructing multi-scale feature maps, creating three scales: P3, P4, and P5. However, simply using summation or concatenation methods to fuse pyramid features cannot effectively leverage the correlations between feature maps of different scales. Therefore, to better merge the multi-scale feature maps, the Scale Sequence Feature Fusion module (SSFF) is proposed.
As shown in Figure 4f, the feature maps P3, P4, and P5 are first convolved with a series of Gaussian kernels of increasing standard deviation [34,35], so that the number of channels across the three feature maps is unified, as follows:
F σ i , j = μ ν f i μ , i ν × G σ μ , ν
G σ x , y = 1 2 π σ 2 e x 2 + y 2 / 2 σ 2
Where f represents a two-dimensional (2D) feature map and F σ is generated by smoothing with a series of convolutions using a 2D Gaussian filter with increasing standard deviation σ . G σ is the 2D Gaussian filter that is used.
Next, the closest neighbor interpolation approach is employed to pair P4, P5, and P3 to the same resolution as P3 because the output feature maps from the above Gaussian smoothing have different resolutions. Each feature map is given the level dimension using the unsqueeze function, converting it from a 3D tensor [height, width, channel] to a 4D tensor [depth, height, width, channel] in order to provide a general view. P3 and P4, as well as P5 feature maps, are finally concatenated.
The 3D convolution block, which consists of 3D convolution, 3D batch normalization, and the leaky reLU activation function, receives the concatenated general view as an input. A 3 × 3 × 3 (depth, height, width) kernel size with suitable padding and a stride of 1 is used in the 3D convolution procedure. Spatial feature information of the SSS images can be efficiently used by adding channels to convolution to extract the scale sequence features of general views. Thus, the deformed SSS targets can be better recognized for identification. Compared with the method of adding attention, this method of directly performing convolution operations to fuse spatial scale features is more lightweight and faster in calculation.

3.2.3. DetectSA

In the traditional YOLOv8 network structure, the design of the detection head is relatively complicated, and its number of parameters occupies almost half of the number of YOLOv8 parameters. This is because YOLOv8 uses a decoupling head to achieve target identification. However, since underwater SSS target detection task only requires the identification of fewer types of targets, a more complicated detection head is not required to guarantee the multiclassification target detection task. Therefore, we restructured the detection head of YOLOv8, abandoned the original means of target detection through multiple convolutional operations, and improved the real-time performance of the model by introducing a lightweight self-attention mechanism while ensuring the detection accuracy.
In order to help the neural network understand which locations and material require more attention, attention methods are frequently used in object detection. In the area of natural language processing, the transformer self-attention mechanism has acquired a lot of traction and shown competitive outcomes [13,36,37]. In fact, when dealing with image-related tasks, each pixel can be viewed as a three-dimensional vector, where the number of image channels represents the dimensions. Therefore, an image can be considered as a collection of vectors fed into the model. When self-attention mechanisms are introduced into the model, the model can independently determine the shape and type of the receptive field, making it more effective in capturing critical information.
However, in traditional attention mechanisms, since the relationship between each element and all other elements needs to be calculated, it results in a significant increase in computational cost and memory requirements. Therefore, we introduced a self-attention mechanism without the fully connected FFN to capture contextual information [38]. As shown in Figure 6, we employed 1 × 1 convolutions both before and after the (MHSA) [39] to reduce and then increase the dimensions, thereby reducing the memory consumption of the self-attention layer.
Specifically, the 1 × 1 convolution layer before the MHSA is used to reduce the feature dimension from c1 to c2, thereby reducing the computational load of the attention mechanism. MHSA is employed to capture long-range dependencies within the features, enhancing the ability to detect targets. The 1 × 1 convolution layer after MHSA is used to restore the original feature dimension by increasing the feature dimension from c2 back to c1. Additionally, a 1 × 1 convolution is added as a bypass to add the output of the self-attention module to the initial input, forming a residual connection [40].

4. Experiments and Analysis

In order to verify the superiority of our proposed algorithm and the completeness of the dataset, we first conducted ablation experiments on the SSSD for the different improvements, and then analyzed the comparative experiments for the different models on this dataset. The ablation experiments aim to confirm the effectiveness of our model improvement through qualitative analysis; while the comparison experiments aim to show the superiority of our proposed model in terms of both real-time and accuracy by comparing our proposed SS-YOLO model with other target detection models.

4.1. Evaluation Criteria

We employ a variety of standardized model detection evaluation metrics to compare the performance of various detection models, such as precision (P), recall rate (R), the average precision (AP), and the mean average precision (mAP). Below are the precise definitions and calculations for these indicators:
(1) Precision is used to assess the model’s ability to accurately identify a target, and it represents the proportion of all detections that are judged by the model to be positive samples that are actually correctly identified as targets. In other words, the precision rate reflects how many of the results predicted to be in the positive category are correct when the model recognizes a target. A higher precision rate means that the model produces fewer false positives, i.e., fewer non-target instances are incorrectly identified as targets. Recall, on the other hand, is used to assess the model’s ability to recognize all target classes, and denotes the proportion of all actual target samples that the model is able to correctly identify. Recall emphasizes the model’s ability to provide complete coverage of targets, i.e., whether it is able to identify all actually existing targets as much as possible. A higher recall means that the model misses less and is able to find more real targets.
Thus, precision and recall focus on the trade-off between the accuracy and completeness of the model’s detection results, respectively. A high precision rate indicates that the model more reliably avoids false positives, while a high recall rate indicates that the model is able to capture the target more comprehensively.
P = T P T P + F P
R = T P T P + F N
The real positives that are accurately predicted are known as true positives (TP). Actual negatives that are accurately anticipated negatives are known as true negatives (TN). Actual negatives that were mistakenly anticipated to be positives are known as false positives (FP). Actual positives that were mispredicted as negatives are known as false negatives (FN).
(2) The area under the precision-recall curve for a particular target category is typically used to determine the average detection precision, or AP (Average Precision). Both the precision and recall of the detection findings are taken into consideration when evaluating the model’s detection performance in that category. On the other hand, the model’s overall detection performance over the whole dataset is assessed using mAP (Mean Average Precision), which is the average of the APs of several target categories. The model’s recognition ability across many categories can be fully reflected by mAP, and a larger mAP value means that the model performs better in each area. Therefore, AP is used to assess the detection ability of a single category, while mAP is more global and can comprehensively measure the performance of the model in a multi-category target detection task.
A P = 0 1 P ( r ) d r
m A P = i = 1 N A P i N
N is the number of detected categories.

4.2. Experiment Setup

The deep learning model training, validation and testing were performed using a computer system using the Windows 10 operating system, with two GPUs (NVIDIA Tesla T4 16G), running CUDA version 12.1, python version 3.10.3, deep learning framwork, and pytorch version 2.1.2.
The training of the deep learning model starts from scratch and the hyperparameters involved in the training process have been given in Table 2.

4.3. SSSD Result

In order to illustrate the crucial significance that various alterations play, we first conduct ablation experiments on the SSSD. Figure 7 shows the confusion matrix for the experimental results of the various models. In the confusion matrix, the true category is represented by each column, while the predicted category to which the data belong is represented by each row. By analyzing the elements of the confusion matrix, it is possible to determine in which categories the model is prone to misclassification. From Figure 7a, it can be seen that the original YOLOv8 model demonstrates relatively low classification performance in the “aircraft” and “reefs” categories, with accuracy rates of 0.57 and 0.83, respectively. Particularly for these two categories, a high proportion of background is misclassified as targets. This indicates significant deficiencies in the model’s ability to distinguish these categories, especially in separating them from complex backgrounds. Figure 7b–d, respectively, demonstrate the improvements after introducing different structures into the original model. With the optimization of the model, the classification accuracy for the “aircraft” and “reefs” categories shows significant improvement. In the SS-YOLO model shown in Figure 7d, the classification coefficient for “aircraft” increased by 0.14, while “reefs” improved by 0.07. Furthermore, the proportion of background being misclassified as “aircraft” and “reefs” dropped significantly. This indicates that the robustness of the improved model against complex backgrounds has been significantly enhanced.
Through the above analysis, it is evident that the MHSA attention mechanism enhances the model’s ability to extract global features, enabling it to better capture the global characteristics of targets such as “aircraft” and “reefs” and avoid being misled by locally complex backgrounds. The ASSF module effectively integrates features at different scales, addressing the shortcomings of the original model in multi-scale object detection, particularly significantly improving the detection performance for small targets (such as “reefs”). Figure 7d shows a significant reduction in the proportion of background misclassified as targets, indicating that the improved SS-YOLO model exhibits higher robustness in distinguishing between target and non-target regions, thereby improving detection reliability. The SS-YOLO structure demonstrates greater consistency across multiple categories, reducing the mutual confusion between “reefs” and “foundation”, which suggests that the model has clearer decision boundaries between different categories.
Furthermore, Figure 8 shows the changes in the loss functions, precision, recall, and map metrics for the SS-YOLO model on the training set and the validation set as the number of training epochs increases. According to the loss function charts, all three loss functions on the training and validation sets are covered to a stable state. Recall shows a consistent rising trend and coverages to about 0.85, whereas precision displays notable oscillations at first and stabilizes at 0.82 after 60 epochs. The model’s mAP value eventually stabilizes around 0.93.

4.3.1. Ablation Experiments

We conducted a series of ablation experiments, making sure that the training settings and hyperparameters remained constant, to assess the contribution of each modification to the network. The YOLOv8n model served as the baseline in these studies, and the suggested improvements were implemented one after the other for comparison.
We evaluated all three enhanced detection networks on the same test dataset, with the detection results presented in Figure 9. For large targets with distinctive features, in Figure 9a, the YOLOv8 model in the first row exhibits a significant under-detection problem. This is likely attributed to the model’s insufficient ability to adapt to complex backgrounds. While in Figure 9c, the YOLOv8 model in the first row has a significantly larger detection range and a duplicate detection problem, which suggests that the target has been misrecognized as multiple independent targets, resulting in the overlapping detection frames. For the small targets in Figure 9d, both the YOLOv8 model in the first row and the model integrating the Fast-C2f structure in the second row show lower confidence and more serious leakage detection, which is particularly underperforming in small object detection. These analyses show that the original YOLOv8 model has significant performance bottlenecks when processing SSS images in complex backgrounds, especially in the small target detection task, and is difficult to meet the demand for high-precision detection.
Based on the above, in order to improve the detection accuracy of the model, the ASSF structure is introduced, and the detection effect is shown in the third row in Figure 9. The detection effect in the third row shows a significant improvement compared to the model in the first and second rows. By introducing the ASSF structure, the model is able to better extract target features from information at different scales, which improves the recognition of multi-scale targets and reduces the effect of background interference on target detection. Especially for smaller targets, the model is able to capture the detailed information of the target area more accurately, which significantly improves the recognition confidence and reduces the leakage detection problem.
As shown in the fourth row of Figure 9, despite the reduction in the number of parameters in the detection head, the model’s ability to perceive the target area remains unaffected. In fact, the attentional mechanism not only optimizes the recognition of small targets but also suppresses the interference from background noise. Compared with the first and second rows, the model in the fourth row can locate the target more accurately, and it maintains a similar detection effect with the third row while being lightweight. Especially in the complex background, the boundary of the target becomes clearer, and the missed detection is significantly reduced. The introduction of the attention mechanism allows the model to further improve the detection accuracy by focusing on key regions without increasing the computational overhead. At the same time, the model is able to adaptively allocate the attention so that even in low resolution images, the model can still effectively extract the target features, ensuring better detection effect and accuracy.
The quantitative evaluation scores of the ablation experimental outcomes are detailed in Table 3. Based on the data presented in Table 3, integrating the Fast-C2f structure into the YOLOv8n model reduces its parameters from 3.15 million to 2.45 million, and the GFLOPs from 8.9 to 7.1. This reduction in both parameters and computational complexity indicates a more lightweight model, which is crucial for deployment on resource-constrained edge devices. The decrease in parameters directly contributes to lower memory usage and faster inference times, which is essential for real-time target detection. This improvement in the model’s architecture enables it to better meet the demands of real-time detection without sacrificing accuracy. Considering the impact of parameter reduction on model accuracy, the ASSF structure and DetectSA structure are addicted, respectively. The addition of the ASSF structure further improves the mAP 0.5 and mAP 0.5:0.95 scores of the YOLOv8n model by 1.8% and 2.5%. In summary, the proposed enhanced model exhibits a reduction in parameters by 0.49 million and a decrease in GFLOPs by 2.5, compared to the original model. Additionally, it achieves notable performance improvements, with an increase of 4.4% in [email protected] and 3% in [email protected]:0.95, indicating a significant boost in detection accuracy.

4.3.2. Contrast Experiments

We tested two popular target detection algorithms as well as other algorithms in the YOLO family separately on the SSSD. The detection results are then compared with our proposed SS-YOLO algorithm in order to evaluate the detection performance of our algorithm in more detail. Table 4 provides a detailed presentation of the comparative experiments’ quantitative data.
It can be observed that the proposed SS-YOLO network demonstrates superior performance across multiple key metrics. Compared with both the traditional single-stage object detection algorithm SSD [41] and the two-stage detection algorithm Faster R-CNN [42], SS-YOLO shows significantly lower parameters and higher accuracy. SS-YOLO’s Params and FLOPs are only 2.55 and 6.4, respectively, while SSD’s Params and FLOPs are 24.98 and 137.94. As a single-stage object detection algorithm, it is clear that SSD is slower than our proposed network. Furthermore, in terms of accuracy, SS-YOLO not only offers faster detection speed but also exhibits superior detection precision. SSD’s mAP is only 0.827, and the lower accuracy is due to its simpler architecture compared to the two-stage detection algorithm Faster R-CNN and our SS-YOLO, which sacrifices detection precision for higher speed [47]. Faster R-CNN’s mAP is 0.88, slightly higher than SSD, but still significantly lower than our proposed network. Additionally, from the perspective of parameter count and inference time, the computationally intensive Faster R-CNN does not meet the requirements for deployment on edge devices.
Another point worth noting is that compared to other algorithms in the YOLO series, our network also demonstrates significant performance advantages. SS-YOLO not only surpasses other models in detection speed but also maintains relatively high accuracy. Compared to YOLOv5s [43] and YOLOv7t [44], SS-YOLO has significantly lower parameters and FLOPs. YOLOv5s has 7.2 M Params and 16.5G FLOPs, while YOLOv7t has 6.2 M Params and 13.9G FLOPs, both more than double those of SS-YOLO. In terms of mAP, SS-YOLO maintains a mAP of 0.924, while YOLOv5s and YOLOv7t have mAP values of 0.896 and 0.868, respectively, both noticeably lower than the improved SS-YOLO. Although the YOLOv9s [46] network achieves slightly higher accuracy with a mAP of 0.933, its Params and FLOPs are 7.1 M and 26.4 G, respectively, which indicates it sacrifices speed to gain higher accuracy. Overall, SS-YOLO achieves a balance between accuracy and speed, offering clear advantages over other models.

5. Conclusions

This paper introduces a lightweight network designed for deployment on unmanned vessels to detect objects in SSS images. To achieve network lightweighting, we first reduced the complexity of the convolution process by combining PConv and PWConv to form a new FasterBlock module, replacing the original convolution block in C2f. This achieved a balance between detection accuracy and speed. To fully utilize the scale information of objects and address the deformation issue of targets in SSS images, we optimized the ASSF structure. After concatenating feature maps of different sizes, we extracted scale-sequential features based on 3D convolutions. To address the high complexity of the original YOLOv8 model’s detection head, we incorporated an MHSA attention mechanism without the FFN layer into the detection head. This increased the model’s sensitivity to features while reducing the parameters of the detection head, thus improving detection speed. We constructed a new SSSD and conducted both ablation and comparative experiments on it. The results show that our proposed SS-YOLO model achieves an excellent balance between detection accuracy and speed, demonstrating superior performance. In the future, we will focus on the deployment of the model on autonomous vessels to realize the real-time detection of underwater targets by unmanned vessels equipped with SSS equipment.

Author Contributions

Conceptualization, N.Y., Z.W. and G.L.; methodology, N.Y. and S.W.; software, N.Y.; validation, N.Y.; formal analysis, N.Y.; investigation, N.Y.; resources, N.Y.; data curation, Z.W., H.R. and N.Y.; writing—original draft preparation, N.Y.; writing—review and editing, Z.W., G.L., X.Z. and Y.P.; visualization, Y.P.; supervision, Z.W.; project administration, X.Z. and Y.P; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key R&D Program of Shandong Province, China (No. 2023CXPT054).

Data Availability Statement

Data are available on request due to restrictions, e.g., privacy or ethical. The data presented in this study are available on request from the corresponding author.

Acknowledgments

The first author would like to thank all the students and teachers who contributed to the data collection and processing for this study, and expresses gratitude to the corresponding author, Zhengrong Wei, and Kai Liu of the First Institute of Oceanography of Ministry of Natural Resources for their suggestions provided in revising this manuscript.

Conflicts of Interest

Author Mr. Guoyu Li was employed by the company Qingdao Xiushan Mobile Mapping Co. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Li, L.; Li, Y.; Yue, C.; Xu, G.; Wang, H.; Feng, X. Real-Time Underwater Target Detection for AUV Using Side Scan Sonar Images Based on Deep Learning. Appl. Ocean Res. 2023, 138, 103630. [Google Scholar] [CrossRef]
  2. Grządziel, A. The Impact of Side-Scan Sonar Resolution and Acoustic Shadow Phenomenon on the Quality of Sonar Imagery and Data Interpretation Capabilities. Remote Sens. 2023, 15, 5599. [Google Scholar] [CrossRef]
  3. Zhou, X.; Zhou, Z.; Wang, M.; Ning, B.; Wang, Y.; Zhu, P. Multi-Level Feature Enhancement Network for Object Detection in Sonar Images. J. Vis. Commun. Image Represent. 2024, 100, 104147. [Google Scholar] [CrossRef]
  4. Yu, H.; Li, Z.; Li, D.; Shen, T. Bottom Detection Method of Side-Scan Sonar Image for AUV Missions. Complexity 2020, 2020, 8890410. [Google Scholar] [CrossRef]
  5. Abu, A.; Diamant, R. A Statistically-Based Method for the Detection of Underwater Objects in Sonar Imagery. IEEE Sens. J. 2019, 19, 6858–6871. [Google Scholar] [CrossRef]
  6. Febriawan, H.K.; Helmholz, P.; Parnum, I.M. Support Vector Machine and Decision Tree Based Classification of Side-Scan Sonar Mosaics Using Textural Features. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, XLII-2/W13, 27–34. [Google Scholar] [CrossRef]
  7. Azimi-Sadjadi, M.R.; Klausner, N.; Kopacz, J. Detection of Underwater Targets Using a Subspace-Based Method with Learning. IEEE J. Ocean. Eng. 2017, 42, 869–879. [Google Scholar] [CrossRef]
  8. He, J.; Chen, J.; Xu, H.; Ayub, M.S. Small Target Detection Method Based on Low-Rank Sparse Matrix Factorization for Side-Scan Sonar Images. Remote Sens. 2023, 15, 2054. [Google Scholar] [CrossRef]
  9. Fakiris, E.; Papatheodorou, G.; Geraga, M.; Ferentinos, G. An Automatic Target Detection Algorithm for Swath Sonar Backscatter Imagery, Using Image Texture and Independent Component Analysis. Remote Sens. 2016, 8, 373. [Google Scholar] [CrossRef]
  10. Zhu, B.; Wang, X.; Chu, Z.; Yang, Y.; Shi, J. Active Learning for Recognition of Shipwreck Target in Side-Scan Sonar Image. Remote Sens. 2019, 11, 243. [Google Scholar] [CrossRef]
  11. Wang, Z.; Zhang, S.; Huang, W.; Guo, J.; Zeng, L. Sonar Image Target Detection Based on Adaptive Global Feature Enhancement Network. IEEE Sens. J. 2022, 22, 1509–1530. [Google Scholar] [CrossRef]
  12. Kong, W.; Hong, J.; Jia, M.; Yao, J.; Cong, W.; Hu, H.; Zhang, H. YOLOv3-DPFIN: A Dual-Path Feature Fusion Neural Network for Robust Real-Time Sonar Target Detection. IEEE Sens. J. 2020, 20, 3745–3756. [Google Scholar] [CrossRef]
  13. Zhang, F.; Zhang, W.; Cheng, C.; Hou, X.; Cao, C. Detection of Small Objects in Side-Scan Sonar Images Using an Enhanced YOLOv7-Based Approach. J. Mar. Sci. Eng. 2023, 11, 2155. [Google Scholar] [CrossRef]
  14. Wen, X.; Zhang, F. Underwater Target Detection by Side-Scan Sonar Based on Yolov7-Attention. In Proceedings of the 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), Quzhou, China, 3–5 November 2023; pp. 1536–1542. [Google Scholar]
  15. Wen, X.; Wang, J.; Cheng, C.; Zhang, F.; Pan, G. Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor. Remote Sens. 2024, 16, 2492. [Google Scholar] [CrossRef]
  16. Mittal, P. A Comprehensive Survey of Deep Learning-Based Lightweight Object Detection Models for Edge Devices. Artif Intell Rev 2024, 57, 242. [Google Scholar] [CrossRef]
  17. Liu, G.; Hu, Y.; Chen, Z.; Guo, J.; Ni, P. Lightweight Object Detection Algorithm for Robots with Improved YOLOv5. Eng. Appl. Artif. Intell. 2023, 123, 106217. [Google Scholar] [CrossRef]
  18. Huyan, L.; Bai, Y.; Li, Y.; Jiang, D.; Zhang, Y.; Zhou, Q.; Wei, J.; Liu, J.; Zhang, Y.; Cui, T. A Lightweight Object Detection Framework for Remote Sensing Images. Remote Sens. 2021, 13, 683. [Google Scholar] [CrossRef]
  19. Zhang, Y.; Zhang, T.; Wu, C.; Tao, R. Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction. IEEE Trans. Multimed. 2024, 26, 4183–4193. [Google Scholar] [CrossRef]
  20. Tang, Y.; Wang, L.; Jin, S.; Zhao, J.; Huang, C.; Yu, Y. AUV-Based Side-Scan Sonar Real-Time Method for Underwater-Target Detection. J. Mar. Sci. Eng. 2023, 11, 690. [Google Scholar] [CrossRef]
  21. Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 17 November 2024).
  22. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
  23. Hao, C.-Y.; Chen, Y.-C.; Chen, T.-T.; Lai, T.-H.; Chou, T.-Y.; Ning, F.-S.; Chen, M.-H. Synthetic Data-Driven Real-Time Detection Transformer Object Detection in Raining Weather Conditions. Appl. Sci. 2024, 14, 4910. [Google Scholar] [CrossRef]
  24. Zhang, P.; Tang, J.; Zhong, H.; Ning, M.; Liu, D.; Wu, K. Self-Trained Target Detection of Radar and Sonar Images Using Automatic Deep Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
  25. GitHub—Ultralytics/Ultralytics: NEW—YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite. Available online: https://github.com/ultralytics/ultralytics (accessed on 27 September 2024).
  26. Liu, Z.; Rasika, D.; Abeyrathna, R.M.; Mulya Sampurno, R.; Massaki Nakaguchi, V.; Ahamed, T. Faster-YOLO-AP: A Lightweight Apple Detection Algorithm Based on Improved YOLOv8 with a New Efficient PDWConv in Orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
  27. Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 1–16. [Google Scholar] [CrossRef]
  28. Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks 2023. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  29. Zhang, L.; Li, P.; Liu, X.; Yu, J.; Hu, G.; Yu, T. Dy-GNet: A Lightweight and Efficient 1DCNN-Based Network for Leakage Aperture Identification. Meas. Sci. Technol. 2024, 35, 056109. [Google Scholar] [CrossRef]
  30. Park, H.-J.; Kang, J.-W.; Kim, B.-G. ssFPN: Scale Sequence (S2) Feature-Based Feature Pyramid Network for Object Detection. Sensors 2023, 23, 4432. [Google Scholar] [CrossRef] [PubMed]
  31. Zhao, Z.; Pan, Y.; Guo, G.; Zhai, Y.; Liu, G. YOLO-AFPN: Marrying YOLO and AFPN for External Damage Detection of Transmission Lines. IET Gener. Transm. Distrib. 2024, 18, 1935–1946. [Google Scholar] [CrossRef]
  32. Zhou, P.; Chen, J.; Tang, P.; Gan, J.; Zhang, H. A Multi-Scale Fusion Strategy for Side Scan Sonar Image Correction to Improve Low Contrast and Noise Interference. Remote Sens. 2024, 16, 1752. [Google Scholar] [CrossRef]
  33. Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A Novel YOLO Model with Attentional Scale Sequence Fusion for Cell Instance Segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
  34. Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a Deep Multi-Scale Feature Ensemble and an Edge-Attention Guidance for Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 105–119. [Google Scholar] [CrossRef]
  35. Lindeberg, T. Scale-Space Theory in Computer Vision; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; ISBN 978-1-4757-6465-9. [Google Scholar]
  36. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
  37. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
  38. Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 24 May 2019; pp. 7354–7363. [Google Scholar]
  39. Cordonnier, J.-B.; Loukas, A.; Jaggi, M. On the Relationship Between Self-Attention and Convolutional Layers. Available online: https://arxiv.org/abs/1911.03584v2 (accessed on 7 September 2024).
  40. Yu, H.; Wan, C.; Liu, M.; Chen, D.; Xiao, B.; Dai, X. Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search. arXiv 2024, arXiv:2403.10413. [Google Scholar]
  41. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  42. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  43. Jocher, G. YOLOv5 by Ultralytics 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 17 November 2024).
  44. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  45. GitHub—Ultralytics/Ultralytics: Ultralytics YOLO11 🚀. Available online: https://github.com/ultralytics/ultralytics/tree/main (accessed on 17 November 2024).
  46. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  47. Qu, S.; Cui, C.; Duan, J.; Lu, Y.; Pang, Z. Underwater Small Target Detection under YOLOv8-LA Model. Sci. Rep. 2024, 14, 16108. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Unmanned surface vessel and SS900F SSS equipment.
Figure 1. Unmanned surface vessel and SS900F SSS equipment.
Jmse 13 00066 g001
Figure 2. Processed SSS images of the wind turbine pile foundation and underwater artificial reef (the image on the left is of the wind turbine pile foundation, and the image on the right is of the underwater artificial reef).
Figure 2. Processed SSS images of the wind turbine pile foundation and underwater artificial reef (the image on the left is of the wind turbine pile foundation, and the image on the right is of the underwater artificial reef).
Jmse 13 00066 g002
Figure 3. Distribution of the normalized width and heights of the objects in the combined dataset.
Figure 3. Distribution of the normalized width and heights of the objects in the combined dataset.
Jmse 13 00066 g003
Figure 4. The network architecture diagram of SS-YOLO ((ac) represent the Backbone, Neck, and Head parts of the YOLO network, respectively. (dg) represent the proposed improved module structures within the overall network).
Figure 4. The network architecture diagram of SS-YOLO ((ac) represent the Backbone, Neck, and Head parts of the YOLO network, respectively. (dg) represent the proposed improved module structures within the overall network).
Jmse 13 00066 g004
Figure 5. The convolutional structure design of PConv and the combination of PConv and PWConv form a T-shaped convolutional structure.
Figure 5. The convolutional structure design of PConv and the combination of PConv and PWConv form a T-shaped convolutional structure.
Jmse 13 00066 g005
Figure 6. Memory-efficient self-attention mechanism.
Figure 6. Memory-efficient self-attention mechanism.
Jmse 13 00066 g006
Figure 7. The confusion matrix obtained from ablation experiments based on the SSSD. (a) YOLOv8 (b) YOLOv8 + Fast-C2f (c) YOLOv8 + Fast-C2f + ASSF (d) YOLOv8 + Fast-C2f + ASSF + Detect.
Figure 7. The confusion matrix obtained from ablation experiments based on the SSSD. (a) YOLOv8 (b) YOLOv8 + Fast-C2f (c) YOLOv8 + Fast-C2f + ASSF (d) YOLOv8 + Fast-C2f + ASSF + Detect.
Jmse 13 00066 g007
Figure 8. The result on our SSSD. The results and variations in the loss function, precision, recall and mapping evaluation metrics are included in the result images.
Figure 8. The result on our SSSD. The results and variations in the loss function, precision, recall and mapping evaluation metrics are included in the result images.
Jmse 13 00066 g008
Figure 9. Visual analysis of different effects obtained from ablation experiments based on the SSSD. (a) YOLOv8 (b) YOLOv8 + Fast-C2f (c) YOLOv8 + Fast-C2f + ASSF (d) YOLOv8 + Fast-C2f + ASSF + Detect.
Figure 9. Visual analysis of different effects obtained from ablation experiments based on the SSSD. (a) YOLOv8 (b) YOLOv8 + Fast-C2f (c) YOLOv8 + Fast-C2f + ASSF (d) YOLOv8 + Fast-C2f + ASSF + Detect.
Jmse 13 00066 g009
Table 1. The technical specifications of the SS900F SSS.
Table 1. The technical specifications of the SS900F SSS.
Technical SpecificationsParameters
Operating Frequency900 kHz
Maximum Range75 m @ 900 kHz
Horizontal Beam Width0.2° @ 900 kHz
Vertical Beam Width50°
Along Track Resolution0.07 m @ 20 m;0.17 m @ 50 m;0.26 m @ 75 m;
Table 2. Hyperparameters of model training.
Table 2. Hyperparameters of model training.
ParametersConfiguration
image size(960, 960)
batch size16
epochs200
initial learning rate0.01
final learning rate0.1
weight decay0.0005
SGD momentum0.937
Table 3. Quantitative results of the ablation experiments (Bolded and underlined data are the best data results for each parameter).
Table 3. Quantitative results of the ablation experiments (Bolded and underlined data are the best data results for each parameter).
MethodParamsFLOPs[email protected]mAP@[0.5, 0.95]
YOLOv8n3.15 M8.9 G0.880.617
YOLOv8n + Fast-C2f2.45 M7.1 G0.8740.634
YOLOv8n + Fast-C2f + ASSF2.69 M7.6 G0.8980.642
YOLOv8n + Fast-C2f + ASSF + DetectSA2.55 M6.4 G0.9240.647
Table 4. Quantitative results of the contrast experiments (Bolded and underlined data are the best data results for each parameter).
Table 4. Quantitative results of the contrast experiments (Bolded and underlined data are the best data results for each parameter).
MethodParamsFLOPsPR[email protected]mAP@[0.5,0.95]
SSD [41]24.98 M137.94 G0.6080.8410.8270.419
Faster R-CNN [42]41.22 M91.1 G0.8360.9150.880.534
YOLOv5s [43]7.2 M16.5 G0.9050.9020.8960.628
YOLOv7t [44]6.2 M13.9 G0.8960.7870.8680.560
YOLOv8n [45]3.15 M8.9 G0.8770.8350.880.617
YOLOv9s [46]7.1 M26.4 G0.8890.9210.9330.698
SS-YOLO2.55 M6.4 G0.8210.8570.9240.647
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, N.; Li, G.; Wang, S.; Wei, Z.; Ren, H.; Zhang, X.; Pei, Y. SS-YOLO: A Lightweight Deep Learning Model Focused on Side-Scan Sonar Target Detection. J. Mar. Sci. Eng. 2025, 13, 66. https://doi.org/10.3390/jmse13010066

AMA Style

Yang N, Li G, Wang S, Wei Z, Ren H, Zhang X, Pei Y. SS-YOLO: A Lightweight Deep Learning Model Focused on Side-Scan Sonar Target Detection. Journal of Marine Science and Engineering. 2025; 13(1):66. https://doi.org/10.3390/jmse13010066

Chicago/Turabian Style

Yang, Na, Guoyu Li, Shengli Wang, Zhengrong Wei, Hu Ren, Xiaobo Zhang, and Yanliang Pei. 2025. "SS-YOLO: A Lightweight Deep Learning Model Focused on Side-Scan Sonar Target Detection" Journal of Marine Science and Engineering 13, no. 1: 66. https://doi.org/10.3390/jmse13010066

APA Style

Yang, N., Li, G., Wang, S., Wei, Z., Ren, H., Zhang, X., & Pei, Y. (2025). SS-YOLO: A Lightweight Deep Learning Model Focused on Side-Scan Sonar Target Detection. Journal of Marine Science and Engineering, 13(1), 66. https://doi.org/10.3390/jmse13010066

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop