Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor

Wen, Xin; Wang, Jian; Cheng, Chensheng; Zhang, Feihu; Pan, Guang

doi:10.3390/rs16132492

Open AccessArticle

Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor

by

Xin Wen

¹

,

Jian Wang

²,

Chensheng Cheng

¹,

Feihu Zhang

^1,*

and

Guang Pan

¹

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

²

Marine Design & Research Institute of China, Shanghai 200011, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(13), 2492; https://doi.org/10.3390/rs16132492

Submission received: 2 April 2024 / Revised: 29 June 2024 / Accepted: 5 July 2024 / Published: 8 July 2024

(This article belongs to the Special Issue Advancement in Undersea Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

:

Side-scan sonar plays a crucial role in underwater exploration, and the autonomous detection of side-scan sonar images is vital for detecting unknown underwater environments. However, due to the complexity of the underwater environment, the presence of a few highlighted areas on the targets, blurred feature details, and difficulty in collecting data from side-scan sonar, achieving high-precision autonomous target recognition in side-scan sonar images is challenging. This article addresses this problem by improving the You Only Look Once v7 (YOLOv7) model to achieve high-precision object detection in side-scan sonar images. Firstly, given that side-scan sonar images contain large areas of irrelevant information, this paper introduces the Swin-Transformer for dynamic attention and global modeling, which enhances the model’s focus on the target regions. Secondly, the Convolutional Block Attention Module (CBAM) is utilized to further improve feature representation and enhance the neural network model’s accuracy. Lastly, to address the uncertainty of geometric features in side-scan sonar target features, this paper innovatively incorporates a feature scaling factor into the YOLOv7 model. The experiment initially verified the necessity of attention mechanisms in the public dataset. Subsequently, experiments on our side-scan sonar (SSS) image dataset show that the improved YOLOv7 model has 87.9% and 49.23% in its average accuracy (

m A P 0.5

) and (

m A P

0.5:0.95), respectively. These results are 9.28% and 8.41% higher than the YOLOv7 model. The improved YOLOv7 algorithm proposed in this paper has great potential for object detection and the recognition of side-scan sonar images.

Keywords:

side-scan sonar; You Only Look Once v7; swin-transformer; convolutional block attention module; scaling factor

1. Introduction

In autonomous underwater exploration tasks, target detection technology based on side-scan sonar (SSS) imagery plays a crucial role [1,2,3]. In traditional methods, target detection in sonar images relies on pixel features [4], greyscale thresholds [5], a priori target information [6], and artificial filters [7]. However, due to the quality of marine sonar images, traditional methods struggle to identify good pixel features and grey scale thresholds. They also require a significant amount of manpower for discrimination, making them inefficient. In recent years, deep learning-based approaches have shown significant advantages and have seen extensive applications in the field of underwater target detection [8,9]. Nevertheless, the precision of deep learning detection algorithms has consistently failed to meet practical engineering standards. The multitude of spatial features and pronounced shadow effects present in target features within side-scan sonar images pose significant challenges to achieving high accuracy in target detection. Therefore, investigating methods to integrate the unique characteristics of side-scan sonar images into target detection algorithms holds paramount importance for enhancing the overall detection performance.

In the domain of target detection, such as You Only Look Once (YOLO) [10], Single-Shot MultiBox Detector (SSD) [11], and Retina NET [12] have showcased substantial potential for accuracy. However, due to the unique attributes of sonar data, there is a limited focus on developing detection algorithms specifically tailored for sonar images. Underwater targets in side-scan sonar images can only be shown as grey-scale bright spots of different geometric sizes due to the mapping of acoustic signal intensity, and the image geometric features are not obvious, which is significantly different from the imaging quality of traditional optical images. Even for the same underwater target, the image features displayed in the side-scan sonar image can vary due to the differences in draught depths, positions, and sonar angles. Current research often involves integrating sonar feature algorithms into existing detection models to enhance their resilience in effectively processing sonar data.

In the domain of sonar image detection, the earliest strategy employed transfer learning, adapting the weight model from optical images to enhance the performance of sonar images [13,14,15]. Zacchini et al. [16] used mask-region convolutional neural networks (Mask-RCNN) to automatically identify and localize targets in forward-looking sonar images, but the neural network structure only performs target detection, not target recognition. It proves the feasibility of the deep learning method in forward-looking sonar. Yulin et al. [17] proposed a transfer learning-based convolutional neural network method for side-scan sonar wreck image recognition, which uses the characteristics of the side-scan sonar wreck dataset and improves the network with reference to the Visual Geometry Group (VGG) model, which greatly improves the training efficiency as well as the accuracy of recognition, and results in the faster convergence of the model. This paper proves the feasibility of transfer learning in image target recognition in side-scan sonar. To more fully prove the feasibility of transfer learning, Du et al. [18] adopted the self-made submarine dataset and four traditional convolutional neural network (CNN) models for training and prediction compared the prediction accuracy and calculation performance of the four CNN models, and certified the efficiency and accuracy of transfer learning in side-scan sonar image recognition. On this basis, to improve the accuracy of transfer learning. Zhang et al. [19] improved the traditional k-means clustering by transfer learning. The intersection over union (IoU) value is used as the distance function to cluster the labeling information of the training set of the forward-looking sonar image. An improved feature extraction method of CoordConv was proposed to give corresponding coordinate information to high-level features which improves the accuracy of network detection regression.

To address the problem that the target features of underwater sonar images are not easily extracted, researchers have improved the network based on its good performance in optical data, and many studies have been conducted on innovation and application to underwater environments [20,21]. Zhu et al. [22] introduced an innovative underwater unmanned vehicles (UUV) sonar automatic target recognition method. This approach involves utilizing the CNN to extract pertinent target features from sonar images. The subsequent classification is achieved through a support vector machine (SVM) trained with manually annotated data, thereby establishing its superior proficiency in sonar image feature extraction. In order to improve the efficiency of sonar image feature extraction, Kong et al. [23] proposed an improved You Only Look Once v3 (YOLOv3) real-time detection algorithm, YOLOv3-DPFIN, to achieve the accurate detection of noise-intensive multi-class sonar targets with minimal time consumption. Dual-path network (DPN) module and fusion transition module are used for efficient feature extraction, and dense connection method is used to improve multi-scale prediction. Li et al. [24] used the YOLO model and introduced an effective feature encoder to enhance the performance of the feature map. In the model training stage, the channel-level sparsity regularization was carried out to improve the inference speed. He et al. [25] proposed a sonar image detection method based on low-rank sparse matrix decomposition. The feature extraction and noise problems are described as matrix decomposition, the improved robust principal component analysis algorithm is used to extract the target, and the fast near-end gradient method is used to optimize the target. Based on the above, to improve the robustness of the detection model, Wang et al. [26] proposed a new convolutional neural network model, which uses the depth separation residual model to extract the sonar target region at multiple scales, uses the multi-channel fusion method to enhance the feature information, uses the adaptive supervision function to classify the target, and improves the generalization ability and robustness of the side-scan target recognition. Song et al. [27] proposed a sonar target detection method based on the Markov model and neural network model. The neural network model is used to fully extract sonar image information and the Fully Convolutional Networks (FCN)-optimized by Markov model is further used for sonar target segmentation. In the direction of network structure optimization, Fan et al. [28] built a 32-layer residual network to replace ResNet50/101 in MASK-RCNN, which reduced the number of parameters in the network, while ensuring the accuracy of target detection, and replaced the Stochastic Gradient Descent (SGD) to improve with Adagrad (adaptive adjustment of learning rate) optimizer, and finally used 2500 sonar images for cross-training to validate the accuracy of the network model.

Existing methods often overlook the strong correlation between target shadows and target features. In this paper, we introduce the attention mechanism to target detection in side-scan sonar images, aiming to utilize the correlation between target shadows and target features fully. Additionally, there is a lack of research on the multi-spatial characteristics of target features and a deficiency of real side-scan sonar datasets to support such studies. Therefore, optimizing existing algorithms to address these key issues in side-scan sonar target detection is highly worthwhile. This paper utilizes a small sample dataset collected from sea experiments for practical training. Attention mechanisms and size scaling factors are introduced into the You Only Look Once v7 (YOLOv7) [29] model. Testing is conducted on both mainstream networks and the improved YOLOv7 network using public datasets and the collected dataset, further validating the crucial role of attention mechanisms and size scaling factors in side-scan sonar image target detection.

In summary, this paper makes the following key contributions:

The experimental dataset is used to address the challenge of limited generalizability in model training outcomes.
To enhance feature diversity and improve classification, a Swin-Transformer was integrated into the YOLOv7 backbone. This leverages the transformer’s ability to capture long-range dependencies and hierarchical features, boosting the model’s performance in detection tasks.
The existing Convolutional Block Attention Module (CBAM) structure is integrated into the prediction head of the YOLOv7 model, enhancing detection speed and accuracy for small targets in complex backgrounds.
An innovative scaling factor is introduced into the prediction head to address the problem of variable spatial feature geometry in side-scan sonar images.

The remainder of this paper is structured as follows. Section 1 provides a review of side-scan sonar target detection algorithms, while Section 2 introduces the framework of the side-scan sonar target detection model proposed in this paper. Section 3 details the methodology for obtaining the dataset used in the experiments, and Section 4 presents the experimental results. Finally, Section 5 concludes the paper and outlines future work.

2. Method

Sonar target detection algorithms necessitate a delicate balance between precision and speed. To effectively address this challenge, we have opted for the YOLOv7 [29] model as the core of our algorithm. However, we have enhanced this model to tailor it specifically for underwater sonar image target detection. To address challenges such as small target features, redundant backgrounds, and variable target sizes in side-scan sonar images, we introduce an attention mechanism into the feature backbone of the YOLOv7 model. Additionally, a size scaling factor module is incorporated into the feature recognition stage. Through meticulous optimizations of the YOLOv7 architecture, we have succeeded in achieving a solution that excels in both accuracy and rapid processing speeds.

2.1. YOLOv7’s Network Structure

The YOLOv7 [29] stands out as a cutting-edge single-stage target recognition detection algorithm. This algorithm comprises four key components: input, backbone, neck, and prediction. In the input stage, we leverage data augmentation techniques alongside the K-Means clustering algorithm to dynamically compute anchor frames based on the data. This adaptive approach enhances image diversity and mitigates overfitting concerns. Throughout the training process, the prediction frame’s size is determined using predefined anchor boxes. It is then matched against the predetermined size to calculate the loss function. This loss function is employed for back-propagation, enabling the model’s parameter updates. The YOLOv7 algorithm integrates the efficient layer attention network (ELAN) and MaxPool (MP) structures within its backbone model, depicted in Figure 1 and Figure 2. This strategic incorporation continually bolsters the network’s learning capacity while ensuring complete feature extraction from images, all without disrupting the original gradient path. Harnessing these advanced features, the YOLOv7 algorithm delivers exceptional accuracy and swift processing speeds. These attributes position it as an ideal choice for the specific demands of underwater sonar image target detection.

The loss function plays a crucial role in improving the accuracy of the model’s localization and recognition. In YOLOv7, the complete cross union (CloU) loss function [30] is utilized to achieve this objective. This loss function aims to reduce the localization and recognition errors by considering the complete cross-union between the predicted bounding boxes and the ground-truth boxes. It integrates the

I o U

metric and the completeness metric to calculate the loss, ensuring that the model learns to predict accurate bounding boxes and object categories.

\begin{matrix} L_{CIoU} = 1 - IoU + \frac{ρ^{2} (p, p^{gt})}{c^{2}} + α M \end{matrix}

(1)

where

p = {[x, y]}^{T}

and

p^{gt} = {[x^{gt}, y^{gt}]}^{T}

are the center points of the anchor frame B and

B^{gt}

, c denotes the diagonal length of box C, and

ρ

is specified as the Euclidean distance. M is the consistency of the aspect ratio.

M = \frac{4}{π^{2}} {(\arctan \frac{w^{gt}}{h^{gt}} - arctan \frac{w}{h})}^{2}

(2)

where

ω

and h are the width and height of boxes and

α

is:

\begin{matrix} α = \{\begin{matrix} 0, & if IoU < 0.5 \\ \frac{V}{(1 - IoU) + V}, & if IoU \geq 0.5 \end{matrix} \end{matrix}

(3)

2.2. Improved Algorithm

In this paper, we enhance the detection accuracy of the YOLOv7 model on side-scan sonar images by incorporating attention mechanisms into both the network feature extraction backbone and the prediction header. Additionally, we introduce a novel size scaling factor in the prediction head to reduce the model’s leakage detection rate.

2.2.1. Swin-Transformer Module

Side-scan sonar images are derived from acoustic signals and typically depict small geometrically sized detected targets. These images often contain extensive areas of insignificance, severely impeding the speed of sonar image detection. This situation is unfavorable for model convergence and may lead to the learning of incorrect features. To address these challenges, this study proposes integrating attention mechanisms into the YOLOv7 model to enhance its feature extraction capability and improve the localization and object recognition accuracy.

Attention mechanisms are inspired by the human visual attention system, which allows for the efficient processing of image information by selectively focusing on relevant regions of interest, even with limited resources. Prominent attention mechanisms, such as efficient channel attention(ECA-Net) [31], Transformer [32], Coordinate Attention (CA) [33], Bottleneck Attention Module (BAM) [34], and CBAM [35], have been shown to enhance the performance of detection models [36,37].

To tackle this issue, the Swin-Transformer [38] structure introduces a window module into the transformer framework, incorporating the localized characteristics of CNNs. This innovation effectively addresses the challenge posed by diverse backgrounds in sonar images. To integrate the enhanced YOLOv7 neural network model with the Swin-Transformer structure, the latter’s capabilities in feature extraction and correlation are harnessed. This integration encompasses both the feature extraction backbone and neck of the YOLOv7 model, capitalizing on the potent feature extraction and association functionalities inherent in the Swin-Transformer structure. The effectiveness of this integration is informed by the Swin-Transformer’s power in handling intricate image features.

The process begins with patch embedding, where the input image is divided into blocks and subsequently embedded into an embedding space.
Within each stage, the structure comprises patch merging and a series of blocks.
The primary function of the Patch Merging module is to reduce the image resolution at the outset of each stage.
The block component is predominantly constructed from LayerNorm(LN), multi-layer perceptron (MLP), window attention, and shifted window attention elements.

The main structure of Swin-Transformer is shown in Figure 3:

The internal structure of the Swin-Transformer block is shown in Figure 4.

The patch merging process involves breaking down each set of neighboring 2 × 2 pixels into a distinct patch. Subsequently, pixels in corresponding positions (with the same color) within each patch are combined, resulting in the creation of four feature maps. Eventually, a linear alteration in the depth of the feature map is enacted, transitioning from a depth of C to C/2. This transformation is achieved through a fully connected layer operating in the depth dimension of the feature map.
The windows multi-head self-attention (W-MSA) module is formulated to minimize the computational load. This involves the initial partitioning of the feature map into discrete windows, each measuring M × M. Following this, Self-Attention is executed within each window independently. The computational intricacy of both the MSA and W-MSA modules can be elucidated using the equation represented as Equation (4):

$\begin{matrix} Ω (MSA) = 4 {hwC}^{2} + 2 {(hw)}^{2} C \\ Ω (W - MSA) = 4 {hwC}^{2} + 2 M^{2} hwC \end{matrix}$

(4)

where h and w are the height and width of the discrete window and C is the channel.
The SW-MSA module builds upon the foundation of W-MSA and introduces offsets to address the challenge of exchanging information between distinct windows.

2.2.2. Convolutional Block Attention Module

The Transformer incorporates a spatial attention mechanism into the YOLO detection model, yet lacks a channel attention mechanism. To fill this gap, we introduced CBAM [35], a lightweight self-attentive mechanism, into the neck structure of our neural network model. This integration aims to enrich the model’s feature representations. The CBAM module is divided into two distinct parts: channel attention and spatial attention, as illustrated in Figure 5.

Channel attention module: The attention channel module aggregates spatial features by global average pooling and maximum pooling. The maximum pooling layer maximizes the pixel area to enhance the sensitivity to feature information and improve the accuracy of classification; the average pooling layer pools the average area to improve the sensitivity to background information and reduce the weight of useless background. Given a feature X of size $H \times W \times C$ , where H is the height of the feature, W is the width of the feature, and C is the number of channels of the feature. The global average pooling and maximum pooling operations are performed on X in the spatial domain to obtain two pooling results $X_{a v g}^{c}$ and $X_{M a x}^{c}$ . As shown in the Equation (5):

$\begin{matrix} M_{c} (F) & = σ (MLP (AvgPool (F)) + MLP (MaxPool (F))) \\ = σ (W_{1} (W_{0} (F_{avg}^{c})) + W_{1} (W_{0} (F_{\max}^{c}))) \end{matrix}$

(5)
The spatial attention module is mainly designed to focus on meaningful features in image data. Given the input $H \times W \times C$ , the features X are globally pooled and maximally pooled in the channel domain to obtain two different 2D mappings $X_{avg}^{s} \in R^{1 \times H \times W}$ and $X_{Max}^{s} \in R^{1 \times H \times W}$ . As shown in the Equation (6):

$\begin{matrix} H_{s} (X) = σ & (f^{7 \times 7} ([AvgPool (X); MaxPool (X)])) \\ = σ (f^{7 \times 7} ([X_{avg}^{s}; X_{\max}^{s}])) \end{matrix}$

(6)

2.2.3. Scaling Factor Module

The same target, characterized by identical latitude and longitude coordinates, exhibits varying geometric features across different side-scan sonar images. These differences arise from factors such as variations in draft depth, spatial distances, and drag speeds associated with the side-scan sonar equipment. To address this challenge, we introduce a feature scaling factor, as illustrated in Figure 6.

In Figure 6, the scaling module resizes the detected anchor frame target feature map using the up-sampling method and adjusts the target features to the same order of magnitude. This approach fully resolves the issue of unfixed target size features in the side-scan sonar image, preventing the incorrect identification of model features and substantially improving the detection accuracy. The scaling ratio of the A feature map is determined by the distance between the center point of the anchor frame and the image midline, denoted by

L 1

, and half of the width of the image, expressed by

L 1 / L

. Similarly, the scaling ratio of the B feature map is calculated based on the distance between the center point of the anchor frame and the image midline, denoted by

L 2

, and half of the width of the image, indicated as

L 2 / L

.

2.2.4. Improved YOLOv7 Architecture

Figure 7 visually depicts the architecture enhancement of the YOLOv7 model. In our approach, we integrated the Swin-Transformer structure into the backbone network to effectively extract image features. Additionally, this structure enhances the predictive capability of the feature network. This integration facilitates comprehensive feature extraction from images and accentuates the significance of pertinent targets in sonar images within the feature representation. Notably, this is accomplished while adhering to the primary framework of YOLOv7. Furthermore, the CBAM module has been applied to the prediction head of the neural network features. This incorporation serves to mitigate background noise within the features and aids the model in focusing on small, luminous points of interest. Finally, we introduce a feature size scaling factor to standardize the size of all model features for prediction, resulting in a significant enhancement in the accuracy of model detection.

3. Dataset

To validate the effectiveness of the proposed method, two different datasets are employed for authentication. Initially, the proposed algorithm’s superiority is confirmed using an established public dataset (SCTD), followed by an evaluation of the model’s generalization capabilities using the actual collection of datasets (SSS-VOC).

3.1. Public Dataset

The dataset was originally introduced by Zhang et al. [39]. It comprises a total of 366 side-scan sonar images, with 351 images containing a single target and 6 images containing two targets. The classification of the dataset’s targets and their corresponding labels are presented in Table 1.

The three types of datasets are shown in Figure 8.

3.2. SideScan Sonar Dataset

The current dataset for side-scan sonar is often collected within controlled experimental environments, leading to a disparity between these data and the actual conditions encountered during ocean exploration. Consequently, the models trained on such datasets exhibit limited ability to generalize in real-world ocean exploration scenarios. To address the issue of poor generalization observed in existing side-scan sonar recognition models, we have taken a different approach. We have curated a dataset by capturing side-scan sonar data directly from real ocean exploration activities. This dataset comprises side-scan sonar images that faithfully represent the authentic conditions of ocean exploration and search operations. This choice of data empowers us to train neural network models that exhibit an enhanced generalization performance when applied to practical engineering scenarios.

3.2.1. Data Acquisition

The sonar equipment used for the experiment is shown in Figure 9.

The side-scan sonar equipment we use is the SS3060 from Hydratron. The SS3060 has both high and low frequency. The parameter details of the side-scan sonar are shown in Table 2.

The process begins by attaching a preconfigured sonar search target to a buoy, and securely positioning it at a designated location within the ocean. Subsequently, a side-scan sonar device is employed to establish the draft depth and is affixed to the ship’s side rigging. In the final step, the vessel navigates around the vicinity of the search target, varying the distances covered. This dynamic movement captures side-scan sonar images under authentic oceanic conditions, accounting for the influence of ocean currents and flow patterns.

3.2.2. Data Processing

During the experiment, dual channels of high frequency and low frequency are utilized for data acquisition.

The collected data are shown in Figure 10. The acquired side-scan sonar images are of dimensions 1000 × 1000. In a side-scan sonar image, the line in the middle is called the track line. It serves as the reference point for measuring the distance to the target on both sides, the target position, altitude, and the measurement direction. The continuous curves on the left and right sides of the track line represent the seabed lines, showing the pattern of the seabed’s unevenness. The spacing between the bottom line and the track line reflects the changes in height from the seafloor. The image area is divided into the water column area and the acoustic wave irradiation area. The water column area is where there are no echoes due to the short echo interval, while the acoustic wave irradiation area is where echoes are typically detected. Due to the relatively small variations in the size of the prefabricated models, the distinctions in features within the sonar images are not pronounced, thereby leading to challenges in target differentiation. A subset of the experiment’s acquired side-scan sonar images lacks discernible target predictions, and the presence of shadows due to the marine environment exacerbates the complexity. Consequently, these images do not effectively contribute to target feature extraction for the neural network model. Hence, they have been excluded from the model’s dataset. You can find an example of a discarded side-scan sonar image in Figure 11.

In the first place, we introduce a random rotation to enhance the diversity of image features. Furthermore, edge contrast is increased by unsharp masking (USM). Finally, the image is normalized to reduce the complexity of model training. The classification of targets and their corresponding labels in the SSS-VOC dataset is provided in Table 3. We categorize underwater targets into two groups: detection targets, which exhibit distinct shadow characteristics, and original natural underwater clutter, which appears as bright spots in side-scan sonar images. The dataset was divided into training, validation, and test sets using a ratio of 0.7:0.15:0.15.

Within the SSS-VOC dataset, there are 156 images containing one target, 66 images containing two targets, 18 images containing three targets, and 2 images each containing four, six, and seven targets. In total, there are 376 targets, calculated as follows:

376 = 1 \times 156 + 2 \times 66 + 3 \times 18 + 2 \times (4 + 6 + 7)

.

4. Experiment and Analysis

To comprehensively validate the superiority of our proposed algorithm and the integrity of the dataset, we commence with ablation experiments on the SCTD dataset. These experiments aim to confirm the efficacy of our model through qualitative analysis. Additionally, using the SSS-VOC dataset, we benchmark our algorithm against current state-of-the-art methods. Through quantitative and qualitative analyses, including further ablation experiments, we verify the effectiveness of our approach for side-scan sonar images in real marine environments.

4.1. Training

To validate the SSS-VOC dataset and enhance the model’s efficacy, we implemented the proposed model using the PyTorch framework on a system comprising an Intel(R) Core(TM) i9-10900F [email protected] GHz, 24GB of operating memory, and two Nvidia GeForce RTX 3090 GPU, running on a 64-bit Windows 10 operating system with CUDA 11.3. Some of the parameters of the training are shown in Table 4.

4.2. Evaluation Index of the Model

For the results of training between different detection models, we need to use a unified model detection evaluation metric. We use the average precision

(A P)

, average accuracy

(m A P)

for our combined measurements. As shown in Equations (7)–(10),

\begin{matrix} R = T P / (T P + F N) \end{matrix}

(7)

\begin{matrix} P = T P / (T P + F P) \end{matrix}

(8)

\begin{matrix} A P = \int_{0}^{1} P (R) d R \end{matrix}

(9)

\begin{matrix} m A P = \sum_{i = 1}^{N} A P_{i} / N \end{matrix}

(10)

R is recall rate. P is precision. TP (true positive) is the number of correctly detected positive samples, FP (false positive) is the number of incorrectly detected negative samples, and FN (false negative) is the number of incorrectly detected positive samples. N represents the type of test target.

4.3. Public Dataset Result

We commence with ablation experiments on the public dataset to showcase the pivotal role played by the attentional mechanism module. The confidence levels yielded by different models are depicted in Figure 12. Each row of the confusion matrix represents the predicted category, and the total number of each row represents the number of data predicted to be in that category; each column represents the true category to which the data belong, and the total number of data in each column represents the number of data instances in that category. Analyzing the confidence graph, it becomes evident that the attention mechanism significantly enhances the model’s detection accuracy within the class, thereby reducing uncertainty and false detection rates. The Swin-Transformer structure primarily contributes to reducing false detections, particularly when the background is mistakenly identified as the target, as it focuses on capturing the correlation between the background and the target. Meanwhile, the CBAM structure, integrated into the prediction head position, places greater emphasis on minimizing false detections within the target category. Additionally, the model training loss function is visualized in Figure 13. The loss function plots indicate that all three loss functions converge to a stable state. Precision exhibits significant oscillations initially, but stabilizes around 0.9 after 130 epochs, while recall shows less fluctuation throughout training and converges to approximately 0.75.

4.3.1. Qualitative Results of Ablation Experiments in the SCTD Dataset

We subjected a subset of the same test dataset to all three detection networks, and the resulting detection outcomes are illustrated in Figure 14. In Figure 14a,d, the YOLOv7 model exhibits a notable issue with missing detections, suggesting that existing detection models encounter challenges in complex side-scan sonar image backgrounds. The incorporation of an attention mechanism not only mitigates the missing detection problem prevalent in existing models for side-scan sonar images but also enhances detection certainty, as observed in Figure 14b,c. This serves as evidence that the attention mechanism contributes to improving the model’s detection accuracy.

4.3.2. Quantitative Results in the SCTD Dataset

The quantitative evaluation scores of the experimental outcomes are detailed in Table 5. Among the four comparable existing models, each employing the same input size, the Cascade R-CNN model attained the highest Mean Average Precision at

I o U

0.5 (

m A P 0.5

) and Mean Average Precision between

I o U

0.5 and 0.95 (

m A P

0.5:0.95) scores, recording 75.12% and 50.01% respectively. Remarkably, our improved model outperformed the rest, registering a

m A P 0.5

of 77.37% and

m A P

0.5:0.95 of 51.92%. These scores outperformed the best-performing existing model by 2.25% and 1.91%, respectively.

4.4. SSS-VOC Dataset Result

To validate the superior performance of our proposed model, qualitative experiments have been conducted on the P–R curve and confusion-matrix, as shown in Figure 15. the model training loss function is visualized in Figure 16. The loss function plots indicate that all three loss functions converge to a stable state. Precision exhibits significant oscillations, stabilizing around 0.8 after 120 epochs, while recall also shows fluctuations during training and eventually converges to approximately 0.82. These fluctuations may be attributed to the small amount of abnormal data present in the provided dataset.

4.4.1. Qualitative Results in the SSS-VOC Dataset

We conducted a comparison between our model and an existing detection model using the same dataset, and the results are visualized in Figure 17. In Figure 17a, concerning the detection of target models, only the Sparse-RCNN model exhibited the issue of missing detections. When it comes to the detection of bright spots, only our proposed model missed one target, whereas other models missed two targets. This highlights the effectiveness of the CBAM module in handling small bright spot targets through secondary feature extraction in both spatial and temporal dimensions. Regarding detection accuracy, our proposed model demonstrated a high classification confidence. In Figure 17b, except for the SSD model, no other models experienced missing detections. Moreover, our proposed model detected targets that were not marked by human annotators but met the geometric feature criteria, showcasing the generalization of the capability of our model. In Figure 17c, during target detection, only our proposed model missed one target, while other models missed two targets. This phenomenon might be attributed to the existing models not fully considering the intricate relationship between small targets and the background during the CNN-based feature extraction process, resulting in the exclusion of relevant features. Our SWIN-Transformer model effectively addresses this issue by fully capturing the relationship between targets and the background. For the detection of objects with small bright spots, our proposed model not only achieved the highest detection rate but also exhibited the highest confidence. In Figure 17d, the Sparse-RCNN model missed target detection, and there was a false target detection in the Cascad-RCNN model’s results. Only our proposed model and the SSD model successfully detected all existing targets, with our model exhibiting higher confidence than the SSD model.

4.4.2. Quantitative Results in the SSS-VOC Dataset

The quantitative evaluation scores of the experimental outcomes are detailed in Table 6. Among the four comparable existing models, each employing the same input size, the Cascade R-CNN model attained the highest mean average precision at

I o U

0.5 and A mean average precision between

I o U

0.5 and 0.95 scores, recording 82.65% and 46.73% respectively. Remarkably, our improved model outperformed the rest, registering a

m A P 0.5

of 87.9% and

m A P

0.5:0.95 of 49.23%. These scores outperformed the best-performing existing model by 5.25% and 2.5% respectively. These results effectively underscore the precision of our optimized model and underscore the significance of the incorporated attention mechanism and scaling factor.

4.5. Ablation Experiment

To validate the effectiveness of our proposed optimization module, we performed ablation experiments on the SSS-VOC dataset.

4.5.1. Quantitative Results of Ablation Experiments

We conducted ablation experiments to assess the impact of integrating modules into the YOLOv7 model, and the quantitative results of these experiments are presented in Table 7. According to the data in Table 7, the incorporation of the Swin-Transformer structure boosts the

m A P 0.5

and

m A P

0.5:0.95 scores of the YOLOv7 model by 1.79% and 1.48%, respectively. Subsequently, the addition of the CBAM structure further improves these scores by 4.91% and 1.33%, respectively. Finally, the scaling factor enhances the scores by 2.58% and 5.6%. Overall, our enhanced model demonstrates improvements of 9.28% and 8.41% over the original model.

4.5.2. Qualitative Results of Ablation Experiments

A visual comparison of the ablation experiment test outcomes is depicted in Figure 18. From the comparison of Figure 18a–c, it is evident that models incorporating the attention module enhance the confidence of YOLOv7 model detections by 0.11 and 0.05, respectively, compared to the baseline YOLOv7 model. This demonstrates that integrating the attention structure into both the backbone network and prediction header of YOLOv7 enables a better focus on target feature extraction, improves feature integration, and reduces background interference, thereby enhancing detection confidence.

Moreover, the introduction of a size-scaling structure ensures consistent prediction of all target geometric features at the same scale. A comparison between Figure 18a,d illustrates that the size scaling factor facilitates the detection of targets missed by the YOLOv7 model, thereby reducing the leakage detection and underscoring the importance of size scaling in side-scan sonar target feature recognition.

5. Conclusions

This paper aims to address the challenges of low precision and limited generalization capacity in side-scan sonar target recognition. Consequently, we integrated the SWIN-Transformer’s attention mechanisms, the CBAM structure and the Scale Factor into the YOLOv7 detection model to validate the efficacy of these attention mechanisms on established open datasets. Subsequently, the improved model will then be further validated using real-world ocean side-scan sonar image datasets. Our enhanced model demonstrates improvements in

m A P_{0.5}

and

m A P_{0.5 : 0.95}

by 9.28% and 8.14%, respectively, over the original model, effectively addressing real-world ocean engineering challenges. The comprehensive experimental validation effectively establishes the proposed model’s accuracy and generalization capabilities. This approach holds promise for applications in UUVs. By enabling automatic detection and identification capabilities, this model can significantly enhance the efficiency of exploring uncharted targets on the seafloor, thereby boosting the intelligence and operational effectiveness of UUVs.

Despite the enhancements, the neural network model still struggles with complex architectures and relatively slow detection speeds. In our future endeavors, we aim to develop a neural network model that balances high precision with a lightweight design for side-scan sonar image target detection. This forthcoming work seeks to create a model that not only meets stringent accuracy requirements but also ensures efficient deployment due to its lightweight nature.

Author Contributions

X.W.: Software, Methodology, Writing—original draft. F.Z.: Supervision, Funding acquisition. J.W.: Writing—a review. C.C.: Investigation, Visualization. G.P.: editing. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Key R&D Program of China (2023QYXX).

Data Availability Statement

Data available upon request from the authors.

Conflicts of Interest

The author declares no conflicts of interest.

References

Li, L.; Li, Y.; Yue, C.; Xu, G.; Wang, H.; Feng, X. Real-time underwater target detection for AUV using side scan sonar images based on deep learning. Appl. Ocean. Res. 2023, 138, 103630. [Google Scholar] [CrossRef]
Wu, M.; Wang, Q.; Rigall, E.; Li, K.; Zhu, W.; He, B.; Yan, T. ECNet: Efficient convolutional networks for side scan sonar image segmentation. Sensors 2019, 19, 2009. [Google Scholar] [CrossRef]
Yu, Y.; Zhao, J.; Gong, Q.; Huang, C.; Zheng, G.; Ma, J. Real-time underwater maritime object detection in side-scan sonar images based on transformer-YOLOv5. Remote Sens. 2021, 13, 3555. [Google Scholar] [CrossRef]
Chen, Z.; Wang, H.; Shen, J.; Dong, X. Underwater object detection by combining the spectral residual and three-frame algorithm. In Advances in Computer Science and Its Applications: CSA 2013; Springer: Berlin/Heidelberg, Germany, 2014; pp. 1109–1114. [Google Scholar]
Villar, S.A.; Acosta, G.G.; Solari, F.J. OS-CFAR process in 2-D for object segmentation from Sidescan Sonar data. In Proceedings of the 2015 XVI Workshop on Information Processing and Control (RPIC), Cordoba, Argentina, 6–9 October 2015; pp. 1–6. [Google Scholar]
Mukherjee, K.; Gupta, S.; Ray, A.; Phoha, S. Symbolic analysis of sonar data for underwater target detection. IEEE J. Ocean. Eng. 2011, 36, 219–230. [Google Scholar] [CrossRef]
Midtgaard, Ø.; Hansen, R.E.; Sæbø, T.O.; Myers, V.; Dubberley, J.R.; Quidu, I. Change detection using synthetic aperture sonar: Preliminary results from the Larvik trial. In Proceedings of the OCEANS’11 MTS/IEEE KONA, Waikoloa, HI, USA, 19–22 September 2011; 22 September 2011; pp. 1–8. [Google Scholar]
Long, H.; Shen, L.; Wang, Z.; Chen, J. Underwater Forward-Looking Sonar Images Target Detection via Speckle Reduction and Scene Prior. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 5604413. [Google Scholar] [CrossRef]
Szymak, P.; Piskur, P.; Naus, K. The effectiveness of using a pretrained deep learning neural networks for object classification in underwater video. Remote. Sens. 2020, 12, 3020. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhou, J.; Sun, J.; Li, C.; Jiang, Q.; Zhou, M.; Lam, K.M.; Zhang, W.; Fu, X. HCLR-Net: Hybrid Contrastive Learning Regularization with Locally Randomized Perturbation for Underwater Image Enhancement. Int. J. Comput. Vis. 2024, 1–25. [Google Scholar]
Zhuang, P.; Wu, J.; Porikli, F.; Li, C. Underwater image enhancement with hyper-laplacian reflectance priors. IEEE Trans. Image Process. 2022, 31, 5442–5455. [Google Scholar] [CrossRef]
Zhuang, P.; Li, C.; Wu, J. Bayesian retinex underwater image enhancement. Eng. Appl. Artif. Intell. 2021, 101, 104171. [Google Scholar] [CrossRef]
Zacchini, L.; Franchi, M.; Manzari, V.; Pagliai, M.; Secciani, N.; Topini, A.; Stifani, M.; Ridolfi, A. Forward-looking sonar CNN-based automatic target recognition: An experimental campaign with FeelHippo AUV. In Proceedings of the 2020 IEEE/OES Autonomous Underwater Vehicles Symposium (AUV), Virtual, 29 September–1 October 2020; pp. 1–6. [Google Scholar]
Yulin, T.; Shaohua, J.; Gang, B.; Yonghou, Z.; Fan, L. The transfer learning with convolutional neural network method of side-scan sonar to identify wreck images. Acta Geod. Cartogr. Sin. 2021, 50, 260. [Google Scholar]
Du, X.; Sun, Y.; Song, Y.; Sun, H.; Yang, L. A Comparative Study of Different CNN Models and Transfer Learning Effect for Underwater Object Classification in Side-Scan Sonar Images. Remote. Sens. 2023, 15, 593. [Google Scholar] [CrossRef]
Zhang, H.; Tian, M.; Shao, G.; Cheng, J.; Liu, J. Target detection of forward-looking sonar image based on improved YOLOv5. IEEE Access 2022, 10, 18023–18034. [Google Scholar] [CrossRef]
Lee, S.; Park, B.; Kim, A. Deep learning from shallow dives: Sonar image generation and training for underwater object detection. arXiv 2018, arXiv:1810.07990. [Google Scholar]
Zhang, W.; Zhou, L.; Zhuang, P.; Li, G.; Pan, X.; Zhao, W.; Li, C. Underwater image enhancement via weighted wavelet visual perception fusion. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2469–2483. [Google Scholar] [CrossRef]
Zhu, P.; Isaacs, J.; Fu, B.; Ferrari, S. Deep learning feature extraction for target recognition and classification in underwater sonar images. In Proceedings of the 2017 IEEE 56th Annual Conference on Decision and Control (CDC), Melbourne, Australia, 12–15 December 2017; pp. 2724–2731. [Google Scholar]
Kong, W.; Hong, J.; Jia, M.; Yao, J.; Cong, W.; Hu, H.; Zhang, H. YOLOv3-DPFIN: A dual-path feature fusion neural network for robust real-time sonar target detection. IEEE Sens. J. 2019, 20, 3745–3756. [Google Scholar] [CrossRef]
Li, Z.; Chen, D.; Yip, T.L.; Zhang, J. Sparsity Regularization-Based Real-Time Target Recognition for Side Scan Sonar with Embedded GPU. J. Mar. Sci. Eng. 2023, 11, 487. [Google Scholar] [CrossRef]
He, J.; Chen, J.; Xu, H.; Ayub, M.S. Small Target Detection Method Based on Low-Rank Sparse Matrix Factorization for Side-Scan Sonar Images. Remote Sens. 2023, 15, 2054. [Google Scholar] [CrossRef]
Wang, Z.; Guo, J.; Huang, W.; Zhang, S. Side-scan sonar image segmentation based on multi-channel fusion convolution neural networks. IEEE Sens. J. 2022, 22, 5911–5928. [Google Scholar] [CrossRef]
Song, Y.; Zhu, Y.; Li, G.; Feng, C.; He, B.; Yan, T. Side scan sonar segmentation using deep convolutional neural network. In Proceedings of the OCEANS 2017-Anchorage, Anchorage, Alaska, 18–21 September 2017; pp. 1–4. [Google Scholar]
Fan, Z.; Xia, W.; Liu, X.; Li, H. Detection and segmentation of underwater objects from forward-looking sonar based on a modified Mask RCNN. Signal Image Video Process. 2021, 15, 1135–1143. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on cOmputer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, M.; Yin, L. Solar cell surface defect detection based on improved YOLO v5. IEEE Access 2022, 10, 80804–80815. [Google Scholar] [CrossRef]
Zhang, Z.; Yan, Z.; Jing, J.; Gu, H.; Li, H. Generating Paired Seismic Training Data with Cycle-Consistent Adversarial Networks. Remote. Sens. 2023, 15, 265. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, P.; Tang, J.; Zhong, H.; Ning, M.; Liu, D.; Wu, K. Self-trained target detection of radar and sonar images using automatic deep learning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4701914. [Google Scholar] [CrossRef]

Figure 1. ELAN structure.

Figure 2. MP structure.

Figure 3. Swin-Transformer.

Figure 4. Swin-Transformer blocks.

Figure 5. CBAM architecture.

Figure 6. Scaling factor. L is the range of the sonar. L1 and L2 are the distance from the anchor frame center to the center line.

Figure 7. Improved YOLOV7 architecture.

Figure 8. Examples of public dataset. (a) Aircraft; (b) Human; and (c) Ship.

Figure 9. The sonar detection for data collection.

Figure 10. High-frequency sonar image data collected from real experiments.

Figure 11. Discarded side-scan sonar image. (a) No target; and (b) complex background.

Figure 12. Confusion matrix on public dataset. (a) YOLOv7 (b) YOLOv7+Swin-Transformer (c) YOLOv7+Swin-Transformer+CBAM.

Figure 13. Result on public dataset. The loss function, precision, recall, and

m a p

evaluation metrics are included in the result picture.

Figure 13. Result on public dataset. The loss function, precision, recall, and

m a p

evaluation metrics are included in the result picture.

Figure 14. Comparison of Ablation Experiments on a Public Dataset.In images (a–c), our proposed optimization module improves the accuracy of the model. In the image (d), our proposed optimization module improves the leakage rate of the model, yet the leakage problem still exists.

Figure 15. Result experiments on SSS-VOC dataset. (a) P–R curve (b) confusion-matrix.

Figure 16. Result on SSS-VOC dataset. The loss function, precision, recall, and

m a p

evaluation metrics are included in the result picture.

Figure 16. Result on SSS-VOC dataset. The loss function, precision, recall, and

m a p

evaluation metrics are included in the result picture.

Figure 17. Comparison of different algorithms on the SSS-VOC dataset. (a,d) Demonstrate the accuracy advantage of the proposed model in this paper; (c) Demonstrate the low leakage rate of the proposed model in this paper; (b) Demonstrate that the proposed model in this paper has the possibility of misdetection.

Figure 18. Ablation experiments on SSS-VOC dataset; (a) original tag (b) YOLOv7 (c) YOLOv7+Attention (d) YOLOv7+Attention+ScalingFactor.

Table 1. Statistics of public dataset.

Categories	Training	Validation	Testing	Total
Aircraft	39	9	9	57
Human	24	5	6	35
Ship	189	41	41	271
Total	252	55	56	363

Table 2. Sonar parameters for the SS3060.

Parameter	High	Low
Frequency	600 kHz	300 kHz
Range	50 m	150 m
Vertical resolution	1.25 cm	2.5 cm
Maximum slope distance	120 m	230 m
Parallel Beamwidth	0.26°	0.28°
Along-track resolution	0.23 cm	0.73 cm

Table 3. Statistics of SSS-VOC dataset.

Categories	Training	Validation	Testing	Total
Bright	169	37	36	242
Target	93	21	20	134
Total	262	58	56	376

Table 4. Training Workstation.

Parameters	Configuration
Img size	(960, 960)
Batch size	32
Epochs	220
Learning rate	0.01
Weight decay	0.0005

Table 5. Quantitative assessment of the experimental results.

Method	Input	mAP₅₀(%)	mAP_50–95(%)	Precision	Recall	Time (s)
SSD	(608, 416)	72.31	45.56	0.83	0.70	0.956
Sparse R-CNN	(608, 416)	75.12	50.01	0.92	0.73	0.625
Cascade R-CNN	(608, 416)	74.64	48.73	0.84	0.72	0.791
Improved YOLOv7	(608, 416)	77.37	51.92	0.95	0.75	0.506

Table 6. Quantitative assessment of the experimental results.

Method	Input	mAP₅₀(%)	mAP_50–95(%)	Precision	Recall	Time (s)
SSD	(960, 960)	78.98	43.34	0.81	0.85	1.21
Sparse R-CNN	(960, 960)	79.56	45.78	0.82	0.88	0.884
Cascade R-CNN	(960, 960)	82.65	46.73	0.84	0.91	0.912
Improved YOLOv7	(960, 960)	87.9	49.23	0.88	0.92	0.818

Table 7. Quantitative results of the ablation experiments.

Method	Input	mAP₅₀(%)	mAP_50–95(%)	Precision	Recall	Time (s)
YOLOv7	(960, 960)	78.62	40.82	0.87	0.83	0.723
YOLOv7+Swin	(960, 960)	80.41	42.3	0.85	0.86	0.754
YOLOv7+Swin +CBAM	(960, 960)	85.32	43.63	0.86	0.89	0.816
YOLOv7+Swin +CBAM+Scaling	(960, 960)	87.9	49.23	0.88	0.92	0.818

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, X.; Wang, J.; Cheng, C.; Zhang, F.; Pan, G. Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor. Remote Sens. 2024, 16, 2492. https://doi.org/10.3390/rs16132492

AMA Style

Wen X, Wang J, Cheng C, Zhang F, Pan G. Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor. Remote Sensing. 2024; 16(13):2492. https://doi.org/10.3390/rs16132492

Chicago/Turabian Style

Wen, Xin, Jian Wang, Chensheng Cheng, Feihu Zhang, and Guang Pan. 2024. "Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor" Remote Sensing 16, no. 13: 2492. https://doi.org/10.3390/rs16132492

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Underwater Side-Scan Sonar Target Detection: YOLOv7 Model Combined with Attention Mechanism and Scaling Factor

Abstract

1. Introduction

2. Method

2.1. YOLOv7’s Network Structure

2.2. Improved Algorithm

2.2.1. Swin-Transformer Module

2.2.2. Convolutional Block Attention Module

2.2.3. Scaling Factor Module

2.2.4. Improved YOLOv7 Architecture

3. Dataset

3.1. Public Dataset

3.2. SideScan Sonar Dataset

3.2.1. Data Acquisition

3.2.2. Data Processing

4. Experiment and Analysis

4.1. Training

4.2. Evaluation Index of the Model

4.3. Public Dataset Result

4.3.1. Qualitative Results of Ablation Experiments in the SCTD Dataset

4.3.2. Quantitative Results in the SCTD Dataset

4.4. SSS-VOC Dataset Result

4.4.1. Qualitative Results in the SSS-VOC Dataset

4.4.2. Quantitative Results in the SSS-VOC Dataset

4.5. Ablation Experiment

4.5.1. Quantitative Results of Ablation Experiments

4.5.2. Qualitative Results of Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI