1. Introduction
As a main detection approach for many underwater tasks—such as maritime emergency rescue, wreckage salvage, and military defense—side-scan sonar (SSS) can quickly search sizeable areas and obtain continuous two-dimensional images of the marine environment, even in low-visibility water [
1,
2]. The underwater search procedure usually adopted by engineers is to first scan the target sea area with sonar, then export the image after a global scan, and finally judge whether there is a target according to the experience of the sonar operator [
3]. However, manual judgement is of low efficiency, time-consuming, resource intensive, and overly reliant on experience. With the development of equipment such as unmanned ships and autonomous underwater vehicles (AUVs) [
4], how to identify the sunken target in SSS images accurately, quickly, and automatically becomes increasingly important. In order to achieve automatic operation of AUVs, researchers have done a great deal of work on automatic target classification (ATC) in SSS images [
5,
6,
7,
8,
9].
Seabed reverberation and the complex underwater environment cause various noises in sonar images, such as speckle noise, Gaussian noise, and impulse noise—the most prominent one of which is speckle noise [
10]. Speckle noise [
11], represented by the random particles of brighter and darker pixels in sonar images, will lead to the loss of image details, contrast reduction, and edge blur, and therefore it makes the feature extraction of the targets in sonar images more difficult. Traditional underwater sonar image classification methods, developed from the optical image classification methods, usually include noise reduction preprocessing, feature extraction, feature classification, and other steps [
12,
13]. The key module of sonar image classification is feature extraction, which usually have to be noise robust. Traditional feature extraction methods can be divided into local feature descriptors and model-based methods. Local feature descriptors, without prior knowledge, can extract shallow visual features, such as the Haar feature [
14], Haar-like and the local binary pattern (LBP) features [
15], scale invariant feature transform (SIT) features [
16,
17], and oriented FAST and rotated BRIEF (ORB) features [
18]. With the use of prior knowledge or driven data, model-based methods have also been proposed for feature extraction, which needs great consistency and similarities between testing and training datasets. Myers [
19] combined the information from both highlight regions and shadow regions with multi-view templates to improve the classification accuracy. Hausdorff distance from the synthetic shadows to the real object shadow was combined with highlight and scale information to produce a membership function, and then the objects were classified using both mono-view and multi-view analysis with the help of Dempster–Shafer information theory [
20].
The extracted features are used to train classifiers, such as hidden Markov model [
21],
k-nearest neighbor model [
22], support vector machine (SVM) [
23], and other classifiers to realize underwater target recognition. Çelebi [
24] used Markov Random Fields to detect potential mines in the SSS images after compensating for illumination variations. The effectiveness and generality of the trained classifiers are limited due to the poor quality of noisy sonar images and the specificity of artificial feature templates. Moreover, when the recognition task or the corresponding environment changes, the feature templates need to be adjusted and the classification models may also need to be redesigned, which is time-consuming and inconvenient.
In recent years, with the tremendous increase in computational power, convolutional neural networks (CNNs), as a representative method for deep learning, have been widely used in computer vision and natural language. Unlike artificially designed features, CNNs, inspired by the human visual system, can learn features at different levels of abstraction, and therefore are more applicable to image understanding, especially in the field of image recognition and classification [
25,
26]. The ATC of SSS images using deep learning (DL) methods has become a new trend. Over the past few years, the use of CNNs in SSS image classification has proved to be more effective than traditional image processing methods [
3,
8,
9,
27,
28,
29]. Luo [
9] proposed a shallow CNN for classifying seabed which outperformed deep CNN in classification accuracy and speed. In [
3], Ye tried to apply the pretrained VGG11 and ResNet18 to classify underwater targets in SSS images and presented the pre-processing method for the training samples which is meaningful in transfer learning. Huo [
8] demonstrated that semisynthetic data can benefit fine-tuning a lot and the pretrained VGG19 after fine-tuning had better performance than the models trained from scratch. Qin [
27] introduced generative adversarial networks (GANs) to enhance the small size dataset to improve the accuracy of sediment classification. Gerg [
28] proposed a structural prior driven regularized deep learning method which outperformed other methods for synthetic aperture sonar image classification. Zhang [
29] used automatic deep learning (AutoDL) in classification of sonar images, and their model achieved excellent accuracy at 99.0% after 2.9 h of training. However, the following problems must be overcome when using the DL methods in SSS image classification.
One problem is that, owing to the lack of SSS images, the DL-based models cannot be fully trained, which can cause over fitting—i.e., the model has a poor generalization ability. In order to tackle the challenge of lacking datasets, data enhancement and transfer learning methods have been adopted to improve the generalization ability of the DL-based models.
Data enhancement methods can be categorized according to the types of data synthesis as follows.
Data transformation rules, such as flipping, rotating, cropping, distorting, scaling, and adding noise, are used on the existing images to enhance data. Inoue [
30] used two randomly selected images from the training set and processed them by basic data enhancement operations. Then, a new sample can be formed by averaging two processed images in pixel with one of the original sample labels set as the new label.
Multiple samples are used to generate similar pseudo samples. The input optical image is preprocessed and combined with sonar image features to create semi synthetic training data to enhance the dataset [
8,
31]. The method of style transferring with a pre-trained CNN was adopted to generate pseudo SSS images, which can be added to the training set, finally achieving a similar improvement compared with the former method [
32]. By changing the upsampling method of style transfer [
33], the noise ratio can be changed by manually adjusting parameters, and the generated pseudo SSS images are more related to the real SSS images.
The randomly generated samples with consistent distribution of the training dataset are created by the generative adversarial networks (GAN), which are trained to learn an image-translation from low-complexity ray-traced images to real sonar images [
27,
34]. Sung [
35] et al. introduced a method of GAN to translate actual sonar images into simulator-like sonar images to generate a great deal of template images.
Meanwhile, transfer learning [
36] can also efficiently relieve the pressure of the lack of datasets. Pre-trained CNNs—e.g., the neural networks pre-trained by ImageNet dataset—are usually used [
37,
38,
39], which can somewhat improve the performance of the model when trained with a small-size dataset.
In the method of style transfer, the final synthesized images have the noise from sonar images and the target contour features from optical image. Therefore, the model trained with synthesized images can have a better ability of extracting features from noise background and identifying contour features simultaneously. Therefore, we try to utilize different features of multi-domain images instead of synthesizing data to guide the training of classification model on a limited SSS dataset.
Another problem is that the complex characteristics of SSS images—such as blurred edges, strong noise, and various shapes of targets—bring great difficulties in extracting useful features in SSS images. Traditional image preprocessing methods may lead to the loss of detail information, while the pre-trained models based on a large optical dataset are unable to entirely match the SSS image features. Therefore, it is also important to make the model focus on useful features and extract all available features as much as possible.
Inspired by the inter-domain transfer learning methods [
40,
41] and the neural network architecture proposed by the Google company [
42], the main contributions of the paper are as follows:
An automatic side-scan sonar image classification method is proposed, which combines the multi-domain collaborative transfer learning (MDCTL) with the multi-scale repeated attention mechanism (MSRAM). The proposed MDCTL method transfers the parameters of low-level feature extraction layers learned from the SAR images and the high-level feature representation layers learned from the optical images, respectively, which gives a new way of transfer learning.
By combining the channel attention mechanism (CAM) and the spatial attention mechanism (SAM) [
43,
44], the MSRAM makes the model do better in extracting and focusing on features of the target, and therefore more key features can be used for classification, which brings the model higher classification accuracy, as well as stability.
The proposed MDCTL method has been tested on a new SSS dataset, which adds 115 more side-scan sonar images to the SeabedObjects-KLSG dataset. The new SSS dataset is now available at
https://github.com/HHUCzCz/-SeabedObjects-KLSG--II (accessed on 16 November 2021). Feature response maps and class activation heatmaps are used to demonstrate the effect of the proposed MDCTL method with MSRAM.
The remainder of this paper is organized as follows.
Section 2 details the proposed SSS classification method with MDCTL and MSRAM.
Section 3 verifies the method proposed by the experiments. In
Section 4, the advantages and limitations of this method are discussed. Finally, some conclusions are given in
Section 5.
3. Results
This section describes the learning process and experimental results of MDCTL model in SSS image classification. Experiments were carried out on a computer with windows10 operating system, with a RTX2070s GPU and 16 GB memory. In all experiments, we used the VGG19 as the deep convolutional architecture, and conventionally fine-tune a pre-trained network with the same architecture without using any source dataset as our baseline. To significantly save the training time of CNNs, several pre-trained deep learning models on ImageNet have been downloaded from the MATLAB Central.
In this section, comparative experiments and analysis are conducted to demonstrate the effectiveness and robustness of the proposed method. The experimental results of the proposed method are compared with commonly used fine-tuning methods to verify the improvement of the performance of the model with different transfer learning methods and training methods. In addition, feature visualization and class activation heat map visualization are used to reveal the effects of transfer learning multimodal datasets in multiple source domains.
3.1. Experimental Setup
3.1.1. Dataset Used
We conduct experiments on the SSS image dataset called SeabedObjects-KLSG-II, which adds 102 images of shipwrecks and 4 images of airplane wreckage to the SeabedObjects-KLSG dataset. The dataset used here contains three main types of images: wreck, airplane, and seabed background. The dataset currently contains 487 shipwreck images, 66 airplane images, and 583 seabed images. Some selected samples from the dataset are shown in
Figure 7, where it can be seen that each type of images has various appearances. Basic data enhancement methods—including horizontal flipping, rotation, random cropping, and other operations—are used.
The transfer learning approach proposed in this paper exploits two other related datasets. The SAR dataset is selected from the MSATR dataset. MSTAR was introduced in the mid-1990s by the US Defense Advanced Research Projects Agency (DARPA). The SAR imagery of a wide range of former Soviet target military vehicles was acquired via high-resolution, cluster-beam synthetic aperture radar. The categories of targets in SAR images are not related to SSS images. However, the similarity of image statistical features between the two datasets deserves our attention. Therefore, we tried to train the feature extraction module of VGG19 on the SAR dataset and transfer it to the SSS image classification network to improve the extraction of low-level detailed features from strong noise. The optical dataset is made up of optical aerial images, including ships and sea surface images in MAritime SATellite Imagery dataset (MASATI), and airplane images in Dataset of Object Detection in Aerial Images (UCAS-AOD). This conventional optical dataset is used to train the feature mapping module in the optical image classification network, which will be transferred to the SSS image classification network. The feature mapping module contains the FC layers close to outputs, in which the majority of parameters are concentrated.
3.1.2. Experimental Details
For each class of the SeabedObjects-KLSG-II dataset, 70% and 30% of the images were randomly selected as training and test samples respectively. The numbers of training samples and testing samples are shown in the
Table 1. To eliminate possible influence of sample partitioning on the performance of the classifier, a hold-out scheme was used to randomly create 10 datasets to test the classifier. To minimize the impact of random initialization of parameters, 10 test repetitions are conducted on each dataset and the average value is taken as the result of the classification on this dataset. The average value of the results on the 10 datasets is taken as the final result.
Due to the small size of the SSS image dataset, several basic data enhancement methods—including flipping, rotating, cropping, and stitching—were used, which are shown in
Figure 8. In practice, the target in the image acquired by the side-scan sonar may be at the edge of the image, or even mutilated. The enhanced dataset obtained by using cropping and stitching more closely resembles the actual sonar images, giving the model a stronger generalization capability.
Before parameter transfer, we need to train the source domain models with SAR image and grayscale optical image datasets. Pretrained VGG19 on ImageNet was used to save time significantly. Some training hyper-parameters were set as: the batch size was 16, the epochs were 15, and the initial learning rate was 0.001 which will be multiplied by a decay factor 0.1 after 10 epochs. Constraint coefficients of source domain models were set according to the Formula (4).
In the training process, the training samples were first input into the detector to generate the feature map by feature extraction modules of Backbone. Then, the feature maps were enhanced and fused by RAM for a better representation, and were mapped from feature space to label space by FC layers. Afterwards, the loss value was calculated between the predicted label vector and the true label to evaluate the performance of the model’s parameters in predicting target category information. Finally, the parameters of the model were updated using the stochastic gradient descent (SGD) algorithm.
Some training hyper-parameters were set as: the initial learning rate was 0.0001, the batch size was 16, the epochs were 20, and the probability of dropout scheme was 0.5. The weighted learning rate and the bias learning rate were both set to 20 to accelerate the learning of parameters for the newly added final fully connected layer. For the method of using bag of features (BOF) model on SIFT features with SVM as the classifier, the size of BOF was set to 300.
3.2. Network Model Evaluating Indicator
The criteria for assessing model performance are the average overall accuracy (OA), the variance OA, and the precision of each class.
The overall accuracy (OA), which is the percentage of all correct positive classifications, represents the overall classification performance; the variance of the overall accuracy demonstrates the stability of the model over multiple tasks; and the analysis of the accuracy of each class is necessary because of the intra-class imbalance in the dataset. In addition, as we also judge from the convergence curve of the model whether or not over-fitting occurs.
where
Nii is the number of test samples that should have been classified as class
i and were classified as class
i in the actual classification result, and
t means the categories of labels in test samples.
Using the airplane target as an example,
TP (true positive) indicates that the model predicts that there is an airplane, and the result is true;
FP (false positive) indicates that the content of the predicted target is an airplane, but it is not true. In short, precision means the proportion of correctly predicted results in all the samples whose predicted label is true.
Variance is used to measure the stability and robustness of the algorithm, which can be obtained by comparing the results of multiple experiments with their mean values.
3.3. Performance Analysis
The model is constructed using the training set and verification set divided by 10 times cross validation strategy. The final performance is measured according to each performance index in the test dataset.
As shown in
Table 2, we compared state-of-the-art (SOTA) methods [
3,
8,
9,
27,
28,
29] with our method and listed their details and performance. The CNN models based on LeNet-5 [
9] and GoogLeNet [
27] are super lightweight and easy to train, while their performance is unsatisfactory for practical use. Various data enhancement methods are used in these SOAT methods including semisynthetic data generation [
3,
8], despeckling [
28], and extracting derived classification dataset [
29], which greatly improve the classification accuracy but cause more time consumption. For example, the effective FL-DARTS [
29] algorithm, which also uses radar and sonar datasets together, have close classification performance to our method, but excessive complexity and training time of auto learning presents obstacles to its wider use in underwater tasks. Compared with these existing methods, our proposed transfer learning method has significant performance improvement and competitive classification speed with acceptable complexity and training time.
Table 3 shows quantitative results comparing different backbone networks on the SeabedObjects-KLSG test set for the target classification task. By comparison, we found that the VGG19 network exhibited good generalization performance after fine-tuning. Fine-tuned VGG19 achieved the highest overall accuracy and the highest precision of the classification of ship and seafloor. Although VGG16 has a significantly better precision of airplane classification, it got the worst precision of seafloor classification, which means it has a high false alarm rate. Compared with VGG16, VGG19 has three more convolutional layers, which makes it more suitable to be combined with the proposed MSRAM that can work better with a deeper model structure. By using a deeper network structure likeVGG19, the proposed MSRAM can combine more multi-scale features to improve the feature representation ability.
Ablation experiments on different methods of transfer learning were conducted to verify the performance improvement as well as the stability of transfer learning for the SSS image classification task, 10 times for each method, and calculated the average and the variance of the overall accuracy. As can be seen from the results in the
Table 4 below, the model achieved a good improvement after transferring parameters from the SAR dataset alone, indicating that the similarity of the low-level features between SAR images and SSS images makes the model learn the extracted features in advance. To confirm this, we used a feature response visualization approach to observe the performance improvement owing to transfer learning from the SAR dataset. Transfer learning from optical datasets likewise improved the model overall accuracy, while resulting in significant instability which can be seen from the highest variance. Although transfer learning from both SAR and optical data sets enables further performance improvements, the model was still instable compared to the baseline model. MSRAM is therefore introduced to stabilize the learned feature extraction and mapping capabilities from multi-domain transfer learning. The method of MCDTL with MSRAM finally got the best classification accuracy and the lowest variance, which means it was able to eliminate performance fluctuations and maintain optimum performance.
However, we found that the model had poor ability to recognize and classify airplanes, which resulted from the class imbalance in the SSS image dataset. To investigate the effect of the proposed method on different target classes, we observed the precision corresponding to the classes.
As can be seen from the boxplots in
Figure 9 below, when training directly from scratch using VGG19, the classification of the airplane category is poor and unstable, with the best result not even reaching 65%, although the OA reaches over 90%. The direct fine-tuning and transfer learning methods were used to improve the classification accuracy, but it can be seen that results are worse in the degree of fluctuation, which indicates that the model performance is not stable enough. This may be due to the fact that there are fewer airplane images and the scarce training set cannot meet the learning needs of the model, while the model does not fully learn the detailed information of the target, and when there is a change in the posture of the airplane, the model is unable to capture the key information. The proposed method of combining MDCTL with RAM not only improves the accuracy rate in all categories as well as in general, but also makes the classification model more stable.
3.4. Visualization
3.4.1. Feature Response Map Visualization
Given that edge features and detailed information of the target can be better extracted from the convolutional layers close to the input, we visualized the first convolutional layer response of four models, including unpre-trained VGG19, pre-trained VGG19 based on the ImageNet, VGG19 learned from the SAR classification model, and a model with RAM added after TL. The details of the visualization method are shown in
Figure 10.
The
Figure 11b–d can show how the four methods gradually distinguish the highlighted and shadowed areas of the image from the background noise, where
Figure 11d further improves the ability to extract detailed features from the wreck target compared to
Figure 11c. As shown in
Figure 11e, the detailed contours in the target highlight area and the shadow contours become clearer with the addition of MSRAM, and seafloor highlight area is suppressed. MSRAM makes the edge contour details of the target and the edge features of the shadows significantly extracted, while the highlight areas of the seabed unrelated to the wreck are suppressed.
3.4.2. Heat Maps Based on Grad-CAM
The VGG19 network can be considered as a feature extraction module combined with a feature mapping module. As the feature extraction module can be transferred, we also tried to apply transfer learning on the feature mapping module, that is, the fully connected layers block. In the VGG19 network model, the fully connected layers play the role of mapping the learned distributed feature representation to the sample label space. The fully connected layers become very sensitive to the structural information of the image, such as the outline, so we transferred this part trained on the optical image dataset which have the same category and target semantic information. We use Grad-CAM (Gradient Weighted Class Activation Mapping) to visualize the areas of focus of the fully connected layer on the different classes, i.e., the influence of the structural information of the image on the classification results.
From the class activation heatmaps in
Figure 12, it can be seen that the aerial images and sonar images of the same category have similar contour features, and the feature details that the model pays attention to are similar in classification. For example, in the group of the airplane category, the features that have a positive impact on the category classification are concentrated on the edges of the wings on both sides of the airplane and the connection between the wings and the fuselage. This demonstrates that when training the classification decision module, although the image modalities are different, the images of the same object category have the same contour information, which can provide a certain gain effect for training.
The similarity of the airplane class activation heat maps between the SSS image and optical image shows that the feature mapping module focuses on consistent features of the airplane. Traditional classification networks can achieve target recognition and classification by focusing only on the corresponding key features of the category, such as the wings or the tail of the airplane. However, for small SSS image datasets, the model cannot learn all the key features of each category with sufficient samples, and therefore adequate learning of each sample is necessary. The MDCTL proposed in this paper makes use of the feature mapping module of the optical image classification model to learn as much as key information as possible that is required to complete recognition and classification which contribute to classifying the same category. Moreover, the MSRAM is used to pass the high-level spatial contour information to the front-end channel attention mechanism at different levels, enabling the acquisition of rich features at key locations at the feature extraction stage. To verify the effectiveness of the proposed method, we observed the class activation heat maps on different methods, which are given in
Figure 13.
Figure 13 illustrates that the classification results and the accuracy of the selected airplane samples are greatly improved after using the MDCTL method, and information on key locations is also focused on. As shown in
Figure 13d, after adding MSRAM, the model pays more attention to the comprehensive and holistic feature information of the target, which is exactly what we need.
3.5. Details in MCDTL
However, due to the small size of the SSS image dataset in the target domain, the selection of the size of the multi-domain dataset is very critical. If the size of the source domain dataset is too large, overfitting will occur, which will make it more difficult to migrate the model to the task of sonar image target classification; if the dataset is too small, the migrated model will not achieve the desired results. To address this issue, we selected SAR and optical datasets of five different sizes relative to the SSS dataset 0.5×, 1×, 1.5×, 3×, and 5× for transfer learning, and the experimental results are shown in
Figure 14.
From
Figure 14, it can be seen that the model performance curve reaches the peak when the two datasets are about the same size, and that the curve begins to degrade when the size of the source domain dataset exceeds that of the target domain dataset. The reason for this is that when the size of the source domain dataset becomes larger than that of the target domain dataset, the parameters of the trained model tend to be closer to those used in the source domain classification task. Therefore, to obtain the optimal result, only a part of the source domain dataset with the same size as that of the target domain dataset is selected.
Table 5 shows the differences between the different transfer learning methods and the number of convolutional layers for parametric transfer learning with the SAR dataset. We can obtain the highest accuracy by transferring the first two convolutional blocks and retraining the parameters. The first two convolutional blocks can be considered as a feature extractor which can extract more accurate and rich edge features from noisy and complex images after pre-training on the SAR dataset. It is necessary for the transferred module to be retrained to be adapted to the target task.
3.6. Applications for Detection
The proposed method aims to speed up the search for underwater targets with automated classification algorithms, and it can also be combined with region proposal network (RPN) to detect objects in SSS images. To verify the effectiveness of the proposed method, we applied it to mine detection and compared its detection performance with several recent SOTA algorithms [
1,
2] used for underwater target detection, and the comparative results are shown in
Table 6.
Our method was combined with RPN which is used in the detection head for generating proposal regions. The multi-scale structure of MSRAM can match RPN well. Therefore, the detection method of transfer learning with MSRAM can accurately locate the target position, which can be seen in
Figure 15. Transfer learning from the SAR dataset with MSRAM outperformed other SOTA methods in terms of
[email protected] and average IOU, while its computational complexity and detecting speed need to be improved. Currently, only a small size of the mine class dataset—which includes 152 mine objects in total—is available, and in the future we will try to do more experiments and collect more samples to verify and improve our method.