1. Introduction
In remote sensing, Synthetic Aperture Radar (SAR), which is an active microwave imaging radar that can observe the surface of the earth day and night [
1,
2], plays a significant role in marine traffic monitoring. In recent years, many countries have developed their own spaceborne technology, such as Germany’s TerraSAR-X, China’s Gaofen-3 and Canada’s RADARSAT-2. Such efforts make object detection of SAR images an increasingly attractive topic.
Deep learning-based object detection for natural images has witnessed a growing number of publications [
3,
4,
5,
6,
7,
8,
9,
10], in many of which dividing object detection into one-stage and two-stage is a common way. The one-stage object detection algorithms treat object detection as a regression problem and obtain bounding box coordinates and class probabilities from image pixels. The typical algorithms are the You Only Look Once (YOLO) series [
11,
12,
13,
14], Single Shot MultiBox Detector (SSD) [
6], and RetinaNet [
9], etc. The two-stage object detection algorithms firstly generate region proposals as the potential bounding boxes and construct a classifier to classify these bounding boxes. After that, the bounding boxes will be refined through post-processing, and finally, duplicate detections will be eliminated. The typical two-stage algorithms are Fast RCNN [
7], Faster RCNN [
8] and Mask RCNN [
10], etc. In general, the two-stage object detection algorithms are more accurate than the one-stage ones, but the one-stage methods are faster and simpler to train.
Inspired by deep learning’s great power in object detection, researchers have introduced deep learning into image processing in remote sensing [
15,
16,
17,
18]. Image processing in SAR is one of the most important fields in remote sensing. Ship detection with multi-scale features [
19,
20,
21,
22] has gotten more and more attention in recent years. Liu et al. [
23] constructed a ship proposal generator to solve the multi-scale problem of ships in SAR images, which can get the highest recall and quality of proposals. The serious missed detection problem of small-scale ships in SAR images had a terrible influence on the performance of object detection, Kang et al. [
22] solved this problem by constructing a context-based convolutional neural network with multi-layer fusion, in which a high-resolution region proposal network (RPN) was used to generate high-quality region proposals, and an object detection network with contextual features can obtain useful contextual information. Fu et al. [
24] balanced semantically the multiple features across different levels by proposing an attention-guided balanced pyramid, which can focus on small ships in complex scenes efficiently. Cui et al. [
25] adopted an attention mechanism to focus on multi-scale ships, in which a dense attention pyramid network was proposed, namely DAPN. The convolutional block attention module in DAPN used channel attention and spatial attention to extract resolution and semantic information and highlight salient features. Additionally, in the study of [
26,
27,
28,
29,
30], different methods were proposed to detect multi-scale ships in SAR images and had achieved satisfying detection results. Although the ship detection algorithms mentioned above had a significant improvement in detection performance, their multi-scale features fusion only fused the feature maps directly. In this way, the fused feature layers are restricted by each other, which is not appropriate for ships of different sizes. Releasing the constraint of feature fusion directly is beneficial to improve the detection performance of multi-scale objects.
This paper describes the design and implementation of a multi-scale ship detection network to achieve an excellent detection performance in SAR images. We firstly construct a CSPMRes2 (Cross Stage Partial network with Modified Res2Net) module for better feature extraction of ships. CSPMRes2 not only has the capability of multi-scale features extraction, but can also model inter-channel relationships and capture long-range dependencies with precise positional information of the feature map. In addition, aiming to directly overcome the shortcoming of feature fusion directly, the fusion proportion of feature maps is considered. Then we construct a feature pyramid network architecture for multi-scale ships detection, namely the FC-FPN (Feature Pyramid Network with Fusion Coefficients). The fusion coefficients in our FC-FPN are set for each feature map participating in fusion and are learned from the training phase of the ship detection network. After fusing feature maps, we pass the output through the CSPMRes2 module to equip FC-FPN with powerful features extraction capability. On the other hand, we also take the model size of the ship’s detection network into account, then adopt YOLOv5 with a small model size (denoted as YOLOv5s) as the ship objects detection framework. Finally, we construct the MSSDNet by applying the CSPMRes2 module and FC-FPN module into YOLOv5s. Benefitted from the design of MSSDNet, the results of experiments on SSDD [
20] and SARShip [
31] datasets illustrate that our MSSDNet has a significant improvement in detection performance with smaller model size and faster inference time. The contributions of this work are summarized below.
1. We construct the MSSDNet with a small model size while having better speed and accuracy compared with the YOLOv5s baseline and other methods.
2. A CSPMRes2 module is proposed to extract the multi-scale discriminative features, which not only possess features extraction capability of ‘scale’ dimension but can also capture the relationships of inter-channels and obtain salient information with precise spatial location information.
3. We construct an FC-FPN module that a learnable fusion coefficient set for each feature map participating in fusion to fuse feature maps adaptively, and we conduct the experiments of fusion coefficient to explore how the fusion coefficients affect the detection of ships.
The rest of this paper is arranged as follows:
Section 2 describes the proposed network.
Section 3 makes an analysis of the result of experiments and a comparison between the proposed network and other algorithms.
Section 4 discusses some phenomena according to the experimental results. Finally,
Section 5 gives conclusions about this paper.
2. The Proposed Method
The sample matching method in the working pipeline of MSSDNet, which is based on the shape between anchor boxes and ground truth, is different from the general ones based on the Intersection over Union (IoU) between the ground truth and anchor boxes. The sample matching method is shown in
Figure 1, where
wg and
hg represent the width of ground truth, respectively, and
wi and
hi represent the width of the three anchor boxes, respectively. A SAR image firstly is resized to a fixed spatial resolution, and then will be divided into S × S grid cells. Each grid cell will set anchor boxes with different aspect ratios. If the width and height of the object match anchor boxes within an allowed range, then the anchor boxes will be responsible for detecting that object, while other anchor boxes will be the background. One object is allowed to have multiple anchor boxes. After sample matching, the bounding box can be obtained by predicting the offset of anchors and objects. The prediction of a bounding box has 6 components:
class, x, y, w, h, confidence. The
class represents which category it belongs to, the
x, y coordinates represent the center offset of the bounding box relative to ground truth, the
w and
h are the width and height of the bounding box, the
x, y, w, h are normalized to 0 and 1 according to the image size. The
confidence score represents the probability that a bounding box contains an object. If there is no object in the bounding box, the
confidence score should be zero. Furthermore, IoU between the ground truth and the predicted box indicates how close the predicted box is to the ground truth. The closer between the predicted box and ground truth, the more likely the predicted box contains an object. Thus, we make the
confidence score of the predicted box equal to the IoU between the ground truth and the predicted box.
The overview of the proposed MSSDNet is illustrated in
Figure 2. Compared with the original YOLOv5s, we reconstruct the backbone of YOLOv5s by introducing our CSPMRes2 module and replacing the FPN of YOLOv5s with the FC-FPN module. The CSPMRes2 module is responsible for extracting features better, and FC-FPN fuses the feature maps adaptively. In the phase of testing, we use the COCO metric as the evaluation standard. We will describe the key modules of MSSDNet in detail.
2.1. CSPMRes2 Module
In order to increase the receptive field range of feature maps, several MRes2 submodules are introduced into the CSPMRes2 module as the feature extraction of scale dimension, as shown in
Figure 3. In the figure, the red block represents the MRes2 module, the pink block represents the convolution module, and other blocks with different colors represent different feature maps. The input of the CSPMRes2 module is split into two branches through channel
x = [
x0
′, x0
″]. Between
x0
′ and
x0
″, the former will go through MRes2 submodules; the latter is linked to the end of the CSPMRes2 module. The outputs of MRes2 submodules,
, will undergo a general convolution module to generate an output,
, then the
x0
″ and
will be concatenated together, after a general convolution module as the output of CSPMRes2 module. The equations of forward propagation and weight update of the CSPMRes2 module are shown in Equations (1) and (2), respectively. In the equations,
g is gradient information,
is weights, and
is the learning rate.
We can see that the gradients of MRes2 submodules are integrated separately, and the bypassed
is also integrated separately. CSPMRes2 module not only possesses characteristics of feature reuse but also reduces the number of duplicate gradient information [
32].
In the MRes2 module, as shown in
Figure 4, the input will go through
n 1 × 1 convolutions, respectively, to change the channels of feature maps, after which, we obtain n feature subsets, denoted as
, where
. Each
has the same number of channels and spatial resolution, where the number of channels is
of the input channels. Except for the feature subset
, each
will go through a convolution with kernel size 3 × 3, denoted as
. Moreover, except for the feature subset
and
, each
will undergo a coordinate attention module (CAM) [
33], denoted as
. We denote the output of
by
and the output of
by
.Thus, the
and
can be written as:
Combine Equation (4) and (5), then we get:
For optimized fuse information at the dimension of ‘scale’, we concatenate all the , denoted by , i.e., , then pass through a 1 × 1 convolution. Finally, in order to capture relationships of inter-channels and obtain salient information with precise spatial location information, after making the feature map go through a 1 × 1 convolution, we pass it through a CAM as the final output of the MRes2 module.
The strategy of separation and combination makes the convolutions process features efficient. The CSPMRes2 module not only has multi-scale feature extraction capability [
34] but also reduces a lot of duplicate gradient information.
2.2. FC-FPN Module
The architecture of the proposed FC-FPN is shown in
Figure 5. A learnable fusion coefficient for each feature map participating in the fusion is set for getting adaptive feature fusion between different feature maps. For better extraction of multi-scale features, we make the output of adaptive feature fusion go through a CSPMRes2 module.
Assuming that
and
represent feature maps participating in feature fusion,
and
are the fusion coefficients of
and
, respectively. We can get an output of features fusion as shown in Equation (7):
The coefficient and can respectively adjust the contribution of and . It can get the best features fusion result by making the and learnable. Furthermore, and are limited within a fixed range to ensure the stability of network training. In this paper, this optimal learning range is obtained by conducting the experiments of fusion coefficients on the SSDD dataset.
2.3. Architecture of MSSDNet
The detailed architecture of MSSDNet is illustrated in
Figure 6, which is the application of the CSPMRes2 module and FC-FPN module in YOLOv5s. We can see the numbers and locations of the CSPMRes2 module and FC-FPN module. In the backbone of MSSDNet, the output of the CSPMRes2 module is the input of FC-FPN and adopts three feature maps with different scales to detect multi-scale ships in SAR images. In order to improve the features representation capability of FC-FPN, we use a CSPMRes2 module after each feature fusion operation.
5. Conclusions
In this paper, the MSSDNet is proposed to detect ships of different sizes in SAR images. The CSPMRes2 module and FC-FPN module are the vital components of MSSDNet, where the CSPMRes2 module is responsible for improving the feature extraction capability of the network, and the FC-FPN module in MSSDNet balances the detection of ships with multi-scale features in SAR images. The ablation study in this paper has confirmed the effectiveness of the two modules; MSSDNet based on the CSPMRes2 module and FC-FPN module can improve the precision of multi-scale ships detection. In addition, it can generate more precise predicted boxes. According to the experimental results on SSDD and SARShip datasets, MSSDNet has achieved higher overall detection performance than other methods. Because the CSPMes2 module just increases a few parameters, the MSSDNet based on the CSPMRes2 module does not increase too many parameters. Benefitting from the small amount of network parameters, in the comparisons of network model size and inference time, both the model size and inference time are lower than other methods, which is of great importance for the field of aviation, aerospace, and the military.