1. Introduction
Synthetic aperture radar (SAR) has played a significant role in surveillance and battlefield reconnaissance, thanks to its all-day, all-weather, and high resolution capability. In recent years, SAR automatic target recognition (ATR) of ground military vehicles has received intensive attention in the radar ATR community. However, SAR images usually have low resolution and they only contain the amplitude information of scattering centers. Thus, it is challenging to identify the targets in SAR images.
The MIT Lincoln Laboratory proposed the standard SAR ATR architecture, which consists of three stages: detection, discrimination, and classification [
1]. In the detection stage, simple decision rules are used to find the bright pixels in SAR images and indicate the presence of targets. The output of this stage might include not only targets of interests, but also clutters, because the decision stage is far from perfect. On the following discrimination stage, a discriminator is designed to solve a two-class (target and clutter) classification problem and the probability of false alarm can be significantly reduced [
2]. On the final classification stage, a classifier is designed to categorize each output image of the discrimination stage as a specific target type.
On the classification stage, there are three mainstream methods: template matching methods, model-based methods, and machine learning methods. For the template matching methods [
3,
4], the template database is generated from training samples according to some matching rules and the best match is then found by comparing each test sample to the template database. The common matching rules are the minimum mean square error, the minimum Euclidean distance, and the maximum correlation coefficient, etc. In these template matching methods, the initial SAR images or sub-images cut from initial SAR images are served as templates. However, the SAR images are sensitive to azimuth angle, depression angle, and target structure. When there is large difference between the training and test samples, the recognition performance will severely decrease. Additionally, such methods suffer from severe overfitting [
5]. Model-based methods were proposed to solve the above problem [
6,
7]. In the model-based methods, SAR images are predicted by computer-aided design model and the modeling procedure is usually complicated.
SAR ATR algorithms that are based on machine learning methods can be further divided into two types, i.e., feature-based methods and deep learning methods. Feature-based methods [
8,
9] require features to be manually extracted from SAR images, while deep learning methods automatically extract features from SAR images. Thus, deep learning methods avoid the designing of feature extractors. As a typical deep learning structure, convolutional neural network (CNN) has been successfully applied in various fields, e.g., SAR image classification [
10] and satellite image classification [
11]. Particularly, CNN-based methods outperform others in SAR ATR tasks due to its unique characteristics that are suitable for two-dimensional image classification [
12].
The MSTAR dataset serves as a benchmark for SAR ATR algorithms evaluation and comparison [
13]. However, there is high-correlation between the target type and clutter in the MSTAR dataset, i.e., the SAR images of a specific target type may correspond to the same background clutter. It was demonstrated that, even if the target and shadow regions are removed, a traditional classifier still achieves high recognition accuracy (above 99%) for the remaining clutters [
14]. It may be impossible that the target location may change in real world situations, and various background clutters instead of a fixed type should accompany the corresponding SAR image. Therefore, we exclude such correlation by target region segmentation [
15] and generate the MSTAR pure target dataset for fair comparison and an evaluation of SAR ATR algorithms.
The key factors in improving the recognition performance of SAR ATR algorithms that are based on CNN include: (i) SAR image preprocessing to extract features more effectively and easily; and, (ii) designing effective network structures that make full use of the extracted features from SAR images.
Ding et al. [
16] augmented the training set by image rotation and shifting to alleviate over-fitting for SAR image preprocessing. Chen et al. generated the augmented training set by cropping the initial 128 × 128 MSTAR images to 88 × 88 patches randomly [
12]. Wagner enlarged the training set by directly adding distorted SAR images to improve the robustness [
17]. Lin et al. cropped the initial MSTAR images to 68 × 68 patches in order to reduce the computation burden of CNN [
18] and Shang et al. cropped the initial MSTAR images to 70 × 70 patches [
19]. Wang et al. used a despeckling subnetwork to suppress speckle noise before inputting SAR images into a classification network [
20].
For the designing of CNN structure for SAR ATR, a traditional CNN structure that consists of convolutional layers, pooling layers and softmax classifier was proposed [
16,
21,
22,
23]. Later, Chen et al. designed A-convent, where the number of unknown parameters is greatly reduced by removing the fully-connected layer [
12]. Wagner replaced the softmax classifier in the traditional CNN structure by a SVM classifier and achieved high recognition accuracy [
17,
24]. Lin et al. proposed CHU-Nets, where a convolutional highway unit is inserted into the traditional CNN structure and the classification performance is improved in a limited-labeled training dataset [
18]. Shang et al. added an information recorder to CNN to remember and store the spatial features of the samples, and then used spatial similarity information of the recorded features to predict the unknown sample labels [
19]. Kechagias-Stamatis et al. fused a convolutional neural network module with a sparse coding module under a decision level scheme, which can adaptively alter the fusion weights that are based on the SAR images [
25]. Pei et al. proposed a multiple-view DCNN (m-VDCNN) to extract the features from target images with different azimuth angles [
26].
Generally, CNN is a data-driven model and each pixel of the training and test samples directly participates in feature extraction. The correlation between the clutter in the training and test sets cannot be ignored, since input SAR images consist of both target region and clutter region. Additionally, for the available SAR ATR algorithms that are based on CNN, the softmax classifier directly applies the features that were extracted by convolutional layers. However, CNN may automatically learn the useless feature maps, which prevent the classifier from effectively utilizing significant features [
27,
28]. Therefore, the available SAR ATR algorithms that are based on CNN ignore the negative effects of the feature maps with little information, and the recognition performance may degrade.
We propose a novel SAR ATR algorithm based on CNN to tackle the above-mentioned problems. The main contributions includes: (i) an enhanced Squeeze and Excitation (SE) module is proposed to suppress feature maps with little information in CNN by allocating different weights to feature maps according to the amount of information they contain; and, (ii) a modified CNN structure, i.e., the Enhanced Squeeze and Excitation Net (ESENet) incorporating the enhanced-SE module is proposed. The experimental results on the MSTAR dataset without clutter have shown that the proposed network outperforms the available CNN structures designed for SAR ATR.
The remainder of this paper is organized, as follows.
Section 2 introduces the Squeeze and Excitation module.
Section 3 introduces a novel SAR ATR method based on the ESENet, and discusses the mechanism of the enhanced-SE module, together with the structure of the ESENet in detail.
Section 4 presents the experimental results to validate the effectiveness of the proposed network, and
Section 5 concludes the paper.
2. Squeeze and Excitation Module
A typical CNN structure consists of a feature extractor and a classifier. The feature extractor is a multilayer structure that is formed by stacking convolutional layers and pooling layers. The feature maps of different hierarchies are extracted layer by layer, and then feature maps of the last layer are applied by the classifier for target recognition. In a typical feature extractor, the feature maps in the same layer are regarded as having the same importance to the next layer. However, such an assumption is usually violated in practice [
29].
Figure 1 shows 16 feature maps extracted by the first convolutional layer for a typical CNN structure applied in a SAR ATR experiment. It is observed that some of the feature maps, e.g., the second feature map in the first row, only have several bright pixels, and contain less target structural information than others.
In a typical CNN structure, all of the feature maps with different importance in the same layer equally pass through the network. Thus, they make equal contributions to recognition and such an equal mechanism disturbs the utilization of important feature maps that contain more information. We could apply the SE module, which allocates different weights to different feature maps in the same layer, to enhance significant feature maps and suppress others with less information [
29].
Figure 2 illustrates the structure of a SE module. For an arbitrary input feature map tensor
U:
, where
W ×
H represents the size of the input feature map and
C represents the number of input feature maps, the SE module transforms
U into a new feature map tensor
X, where
X shares the same size with
U, i.e.,
.
r is a fixed hyperparameter in a SE module.
The computation of a SE module includes two steps, i.e., the squeeze operation
and the excitation operation
. The squeeze operation obtains the global information of each feature map, while the excitation operation automatically learns the weight of each feature map. A simple implementation of the squeeze operation is global average pooling. For the feature map tensor
, such a squeeze operation outputs a description tensor
, where the
cth element of
z is denoted by:
where
represents the
cth feature map of
U. The excitation operation is denoted by the following nonlinear function:
where
is the rectified linear unit (ReLU) function,
is the sigmoid activation function,
,
,
r is a fixed hyperparameter, and
s is the automatically-learned weight vector, which represents the importance of feature maps. It can be seen from Equations (1) and (2) that the combination of the squeeze operation and the excitation operation learns the importance of each feature map independently from the network. Finally, the
cth feature map that is produced by the SE module is denoted by:
where
represents the weight of
and
represents the product of them.
As discussed above, the SE module computes and allocates weights to the corresponding feature maps. The feature maps with little information will be suppressed after being multiplied by the weights that are much less than 1, while the others will remain almost unchanged after being multiplied by the weights near 1.
3. SAR ATR Based on ESENet
In this section, we will propose the Enhanced-SE module according to the characteristics of the SAR data, and then design a new CNN structure for SAR ATR, namely the ESENet.
Figure 3 shows main steps of the training and test stages to give a brief view of the proposed method. Firstly, image segmentation is utilized to remove the background clutter [
15,
30]. Subsequently, the segmented training images are input into the ESENet to learn weights, and all of the weights in the ESENet are fixed when the training stage ends. After that, the ESENet is used for classification. During the test stage, the segmented test images are input into the ESENet to obtain the classification results. The correlation between the clutter in the training and tests is excluded, because the clutter irrelevant to the target does not join the training and test stages of the ESENet.
In what follows, we will explain the mechanisms of the ESENet in detail.
3.1. Overall Structure of the ESENet
In this part, we will discuss the characteristics and general layout of the proposed ESENet. As shown in
Figure 4, the ESENet consists of four convolutional layers, three max pooling layers, a fully-connected layer, a SE-module, an enhanced-SE module, and a LM-softmax classifier [
31]. There are 16 5 × 5 convolutional kernels in the first convolutional layer, 32 3 × 3 convolutional kernels in the second convolutional layer, 64 4 × 4 convolutional kernels in the third convolutional layer, and 64 5 × 5 convolutional kernels in the last convolutional layer. Batch normalization [
32] is used in the first convolutional layer to accelerate the convergence. A max pooling layer with pooling size 2 × 2 and stride size 2 is added after the first convolutional layer, the SE module, and the enhanced-SE module, respectively. The SE module is inserted in the middle of the network to preliminarily enhance the important feature maps. An enhanced-SE module is inserted before the last pooling layer to further suppress higher-level feature maps with little information. Subsequently, dropout is added to the third convolutional layer and the last convolutional layer. The fully-connected layer has 10 nodes. Finally, we apply the LM-softmax classifier for classification. Below, we will introduce the key components of the proposed network in detail.
3.2. Enhanced Squeeze and Excitation Module
We discovered that, if the original SE module is inserted directly into a CNN designed for SAR ATR, most of the weights output by the sigmoid function become 1 (or almost 1), thus the feature maps remain almost unchanged after being multiplied by the corresponding weights. Accordingly, the original SE module cannot effectively suppress the feature maps with little information.
To solve this problem, a modified SE module is proposed, i.e., the enhanced-SE module. Firstly, although global average pooling could compute global information of the current feature map, its accurately apperceiving ability is limited. Thus, we design a new layer with learnable parameters to apperceive global information regarding the current feature map, which is realized by replacing the global average pooling layer by a convolutional layer whose kernel size is the same as the size of the current feature map. Additionally, the first fully-connected layer is deleted, thus the apperceived global information directly joins the computation of the final output weights.
The sigmoid function is utilized to avoid numerical explosion by transforming all the learned weights to (0,1) in the original SE module, which is defined by reference [
33],
Although the sigmoid function is monotonically increasing, all of the large weights are transformed to almost 1 (e.g., the weight 2.5 becomes 0.9241 after the sigmoid transformation). Such transformation is helpless for the network in distinguishing the importance of different feature maps. To solve the above problem, we design a new function, i.e., the enhanced-sigmoid function,
where
a is the shift parameter,
b is the scale parameter,
q is the power parameter, and
is the original sigmoid function. If
a = 0,
b = 1,
q = 1, then
is the same as
. For
a = 0,
b = 1,
q = 2, the comparison between the sigmoid function and
Figure 5 shows the enhanced-sigmoid function. If the input value falls in (−5,5), then the output of the enhanced-sigmoid function is smaller than the output of the sigmoid function (e.g., the weight 2.5 becomes 0.8540 after the enhanced-sigmoid transformation, which is obviously smaller than 0.9241).
Figure 6 shows the structure of the enhanced-SE module with the above modification.
Figure 7 shows an illustrative comparison between the feature maps output by the SE module and the enhanced-SE module in a SAR ATR task. Obviously, many feature maps become blank in
Figure 7b, indicating that the enhanced-SE module suppresses feature maps with little information more effectively than the original SE module.
3.3. Other Components in the ESENet
The convolutional layer and pooling layer are the basic components in a typical CNN structure [
34]. The convolutional layer often acts as a feature extractor, which convolutes the input with a convolutional kernel to generate the new feature map. The pooling layer is a subsampling layer that reduces the number of trainable parameters of the network. By subsampling, the structural feature of the current layer is maintained and the impact of the deformed training samples on feature extraction is reduced.
Neural networks are essentially utilized to fit the data distribution. If the training and test sets have different distributions, the convergence speed will decrease and the generalization performance will degrade. To tackle this problem, batch normalization is added behind the first convolutional layer of the ESENet to accelerate network training and improve the generalization performance.
Dropout is a common regularization method that is utilized in deep neural networks [
35]. This technique randomly samples the weights from the current layer with probability
p and prune them out, similar to the ensemble of sub-networks. Usually, it is adopted in the layer with a large number of parameters to alleviate overfitting. In the proposed ESENet, the fully-connected layer has a small number of parameters, while the third convolutional layer and the forth convolutional layer contain most of the trainable weights. Thus, we apply dropout in the two layers with
p = 0.5 and
p = 0.25, respectively.
Additionally, we replace the common softmax classifier by the LM-softmax classifier, which could improve the classification performance by adjusting the decision boundary of features that were extracted by CNN.
3.4. Parameter Settings and Training Method
We apply the gradient decent technique with weight decay and momentum in the training process [
36], which is defined by:
where
is the variation of
in the (
i + 1)th iteration,
is the learning rate,
is the momentum coefficient,
is the weight decay coefficient, and
is the derivative of loss function
with respect to
. In this paper, the base learning rate is set to 0.02,
is set to 0.9, and
is set to 0.004, respectively. Subsequently, we adopt a multi-step iteration strategy, which updates the learning rate to be
if the iteration number reaches 1000, 2000, and 4000, etc. Additionally, we adopt a common training method that subtracts the mean of training samples from both the training and test samples to accelerate the convergence of CNN. In the enhanced-SE module,
a is set to 0,
b is set to 1, and
q is set to 2. In the SE module,
r is set to 16.