1. Introduction
Synthetic Aperture Radar (SAR) features all-weather, long-range and large-scale detection performance. It can obtain high-resolution radar images under extremely low-visibility weather conditions, effectively identify ground camouflage and masking targets and is widely used in marine environment detection, terrain survey and military target recognition field. SAR Automatic Target Recognition (ATR) is a crucial technique for interpreting SAR target images, which can effectively improve the utilization efficiency of SAR targets images [
1]. However, limited by the SAR’s coherent imaging system as well as the electromagnetic scattering mechanism, the SAR target image is full of strong coherent speckle noise, and is affected by the variation of target attitude, angle, imaging parameters and other factors.
Based on the imaging characteristics of SAR target images, researchers have conducted a lot of research on the SAR ATR algorithm. Traditional SAR target recognition methods mainly concentrate on the stage of feature extraction and the construction of a classifier. In the feature extraction phase, Principal Component Analysis (PCA) [
2], Independent Component Analysis (ICA) [
3], Gray Level Co-occurrence Matrix (GLCM) [
4] and Histogram of oriented gradient (HOG) [
5] are applied to SAR target recognition tasks. PCA is a multivariate statistical method that examines the correlation between multiple variables. The goal is to extract significant information from the acquired data and then depict it as a new set of orthogonal variables named principal components. Gang et al. [
6] put forth a joint multi-channel sparsity method on the basis of robust PCA to improve the display performance of SAR ground moving targets. ICA is an analytical method based on high-order statistical characteristics, which is used to decompose complex datasets into independent sub-parts. Vasile et al. [
7] used ICA to the speckle filtering of actual polarized SAR data, which applied the rotational invariant scattering vector derived from each ICA to the minimum mean square error filter, the spatial resolution was better preserved. GLCM is a description of the joint distribution of two gray-level pixels with a certain spatial location relationship, on which the anatomy of the local patterns as well as the arrangement rules of the image is based. Numbisi et al. [
8] used random forest and ensemble classifier to average the texture characteristics of SAR target images and the Gray Level Co-occurrence Matrix (GLCM), showing that SAR target images identify cocoa agricultural and transition forests in a multiphase landscape. HOG is a feature descriptor for object detection employed to make calculations of the numerical value of the direction information of the local image gradient. Song et al. [
9] designed a HOG for SAR ATR, which accurately captured the target structure in the SAR target image. In terms of classifiers construction, Support Vector Machine (SVM) [
10], Adaptive Boosting (Adaboost) [
11] and K Nearest Neighbor (KNN) [
12] have also been successfully applied to SAR ATR related algorithms. Sukawatanavijit et al. [
13] combined genetic algorithm with SVM to present a novel algorithm, which can obtain the optimal classification accuracy of multi-frequency radar satellite-2 (RS2) SAR target images using merely a small number of input features. Kim et al. [
14] brought forward a new method for target detection based on Adaboost’s decision-level SAR and IR fusion, which showed satisfactory performance on a synthetic database created by OKTAL-SE. Hou et al. [
15] introduced the KNN algorithm to enhance the classification accuracy of super-pixel SAR targets images. This method takes into account the spatial position relationship between pixels and has strong robustness to coherent speckle noise. Eryildirim et al. [
16] proposed a novel method for extracting descriptive feature parameters from two-dimensional cepstrum of SAR images, which had a lower computational cost than PCA. Clemente et al. [
17] utilized pseudo-Zernike moments of multi-channel SAR images as features to identify different targets, and realized high confidence ATR. Sun et al. [
18] introduced a SAR images recognition method based on dictionary learning and joint dynamic sparse representation, which accelerated the recognition speed and accuracy. Clemente et al. [
19] designed a SAR ATR algorithm based on Krawtchouk moments, which had strong target classification ability and anti-noise ability. However, the above algorithms heavily rely on cumbersome manual feature design and empirical selection, which is not only costly, but also the generalization ability of the designed model is often poor.
CNN is capable of processing multi-dimensional data and has powerful representation learning capabilities, therefore having attracted the attention of many researchers. Since AlexNet [
20] won ImageNet Challenge: ILSVRC 2012 [
21] and demonstrated the power of CNN, CNN has begun to appear in various computer vision tasks. Then VGG Net [
22] used a sequential structure to explore the impact of CNN depth on image classification tasks, showing that the depth of the network has greatly contributed to the excellent algorithm, and achieved the second place in ILSVRC 2014. ResNet [
23] introduced the concept of residual representation into the construction of CNN models, which further extended the depth of CNN and achieved better performance. GoogLeNet [
24] provided another idea for the design of CNN models, which proposed the Inception module that greatly improved the utilization of parameters by expanding the width of the network and using smaller convolution kernels. In recent years, various models based on CNN have made remarkable achievements in the optical image target recognition tasks. SAR target image recognition tasks have been stimulated by this and plenty of related research has been carried out. Chierchia et al. [
25] adopted a residual learning strategy in the designed CNN to denoise the SAR target images by subtracting the recovered speckle component from the noise component. Pei et al. [
26] augmented the data by generating sufficient multi-view SAR data and fed the expanded SAR data into a designed CNN with a multi-input parallel network topology for identification. Dong et al. [
27] utilized spatial polarization information and XGBoost to perform classification experiments on the PolSAR images of the Gaofen-3 satellite, and proved that the combination of spatial information helps improve the overall performance. Wang et al. [
28] proposed a fixed-feature-size CNN, which realized the classification of all pixels of a PolSAR image at the same time, and improved the classification accuracy by using the correlation between different land covers. Shao et al. [
29] effectively reduced the impact of data imbalance on SAR image recognition results by introducing a visual attention mechanism and a new weighted distance measure loss function into the designed network. Zhang et al. [
30] improved the scattering decomposition technology based on the multi-component model and adopted a superpixel-level classification strategy for the extracted multiple features, providing a new method for land use classification of PolSAR data. He et al. [
31] designed a special generative adversarial network to generate enough labeled SAR data, which improved the classification performance of CNN on SAR images. Although the CNN-based SAR ATR algorithm has achieved breakthroughs in recognition performance, there are still three crucial problems in the field of SAR target image recognition that need to be resolved. First, the SAR target images are full of coherent speckle noise, resulting in highly redundant training sample features, lack of representative features for target recognition and greatly affecting the classification performance of the model. Second, limited by the expert domain knowledge and labeling costs of SAR target images, labeled SAR data is scarce. Training a CNN with a small amount of sample SAR data will cause serious overfitting and the model generalization ability is poor. Third, deep CNNs have complex structures and enormous computational complexity, which are not conducive to the development of terminal equipment for SAR identification systems.
In view of the above challenges, a lightweight network architecture TAI-SARNET combined with transfer learning is put forward here to achieve efficient SAR target image recognition. Firstly, the Atrous-Inception module with a small size convolution kernel is adopted in the network to increase both the depth and width of the network while increasing the receptive field and reducing the number of parameters. Secondly, Batch Normalization (BN) [
32] is employed behind each convolutional layer to effectively prevent network overfitting, and Global Average Pooling [
33] is exploited to further decrease the number of parameters. Subsequently, the robustness of the proposed algorithm on a small amount of sample data is verified, and transfer learning is introduced to enhance the performance of the model. Eventually, the suggested algorithm is tested on the MSTAR [
34] database, and the experimental results have demonstrated that it attains excellent recognition accuracy. The main contributions of this paper are summarized as follows:
- (1)
An improved lightweight CNN model based on the Atrous convolution and Inception module is proposed. This model can obtain a rich global receptive field, effectively preventing the network from overfitting, and obtain high recognition accuracy on the MSTAR dataset.
- (2)
The Atrous-Inception module is designed to extract more detailed target feature information, and it has strong robustness on a small sample dataset of the constructed SAR targets images.
- (3)
The transfer learning strategy is used to explore the performance of the prior knowledge based on optical, non-optical, hybrid optical and non-optical fields transferred to the SAR targets images recognition tasks, further improving the robustness as well as the generalization of the model on the SAR small sample datasets.
The main work arrangement of this paper is as follows: The second part introduces the related work of CNN and transfer learning. The third part introduces the methods in this paper. The fourth part introduces the experimental results and analyzes them. The fifth part draws the conclusion.
3. Proposed Methods
In this section, we propose a lightweight SAR target image classification network combining the atrous convolution and the Inception module, and the application of transfer learning is introduced in this paper. The first part displays the basic network architecture. The second part illustrates the Atrous-Inception module in detail and elaborates on the related concepts of the receptive fields. The third part derives the mathematical formula of the design principle of the optimization algorithm used in this paper. In the fourth part, we give the specific definition of transfer learning and introduce the transfer learning strategy adopted in this paper.
3.1. Proposed Network
This paper presents a lightweight network based on atrous convolution and Inception-v3, which we call TAI-SARNET, the specific network structure is presented in
Figure 1, and the detailed parameter information of the structure is shown in
Table 1.
The first four layers of TAI-SARNET are consistent with Inception-v3, and output 80 feature maps after passing the first four layers of the network. Then, an Atrous-Inception module is immediately followed, which contains three atrous convolution layers and outputs 96 map features, and details of The Atrous-Inception module will be introduced in
Section 3.2. Finally, an atrous convolution layer is placed behind the Atrous-Inception module, which outputs 256 feature maps. The above structure constitutes the feature extraction part of TAI-SARNET. In particular, it should be noted that the activation function used in TAI-SARNET is ReLU and the BN layer is added behind each convolution layer. The addition of the BN layer can dramatically accelerate the speed of network training, effectively address the problem of gradient dispersion and avoid overfitting of the network. The formula of the BN layer can be expressed as follows:
where
represents the
-th input of the BN layer,
is the mean of a batch input data,
represents the standard deviation of the batch input data and
is a very small constant value.
and
are learnable reconstruction parameters, which are iteratively optimized through network training. By Equation (1), the
-th output
of the BN layer can be calculated.
Subsequently, the global average pooling layer is utilized in place of the traditional fully connected layer so as to realize the calculation of all pixel averages for each feature map output from the feature extraction part. The global average pooling layer is employed to directly reduce the data dimension, greatly reducing network parameters and play the role of regularization in the entire structure to prevent network overfitting. Finally, the data values from the global average pooling layer are input into the classifier softmax. The essence of softmax is a normalized exponential function that is functionally normalized as the probability distribution value within the 0–1 interval, and the formula of the softmax layer can be represented as follows:
In Equation (2), represents the -th output value of the global average pooling layer, and represents the number of categories. After softmax normalization, each output value can be regarded as the probability value of identifying the target for that category, and the sum of the output values for all categories of the network is one.
3.2. Atrous-Inception Module
The receptive field is used to represent the receptive range of different neurons in the network to the original image. The larger the value of neuron receptive field is, the greater the range of the original image it can access, which means that it contains more global and higher-level semantic features. In traditional CNNs, a convolution operation is first used to extract feature maps, and then downsampling is performed by a pooling operation to increase the receptive field. However, this operation can result in the loss of the internal structure and spatial information of the target, and the information of the small target object can not be reconstructed. Atrous convolution [
46] is a good solution to the problems caused by frequent use of pooled operation, and is widely used in the field of semantic segmentation of images. Specifically, the atrous convolution inserts holes in the standard convolution, which ensures that the receptive field is increased without losing image information, enabling each convolution output to contain a greater range of information.
The size of the receptive field of the standard convolution is pertinent to the size of the current layer of the convolution kernel, the moving step of the convolution kernel, and the size of the receptive field of the previous standard convolution. The receptive field calculation formula for standard convolution is defined as follows:
where
in the formula represents the receptive field in the
-th convolution layer,
represents the size of the convolution kernel in the
-th convolution layer,
indicates the moving step size of the convolution kernel in the
-th convolution layer and the symbol
represents the multiplication of two factors. In particular, the size of the receptive field of the first convolution layer is the size of the convolution kernel of the convolution layer. Simplify the formula as follows:
For example, if the first layer is a 3 3 standard convolution, then the receptive field of this convolution layer is 3 3. After the first layer of standard convolution is superimposed with a 3 3 standard convolution, then the receptive field is 5 5. If three 3 3 standard convolutions are superimposed, the receptive field is 7 7.
Compared to the standard convolution, the atrous convolution has a hyperparameter called dilated rate, which refers to the number of intervals between convolution kernels. The calculation formula of the receptive field of the hollow convolution is as follows:
The Equation (5) represents the dilated rate in the first atrous convolution layer, and the atrous convolution of the dilated rate of 1 is equivalent to a standard convolution.
Figure 2 shows that the receptive field after 3
3 atrous convolution with the dilated rate of 1, 2 and 4 have reached 15
15, which indicates that in the same network depth, the receptive field of the atrous convolution is much larger than the standard convolutional.
In the Inception series of networks, the strategy adopted is to use a small convolution kernel stack rather than a large convolution kernel, which is able to reduce the parameters while obtaining the receptive field of the same size as the large convolution kernel. However, expanding the receptive field by superimposing a small convolution kernel can only grow linearly, while atrous convolution can increase the receptive field exponentially without increasing the parameter amount. Therefore, the Atrous-Inception module is used to grow the receptive field, obtain a larger range of spatial information, and further control the number of network parameters. The detailed structure of the Atrous-Inception module is displayed in
Figure 3, the first part uses an atrous convolution layer with dilated rate of 4 and a convolution kernel size of 3
3, the second part uses an improved Inception module. In this improved Inception module, we maintain the main structure of the original Inception, using four 1
1 standard convolutions to build the bottleneck layer to achieve feature reduction, and using average pooling to decrease the number of parameters. On top of the original Inception structure, we replace the three 3
3 standard convolutions on the two branches with the dilated rate of 4 and convolution kernel size of 3
3, which can increase the receptive field of the network and obtain more global information. In the final merge operation, the Maximum merge method is adopted, which realizes the output element-wise maximum corresponding to the feature map. Compared with the original Inception structure using Concatenate merge, Maximum avoids the increase of feature dimensions and greatly reduces the number of parameters.
3.3. RMSProp Optimization
The gradient descent algorithm is a typical optimization method commonly used in CNNs, and the core of the backpropagation algorithm is to continuously use gradient descent to update the weight parameters of each layer for optimization. However, the gradient descent algorithm needs to traverse all samples per iteration, making the training process slow and taking a long time to reach convergence. The RMSProp algorithm is an adaptive learning rate algorithm propounded by Hinton [
47], which uses the exponential decay average of historical gradients and discards the gradients at the earlier points, thus speeding up the convergence speed of the algorithm. The related calculation formula of RMSProp vector update is shown in Equation (6):
In Equation (6), represents the gradient of the current moment, is a manually set fractional value and represents the running average of the attenuation of the square gradient at the time step . It can be known from the formula that depends only on the current gradient and the previous gradient average.
The calculation formula for parameter update is as follows:
In Equation (7), represents the parameter at the time step , represents the initial learning rate and represents a smoothing term that avoids the denominator of 0, in which generally takes 1 × 10−8. The RMSProp algorithm is applied to update according to the parameter update rule of the above formula, one of the optimization learning algorithms commonly used in deep learning.
3.4. Transfer Learning
When the distribution of training data and testing data does not meet the prerequisites of the same distribution, traditional machine learning algorithms cannot achieve satisfactory performance. However, due to the cost of manual annotation and data accessibility, it is difficult to construct a qualified dataset from scratch. Transfer learning is capable of improving the learning effect of related new tasks through transferring knowledge from already learned tasks, breaking through the limitations of data distribution differences and the lack of a vast quantity of labeled data in the related target domain. Specifically, transfer learning transfers the knowledge learned from a source domain incorporating a large number of labeled training samples to a related target domain with a small number of labeled samples, so that the target domain achieves better results. In addition, transfer learning can reuse previous models, thus greatly speeding up learning efficiency. For more rigorous expressions, the relevant definitions and symbols of transfer learning are as follows: Given a domain , where represents the feature space, represents the edge probability distribution and . Then given a task corresponding to the domain , where represents the label space and represents the target prediction function. The training set data pair can be represented as , where , . represents the label value of the unlabeled test sample . In transfer learning, the domain that has been learned is called the source domain , and the related domain to be improved is called the target domain . The task corresponding to the source domain is called the source task , and the task corresponding to the target domain is called the target task . The goal of transfer learning is to use the knowledge learned from the source domain and the source task to improve the prediction capability of the prediction function in the target domain , and require or . SAR targets images are typical non-optical images, this paper explores the performance of transferring prior knowledge from the optical, non-optical, and hybrid optical and non-optical domains to SAR targets recognition tasks.
For the optical domains, the source domain
dataset uses the ImageNet, the source task
is classified into 1000 classes of optical target images, the target domain
dataset uses the 10-class MSTAR dataset under Standard Operating Condition (SOC), and the target task
is 10 classes of SAR target image classification. The features extracted by shallow networks are more general, therefore the structure of the first four layers designed by TAI-SARNET is the same as that of the first four layers of Inception-v3, and the transfer based on the parameter migration of specific layers is used. Specifically, first obtain the Inception-v3 model pre-trained on
, extract its underlying parameters and transfer to the corresponding layer of TAI-SARNET used in
and then retrain the entire network until the model reaches a convergence state and obtains the optimal result. The specific transfer process is shown in
Figure 4.
For the non-optical domains, the source domain
dataset uses the 3-class MSTAR dataset with augmentation, the source task
is 3 categories of SAR target image classification, the target domain
dataset uses 10-class MSTAR dataset under Standard Operating Conditions (SOC) and
is the classification of 10 categories of SAR target images. First, we use the full-angle rotation enhancement method to augment the 3-class MSTAR dataset by 360 times as the
dataset and then use the enhanced data for
to train on TAI-SARNET to obtain a pre-trained model. Finally, the network on
loads the specific layer weights of the pre-trained model and fine-tunes the training by changing the classification number of the softmax layer according to
until the model reaches a convergent state and
obtains the optimal prediction result. The specific transfer process is shown in
Figure 5.
For the purpose of fully comparing the transfer effects of data from different imaging modes as source domain knowledge, we take the classification task of combining optical and non-optical radar images as source task
and transfer the knowledge obtained from
to the 10 classes of SAR targets images classification which is target task
. Specifically, the NWPU-RESISC45 dataset [
48] and the 3-class MSTAR dataset extended by random angle rotation are mixed to construct a complete hybrid radar image dataset as the source domain
dataset. The number of the training set for the complete hybrid radar dataset is 23,472, the number of the validation set is 6768, and the number of the testing set is 4515. The transfer strategy is consistent with
Figure 5, starting with training at TAI-SARNET in the hybrid radar image to obtain a pre-training model. Then, keep the feature extraction part of TAI-SARNET and adjust the classifier based on
only. Finally, the weights of the pre-trained model are fine-tuned to the network of target domain A and retrained until the model converges on the 10-class MSTAR dataset under SOC.