1. Introduction
With the rapid development of sensors and satellite-based remote sensing technologies, the resolution of remote sensing images has greatly improved. High-resolution images contain more abundant textural details and target information, which are conducive to the identification of various objects. As a typical civil and military target, the aircraft plays an important role in many fields, such as transportation services, wartime strikes, and air surveillance, so it is of great importance to detect aircrafts in remote sensing images.
To date, various aircraft detection methods have been proposed, which can be mainly represented by the template matching-based method [
1,
2], the segmentation and edged-based method [
3], and the machine learning-based method [
4,
5]. The template matching-based method is one of the earliest proposed algorithms, which measures the similarity between template and target to obtain detection. This method extremely depends on the design of the template and can only detect the aircraft target which is consistent with the shape, size and direction of the template. The segmentation and edged-based methods focus on the obvious contour, line and edge features of the targets, which is fast and simple but susceptible to external interference. The machine learning-based method regards object detection as a process of feature extraction and classification, which firstly extracts the texture, shape and spatial relationship features by bag-of-words (BoW) [
5], histogram of oriented gradients (HOG) [
6], etc., and then sends features to classifiers for further determination, such as support vector machine (SVM), AdaBoost. In summary, all of these methods utilize the low-level features manually designed to describe the structures of objects, which are highly dependent on prior knowledge and have poor generalizability. They are not sufficiently robust to be applied into background-complex remote sensing images for automatic aircraft detection.
The early-stage convolutional neural network (CNN) [
7] has many limitations in application due to its weak expression ability. In recent years, the development of deep learning and high-performance computing devices makes the realization of deep CNN (DCNN) possible. By constructing a multilayered neural network to simulate the organizational structure of the human cerebral cortex for perceiving external information, DCNN can automatically learn feature representations, which realizes the abstraction and description of objects hierarchically [
8]. In 2014, the success of GoogleNet [
9] and VGGNet [
10] in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) competition firstly introduced the DCNN publicly. VGGNet achieves the maximum depth of 19 layers by repeatedly stacking 3 × 3 convolution layers and 2 × 2 pooling layers. GoogleNet not only increases the depth of the network but also the width by parallelly performing multiple convolution and pool operations at each layer, which enhances the feature representation. However, the vanishing-gradient problem [
11,
12] has always been a difficulty in training DCNNs. In 2015, deep residual network (ResNet) proposed by He et al. [
13] greatly alleviated this problem. ResNet involves identity mapping directly from the output of the shallow network to the deep layer through shortcut connections, which transforms the network for optimizing the residual mapping, thus reducing the difficulty of network learning. Inspired by identity mapping, Huang et al. [
14] proposed DenseNet in 2017: it enhances feature propagation by dense connections and greatly reduces the number of parameters while ensuring a high performance.
Due to its powerful learning ability, DCNN has achieved top performance in the field of scene classification [
15,
16,
17] and object detection [
18,
19,
20]. Generally, CNN-based methods can be divided into two aspects: the two-stage detector and the one-stage detector. In 2014, the emergence of the region-based CNN (R-CNN) [
21] successfully introduced deep learning into the object detection. As the ground-breaking work of two-stage algorithm, R-CNN applies the CNN to extract the features of region proposals obtained by selective search (SS) [
22], which greatly improves the detection accuracy compared with traditional machine learning-based methods. Aiming for a lower time-consuming, Fast R-CNN [
23] extracts the features on the whole image and maps the region proposals directly to the last convolution layers, avoiding the repeated feature extraction operations. However, an SS algorithm to generate proposals has always been a crucial factor that leads to low efficiency. In 2016, the Faster R-CNN proposed by Ren [
24] replaces SS with the Region Proposal Network (RPN) and unifies feature extraction, RPN, object classification and bounding boxes regression into an end-to-end framework, which makes the object detection process more concise and obtain a significant progress in both accuracy and efficiency. Based on Faster RCNN, more excellent algorithms emerge, such as FPN [
25], Mask R-CNN [
26] and Cascade R-CNN [
27], which greatly promotes the development of two-stage detectors. Conversely, the one-stage detector directly predicts the location and corresponding category probability of the target in the image through the regression method, e.g., You Only Look Once (YOLO) series [
28,
29,
30]. Abandoning the process of proposal generations, YOLO acquires a real-time detection speed at the expense of accuracy. Single Shot MultiBox Detector (SSD) [
31] not only absorbs the anchor mechanism of Faster RCNN and the regression idea of YOLO, but also employs the feature maps of different resolution to predict, which improves the detection accuracy and speed simultaneously. In 2018, RetinaNet [
32] imports focal loss to the one-stage detector, which further alleviates the imbalance problems between positive and negative samples, outperforming all the other existing detectors.
At present, many research teams have applied the CNN-based method to detect aircrafts in high-resolution images: Xie et al. [
33] proposed a robust method for tiny and dense aircraft detection by combining Region-based Fully Convolutional Networks (R-FCN) [
34] and ResNet-101. By replacing standard convolution with deformable convolution, Ren et al. [
35] proposed a Deformable ResNet-based Faster R-CNN method which produces a single high-level feature maps for prediction, demonstrating the effectiveness in modeling geometric variations. Guo et al. [
36] adopted VGGNet into the Faster-RCNN and constructed a multi-scale base network, with the consideration of feature maps with various receptive fields. Zhang et al. [
37] applied ResNet-101 as feature extraction network and introduced Online Hard Example Mining (OHEM) [
38] to improve the performance of Faster R-CNN; motivated by the SSD and YOLO, Zhuang et al. [
39] designed a single shot detection framework with the combination of multi-scale feature fusion and soft-Non Maximum Suppression (soft-NMS), which obtains a good tradeoff between detection accuracy and computational efficiency. Zheng et al. [
40] borrowed the idea of dense connection and built a new structure called Dense-YOLO by replacing the two residual network modules in YOLO V3 [
30] with two dense network modules, achieving a good performance in over-exposure, and cloud-occlusions scenes; Guo et al. [
41] also applied DenseNet to SSD and designed a series of candidate boxes with different aspect ratios to detect aircraft targets of different scales. As can be seen, Faster RCNN is still the mainstream two-stage algorithm applied in aircraft target detection. Meanwhile, with the increasing demand for detection speed, one-stage detectors are gradually being widely used. However, all the methods above either improve the one-stage algorithm such as YOLO and SSD or simply use the ResNets and VGGNet as backbone, which does not explore the application of DenseNets into Faster RCNN.
Aircraft target detection in remote sensing images is sensitive to the resolution. The shapes of the same aircraft are multi-scale in different resolution images and the sizes of different types of aircraft also vary greatly in the same resolution image. Therefore, it is necessary to consider the variance of aircrafts’ scale. Additionally, the size of common types of aircrafts (e.g., F-16, F-22, etc.) is generally less than 50 × 50 pixels. After feature extraction of the DCNN, the size of aircrafts in the top-level feature map is only 1/32 of the original, approximately 1 × 1 pixel, which causes serious loss of semantic information, thus making detection very difficult. Actually, the features of each layer are the mapping of the targets on various scales that contain different semantic meanings. Prediction with only top-level features does not completely account for the contribution and difference of multi-scale features in target expression.
To mitigate the above problems, a multi-scale DenseNets-based method is proposed in this paper. Our contributions are listed as follows:
- (1)
We introduced DenseNet as backbone and then constructed a MS-DenseNet with the application of FPN [
25], which not only enhances the propagation of features but also comprehensively utilizes both bottom-level high-resolution features and top-level semantic strong features. Additionally, we applied a multi-scale region proposal network (MS-RPN), which can produce multi-scale proposals to be responsible for targets of corresponding scale, ensuring the effectiveness for detecting small aircrafts.
- (2)
We developed a new compact structure named MS-Densenet-65, which effectively improves the performance of small aircrafts detection, while costing less time in both training and testing. By eliminating some unrequired convolution layers, the Densenet-65 reduces the destruction of the bottom-level high-resolution features and protects the information of small aircraft targets, which are easily submerged by redundant features.
- (3)
We proposed a multi-scale training strategy and design a suitable testing scale of image in detection, which allows the network to learn aircraft targets at different scales and resolutions, thus improving the robustness and generalization ability of proposed model.
The rest of this paper is organized as follows.
Section 2 presents the background of DenseNet and the details of our proposed method.
Section 3 presents a description of the dataset, experimental settings, and detection performance.
Section 4 analyzes the results of the proposed method. Finally,
Section 5 concludes this paper.
4. Discussion
The quantitative analysis in
Table 3 and
Table 4 shows that our proposed MS-DenseNet-65 makes a great progress in detecting small aircraft targets, with a 3.5% recall improvement over MS-DenseNet-121, and maintains fast training and testing. Generally, the more layers a network has, the more expressive it will be. However, with an increase in network layers, the bottom-level features will be seriously destroyed.
Figure 9b,d,f represent the feature maps of MS-DenseNet-65, while
Figure 9c,e,g represent the feature maps of MS-DenseNet-121. Compared with
Figure 9c the large aircraft target in
Figure 9b possesses obvious contour features and the target feature is more abundant. In addition, it is clear that
Figure 9d,f can still express small aircraft targets, while some of the aircraft targets in
Figure 9e,g have disappeared. These results prove that the repetitive convolution layers are not all effective for small aircraft detection. Less feature propagation can also promote the performance of the network.
In
Table 5, we can observe that when the testing scale is set to 1024 × 1024, the multi-scale training method achieves the best detection performance with a recall of 94% and F1-Score of 92.7%, which are far ahead of the single-scale training method. The reason is that multi-scale training improves the expression ability of the detector on different resolution aircraft targets. In addition, it is obvious that with the increase of testing scale, the recall increases first and then decreases. The reason is that enlarged remote sensing images enhance the resolution features of the small objects and make them easier to detect. However, with the further increase of the testing scale, the distribution of the large aircraft targets will be also destroyed, which leads to missing of large aircraft targets. The experimental results show that it is very important to select a suitable testing scale. For a single training scale of t, we find that when the testing scale is set to t + 256, the network achieves the best performance.
The comparative experiments of
Table 6 and
Table 7 reveal that our method has a great advantage in small aircrafts detection. From
Figure 7, we can also see that our method is capable of detecting aircraft targets of different resolutions and shows strong feature representation ability in the detection of dense small targets. Moreover, in the test experiments of two new data sets UCAS-AOD and RSOD, our method can still obtain more than a 92% F1-score, which demonstrates a good transferability.