1. Introduction
There are many applications for automatically extracting buildings from remote sensing images (RSI), such as urban planning, population estimation, disaster emergency response, etc. [
1]. However, automatically assigning each pixel in RSI into buildings or non-buildings is a challenging task, because there are large within-class and small between-class variance in pixel values of objects. There are big differences in the size and shape of buildings. At the same time, there is a strong similarity between buildings and non-buildings. With the development of artificial neural network technology, neural network structure [
2,
3,
4,
5,
6,
7,
8,
9], and the operation of convolution, pooling, batch normalization, and other calculation methods [
4,
5,
6,
10,
11,
12,
13,
14,
15] have made a great progress. These developments have helped CNN [
16] surpass conventional methods in various computer vision tasks, such as object detection, semantic and instance segmentation, etc. [
17]. Therefore, CNN is also used in the field of object extraction from RSI. Ball [
18] comprehensively discussed the progress and challenge in extracting objects from RSI using deep learning methods. In this paper, we focus on the building extraction from RSI; we only talk about the application of CNN in building extraction, which can be summarized as the following three methods.
The first method is based on image classification task with CNN. A fixed-size image tile is putted into a CNN and predict the classes of one or several pixels in the center of the tile [
19,
20]. This idea is called sliding-window-based method, because it uses a sliding window travers all over the RSI at a certain step to acquire the fixed-size image tile, and then obtains segmentation result of the entire image. However, this method will cause a lot of repetitive computation and seriously affects the efficiency of image segmentation. In order to reduce the impact of repeated calculations, a new idea consisting of proposal regions and sliding window convolutional neural network algorithm was proposed [
21,
22], but the proposal regions will influence the results. The second method is called object-oriented convolutional neural network semantic segmentation which combines image segmentation with neural network classification. This method consists of two steps. First, conventional image segmentation methods such as multi-scale segmentation are used to segment the image into potential object patches, and then compress, stretch, and fill these potential object patches to meet the input size of the neural network. Second, these image patches are inputted into the neural network for training and classification [
23,
24]. However, deep learning methods are not used in the image segmentation process, and the bottleneck problem of image segmentation is not alleviated. The accuracy of the image segmentation seriously affects image semantic segmentation. The third method is called semantic segmentation and is based on fully convolutional neural network (FCN) [
25]. The basic idea of the FCN is to replace the fully connected layers with the convolutional layers, so that the final feature map contains position information. Moreover, in order to improve the spatial resolution of the feature map, the last layer of the convolutional neural network is up sampled to the same size as the input image. FCN is an end-to-end deep learning network for image semantic segmentation. It does depend on manual-designed features and makes it possible to realize semantic segmentation tasks through autonomous extracting semantic features from images.
At present, most CNNs used to extract buildings from RSI are still based on the idea of FCN. In order to improve the accuracy and the speed of network training, some researchers have proposed many neural network structures [
26,
27,
28,
29,
30,
31,
32] for the semantic segmentation of RSI.
To improve the results of building extraction, the features extracted by both shallow layers and deep layers are merged. Most of the methods fusing shallow features and deep features use residual networks and skip-layer connections. In [
26], a new FCN structure consisting of a spatial residual convolution module named spatial residual inception (SRI) was proposed for extracting buildings from RSI. In [
33], residual network connection was also used for building extraction. In [
34], Following the basic architecture of U-net [
2], a deep convolutional neural network named DeepResUnet was proposed, which can effectively perform urban building segmentation at pixel scale from RSI and generate accurate segmentation results. In [
27], based on U-net [
2] a new network named ResUnet-a was proposed, which was in combination with hole convolution, residual connection method, pyramid pooling and multi-task learning mechanism, but the fusion of deep and shallow features in the residual block is not enough.
Another way to improve the performance of building extraction is to make full use of the multi-scale features of the pixels. Based on this idea, multi-scale feature extractors were used to the deep neural networks, such as a global multi-scale encoder-decoder network (GMEDN) [
28], U-shaped hollow pyramid pooling (USPP) network [
29], ARC-Net [
33], and ResUnet-a [
30]. These network structures contribute to extracted and fused the multi-scale feature information of pixels in the decoding module. However, in order to control the number of parameters in the neural network, these networks only add the multi-scale feature extractor in the decoding module. Lacking the fusion of deep and shallow features in the encoding stage has an adverse effect on the building extraction.
To improve the result of building extraction, some scholars further modified the output results of the CNN. Based on the U-net and residual neural network, BRRNet [
31] was proposed, which is composed of a prediction module and a result adjustment network. The adjustment network takes the probability map outputted by the prediction module as input, and then outputs the final semantic segmentation result. However, BRRNet does not adopt depth separable convolution, batch normalization, and other strategies, so there are still numerous parameters. Another new strategy [
32] combining neural network with polygon regularization was used to build the extraction. It consists of two steps: firstly, a neural network preliminarily extracts buildings from RSI, and then regularized polygons are used to correct the buildings extracted by the neural network. The first step has a big influenced on the final result. Thus, it is necessary to improve the performance of the neural network.
Some scholars applied multi-task learning [
27,
35] and attention mechanism neural network structure [
36,
37] to build the extraction from RSI. However, introducing more effective feature fusion and multi-scale information extraction strategies into multi-task learning and attention mechanism neural networks can further improve the effect.
At present, in order to reduce training parameters and improve training speed of the neural network, on the one hand, depth separable convolution and hole convolution were used to replace the conventional convolution operation, and on the other hand, batch normalization processing was introduced to accelerate the convergence speed of the network. In order to reduce the training parameters in the network, we adopt the method that reduces the number of convolution kernels in densely connected networks.
Although many neural networks, as we mentioned above, have been used for the semantic segmentation of RSI, it is difficult to extract a building with irregular shape or small size. The reasons can be distilled to the following: firstly, the current neural network mostly uses the skip-layer [
25] to fuse deep features and shallow features. This method cannot fuse features between skip-layers sufficiently. Some neural networks also use the residual network connection method to merge deep features and shallow features, but in the residual block still lacks feature fusion; secondly, to control the number of parameters, most of the networks only extract the multi-scale features of pixels in the decoding stage and lacks the extraction of multi-scale features in the encoding stage. To fill the mentioned-above knowledges, a dense residual neural network (DR-Net) was therefore proposed by this paper, in which a deeplabv3+Net encoder/decoder backbone was employed by integrating densely connected DCNN with ResNet. To reduce the complexity of the network, we decreased the number of parameters by reducing the number of convolution kernels in the network.
The highlights of this paper can be summarized as three aspects. Firstly, a dense residual neural network (DR-Net) was proposed, which uses a deeplabv3+Net encoder/decoder backbone, in combination with densely connected convolution neural network (DCNN) and residual network (ResNet) structure. Secondly, the number of parameters in this network is greatly reduced, but DR-Net still showed an outstanding performance in building extraction task. Thirdly, DR-Net has a faster convergence speed and consume less time to train.
The following section present the materials and the proposed network DR-Net.
Section 3 explains the experiment and result in detail. In
Section 4, we discuss the reasons why the DR-Net can perform well and give some directions to further improve its performance. Finally, in
Section 5, conclusions about this paper are given.
4. Discussion
This paper proposes a new convolutional neural network structure named DR-Net for extracting buildings from heigh resolution RSI. DR-Net uses deeplabv3+Net as a backbone and combines the modules of DCNN with ResNet, so that the network can not only better extract the context information from RSI but also greatly reduce the number of parameters.
The DR-Net can achieve better performance in extracting buildings. We consider that each layer of the DR-Net contains the more original spectral information in the image. This spectral information can better preserve boundaries between buildings and background. The input of each layer within DR-Net contains the output of all layers before the current layer. This structure makes each layer contains the information about shallow features and the abstract features obtained in deeper layers. In fact, it is similar to concatenate the original RSI and the abstract features together in channel dimension. With the increasing of depth, the proportion of original information input into each layer is decreased, but does not disappear. We believe that this design can better integrate contextual information contained in shallow and deep feature maps. Therefore, DR-Net can achieve better results in the extraction of buildings.
Compared with the deeplabv3+ neural network, DR-Net reduces the number of parameters by dropping off the number of convolution kernels, and make the network more lightweight, easier to train. More importantly, DR-Net does not sacrifice its performance. Although, when batch size is set to 1, three networks have similar performance, but thinking about the numbers of parameters and the complexity of networks, it could conclude that DR-Net has made a great progress. Moreover, under the same hardware configuration (as described in
Section 3.1), batch size can be set to 2 in DR-Net. As a comparation, if batch size is set to 2 in deeplabv3+Net and BRRNet, the nets cannot be trained, because of the limitation of the GPU.
When the computer’s computing performance and memory of GPU are limited, try to reduce the number of convolution kernels in the neural network, and increase batch sizes; this may improve the performance of the network. We have not done further research and discussion on the balance between the number of convolution kernels and batch size in a neural network. This work will be carried out in the future.
It is important to note that this paper focuses on improving the performance of DR-Net. We considered that, only in a same situation where the data set, the memory and performance of computer should remain the same, the performance of different neural networks can be measured. Thus, in this article, we did not use data enhancement strategies. Some results of other articles [
30,
38] may be better than ours, but we found that their GPU memory is 11G and 12G, respectively, about 2 times ours. At the same time, some data enhanced strategies were adopted in these paper [
30,
38]. Thus, we thought that our result and articles [
30,
38] were not based on the same foundation, so we cannot simply judge which one is better or not.
We investigated the wrong areas, where the buildings were predicted as backgrounds or backgrounds were predicted as buildings. We found some interesting phenomenon: Firstly, some background areas similar to buildings were predicted to buildings, such as some containers were regarded as buildings. In fact, it is difficult for the naked eye to recognize containers from a 500x500 pixels image tile. Secondly, some buildings under construction were predicated as backgrounds, because these buildings had different contexture and spectrum response from already built buildings, at the same time, only a few of buildings under construction in the training data set.
We found networks trained in Massachusetts and tested on WHU building data set had a better performance than networks trained on WHU building data set and tested on Massachusetts. We think this is because WHU building data set has higher spatial resolution than Massachusetts. Another interesting phenomenon is that BRRNet and Deeplabv3+Net had better generalization abilities. We think this is because during the training process, DR-Net could better fused shallow and deep feature, while these fused features cannot transfer to other datasets directly.
We give some possible directions to further improve the performance of DR-Net. The first is to introduce advanced feature extractor, such as Feature Pyramid Network (FPN) [
46]. The second is to combine the multi-task learning mechanism and attention mechanism [
47].