Deep Learning in Remote Sensing
Deep Learning in Remote Sensing
MOTIVATION
Deep learning is the fastest-growing trend in big data analysis
and was deemed one of the ten breakthrough technologies
of 2013 [1]. It is characterized by neural networks (NNs) in-
volving usually more than two hidden layers (for this rea-
son, they are called deep). Like shallow NNs, deep NNs ex-
ploit feature representations learned exclusively from data,
instead of handcrafting features that are designed based
mainly on domain-specific knowledge. Deep learning
research has been extensively pushed by Internet compa-
nies, such as Google, Baidu, Microsoft, and Facebook, for
several image analysis tasks, including image indexing, seg-
mentation, and object detection.
Based on recent advances, deep learning is proving to
be a very successful set of tools, sometimes able to surpass
brain image with map—image licensed by ingram publishing, electronic circuit board—©istockphoto.com/Henrik5000
N M
1 CONVOLUTIONAL NEURAL NETWORKS
L = N / (J (x i, y i, H, b) + m / KL (t tt j)), (3)
i =1 j =1 Supervised deep NNs have come under the spotlight in re-
cent years. The leading model is the CNN, which studies
where J (x i, y i, H, b) is an average sum-of-squares error the filters performing convolutions in the image domain.
term, which represents the reconstruction error between Here, we briefly review some successful CNN architectures
the input x i and its reconstruction y i. KL (t tt j) is the Kull- recently offered for computer vision. For a comprehensive
back–Leibler (KL) divergence between a Bernoulli random introduction to CNNs, we refer readers to the excellent book
variable with mean t and a Bernoulli random variable by Goodfellow and colleagues [17].
with mean tt j:
ALEXNET
t 1 -t In 2012, Krizhevsky et al. [2] created AlexNet, a “large, deep
KL (t tt j) = t log t + (1 - t) log . (4)
tj 1 - tt j convolutional neural network” that won the 2012 ImageN-
et Large-Scale Visual Recognition Challenge (ILSVRC). The
KL divergence is a standard function for measuring the year 2012 is marked as the first year that a CNN was used to
similarity between two distributions. In the sparse AE model, achieve a top-five test error rate of 15.4%.
the KL divergence is a sparsity penalty term, and m controls its AlexNet (see Figure 2) scaled the insights of LeNet [18]
importance. t is a free parameter corresponding to a desired into a deeper and much larger network that could be
average activation value, and tt indicates the actual average ac- used to learn the appearance of more numerous and
tivation value of hidden neuron h j over the training samples. complicated objects. The contributions of AlexNet in-
An activation corresponds to how often a region of the im- clude the following:
age reacts when convolved with a filter. In the first layer, e.g., ◗◗ using rectified linear units (ReLUs) as nonlinearity func-
each location in the image receives a value that corresponds tions capable of decreasing training time because a ReLU
to a linear combination of the original input and the filter ap- is several times faster than the conventional hyperbolic
plied. The higher such value, the more activated this filter is tangent function
on that region. When convolved over the whole image, a filter ◗◗ implementing dropout layers to avoid the problem of
produces an activation map, which is the activation at each lo- overfitting
cation where the filter has been applied. Similar to the AE, the ◗◗ using data augmentation techniques to artificially increase
optimization of a sparse AE can be achieved via SGD. the size of the training set (and observe a more diverse set
of situations); from this, the training patches are translated
RESTRICTED BOLTZMANN MACHINE and reflected on the horizontal and vertical axes.
AND DEEP BELIEF NETWORK
Unlike the deterministic network architectures, such as AEs
or sparse AEs, a restricted Boltzmann machine (RBM) (see
Figure 1) is a stochastic undirected graphical model consist-
y
ing of a visible layer and a hidden layer. No connections
exist within the hidden layer or the input layer. The energy
f (h; Θ ′, β ′)
function of an RBM can be defined as follows:
h
1 f (x; Θ, β )
E (x, h) = 2 x T x - (h T Wx + c T x + b T h), (5)
x
where W, c, and b are learnable weights. Here, the input x is
also named as the visible random variable, which is denoted (a)
as v in the original RBM paper [16]. The joint probability
distribution of the RBM is defined as p (x, h) = 1/Z exp (–E(x, h))
1 h
p (x, h) = Z exp (-E (x, h)), (6)
W
where Z is a normalization constant. The form of the RBM
makes the conditional probability distribution computa- x
tionally feasible when x or h is fixed.
The feature representation ability of a single RBM is lim- (b)
ited. However, its real power emerges when two or more
RBMs are stacked, forming a deep belief network (DBN) [16]. FIGURE 1. A schematic comparison of (a) an AE and (b) an RBM.
Dense
27
Dense
3
11 3
55 1,000
11 192 192 128 Max-
224 Max- 128 Max- Pooling 2,048 2,048
Stride Pooling Pooling
of 4
3 48
One of the keys of the success of AlexNet is that the that it is easier to optimize the residual mapping in the
model was trained on graphics processing units (GPUs). The ResNet than to optimize the original, unreferenced map-
fact that GPUs can offer a much larger number of cores than ping in conventional CNNs. The core idea of ResNet is to
central processing units allows much faster training, which, add shortcut connections that bypass two or more stacked
in turn, allows the use of larger data sets and bigger images. convolutional layers by performing identity mapping. The
connections are then added together with the output of
VGG NETWORKS stacked convolutions.
The design philosophy of VGG networks (named for Ox-
ford University’s Visual Geometry Group) [3] is simplic- Fully Convolutional Network
ity and depth. In 2014, Simonyan and Zisserman created The fully convolutional network (FCN) [7] is the most im-
VGG networks that make use strictly of 3 × 3 filters with portant work in deep learning for semantic segmentation,
stride and padding of 1, along with 2 × 2 max-pooling i.e., the task of assigning a semantic label to every pixel in
layers with stride of 2. The main points of VGG networks the image. To perform this task, the output of the CNN
are that they must be of the same pixel size as the input (contrary to the
◗◗ use filters with a small receptive field of 3 × 3, rather single class per image of the aforementioned models). FCN
than larger ones (5 × 5 or 7 × 7, as in Alexnet) introduces many significant ideas, such as
◗◗ have the same feature map size and number of filters in ◗◗ end-to-end learning of the upsampling algorithm via an
each convolutional layer of the same block encoder/decoder structure that first downsamples the
◗◗ increase the size of the features in the deeper layers, activation’s size and then upsamples it again
roughly doubling after each max-pooling layer ◗◗ using a fully convolutional architecture, which allows
◗◗ use scale jittering as one data augmentation technique the network to take images of arbitrary size as input be-
during training. cause there is no fully connected layer at the end that
VGG networks are one of the most influential CNN requires a specific size of the activations
models, as they reinforce the notion that CNNs with deeper ◗◗ introducing skip connections as a way of fusing infor-
architectures can promote hierarchical feature representa- mation from different depths in the network for the
tions of visual data, which, in turn, improves classification multiscale inference.
accuracy. A drawback is that training such a model from Figure 3 shows the FCN architecture.
scratch requires large computational power and a very large
labeled training set. REMOTE SENSING MEETS DEEP LEARNING
Deep learning is taking off in remote sensing, as shown in
RESNET Figure 4, which illustrates the number of papers published
He et al. [4] pushed the idea of very deep networks even on the topic since 2014. The exponential increase confirms
further by proposing the 152-layer ResNet, which won ILS- the rapid surge of interest in deep learning for remote sens-
VRC 2015 with a top-5 error rate of 3.6% and set new re- ing. In this section, we focus on a variety of remote-sensing
cords in classification, detection, and localization through applications that are achieved by deep learning and provide
a single network architecture. In [4], the authors provide an in-depth investigation from the perspectives of hyper-
an in-depth analysis of the degradation problem, i.e., that spectral image analysis, interpretation of SAR images, inter-
simply increasing the number of layers in plain networks pretation of high-resolution satellite images, multimodal
results in higher training and test errors, and they claim data fusion, and 3-D reconstruction.
6
data is of great importance in many
09
09
4
6
4
38
25
4,
4,
38
practical applications, such as land
6
25
cover/use classification or change and
96
object detection. Because high-quality
hyperspectral satellite data are becom-
ing available (e.g., via the launch of 21
EnMAP, planned for 2020, and the
DESIS on the International Space Sta- FIGURE 3. The FCN architecture [7]. g.t.: ground truth.
tion, planned for 2018), hyperspectral
image analysis has been one of the most active research areas layer, and an output layer—and directly classify hyperspec-
within the remote-sensing community over the last decade. tral images in the spectral domain.
Inspired by the success of deep learning in computer vi- Makantasis et al. [26] exploited a two-dimensional
sion, preliminary studies have been carried out on deep learn- (2-D) CNN to encode spectral and spatial information, fol-
ing in hyperspectral data analysis, which brings new momen- lowed by a multilayer perceptron performing the actual
tum to this field. In the following, we review two application classification. In [27], the authors attempted to carry out
cases, land cover/use classification and anomaly detection. the classification of crop types using 1-D CNN and 2-D
CNN. They concluded that the 2-D CNNs can outperform
HYPERSPECTRAL IMAGE CLASSIFICATION the 1-D CNNs, but some small objects in the final clas-
Supervised classification is probably the most active research sification map provided by 2-D CNNs are smoothed and
area in hyperspectral data analysis. There is a vast literature misclassified. To avoid overfitting, Zhao and Du [28] sug-
on this topic using conventional supervised machine-learn- gest a spectral-spatial-feature-based classification frame-
ing models, such as decision trees, random forests, and sup- work, which jointly makes use of a local-discriminant
port vector machines (SVMs) [20]. With the investigation of embedding-based dimension-reduction algorithm and a
hyperspectral image classification [21], a major finding was 2-D CNN. In [21], the authors propose a self-improving
that various atmospheric scattering conditions, complicated CNN model that combines a 2-D CNN with a fractional-
light-scattering mechanisms, interclass similarity, and in- order Darwinian particle swarm optimization algorithm
traclass variability result in the hyperspectral imaging pro- to iteratively select the most informative bands suitable
cedure being inherently nonlinear. It is believed that, com- for training the designed CNN. Santara et al. [29] discuss
pared to the previously mentioned shallow models, deep
learning architectures are able to extract high-level, hierar-
chical, and abstract features, which are generally more robust 100+ (Predicted)
to the nonlinear processing.
The following sections discuss research on hyperspec-
tral image classification. 78
71 (Sept. 2017)
n
g
sio
g
lin
on
v.
C -D
gr gis
3
lin
on
o
Re Lo
o
Po
Po
Label
Neighborhood
of the Pixel
Hyperspectral
Vector
Image
FIGURE 5. A flowchart of the 3-D CNN architecture proposed in [19] for spectral-spatial hyperspectral image classification. Conv.: convolution.
p h
GRU
x k–1 p h
xk
GRU
x k+1
p h
GRU
Input Layer Recurrent Layer + Batch Normalization + PRetanh Fully Connected Layer Softmax Layer
FIGURE 7. The RNN proposed for the hyperspectral image classification task in [36]. GRU: gated recurrent unit; PRetanh: parametric recti-
fied tanh.
a + jb
W∈
Output Layer ∈
Conv. Layers ∈ Pooling
Conv. Layers ∈ Pooling Layers ∈ Layers ∈ Fully Connected
Layers ∈
Potato
Fruit
Oats
Beets
Barley
Onions
Wheat
Beans
Peas
Maize
Flax
Rapeseed
Grass
Lucerne
(a) (b)
FIGURE 9. The Flevoland data set. (a) The Pauli RGB of the PolSAR data set. (b) The classification result from [53].
FIGURE 11. The deconvolution network proposed in [92]. The yellow and green parts correspond to a fully convolutional network with a
9 × 9-pixels bottleneck; then, a deconvolutional block (purple) leads to predictions of the same size as the input image (in [92], 65 × 65 pixels).
FIGURE 12. The image classification results on the Potsdam data sets, considering 65 × 65-pixels patches (from [92]). CNN-PC: patch-based
CNN, predicting single labels per patch and using a sliding-window approach; CNN-SPL: fully convolutional CNN, predicting a 9 × 9 output,
then upsampled to the original size via interpolation; CNN-FPL: deconvolutional network predicting the 65 × 65 output at full resolution;
nDSM: normalized digital surface model; GT: ground truth.
FIGURE 13. A deep learning-based system that helps in analyzing how land cover changes using large-scale and long-term multitemporal
image sequences. This example shows how Munich airport was built out over the past 30 years. DOY: day of year.
50,000
GitHub Star Count
40,000
30,000
17,570
20,000 10,543 9,435
6,778 6,543 6,196
10,000 4,320
0
Google’s Caffe Microsoft- MXNet Facebook Deeplearning4j Theano Facebook
Tensorflow CNTK Torch Caffe 2
FIGURE 15. The most popular open-source deep learning frameworks. The ranking is based on the number of stars awarded by developers
in GitHub. (Image courtesy of [190].)