0% found this document useful (0 votes)

18 views

Deep Learning in Remote Sensing

This document discusses the use of deep learning techniques for remote sensing data analysis. It reviews recent advances in applying convolutional neural networks, recurrent neural networks, and other deep learning methods to tasks like image classification, object detection, and semantic segmentation using remote sensing imagery. The document also discusses challenges that are unique to remote sensing applications of deep learning like dealing with multimodal data from different sensor types and leveraging spatial and temporal information in remote sensing time series.

Uploaded by

vckooo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Deep Learning in Remote Sensing

Uploaded by

vckooo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

XIAO XIANG ZHU, DEVIS TUIA, LICHAO MOU, GUI-SONG XIA,

LIANGPEI ZHANG, FENG XU, AND FRIEDRICH FRAUNDORFER

Deep Learning A comprehensive

in Remote Sensing review and

list of resources

C entral to the looming paradigm shift toward data-in-

tensive science, machine-learning techniques are be-
coming increasingly important. In particular, deep learn-
ing has proven to be both a major breakthrough and an
extremely powerful tool in many fields. Shall we embrace
deep learning as the key to everything? Or should we resist
a black-box solution? These are controversial issues within
the remote-sensing community. In this article, we analyze
the challenges of using deep learning for remote-sensing
data analysis, review recent advances, and provide resourc-
es we hope will make deep learning in remote sensing
seem ridiculously simple. More importantly, we encourage
remote-sensing scientists to bring their expertise into deep
learning and use it as an implicit general model to tackle
unprecedented, large-scale, influential challenges, such as
climate change and urbanization.

MOTIVATION
Deep learning is the fastest-growing trend in big data analysis
and was deemed one of the ten breakthrough technologies
of 2013 [1]. It is characterized by neural networks (NNs) in-
volving usually more than two hidden layers (for this rea-
son, they are called deep). Like shallow NNs, deep NNs ex-
ploit feature representations learned exclusively from data,
instead of handcrafting features that are designed based
mainly on domain-specific knowledge. Deep learning
research has been extensively pushed by Internet compa-
nies, such as Google, Baidu, Microsoft, and Facebook, for
several image analysis tasks, including image indexing, seg-
mentation, and object detection.
Based on recent advances, deep learning is proving to
be a very successful set of tools, sometimes able to surpass

Digital Object Identifier 10.1109/MGRS.2017.2762307

Date of publication: 27 December 2017

8 0274-6638/17©2017IEEE ieee Geoscience and remote sensing magazine december 2017

even humans in solving highly computational tasks (con- image recognition [2]–[4], object detection [5], [6], and se-
sider, e.g., the widely reported Go match between Google’s mantic segmentation [7], [8]. Furthermore, recurrent NNs
AlphaGo artificial intelligence program and the world Go (RNNs), an important branch of the deep learning family,
champion Lee Sedol). Based on such exciting successes, have demonstrated significant achievement on a variety of
deep learning is increasingly the model of choice in many tasks involved in sequential data analysis, such as action
application fields. recognition [9], [10] and image captioning [11].
For instance, convolutional NNs (CNNs) have proven to In the wake of this success and thanks to the increased
be good at extracting mid- and high-level abstract features availability of data and computational resources, the use of
from raw images by interleaving convolutional and pooling deep learning is finally taking off in remote sensing as well.
layers (i.e., by spatially shrinking the feature maps layer by Remote-sensing data present some new challenges for deep
layer). Recent studies indicate that the feature representa- learning, because satellite image analysis raises unique is-
tions learned by CNNs are highly effective in large-scale sues that pose difficult new scientific questions.

brain image with map—image licensed by ingram publishing, electronic circuit board—©istockphoto.com/Henrik5000

december 2017 ieee Geoscience and remote sensing magazine 9

◗◗ Remote-sensing data are often multimodal, e.g., from terrain elevation of biomass. Often, process models and
optical (multi- and hyperspectral), Lidar, and synthetic expert knowledge exist and are traditionally used as pri-
aperture radar (SAR) sensors, where the imaging geom- ors for the estimates. This suggests, in particular, that
etries and content are completely different. Data and in- the dogma of expert-free, fully automated deep learning
formation fusion uses these complementary data sources should be questioned for remote sensing and that physi-
in a synergistic way. Already, prior to a joint information cal models should be reintroduced into the concept, as,
extraction, a crucial step involves developing novel archi- e.g., in the concept of emulators [12].
tectures to match images taken from different perspec- Remote-sensing scientists have exploited the power of
tives and even different imaging modalities, preferably deep learning to tackle these different challenges and insti-
without requiring an existing three-dimensional (3-D) gated a new wave of promising research. In this article, we
model. Also, in addition to conventional decision fusion, review these advances.
an alternative is to investigate the transferability of trained
networks to other imaging modalities. FROM PERCEPTRON TO DEEP LEARNING
◗◗ Remote-sensing data are geolocated, i.e., they are naturally The perceptron is the basis of the earliest NNs [13]. It is a
located in the geographical space. Each pixel corresponds bioinspired model for binary classification that aims to
to a spatial coordinate, which facilitates the fusion of pixel mathematically formalize how a biological neuron works.
information with other sources of data, such as geograph- In contrast, deep learning has provided more sophisticated
ic information system layers, geotagged images from so- methodologies to train deep NN architectures. In this sec-
cial media, or simply other sensors (as just discussed). On tion, we recall the classic deep learning architectures used
the one hand, this allows tackling data fusion with non- in visual data processing.
traditional data modalities. On the other hand, it opens
the field to new applications, such as picture localization, AUTOENCODER MODELS
location-based services, and reality augmentation.
◗◗ Remote-sensing data are geodetic measurements in which AUTOENCODER AND STACKED AUTOENCODER
quality is controlled. This enables us to retrieve geoparam- An autoencoder (AE) [14] takes an input x ! R D and, first, maps
eters with confidence estimates. However, unlike purely it to a latent representation h ! R M via a nonlinear mapping:
data-driven approaches, the role of prior knowledge con-
h = f (Wx + b), (1)
cerning the sensors’ adequacy and data quality becomes
especially crucial. To retrieve topographic information, e.g., where W is a weight matrix to be estimated during train-
even at the same spatial resolution, interferograms acquired ing, b is a bias vector, and f stands for a nonlinear function,
using a single-pass SAR system are considered to be more such as the logistic sigmoid function or a hyperbolic tangent
reliable than the ones acquired in a repeat-pass manner. function. The encoded feature representation h is then used
◗◗ The time variable is becoming increasingly important in to reconstruct the input x by a reverse mapping, leading to
the field. The Copernicus program guarantees continuous the reconstructed input y:
data acquisition for decades; e.g., Sentinel-1 images the en-
tire Earth every six days. This capability is triggering a shift y = f (W l h + bl ), (2)
from individual image analysis to time-series processing.
Novel network architectures must be developed to opti- where Wl is usually constrained to be the form of W l = W T,
mally exploit the temporal information jointly with the i.e., the same weight is used for encoding the input and de-
spatial and spectral information of these data. coding the latent representation. The reconstruction error
◗◗ Remote sensing also faces the “big data” challenge. In is defined as the Euclidean distance between x and y that
the Copernicus era, we are dealing with very large and is constrained to approximate the input data x (i.e., mini-
2
ever-growing data volumes, often on a global scale. Even mizing x - y 2). The parameters of the AE are generally
if they were launched in 2014, e.g., Sentinel satellites optimized by stochastic gradient descent (SGD).
have already acquired about 25 PB of data. The Coperni- A stacked AE (SAE) is an NN consisting of multiple lay-
cus concept calls for global applications, i.e., algorithms ers of AEs in which the outputs of each layer are wired to
must be fast enough and sufficiently transferrable to the inputs of the following one.
be applied for the whole Earth surface. However, these
data are well annotated and contain plenty of metadata. SPARSE AUTOENCODER
Hence, in some cases, large training data sets might be The conventional AE relies on the dimension of the latent
generated (semi)automatically. representation h being smaller than that of input x, i.e.,
◗◗ In many cases, remote sensing aims at retrieving geo- M < D, which means that it tends to learn a low-dimen-
physical or biochemical quantities rather than detect- sional, compressed representation. However, when M > D,
ing or classifying objects. These quantities include mass one can still discover interesting structures by enforcing a
movement rates, mineral composition of soils, water sparsity constraint on the hidden units. Formally, given a
constituents, atmospheric trace gas concentrations, and set of unlabeled data X = {x 1, x 2, f, x N}, training a sparse

10 ieee Geoscience and remote sensing magazine december 2017

AE [15] boils down to finding the optimal parameters by Hinton et al. [16] proposed a greedy approach that trains
minimizing the following loss function: RBMs in each layer to efficiently train the whole DBN.

N M
1 CONVOLUTIONAL NEURAL NETWORKS
L = N / (J (x i, y i, H, b) + m / KL (t tt j)), (3)
i =1 j =1 Supervised deep NNs have come under the spotlight in re-
cent years. The leading model is the CNN, which studies
where J (x i, y i, H, b) is an average sum-of-squares error the filters performing convolutions in the image domain.
term, which represents the reconstruction error between Here, we briefly review some successful CNN architectures
the input x i and its reconstruction y i. KL (t tt j) is the Kull- recently offered for computer vision. For a comprehensive
back–Leibler (KL) divergence between a Bernoulli random introduction to CNNs, we refer readers to the excellent book
variable with mean t and a Bernoulli random variable by Goodfellow and colleagues [17].
with mean tt j:
ALEXNET
t 1 -t In 2012, Krizhevsky et al. [2] created AlexNet, a “large, deep
KL (t tt j) = t log t + (1 - t) log . (4)
tj 1 - tt j convolutional neural network” that won the 2012 ImageN-
et Large-Scale Visual Recognition Challenge (ILSVRC). The
KL divergence is a standard function for measuring the year 2012 is marked as the first year that a CNN was used to
similarity between two distributions. In the sparse AE model, achieve a top-five test error rate of 15.4%.
the KL divergence is a sparsity penalty term, and m controls its AlexNet (see Figure 2) scaled the insights of LeNet [18]
importance. t is a free parameter corresponding to a desired into a deeper and much larger network that could be
average activation value, and tt indicates the actual average ac- used to learn the appearance of more numerous and
tivation value of hidden neuron h j over the training samples. complicated objects. The contributions of AlexNet in-
An activation corresponds to how often a region of the im- clude the following:
age reacts when convolved with a filter. In the first layer, e.g., ◗◗ using rectified linear units (ReLUs) as nonlinearity func-
each location in the image receives a value that corresponds tions capable of decreasing training time because a ReLU
to a linear combination of the original input and the filter ap- is several times faster than the conventional hyperbolic
plied. The higher such value, the more activated this filter is tangent function
on that region. When convolved over the whole image, a filter ◗◗ implementing dropout layers to avoid the problem of
produces an activation map, which is the activation at each lo- overfitting
cation where the filter has been applied. Similar to the AE, the ◗◗ using data augmentation techniques to artificially increase
optimization of a sparse AE can be achieved via SGD. the size of the training set (and observe a more diverse set
of situations); from this, the training patches are translated
RESTRICTED BOLTZMANN MACHINE and reflected on the horizontal and vertical axes.
AND DEEP BELIEF NETWORK
Unlike the deterministic network architectures, such as AEs
or sparse AEs, a restricted Boltzmann machine (RBM) (see
Figure 1) is a stochastic undirected graphical model consist-
y
ing of a visible layer and a hidden layer. No connections
exist within the hidden layer or the input layer. The energy
f (h; Θ ′, β ′)
function of an RBM can be defined as follows:
h
1 f (x; Θ, β )
E (x, h) = 2 x T x - (h T Wx + c T x + b T h), (5)

x
where W, c, and b are learnable weights. Here, the input x is
also named as the visible random variable, which is denoted (a)
as v in the original RBM paper [16]. The joint probability
distribution of the RBM is defined as p (x, h) = 1/Z exp (–E(x, h))

1 h
p (x, h) = Z exp (-E (x, h)), (6)
W
where Z is a normalization constant. The form of the RBM
makes the conditional probability distribution computa- x
tionally feasible when x or h is fixed.
The feature representation ability of a single RBM is lim- (b)
ited. However, its real power emerges when two or more
RBMs are stacked, forming a deep belief network (DBN) [16]. FIGURE 1. A schematic comparison of (a) an AE and (b) an RBM.

december 2017 ieee Geoscience and remote sensing magazine 11

3 3
5 3 3
3
5 3
11 3
48 192 192 128 2,048 2,048 Dense
11 27 128
55 13 13 13
5 3 3
224 3 3 3
5 13
3 13 13

Dense
27

Dense
3
11 3
55 1,000
11 192 192 128 Max-
224 Max- 128 Max- Pooling 2,048 2,048
Stride Pooling Pooling
of 4
3 48

FIGURE 2. The architecture of AlexNet, as shown in [2].

One of the keys of the success of AlexNet is that the that it is easier to optimize the residual mapping in the
model was trained on graphics processing units (GPUs). The ResNet than to optimize the original, unreferenced map-
fact that GPUs can offer a much larger number of cores than ping in conventional CNNs. The core idea of ResNet is to
central processing units allows much faster training, which, add shortcut connections that bypass two or more stacked
in turn, allows the use of larger data sets and bigger images. convolutional layers by performing identity mapping. The
connections are then added together with the output of
VGG NETWORKS stacked convolutions.
The design philosophy of VGG networks (named for Ox-
ford University’s Visual Geometry Group) [3] is simplic- Fully Convolutional Network
ity and depth. In 2014, Simonyan and Zisserman created The fully convolutional network (FCN) [7] is the most im-
VGG networks that make use strictly of 3 × 3 filters with portant work in deep learning for semantic segmentation,
stride and padding of 1, along with 2 × 2 max-pooling i.e., the task of assigning a semantic label to every pixel in
layers with stride of 2. The main points of VGG networks the image. To perform this task, the output of the CNN
are that they must be of the same pixel size as the input (contrary to the
◗◗ use filters with a small receptive field of 3 × 3, rather single class per image of the aforementioned models). FCN
than larger ones (5 × 5 or 7 × 7, as in Alexnet) introduces many significant ideas, such as
◗◗ have the same feature map size and number of filters in ◗◗ end-to-end learning of the upsampling algorithm via an
each convolutional layer of the same block encoder/decoder structure that first downsamples the
◗◗ increase the size of the features in the deeper layers, activation’s size and then upsamples it again
roughly doubling after each max-pooling layer ◗◗ using a fully convolutional architecture, which allows
◗◗ use scale jittering as one data augmentation technique the network to take images of arbitrary size as input be-
during training. cause there is no fully connected layer at the end that
VGG networks are one of the most influential CNN requires a specific size of the activations
models, as they reinforce the notion that CNNs with deeper ◗◗ introducing skip connections as a way of fusing infor-
architectures can promote hierarchical feature representa- mation from different depths in the network for the
tions of visual data, which, in turn, improves classification multiscale inference.
accuracy. A drawback is that training such a model from Figure 3 shows the FCN architecture.
scratch requires large computational power and a very large
labeled training set. REMOTE SENSING MEETS DEEP LEARNING
Deep learning is taking off in remote sensing, as shown in
RESNET Figure 4, which illustrates the number of papers published
He et al. [4] pushed the idea of very deep networks even on the topic since 2014. The exponential increase confirms
further by proposing the 152-layer ResNet, which won ILS- the rapid surge of interest in deep learning for remote sens-
VRC 2015 with a top-5 error rate of 3.6% and set new re- ing. In this section, we focus on a variety of remote-sensing
cords in classification, detection, and localization through applications that are achieved by deep learning and provide
a single network architecture. In [4], the authors provide an in-depth investigation from the perspectives of hyper-
an in-depth analysis of the degradation problem, i.e., that spectral image analysis, interpretation of SAR images, inter-
simply increasing the number of layers in plain networks pretation of high-resolution satellite images, multimodal
results in higher training and test errors, and they claim data fusion, and 3-D reconstruction.

12 ieee Geoscience and remote sensing magazine december 2017

HYPERSPECTRAL IMAGE
ion .
ANALYSIS Forward/Inference ict g.t
r ed t i on
Hyperspectral sensors are character- P ta
ise en
ized by hundreds of narrow spectral Backward/Learning e lw e gm
bands. This very high spectral resolu- Pix S

tion enables us to identify the materials

contained in the pixel via spectroscop-
ic analysis. Analysis of hyperspectral
21

6
data is of great importance in many

09
4

6
4
38

25
4,

4,
38
practical applications, such as land

6
25
cover/use classification or change and
96
object detection. Because high-quality
hyperspectral satellite data are becom-
ing available (e.g., via the launch of 21
EnMAP, planned for 2020, and the
DESIS on the International Space Sta- FIGURE 3. The FCN architecture [7]. g.t.: ground truth.
tion, planned for 2018), hyperspectral
image analysis has been one of the most active research areas layer, and an output layer—and directly classify hyperspec-
within the remote-sensing community over the last decade. tral images in the spectral domain.
Inspired by the success of deep learning in computer vi- Makantasis et al. [26] exploited a two-dimensional
sion, preliminary studies have been carried out on deep learn- (2-D) CNN to encode spectral and spatial information, fol-
ing in hyperspectral data analysis, which brings new momen- lowed by a multilayer perceptron performing the actual
tum to this field. In the following, we review two application classification. In [27], the authors attempted to carry out
cases, land cover/use classification and anomaly detection. the classification of crop types using 1-D CNN and 2-D
CNN. They concluded that the 2-D CNNs can outperform
HYPERSPECTRAL IMAGE CLASSIFICATION the 1-D CNNs, but some small objects in the final clas-
Supervised classification is probably the most active research sification map provided by 2-D CNNs are smoothed and
area in hyperspectral data analysis. There is a vast literature misclassified. To avoid overfitting, Zhao and Du [28] sug-
on this topic using conventional supervised machine-learn- gest a spectral-spatial-feature-based classification frame-
ing models, such as decision trees, random forests, and sup- work, which jointly makes use of a local-discriminant
port vector machines (SVMs) [20]. With the investigation of embedding-based dimension-reduction algorithm and a
hyperspectral image classification [21], a major finding was 2-D CNN. In [21], the authors propose a self-improving
that various atmospheric scattering conditions, complicated CNN model that combines a 2-D CNN with a fractional-
light-scattering mechanisms, interclass similarity, and in- order Darwinian particle swarm optimization algorithm
traclass variability result in the hyperspectral imaging pro- to iteratively select the most informative bands suitable
cedure being inherently nonlinear. It is believed that, com- for training the designed CNN. Santara et al. [29] discuss
pared to the previously mentioned shallow models, deep
learning architectures are able to extract high-level, hierar-
chical, and abstract features, which are generally more robust 100+ (Predicted)
to the nonlinear processing.
The following sections discuss research on hyperspec-
tral image classification. 78
71 (Sept. 2017)

SAE for hyperspectral data classification

A first attempt in this direction can be found in [22], where
the authors make use of an SAE to extract hierarchical fea-
tures in the spectral domain. Subsequently, in [23], the au-
thors employ DBN. Similarly, Tao et al. [24] use sparse SAEs 23
to learn an effective feature representation from input data;
then, the learned features are fed into a linear SVM for hy-
3
perspectral data classification.
2014 2015 2016 2017
Supervised CNNs Publication Years
In [25], the authors train a simple one-dimensional
(1-D) CNN that contains five layers—i.e., an input layer, a FIGURE 4. The statistics for published papers related to deep learn-
convolutional layer, a max-pooling layer, a fully connected ing in remote sensing [187].

december 2017 ieee Geoscience and remote sensing magazine 13

es tic
v.
C -D

n
g

sio
g
lin
on

v.
C -D

gr gis
3

lin
on
o

Re Lo
o
Po

Po
Label
Neighborhood
of the Pixel
Hyperspectral
Vector
Image

FIGURE 5. A flowchart of the 3-D CNN architecture proposed in [19] for spectral-spatial hyperspectral image classification. Conv.: convolution.

Unsupervised deep learning

To allow less dependence on the existence of large anno-
tated collections of labeled data, unsupervised feature
extraction is of great interest. The authors of [35] propose
an unsupervised convolutional network for learning spec-
tral-spatial features using sparse learning to estimate the
network weights in a greedy layerwise fashion instead of
end-to-end learning. Mou et al. [33], [34] present a network
architecture called a fully residual conv-deconv network for un-
supervised spectral-spatial feature learning of hyperspec-
tral images. They report an extensive study of the filters
learned (see Figure 6).

RECURRENT NEURAL NETWORKS for

hyperspectral image classification
In [36], the authors propose an RNN model with a new
(a) (b) activation function and modified gated recurrent unit
for hyperspectral image classification that can effectively
FIGURE 6. The object-detection maps using learned filters of analyze hyperspectral pixels as sequential data and then
the first residual block in the unsupervised residual conv-deconv determine information categories via network reasoning
network [33], [34], where some neurons own good description (see Figure 7).
power for semantic visual patterns at the object level. The feature
maps activated by the convolutional filter numbers 52 and 03, e.g., ANOMALY DETECTION
in the first residual block can be used to precisely capture (a) metal In a hyperspectral image, the pixels whose spectral signa-
sheets and (b) vegetative covers, respectively. tures are significantly different from the global background
pixels are considered anomalies. Because the prior knowl-
edge of the anomalous spectrum is difficult to obtain in
an end-to-end, band-adaptive spectral-spatial-feature- practice, anomaly detection is usually solved by background
learning network to address the problems of the curse of modeling or statistical characterization for hyperspectral
dimensionality. In [30], to allow a CNN to be appropriately data. So far, the only attempt to address this problem via
trained using limited labeled data, the authors present a deep learning can be found in [37], where Li et al. propose an
novel pixel-pair CNN to significantly augment the number anomaly detection framework in which a multilayer CNN is
of training samples. trained using the differences in values between neighboring
Following recent vision developments in 3-D CNNs [31], pixel pairs in the reference image as input data. Then, in the
in which the third dimension usually refers to the time axis, test phase, anomalies are detected by evaluating differences
such architecture has also been employed in hyperspectral between neighboring pixel pairs using the trained CNN.
classification. In other words, in a 3-D CNN, convolution In summary, deep learning has been widely applied
operations are performed spatial spectrally, while in 2-D to multi/hyperspectral image classification, and some
CNNs, they are done only spatially. The authors in [19] in- promising results have been achieved. In contrast, for
troduce a supervised, , 2-regularized 3-D CNN-based mod- other hyperspectral data analysis tasks, such as change
el (see Figure 5), while the authors of [32] follow a similar and anomaly detections, deep learning is just beginning
idea for spatial-spectral classification. to make its mark [37], [38]. Some potential problems to

14 ieee Geoscience and remote sensing magazine december 2017

be further explored include nonlinear spectral unmixing, generalization capability of the so-called AConvNets, and
hyperspectral image enhancement, and hyperspectral time- they have proved to be quite robust in several extended op-
series analysis. erating conditions. The removal of the fully connected lay-
ers, originally designed to be trainable classifiers, might be
INTERPRETATION OF SYNTHETIC justifiable in this case because the limited number of target
APERTURE RADAR IMAGES types can be seen as the feature templates that the ACon-
Over the past several years, many studies related to deep learn- vNets are extracting.
ing for SAR image analysis have been published. Among these, Many authors have applied CNNs to SAR ATR and tested
deep learning techniques have been used most in typical ap- the results on the MSTAR data
plications, including automatic target recognition (ATR), ter- set, e.g., [43]–[46]. Among these
rain surface classification, and parameter inversion. This section studies, the one common find-
reviews some of the relevant studies in this area. ing is that data augmentation is SAR ATR is an important
necessary and the most critical application, in
AUTOMATIC TARGET RECOGNITION step for SAR ATR using CNNs. particular, for
SAR ATR is an important application, in particular, for mili- Various augmentation strate- military surveillance.
tary surveillance [39]. A standard architecture for efficient gies have been offered, includ-
ATR consists of three stages: detection, discrimination, and ing translation, rotation, and
classification. Each stage tends to perform a more compli- interpolation. Cui et al. [47]
cated and refined processing than its predecessor and selects introduce DBN to SAR ATR, where stacked RBMs are used to
the candidate objects for the next-stage processing. Howev- extract features that are then fed to a trainable classifier.
er, all three stages can be treated as a classification problem Wagner [48] suggests using a CNN to first extract feature
and, for this reason, deep learning has made its mark. vectors and then feed them to an SVM for classification. The
Chen and Wang [40] introduced CNNs into SAR ATR and CNN is trained with a fully connected layer, but only the
tested them on the standard ATR data set MSTAR [41]. They previous activations are used. A systematic data augmenta-
found the major issue to be the lack of sufficient training tion approach is employed, which includes elastic distor-
samples as compared to optical images. This might cause se- tions and affine transformations. It is intended to mimic
vere overfitting and, therefore, greatly limit the capability of typical imaging errors, such as a changing range (which is
generalizing the model, so data augmentation is employed scale dependent on the depression angle) or an incorrectly
to counteract overfitting. Chen et al. [42] propose to further estimated aspect angle.
remove all fully connected layers from conventional CNNs, Additional studies applying CNNs to the ATR problem
which are accountable for most trainable parameters. The are also of interest. Bentes et al. [49] applied a CNN to ship–
final performance is demonstrated as superior compared to iceberg discrimination, tested on TerraSAR-X StripMap im-
conventional CNNs on the MSTAR data set (i.e., a state-of- ages. Schwegmann et al. [50] applied a specific type of deep
the-art accuracy of 99.1% in standard operating condition). NNs, highway networks, to the discrimination of ships in
Extensive experiments have been conducted to test the SAR imagery and achieved promising results. Ødegaard et al.

Hyperspectral Image Classification Map

Network
One Pixel

p h
GRU

x k–1 p h
xk
GRU
x k+1

p h
GRU

Input Layer Recurrent Layer + Batch Normalization + PRetanh Fully Connected Layer Softmax Layer

FIGURE 7. The RNN proposed for the hyperspectral image classification task in [36]. GRU: gated recurrent unit; PRetanh: parametric recti-
fied tanh.

december 2017 ieee Geoscience and remote sensing magazine 15

[51] applied a CNN to detect ships in a harbor background in TerraSAR-X images. Geng et al. [58] later presented a simi-
SAR images; to address the issue of a lack of training samples, lar framework, deep supervised and contractive NN, for
they employed a simulation software to generate simulated SAR image classification; this framework additionally in-
data for training. Song et al. cludes the histogram of oriented gradient descriptors
[52] follow this idea, introduc- as handcrafted kernels. The trainable AE layers employ a
When terrain surface ing a deep generative NN for supervised penalty that captures the relevant information
classification uses SAR, SAR ATR. A generative decon- between features and labels, as well as a contractive restric-
in particular volutional NN is first trained tion that enhances local invariance. An interesting finding
polarimetric SAR, to generate a simulated SAR of Geng et al. [58] is that speckle reduction yields the worst
data meet another image from a given target la- performance, and the authors suspect that speckle reduc-
important application bel, while a feature space is tion might smooth out some useful information.
simultaneously constructed in Lv et al. [59] tested DBN on urban land use and land cover
in radar remote sensing.
the intermediate layer. A CNN classification using PolSAR data. Hou et al. [60] proposed an
is then trained to map an in- SAE combined with superpixels for PolSAR image classifica-
put SAR image to the feature tion. Here, multiple AE layers are trained on a pixel-by-pixel
space. The goal is to develop an extended ATR system capa- basis, and superpixels are formed based on Pauli-decomposed
ble of interpreting a previously unseen target in the context pseudocolor images. The output of the SAE is used as a feature
of all known targets. in the final step for k-nearest neighbor clustering of superpix-
els. Zhang et al. [61] applied a stacked sparse SAE to PolSAR
TERRAIN SURFACE CLASSIFICATION image classification, while Qin et al. [62] applied adaptive
When terrain surface classification uses SAR, in particular boosting of RBMs to PolSAR image classification. Zhao et al.
polarimetric SAR (PolSAR), data meet another important [63] proposed discriminant DBN for SAR image classification;
application in radar remote sensing. This is very similar to here, the discriminant features are learned by combining en-
the task of image segmentation in computer vision. Conven- semble learning with a DBN in an unsupervised manner.
tional approaches are based mostly on pixelwise polarimet- Jiao and Liu [64] presented a deep stacking network for
ric target decomposition parameters [54]. They hardly con- PolSAR image classification, which mainly takes advantage
sider the spatial patterns, which convey rich information in of fast Wishart distance calculation through linear projec-
high-resolution SAR images [55]. Deep learning provides a tion. The proposed network aims to perform a k-means
tool for automatically extracting features that represent spa- clustering/classification task where Wishart distance is
tial patterns as well as polarimetric characteristics. used as the similarity metric.
One large stream of studies employs at least one type of The other stream of studies involves CNNs. Zhou et al.
unsupervised generative graphical models, such as the DBN, [65] applied CNNs to PolSAR image classification; here, a
SAE, or RBM. Xie et al. [56] first introduced multilayer feature covariance matrix is extracted as six real-channel data in-
learning for PolSAR classification; here, an SAE is employed put. Duan et al. [66] suggested replacing the conventional
to extract useful features from a channel PolSAR image. pooling layer in CNNs by a wavelet-constrained pooling
Geng et al. [57] proposed a deep convolutional AE (DCAE) layer. The so-called convolutional-wavelet NN is then used
to extract features and conduct classification automatically. in conjunction with superpixels and a Markov random
The DCAE consists of a handcrafted first layer of convolu- field (MRF) to produce the final segmentation map.
tion that contains kernels, such as gray-level co-occurrence Zhang et al. [53] described a complex-valued (CV) CNN
matrix and Gabor filters, and a handcrafted second layer of (see Figure 8) specifically designed to process complex values
scale transformation that integrates correlated neighbor in PolSAR data, i.e., the off-diagonal elements of coherency
pixels. The remaining layers are trained SAEs. This ap- or covariance matrix. The CV CNN not only takes complex
proach was tested on high-resolution, single-polarization numbers as input but also employs complex weights and

a + jb
W∈

Output Layer ∈
Conv. Layers ∈ Pooling
Conv. Layers ∈ Pooling Layers ∈ Layers ∈ Fully Connected
Layers ∈

FIGURE 8. The structure of a CV CNN (adapted from [53]).

16 ieee Geoscience and remote sensing magazine december 2017

complex operations throughout different layers. A CV back- classifier is time consuming. Hence, extracting a holistic
propagation algorithm is also developed to train it. Figure 9 and discriminative feature representation is the most sig-
shows an example of PolSAR classification using a CV CNN. nificant step for scene classification. Traditional approach-
es are most often based on the bag-of-visual-words (BoVW)
PARAMETER INVERSION model [78], [79], but their potential for improvement has
The authors of [67] applied CNNs to estimate ice concen- been limited by the ability of experts to design the feature
tration using SAR images during melt season. The labels extractor and the expressive power encoded.
were produced by visual interpretation by ice experts and The deep arhitectures discussed in the “Convolutional
tested on dual-polarized RadarSat-2 data. Because the prob- Neural Networks” section have been applied to the scene
lem considered is regression of a continuous value, the loss classification problem of high-resolution satellite images
function is selected as mean-squared error. The final results and led to state-of-the-art performance [71], [74], [80]–[87].
suggest that CNNs can produce a more detailed result than As deep learning is a multilayer feature-learning architec-
operational products. ture, it can learn more abstract and discriminative semantic
features as the depth grows and achieve far better classifica-
INTERPRETATION OF HIGH-RESOLUTION tion performance compared to midlevel approaches. In this
SATELLITE IMAGES section, we summarize the existing deep learning-based
methods according to the following three categories:
SCENE CLASSIFICATION ◗◗ Using pretrained networks: The deep CNN pretrained on a
Scene classification, which aims to automatically assign a natural image data set, e.g., OverFeat [88] and GoogLeNet
semantic label to each scene image, has been an active re- [89], has led to impressive results on the scene classifica-
search topic in the field of high-resolution satellite images tion of high-resolution satellite images by directly ex-
in past decades [68]–[74]. As a key problem in the interpre- tracting the features from the intermediate layers to form
tation of satellite images, it has widespread applications, global feature representations [81]–[83], [87]; e.g., [74],
including object detection [75], [76], change detection [77], [81], and [82] directly use the features from the fully con-
urban planning, and land resource management. However, nected layers as the input of the classifier, while [83] takes
due to the high spatial resolutions, different scene images the CNN as a local feature extractor and combines it with
may contain the same kinds of objects or share similar spa- feature coding techniques, such as BoVW [78] and vector
tial arrangement, e.g., both residential areas and commer- of locally aggregated descriptors (VLAD), to generate the
cial areas may contain buildings, roads, and trees, but they final image representation.
are two different scene types. Therefore, the great variations ◗◗ Making a pretrained model adapt: Making a pretrained
in the spatial arrangements and structural patterns make model adapt to the specific conditions observed in a
scene classification a considerably challenging task. data set under study, one can decide to fine-tune it on a
Generally, scene classification can be divided into two smaller labeled data set of satellite images. The authors of
steps: feature extraction and classification. With the grow- [82] and [86], e.g., fine-tune some high-level layers of the
ing number of images, training a complicated nonlinear GoogLeNet [89] using the University of California–Merced

Potato
Fruit
Oats
Beets
Barley
Onions
Wheat
Beans
Peas
Maize
Flax
Rapeseed
Grass
Lucerne
(a) (b)

FIGURE 9. The Flevoland data set. (a) The Pauli RGB of the PolSAR data set. (b) The classification result from [53].

december 2017 ieee Geoscience and remote sensing magazine 17

(UC Merced) data set [90] (see the “Remote-Sensing to localize one or more specific ground objects of interest
Data for Training Deep Learning Models” section), thus (such as a building, vehicle, or aircraft) within a satellite
obtaining better results than when directly using only image and predict the corresponding categories, as shown in
the pretrained CNNs. This can be explained, because the Figure 10. Due to the powerful ability of learning high-level
features learned are more oriented to the satellite images (more abstract and semantically meaningful) feature repre-
after fine-tuning, which can help exploit the intrinsic sentations, deep CNNs are being explored in object-detec-
characteristic of satellite images. Nonetheless, compared tion systems in contrast to the more traditional methods
with the natural image data set consisting of more than followed by a classifier based on handcrafted features [94],
10 million samples, the scales of public satellite image [95]. Here, we review most existing works using CNNs for
data sets (i.e., UC Merced data set [90], RSSCN7 [80], both specific and generic object detection.
and WHU-RS19 [91]) are fairly small, e.g., up to several Jin and Davis [96] proposed a vector-guided vehicle
thousands, for which we cannot fine-tune whole CNNs detection approach for IKONOS satellite imagery using a
to make them more adaptive to satellite images. morphological shared-weight NN that learns the implicit
◗◗ Training new networks: In addition to the previous two ways vehicle model and incorporates both spatial and spectral
to use deep learning methods for classifying satellite im- characteristics and classifies pixels into vehicles and nonve-
ages, some researchers train the network from scratch us- hicles. To address the problem of large-scale variance of ob-
ing satellite images. The authors of [82] and [86], e.g., train jects, Chen et al. [97] suggested a hybrid deep CNN model
the networks by using only the existing satellite image data for vehicle detection in satellite images; this model divides
set, which suffers a drop in classification accuracy com- all feature maps of the last convolutional and max-pooling
pared with using the pretrained networks as global feature layer of the CNN into multiple blocks of variable-receptive-
extractors or fine-tuning the pretrained networks. The rea- field size or pooling size to extract multiscale features.
son lies in the fact that large-scale networks usually contain Jiang et al. [98] proposed a CNN-based vehicle detection ap-
millions of parameters to be learned. Thus, training them proach, wherein a graph-based superpixel segmentation is
using small-scale satellite image data sets will easily cause used to extract image patches and a CNN model is trained
overfitting and local minimum problems. Consequently, to predict whether a patch contains a vehicle.
some construct a new smaller network and train it from A few detection methods transfer the pretrained CNNs
scratch using satellite images to better fit the satellite data for object detection. Zhou et al. [99] presented a weakly
[80], [84], [85], [92]. However, such small-scale networks supervised learning framework to train an object detector;
are often easily oriented to the training images, and the here, a pretrained CNN model is transferred to extract
generalization ability decreases. For each satellite data set, high-level features of objects, and the negative bootstrap-
the network needs to be retrained. ping scheme is incorporated into the detector training
process to provide faster convergence of the detector.
OBJECT DETECTION Zhang et al. [100] advanced a hierarchical oil tank detec-
Object detection is another important task in the interpre- tor, which combines deep surrounding features extract-
tation of high-resolution satellite images [93]: one wishes ed from the pretrained CNN model with local features
(histogram of oriented gradients [101]). The candidate
regions are selected by an ellipse and line segment detec-
tor. Salberg [102] proposed extracting features from the
pretrained AlexNet model and applying the deep CNN
features for automatic detection of seals in aerial images.
Ševo and Avramovic´ [103] suggested a two-stage ap-
proach for CNN training and developed an automatic
object-detection method based on a pretrained CNN,
where the GoogLeNet is first fine-tuned twice on the UC
Merced data set using different fine-tuning options and
then the fine-tuned model is used for sliding-window
object detection. To address the problem of orientation
variations of objects, Zhu et al. [104] have employed the
pretrained CNN features that are extracted from com-
bined layers and implemented orientation-robust object
(a) (b) detection in a coarse localization framework.
Zhang et al. [105] proposed a weakly supervised learn-
FIGURE 10. An illustration of a typical object-detection result ing approach using coupled CNNs for aircraft detection.
within a high-resolution satellite image. (a) The annotated ground The authors employed an iterative, weakly supervised
truth of targets of interests (airplanes). (b) The airplanes detected framework that simply requires image-level training data
by a CNN-based detector. to automatically mine and augment the training data set

18 ieee Geoscience and remote sensing magazine december 2017

from the original image, which can dramatically decrease enables the retrieval of satellite images with hand-free
human labor. A coupled CNN model, composed of a can- sketches only.
didate region proposal network, and a localization net- Although there is still a lack of sufficient study in terms
work were developed to generate region proposals and lo- of exploiting deep learning approaches for remote-sensing
cate aircraft simultaneously, which is suitable and effective image retrieval at present, considering the great potential
for large-scale, high-resolution satellite images. for learning high-level features with deep learning meth-
For enhancing the performance of generic object detec- ods, we believe that more deep learning-based image-
tion, Cheng et al. [76] proposed an effective approach to retrieval systems will be developed in the near future. It is
learning a rotation-invariant CNN (RICNN) to improve in- also worth noticing how feedback from users is integrated
variance to object rotation. In their paper, they add a new into the deep learning retrieval scheme.
rotation-invariant layer to the off-the-shelf AlexNet model.
The RICNN is learned by optimizing a new object function, MULTIMODAL DATA FUSION
including an additional regularization constraint that en- Data fusion is one of the fast-moving areas of remote sensing
forces the training samples before and after being rotated to [114]–[116]. Due to recent increases in the availability of sensor
share similar features and guarantee the rotation-invariant data, using big and heterogeneous data to study environmen-
ability of the RICNN model. tal processes has become more tangible. Of course, when data
Finally, several papers considering methods other than are big and relations to be un-
CNNs exist. Tang et al. [106] offered a compressed-domain veiled are complex, one would
ship detection framework combined with SDA and an ex- favor high-capacity models.
treme learning machine (ELM) [107] for optical spaceborne In this respect, deep NNs are We believe that more
images. Two SDA models are employed for hierarchical ship natural candidates to tackle the deep learning-based
feature extraction in the wavelet domain, which can yield challenges of modern data fu- image-retrieval systems
more robust features under changing conditions. The ELM sion in remote sensing. In this will be developed in the
was introduced for efficient feature pooling and classifica- section, we review three areas near future.
tion, making the ship detection accurate and fast. Han et of remote-sensing image analy-
al. [108] advanced an effective object-detection framework, sis where data fusion tasks have
exploiting weakly supervised learning and DBNs. The sys- been approached with deep
tem requires only weak labels to identify the presence of an learning: pansharpening, feature and decision-level fusion,
object in the whole image and significantly reduces the labor and fusion of heterogeneous sources.
of manually annotating training data.
PANSHARPENING AND SUPERRESOLUTION
IMAGE RETRIEVAL Pansharpening is the task of improving the spatial resolu-
Remote-sensing image retrieval aims at retrieving images tion of multispectral data by fusing these with data char-
having a similar visual content with respect to a query acterized by sharper spatial information. It is a special
image from a database [109]. A common image-retrieval instance of the more general problem of superresolution.
system needs to compute image similarity based on im- Traditionally, the field was dominated by works fusing mul-
age feature representations, and thus the performance of tispectral data with panchromatic bands [117], but more re-
a retrieval depends to a large degree on the descriptive cently it has been extended to thermal [118] or hyperspec-
capability of image features. Building image representa- tral images [119]. Most techniques rely either on projective
tion via feature coding methods (e.g., BoVW and VLAD) methods, sparse models, or pyramidal decompositions.
using low-level handcrafted features has been proven to Using deep NNs for pansharpening multispectral images is
be very effective in aerial image retrieval [109], [110]. Nev- certainly an interesting concept, because most images ac-
ertheless, the discriminative ability of low-level features quired by satellite such as the WorldView series or Landsat
is very limited, and thus it is difficult to achieve substan- come with a panchromatic band. In this respect, training
tial performance gain. Recently, a few works have investi- data are abundant, which is in line with the requirements
gated extracting deep feature representations from CNNs. of modern CNNs.
Napoletano [111] extracts deep features from the fully A first attempt in this direction can be found in [120],
connected layers of the pretrained CNN models, and the where the authors use a shallow network to upsample the
deep features prove to perform better than low-level fea- intensity component obtained after the intensity, hue, and
tures regardless of the retrieval system. Zhou et al. [112] saturation of color images [red, green, blue (RGB)]. Once the
proposed a CNN architecture followed by a three-layer multispectral bands have been upsampled with the CNN,
perceptron, which is trained on a large remote-sensing a traditional Gram–Schmidt transform is used to perform
data set and able to achieve remarkable performance even the pansharpening. The authors use a data set of QuickBird
with low-dimensional deep features. Jiang et al. [113] images for their analysis. Even though this is interesting, in
present a sketch-based satellite-image-retrieval method [120], the authors simply replace one operation (the nearest
that involves learning deep cross-domain features, which neighbor or bicubic convolution) with a CNN.

december 2017 ieee Geoscience and remote sensing magazine 19

In [121], the authors propose using a CNN to learn the processed. The filters learned by the first layer of the network
pansharpening transform end to end, i.e., letting the CNN will, therefore, depend on a stack of different sources. Stud-
perform the whole pansharpening process. In their CNN, ies considering this straightforward extension of NNs are nu-
they stack upsampled spectral bands with the panchromat- merous and, in [126], the authors compared networks trained
ic band and then learn, for each patch, the high resolution on RGB data (fine-tuned from existing architectures) with
values of the central pixel. networks including a digital surface model (DSM) channel,
In [122], the authors use a superresolution CNN trained using the 2015 Data Fusion Contest data set over the city of
on natural images [123] as a pretrained model and fine- Zeebruges [127] (data are available from [188]; also see the
tune it on a data set of hyperspectral images. By doing so, “Remote-Sensing Data for Training Deep Learning Models”
they make an attempt at transfer learning [124] between section). They use the CNN as a feature extractor and then
the domains of color (three bands, large bandwidths) and use the features to train an SVM, predicting a single semantic
hyperspectral images (many bands, narrow bandwidths). class for the entire patch. They then apply the classifier in a
Fine-tuning existing architectures that have been trained sliding-window manner.
on massive data sets with very large models is often a rel- Parallel research has considered spatial structures in the
evant solution, because one makes use of discriminative network by training architectures predicting all labels in the
strong features and injects only task-specific knowledge. patch instead of a single label to be attributed to the central
In [125], the authors present an upsampling of the pan- pixel. By doing so, spatial structures are inherently included
chromatic band via a stack of AEs: the model is trained in the filters. Fully convolutional and deconvolutional ap-
to predict the full-resolution panchromatic image from a proaches are natural candidates for such a task: in the first,
downsampled version of itself (at the resolution of the mul- the last fully connected layer is replaced with a convolutional
tispectral bands). Once the model is trained, the multispec- layer (see [88]) to have a downsized patch prediction that
tral bands are fed into the model one by one and thereby then needs to be upsampled. In the second, a series of de-
upsampled using the data relationships learned from the convolutions (transposed convolutions [7], [8]) are learned
panchromatic images. to upsample the convolutional fully connected layer. Both
approaches have been compared in [92] using the Interna-
FEATURE- AND DECISION-LEVEL FUSION tional Society for Photogrammetry and Remote Sensing (IS-
FOR IMAGE CLASSIFICATION PRS) Vaihingen and Potsdam benchmark data sets (available
Most of the current remote-sensing literature dealing with in [189]; also see the “Remote-Sensing Data for Training
deep NNs studies the problem of image classification, i.e., Deep Learning Models” section), stacking color infrared
the task of assigning each pixel in the image to a given se- (CIR) and normalized digital elevation models. The archi-
mantic class (land use, land cover, damage level, and so tectures are compared and some zoomed results are reported
forth). In the following, we review recent approaches deal- in Figures 11 and 12, respectively. Other strategies for spatial
ing with image classification problems, mostly at very high upsampling have been proposed in recent literature, includ-
resolution, using two strategies: feature-level fusion and ing the direct use of upsampled activation maps as features
decision-level fusion. In the last part of this section, we re- to train the final classifier [128]. In [129], the authors stud-
view works using different data sources to tackle separate ied the possibility of visualizing uncertainty of predictions
but related predictive tasks, or multitask problems. (applying the model of [130]). They stacked CIR, DSM, and
normalized DSM data as inputs to the CNN.
Feature-level fusion In addition to dense predictions, other strategies have
Feature-level fusion uses multiple sources simultaneously been presented to include spatial information in deep
in a network. Like most image-processing techniques, deep NNs. For example, the authors of [58] extract different
NNs use d-dimensional inputs. A very simple way of using types of spatial filters and stack them in a single tensor,
multiple data sources in a deep network is to stack them, i.e., which is then used to learn a supervised stack of AEs. They
to concatenate the image sources into a single data cube to be apply their models on the classification of SAR images, so

Input Downsampling Upsampling Prediction

FIGURE 11. The deconvolution network proposed in [92]. The yellow and green parts correspond to a fully convolutional network with a
9 × 9-pixels bottleneck; then, a deconvolutional block (purple) leads to predictions of the same size as the input image (in [92], 65 × 65 pixels).

20 ieee Geoscience and remote sensing magazine december 2017

the fusion here is to be considered between different types is applied on optical images from the Chinese GaoFen-1
of spatial information. The NN is then followed by a condi- satellite and WorldView-2.
tional random field (CRF) to decrease the effect of speckle ◗◗ In [135], the joint representation is learned via a stack of
noise inherent in SAR images. In [131], the authors pres- AEs using the single temporal acquisitions at each end
ent a model that learns combinations of spatial filters ex- of the encoder–decoder system. By doing so, they learn
tracted from hyperspectral image bands and DSMs. Even a representation useful for change detection at the bot-
if the model is not a traditional deep network, it learns a tleneck of the system (i.e., in the middle). The authors
sequence of recombinations of filters, thereby extracting show the versatility of their approach by applying it to
higher-level information in an automatic way as deep NNs several data sets, including pairs of optical and SAR im-
do, i.e., it learns the right filter parameters (along with ages, and an example performing change detection be-
their combinations) instead of learning the filter coeffi- tween optical and SAR images.
cients themselves. ◗◗ More recent work addresses the transferability of deep
Data fusion is also a key component in change detection, learning for change detection, while analyzing data of
where one wishes to extract joint features from a bitempo- long time series for large-scale problems. In [38], e.g.,
ral sequence. The aim is to learn a joint representation in the authors make use of an end-to-end RNN to solve the
which both (coregistered) images can be compared. This multi/hyperspectral change detection task, because RNN
area is especially interesting when methods can align data is well known to be good at processing sequential data.
from multiple sensors (see [132] and [133]). Three studies In their framework, an RNN based on long short-term
employ deep learning to this end: memory is employed to learn joint spectral feature rep-
◗◗ In [134], the authors present a model that learns a joint resentations from a bitemporal image sequence. In addi-
representation of two images with DBNs. Feature vec- tion, the authors show that their network can detect mul-
tors issued from the two image acquisitions are stacked ticlass changes and has a good transferability for change
and used to learn a representation, where changes stand detection in a new scene without fine-tuning. The au-
out more clearly. Using such representation, changes are thors of [136] introduce an RNN-based transfer-learning
more easily detected by image differencing. This approach approach to detect annual urban dynamics of four cities

Image nDSM GT CNN-PC CNN-SPL CNN-FPL

FIGURE 12. The image classification results on the Potsdam data sets, considering 65 × 65-pixels patches (from [92]). CNN-PC: patch-based
CNN, predicting single labels per patch and using a sliding-window approach; CNN-SPL: fully convolutional CNN, predicting a 9 × 9 output,
then upsampled to the original size via interpolation; CNN-FPL: deconvolutional network predicting the 65 × 65 output at full resolution;
nDSM: normalized digital surface model; GT: ground truth.

december 2017 ieee Geoscience and remote sensing magazine 21

(Beijing, New York, Melbourne, and Munich) from 1984 identification of corresponding patches in SAR and optical
to 2016, using Landsat data. The main challenge here is imagery of urban scenes.
that training data in such a large-scale and long-term
image sequence are very scarce. By combining RNN and Decision-level fusion
transfer learning, the authors are able to transfer the fea- Decision-level fusion fuses CNN (and other) outputs. While
ture representations learned from a few training samples the works reviewed previously use a single network to learn
to new target scenes directly. Some zoomed results are the semantics of interest all at once (either by extracting
reported in Figure 13. relevant features or by learning the model end to end), an-
Another view of feature fusion involves NNs fusing fea- other series of works has studied ways of performing deci-
tures obtained from different inputs: two (or more) networks sion fusion with deep learning. Even though the distinction
are trained in parallel, and their activations are then fused between these and the models reviewed previously might
at a later stage, e.g., by feature concatenation. The author seem artificial, here we review approaches including an
of [137] studies a solution in this direction that fuses two explicit fusion layer between land cover maps. We distin-
CNNs: the first considers CIR images of the Vaihingen data guish between two families of approaches, depending on
set and passes them through the pretrained VGG network to whether the decision fusion is performed as a postprocess-
learn color features, while the second considers the DSM and ing step or learned.
learns a fully connected network from scratch. Both models’ ◗◗ Fusing semantic maps obtained with CNNs: In this case, dif-
features are then concatenated, and two randomly initialized, ferent models predict the classes, and their predictions
fully connected layers are learned from this concatenation. are then fused. Two works are particularly notable in this
A similar logic is also followed in [138], where the authors respect. On the one hand [141], the authors fuse a classi-
present a model that learns a fully connected layer perform- fication map obtained by a CNN with another obtained
ing the fusion between networks learned at different spatial by a random forest classifier trained using handcrafted
scales. They apply their model to the tasks of buildings and features. Both models use CIR, DSM, and normalized
road detection. In [139], the authors train a two-stream CNN DSM inputs from the Vaihingen data set. The two maps
with two separate yet identical convolutional streams that are fused by multiplication of the posterior probabili-
process the PolSAR and hyperspectral data in parallel and ties, and an edge-sensitive CRF is also learned on top to
only fuse the resulting information at a later convolutional improve the quality of the final labeling. On the other
layer for the purpose of land cover classification. With a simi- hand, the authors in [133] consider the learning of an
lar network architecture and contrastive loss function, the ensemble of CNNs and then averaging their predic-
authors of [140] present a model that learns a network for the tions: their proposed pipeline has two main streams,

DOY: 1985–225 DOY: 1986–196 DOY: 1990–223 DOY: 2014–161

Zone: Munich Airport

Urban Bare Land Water Vegetation Cover Cloud

FIGURE 13. A deep learning-based system that helps in analyzing how land cover changes using large-scale and long-term multitemporal
image sequences. This example shows how Munich airport was built out over the past 30 years. DOY: day of year.

22 ieee Geoscience and remote sensing magazine december 2017

one processing the CIR data and another processing the formation is not available (and it is certainly not when
DSM. They train several CNNs, using them as inputs for working on historical data). A system predicting a height
the activation maps of each layer of the main model as map from image data would indeed be very valuable,
well as one fusing the CIR and DSM main streams, as because it could generate reasonably accurate DSM for
in [137]. By doing so, they obtain a series of land cover color image acquisitions.
maps to nourish the ensemble, which improves perfor- This is known in vision
mances by considering classifiers issued from different as the problem of estimating
data sources and levels of abstraction. Compared to the depth maps [144] and has Services like Google
previous model discussed in this section, this model has been considered in [145] Street View and Flickr
the advantage of being entirely learned in an end-to-end for monocular subdecime-
provide endless sources
fashion, but it also incurs an extreme computational ter images. In their models,
of ground images
load and a complex architecture involving many skip the authors use a joint-loss
connections and fusion layers. function, which is a linear describing cities from
◗◗ Decision fusion learned in the network: This is an alterna- combination of a dense- the human perspective.
tive to an ad hoc fusion (multiplication or averaging of image-classification loss
the posterior maps), in which one may learn the optimal and a regression loss mini-
fusion. In [142], the authors perform the fusion between mizing DSM predictions
two maps obtained by pretrained models by learning a errors. The model can be trained by traditional back-
fusion network based on residual learning [4] logic. In propagation by alternating over the two losses. Note that,
their architecture, they learn how to correct the average in this case, the DSM is used as an output (contrary to
fusion result by learning extra coefficients favoring one most approaches discussed previously) and is, therefore,
or the other map. Their results show that such a learned not needed at prediction time.
fusion outperforms the more intuitive, simple averaging ◗◗ Registration: When performing change detection, one
of the posterior probabilities. expects perfect coregistration of the sources. But, es-
pecially when working at very high resolution, this is
Using CNNs for solving different tasks difficult to achieve. Think of urban areas, e.g., where
So far, only literature dealing with a single task (image clas- buildings are tilted by the viewing angle. In their en-
sification) has been reviewed. But, besides this, one might try to the IEEE Geoscience and Remote Sensing Society
want to predict other quantities or use the image-classifica- (GRSS) Data Fusion Contest 2016 (data are available
tion results to improve the quality of related tasks such as from [188]), the authors of [146] present a model that
image registration. In this case, predicting different outputs learns jointly the registration between the images, the
jointly allows one to tighten feature representations with land use classification of each input, and a change de-
different meanings, thereby leading to another type of data tection map with a CRF model. The land use classifier
fusion with respect to the ones described earlier (which used is a two-layer CNN trained from scratch; the mod-
were concerned mainly with fusing different inputs). Here, el is applied successfully either to pairs of very high-
we discuss fusing outputs and describe three examples resolution (VHR) images or to data sets composed of
from recent literature wherein alternative tasks are learned VHR images and video frames from the International
together with image classification. Space Station.
◗◗ Edges: In the previous section, we discussed the work of
Marcos et al. [133], in which the authors produced and FUSING HETEROGENEOUS SOURCES
fused an ensemble of land cover maps. In [143], that work Data fusion is not only about fusing image data with the
was extended by including the idea of predicting object same viewpoint. Multimodal remote-sensing data that ex-
boundaries jointly with the land cover. The intuition be- ceed these restrictive boundaries and approaches to tackle
hind this is that predicting boundaries helps to achieve new, exciting problems with remote sensing are beginning
sharper (and therefore more useful) classification maps. to appear in the literature. An excellent example is the joint
In [143], the authors present a model that learns a CNN use of ground-based and aerial images [147]: services like
to separately output edge likelihoods at multiple scales Google Street View and Flickr provide endless sources of
from CIR and height data. Then, the boundaries detected ground images describing cities from the human perspec-
with each source are added as an extra channel to each tive. These data can be fused to overhead views to provide
source, and an image classification network, similar better object detection, localization, or re-creation of vir-
to the one in [133], is trained. The predictions of such tual environments. In the following, we review a series of
a model are very accurate, but the computational load applications in this area.
involved becomes very high: the authors report models In [148], the authors consider the task of detecting and
involving up to 800 million parameters to be learned. classifying urban trees. To this end, they exploit Faster
◗◗ Depth: Some approaches discussed previously include R-CNN [149], an object detector developed for general-pur-
the DSM as an input to the network. But, often, such in- pose object detection in vision. After detecting the trees in

december 2017 ieee Geoscience and remote sensing magazine 23

the aerial view and the Google Street View panoramas, they fundamental algorithms in this pipeline are geometric
minimize an energy function to detect trees jointly in all in nature, and the implementations are based on analyti-
sources but also avoid multiple and illogical detections (e.g., cal calculations. So far, machine learning has not played
trees in the middle of a street). They use a trees inventory a major role in this pipeline. However, there are steps in
from the city of Pasadena to validate their detection model this pipeline that could be improved significantly by using
and train a fine-grained CNN based on GoogLeNet [89] to machine-learning techniques.
perform a fine-grained classification of the tree species on
the detections, with impressive results. The authors of [150] TIE-POINTS IDENTIFICATION AND MATCHING
take advantage of an approach that combines a CNN and During camera orientation, the identification and match-
an MRF and can estimate fine-grained categories (e.g., road, ing of tie points have long been accomplished manually
sidewalk, background, building, and parking) by perform- by operators. The task of the operator was to identify cor-
ing joint inference over both monocular aerial imagery and responding locations in two or more images. This process
ground images taken from a has been automated by clever engineering of computer al-
stereo camera on top of a car. gorithms to detect point locations in images that will be
Many papers in geospatial easy to redetect in other images (e.g., corners) as well as
The processing of image computer vision work toward algorithms for computing similarities of image patches for
data from airborne cross-view image localization: finding a tie-point correspondence. Many different detec-
sensors or satellite when presented to a ground tors and similarity measures have been engineered so far—
systems is a long- picture, it would be relevant famous examples are the SIFT [157] or SURF [158] features.
standing tradition. to be able to locate images in However, all these engineered methods fall short (i.e., they
space. This is very important are still less accurate than humans). This is a domain in
in photo-sharing platforms, which machine learning and, in particular, CNNs are em-
for which only a fraction of ployed to learn, based on an enormous number of correct
the uploaded photos comes with geolocation. The authors tie-point matches and point locations, the similarity met-
of [151] and [152] worked toward this aim, by training a rics between image patches.
cross-view Siamese network [153] to match ground im- In the area of tie-point detection and matching, Fisch-
ages and aerial views. Siamese networks have also been er et al. [159] used a CNN to learn a descriptor for image
recently applied [147] to detect changes between matched patch matching from training examples, similar to the well-
ground panoramas and aerial images. Returning to more known SIFT descriptor. In this article, the authors trained
traditional CNNs, the authors of [154] and [155] study the a CNN with five convolutional layers and two fully con-
specificity of images to refer to a given city: they study how nected layers. The trained network computes a descriptor for
closely images of Charleston, South Carolina, resemble a given image patch. In the experiments on standard data
those of San Francisco, and the other way around, by us- sets, the authors could show that the trained descriptors out-
ing the fully connected layers of Places CNN [156] and then perform engineered descriptors (i.e., SIFT) significantly in a
translating this into differences in the respective aerial im- tie-point matching task. Similar successes are described in
ages. Moreover, in [154], they also present applications on other works such as those by Handa et al. [160], Lenc and
image localization similar to those mentioned previously, Vedaldi [161], and Han et al. [162] The work of Yi et al. [163]
where the likelihood of localization is given by a similarity takes this idea one step further: the authors propose a deep
score between the features of the fully connected layer of CNN to detect tie-point locations in an image and output a
Places CNN. descriptor vector for each tie point.

3-D RECONSTRUCTION STEREO PROCESSING USING CNNs

The 3-D data generation from image data plays an im- The second important step in this workflow is stereo match-
portant role for remote sensing. The 3-D data (e.g., in the ing, i.e., the search for corresponding pixels in two or more
form of a DSM or digital terrain model is a basic data layer images. In this step, a corresponding pixel is sought for every
for further processing or analysis steps. The processing of pixel in the image. In most cases, this search can be restrict-
image data from airborne sensors or satellite systems is a ed to a line in the corresponding image. However, current
long-standing tradition. In a typical 3-D data-generation methods still make mistakes in this process. The semiglobal
workflow, two main steps must be performed. First is cam- matching (SGM) approach by Hirschmueller [164] served as
era orientation, which refers to computing the position the gold-standard method for a considerable time.
and orientation of the cameras that produced the image. Since 2002, progress on stereo processing is tracked
This can be computed from the image data themselves, by the Middlebury stereo evaluation benchmark (http://
by identifying and matching tie points and then perform- vision.middlebury.edu/stereo/). The benchmark allows
ing camera resectioning. The second step is triangulation, comparison of results of stereo-processing algorithms to
which calculates the 3-D measurements for point cor- a carefully maintained ground truth. The performance of
respondences established through stereo matching. The the different algorithms can be viewed as a ranked list. This

24 ieee Geoscience and remote sensing magazine december 2017

ranking reveals that, today, the top performing method is
based on CNNs. Table 1. THE TOP-RANKED STEREO METHODS FROM THE
MIDDLEBURY STEREO EVALUATION BENCHMARK AS OF
Most stereo methods in this ranking proceed along the
MAY 2017.
following main steps. First, a stereo correspondence search
is performed by computing a similarity measure between METHOD BAD PIXEL ERROR RATE %
image locations. This is typically carried out exhaustively 3DMST [166] 5.92
for all possible depth values. Next, the optimal depth val- MC-CNN + TDSR [167] 6.35
ues are searched by optimization on the cost value. Dif- LW-CNN [168] 7.04
ferent optimization schemes—convex optimization,
MC-CNN-acrt [165] 8.08
local-optimization strategies (e.g., SGM), and probabilities
SGM [164] 18.4
methods (e.g., MRF inference)—are used. Finally, some
heuristic filtering is typically applied to remove gross outli- LW-CNN: look wider CNN; TDSR: top-down segmented regression.
ers (e.g., left–right check).
The pioneering work of Zbontar and LeCun [165] uti-
lized a CNN in the first step of the typical stereo pipeline.
In their work, the authors suggested training a CNN to information is generated from image data. The traditional
compute the similarity measure between image patches 3-D generation process neglected object information: the
(instead of using normalized cross correlation or the census 3-D data were generated from geometric constraints only,
transform). This change proved to be significant. Compared and image data were treated as pure intensity values without
to SGM, which is often considered a baseline method, the any semantic meaning.
proposed method achieved a significantly lower error rate. The availability of semantic information from CNN-
For SGM, the error rate was still 18.4%, whereas for the based classification now makes it possible to utilize this
matching-cost(MC)-CNN method, the error rate was only information in the 3-D generation process. CNN-based
8%. After that, other variants of CNN-based stereo meth- classification allows one to assign class labels to aerial
ods have been offered, and the best ranking method today imagery with unprecedented accuracy [170]. Pixels in
has an error rate of only 5.9%. Table 1 lists the error rates of the images are then assigned labels like vegetation, road,
the top-ranking CNN-based methods. building, and so on. This semantic information can then
In addition to similarity measures, a typical stereo- be used to steer the 3-D data generation process. Class
processing pipeline contains other engineered decisions label-specific parameters can be chosen for the 3-D data
as well. After creating a so-called cost volume from the generation process.
similarity measures, most methods use specifically engi- The latest proposal in this area, however, is a joint re-
neered algorithms to find the depths (e.g., based on neigh- construction of 3-D and semantic information (Häne et al.
borhood constraints) and heuristics to filter out wrong [171]), where 3-D reconstruc-
matches. New proposals, however, suggest that these other tion is performed with a volu-
steps can also be replaced solely by a CNN. Mayer et al. metric method. The area to be CNN-based
[169] offered such a paradigm-shifting design for stereo reconstructed is partitioned classification allows
processing. In their proposal, the stereo-processing prob- into small cells, the size of
one to assign class
lem is modeled solely as a CNN. The proposed CNN takes which define the resolution
labels to aerial
two images of a stereo pair as an input and directly outputs of the 3-D reconstruction. The
imagery with
the final disparity map. A single CNN architecture replaces reconstruction algorithm now
all the individual algorithm steps utilized so far. The CNN finds the optimal partitioning unprecedented
of Mayer is based on an encoder–decoder architecture with of this voxel grid into occu- accuracy.
a total of 26 layers. In addition, it includes crosslinks be- pied and nonoccupied voxels
tween contracting and expanding network parts. To train that fit to the image data. The
the CNN architecture, end-to-end training using ground result is a 3-D reconstruction of the scene. The work of Häne
truth image-depth map pairs is performed. The fascinating et al. also jointly assigns the 3-D reconstruction to a class
aspect of the proposed method is that the stereo algorithm label for each voxel, e.g., vegetation, building, road, and sky.
itself can be learned from data only. The network architec- Each generated 3-D data point now also has a class label. The
ture does not define the algorithm, but the data and the 3-D reconstruction is semantically interpretable. This pro-
end-to-end training define what type of processing the net- cess is a joint process, with the computation of the occupied
work should perform. and nonoccupied voxels taking into account the class labels
in the original images. If a voxel corresponds to a building
LARGE-SCALE SEMANTIC 3-D CITY RECONSTRUCTION pixel in the image, it is set to “occupied” with high probabil-
The availability of semantics (e.g., the knowledge of what ity. If a voxel corresponds to a sky pixel in the image, it has
type of object a pixel in the image represents) through a high probability of being “unoccupied.” If a set of voxels is
CNN-based classification is also changing the way that 3-D stacked on top of one another, it is likely that these belong

december 2017 ieee Geoscience and remote sensing magazine 25

to some building, i.e., the probability for assigning the label DEEP LEARNING IN REMOTE SENSING
“class of building” is increased for this structure. MADE RIDICULOUSLY SIMPLE
This semantic 3-D reconstruction method has been suc- To provide an easy starting point for researchers attempting
cessfully applied to 3-D reconstruction from aerial imagery to work on deep learning in remote sensing, we list some
by Bláha et al. [172], [173]. In their work, they achieved a available resources, including tutorials and open-source
semantic 3-D reconstruction of cities on large scales. The deep learning frameworks. In addition, we provide a selected
3-D model contains not only 3-D data but also class labels, list of open remote-sensing data for training deep learning
e.g., a 3-D structure that represents buildings gets the class models as well as some showcasing examples with source
label “building.” Even more, every building has its roof codes developed using different deep learning frameworks.
structures labeled as “roof.” Figure 14 shows an image of a
semantic 3-D reconstruction produced by the method de- TUTORIALS
scribed in [172]. Some valuable tutorials for those new to deep learning, in-
In summary, we can say that CNNs quickly took on a cluding books, survey papers, code tutorials, and videos, can
significant role in 3-D data generation. Utilizing CNNs for be found at http://deeplearning.net/reading-list/tutorials/. In
stereo processing significantly boosted the accuracy and addition, we list two references [174], [175] that provide some
precision of depth estimation. The availability of reliable general recommendations for the choice of the parameters.
class labels extracted from CNN classifiers opened the pos-
sibility of creating semantic 3-D reconstructions, a research OPEN-SOURCE DEEP LEARNING FRAMEWORKS
area that is poised to grow significantly. When diving deep into deep learning, the choice of an open-
source framework is of great importance. Figure 15 shows
the most popular open-source deep learning frameworks,
such as Caffe, Torch, Theano, TensorFlow, and Microsoft-
CNTK. Because the field and surrounding technologies are
relatively new and have been developing rapidly, the most
common concerns among people who would like to work
on deep learning are how these frameworks differ, where
they fall short, and which ones are worth investing in. A
detailed discussion of popular deep learning frameworks
can be found in [190].

REMOTE-SENSING DATA FOR TRAINING

DEEP LEARNING MODELS
FIGURE 14. A semantic 3-D reconstruction from the Enschede To train deep learning methods with good generalization
aerial image data set computed with the method described in abilities, one needs large data sets. This is true for both fine-
[172]. The different colors represent different class labels: “ground” tuning models and training small networks from scratch
(gray), “building” (red), “roof” (yellow), “vegetation” (green), and (although if we consider training large architectures, one
“clutter” (blue). (Image courtesy of the authors of [172].) should preferably resort to pretrained methods) [176]. In

Deep Learning Frameworks

55,767
60,000

50,000
GitHub Star Count

40,000

30,000
17,570
20,000 10,543 9,435
6,778 6,543 6,196
10,000 4,320

0
Google’s Caffe Microsoft- MXNet Facebook Deeplearning4j Theano Facebook
Tensorflow CNTK Torch Caffe 2

FIGURE 15. The most popular open-source deep learning frameworks. The ranking is based on the number of stars awarded by developers
in GitHub. (Image courtesy of [190].)

26 ieee Geoscience and remote sensing magazine december 2017

recent years, several data sets have been made public that provided a subdecimeter resolution data set over the two
can be used to train deep NNs. The following is a nonex- cities of Vaihingen and Potsdam (Germany). The data are
haustive list. similar to those of the Zeebruges data, with the difference
that the height information is provided as a DSM at the
SCENE CLASSIFICATION (ONE IMAGE IS same resolution of the image data. Moreover, images are
CLASSIFIED INTO A SINGLE LABEL) provided with an infrared channel. The data set is also
◗◗ UC Merced data set [177]: This data set is a collection of fully labeled into six class-
aerial images (256 × 256 pixels in RGB space) depicting es, including land classes
21 land use classes. Each class is made up of 100 images. (roads, meadows) and ob-
With the growing
Because every image comes with a single label, the data jects (cars). It also comes
set can be used only for image-classification purposes, with a clutter class gather- attention on VHR SAR
i.e., to classify the whole image into a single land use ing all unknown objects. data, the fusion of
class. The data set can be downloaded from [191]. The Vaihingen dat a set optical and SAR images
◗◗ Aerial Image data set (AID) [74]: This data set is a collec- comes with 33 tiles having in dense urban areas
tion of 10,000 annotated aerial images distributed in 30 an average size of 2,000 × has become an emerging
land use scene classes and can be used for image-classi- 3,000 pixels. Half the tiles and timely topic.
fication purposes. In comparison with the UC Merced come with labels. The oth-
data set, the AID contains many more images and covers er 17 tiles come with no la-
a wider range of scene categories. Thus, it is in line with bels, and participants must
the data requirements of modern deep learning. The upload classification maps for evaluation. The Potsdam
data set can be downloaded from [192]. data set comes with 24 labeled tiles (6,000 × 6,000 pixels)
◗◗ Northwestern Polytechnical University–Remote Sensing Im- and 14 unlabeled ones. Both data sets can be obtained
age Scene Classification 45 data set [178]: This data set from [189].
contains 31,500 aerial images spread over 45 scene
classes. So far, it is the largest data set for land use scene REGISTRATION/MATCHING
classification in terms of both total number of images ◗◗ SARptical data set [180]. With the growing attention on
and number of scene classes. The data set can be ob- VHR SAR data, the fusion of optical and SAR images in
tained from [193]. dense urban areas has become an emerging and timely
topic. At the core of such a fusion topic is the challeng-
IMAGE CLASSIFICATION (EACH PIXEL ing task of coregistering SAR and optical images. Two
OF AN IMAGE IS CLASSIFIED INTO A LABEL) such images are acquired with intrinsically different
◗◗ Zurich Summer data set [179]: This data set is a collection imaging geometries and thus are nearly impossible to
of 20 image chips from a single large QuickBird image be coregistered without a precise 3-D model of the im-
acquired over Zurich, Switzerland, in 2002. Each im- aged scene. SARptical is a unique data set for SAR and
age chip is pansharpened to 0.6-m resolution, and eight optical image matching in dense urban areas. It consists
land use classes are presented. All images are released, of 10,000 pairs of corresponding SAR and optical im-
along with their ground truths. The data set can be ob- age patches in central Berlin, with the center pixels of
tained from [194]. each patch pair precisely coregistered. They are generat-
◗◗ Zeebruges, or the Data Fusion Contest 2015 data set [127]: ed based on coregistered 3-D interferometric SAR point
In 2015, the Image Analysis and Data Fusion Technical clouds (which are reconstructed by SAR tomography us-
Committee of the IEEE GRSS organized a data processing tens of TerraSAR-X high-resolution spotlight images)
ing competition aimed at 5-cm resolution land mapping. and 3-D optical point clouds (which are reconstructed
To do so, the organizers provided both an RGB aerial im- by structure from motion, followed by dense stereo
age and a dense (65-points/m2) lidar point cloud over the matching using several UltraCam images with a ground
harbor of Zeebruges (Belgium). The data are organized spacing of 20 cm). This data set can be downloaded from
on seven 10,000 × 10,000 pixels tiles. All the tiles have https://www.sipeo.bgu.tum.de/downloads.
been labeled densely in eight land classes, including
land use (building, roads) and objects (vehicles, boats) SHOWCASING
[126]. The data can be obtained from the Data and Algo- Starting to work with CNNs from scratch might seem a ti-
rithm Standard Evaluation (DASE) website, http://dase tanic task. The number of models available is large, and set-
.ticinumaerospace.com/. On DASE, users can download ting up an architecture from zero is challenging. In this sec-
the seven tiles and labels for five tiles. To assess models tion, we point to three showcasing example that have been
on the two remaining tiles, users can upload the classified recently provided by remote-sensing researchers. All these
maps on the DASE server. examples are offered with open licenses, and the corre-
◗◗ ISPRS 2-D semantic labeling challenge: The working group sponding papers must be acknowledged when using those
II/4 of the ISPRS 3-D Scene Reconstruction and Analysis codes. The rules on the respective websites apply. Please

december 2017 ieee Geoscience and remote sensing magazine 27

read the specific terms and conditions carefully. Each ex- user’s own classification purpose by fine-tuning the
ample, uses a different deep learning library (and program- target data sets; alternatively, free object detection can
ming language). be obtained using the learned filters in the first residual
◗◗ Deconvolution network in MatConvNet: The first example block of the residual conv-deconv network. The code
is released by the authors of [92] and corresponds to the can be downloaded from https://www.sipeo.bgu.tum
architecture in Figure 11. It exploits the MatConvNet li- .de/downloads.
brary for MATLAB (http://www.vlfeat.org/matconvnet/)
and provides a pretrained network for both the Vaihin- CONCLUSIONS AND FUTURE TRENDS
gen and Potsdam data sets described previously. The ini- In this article, we have reviewed the current state of the art
tial models are specific to remote-sensing data and have in deep learning for remote sensing. Thanks to the enor-
been trained on each data set mous success encountered in several areas of research, re-
separately. This example is pri- mote sensing is surfing the wave of deep NNs, following
marily meant to show how to a trend similarly being pursued in other fields: deep net-
Thanks to the
fine-tune an existing model works are solid models that tend to improve over classi-
enormous success in MatConvNet by training cal approaches using handcrafted features. Yet, this field is
encountered in a few extra iterations to im- still relatively young and, in the upcoming years, rapid ad-
several areas of prove the model weights. It vancement of deep learning in remote sensing is expected.
research, remote can, of course, be trained from Technical challenges obviously remain, however.
sensing is surfing scratch by reinitializing the ◗◗ What further applications in remote sensing might po-

the wave of deep NNs.

weights randomly. A function tentially benefit from deep learning? In general, deep
to test the additional images networks are particularly beneficial for remote-sensing
of the data sets is also pro- problems whose physical models are complicated, e.g.,
vided. Overall, it allows one nonlinear, or even not yet well understood and/or can-
to reproduce the results in [92], which are similar to the not be generalized. Yet, so far, in various remote-sensing
right-hand column in Figure 12. By removing the decon- fields, most deep learning-related research has focused
volutional part of the network and adding a fully connect- on classification- and detection-related tasks using a
ed layer at the bottleneck, one can reproduce the CNN-PC number of benchmark data sets.
model. If, instead, one adds a spatial upsampling layer ◗◗ Is the transferability of deep networks sufficient to ex-
(e.g., a spatial interpolation of the bottleneck), one can tract geoinformation on a global scale? Complex light-
also reproduce the results of the CNN-SPL model of Fig- scattering mechanisms in natural objects, various at-
ure 12. In both cases, the models must be retrained (or, at mospheric scattering conditions, intraclass variability,
least, heavily fine-tuned). The code can be downloaded culture-dependent features, and limited training sam-
from [195]. ples make the use of deep learning for global tasks chal-
◗◗ Fully convolutional (SegNet) architecture in Caffe: This second lenging [182]. To meet the need of large-scale applica-
example is released by the authors of [142] and exploits the tions, possible solutions are never-ending learning [183]
Caffe library (http://caffe.berkeleyvision.org/). The model and self-taught learning [184].
uses the SegNet architecture from Kendall et al. [181]. The ◗◗ How should problems raised by very limited annotated
authors released the pretrained model to reproduce the data in remote sensing be tackled? Is it possible to learn
results of [142] on the Vaihingen data set. The network deep hierarchical models for remote-sensing image un-
configuration, database generation, and training files are derstanding in a weakly supervised, semisupervised, or
given in Python. The code can be downloaded from https:// even unsupervised way? A few inspiring works in ma-
github.com/nshaud/DeepNetsForEO. chine learning and computer vision are [34], [185], and
◗◗ AConvNet for SAR ATR in Caffe: The third example is re- [186]. How do we benchmark the fast-growing deep-
leased by the authors of [42]. It implements a CNN-based learning algorithms in remote-sensing applications?
SAR target recognition demonstrated via the MSTAR Some recent initiatives include the 2017 IEEE GRSS Data
data set. It includes the model configuration file and the Fusion Contest data set [196] and the Functional Map of
source code for training and testing as well as a success- the World Challenge data set [197].
fully trained CNN model. The code can be downloaded The fusion of physics-based modeling and deep NN is
from https://github.com/fudanxu/MSTAR-AConvNet. a promising direction. Remote-sensing imagery is a direct
◗◗ Residual conv-deconv network in TensorFlow: This final product of physical processes, such as light reflection and
example is released by the authors of [33] and [34] and microwave scattering. It must resort to a synergy of the
shows how to build up a residual conv-deconv network physics-based models that describe the a priori knowledge
for unsupervised spectral-spatial feature learning of of the process behind the imagery and newly developed ar-
hyperspectral data. It exploits the TensorFlow (https:// tificial intelligence technologies.
www.tensorflow.org/) and Keras (https://keras.io/) li- Besides focusing on technical challenges, deep learn-
braries. The trained network can be transferred for the ing in remote sensing opens up opportunities for new

28 ieee Geoscience and remote sensing magazine december 2017

applications, such as monitoring global changes or evaluat- big data analysis in remote sensing, and modern signal
ing strategies for the reduction of resources consumption. processing. She is an associate editor of IEEE Transactions on
In this context, deep learning offers an incredible tool box Geoscience and Remote Sensing.
that allows researchers in remote sensing to exceed the Devis Tuia (devis.tuia@wur.nl) received his M.Sc. and
boundaries of the field, to move beyond traditional small- Ph.D. degrees from the University of Lausanne, Switzer-
scale benchmarking tasks and tackle large-scale, real-life land, in 2005 and 2009, respectively. He was a postdoctoral
problems with implicit models that generalize well. The scholar at the University of Valéncia, Spain, the University
data are now here, the hardware is ready, and deep learning of Colorado, Boulder, and École Polytechnique Fédérale de
frameworks are openly available, so it is now time to de- Lausanne, Switzerland. He is currently an associate profes-
sign models that are tailored to big remote-sensing data and sor with the GeoInformation Science and Remote Sensing
the multimodal, geolocated, and multitemporal aspects we Laboratory at Wageningen University, The Netherlands. Be-
raised in the introduction. tween 2014 and 2017, he was an assistant professor at the
Commercial players are on the march toward remote University of Zurich, Switzerland. His research interests
sensing and Earth observation. Planet, e.g., has launched include algorithms for information extraction and geospa-
approximately 140 small satellites that map the whole tial data fusion (including remote sensing) using machine
Earth daily. Standing on the paradigm shift from compu- learning and computer vision.
tational science to data-driven science, we, as remote-sens- Lichao Mou (lichao.mou@dlr.de) received his bachelor’s
ing experts, must appropriately position ourselves among degree in automation from the Xi’an University of Posts and
other data scientists also trying to use deep learning for in- Telecommunications, China, in 2012 and his master’s degree
novative remote-sensing applications. This requires us, in in signal and information processing from the University of
turn, to bring our domain expertise into deep learning to Chinese Academy of Sciences, Beijing, in 2015. In 2015, he
provide prior knowledge that is tailored to specific remote- spent six months with the Computer Vision Group at the
sensing problems. University of Freiburg in Germany. He is currently working
Last but not least, we encourage efforts within the com- toward his Ph.D. degree at the German Aerospace Center
munity to share data and architectures and so be able to (DLR) and the Technical University of Munich, Germany.
answer the challenges of the years to come. His research interests include remote sensing, computer vi-
sion, and machine learning, especially remote-sensing video
ACKNOWLEDGMENTS analysis and applications for deep neural networks in remote
The work of Xao Xiang Zhu and Lichao Mou is supported sensing. He was first place in the 2016 IEEE Geoscience and
by the European Research Council under the European Remote Sensing Society’s Data Fusion Contest and a finalist
Unions Horizon 2020 research and innovation program for the Best Student Paper Award at the 2017 Joint Urban Re-
(grant agreement no. ERC-2016-StG-714087, So2Sat), the mote Sensing Event.
Helmholtz Association under the framework of the Young Gui-Song Xia (guisong.xia@whu.edu.cn) received his
Investigators Group SiPEO (VH-NG-1018, www.sipeo.bgu. B.S. degree in electronic engineering and M.S. degree in
tum.de), and the China Scholarship Council. The work of signal processing from Wuhan University, China, in 2005
Devis Tuia is supported by the Swiss National Science Foun- and 2007, respectively, and a Ph.D. degree in image pro-
dation under project no. PP0P2 150593. The work of Gui- cessing and computer vision from the Centre National de la
Song Xia and Liangpei Zhang is supported by the National Recherche Scientifique (CNRS), Laboratoire de Traitement
Natural Science Foundation of China (NSFC) projects with et Communication de l’ Information, Telecom ParisTech,
grant no. 41501462 and no. 41431175. The work of Feng Xu France, in 2011. He is currently a professor with the State
is supported by the NSFC projects with grant no. 61571134. Key Laboratory of Information Engineering, Surveying,
Mapping, and Remote Sensing, Wuhan University. He was
AUTHOR INFORMATION a postdoctoral researcher with the Centre de Recherche
Xiao Xiang Zhu (xiao.zhu@dlr.de) received her M.Sc. de- en Mathmatiques de la Decision, CNRS, Paris-Dauphine
gree, Dr.-Ing. degree, and habilitation in the field of sig- University, France, for one-and-a-half years beginning in
nal processing from the Technical University of Munich 2011. His current research interests include mathematical
(TUM), Germany, in 2008, 2011, and 2013, respectively. image modeling, texture synthesis, image indexing and
She has been a professor of signal processing in Earth ob- content-based retrieval, structure from motion, perceptual
servation since 2015 at TUM and the German Aerospace grouping, and remote-sensing imaging.
Center (DLR); head of Team Signal Analysis (since 2011) Liangpei Zhang (zlp62@whu.edu.cn) received his B.S.
at DLR with the Remote Sensing Technology Institute; and degree in physics from Hunan Normal University, Chang-
head of the Helmholtz Young Investigator Group SiPEO sha, China, in 1982, his M.S. degree in optics from the
(since 2013), DLR and TUM. Her main research interests Xi’an Institute of Optics and Precision Mechanics, Chinese
are advanced interferometric synthetic aperture radar tech- Academy of Sciences in 1988, and his Ph.D. degree in pho-
niques, computer vision in remote sensing including object togrammetry and remote sensing from Wuhan University,
reconstruction and multidimensional data visualization, China, in 1998. He is currently a Chang-Jiang Scholar Chair

december 2017 ieee Geoscience and remote sensing magazine 29

rofessor at Wuhan University, appointed by the Ministry
P [3] K. Simonyan and A. Zisserman. (2014). Very deep convolution-
of Education of China. He has published more than 500 real networks for large-scale image recognition. arXiv. [Online].
search papers and five books, and he holds 15 patents. He Available: https://arxiv.org/pdf/1409.1556.pdf
is a fellow of the Institution of Engineering and Technology, [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
an executive member (Board of Governors) of the China Na- image recognition,” in Proc. IEEE Int. Conf. Computer Vision and
tional Committee of the International Geosphere-Biosphere Pattern Recognition (CVPR), 2016, pp. 770–778.
Programme, and an executive member of the China Society [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
of Image and Graphics. He was a recipient of the 2010 Best convolutional networks for accurate object detection and se-
Paper Boeing Award and the 2013 Best Paper ERDAS Award mantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
from the American Society of Photogrammetry and Remote vol. 38, no. 1, pp. 142–158, 2016.
Sensing. He is as an associate editor of IEEE Transactions on [6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
Geoscience and Remote Sensing. look once: Unified, real-time object detection,” in Proc. IEEE
Feng Xu (fengxu@fudan.edu.cn) received his B.E. de- Int. Conf. Computer Vision and Pattern Recognition (CVPR), 2016,
gree in information engineering from Southeast Univer- pp. 779–788.
sity, Nanjing, China, and his Ph.D. degree in electronic [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
engineering from Fudan University, Shanghai, China, in works for semantic segmentation,” in Proc. IEEE Int. Conf. Computer
2003 and 2008, respectively. In 2012, he was accepted into Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
China’s Global Experts Recruitment Program and returned [8] H. Noh, S. Hong, and B. Han, “Learning deconvolutional network
to F udan University in June 2013, where he is currently a for semantic segmentation,” in Proc. IEEE Int. Conf. Computer Vi-
professor in the School of Information Science and Tech- sion (ICCV), 2015, pp. 1520–1528.
nology and the vice director of the Key Lab for Informa- [9] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S.
tion Science of Electromagnetic Waves. He was the 2014 Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent
recipient of the Early Career Award of the IEEE Geoscience convolutional networks for visual recognition and descrip-
and Remote Sensing Society (GRSS) and the 2007 recipi- tion,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recog-
ent of the SUMMA Foundation graduate fellowship in the nition (CVPR), 2015, pp. 2625–2634.
advanced electromagnetics area. He serves as the associate [10] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural
editor for IEEE Geoscience and Remote Sensing Letters and is network for skeleton based action recognition,” in Proc. IEEE
the founding chair of the IEEE GRSS Shanghai Chapter. His Int. Conf. Computer Vision and Pattern Recognition (CVPR), 2015,
research interests include electromagnetic scattering theo- pp. 1110–1118.
ry, synthetic aperture radar information retrieval, and radar [11] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Ze-
system development. mel, and Y. Bengio, “Show, attend and tell: Neural image caption
Friedrich Fraundorfer (fraundorfer@icg.tugraz.at) re- generation with visual attention,” in Proc. IEEE Int. Conf. Machine
ceived his M.S. and Ph.D degrees in computer science from Learning (ICML), 2015, pp. 2048–2057.
the Graz University of Technology, Austria, where he is cur- [12] J. P. Rivera, J. Verrelst, J. Gomez-Dans, J. Muñoz-Marí, J. More-
rently an assistant professor. He has held postdoctoral posi- no, and G. Camps-Valls, “An emulator toolbox to approximate
tions at the University of Kentucky, Lexington, the Univer- radiative transfer models with statitistical learning,” Remote
sity of North Carolina at Chapel Hill, and the Swiss Federal Sens., vol. 7, no. 7, pp. 9347–9370, 2015.
Institute of Technology in Zurich. From 2012 to 2014, he [13] R. Hecht-Nielsen, “Theory of the backpropagation neural net-
acted as the deputy director of remote sensing technology work,” in Proc. Int. Joint Conf. Neural Networks (IJCNN), 1989,
with the Faculty of Civil, Geo, and Environmental Engi- pp. 445–448.
neering at the Technical University of Munich, Germany. [14] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol,
His main research areas are three-dimensional computer “Stacked denoising autoencoders: Learning useful represen-
vision, robot vision, multiview geometry, visual-inertial fu- tations in a deep network with a local denoising criterion,” J.
sion, microaerial vehicles, autonomous systems, and aerial Mach. Learning Res., vol. 11, pp. 3371–3408, Dec. 2010.
imaging. His work on autonomous unmanned aerial vehi- [15] A. Ng. (2010). Sparse autoencoder. [Online]. Available: https://
cles was a finalist for the Best Paper Award at the 2012 IEEE web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
International Conference on Intelligent Robots. [16] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm
for deep belief nets,” Neural Computation, vol. 18, no. 7, pp.
REFERENCES 1527–1554, 2006.
[1] MIT Technology Review. (2013). 10 breakthrough technologies [17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cam-
2013. [Online]. Available: https://www.technologyreview.com/ bridge, MA: MIT Press, 2016.
lists/technologies/2013/ [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
[2] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classifi- learning applied to document recognition,” Proc. IEEE, vol. 86,
cation with deep convolutional neural networks,” in Proc. Ad- no. 11, pp. 2278–2324, 1998.
vances in Neural Information Processing Systems (NIPS), 2012, pp. [19] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature
1097–1115. extraction and classification of hyperspectral images based

30 ieee Geoscience and remote sensing magazine december 2017

on convolutional neural networks,” IEEE Trans. Geosci. Remote tral image classification,” IEEE Trans. Geosci. Remote Sens. doi:10.1109/
Sens., vol. 54, no. 10, pp. 6232–6251, 2016. TGRS.2017.2748160.
[20] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, [35] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep
“Advances in hyperspectral image classification,” IEEE Signal feature extraction for remote sensing image classification,” IEEE
Process. Mag, vol. 31, no. 1, pp. 45–54, 2014. Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1349–1362, 2016.
[21] P. Ghamisi, Y. Chen, and X. Zhu, “A self-improving convolution [36] L. Mou, P. Ghamisi, and X. Zhu, “Deep recurrent neural net-
neural network for the classification of hyperspectral data,” IEEE works for hyperspectral image classification,” IEEE Trans. Geos-
Geosci. Remote Sens. Lett., vol. 13, no. 10, pp. 1537–1541, 2016. ci. Remote Sens., vol. 55, no. 7, pp. 3639–3655, 2017.
[22] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learn- [37] W. Li, G. Wu, and Q. Du, “Transferred deep learning for anom-
ing-based classification of hyperspectral data,” IEEE J. Select. aly detection in hyperspectral imagery,” IEEE Geosci. Remote
Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2094– Sens. Lett. doi: 10.1109/LGRS.2017.2657818.
2107, 2014. [38] H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule
[23] Y. Chen, X. Zhao, and X. Jia, “Spectra-spatial classification of hy- from a recurrent neural network for land cover change detec-
perspectral data based on deep belief network,” IEEE J. Select. Topics tion,” Remote Sens., vol. 8, no. 6, pp. 506, 2016.
Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392, 2015. [39] D. E. Dudgeon, R. T. Lacoss, and A. Moreira, “An overview of
[24] C. Tao, H. Pan, Y. Li, and Z. Zou, “Unsupervised spectral-spa- automatic target recognition,” Lincoln Laboratory J., vol. 6, no. 1,
tial feature learning with stacked sparse autoencoder for hyper- pp. 3–10, 1993.
spectral imagery classification,” IEEE Geosci. Remote Sens. Lett., [40] S. Chen and H. Wang, “SAR target recognition based on deep
vol. 8, no. 6, pp. 2381–2392, 2015. learning,” in Proc. Int. Conf. Data Science and Advanced Analytics,
[25] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolu- 2014, pp. 541–547.
tional neural networks for hyperspectral image classification,” [41] E. R. Keydel, S. W. Lee, and J. T. Moore, “MSTAR extended op-
J. Sensors, vol. 2015, 2015. doi: 10.1155/2015/258619 erating conditions: A tutorial,” Proc. SPIE-Int. Soc. Opt. Eng., vol.
[26] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, 2757, 1996. doi: 10.1117/12.242059.
“Deep supervised learning for hyperspectral data classification [42] S. Chen, H. Wang, F. Xu, and Y. Q. Jin, “Target classifica-
through convolutional neural networks,” in Proc. IEEE Int. Geosci- tion using the deep convolutional networks for SAR images,”
ence and Remote Sensing Symp. (IGARSS), 2015, pp. 4959–4962. IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4806–
[27] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov, “Deep 4817, 2016.
learning classification of land cover and crop types using re- [43] D. Morgan, “Deep convolutional neural networks for ATR from
mote sensing data,” IEEE Geosci. Remote Sens. Lett. doi: 10.1109/ SAR imagery,” Proc. SPIE-Int. Soc. Opt. Eng., vol. 9475, 2015. doi:
LGRS.2017.2681128. 10.1117/12.2176558.
[28] W. Zhao and S. Du, “Spectral-spatial feature extraction for hy- [44] J. Ding, B. Chen, H. Liu, and M. Huang, “Convolutional neu-
perspectral image classification: A dimension reduction and ral network with data augmentation for SAR target recogni-
deep learning approach,” IEEE Trans. Geosci. Remote Sens., vol. tion,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 3, pp. 364–
54, no. 8, pp. 4544–4554, 2016. 368, 2016.
[29] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, [45] K. Du, Y. Deng, R. Wang, T. Zhao, and N. Li, “SAR ATR based
and P. Mitra, “Bass net: Band-adaptive spectral-spatial fea- on displacement- and rotation-insensitive CNN,” Remote Sens.
ture learning neural network for hyperspectral image classi- Lett., vol. 7, no. 9, pp. 895–904, 2016.
fication,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. [46] M. Wilmanski, C. Kreucher, and J. Lauer, “Modern approaches
5293–5301, 2017. in deep learning for SAR ATR,” Proc. SPIE-Int. Soc. Opt. Eng., vol.
[30] W. Li, G. Wu, and F. Zhang, and Q. Du, “Hyperspectral image 9843, 2016. doi: 10.1117/12.2220290.
classification using deep pixel-pair features,” IEEE Trans. Geosci. [47] Z. Cui, Z. Cao, J. Yang, and H. Ren, “Hierarchical recognition
Remote Sens., vol. 55, no. 2, pp. 844–853, 2017. system for target recognition from sparse representations,”
[31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Math. Problems Eng., vol. 2015, 2015. doi: 10.1155/2015/527095
“Learning spatiotemporal features with 3D convolutional net- [48] S. A. Wagner, “SAR ATR by a combination of convolutional
works,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2015, neural network and support vector machines,” IEEE Trans.
pp. 4489–4497. Geosci. Remote Sens., vol. 52, no. 6, pp. 2861–2872, 2016.
[32] Y. Li, H. Zhang, and Q. Shen, “Spectral-spatial classification of [49] C. Bentes, A. Frost, D. Velotto, and B. Tings, “Ship-iceberg dis-
hyperspectral imagery with 3D convolutional neural network,” crimination with convolutional neural networks in high resolu-
Remote Sens., vol. 18, no. 7, pp. 1527–1554, 2006. tion SAR images,” in Proc. European Conf. Synthetic Aperture Radar
[33] L. Mou, P. Ghamisi, and X. X. Zhu, “Fully conv-deconv net- (EUSAR), 2016, pp. 1–4.
work for unsupervised spectral-spatial feature extraction [50] C. Schwegmann, W. Kleynhans, B. Salmon, L. Mdakane, and R.
of hyperspectral imagery via residual learning,” in Proc. Meyer, “Very deep learning for ship discrimination in synthetic
IEEE Int. Geoscience and Remote Sensing Symp. (IGARSS), to be aperture radar imagery,” in Proc. IEEE Int. Geoscience and Remote
published. Sensing Symp. (IGARSS), 2016, pp. 104–107.
[34] L. Mou, P. Ghamisi, and X. Zhu, “Unsupervised spectral-spatial fea- [51] N. Ødegaard, A. O. Knapskog, C. Cochin, and J. C. Louvigne,
ture learning via deep residual conv-deconv network for hyperspec- “Classification of ships using real and simulated data in a

december 2017 ieee Geoscience and remote sensing magazine 31

c onvolutional neural network,” in Proc. IEEE Radar Conf., 2016, deep convolutional neural networks: A case study,” IEEE Trans.
pp. 1–6. Geosci. Remote Sens., vol. 54, no. 8, pp. 4524–4533, 2016.
[52] Q. Song and F. Xu, “Zero-shot learning of SAR target feature [68] W. Yang, D. Dai, B. Triggs, and G. Xia, “SAR-based terrain classifi-
space with deep generative neural networks,” IEEE Geosci. Re- cation using weakly supervised hierarchical Markov aspect mod-
mote Sens. Lett., to be published. els,” IEEE Trans. Image Process., vol. 21, no. 9, pp. 4232–4243,
[53] Z. Zhang, H. Wang, F. Xu, and Y. Q. Jin, “Complex-valued con- 2012.
volutional neural networks and its applications to PolSAR image [69] W. Shao, W. Yang, and G.-S. Xia, “Extreme value theory-based
classification,” IEEE Trans. Geosci. Remote Sens., to be published. calibration for multiple feature fusion in high-resolution satel-
[54] Y. Q. Jin and F. Xu, Polarimetric Scattering and SAR Information lite scene classification,” Int. J. Remote Sens., vol. 34, no. 3, pp.
Retrieval. Hoboken, NJ: Wiley, 2013. 8588–8602, 2013.
[55] F. Xu, Y. Q. Jin, and A. Moreira, “A preliminary study on SAR [70] W. Yang, X. Yin, and G. Xia, “Learning high-level features for
advanced information retrieval and scene reconstruction,” satellite image classification with limited labeled samples,”
IEEE Geosci. Remote Sens. Lett., vol. 13, no. 10, pp. 1443–1447, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4472–
2016. 4482, 2015.
[56] H. Xie, S. Wang, K. Liu, S. Lin, and B. Hou, “Multilayer feature [71] F. Hu, G. S. Xia, Z. Wang, X. Huang, L. Zhang, and H. Sun,
learning for polarimetric synthetic radar data classification,” in “Unsupervised feature learning via spectral clustering of multi-
Proc. IEEE Int. Geoscience and Remote Sensing Symp. (IGARSS), dimensional patches for remotely sensed scene classification,”
2014, pp. 2818–2821. IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 8, no.
[57] J. Geng, J. Fan, H. Wang, X. Ma, B. Li, and F. Chen, “High-res- 5, pp. 2015–2030, 2015.
olution SAR image classification via deep convolutional auto- [72] B. Zhao, Y. Zhong, G. Xia, and L. Zhang, “Dirichlet-derived
encoders,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 11, pp. multiple topic scene classification model for high spatial res-
2351–2355, 2015. olution remote sensing imagery,” IEEE Trans. Geosci. Remote
[58] J. Geng, H. Wang, J. Fan, and X. Ma, “Deep supervised and Sens., vol. 54, no. 4, pp. 2108–2123, 2016.
contractive neural network for SAR image classification,” [73] F. Hu, G. Xia, J. Hu, Y. Zhong, and K. Xu, “Fast binary coding
IEEE Trans. Geosci. Remote Sens., vol. 55, no. 4, pp. 2442–2459, for the scene classification of high-resolution remote sensing
2017. imagery,” Remote Sens., vol. 8, no. 7, pp. 555, 2016.
[59] Q. Lv, Y. Dou, X. Niu, J. Xu, J. Xu, and F. Xia, “Urban land use [74] G.-S. Xia, J. Hu, B. Shi, X. Bai, Y. Zhong, X. Lu, and L. Zhang,
and land cover classification using remotely sensed SAR data “AID: A benchmark dataset for performance evaluation of aeri-
through deep belief networks,” J. Sensors, vol. 2015, 2015. doi: al scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55,
10.1155/2015/538063 no. 7, pp. 3965–3981, 2017.
[60] B. Hou, H. Kou, and L. Jiao, “Classification of polarimetric SAR [75] S. Bhagavathy and B. S. Manjunath, “Modeling and detection
images using multilayer autoencoders and superpixels,” IEEE J. of geospatial objects using texture motifs,” IEEE Trans. Geosci.
Select. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 7, pp. Remote Sens., vol. 44, no. 12, pp. 3706–3715, 2006.
3072–3081, 2016. [76] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant
[61] L. Zhang, W. Ma, and D. Zhang, “Stacked sparse autoencoder in convolutional neural networks for object detection in VHR op-
PolSAR data classification using local spatial information,” IEEE tical remote sensing images,” IEEE Trans. Geosci. Remote Sens.,
Geosci. Remote Sens. Lett., vol. 13, no. 9, pp. 1359–1363, 2016. vol. 54, no. 12, pp. 7405–7415, 2016.
[62] F. Qin, J. Guo, and W. Sun, “Object-oriented ensemble classifi- [77] X. Chen, H. Zhao, P. Li, and Z. Yin, “Remote sensing image-
cation for polarimetric SAR imagery using restricted Boltzmann based analysis of the relationship between urban heat island
machines,” Remote Sens. Lett., vol. 8, no. 3, pp. 204–213, 2017. and land use/cover changes,” Remote Sens. Environment, vol.
[63] Z. Zhao, L. Jiao, J. Zhao, J. Gu, and J. Zhao, “Discriminant deep 104, no. 2, pp. 133–146, 2006.
belief network for high-resolution SAR image classification,” [78] J. Sivic and A. Zisserman, “Video google: A text retrieval approach
Pattern Recognit., vol. 61, pp. 686–701, May 2017. to object matching in videos,” in Proc. IEEE Int. Conf. Computer
[64] L. Jiao and F. Liu, “Wishart deep stacking network for fast Pol- Vision (ICCV), 2003, p. 1470.
SAR image classification,” IEEE Trans. Image Process., vol. 25, no. [79] Q. Zhu, Y. Zhong, B. Zhao, G.-S. Xia, and L. Zhang, “Bag-of-
7, pp. 3273–3286, 2016. visual-words scene classifier with local and global features for
[65] Y. Zhou, H. Wang, F. Xu, and Y. Q. Jin, “Polarimetric SAR im- high spatial resolution remote sensing imagery,” IEEE Geosci.
age classification using deep convolutional neural networks,” Remote Sens. Lett., vol. 13, no. 6, pp. 747–751, 2016.
IEEE Geosci. Remote Sens. Lett., vol. 13, no. 12, pp. 1935–1939, [80] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based
2016. feature selection for remote sensing scene classification,”
[66] Y. Duan, F. Liu, L. Jiao, P. Zhao, and L. Zhang, “SAR image seg- IEEE Geosci. Remote Sens. Lett., vol. 12, no. 11, pp. 2321–
mentation based on convolutional-wavelet neural network and 2325, 2015.
Markov random field,” Pattern Recognit., vol. 64, pp. 255–267, [81] O. Penatti, K. Nogueira, and J. Santos, “Do deep features gener-
Apr. 2017. alize from everyday objects to remote sensing and aerial scenes
[67] L. Wang, K. A. Scott, L. Xu, and D. A. Clausi, “Sea ice concentra- domains?” in Proc. IEEE Int. Conf. Computer Vision and Pattern
tion estimation during melt from dual-Pol SAR scenes using Recognition (CVPR) Workshop, 2015, pp. 44–51.

32 ieee Geoscience and remote sensing magazine december 2017

[82] M. Castelluccio, G. Poggi, C. Sansone, and L. Verdoliva. neural networks,” Image Vis. Comput., vol. 25, no. 9, pp.
(2015). Land use classification in remote sensing images by 1422–1431, 2007.
c onvolutional neural networks. arXiv. [Online]. Available: [97] X. Chen, S. Xiang, C.-L. Liu, and C.-H. Pan, “Vehicle detec-
https://arxiv.org/abs/1508.00092 tion in satellite images by hybrid deep convolutional neural
[83] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolu- networks,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 10, pp.
tional neural networks for the scene classification of high-resolu- 1797–1801, 2014.
tion remote sensing imagery,” Remote Sens., vol. 7, no. 11, pp. 14 [98] Q. Jiang, L. Cao, M. Cheng, C. Wang, and J. Li, “Deep neu-
680–14 707, 2015. ral networks-based vehicle detection in satellite images,”
[84] F. Zhang, B. Du, and L. Zhang, “Scene classification via a gra- in Proc. Int. Symp. Bioelectronics and Bioinformatics, 2015, pp.
dient boosting random convolutional network framework,” 184–187.
IEEE Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1793– [99] P. Zhou, G. Cheng, Z. Liu, S. Bu, and X. Hu, “Weakly su-
1802, 2016. pervised target detection in remote sensing images based on
[85] F. Luus, B. Salmon, F. Bergh, and B. Maharaj, “Multiview deep transferred deep features and negative bootstrapping,” Mul-
learning for land-use classification,” IEEE Geosci. Remote Sens. tidimensional Syst. Signal Process., vol. 27, no. 4, pp. 925–944,
Lett., vol. 12, no. 12, pp. 2448–2452, 2015. 2016.
[86] K. Nogueira, O. Penatti, and J. Santos, “Towards better exploit- [100] L. Zhang, Z. Shi, and J. Wu, “A hierarchical oil tank detector
ing convolutional neural networks for remote sensing scene with deep surrounding features for high-resolution optical sat-
classification,” Pattern Recognit., vol. 61, pp. 539–556, May 2017. ellite imagery,” IEEE J. Select. Topics Appl. Earth Observ. Remote
[87] D. Marmanis, M. Datcu, T. Esch, and U. Stilla, “Deep learn- Sens., vol. 8, no. 10, pp. 4895–4909, 2015.
ing earth observation classification using imagenet pretrained [101] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
networks,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 1, pp. detection,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recog-
105–109, 2016. nition (CVPR), 2005, pp. 886–893.
[88] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. [102] A.-B. Salberg, “Detection of seals in remote sensing images
LeCun. (2013). Overfeat: Integrated recognition, localization using features extracted from deep convolutional neural
and detection using convolutional networks. arXiv. [Online]. networks,” in Proc. IEEE Int. Geoscience and Remote Sensing
Available: https://arxiv.org/pdf/1312.6229.pdf Symp. (IGARSS), 2015, pp. 1893–1986.
[89] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. [103] I. Ševo and A. Avramovic´, “Convolutional neural network
Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with based automatic object detection on aerial images,” IEEE
convolutions,” in Proc. IEEE Int. Conf. Computer Vision and Pat- Geosci. Remote Sens. Lett., vol. 13, no. 5, pp. 740–744, 2016.
tern Recognition (CVPR), 2015, pp. 1–9. [104] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Orientation
[90] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial ex- robust object detection in aerial images using deep convolu-
tensions for land-use classification,” in Proc. ACM SIGSPATIAL tional neural network,” in Proc. IEEE Int. Conf. Image Processing
Int. Conf. Advances in Geographic Information Systems, 2010, pp. (ICIP), 2015, pp. 3735–3739.
270–279. [105] F. Zhang, B. Du, L. Zhang, and M. Xu, “Weakly supervised
[91] G.-S. Xia, W. Yang, J. Delon, Y. Gousseau., H. Sun, and H. Mai- learning based on coupled convolutional neural networks for
tre, “Structural high-resolution satellite image indexing,” in Proc. aircraft detection,” IEEE Trans. Geosci. Remote Sens., vol. 54, no.
Symp.: 100 Years ISPRS—Advancing Remote Sensing Science, Vien- 9, pp. 5553–5563, 2016.
na, Austria, 2010, pp. 298–303. [106] J. Tang, C. Deng, G.-B. Huang, and B. Zhao, “Compressed-do-
[92] M. Volpi and D. Tuia, “Dense semantic labeling of subdecime- main ship detection on spaceborne optical image using deep
ter resolution images with convolutional neural networks,” neural network and extreme learning machine,” IEEE Trans.
IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 881–893, Geosci. Remote Sens., vol. 53, no. 3, pp. 1174–1185, 2015.
2017. [107] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning ma-
[93] G. Cheng and J. Han, “A survey on object detection in optical chine: Theory and applications,” Neurocomput., vol. 70, no. 1,
remote sensing images,” ISPRS J. Photogrammetry Remote Sens., pp. 489–501, 2006.
vol. 117, pp. 11–28, July 2016. [108] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object
[94] N. Yokoya and A. Iwasaki, “Object detection based on sparse detection in optical remote sensing images based on weakly
representation and hough voting for optical remote sensing im- supervised learning and high-level feature learning,” IEEE
agery,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. Trans. Geosci. Remote Sens., vol. 53, no. 6, pp. 3325–3337,
8, no. 5, pp. 2053–2062, 2015. 2015.
[95] J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, S. Bu, and [109] Y. Yang and S. Newsam, “Geographic image retrieval using lo-
J. Wu, “Efficient, simultaneous detection of multi-class geospa- cal invariant features,” IEEE Trans. Geosci. Remote Sens., vol. 51,
tial targets based on visual saliency modeling and discrimina- no. 2, pp. 818–832, 2013.
tive learning of sparse coding,” ISPRS J. Photogrammetry Remote [110] S. Özkan, T. Ates, E. Tola, M. Soysal, and E. Esen, “Perfor-
Sens., vol. 89, pp. 37–48, Mar. 2014. mance analysis of state-of-the-art representation methods for
[96] X. Jin and C. H. Davis, “Vehicle detection from high-reso- geographical image retrieval and categorization,” IEEE Geosci.
lution satellite imagery using morphological shared-weight Remote Sens. Lett., vol. 11, no. 11, pp. 1996–2000, 2014.

december 2017 ieee Geoscience and remote sensing magazine 33

[111] P. Napoletano. (2016). Visual descriptors for content-based “Benchmarking classification of earth-observation data: From
retrieval of remote sensing images. arXiv. [Online]. Available: learning explicit features to convolutional networks,” in IEEE
https://arxiv.org/abs/1602.00970 Int. Geoscience and Remote Sensing Symp. (IGARSS), 2015, pp.
[112] W. Zhou, S. Newsam, C. Li, and Z. Shao, “Learning low di- 4173–4176.
mensional convolutional neural networks for high-resolution [127] M. Campos-Taberner, A. Romero-Soriano, C. Gatta, G. Camps-
remote sensing image retrieval,” Remote Sens., vol. 9, no. 5, pp. Valls, A. Lagrange, B. L. Saux, A. Beaupère, A. Boulch, A. Chan-
489, 2017. Hon-Tong, S. Herbin, H. Randrianarivo, M. Ferecatu, M. Shi-
[113] T. Jiang, G.-S. Xia, and Q. Lu, “Sketch-based aerial image re- moni, G. Moser, and D. Tuia, “Processing of extremely high
trieval,” in Proc. IEEE Int. Conf. Image Processing (ICIP), to be resolution LiDAR and RGB data: Outcome of the 2015 IEEE
published. GRSS Data Fusion Contest. Part A: 2D contest,” IEEE J. Select.
[114] L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Mul- Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 12, pp. 5547–
timodal classification of remote sensing images: A review and 5559, 2016.
future directions,” Proc. IEEE, vol. 103, no. 9, pp. 1560–1584, [128] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez. (2017).
2015. High-resolution semantic labeling with convolutional neu-
[115] M. Schmitt and X. X. Zhu, “Data fusion and remote sensing: An ral networks. arXiv. [Online]. Available: https://arxiv.org/
ever-growing relationship,” IEEE Geosci. Remote Sens. Mag., vol. abs/1611.01962
4, no. 4, pp. 6–23, 2016. [129] M. Kampffmeyer, A. B. Salberg, and R. Jenssen, “Semantic
[116] L. Mou, X. X. Zhu, M. Vakalopoulou, K. Karantzalos, N. Para- segmentation of small objects and modeling of uncertainty in
gios, B. L. Saux, G. Moser, and D. Tuia, “Multi-temporal very urban remote sensing images using deep convolutional neural
high resolution from space: Outcome of the 2016 IEEE GRSS networks,” in Proc. IEEE Int. Conf. Computer Vision and Pattern
Data Fusion Contest,” IEEE J. Select. Topics Appl. Earth Observ. Recognition (CVPR) Workshops, 2016, pp. 1–9.
Remote Sens., vol. 10, no. 8, pp. 3435–3447, 2017. [130] Y. Gal and Z. Ghahramani. (2015). Dropout as a Bayesian ap-
[117] L. Alparone, L. Wald, J. Chanussot, C. Thomas, P. Gamba, proximation: Representing model uncertainty in deep learning.
and L. M. Bruce, “Comparison of pansharpening algo- arXiv. [Online]. Available: https://arxiv.org/abs/1506.02142
rithms: Outcome of the 2006 GRSS Data-Fusion Contest,” [131] D. Tuia, N. Courty, and R. Flamary, “Multiclass feature learning
IEEE Trans. Geosci. Remote Sens., vol. 45, no. 10, pp. 3012– for hyperspectral image classification: Sparse and hierarchical
3021, 2007. solutions,” ISPRS J. Photogrammetry Remote Sens., vol. 105, pp.
[118] D. Fasbender, D. Tuia, M. Kanevski, and P. Bogaert, “Support- 272–285, July 2015.
based implementation of Bayesian data fusion for spatial en- [132] M. Volpi, G. Camps-Valls, and D. Tuia, “Spectral alignment of
hancement: Applications to aster thermal images,” IEEE Geosci. cross-sensor images with automated kernel canonical correlation
Remote Sens. Lett., vol. 5, no. 4, pp. 589–602, 2008. analysis,” ISPRS J. Photogrammetry Remote Sens., vol. 107, pp. 50–
[119] L. Loncan, L. B. Almeida, J. M. Bioucas-Dias, X. Briottet, J. 63, Sept. 2015.
Chanussot, N. Dobigeon, S. Fabre, W. Liao, G. A. Licciardi, M. [133] D. Marcos, R. Hamid, and D. Tuia, “Geospatial correspon-
Simoes, J. Y. Tourneret, M. A. Veganzones, G. Vivone, Q. Wei, dence for multimodal registration,” in Proc. IEEE Int. Conf.
and N. Yokoya, “Hyperspectral pansharpening: A review,” IEEE Computer Vision and Pattern Recognition (CVPR), 2016, pp.
Geosci. Remote Sens. Mag., vol. 3, no. 3, pp. 27–46, 2015. 5091–5100.
[120] J. Zhong, B. Yang, G. Huang, F. Zhong, and Z. Chen, “Remote [134] M. Gong, T. Zhan, P. Zhang, and Q. Miao, “Superpixel-based
sensing image fusion with convolutional neural network,” difference representation learning for change detection in mul-
Sens. Imaging, vol. 17, no. 1, 2016. tispectral remote sensing images,” IEEE Trans. Geosci. Remote
[121] G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa, “Pansharp- Sens., vol. 55, no. 5, pp. 2658–2673, 2017.
ening by convolutional neural networks,” Remote Sens., vol. 8, [135] P. Zhang, M. Gong, L. Su, J. Liu, and Z. Li, “Change detection
no. 7, pp. 594, 2016. based on deep feature representation and mapping transfor-
[122] Y. Yuan, S. Zheng, and X. Lu, “Hyperspectral image superreso- mation for multi-spatial-resolution remote sensing images,”
lution by transfer learning,” IEEE J. Select. Topics Appl. Earth Ob- ISPRS J. Photogrammetry Remote Sens., vol. 116, pp. 24–41, June
serv. Remote Sens., vol. 10, no. 5, pp. 1963–1974, 2017. 2016.
[123] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convo- [136] H. Lyu, H. Lu, L. Mou, W. Li, X. Li, X. Li, J. Wang, X. X. Zhu, L.
lutional network for image super-resolution,” in Proc. European Yu, and P. Gong, “A deep information based transfer learning
Conf. Computer Vision (ECCV), 2014, pp. 184–199. method to detect annual urban dynamics of four developed
[124] D. Tuia, C. Persello, and L. Bruzzone, “Recent advances in do- cities from 1984–2016 by Landsat data,” Remote Sens. Environ-
main adaptation for the classification of remote sensing data,” ment, to be published.
IEEE Geosci. Remote Sens. Mag., vol. 4, no. 2, pp. 41–57, 2016. [137] J. Sherrah. (2016). Fully convolutional networks for dense se-
[125] W. Huang, L. Xiao, Z. Wei, H. Liu, and S. Tang, “A new pan- mantic labelling of high-resolution aerial imagery. arXiv. [On-
sharpening method with deep neural networks,” IEEE Geosci. line]. Available: https://arxiv.org/abs/1606.02585
Remote Sens. Lett., vol. 12, no. 5, pp. 1037–1041, 2015. [138] A. Marcu and M. Leordeanu. (2016). Dual local-global contex-
[126] A. Lagrange, B. L. Saux, A. Beaupere, A. Boulch, A. Chan- tual pathways for recognition in aerial imagery. arXiv. [Online].
Hon-Tong, S. Herbin, H. Randrianarivo, and M. Ferecatu, Available: https://arxiv.org/abs/1605.05462

34 ieee Geoscience and remote sensing magazine december 2017

[139] J. Hu, L. Mou, A. Schmitt, and X. X. Zhu, “FusioNet: A two- in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition
stream convolutional neural network for urban scene (CVPR), 2005, pp. 539–546.
classification using PolSAR and hyperspectral data,” in Proc. [154] S. Workman and N. Jacobs, “On the location dependence of
Joint Urban Remote Sensing Event (JURSE), to be published. convolutional neural network features,” in Proc. IEEE Int. Conf.
[140] L. Mou, M. Schmitt, Y. Wang, and X. X. Zhu, “A CNN for the Computer Vision and Pattern Recognition (CVPR) Workshops, 2015,
identification of corresponding patches in SAR and optical pp. 70–78.
imagery of urban scenes,” in Proc. Joint Urban Remote Sensing [155] S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image ge-
Event (JURSE), to be published. olocalization with aerial reference imagery,” in Proc. IEEE Int.
[141] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. van den Hen- Conf. Computer Vision (ICCV), 2015, pp. 3961–3969.
gel, “Semantic labeling of aerial and satellite imagery,” IEEE J. [156] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learn-
Select. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 7, pp. ing deep features for scene recognition using places database,”
2868–2881, 2016. in Proc. Advances in Neural Information Systems (NIPS), 2014, pp.
[142] N. Audebert, B. L. Saux, and S. Lefèvre, “Semantic segmenta- 487–495.
tion of earth observation data using multimodal and multi- [157] D. G. Lowe, “Distinctive image features from scale-invariant
scale deep networks,” in Proc. Asian Conf. Computer Vision keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110,
(ACCV), 2016, pp. 180–196. 2004.
[143] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, [158] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robust
and U. Stilla. (2017). Classification with an edge: Improving features (SURF),” Comput. Vis. Image Understanding, vol. 110,
semantic image segmentation with boundary detection. arXiv. no. 3, pp. 346–359, 2008.
[Online]. Available: https://arxiv.org/abs/1612.01337 [159] P. Fischer, A. Dosovitskiy, and T. Brox. (2014). Descriptor match-
[144] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction ing with convolutional neural networks: A comparison to SIFT.
from a single image using a multi-scale deep network,” in Proc. arXiv. [Online]. Available: https://arxiv.org/abs/1406.6909
Advances in Neural Information Processing Systems (NIPS), 2014, [160] A. Handa, M. Blösch, V. Patraucean, S. Stent, J. McCormac,
pp. 2366–2374. and A. J. Davison. (2016). gvnn: Neural network library for
[145] S. Srivastava, M. Volpi, and D. Tuia, “Joint height estimation and se- geometric computer vision. arXiv. [Online]. Available: https://
mantic labeling of monocular aerial images with CNNs,” in Proc. IEEE arxiv.org/abs/1607.07405
Int. Geoscience and Remote Sensing Symp. (IGARSS), to be published. [161] K. Lenc and A. Vedaldi. (2016). Learning covariant fea-
[146] M. Vakalopoulou, C. Platias, M. Papadomanolaki, N. Paragios, ture detectors. arXiv. [Online]. Available: https://arxiv.org/
and K. Karantzalos, “Simultaneous registration, segmentation abs/1605.01224
and change detection from multisensor, multitemporal satel- [162] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg, “Match-
lite image pairs,” in Proc. IEEE Int. Geoscience and Remote Sensing Net: Unifying feature and metric learning for patch-based
Symp. (IGARSS), 2016, pp. 1827–1830. matching,” in Proc. IEEE Conf. Computer Vision and Pattern Rec-
[147] S. Lefèvre, D. Tuia, J. D. Wegner, T. Produit, and A. S. Nassar, ognition (CVPR), 2015, pp. 3279–3286.
“Towards seamless multi-view scene analysis from satellite to [163] K. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned invari-
street-level,” Proc. IEEE. doi: 10.1109/JPROC.2017.2684300. ant feature transform,” in Proc. European Conf. Computer Vision
[148] J. D. Wegner, S. Branson, D. Hall, and P. Perona, “Cataloging (ECCV), 2016, pp. 467–483.
public objects using aerial and street-level images—Urban [164] H. Hirschmuller, “Stereo processing by semiglobal matching
trees,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Rec- and mutual information,” IEEE Trans. Pattern Anal. Mach. Intell.,
ognition (CVPR), 2016, pp. 6014–6023. vol. 30, no. 2, pp. 328–341, 2008.
[149] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards [165] J. Zbontar and Y. LeCun, “Computing the stereo matching
real-time object detection with region proposal networks,” in cost with a convolutional neural network,” in Proc. IEEE Int.
Proc. Advances in Neural Information Processing Systems (NIPS), Conf. Computer Vision and Pattern Recognition (CVPR), 2015,
2015, pp. 91–99. pp. 1592–1599.
[150] G. Mattyus, S. Wang, S. Fidler, and R. Urtasun, “Hd maps: [166] L. Li, X. Yu, S. Zhang, X. Zhao, and L. Zhang, “3D cost aggrega-
Fine-grained road segmentation by parsing ground and aerial tion with multiple minimum spanning trees for stereo match-
images,” in Proc. IEEE Int. Conf. Computer Vision and Pattern ing,” Appl. Opt., vol. 56, no. 12, pp. 3411–3420, 2017.
Recognition (CVPR), 2016, pp. 3611–3619. [167] S. Drouyer, S. Beucher, M. Bilodeau, M. Moreaud, and L. Sorbi-
[151] T. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep represen- er, “Sparse stereo disparity map densification using hierarchi-
tations for ground-to-aerial geolocalization,” in Proc. IEEE Int. cal image segmentation,” in Proc. Int. Symp. Mathematical Mor-
Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. phology and Its Applications to Signal and Image Processing, 2017,
5007–5015. pp. 172–184.
[152] N. N. Vo and J. Hays, “Localizing and orienting street views us- [168] H. Park and K. M. Lee, “Look wider to match image patches
ing overhead imagery,” in Proc. European Conf. Computer Vision with convolutional neural networks,” IEEE Signal Process. Lett.
(ECCV), 2016, pp. 494–509. doi: 10.1109/LSP.2016.2637355.
[153] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity [169] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy,
metric discriminatively, with application to face verification,” and T. Brox, “A large dataset to train convolutional networks

december 2017 ieee Geoscience and remote sensing magazine 35

for disparity, optical flow, and scene flow estimation,” in Proc. [183] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge,
IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR), A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy,
2016, pp. 4040–4048. N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios,
[170] D. Marmanis, J. D. Wegner, S. Galliani, K. Schindler, M. Dat- A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta,
cu, and U. Stilla, “Semantic segmentation of aerial images X. Chen, A. Saparov, M. Greaves, and J. Welling, “Never-ending
with an ensemble of fully convolutional neural networks,” in learning,” in Proc. Conf. Artificial Intelligence (AAAI), 2015, pp.
Proc. ISPRS Ann. Photogrammetry, Remote Sensing and Spatial In- 2302–2310.
formation Sciences, 2016, pp. 473–480. [184] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught
[171] C. Häne, C. Zach, A. Cohen, R. Angst, and M. Pollefeys, “Joint learning: Transfer learning from unlabeled data,” in Proc. IEEE
3D scene reconstruction and class segmentation,” in Proc. IEEE Int. Conf. Machine Learning (ICML), 2007, pp. 759–766.
Int. Conf. Computer Vision and Pattern Recognition (CVPR), 2013, [185] T. Durand, N. Thome, and M. Cord, “Weldon: Weakly super-
pp. 97–104. vised learning of deep convolutional neural networks,” in Proc.
[172] M. Bláha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, and K. IEEE Int. Conf. Computer Vision and Pattern Recognition (CVPR),
Schindler, “Large-scale semantic 3D reconstruction: An adap- 2016, pp. 4743–4752.
tive multi-resolution model for multi-class volumetric label- [186] R. Johnson and T. Zhang, “Supervised and semi-supervised text
ing,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recog- categorization using LSTM for region embeddings,” in Proc. IEEE
nition (CVPR), 2016, pp. 3176–3184. Int. Conf. Machine Learning (ICML), 2016, pp. 526–534.
[173] M. Bláha, C. Vogel, A. R ichard, J. D. Wegner, T. Pock, and [187] ISI. (2017, Sept.). Web of science. [Online]. Available: https://
K. Schindler, “Towards integrated 3D reconstruction and webofknowledge.com
semantic interpretation of urban scenes,” in Proc. Dreilän- [188] IEEE GRSS. (2017). Image analysis and data fusion. [Online].
dertagung der SGPF, DGPF und OVG: Lösungen für eine Welt im Available: http://www.grss-ieee.org/community/technical-
Wandel: Vorträge, 2016, pp. 44–53. committees/data-fusion/
[174] Y. Bengio. (2012). Practical recommendations for gradient- [189] ISPRS. (2017). 2D semantic labeling contest. [Online]. Avail-
based training of deep architectures. arXiv. [Online]. Available: able: http://www2.isprs.org/commissions/comm3/wg4/
https://arxiv.org/abs/1206.5533 semantic-labeling.html
[175] G. Montavon, G. B. Orr, and K.-R. Müller, Neural Networks: [190] M. De Felice. (2017, May 4). Which deep learning network is
Tricks of the Trade. New York: Springer-Verlag, 2012. best for you? IDG Communications. [Online]. Available: http://
[176] M. Castelluccio, G. Poggi, C. Sansone, and L. Verdoliva. (2015). www.cio.com/article/3193689/artificial-intelligence/which-
Land use classification in remote sensing images by convolu- deep-learning-network-is-best-for-you.html
tional neural networks. arXiv. [Online]. Available: https://arxiv [191] NSF. (2010, Oct. 28). UC merced land use dataset. UC Merced.
.org/abs/1508.00092 [Online]. Available: http://vision.ucmerced.edu/datasets/
[177] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial ex- landuse.html
tensions for land-use classification,” in Proc. ACM SIGSPATIAL [192] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, and L. Zhang.
Int. Conf. Advances in Geographic Information Systems (ACM GIS), (2017, Feb. 27). AID: A benchmark dataset for performance
2010, pp. 270–279. evaluation of aerial scene classification. IEEE Trans. Geosci.
[178] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene Remote Sens. [Online]. Available: http://www.lmars.whu.edu
classification: Benchmark and state of the art,” Proc. IEEE. doi: .cn/xia/AID-project.html
10.1109/JPROC.2017.2675998. [193] EScience. (2016). NWPU-RESISC45 datasets. [Online]. Avail-
[179] M. Volpi and V. Ferrari, “Semantic segmentation of urban scenes able: http://www.escience.cn/people/JunweiHan/NWPU-RE
by learning local class interactions,” in Proc. IEEE Int. Conf. Com- SISC45.html
puter Vision and Pattern Recognition (CVPR) Workshop on EarthVi- [194] M. Volpi. (2017). Zurich summer dataset. Google Sites. [On-
sion, 2015, pp. 1–9. line]. Available: https://sites.google.com/site/michelevol
[180] Y. Wang, X. X. Zhu, B. Zeisl, and M. Pollefeys, “Fusing meter- piresearch/data/zurich-dataset.
resolution 4-D InSAR point clouds and optical images for se- [195] M. Volpi. (2017). Dense semantic labeling. Google Sites. [On-
mantic urban infrastructure monitoring,” IEEE Trans. Geosci. line]. https://sites.google.com/site/michelevol piresearch/
Remote Sens., vol. 55, no. 1, pp. 14–26, 2017. codes/dense-labeling
[181] A. Kendall, V. Badrinarayanan, and R. Cipolla. (2015). Bayesian [196] IEEE GRSS. (2017). 2017 IEEE GRSS data fusion contest. [On-
SegNet: Model uncertainty in deep convolutional encoder–de- line]. Available: http://www.grss-ieee.org/2017-ieee-grss-data-
coder architectures for scene understanding. arXiv. [Online]. fusion-contest/
Available: https://arxiv.org/abs/1511.02680 [197] Office of the Director of National Intelligence. (2017). Func-
[182] P. Gong, L. Yu, C. Li, J. Wang, L. Liang, X. Li, L. Ji, Y. Bai, Y. tional map of the world challenge. IARPA. [Online]. Available:
Cheng, and Z. Zhu, “A new research paradigm for global land https://www.iarpa.gov/challenges/fmow.html
cover mapping,” Ann. GIS, vol. 22, no. 2, pp. 1–16, 2016. grs