Advanced Spectral Classifiers For Hyperspectral Images A Review

Advanced
Spectral
Classifiers for
Hyperspectral
Images
A review
PEDRAM GHAMISI, JAVIER PLAZA,

YUSHI CHEN, JUN LI, AND ANTONIO PLAZA
H yperspectral image classification has been a vibrant

area of research in recent years. Given a set of observa-
tions, i.e., pixel vectors in a hyperspectral image, classifica-
tion approaches try to allocate a unique label to each pixel
vector. However, the classification of hyperspectral images
is a challenging task for a number of reasons, such as the CLASSIFYING HYPERSPECTRAL DATA
presence of redundant features, the imbalance among the Imaging spectroscopy (also known as hyperspectral imaging)
limited number of available training samples, and the high is an important technique in remote sensing (RS). Hyper-
dimensionality of the data. spectral imaging sensors often capture data from the visible
The aforementioned issues (among others) make the through the near-infrared wavelength ranges, thus providing
commonly used classification methods designed for the hundreds of narrow spectral channels from the same area on
analysis of gray scale, color, or multispectral images in- the surface of the earth. These instruments collect data con-
appropriate for hyperspectral images. To this end, several sisting of a set of pixels represented as vectors, in which each
spectral classifiers have been specifically developed for element is a measurement corresponding to a specific wave-
hyperspectral images or carried out on such data. Among length. The size of each vector is equal to the number of spec-
those approaches, support vector machines (SVMs), ran- tral channels or bands. Hyperspectral images usually consist
dom forests (RFs), neural networks, deep approaches, of several hundred spectral data channels for the same area on
and logistic regression-based techniques have attracted the earth’s surface; while, in multispectral data, the number
great interest in the hyperspectral community. This article of spectral channels is usually up to tens of bands [1]. The de-
reviews most of the existing spectral classification ap- tailed spectral information collected by hyperspectral sensors
proaches in the literature. It also critically compares the increases the capability of discriminating between different
most powerful hyperspectral classification approaches land-cover classes with increased accuracy. Several operation-
from different points of view, including their classifica- al hyperspectral imaging systems are currently available, pro-
tion accuracy and computational complexity. The article viding a large volume of image data that can be used for a wide
goes on to provide several hints for readers about the logi- variety of applications, such as in ecology, geology, hydrology,
cal choice of an appropriate classifier based on the appli- precision agriculture, and military applications.
cation at hand. Due to the detailed spectral information available from
the hundreds of narrow bands collected by hyperspectral
Digital Object Identifier 10.1109/MGRS.2016.2616418
sensors, the accurate discrimination of different materials
Date of publication: 16 March 2017 is possible. This fact makes hyperspectral data a valuable
8 0274-6638/17©2017IEEE ieee Geoscience and remote sensing magazine march 2017

Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on July 03,2023 at 10:50:16 UTC from IEEE Xplore. Restrictions apply.
Image ©corel
source of information to be fed to advanced classifiers. The During the processing, each pixel is associated with one
output of the classification step is known as the classifica- of the cluster centers based on a similarity criterion [1],
tion map. [3]. Therefore, pixels that belong to different clusters are
Table 1 categorizes different groups of classifiers with more dissimilar to each other compared to pixels within
respect to different criteria, followed by a brief descrip- the same cluster [4], [5].
tion. Since classification is a wide field of research and it is There is a vast amount of literature on u nsupervised
not feasible to investigate all those approaches in a single classification approaches. Among these methods, Kmeans [6],
article, we tried to narrow down our description by exclud- Iterative Self-Organizing Data Analysis Technique (ISODATA)
ing the items highlighted in green in Table 1, which have [7], and Fuzzy Cmeans [8] rank
been extensively covered in other contributions. We reiter- among the most popular. This
ate that our main goal in this article is to provide a com- set of approaches is known
SINCE SUPERVISED
parative assessment and best practice recommendations for being highly sensitive to
APPROACHES CONSIDER
for the remaining contributions in Table 1. the initial cluster configuration
With respect to the availability of training samples, and may be trapped into subop- CLASS-SPECIFIC
classification approaches can be split into two categories, timal solutions [9]. To address INFORMATION PROVIDED BY
i.e., supervised and unsupervised classifiers. Supervised this issue, researchers have tri TRAINING SAMPLES, THEY
approaches classify input data for each class using a set of ed to improve the resilience of LEAD TO MORE PRECISE
representative samples known as training samples. Training the Kmeans (and its family) by CLASSIFICATION MAPS
samples are usually collected either by manually labeling optimizing it with bioinspired THAN UNSUPERVISED
a small number of pixels in an image or based on some optimization techniques [3].
APPROACHES.
field measurements [2]. In contrast, unsupervised classi- Since supervised approaches
fication (also known as clustering) does not consider train- consider class-specific infor-
ing samples. This type of approach classifies the data based mation provided by training
only on an arbitrary number of initial cluster centers that samples, they lead to more precise classification maps than
may be either user specified or quite arbitrarily selected. unsupervised approaches. In addition to unsupervised and
march 2017 ieee Geoscience and remote sensing magazine 9

TABLE 1. A TERMINOLOGY OF CLASSIFICATION APPROACHES BASED ON DIFFERENT CRITERIA. To narrow down this
article’s research line, we intentionally avoid elaborating on THE ITEMS HIGHLIGHTED IN GREEN.
Criteria Types Brief Description

Whether training samples are used Supervised classifiers Supervised approaches classify input data using a set of representative
or not. samples for each class, known as training samples.
Unsupervised classifiers Unsupervised approaches, also known as clustering, do not consider
the labels of training samples to classify the input data.
Semisupervised classifiers The training step in semisupervised approaches is based on both
labeled and unlabeled training samples.
Whether any assumption on the Parametric classifiers Parametric classifiers are based on the assumption that the probability
distribution of the input data is density function for each class is known.
considered or not.
Nonparametric classifiers Nonparametric classifiers are not constrained by any assumptions on
the distribution of input data.
Whether either a single classifier or Single-classifier classifiers In this approach, a single classifier is taken into account to allocate a
an ensemble classifier is taken into class label for a given pixel.
account.
Ensemble (multiple) In this approach, a set of classifiers (multiple classifiers) is taken into
classifiers account to allocate a class label for a given pixel.
Whether or not the technique uses Hard classifiers Hard classification techniques do not consider the continuous changes
hard partitioning, in which each data of different land-cover classes from one to another.
point belongs to exactly one cluster.
Soft (fuzzy) classifiers Fuzzy classifiers model the gradual boundary changes by providing
measurements of the degree of similarity of all classes.
Whether spatial information is taken Spectral classifiers This approach considers the hyperspectral image as a list of spectral
into account. measurements with no spatial organization.
Spatial classifiers This approach classifies the input data using spatially adjacent pixels,
based on either a crisp or adaptive neighborhood system.
Spectral–spatial classifiers The sequence of spectral and spatial information is taken into account
for the classification of hyperspectral data.
Whether the classifier learns a model Generative classifiers This approach learns a model of the joint probability of the input and
of the joint probability of the input the labeled pixels and makes the prediction using Bayes rules.
and the labeled pixels.
Discriminative classifiers This approach learns conditional probability distribution or learns a
direct map from inputs to class labels.
Whether the classifier predicts a Probabilistic classifiers This approach is able to predict, given a sample input, a probability
probability distribution over a set of distribution over a set of classes.
classes, given a sample input.
Nonprobabilistic classifiers This approach simply assigns the sample to the most likely class that
the sample should belong to.
Which type of pixel information Subpixel classifiers In this approach, the spectral value of each pixel is assumed to be a
is used. linear or nonlinear combination of endmembers (pure
materials).
Per-pixel Input pixel vectors are fed to classifiers as inputs.
Object- based and In this approach, a segmentation technique allocates a label for each
object-oriented classifiers pixel in the image in such a way that pixels with the same label share
certain visual characteristics. In this case, objects are known as
underlying units after applying segmentation. Classification is
conducted based on the objects instead of a single pixel.
Per-field classifiers This type of classifier is obtained using a combination of RS and
geographic information system (GIS) techniques. In this context, raster
and vector data are integrated in a classification. The vector data are
often used to subdivide an image into parcels, and classification is
based on the parcels.
supervised approaches, semisupervised techniques have SUPERVISED CLASSIFICATION OF

been introduced [10], [11]. With these, the training is based HYPERSPECTRAL DATA
on both labeled training samples as well as unlabeled sam- A hyperspectral data set can be seen as a stack of many
ples. In the literature, it has been shown that the classifi- pixel vectors, here denoted by x = (x 1, ..., x d) T , where d
cation accuracy obtained with semisupervised approaches represents the number of bands or the length of the pixel
can outperform that obtained by supervised classification. vector. A common task when interpreting RS images is to
In this article, our focus is only on supervised classifica- differentiate between several land-cover classes. A classifi-
tion approaches. cation algorithm is used to separate between different types
10 ieee Geoscience and remote sensing magazine march 2017

of patterns [5]. In RS, classification is usually carried out ◗◗ Fukunaga [13] showed that there is a relation between
in a feature space [12]. In general, the initial set of features the required number of training samples and the num-
for classification contains the spectral information, i.e., the ber of dimensions for different types of classifiers. The
wavelength information for the pixels [1]. In this space, each required number of training samples is linearly related
feature is presented as one dimension, and pixel vectors can to the dimensionality for linear classifiers and to the
be represented as points in this d-dimensional space. A clas- square of the dimensionality for quadratic classifiers
sification approach tries to assign unknown pixels to one of (e.g., Gaussian MLC [13]).
y classes X = " y 1, y 2, ..., y K ,, where K represents the number ◗◗ In [17], Landgrebe showed that too many spectral bands
of classes, based on a set of training samples. The individual might be undesirable in terms of expected classification
classes are discriminated based either on the similarity to a accuracy. When dimension-
certain class or by decision boundaries that are constructed ality (the number of bands)
in the feature space [5]. increases, with a constant CONVENTIONAL
number of training sam- TECHNIQUES DEVELOPED
PARAMETRIC VERSUS NONPARAMETRIC ples, a higher-dimensional
FOR MULTISPECTRAL DATA
CLASSIFICATION set of statistics must be
MAY NOT BE SUITABLE FOR
From another perspective, classification approaches can be estimated. In other words,
split into parametric and nonparametric. For example, although higher spectral THE CLASSIFICATION OF
the widely used supervised maximum likelihood classi- dimensions increase the HYPERSPECTRAL DATA.
fier (MLC) is often applied in the parametric context. In separability of the classes,
this manner, the MLC is based on the assumption that the the accuracy of the statisti-
probability density function for each class is governed by cal estimation decreases. This leads to a decrease in clas-
the Gaussian distribution [13]. In contrast, nonparametric sification accuracies beyond a number of bands. For the
methods are not constrained by any assumptions on the purpose of classification, these problems are related to the
distribution of the input data. Hence, techniques such as so-called curse of d imensionality.
SVMs, neural networks, decision trees, and ensemble ap- It is expected that, as dimensionality increases, more in
proaches (including RFs) can be applied, even if the class- formation is demanded to detect more classes with higher
conditional densities are not known or cannot be reliably accuracy. At the same time, the aforementioned characteris-
estimated [1]. Therefore, for hyperspectral data with a lim- tics demonstrate that conventional techniques developed for
ited number of available training samples, such techniques multispectral data may not be suitable for the classification of
may lead to more accurate classification results. hyperspectral data.
The aforementioned issues related to the high-dimension-
CHALLENGES FOR THE CLASSIFICATION OF al nature of the data have a dramatic influence on super-
HYPERSPECTRAL DATA vised classification techniques [18]. These techniques demand a
In this section, we discuss some specific characteristics of hy- large number of training samples (which is almost impossible
perspectral data that make the classification step challenging. to obtain in practice) to make a precise estimation. This prob-
lem is even more severe when dimensionality increases. There-
THE CURSE OF DIMENSIONALITY fore, classification approaches developed on hyperspectral
In [14]–[16], researchers have reported some distinguishing data need to be capable of handling high-dimensional data
geometrical, statistical, and asymptotical properties of high- when only a limited number of training samples is available.
dimensional data through some experimental examples,
e.g., 1) as dimensionality increases, the volume of a hy- UNCERTAINTIES
percube concentrates in corners, or 2) as dimensionality Uncertainties generated at different stages of the data acqui-
increases, the volume of a hypersphere concentrates in an sition and classification procedure can dramatically influence
outside shell. With respect to these examples, the following the classification accuracies and the quality of the final
conclusions have been drawn: classification map [19]–[22]. There are many reasons for
◗◗ A high-dimensional space is almost empty, which im- such uncertainties, including atmospheric conditions at the
plies that multivariate data in IR are usually in a lower data acquisition time, data limitation in terms of radiomet-
dimensional structure. In other words, high-dimensional ric and spatial resolutions, mosaicing several images, and
data can be projected into a lower subspace without many others. Image registration and geometric rectifica-
sacrificing considerable information in terms of class tion cause position uncertainty. Furthermore, algorithmic
separability [1]. errors at the time of calibrating either atmospheric or topo-
◗◗ Gaussian distributed data have a tendency to concen- graphic effects may lead to radiometric uncertainties [23].
trate in the tails, while uniformly distributed data have
a tendency to be concentrated in the corners, which INFLUENCE OF SPATIAL RESOLUTION
makes the density estimation of high-dimensional data Classification accuracies can be highly influenced by the
for both distributions more difficult. spatial resolution of the hyperspectral data. A higher spatial

resolution can significantly reduce the mixed-pixel problem to the boundaries between the classes) a degree of plausibil-
and detect more details of the scene. In [24], it was mentioned ity for such classes [33]. Sparse MLR (SMLR), by adopting a
that classification accuracies are the result of a tradeoff be- Laplacian prior to enforce sparsity, leads to good machine
tween two aspects. The first refers to the influence of bound- generalization capabilities in hyperspectral classification
ary pixels on classification results. In this case, as spatial [34], [35], though with some computational limitations.
resolution becomes finer, the The logistic regression via splitting and augmented Lagrang-
number of pixels falling on ian (LORSAL) algorithm opened the door to the processing
the boundary of different ob- of hyperspectral images of median or big volume and an
THE USE OF NEURAL
jects will decrease. The second extremely large number of classes, using a high number of
NETWORKS IN COMPLEX aspect refers to the increased training samples [36], [37]. More recently, a subspace-based
CLASSIFICATION spectral variance of different version of this classifier, called MLRsub [38], has also been
SCENARIOS IS A land covers associated with fin- proposed. The idea of applying subspace projection meth-
CONSEQUENCE OF THEIR er spatial resolution. ods relies on the basic assumption that the samples within
SUCCESSFUL APPLICATION When we deal with low each class can approximately lie in a lower-dimensional
IN THE FIELD OF PATTERN or medium spatial resolution subspace. The exploration of MLR, SMLR, LORSAL, and
optical data, the existence of MLRsub for hyperspectral models presents two important
RECOGNITION.
many mixed pixels between advantages. On one hand, with the advantages of good
different land-cover classes algorithm generalization and fast computation, MLR has
is the main source of uncer- been widely used to model the spectral information of hy-
tainty and can influence classification results dramatically. perspectral data [39]–[48]. On the other hand, as the struc-
Fine spatial resolution can provide detailed information ture of MLR classifiers is very open and flexible, composite
about the shape and structure of different land covers. kernel learning [49], [50] and multiple feature learning [51],
Such information can also be fed to the classification sys- [52] become active topics under the MLR model and lead to
tem to further increase classification accuracy values and very competitive results for hyperspectral image classifica-
improve the quality of classification maps. The consider- tion problems.
ation of spatial information in the classification system is
a vibrant research topic in the hyperspectral community, NEURAL NETWORKS
and it has been investigated in many works such as [1] and The use of neural networks in complex classification sce-
[25]–[29]. As mentioned, the consideration of spatial in- narios is a consequence of their successful application in
formation in the classification system is out of the scope of the field of pattern recognition [53]. Particularly in the
this work, which focuses on supervised spectral classifiers. 1990s, neural network approaches attracted many research-
However, the use of high-resolution hyperspectral images ers in the area of the classification of hyperspectral images
introduces some new problems, especially those caused [54], [55]. The advantage of such approaches over proba-
by shadows, which lead to high spectral variations within bilistic methods result mainly from the fact that neural
the same land-cover class. These disadvantages may reduce networks do not need prior knowledge about the statistical
classification accuracy if classifiers cannot effectively han- distribution of the classes. Their attractiveness increased
dle such effects [30]. because of the availability of feasible training techniques
for nonlinearly separable data [56], although their use has
LITERATURE REVIEW been traditionally affected by their algorithmic and train-
In this section, we briefly outline some of the most popular ing complexity [57] as well as by the number of parameters
supervised classification methods for hyperspectral imag- that need to be tuned.
ery. Some of these methods will be detailed further in sub- Several neural network-based classification approaches
sequent sections of this article. have been proposed in the literature that consider both
supervised and unsupervised nonparametric approaches
PROBABILISTIC APPROACHES [58]–[62]. The feedforward neural network (FN)-based
A common subclass of classifiers is based on probabilistic classifiers are the most commonly adopted ones. FNs have
approaches. This group of classifiers uses statistical termi- been well studied and widely used since the introduction
nologies to find the best class for a given pixel. In contrast of the well-known backpropagation algorithm (BP) [63],
with other algorithms that simply allocate a label with re- a first-order gradient method for parameter optimization.
spect to a best class, probabilistic algorithms output a prob- The BP presents two main problems, i.e., slow convergence
ability of the pixel being a member of each of the possible and the possibility of falling in local minima, especially
classes [5], [13], [31]. The best class is normally then select- when the parameters of the network are not properly fine-
ed as the one with the highest probability. tuned. With the aim of alleviating the disadvantages of the
For instance, the multinomial logistic regression (MLR) original BP algorithm, several second-order optimization-
classifier [32], which is able to model the posterior class based strategies, which are faster and need fewer input pa-
distributions in a Bayesian framework, supplies (in addition rameters, have been proposed in the literature [64], [65].

Recently, the extreme learning machine (ELM) learning DECISION TREES
algorithm has been proposed to train single hidden-layer Decision trees represent another subclass of nonparametric
FNs (SLFN) [66], [67]. Then, the concept has been extended approaches that can be used for both classification and re-
to multihidden-layer networks [68], radial basis function gression. Safavian and Landgrebe [85] provided a good de-
(RBF) networks [69], and kernel learning [70]. The main scription of such classifiers.
characteristic of the ELM is that the hidden layer (feature During the construction of a
mapping) is randomly fixed and need not be iteratively decision tree, the training set
SVMs HAVE BEEN
tuned. ELM-based networks are remarkably efficient in is progressively split into an
WIDELY USED FOR THE
terms of accuracy and computational complexity and have increasing number of smaller,
been successfully applied as nonlinear classifiers for hyper- more homogeneous groups. CLASSIFICATION OF
spectral data, providing results comparable with state-of- This unique hierarchical con- HYPERSPECTRAL DATA
the-art methodologies [71]–[74]. cept is different from other BECAUSE OF THEIR ABILITY
classification approaches that TO HANDLE HIGH-
KERNEL METHODS, INCLUDING SVMs generally use the entire fea- DIMENSIONAL DATA WITH
SVMs are another example of a supervised classification ap- ture space at once and make A LIMITED NUMBER OF
proach. They have been widely used for the classification a single membership decision
TRAINING SAMPLES.
of hyperspectral data because of their ability to handle per class [86]. The relative
high-dimensional data with a limited number of train- structural simplicity of decision
ing samples [1], [75], [76]. SVMs were originally intro- trees and the relatively short
duced to classify linear classification problems. To gener- training time required (compared to methods that can be
alize the SVM for nonlinear classification problems, the computationally demanding) are the main advantages of
so-called kernel trick was introduced [77]. The sensitivity such classifiers [1], [87], [88]. Moreover, decision tree classi-
to the choice of the kernel and regularization parameters fiers make it possible to directly interpret class membership
is the most important disadvantage of a kernel SVM. For decisions with respect to the impact of individual features
the former, the Gaussian RBF is widely used in RS [77]. [5]. Although a standard decision tree may be degraded under
The latter is classically addressed using cross validation some circumstances, its general concept is of interest, and the
techniques that employ training data [78]. Gómez et al. classifier performance can be further improved in terms of
proposed an approach by combining both labeled and classification accuracies by classifier ensembles or multiple
unlabeled pixels using clustering and the mean map ker- classifier systems [89], [90].
nel to increase the classification accuracy and reliability of
SVMs [79]. In [80], a local k-nearest neighbor adaptation ENSEMBLE METHODS (MULTIPLE CLASSIFIERS)
was taken into account to formulate localized SVM vari- Traditionally, a single classifier was taken into account to al-
ants. Tuia and Camps-Vallas proposed a regularization ap- locate a class label for a given pixel. However, in most cases,
proach to tackle the issue of kernel predetermination. The the use of an ensemble of classifiers (multiple classifiers)
method was based on the identification of kernel structures can be considered to increase classification accuracies [1]. To
through the analysis of unlabeled pixels [81]. In [82], a so- develop an efficient multiple classifier, one needs to deter-
called bootstrapped SVM was proposed as a modification of mine an effective combination of classifiers such that each
the SVM. The training strategy of the approach is that an is able to benefit the others while avoiding the weaknesses
incorrectly classified training sample in a given learning of each [89]. Two highly used multiple classifiers are boost-
step is removed from the training pool, reassigned a correct ing and bagging [89], [91], [92], which were elaborated in
label, and reintroduced into the training set in the subse- detail in [1].
quent training cycles.
In addition to the SVM, a composite kernel framework RANDOM FORESTS
for the classification of hyperspectral images was recently RFs were first introduced in [95], and they represent a popu-
investigated. In [83], a linearly weighted composite ker- lar ensemble method for classification and regression. This
nel framework with SVMs was used for the classification classifier has been widely used in conjunction with hyper-
of hyperspectral data. However, classification using com- spectral data, since it does not assume any underlying prob-
posite kernels and SVMs demands a convex combination ability distribution for input data. Moreover, it can provide
of kernels and a time-consuming optimization process. To a good classification result in terms of accuracies in an ill-
overcome these limitations, a generalized composite kernel posed situation when there is no balance between dimen-
framework for spectral–spatial classification was developed sionality and the number of available training samples. In
in [83]. The MLR [33], [37], [84] was also investigated as [96], rotation forest is proposed based on the idea of RFs
an alternative to the SVM classifier for the construction of to simultaneously encourage both member diversities and
composite kernels, and a set of generalized composite ker- individual accuracy within a classifier ensemble. For a de-
nels that can be linearly combined without any constraint tailed description of this approach, see [1], [90], [93], [95],
of convexity was proposed. and [96].

SPARSE REPRESENTATION CLASSIFIERS perspectives. Without any doubt, classification plays an
Another important development has been the use of sparse important role in the analysis of hyperspectral data. There
representation classifiers (SRCs) with dictionary-based gen- are many papers dealing with advanced classifiers, but, to
erative models [97], [98]. In this case, an input signal is rep- the best of our knowledge, there is no contribution in the
resented by a sparse linear combination of samples (atoms) literature that critically reviews and compares advanced
from a dictionary [97], where the training data are generally classifiers with each other, providing recommendations on
used as the dictionary. The main advantage of SRCs is that best practice when selecting a specific classifier for a given
they avoid the heavy training procedure that a supervised application domain.
classifier generally conducts, and the classification is per- To make our research more specific, we consider only
formed directly on the dictionary. Given the availability of spectral and per-pixel-based classifiers in this article. In
sufficient training data, some researchers have also devel- other words, spatial classifiers, fuzzy approaches, subpixel
oped discriminative as well as compact class dictionaries to classifiers, object-based approaches, and per-field RS-GIS
improve classification performance [99]. approaches are considered to be out of scope.
Compared to previous review papers, such as [112] pub-
DEEP LEARNING lished in 2009 that provides a general review of the advanc-
Deep learning is a kind of neural network with multilay- es in techniques for hyperspectral image processing to that
ers, typically deeper than three layers, that tries to hierar- date, this article deals specifically with spectral classifiers
chically learn the features of input data. Deep learning is and includes the most recent and advanced spectral classi-
a fast-growing topic that has shown usefulness in many fication approaches in the hyperspectral community (with
research areas, including computer vision and natural many new developments since the publication of the pre
language processing [100]. vious paper). In addition, we believe that a few specific clas-
In the context of RS, some sifiers have gained great interest in the hyperspectral com-
deep models have been pro- munity because of their ability to handle high-dimensional
DEEP LEARNING IS A FAST- posed for hyperspectral data data with a limited number of training samples. Among
GROWING TOPIC THAT HAS feature extraction and clas- those approaches, neural networks, RFs, MLR, SVMs, and
SHOWN USEFULNESS IN sification [101]. The stacked deep CNN-based classifiers are the most widely used at
MANY RESEARCH AREAS, autoencoder (SAE) and the present. As a result, we first elaborate on these approaches
INCLUDING COMPUTER autoencoder (AE) with sparse and then further compare them based on different scenari-
VISION AND NATURAL constrain were proposed for os, such as the capability of the methods in terms of having
hyperspectral data classifica- different numbers of training samples, spatial resolution,
LANGUAGE PROCESSING.
tion [102], [103]. Later, an- stability, complexity, and the automation of the considered
other deep model, i.e., the classifiers. The aforementioned approaches are applied to
deep belief network (DBN), three widely used hyperspectral images (e.g., Indian Pines,
was proposed for the classification of hyperspectral data Pavia University, and Houston), and the obtained results
[104]. Very recently, an unsupervised convolutional neural are critically compared with each other. To make the equa-
network (CNN) was proposed for RS image analysis, which tions easier to follow, Table 2 details all of the notations
uses greedy layer-wise unsupervised learning to formulate used in this article.
a deep CNN model [105]. Figure 1 shows the classification approaches investi-
gated in this article along with their publication year and
CLASSIFICATION ACCURACY ASSESSMENT the number of citations obtained so far. However, it should
Accuracy assessment is a crucial step in evaluating the effi- be noted that, in each paper, authors cited different papers
ciency and capability of different classifiers. There are many as the original one. Here, we use the most-cited paper of
sources of errors, such as errors caused by the classification the corresponding classifier used in the RS community. We
algorithm, position errors caused by the registration step, used [58] for neural networks, [90] for RFs, [33] for MLR,
mixed pixels, and unacceptable quality of training and test [113] for SVMs, [114] for ELM, and [115] for kernel ELM
samples. In general, it is assumed that the difference be- (KELM). Since the CNN was published only very recently, it
tween the classified image and the reference data is due to is not shown in Figure 1.
the errors caused by the classification algorithm itself [23].
A considerable number of works and reviews on classifica- NEURAL NETWORKS
tion accuracy assessment have been conducted in the area Artificial neural networks (ANNs) have been traditionally
of RS [1], [106]–[111]. used in multihyperspectral data classification. FNs, in par-
ticular, have been extensively applied because of their abil-
THIS ARTICLE’S CONTRIBUTIONS ity to approximate complex nonlinear mappings directly
The main aim of this article is to critically compare rep- from the input samples using a single hidden layer [116].
resentative spectral-based classifiers (such as those out- Traditional learning techniques are based on the original
lined in the “Literature Review” section) from different BP algorithm [63]. The most popular group is the gradient

TABLE 2. A LIST OF NOTATIONS AND ACRONYMS.
NOTATIONS DEFINITION NOTATIONS DEFINITION NOTATIONS DEFINITION NOTATIONS DEFINITION

x Pixel vector d Number of bands b Bias m Regularization
parameter
U Transformation C Regularization v Stack variable k Kernel
parameter
. Euclidean norm w Normal vector L Number of K Number of classes
hidden nodes
y Classification w Input weight n Number of p (y i | x i ) Probability of pixel i
label training samples
a Lagrange b Output weight v Visible units h Hidden units
multiplier
descent-based learning methods, which are generally slow Let (x i t i) be n distinct samples, where x i = [x i1, x i2, ..., x id] T
and may easily converge to a local minima. These tech- ! IR d and t i = [t i1, t i2, ..., t iK ] T ! IR K , where d is the spectral
niques adjust the weights in the steepest descent direction dimensionality of the data and K is the number of spectral
(negative of the gradient), which is the direction in which classes. An SLFN with L hidden nodes and an activation
the performance function decreases most rapidly, but this function f (x) can be expressed as
does not necessarily produce the fastest convergence [64]. In
L L
this sense, several conjugate gradient algorithms have been
proposed to perform a search along conjugate directions,
/ b i fi (x j) = / b i f (w i ·x j + b i) = o j, j = 1, ..., n,(1)
i=1 i=1
which generally result in faster convergence. These algo-
rithms usually require high storage capacity and are widely where w i = [w i1, w i2, ..., w id] T is the weight vector con-
used in networks with a large number of weights. Lastly, necting the i th hidden node and the input nodes,
Newton-based learning algorithms generally provide b etter b i = [b i1, b i2, ..., b iK ] T is the weight vector connecting
and faster optimization than conjugate gradient methods. the i th hidden node and the output nodes, b i is the bias
Based in the Hessian matrix (second derivatives) of the of the i th hidden node, and f (w i ·x j + b i) is the output of
performance index at the current values of the weight and the i th hidden node regarding the input sample x i . The
biases, their convergence is faster, although their complexity above equation can be rewritten compactly as
usually introduces an extra computational burden for the
calculation of the Hessian matrix. H·b = Y, (2)
Recently, the ELM algorithm has been proposed to train
SLFNs [66], [67] and has emerged as an efficient algorithm R f (w ·x + b ) f f (w ·x + b )V
S 1 1 1 L 1 L W
that provides accurate results in much less time. Tradi- H=S h f h W , (3)
S f (w ·x + b ) f f (w ·x + b )W
tional gradient-based learning algorithms assume that 1 n 1 L n L L#L
T X
all of the parameters (weight and bias) of the feedforward
networks need to be tuned, establishing a dependency
between different layers of parameters and fostering very
slow convergence. In [117] and [118], it was first shown that 1,600
an SLFN (with N hidden nodes) with randomly chosen
1,400
input weights and hidden-layer biases can learn exactly
N distinct observations, which means that it may not be 1,200
necessary to adjust the input weights and first hidden- 1,000
Citations
layer biases in applications. In [66], it was proven that the 800

input weights and hidden-layer biases of an SLFN can be
600
randomly assigned if the activation function of the hid-
den layer is infinitely differentiable, which allows for the 400
analytical determination of the rest of the parameters (the 200
weights between the hidden and output layers) since the 0
ANN
(1990)
SVM
(2004)
RF
(2006)
ELM
(2009)
KELM
(2013)
MLR
(2010)
SLFN is a linear system. This fact leads to a significant de-

crease of the computational complexity of the algorithm,
making it much faster than its predecessors, and turning Classifiers
ELM into the main alternative for the analysis of large
amounts of data. FIGURE 1. The number of citations associated with each classifier.

n
R TV R TV 1 1
Sb 1W Sy 1 W min b 2 b 22 + C 2 / p 2i , (7)
b = S h W , Y = S h W , (4) i= 1
S TW SS TWW
b yL s.t. h (x i) b = y Ti - p 2i , i = 1, ..., n, (8)
T L XL # K T Xn # K
where H is the output matrix of the hidden layer and b is where p 2i is the training error of training sample x i and C is
the output weight matrix. The objective is to find specific a regularization parameter. The output of ELM can be ana-
wt i, bt i, bt (i = 1, ..., L) so that lytically expressed as
I -1
h (x) b = h (x) H T a C + HH T k Y. (9)
2
t i, bt i) bt - Y =
H (w

min w i, bi, b H (w 1, f, w L, b 1, f, b L) b - Y 2 . (5) This expression can be generalized to a kernel version of
ELM using the kernel trick [71]. The inner product opera-
tion considered in h (x) H T and HH T can be replaced by
As mentioned before, the minimum of Hb - Y 2 is tradi a kernel function: h (x i) ·h (x j) = k (x i, x j). Both the regu-
tionally calculated using gradient-based learning algo- larized and kernel extensions of the traditional ELM al-
rithms. The main issues related to these traditional methods gorithm require the setting of the needed parameters (C
are as follows: and all kernel-dependent parameters). When compared
◗◗ First and foremost, all gradient-based learning algo- with traditional learning algorithms, ELM has the follow-
rithms are very time consuming in most applications. ing advantages:
This became an important problem when classifying ◗◗ There is no need to iteratively tune the input weights w i
hyperspectral data. and the hidden-layer biases b i using slow gradient-based
◗◗ The size of the learning rate parameter strongly affects learning algorithms.
the performance of the network. Values that are too ◗◗ Derived from the fact that ELM tries to reach both the
small generate very slow convergence processes, while smallest training error and the smallest norm of output
scores in h that are too large make the learning algo- weights, this algorithm exhibits better generalization
rithm diverge and become unstable. performance in most cases when compared with tradi-
◗◗ The error surface generally presents local minima. Gra- tional approaches.
dient-based learning algorithms can get stuck at local ◗◗ ELM’s learning speed is much faster than in the tradi-
minima. This can be an important issue if local minima tional gradient-based learning algorithms. Depending
are far above global minima. on the application, ELM can be tens to hundreds of
◗◗ FNs can be overtrained using BP-based algorithms, thus times faster [66].
obtaining worse generalization performance. The ef- ◗◗ The use of ELM avoids inherent problems with gradient-
fects of overtraining can be alleviated using regulariza- descent methods such as getting stuck in a local minima
tion or early stopping criteria [119]. or overfitting the model [66].
It has been proved in [66] that the input weights w i
and the hidden-layer biases b i do not need to be tuned, SUPPORT VECTOR MACHINES
so the output matrix of the SVMs [113] have often been used for the classification
hidden layer H can remain of hyperspectral data because of their ability to handle
RECENTLY, THE ELM unchanged after a random high-dimensional data with a limited number of train-
ALGORITHM HAS BEEN initialization. Fixing the input ing s amples. The goal is to define an optimal linear-
PROPOSED TO TRAIN SLFNs weights w i and the hidden- separating hyperplane (the class boundary) within a
layer biases b i means that multidimensional feature space that differentiates the
AND HAS EMERGED AS AN
training an SLFN is equiva- training samples of two classes. The best hyperplane is the
EFFICIENT ALGORITHM
lent to finding a least-squares one that leaves the maximum margin from both classes.
THAT PROVIDES solution bt of the linear sys- The hyperplane is obtained using an optimization prob-
ACCURATE RESULTS IN tem Hb = Y. Different from lem that is solved via structural risk minimization. In this
MUCH LESS TIME. the traditional gradient-based way, in contrast to statistical approaches, SVMs minimize
learning algorithms, ELM classification error on unseen data without any prior as-
aims to reach not only the sumptions made on the probability distribution of the
smallest training error but also the smallest norm of out- data [120].
put weights. The SVM tries to maximize the margins between the
hyperplane and the closest training samples [75]. In other
Minimize: Hb - Y 2
and b 2 . (6) words, to train the classifier, only samples that are close
to the class boundary are needed to locate the hyperplane
Let h (x) = [f (w 1 ·x + b 1), ..., f (w L ·x + b L)], if we express vector. This is why the training samples closest to the hy-
(6) from the optimization theory point of view perplane are called support vectors. More importantly, since

only the closest training samples are influential in placing The only required knowledge lies on the kernel function
the hyperplane in the feature space, the SVM can classify k. Therefore, one needs to estimate the parameters of the
the input data efficiently even if only a limited number of kernel function as well as the regularization parameter. To
training samples is available [2], [113], [121], [122]. In addi- solve this issue, an automatic model selection based on a
tion, SVMs can efficiently handle the classification of noisy cross validation was introduced [124]. In [125], a genetic
patterns and multimodal feature spaces. algorithm-based approach was used to regulate the hyper-
With regard to a binary classification problem in a d - plane parameters of an SVM while it found efficient fea-
dimensional feature space, IR d, x i ! IR d, i = 1, f, n is a tures to be fed to the classifier.
set of n training samples with their corresponding class In terms of kernels, the Gaussian RBF kernel may be the
labels y i ! {1, + 1}. The optimal separating hyperplane most widely used in RS [77], [95]. This kernel can handle
f (x) is determined by a normal vector w ! IR d and the more complex, nonlinear class distributions in comparison
bias b, where b / w is the distance between the hyper- with a simple linear kernel, which is just a special case of
plane and the origin, with w as the Euclidean norm the Gaussian RBF kernel [1], [126].
from w SVMs were originally developed for binary classifica-
tion problems. In general, one needs to deal with multiple
f (x) = wx + b. (10) classes in RS [1]. To address this, several multiclass strate-
gies have been introduced in the literature. Among those
The support vectors lie on two canonical hyperplanes approaches, two main strategies are best known and are
wx + b = ! 1 that are parallel to the optimal separating hy- based on the separation of the multiclass problem into sev-
perplane. The margin maximization leads to the following eral binary classification problems [127]. These are the one-
optimization problem: against-one strategy and the one-against-rest strategy [95].
n
The following are some important points:
w2
min 2 + C / y i, (11) ◗◗ The capability of the SVM in handling a limited num-
i
ber of training samples, self-adaptability, a swift train-
where the slack variables y i and the regularization param- ing stage, and ease of use are considered as the main
eter C are considered to deal with misclassified samples in advantages of this classifier. In addition, SVMs are resil-
nonseparable cases, i.e., cases that are not linearly separable. ient to becoming trapped
The regularization parameter is a constant used as a penalty in local minima, since the
for samples that lie on the wrong side of the hyperplane. convexity of the cost func- ALL GRADIENT-BASED
It is able to efficiently control the shape of the solution of tion enables the classifier to LEARNING ALGORITHMS
the decision boundary. Thus, it affects the generalization consistently identify the op-
ARE VERY TIME
capability of the SVM (e.g., a large value of C may cause the timal solution [120]. More
CONSUMING IN MOST
approach to overfit the training data) [97]. precisely, SVMs deal with
The SVM described previously is a linear classifier, quadratic problems and, APPLICATIONS.
while decision boundaries are often nonlinear for clas- as a result, they guarantee
sification problems. To tackle this issue, kernel methods to the global minimum.
are required to extend the linear SVM approach to non- Furthermore, the result of the SVM is stable for the same
linear cases. In such cases, a nonlinear mapping is used set of training samples, and there is no need to repeat
to project the data into a high-dimensional feature space. the classification step, as is the case for many approach-
After the transformation, the input pattern x can be de- es such as neural networks. Last but not least, SVMs are
scribed by U (x) . nonparametric and do not assume a known statistical
distribution of the data to be classified. This is consid-
(U (x i), U (x j)) = k (x i, x j). (12) ered an important advantage because the data acquired
from remotely sensed imagery usually have unknown
The transformation into the higher-dimensional space can distributions [120].
be computationally intensive. The computational cost can ◗◗ One drawback of the SVM lies in setting the key param-
be decreased using a positive definite kernel k, which ful- eters. For example, choosing a small value for the kernel
fills the so-called Mercer’s conditions [77], [95]. When the width parameter may cause overfitting, while a large
Mercer’s conditions are met, the final hyperplane can be value may cause oversmoothing, which is a common
defined by drawback of all kernel-based approaches. Moreover, the
n
choice of the regularization parameter C, which con-
f (x) = d / a i y i k (x i, x j) + b n, (13) trols the tradeoff between maximizing the margin and
i=1
minimizing the training error, is highly important.
where a i denotes the Lagrange multipliers. For a detailed For further reading, a detailed introduction of SVMs is giv-
derivation of (13), we refer readers to [123]. In the new en by Burges [123], Cristianini and Shawe-Taylor [130], and
feature space, an explicit knowledge of U is not needed. Scholkopf and Smola [77].

MULTINOMIAL LOGISTIC REGRESSION success in hyperspectral classification. For more informa-
MLR models the posterior densities p (y i | x i, ~) as follows tion about the LORSAL algorithm, see [33] and [37].
[32]: Finally, the advantages of MLR are as follows:
T ◗◗ MLR classifiers are able to directly learn the posterior
exp (~ (k) U (x i))
p (y i = k | x i, ~) = K , (14) class distributions and deal with the high dimensional-
/ exp (~ (k) U (x i))
T
ity of hyperspectral data in a very effective way. The class
k=1
posterior probability plays a crucial role in the complete
T T
where ~ = [~ (1) , ..., ~ (K - 1) ] T are the logistic regressors. posterior probability under the Bayesian framework to
Again, y i is the class label of pixel x i ! R d, d is the number include the spectral and spatial information.
of bands, and K is the number of classes. Since the density ◗◗ The sparsity-inducing prior on the regressors leads to
in (14) does not depend on translations of the regressors sparse estimates, which allows us to control the algo-
~ (k), we take ~ (K) = 0. The term U (x) = [z 1 (x),f , z l (x)] T rithm complexity and their generalization capacity.
is the fixed functions of the input, often termed features. ◗◗ The open structure of the MLR results in a good flexibil-
The open structure of U (x) leads to flexible selection of the ity for the input functions, which can be linear, kernel-
input features, i.e, they can be linear, kernel, or nonlinear based, or nonlinear.
functions. To control the algorithm complexity and its gen-
eralization capacity, the regressor ~ is modeled as a ran- RANDOM FORESTS
dom vector with Laplacian density [129] RFs were proposed in [93] as an ensemble method for clas-
sification and regression. Ensemble classifiers get their
p (~) ? exp (- m ~ 1), (15) name from the fact that several classifiers, i.e., an ensem-
ble of classifiers, is trained and their individual results are
where m is the regularization parameter controlling the de- then combined through a voting process [130], [131]. In
gree of sparsity of ~. other words, the classification label is allocated to the input
In the present problem, under a supervised scenario, vector (x) through y Brf = majority vote " y b (x) ,B1, where y b (x)
learning the class density amounts to estimating the lo- is the class prediction of the bth tree and B shows the total
gistic regressors ~, which can be done by computing the number of trees. RFs can be considered to be a particular
maximum a posteriori estimate of ~ case of decision trees. However, since RFs are composed of
many classifiers, this implies special characteristics that
t = argmax , (~) + log p (~), (16)
~ make them completely different from traditional classifica-
~
tion trees; therefore, they should be understood as a new
where , (~) is the log-likelihood function over the labeled type of classifier [132].
training samples. For supervised learning, it is given by The training algorithm for RFs applies the general tech-
n
nique of bootstrap aggregating, or bagging, to tree learners
, (~) / / log p (y i = k | x i, ~), (17) [92]. Bootstrap aggregating is a technique used for training
i= 1
data creation by resampling the original data set in a ran-
where n is the number of training samples. Although con- dom fashion, with replacement (i.e., there is no deletion
vex, (16) is difficult to compute because the term of , (~) of the data selected from the input sample for generating
is nonquadratic and the term log p (~) is nonsmooth. Fol- the next subset) [132]. The bootstrapping procedure leads
lowing [32], , (~) can be estimated by a quadratic function. to more efficient model performance, since it decreases
However, the problem is still difficult, as log p (~) is non- the variance of the model without increasing the bias.
smooth. Optimization prob- In other words, while the predictions of a single tree are
lem (16) can be solved by the highly sensitive to noise in its training set, the average of
SMLR in [129] and by the fast many trees is not that sensitive as far as the trees are not
MLR CLASSIFIERS ARE
SMLR in [35]. However, most correlated [133]. By training many trees on a single train-
ABLE TO DIRECTLY LEARN hyperspectral data sets are ing set (or even the same tree many times if the training
THE POSTERIOR CLASS beyond the reach of these al- algorithm is deterministic), strongly correlated trees are
DISTRIBUTIONS AND gorithms, as their processing produced. Bootstrap sampling decorrelates the trees by
DEAL WITH THE HIGH becomes unbearable when showing them different training sets. RF uses trees as base
DIMENSIONALITY OF the dimensionality of the in- classifiers, " h (x, i k), k = 1, . . . , ,, where x and i k are the
HYPERSPECTRAL DATA IN put features increases. This set of input vectors and the independent and identically
is even more critical in the distributed random vectors [95], [136]. Since some data
A VERY EFFECTIVE WAY.
frameworks of composite ker- may be used more than once for the training of the classi-
nel learning and multiple fea- fier while others may not be used, greater classifier stabil-
ture learning. To address this ity is achieved. This makes the classifier more robust when
issue, the LORSAL algorithm is proposed in [36] and [37] a slight variation in input data occurs, and consequently
to deal with high-dimensional features and leads to good higher classification accuracy can be obtained [132], [134].

As mentioned in several studies, such as [88], [89], [132], scattering conditions and intraclass variability make it ex-
and [135], methods based on bagging such as RFs, in con- tremely difficult to effectively extract the features. Moreover,
trast with other methods based on boosting, are not sensi- hyperspectral data quickly in-
tive to noise or overtraining. crease in volume, velocity, and
In RFs, there are only two parameters to generate the variety, so they are difficult SINCE RFs ARE COMPOSED
prediction model, i.e., the number of trees and the number to analyze in complicated real OF MANY CLASSIFIERS,
of prediction variables. The number of trees is a free param- situations. On the other hand, THIS IMPLIES SPECIAL
eter that can be chosen with respect to the size and nature it is believed that deep models CHARACTERISTICS THAT
of the training set. One possible way to choose the optimal can progressively lead to more
MAKE THEM COMPLETELY
number of trees is based on cross validation or by observing invariant and abstract features
DIFFERENT FROM
the out-of-bag error [93], [131], [136]. For detailed informa- at higher layers [100]. There-
TRADITIONAL
tion regarding RFs and their different implementations, see fore, deep models have the
[1], [130], and [131]. The number of prediction variables is potential to be a promising CLASSIFICATION TREES.
referred to as the only adjustable parameter to which the tool. Deep learning involves a
forest is sensitive. As mentioned in [1], the optimal range number of models, including
of this parameter can be quite wide. However, the value is the SAE [140], DBN [141], and deep CNN [142].
usually set to approximately the square root of the number
of input features [130], [131], [137], [138]. THE SAE
By using RFs, the out-of-bag error, the variable impor- The AE is the basic part of the SAE [140]. As shown in
tance, and the proximity analysis can be driven. To find de- Figure 2, an AE contains one visible layer of d inputs, one
tailed information about the RF and its derived parameters, hidden layer of L units, and one reconstruction layer of d
see [1], [86], [93], [130], [131], and [136]. The following are units. During the training procedure, x ! IR d is mapped
some important points about RFs: to z ! IR L in the hidden layer, and it is called the encoder.
◗◗ RFs are quite flexible, and they can handle different sce- Then, z is mapped to r ! IR d by a decoder, which is called
narios, such as large numbers of attributes, very limited the reconstruction. These two steps can be formulated as
numbers of training samples, and small or large data
z = f (w z x + b z),
sets. In addition, they are easy and quick to evaluate.
r = f (w r x + b r),
◗◗ RFs do not assume any underlying probability dis-
tribution for input data, can provide a good classifica- where w z and w r denote the input-to-hidden and the hid-
tion result in terms of accuracies, and can handle many den-to-output weights, respectively. b z and b r denote the
variables and a large amount of missing data. Another bias of the hidden and output units, and f (.) denotes the
advantage of the RF classifier is that it is insensitive to activation function.
noise in the training labels. In addition, RF provides an Stacking the input and hidden layers of AEs together layer
unbiased estimate of the test set error as trees are added by layer constructs an SAE. Figure 3 shows a typical instance
to the ensemble, and finally it does not overfit. of an SAE connected with a subsequent logistic regression
◗◗ The generated forest can be saved and used for other classifier. The SAE can be used as a spectral c lassifier.
data sets.
◗◗ In general, for sparse feature vectors, which is the case in DBNs
most high-dimensional data, a random selection of fea- The restricted Boltzmann machine (RBM) is a layer-wise
tures may not be efficient all the time since u
ninformative training model in the construction of a DBN [143]. As
or correlated features might be selected, which down- shown in Figure 4, it is a two-layer network with visible units
grades the performance of the classifier.
◗◗ Although RFs have widely been used for classification
purposes, a gap still remains between the theoretical Reconstruction
understanding of RFs and their corresponding practical
r
use. A variety of RF algorithms have been introduced,
showing promising practical success. However, these wr , br
algorithms are difficult to analyze, and the basic math-
ematical properties of even the original variant are still z
not well understood [139]. wz , bz
DEEP LEARNING-BASED APPROACHES x

There are some motivations to extract the invariant fea- Input
tures from hyperspectral data. First, undesired scattering
from neighboring objects may deform the characteristics of FIGURE 2. A single hidden-layer AE. The model learns a hidden
the object of interest. Furthermore, different atmospheric feature z from input x by reconstructing it on r.

v = " 0, 1 ,d and hidden units h = " 0, 1 ,L . A joint configura- where i = " b i, a j, w ij ,, in which w ij is the weight between
tion of the units has an energy given by visible unit i and hidden unit j, and b i and a j are bias
d L d L
terms of the visible and hidden unit, respectively. The
E (v, h; i) = - / b i v i - / a j h j - / / w ij v i h j learning of w ij is done by a method called constructive
i= 1 j= 1 i= 1 j= 1 divergence [141].
= - b T v - a T h - v T wh, (18) Due to the complexity of input hyperspectral data, RBM
is not the best way to capture the features. After the train-
ing of RBM, the learned features can be used as the input
SAE
data for the following RBM. This kind of layer-by-layer
learning system constructs a DBN. As shown in Figure 5,
Output: a DBN is employed for feature learning and adds a logistic
Input Class
regression layer above the DBN to constitute a DBN-logis-
Labels
tic regression framework.
Hyperspectral Pixel AE1 AE2 Logistic
Data Vector Regression
THE DEEP CNN
The CNN is a special type of deep learning model that is
FIGURE 3. A typical instance of an SAE connected with a subse- inspired by neuroscience. A complete CNN stage contains a
quent logistic regression classifier. convolution layer with nonlinear operation and a pooling
layer. A convolutional layer is as follows:
h x lj = f d / x li - 1 ) k lij + b lj n,
i=1
w
where x li - 1 is the i th feature map of (l - 1) th layer, x lj is
the j th feature map of current (i) th layer, and M is the
v number of input feature maps. k lij and b lj are the trainable
parameters in the convolutional layer. f (.) is a nonlinear
function, and * is the convolution operation. It should be
FIGURE 4. A graphical illustration of an RBM. The top layer (h) noted that here we explain one-dimensional (1-D) CNN,
represents the hidden units and the bottom layer (v) represents the as this article deals with spectral classifiers. To find de-
visible units. w: input weight. tailed information about two-dimensional (2-D) and
three-dimensional (3-D) CNN for the classification of
hyperspectral data, see [145].
DBN The pooling operation offers invariance by reducing the
RBM1
resolution of the feature maps. The neuron in the pooling
RBM2 RBM3 layer combines a small N # 1 patch of the convolution layer,
and the most common pooling operation is max pooling.
Output:
Class
A convolution layer, nonlinear function, and pooling layer
Input
Labels are three fundamental parts of CNNs [144]. By stacking
several convolution layers with nonlinear operation and
Logistic several pooling layers, a deep CNN can be formulated. A
Hyperspectral Pixel
Regression deep CNN can hierarchically extract the features of inputs,
Data Vector
which tend to be invariant and robust [100].
FIGURE 5. A spectral classifier based on a DBN. The classification The architecture of a deep CNN for spectral classifica-
scheme shown here has four layers: one input layer, two RBMs, and tion is shown in Figure 6. The input of the system is a
a logistic regression layer. pixel vector of hyperspectral data, and the output is the
Feature Map 1
Pixel Pooling Logistic
Vector Convolution Convolution Pooling Stack Regression
Feature Map 2
Output:
Feature Map 3 Class
Labels
FIGURE 6. A spectral classifier based on a deep CNN.

label of the pixel to be classified. It consists of two convo-
TABLE 3. PAVIA UNIVERSITY:
lutional and two pooling layers as well as a logistic regres-
THE NUMBER OF TRAINING AND TEST SAMPLES.
sion layer. After convolution and pooling, the pixel vector
can be converted into a feature vector that captures the CLASS NUMBER OF SAMPLES
spectral information. NUMBER NAME TOTAL
1 Asphalt 6,304
DISCUSSION OF DEEP LEARNING APPROACHES 2 Meadow 18,146
The following aspects are worth being mentioned about 3 Gravel 1,815
deep learning-based approaches: 4 Tree 2,912
◗◗ Recently, some deep models have been used in hy- 5 Metal sheets 1,113
perspectral data feature extraction and classification. 6 Bare soil 4,572
Deep learning opens a new window for future research, 7 Bitumen 981
showcasing the deep learning-based methods’ huge 8 Brick 3,364
potential [145]. 9 Shadow 795
◗◗ The architecture design is the crucial part of a successful Total 40,002
deep learning model. How to design a proper deep net
is still an open area in the machine learning communi-
ty, though we may be able to use grid searches to find a
proper deep model. Asphalt
◗◗ Deep learning methods may lead to a serious problem
Meadows
called overfitting, which means that the results can be
very good on the training data but poor on the test data. Gravel
To deal with this issue, it is necessary to use powerful Trees
regularization methods. Metal
Sheets
◗◗ Deep learning methods can be combined with other
Bare Soil
methods, such as sparse coding and ensemble learn-
ing, which is another research area in hyperspectral Bitumen
data classification. Bricks
Shadows
EXPERIMENTAL RESULTS
(a) (b) (c)
This section describes our experimental results, including
the different hyperspectral data sets used in experiments, the
FIGURE 7. Some ROSIS-03 Pavia University hyperspectral data: (a)
setup for the different algorithms to be compared, and the ob-
tained results with a detailed discussion about the use of the the three-band false color composite, (b) the reference data, and (c)
different classifiers tested in different applications. The sets the color code.
of training and test samples used in this article are available
on request by e-mailing the authors. including trees, asphalt, bi-
tumen, gravel, metal sheet,
DEEP LEARNING OPENS A
DATA DESCRIPTION shadow, bricks, meadow, and
NEW WINDOW FOR FUTURE
soil (see Table 3). Figure 7
PAVIA UNIVERSITY presents a false color image RESEARCH, SHOWCASING
This hyperspectral data set has been repeatedly used. It was of ROSIS-03 Pavia University THE DEEP LEARNING-
captured over the city of Pavia, Italy, by the Reflective Op- data and their c orresponding BASED METHODS’
tics Spectrographic Imaging System (ROSIS-03) airborne reference samples. These sam HUGE POTENTIAL.
instrument. The flight over the city of Pavia was operated ples are usually obtained by
by the Deutschen Zentrum für Luft- und Raumfahrt (DLR, manual labeling of a small num
the German Aerospace Agency) within the context of the ber of pixels in an image or
HySens project, managed and sponsored by the European based on some field measurements. Thus, the collection of
Union. The ROSIS-03 sensor has 115 data channels with these samples is expensive and time demanding [2]. As a re-
a spectral coverage ranging from 0.43 to 0.86 μm. Twelve sult, the number of available training samples is usually limit-
channels have been removed due to noise. The remaining ed, which is a challenging issue in supervised classification.
103 spectral channels were processed. The data have been
corrected atmospherically but not geometrically. The spa- INDIAN PINES
tial resolution is 1.3 m per pixel. The data set, with dimen- This data set was acquired by the Airborne Visible/Infra-
sions of 640 # 340 pixels, covers the Engineering School red Imaging Spectrometer (AVIRIS) sensor over the agri-
at the University of Pavia and consists of different classes, cultural Indian Pines test site in northwestern Indiana.

TABLE 4. INDIAN PINES:
THE NUMBER OF TRAINING AND TEST SAMPLES.
CLASS NUMBER OF SAMPLES
NUMBER NAME TOTAL

1 Corn-no till 1,434
2 Corn-minimum till 834
3 Corn 238
4 Grass/pasture 497
5 Grass/trees 747 (b)
(a)
6 Hay-windrowed 489
7 Soybean-no till 968 Corn-No Till Grass/Trees
8 Soybean-minimum till 2,468 Corn-Minimum Till Grass/Pasture-Mowed
9 Soybean-clean 614 Corn Hay-Windrowed
10 Wheat 212 Soybeans-No Till Oats
11 Woods 1,294
Soybeans-Minimum Till Wheat
12 Building/grass/tree-drives 380
Soybeans-Clean Till Woods
13 Stone-steel towers 95
Alfalfa Building/Grass/Tree-Drives
14 Alfalfa 54
Grass/Pasture Stone-Steel Towers
15 Grass/pasture-mowed 26
16 Oats 20 (c)
Total 10,366
FIGURE 8. Some AVIRIS Indian Pines hyperspectral data: (a) the
three-band false color composite, (b) the reference data, and
(c) the color code.
TABLE 5. HOUSTON:
THE NUMBER OF TRAINING AND TEST SAMPLES. represent mostly different types of crops and are detailed in
CLASS NUMBER OF SAMPLES Table 4. Figure 8 shows a three-band false color image and
NUMBER NAME TRAINING TEST its corresponding reference samples.
1 Grass-healthy 198 1,053
HOUSTON DATA
2 Grass-stressed 190 1,064
This data set was captured by the Compact Airborne Spec-
3 Grass-synthetic 192 505
trographic Imager (CASI) over the University of Houston
4 Tree 188 1,056 campus and the neighboring urban area in June 2012.
5 Soil 186 1,056 With a size of 349 # 1905 pixels and a spatial resolution
6 Water 182 143 of 2.5 m, this data set is composed of 144 spectral bands
7 Residential 196 1,072 ranging from 0.38 to 1.05 m. These data consist of 15 class-
8 Commercial 191 1,053 es, including healthy grass, stressed grass, synthetic grass,
trees, soil, water, residential, commercial, road, highway,
9 Road 193 1,059
railway, parking lot 1, parking lot 2, tennis court, and run-
10 Highway 191 1,036
ning track. Parking lot 1 includes parking garages at the
11 Railway 181 1,054
ground level and also in elevated areas, while parking lot
12 Parking lot 1 192 1,041 2 corresponds to parked vehicles. Table 5 demonstrates the
13 Parking lot 2 184 285 different classes with the corresponding number of train-
14 Tennis court 181 247 ing and test samples. Figure 9 shows a three-band false col-
15 Running track 187 473 or image and its corresponding already-separated training
Total 2,832 12,197
and test samples.
ALGORITHM SETUP
In this article, two different scenarios were defined to
Its spatial dimensions are 145 # 145 pixels, and its spa- evaluate different approaches. In the first scenario, differ-
tial resolution is 20 m per pixel. This data set originally ent percentages of the available reference data were chosen
included 220 spectral channels, but 20 water absorption as training samples. In this scenario, only Indian Pines
bands (104–108, 150–163, 220) have been removed, and and Pavia University were considered. For Indian Pines, 1,
the rest (200 bands) were taken into account for the experi- 5, 10, 15, 20, and 25% of the whole sample were random-
ments. The reference data contain 16 classes of interest that ly selected as training samples, except for classes alfalfa,

(a)
(b)
(c)
Thematic Classes
Healthy Grass Stressed Grass Synthetic Grass Tree Soil
Water Residential Commercial Road Highway
Railway Parking Lot 1 Parking Lot 2 Tennis Court Running Track
(d)
FIGURE 9. Some CASI Houston hyperspectral data: (a) a color composite representation of the data, using bands 70, 50, and 20 as R, G, and
B, respectively; (b) training samples; (c) test samples; and (d) a legend of the different classes.
grass/pasture-mowed, and oats. These classes contain only ◗◗ 1-D CNN

a small number of samples in the reference data. Therefore, ◗◗ MLR.
only 15 samples in each of these classes were chosen at ran- For the MLR classifier, which is executed by the LOR-
dom as training samples and the rest as the test samples. For SAL algorithm [36,] [37], we use a Gaussian RBF kernel
Pavia University, 1, 5, 10, 15, and 20% of the whole samples given by K (x, z) = exp (- x - z 2 /2v 2), which is widely
were randomly selected as training samples and the rest as used in hyperspectral image classification problems
test samples. The experiments were repeated ten times, and [146]. For the parameters involved in the algorithm, we
the mean and the standard deviation of the o btained overall use the default settings provided in the online demo
accuracy (OA) are reported. (http://www.lx.it.pt/~jun/demo_LORSAL_AL.rar), where
For the second scenario, the Houston data were taken it illustrates that the MLR classifier is insensitive to the
into account. The training and test samples of these data parameter settings, which also can be observed in the fol-
were separated (Table 5). The results were evaluated using lowing experiments.
OA, average accuracy (AA), kappa coefficient (Kappa), and In terms of the SVM, the RBF kernel is taken into ac-
class-specific accuracies. count. The optimal hyperplane parameters C (the pa-
The following classifiers were investigated and com- rameter that controls the amount of penalty during the
pared in the two different scenarios discussed previously: SVM optimization) and c (the spread of the RBF kernel)
◗◗ SVM have been traced in the range of C = 10 -2, 10 -1, ..., 10 4
◗◗ RF and c = 10 -3, 10 -2, ..., 10 4 using fivefold cross validation.
◗◗ BP (also known as multilayer perceptron) In terms of the RF, the number of trees is set to 300. The
◗◗ ELM number of the prediction variable is set approximately to
◗◗ KELM the square root of the number of input bands. The same

C2 C4 C6 C8 C10
Layer Name I1 F12 O13
S3 S5 S7 S9 S11
Indian 1×5 1×5 1×4 1×5 1×4 Fully

Pines 1 × 200 Connected 1 × 16
1×2 1×2 1×2 1×2 1×1
Kernel Pavia 1×8 1×7 1×8

Fully
Size University 1 × 103 – – Connected 1×9
1×2 1×2 1×2
1×5 1×5 1×6 1×5 Fully
Houston 1 × 144 – 1 × 15
1×2 1×2 1×2 1×2 Connected
Number of Feature Map/

6 12 24 48 96 256
Number of Neurons
FIGURE 10. The architectures of the 1-D CNN on three data sets.
parameters were used for all experiments, stating that the

90 RF is insensitive to the parameter initialization.
85 Regarding the BP-based neural network classifier, the
network has only one hidden layer, and the number of
80 hidden nodes has been empirically set within the range
75 ^^ n + K h # 2 h /3 ! 10. The number of input nodes equals
OA (%)
the number of spectral bands of the image, while the num-

70
ber of output nodes equals the number of spectral classes.
65 Hidden nodes have sigmoid activation functions while
60 output nodes implement softmax activation function. The
implemented learning algorithm is scaled conjugate gradi-
55
ent backpropagation [64]. During the experiments, we em-
50 pirically adjusted the early stopping parameters to achieve
1 5 10 15 20 25
Training Samples (%) reasonable performance goals.
(a) In the case of the ELM, the network also has one sin-
gle hidden layer. The number of nodes L and the regu-
96
larization parameter C [147] were traced in the ranges of
94 L = 400, 600, 800, ..., 2000 and C = 10 -3, 10 -2, ..., 10 4 using
92 fivefold cross validation.
For the KELM, the RBF kernel is considered. Again, the
90
regularization parameter C and the kernel parameter y
OA (%)
88 were searched in the ranges C = 10 -3, 10 -1, ..., 10 4 and

86 c = 2 -3, 2 -2, ..., 2 4 also using fivefold cross validation. For
the 1-D CNN, the important parameters are the kernel size,
84
the number of layers, the number of feature maps, the num-
82 ber of neurons in the hidden layer, and the learning rate.
80 Figure 10 shows the architectures of the deep 1-D CNN used
1 5 10 15 20 for the experimental part. As an example, for the Indian Pines
Training Samples (%) data set, there are 13 layers, denoted as I1, C2, S3, C4, S5, C6,
(b) S7, C8, S9, C10, S11, F12, and O13 in sequence. I1 is the input
layer. C refers to the convolution layers, and S to the pooling
RF SVM BP MLR
ELM KELM 1-D CNN layers. F12 a fully connected layer, and O13 is the output layer
of the whole neural network. The input data are normalized
into [−1 1]. The learning rate is set to 0.005, and the training
FIGURE 11. Scenario 1: OA. The OA of different approaches epochs are 700 for the Indian Pines data set. For the Pavia
(i.e., the average value over ten runs) using different percentages University data set, we set the learning to 0.01 and the num-
of training samples from (a) Indian Pines and (b) Pavia University ber of epochs to 300. For the Houston data set, the learning is
obtained by different classification approaches. 0.01 with 500 epochs.

5 TABLE 6. SCENARIO 2: THE CLASSIFICATION ACCURACIES (%)
OBTAINED BY DIFFERENT CLASSIFICATION APPROACHES
4.5 ON THE HOUSTON HYPERSPECTRAL DATA.
4 CLASS SVM RF BP ELM KELM 1D CNN MLR
Standard Deviation
3.5 1 82.24 82.62 81.86 97.25 95.37 82.91 82.62

3 2 82.99 83.46 85.63 98.39 98.75 83.65 83.55
2.5 3 99.80 97.62 99.90 100.00 100.00 99.8 99.80
2 4 92.33 92.14 90.11 96.09 99.49 90.06 92.23
1.5 5 98.30 96.78 98.08 96.80 97.84 97.82 98.39
1 6 99.30 99.30 86.43 99.03 100.00 99.3 95.10
0.5 7 79.10 74.72 79.64 53.26 73.63 85.63 78.73
0 8 50.62 32.95 51.80 66.04 76.18 41.41 53.46

1 5 10 15 20 25 9 79.13 68.65 77.26 76.81 73.88 79.41 79.79
Training Samples (%)
10 57.92 43.15 57.46 71.39 76.08 53.38 58.10
(a)
11 81.31 70.49 85.76 82.25 67.28 70.49 82.44
1.4
12 76.08 55.04 81.76 72.21 59.74 72.72 76.36
1.2 13 69.82 60.00 74.42 42.65 41.74 63.86 68.42
1 14 100.00 99.19 99.31 89.81 90.41 99.6 98.78

Standard Deviation
15 96.83 97.46 98.08 94.15 94.34 98.52 97.88

0.8
OA 80.18 72.99 80.98 79.55 80.64 78.21 80.60
0.6 AA 83.05 76.9 83.17 82.4 82.98 81.23 83.04
Kappa 0.7866 0.7097 0.7934 0.7783 0.7901 0.7846 0.7908
0.4
0.2
0 ◗◗ SVM versus RF: Although both classifiers have the same

1 5 10 15 20 number of hyperparameters to tune (i.e., the RBF SVM has
Training Samples (%) c and C, and RFs have the number of trees and the depth
(b) of the tree), RFs’ parameters are easier to set. In practice,
the more trees we have, the higher the classification ac-
RF SVM BP MLR
ELM KELM 1-D CNN
curacy of RFs that can be obtained. RFs are trained faster
than a kernel SVM. A suggested number of trees can be
varied from 100 to 500 for the classification of hyperspec-
FIGURE 12. Scenario 1: stability. The standard deviation value tral data. However, with respect to our experiments, the
over ten runs using different percentages of training samples from SVM established higher classification accuracies than RFs.
(a) Indian Pines and (b) Pavia University obtained by different ◗◗ SVM versus BP: The SVM classifier presents a series of ad-
classification approaches. vantages over the BP classifier. The SVM exhibits less com-
putational complexity, even when the kernel trick is used,
Figure 11 shows the OA of different approaches (i.e., and usually provides better results when a small number
the average value over ten runs) on different percentages of training samples is available. However, if the BP con-
of training samples on Indian Pines and Pavia University. figuration is properly tuned, both classifiers can provide
To evaluate the stability of different classifiers on the change comparable classification accuracies. Last but not least, the
of training samples, the standard deviation value over ten BP is much more complex from a computational point of
runs for each percentage is estimated and shown in Figure 12. view. Actually, in this work we use the scaled conjugate gra-
For the Houston hyperspectral data, since the training dient BP algorithm, which presents a practical complexity
and test sets were already separated, we performed the clas- of O ((n ((dLK) + L + K)) 2) (the square of the number of
sifiers on the standard set of training and test samples. The weights of the network), where n is the number of training
classification accuracies (i.e., OA, AA, Kappa, and class spe- patterns, d the number of spectral bands, L the number of
cific accuracies) are reported in Table 6. The classification hidden nodes, and K the number of classes [64].
maps of this data set are shown in Figure 13. ◗◗ SVM versus ELM: From an optimization point of view,
the ELM presents the same optimization cost function
RESULTS AND DISCUSSION as the least squares SVM [148] but much less computa-
The main observations obtained from our experimental re- tional complexity. In general terms, ELM training is tens
sults are listed systematically as follows: or hundreds of times faster than a traditional SVM.

(a) (b)
(c) (d)
(e) (f)
Thematic Classes
Healthy Grass Stressed Grass Synthetic Grass Tree Soil
Water Residential Commercial Road Highway
Railway Parking Lot 1 Parking Lot 2 Tennis Court Running Track
FIGURE 13. Scenario 2: classification maps for Houston data using (a) RF, (b) SVM, (c) BP, (d) KELM, (e) MLR, and (f) 1-D CNN.
Regarding the classification accuracy, it can be seen that longer than the RBF-SVM. On the other hand, the advan-
the ELM achieves comparable results. tage of the deep CNN is that it is extremely fast on the
◗◗ SVM versus KELM: The computational complexity of the testing stage.
SVM is much bigger than the KELM. It can be seen that ◗◗ MLR (executed via LORSAL) versus other methods: Some of
the KELM slightly outperforms the SVM in terms of clas- the MLR advantages are as follows: 1) It converges very
sification accuracy. Experimental validation shows that fast and is relatively insensitive to parameter settings. In
the kernel used in the KELM and SVM is more efficient our experiments, we used the same settings for all data
than the activation function used in ELM. sets and received very competitive results in comparison
◗◗ BP versus ELM versus KELM: In light of the results, it can be with those obtained by other methods. 2) MLR has a
seen how the three versions of the SLFN p rovide compet- very low computational cost, with a practical complex-
itive results in terms of accuracy. However, it should be ity of O (d 2 (K - 1)).
noticed that both the ELM and KELM are on the order of For illustrative purposes, Figure 11 provides a compari-
hundreds or even thousands of times faster than the BP. son of the different classifiers tested in this work with the
Actually, the ELM and KELM have a practical complex- Indian Pines and Pavia University scenes (in terms of OA).
ity of O (L3 + L2 n + (K + d) Ln) and O (2n 3 + (K + d) n 2), As shown by Figure 11, different classifiers provide differ-
respectively [149]. ent performances for the two considered images, indicat-
◗◗ SVM versus 1-D CNN: The main advantage of 2-D and 3-D ing that there is no classifier consistently providing the
CNNs is that they use local connections to handle spatial de- best classification results for different scenes. The stability
pendencies. In this work, however, the 1-D CNN is taken of the different classifiers with the two considered scenes
to have a fair comparison with is illustrated in Figure 12, which demonstrates how much
other spectral approaches. In a classifier is stable with respect to some changes in the
DIFFERENT SOLUTIONS general, the SVM can obtain available training sets. Furthermore, Table 6 gives detailed
DEPEND ON THE higher classification accura- information about the classification accuracies obtained by
cies and work faster than the different approaches in a different application domain, rep-
COMPLEXITY OF THE
1-D CNN, so the use of SVMs resented by the Houston data set. In this case, the optimized
ANALYSIS SCENARIO AND
over the 1-D CNN is recom- classifiers also perform similarly in terms of classification
ON THE CONSIDERED
mended. In terms of central accuracy; so, ultimately, the choice of a given classifier is
APPLICATION DOMAIN. processing unit (CPU) pro- more driven by the simplicity of tuning the parameters and
cessing time, deep-learning configurations rather than by the obtained classification
methods are time consuming results. This is an important observation, as it is felt that
in the training step. Compared to the SVM, the training the hyperspectral community has reached a point at which
time of the 1-D deep CNN is about two or three times many classifiers are able to provide very high classification

accuracies. However, the competitive differences between
existing classifiers are more related to their simplicity and TABLE 7. the PERFORMANCE EVALUATION OF DIFFERENT
SPECTRAL CLASSIFIERS.
tuning configurations. In this regard, our assessment of the
SIMPLICITY
characteristics of different algorithms and their tuning is TECHNIQUES ACCURACY AUTOMATION AND SPEED STABILITY
believed to provide helpful insights regarding the choice of
RF • •••• •••• ••
a given classifier in a certain application domain.
With the aforementioned observations in mind, we can SVM •••• ••• ••• •••
interpret the results provided in Table 7 in more detail. In this BP •••• •• •• ••
table, one bullet denotes to the worst performance while four ELM •• •• ••• •••
bullets denotes the best. It can be observed that the KELM KELM •••• •• ••• •••
can provide high classification accuracies in a short period of 1-D CNN •• • • ••
time, while the obtained results are also stable with respect MLR •••• •••• •••• ••
to some changes of the input training samples. The SVM and
MLR also show a fair balance between accuracy, automation One bullet indicates the worst performance while four bullets indicates the best.
(obtained with respect to the number of parameters needed
to be adjusted), speed (evaluated based on the demanded
CPU processing time of different classifiers), and stability, Indiana, respectively. This research was supported by the
which can be advantageous for applications where a tradeoff Chinese 1000 people program B under project 41090427
between these elements is needed. In contrast, the 1-D CNN and by the Guangdong Provincial Science Foundation un-
does not display enough advantages, either in terms of clas- der project 42030397. This work was also partly supported
sification accuracy and stability or speed and automation. by the Alexander von Humboldt Fellowship for postdoc-
toral researchers.
CONCLUSIONS
In this article, we have provided a review and critical com- AUTHOR INFORMATION
parison of different supervised hyperspectral classification Pedram Ghamisi (p.ghamisi@gmail.com) received his B.Sc.
approaches from different points of view, with particular degree in civil (survey) engineering from the Tehran
emphasis on the configuration, speed, and automation South Campus, Azad University, Iran. He received his M.E.
capacity of various algorithms. The compared techniques degree with first-class honors in remote sensing at K.N.
include popular approaches such as SVMs, RFs, neural Toosi University of Technology in 2012, and he received
networks, deep approaches, logistic regression-based tech- his Ph.D. degree in electrical and computer engineering
niques, and sparse representation-based classifiers, which from the University of Iceland, Reykjavik, in 2015. He then
have been widely used in the hyperspectral analysis com- worked as a postdoctoral research fellow at the University
munity but never investigated systematically using a quan- of Iceland. In 2015, he won the prestigious Alexander von
titative and comparative approach. The critical comparison Humboldt Fellowship and started his work as a postdoc-
conducted in this work leads to interesting hints about toral research fellow at Technische Universität München
the logical choice of an appropriate classifier based on the (TUM), Signal Processing in Earth Observation, Munich,
application at hand. The main conclusion that can be ob- Germany. He has also been working as a researcher at the
tained from the present study is that there is no classifier German Aerospace Center, Remote Sensing Technology
that c onsistently provides the best performance among the Institute, Germany, on deep learning since October 2015. His
considered metrics (particularly, from the viewpoint of clas- research interests include machine learning, deep learn-
sification accuracy). Instead, different solutions depend on ing, and hyperspectral image analysis. He is a Member of
the complexity of the analysis scenario (e.g., the availability the IEEE.
of training samples, processing requirements, tuning pa- Javier Plaza (jplaza@unex.es) received his B.S. degree
rameters, and speed of the algorithm) and on the consid- in 2002, his M.Sc. degree in 2004, and his Ph.D. degree in
ered application domain. Combined, the insights provided 2008, all in computer engineering. In 2008, he was the re-
in this article may facilitate the selection of a specific classi- cipient of the Outstanding Ph.D. Dissertation Award at the
fier by an end user depending on his/her expectations and/ University of Extremadura, Spain, where he is an associate
or exploitation goals. professor in the Department of Technology of Comput-
ers and Communications. He has authored or coauthored
ACKNOWLEDGMENTS more than 120 scientific publications. He is currently serv-
The authors would like to thank the National Center for ing as associate editor of IEEE Geoscience and Remote Sensing
Airborne Laser Mapping for providing the Houston data Letters. He has served as a reviewer for more than 180 pa-
set. The ROSIS Pavia University and Indian Pines data pers submitted to more than 30 different journals, and he
and the corresponding reference information were kindly has served as a proposal evaluator for the Spanish Ministry
provided by Prof. P. Gamba, University of Pavia, Italy, and of Science and Innovation since 2008. He has also served
Prof. D. Landgrebe, Purdue University, West Lafayette, as a proposal evaluator for the Czech Science Foundation

and the Chilean National Science and Technology Com- tion,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 2, pp. 309–313,
mission. He is a Senior Member of the IEEE. 2015.
Yushi Chen (chenyushi@hit.edu.cn) received his Ph.D. [3] P. Ghamisi, A. R. Ali, M. Couceiro, and J. Benediktsson, “A novel
degree from Harbin Institute of Technology, China, in evolutionary swarm fuzzy clustering approach for hyperspectral
2008, where he is currently an associate professor in the imagery,” IEEE J. Sel. Topics Appl. Earth Observ. in Remote Sens.,
School of Electrical and Information Engineering. His re- vol. 8, no. 6, pp. 2447–2456, 2015.
search interests include remote sensing data processing [4] A. K. Jain, R. P. Duin, and J. Mao, “Statistical pattern recognition:
and machine learning. He is a Member of the IEEE. A review,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 22, no. 1,
Jun Li (lijun48@mail.sysu.edu.cn) received her B.S. de- pp. 4–37, 2000.
gree in geographic information systems from Hunan Nor [5] B. Waske and J. A. Benediktsson, Pattern Recognition and Classi-
mal University, Changsha, China, in 2004; her M.E. degree fication, Encyclopedia of Remote Sensing, E. G. Njoku, Ed. Berlin,
in remote sensing from Peking University, Beijing, China, Germany: Springer-Verlag, 2014.
in 2007; and her Ph.D. degree in electrical engineering [6] J. B. MacQueen, “Some methods for classification and analysis of
from the Instituto de Telecomunicaes, Instituto Superior multivariate observations,” in Proc. 5th Berkeley Symp. Mathemati-
Técnico (IST), Universidade Técnica de Lisboa, Lisbon, Por- cal Statistics and Probability, 1967, pp. 281–297.
tugal, in 2011. From 2007 to 2011, she was a Marie Curie [7] G. Ball and D. Hall, “ISODATA: A novel method of data analysis
Research Fellow with the Departamento de Engenharia and classification,” Tech. Rep. AD-699616, Stanford Univ., Stan-
Electrotécnica e de Computadores and the Instituto de Teleco- ford, CA, 1965.
municaes, IST, Universidade Técnica de Lisboa, in the frame- [8] J. C. Bezdek and R. Ehrlich, “FCM: The fuzzy c-means cluster-
work of the European Doctorate for Signal Processing. Since ing algorithm,” Comput. Geosci., vol. 10, no. 22, pp. 191–203,
2011, she has been a postdoctoral researcher with the Hyper- 1981.
spectral Computing Laboratory, Department of Technology [9] W. Wang, Y. Zhang, Y. Li, and X. Zhang, “The global fuzzy
of Computers and Communications, Escuela Politécnica, c-means clustering algorithm,” Intell. Cont. Aut., vol. 1, pp. 3604–
University of Extremadura, Cáceres, Spain. Currently, she is 3607, June 2006.
a professor with Sun Yat-Sen University, Guangzhou, China. [10] B. M. Shahshahani and D. A. Landgrebe, “The effect of unla-
Her research interests include hyperspectral image clas- beled samples in reducing the small sample size problem and
sification and segmentation, spectral unmixing, signal mitigating the Hughes phenomenon,” IEEE Trans. Geosci. Remote
processing, and remote sensing. She is a Senior Member of Sens., vol. 32, no. 5, pp. 4–37, 1995.
the IEEE. [11] Q. Jackson and D. Landgrebe, “Adaptive Bayesian contextual
Antonio Plaza (aplaza@unex.es) received his M.Sc. degree classification based on Markov random fields,” IEEE Trans. Geos-
in 1999 and his Ph.D. degree in 2002 from the University ci. Remote Sens., vol. 40, no. 11, pp. 2454–2463, 2002.
of Extremadura, Spain, both in computer engineering. [12] X. Jia and J. A. Richards, “Cluster-space representation for hyper-
He is head of the Hyperspectral Computing Laboratory, spectral data classification,” IEEE Trans. Geosci. Remote Sens., vol.
Department of Technology of Computers and Communi- 40, no. 3, pp. 593–598, 2002.
cations, at the University of Extremadura, and his main [13] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed.
research interests lie in hyperspectral data processing and San Diego, CA: Academic, 1990.
parallel computing of remote sensing data. He has au- [14] D. W. Scott, Multivariate Density Estimation, New York, NY: Wiley,
thored more than 500 publications, including 182 Journal 1992.
Citation Report papers (132 in IEEE journals), 20 book [15] E. J. Wegman, “Hyperdimensional data analysis using parallel
chapters, and over 250 peer-reviewed conference proceed- coordinates,” J. Amer. Stat. Assoc., vol. 85, no. 411, pp. 664–675,
ing papers. In 2015, he received the Best Column Award of 1990.
IEEE Signal Processing Magazine. He served as the director of [16] L. Jimenez and D. Landgrebe, “Supervised classification in high-
education activities for the IEEE Geoscience and R emote dimensional space: Geometrical, statistical, and asymptotical
Sensing Society (GRSS) in 2011–2012, and he is currently properties of multivariate data,” IEEE Trans. Syst., Man, Cybern.
serving as president of the Spanish Chapter of the IEEE A, Syst., Humans, vol. 28, no. 1, pp. 39–54, 1998.
GRSS. He has reviewed more than 500 manuscripts for [17] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote
over 50 journals. He currently serves as the editor-in-chief Sensing. Hoboken, NJ: Wiley, 2003.
of IEEE Transactions on Geoscience and Remote Sensing. He is [18] Y. Qian, F. Yao, and S. Jia. (2009). Band selection for hyper-
a Fellow of the IEEE. spectral imagery using affinity propagation. IET Comput. Vis.
3(4), p. 213. [Online]. Available: http://dx.doi.org/10.1049/
REFERENCES iet-cvi.2009.0034
[1] J. A. Benediktsson and P. Ghamisi, Spectral-Spatial Classification [19] F. Canters, “Evaluating the uncertainty of area estimates derived
of Hyperspectral Remote Sensing Images. Norwood, MA: Artech from fuzzy landcover classification,” Photogrammetric Eng. Re-
House, 2015. mote Sens., vol. 63, pp. 403–414, 1997.
[2] P. Ghamisi and J. A. Benediktsson, “Feature selection based on [20] J. L. Dungan, “Toward a comprehensive view of uncertainty in
hybridization of genetic algorithm and particle swarm optimiza- remote sensing analysis,” in Uncertainty in Remote Sensing and

GIS, 2nd ed. G. M. Foody and P. M. Atkinson, Eds. Hoboken, [36] J. Bioucas-Dias and M. Figueiredo, “Logistic regression via vari-
NJ: Wiley, 2002. able splitting and augmented Lagrangian tools,” Instituto Supe-
[21] M. A. Friedl, K. C. McGwire, and D. K. McIver, “An overview rior Técnico, TULisbon, Portugal, Tech. Rep., 2009.
of uncertainty in optical remotely sensed data for ecological [37] J. Li, J. Bioucas-Dias, and A. Plaza, “Hyperspectral image segmen-
applications,” in Spatial Uncertainty in Ecology, C. T. Hunsaker, tation using a new Bayesian approach with active learning,” IEEE
M. F. Goodchild, M.A. Friedl, and T.J. Case, Eds. New York, NY: Trans. Geosci. Remote Sens., vol. 49, no. 19, pp. 3947–3960, 2011.
Springer-Verlag, 2001. [38] J. Li, J. Bioucas-Dias, and A. Plaza, “Spectral-spatial hyperspec-
[22] X. Wang, “Learning from big data with uncertainty editorial,” J. tral image segmentation using subspace multinomial logistic re-
Intell. and Fuzzy Syst., vol. 28, no. 5, pp. 2329–2330, 2015. gression and Markov random fields,” IEEE Trans. Geosci. Remote
[23] D. Lu and Q. Weng, “A survey of image classification methods Sens., vol. 50, no. 3, pp. 809–823, 2012.
and techniques for improving classification performance,” Int. [39] P. Zhong and R. Wang, “Learning conditional random fields for
Jour. Remote Sens, vol. 28, no. 5, pp. 823–870, 2007. classification of hyperspectral images,” IEEE Trans. Image Process.,
[24] C. E. Woodcock and A. H. Strahler, “The factor of scale in remote vol. 19, no. 7, pp. 1890–1907, July 2010.
sensing,” Remote Sens. Env., vol. 21, no. 3, pp. 311–332, 1987. [40] Y. Qian, M. Ye, and J. Zhou, “Hyperspectral image classification
[25] P. Ghamisi, M. Dalla Mura, and J. A. Benediktsson, “A survey on based on structured sparse logistic regression and three-dimen-
spectral–spatial classification techniques based on attribute pro- sional wavelet texture features,” IEEE Trans. Geosci. Remote Sens.,
files,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2335– vol. 51, no. 4, pp. 2276–2291, Apr. 2013.
2353, 2015. [41] P. Zhong and R. Wang, “Jointly learning the hybrid CRF and
[26] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and MLR model for simultaneous denoising and classification of hy-
J. C. Tilton, “Advances in spectral-spatial classification of hyperspectral imagery,” IEEE Trans. Neural Netw. Learn. Syst., vol.
perspectral images,” Proc. IEEE, vol. 101, no. 3, pp. 652–675, 25, no. 7, pp. 1319–1334, July 2014.
2013. [42] M. Khodadadzadeh, J. Li, A. Plaza, and J. M. Bioucas-Dias, “A
[27] C. Xu, H. Liu, W. Cao, and J. Feng. (2012, Jan.). Multispectral subspace-based multinomial logistic regression for hyperspec-
image edge detection via Clifford gradient. Sci. China Inform. Sci. tral image classification,” IEEE Geosci. Remote Sens. Lett., vol. 11,
55(2), pp. 260–269, Jan. 2012. [Online]. Available: http://dx.doi no. 12, pp. 2105–2109, Dec. 2014.
.org/10.1007/s11432-011-4540-0 [43] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral-spatial classifi-
[28] Z. Su, X. Luo, Z. Deng, Y. Liang, and Z. Ji. (2013, Apr.). Edge- cation of hyperspectral data using loopy belief propagation and
preserving texture suppression filter based on joint filtering active learning,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2,
schemes. IEEE Trans. Multimedia. 15(3), pp. 535–548. [Online]. pp. 844–856, Feb. 2013.
Available: http://dx.doi.org/10.1109/TMM.2012.2237025 [44] M. Khodadadzadeh, J. Li, A. Plaza, H. Ghassemian, J. M. Bi-
[29] Z. Zhu, S. Jia, S. He, Y. Sun, Z. Ji, and L. Shen, “Three-dimensional oucas-Dias, and X. Li, “Spectral-spatial classification of hyper-
Gabor feature extraction for hyperspectral imagery classification spectral data using local and global probabilities for mixed pixel
using a memetic framework,” Inform. Sci., vol. 298, pp. 274–287, characterization,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10,
2015. pp. 6298–6314, Oct. 2014.
[30] J. L. Cushnie, “The interactive effect of spatial resolution and [45] L. Sun, Z. Wu, J. Liu, L. Xiao, and Z. Wei, “Supervised spectral-
degree of internal variability within land-cover types on classi- spatial hyperspectral image classification with weighted Markov
fication accuracies,” Int. J. Remote Sens., vol. 8, no. 1, pp. 15–29, random fields,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp.
1987. 1490–1503, Mar. 2015.
[31] Y. Zhong, Q. Zhu, and L. Zhang. (2015, Nov.). Scene classifica- [46] S. Sun, P. Zhong, H. Xiao, and R. Wang, “An MRF model-based
tion based on the multifeature fusion probabilistic topic model active learning framework for the spectral-spatial classification
for high spatial resolution remote sensing imagery. IEEE Trans. of hyperspectral imagery,” IEEE J. Sel. Topics Signal Process., vol. 9,
Geosci. Remote Sens. 53(11), pp. 6207–6222. [Online]. Available: no. 6, pp. 1074–1088, Sept. 2015.
http://dx.doi.org/10.1109/TGRS.2015.2435801 [47] J. Li, M. Khodadadzadeh, A. Plaza, X. Jia, and J. M. Bioucas-Di-
[32] D. Böhning, “Multinomial logistic regression algorithm,” Ann. as, “A discontinuity preserving relaxation scheme for spectral-
Inst. Statist. Math., vol. 44, no. 1, pp. 197–200, 1992. spatial hyperspectral image classification,” IEEE J. Sel. Topics
[33] J. Li, J. Bioucas-Dias, and A. Plaza, “Semi-supervised hyperspec- Appl. Earth Observ. in Remote Sens., vol. 9, no. 2, pp. 625–639,
tral image segmentation using multinomial logistic regression Feb. 2016.
with active learning,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. [48] J. Zhao, Y. Zhong, H. Shu, and L. Zhang. (2016, Sept.). High-
11, pp. 4085–4098, 2010. resolution image classification integrating spectral-spatial-
[34] P. Zhong, P. Zhang, and R. Wang, “Dynamic learning of SMLR location cues by conditional random fields. IEEE Trans. Image
for feature selection and classification of hyperspectral data,” IEEE Process. 25(9), pp. 4033–4045. [Online]. Available: http://dx.doi
Geosci. Remote Sens. Lett., vol. 5, no. 2, pp. 280–284, Apr. 2008. .org/10.1109/TIP.2016.2577886
[35] J. S. Borges, J. M. Bioucas-Dias, and A. R. S. Marcal, “Bayesian [49] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, and J. A. Benediktsson,
hyperspectral image segmentation with discriminative class “Generalized composite kernel framework for hyperspectral im-
learning,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 6, pp. age classification,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 9,
2151–2164, June 2011. pp. 4816–4829, Feb. 2013.

[50] Y. Zhang and S. Prasad, “Locality preserving composite kernel [66] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning ma-
feature extraction for multi-source geospatial image analysis,” chine: Theory and applications,” Neurocomputing, vol. 70, no. 1–3,
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 3, pp. pp. 489–501, Dec. 2006.
1385–1392, Mar. 2015. [67] G. Huang, G. B. Huang, S. Song, and K. You, “Trends in extreme
[51] J. Li, X. Huang, P. Gamba, J. M. Bioucas-Dias, L. Zhang, J. A. learning machines: A review,” Neural Netw., vol. 61, pp. 32–48,
Benediktsson, and A. Plaza, “Multiple feature learning for hy- Jan. 2015.
perspectral image classification,” IEEE Trans. Geosci. Remote Sens., [68] J. Tang, C. Deng, and G. B. Huang, “Extreme learning machine
vol. 53, no. 3, pp. 1592–1606, Mar. 2015. for multilayer perceptron,” IEEE Trans. Neural Netw. Learn. Syst.,
[52] C. Zhao, X. Gao, Y. Wang, and J. Li, “Efficient multiple-feature vol. 27, no. 4, pp. 809–21, Apr. 2016.
learning-based hyperspectral image classification with limited [69] G. B. Huang and C. K. Siew, “Extreme learning machine: RBF
training samples,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 7, network case,” in Proc. 8th Control, Automation, Robotics and Vision
pp. 4052–4062, July 2016. Conf., (ICARCV 2004), vol. 2, 2004, pp. 1029–1036.
[53] C. Bishop, Pattern Recognition and Machine Learning. New York, [70] G. B. Huang, “An insight into extreme learning machines: Ran-
NY: Springer-Verlag, 2006. dom neurons, random features and kernels,” Cognitive Computa-
[54] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy, “Conjugate tion, vol. 6, no. 3, pp. 376–390, Sept. 2014.
gradient neural networks in classification of very high dimen- [71] Y. Zhou, J. Peng, and C. L. P. Chen, “Extreme learning machine
sional remote sensing data,” Int. J. Remote Sens., vol. 14, no. 15, with composite kernels for hyperspectral image classification,”
pp. 2883–2903, 1993. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6,
[55] H. Yang, “A backpropagation neural network for mineralogical pp. 2351–2360, 2015.
mapping from AVIRIS data,” Int. J. Remote Sens., vol. 20, no. 1, [72] A. B. Santos, A. Araujo, and D. Menotti, “Combining multiple
pp. 97–110, 1999. classification methods for hyperspectral data interpretation,”
[56] J. A. Benediktsson, “Statistical methods and neural network ap- IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 3,
proaches for classification of data from multiple sources,” Ph.D. pp. 1450–1459, 2013.
dissertation, School of Elect. Eng., Purdue Univ., West Lafayette, [73] J. Li, Q. Du, W. Li, and Y. Li, “Optimizing extreme learning
IN, 1990. machine for hyperspectral image classification,” J. Appl. Remote
[57] J. A. Richards, “Analysis of remotely sensed data: The formative Sens., vol. 9, no. 1, pp. 097296, 2015.
decades and the future,” IEEE Trans. Geosci. Remote Sens., vol. 43, [74] A. Samat, P. Du, S. Liu, and L. Cheng, “E2LMs: Ensemble ex-
no. 3, pp. 422–432, 2005. treme learning machines for hyperspectral image classfication,”
[58] J. A. Benediktsson, P. H. Swain, and O. K. Ersoy, “Neural network IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 4,
approaches versus statistical methods in classification of multi- pp. 1060–1069, 2014.
source remote sensing data,” IEEE Trans. Geosci. Remote Sens., vol. [75] V. N. Vapnik, Statistical Learning Theory. New York, NY: Wiley, 1998.
28, no. 4, pp. 540–552, 1990. [76] B. Pan, J. Lai, and L. Shen. (2014, Aug.). Ideal regulariza-
[59] E. Merényi, W. H. Farrand, J. V. Taranik, and T. B. Minor, “Clas- tion for learning kernels from labels. Neural Netw. 56, pp.
sification of hyperspectral imagery with neural networks: com- 22–34. [Online]. Available: http://dx.doi.org/10.1016/j.neunet
parison to conventional tools,” Eurasip J. on Advances in Signal .2014.04.003
Processing, vol. 2014, no. 1, pp. 1–19, 2014. [77] B. Scholkopf and A. J. Smola, Learning with Kernels. Cambridge,
[60] F. D. Frate, F. Pacifici, G. Schiavon, and C. Solimini, “Use of MA: MIT Press, 2002.
neural networks for automatic classification from high-resolu- [78] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Kernel prin-
tion images,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 4, pp. cipal component analysis for the classification of hyperspectral
800–809, 2007. remote-sensing data over urban areas,” EURASIP J. Adv. Signal
[61] F. Ratle, G. Camps-Valls, and J. Wetson, “Semisupervised neu- Process., pp. 1–14, 2009.
ral networks for efficient hyperspectral image classification,” [79] L. Gómez-Chova, G. Camps-Valls, J. Muoz-Mar, and J. Calpe,
IEEE Trans. Geosci. Remote Sens., vol. 48, no. 5, pp. 2271–2282, “Semisupervised image classification with Laplacian support
2010. vector machines,” IEEE Geosci. Remote Sens. Lett., vol. 5, no. 3,
[62] Y. Zhong and L. Zhang, “An adaptive artificial immune network pp. 336–340, 2008.
for supervised classification of multi-/hyperspectral remote [80] E. Blanzieri and F. Melgani, “Nearest neighbor classification
sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 3, of remote sensing images with the maximal margin principle,”
pp. 894–909, 2012. IEEE Geosci. Remote Sens. Lett., vol. 46, no. 6, pp. 1804–1811,
[63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning 2008.
representations by back-propagating errors,” Nature, vol. 323, no. [81] D. Tuia and G. Camps-Valls, “Semisupervised remote sensing
6088, pp. 533–536, 1986. image classification with cluster kernels,” IEEE Geosci. Remote
[64] M. Moller, “A scaled conjugate gradient algorithm for fast super- Sens. Lett., vol. 6, no. 2, pp. 224–228, 2005.
vised learning,” Neural Netw., vol. 6, no. 4, pp. 525–533, 1993. [82] C. Castillo, I. Chollett, and E. Klein, “Enhanced duckweed detec-
[65] M. T. Hagan and M. Menhaj, “Training feed-forward networks tion using bootstrapped SVM classification on medium resolu-
with the Marquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, tion RGB MODIS imagery,” Int. J. Remote Sens., vol. 29, no. 19,
no. 6, pp. 989–993, 1994. pp. 5595–5604, 2008.

[83] J. Li, P. Marpu, A. Plaza, J. Bioucas-Dias, and J. A. Benediktsson, [101] L. G. Chova, D. Tuia, G. Moser, and G. C. Valls, “Multimodal
“Generalized composite kernel framework for hyperspectral im- classification of remote sensing images: A review and future di-
age classification,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 9, rections,” Proc. IEEE, vol. 103, no. 9, pp. 1560–1584, 2015.
pp. 4816–4829, 2013. [102] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-
[84] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, based classification of hyperspectral data,” IEEE J. Sel. Topics
“Sparse multinomial logistic regression: Fast algorithms and Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2094–2107,
generalization bounds,” IEEE Trans. Pattern Anal. Mach. Intell., 2014.
vol. 27, no. 6, pp. 957–968, 2005. [103] C. Tao, H. Pan, Y. Li, and Z. Zou, “Unsupervised spectral-spatial
[85] S. R. Safavian and D. Landgrebe, “A survey of decision tree clas- feature learning with stacked sparse autoencoder for hyperspec-
sifier methodology,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 21, tral imagery classification,” IEEE Geosci. Remote Sens. Lett., vol. 12,
no. 3, pp. 660–674, 1991. no. 12, pp. 2438–2442, 2015.
[86] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, Classifica- [104] Y. Chen, X. Zhao, and X. Jia, “Spectral-spatial classification of
tion and Regression Tree. London, U.K.: Chapman & Hall, 1984. hyperspectral data based on deep belief network,” IEEE J. Sel. Top-
[87] M. A. Friedl and C. E. Brodley, “Decision tree classification of ics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2292,
land cover from remotely sensed data,” Remote Sens. Env., vol. 61, 2015.
no. 3, pp. 399–409, 1997. [105] A. Romero, C. Gatta, and G. C. Valls, “Unsupervised deep fea-
[88] M. Pal and P. Mather, “An assessment of the effectiveness of de- ture extraction for remote sensing image classification,” IEEE
cision tree methods for land cover classification,” Remote Sens. Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1–14, 2016.
Env., vol. 86, no. 4, pp. 554–565, 2003. [106] J. A. Richards and X. Jia, Remote Sensing Digital Image Analysis,
[89] G. J. Briem, J. A. Benediktsson, and J. R. Sveinsson, “Multiple 4th ed. Berlin, Germany: Springer-Verlag, 2006.
classifiers applied to multisource remote sensing data,” IEEE [107] P. C. Smits, S. G. Dellepiane, and R. A. Schowengerdt, “Qual-
Trans. Geosci. Remote Sens., vol. 40, no. 10, pp. 2291–2299, 2003. ity assessment of image classification algorithms for land-cover
[90] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson, “Random mapping: A review and a proposal for a cost-based approach,”
forests for land cover classification,” Pattern Recog. Lett., vol. 27, Int. J. Remote Sens., vol. 20, no. 8, pp. 1461–1486, 1999.
no. 4, pp. 294–300, 2006. [108] W. D. Hudson and C. V. Ramm, “Correct formulation of the
[91] L. Breiman, “Arcing classifier,” Ann. Statist., vol. 26, no. 3, kappa coefficient of agreement,” Photogrammetric Eng. Remote
pp. 801–849, 1998. Sens., vol. 53, pp. 21–422, Aug. 1987.
[92] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 1, [109] R. G. Congalton, “A review of assessing the accuracy of classifi-
pp. 123–140, 1994. cation of remotely sensed data,” Remote Sens. Env., vol. 37, no. 1,
[93] L. Breiman, “Random forests,” Mach. Learn, vol. 45, no. 1, pp. 35–46, July 1991.
pp. 5–32, 2001. [110] L. L. F. Janssen and F. J. M. Vanderwel, “Accuracy assessment of
[94] J. Xia, P. Du, X. He, and J. Chanussot, “Hyperspectral remote satellite derived land-cover data: A review,” Photogrammetric Eng.
sensing image classification based on rotation forest,” IEEE Geos- Remote Sens., vol. 60, no. 4, pp. 419–426, Apr. 1994.
ci. Remote Sens. Lett., vol. 11, no. 1, pp. 239–243, 2014. [111] G. M. Foody, “Status of land cover classification accuracy assess-
[95] B. Waske, J. A. Benediktsson, K. Arnason, and J. R. Sveinsson, ment,” Remote Sens. Env., vol. 80, no. 1, pp. 185–201, Apr. 2002.
“Mapping of hyperspectral AVIRIS data using machine-learning [112] A. Plaza, J. A. Benediktsson, J. W. Boardman, J. Brazile, L. Bru-
algorithms,” Canadian J. Remote Sens., vol. 35, suppl. 1, pp. 106– zzone, G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, A.
116, 2009. Gualtieri, M. Marconcini, J. C. Tilton, and G. Trianni, “Recent
[96] Z. Zhi-Hua, Ensemble Methods: Foundations and Algorithms. Boca advances in techniques for hyperspectral image processing,” Re-
Raton, FL: CRC, 2012. mote Sens. Env., vol. 113, suppl. 1, pp. 110–122, Sept. 2009.
[97] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image [113] F. Melgani and L. Bruzzone, “Classification of hyperspectral re-
classification using dictionary-based sparse representation,” mote sensing images with support vector machines,” IEEE Trans.
IEEE Trans. Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, Geosci. Remote Sens., vol. 42, no. 8, pp. 1778–1790, 2004.
Oct. 2011. [114] M. Pal, “Extreme-learning-machine-based land cover classifica-
[98] Z. Lai, W. K. Wong, Y. Xu, C. Zhao, and M. Sun. (2014, Oct.). tion,” Int. J. Remote Sens., vol. 30, no. 14, pp. 3835–3841, 2009.
Sparse alignment for robust tensor learning. IEEE Trans. Neural [115] M. Pal, A. E. Maxwell, and T. A. Warner, “Kernel-based extreme
Netw. Learn. Syst. 25(10), pp. 1779–1792. [Online]. Available: learning machine for remote-sensing image classification,” Re-
http://dx.doi.org/10.1109/TNNLS.2013.2295717 mote Sens. Lett., vol. 4, no. 9, pp. 853–862, 2013.
[99] A. Castrodad, Z. Xing, J. Greer, E. Bosch, L. Carin, and G. Sapiro, [116] K. Hornik, “Approximation capabilities of multilayer feedfor-
“Learning discriminative sparse representations for modeling, ward networks,” Neural Netw., vol. 4, pp. 251–257, 1991.
source separation, and mapping of hyperspectral imagery,” IEEE [117] S. Tamura and M. Tateishi, “Capabilities of a four-layered feed-
Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4263–4281, Dec. forward neural network: Four layers versus three,” IEEE Trans.
2011. Neural Netw., vol. 8, no. 2, pp. 251–255, 1997.
[100] Y. Bengio, A. Courville, and P. Vincent, “Representation learn- [118] G. B. Huang, “Learning capability and storage capacity of two
ing: A review and new perspectives,” IEEE Trans. Pattern Anal. hidden-layer feedforward networks,” IEEE Trans. Neural Netw.,
Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013. vol. 14, no. 2, pp. 274–281, 2003.

[119] L. Prechelt, “Automatic early stopping using cross validation: [135] J. C. Chan and D. Paelinckx, “Evaluation of random forest
Quantifying the criteria,” Neural Netw., vol. 11, no. 4, pp. 761– and Adaboost tree-based ensemble classification and spectral
767, 1998. band selection for ecotope mapping using airborne hyperspec-
[120] G. Mountrakis, J. Im, and C. Ogole, “Support vector machines tral imagery,” Remote Sens. Env., vol. 112, no. 6, pp. 2999–3011,
in remote sensing: A review,” ISPRS J. Photogrammetry Remote 2008.
Sens., vol. 66, no. 3, pp. 247–259, 2011. [136] D. R. Cutler, T. C. Edwards, K. H. Beard, A. Cutler, and K. T.
[121] P. Ghamisi, M. S. Couceiro, and J. A. Benediktsson, “A novel Hess, “Random forests for classification in ecology,” Ecology, vol.
feature selection approach based on FODPSO and SVM,” IEEE 88, no. 11, pp. 2783– 2792, 2007.
Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2935–2947, 2015. [137] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investiga-
[122] M. Pal and P. Mather, “Some issues in the classification of dais tion of the random forest framework for classification of hyper-
hyperspectral data,” Int. J. Remote Sens., vol. 27, no. 14, pp. 2895– spectral data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp.
2916, 2006. 492– 501, 2005.
[123] C. J. C. Burges, “A tutorial on support vector machines for pat- [138] S. R. Joelsson, J. A. Benediktsson, and J. R. Sveinsson, “Ran-
tern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. dom forest classifiers for hyperspectral data,” in Proc. IEEE
2, pp. 121–167, 1998. Int. Geoscience Remote Sensing Symp., (IGARSS 05), 2005, pp.
[124] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, 25–29.
“Choosing multiple parameters for support vector machines,” [139] J. L. Cushnie, “Analysis of a random forests model,” J. Mach.
Mach. Learn., vol. 46, no. 1, pp. 131–159, 2002. Learn. Res., vol. 13, pp. 1063–1095, Apr. 2012.
[125] Y. Bazi and F. Melgani, “Toward an optimal SVM classification [140] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol,
system for hyperspectral remote sensing images,” IEEE Trans. “Stacked denoising autoencoders,” J. Mach. Learn. Res., vol. 11,
Geosci. Remote Sens., vol. 44, no. 11, pp. 3374–3385, 2006. no. 12, pp. 3371–3408, 2010.
[126] S. S. Keerthi and C. Lin, “Asymptotic behaviors of support vec- [141] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algo-
tor machines with Gaussian kernel,” Neur. Comp, vol. 15, no. 7, rithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp.
pp. 1667–1689, 2003. 1106–1114, 2012.
[127] G. Foody and A. Mathur, “A relative evaluation of multiclass [142] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classifi-
image classification by support vector machines,” IEEE Trans. cation with deep convolutional neural networks,” in Proc. Neural
Geosci. Remote Sens., vol. 42, no. 6, pp. 1335–1343, 2002. Information Processing Systems 25, Lake Tahoe, NV, USA, 2012, pp.
[128] N. Cristianini and J. Shawe-Taylor, An Introduction to Sup- 1527–1554.
port Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, [143] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature
2000. extraction and classification of hyperspectral images based on
[129] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, convolutional neural networks,” IEEE Trans. Geosci. Remote Sens.,
“Sparse multinomial logistic regression: Fast algorithms and vol. 54, no. 10, pp. 6232–6251, Oct. 2016.
generalization bounds,” IEEE Trans. Pattern Anal. Mach. Intell., [144] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
vol. 27, no. 6, pp. 957–968, 2005. learning applied to document recognition,” Proc. IEEE, vol. 86,
[130] S. R. Joelsson, J. A. Benediktsson, and J. R. Sveinsson, “Random no. 11, pp. 2278–2324, 1998.
forest classification of remote sensing data,” in Signal and Image [145] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sens-
Processing for Remote Sensing, C. H. Chen, Ed. Boca Raton, FL.: ing data: A technical tutorial on the state of the art,” IEEE Geosci.
CRC, 2007, pp. 327–344. Remote Sens. Mag., vol. 4, no. 2, pp. 22–40, 2016.
[131] B. Waske, J. A. Benediktsson, and J. R. Sveinsson, “Random [146] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for
forest classification of remote sensing data,” in Signal and Image hyperspectral image classification,” IEEE Trans. Geosci. Remote
Processing for Remote Sensing, C. H. Chen, Ed. New York, NY: CRC, Sens., vol. 43, no. 6, pp. 1351–1362, June 2005.
2012, pp. 363–374. [147] G. B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learn-
[132] V. F. Rodriguez-Galiano, B. Ghimire, J. Rogan, M. Chica-Ol- ing machine for regression and multiclass classification,” IEEE
mo, and J. P. Rigol-Sanchez, “An assessment of the effective- Trans. Syst. Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529,
ness of a random forest classifier for land-cover classification,” 2012.
ISPRS J. Photogrammetry Remote Sens., vol. 67, pp. 93–104, Jan. [148] J. A. K. Suykens and J. Vandewalle, “Least squares support vec-
2012. tor machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp.
[133] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statisti- 293–300, 1999.
cal Learning, 2nd ed. New York, NY: Springer-Verlag, 2008. [149] A. Iosifidis, A. Tefas, and I. Pitas, “On the kernel extreme learn-
[134] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statisti- ing machine classifier,” Pattern Recog. Lett., vol. 54, pp. 11–17,
cal Learning: Data Mining, Inference, and Prediction, Reading, MA: Mar. 2015.
Addison-Wesley, 2009. grs


Advanced Spectral Classifiers For Hyperspectral Images A Review

Uploaded by

Copyright:

Available Formats

Advanced Spectral Classifiers For Hyperspectral Images A Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Spectral Classifiers For Hyperspectral Images A Review

Uploaded by

Copyright:

Available Formats

Advanced

PEDRAM GHAMISI, JAVIER PLAZA,

H yperspectral image classification has been a vibrant

8 0274-6638/17©2017IEEE ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 9

Criteria Types Brief Description

supervised approaches, semisupervised techniques have SUPERVISED CLASSIFICATION OF

10 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 11

12 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 13

14 ieee Geoscience and remote sensing magazine march 2017

NOTATIONS DEFINITION NOTATIONS DEFINITION NOTATIONS DEFINITION NOTATIONS DEFINITION

layer biases in applications. In [66], it was proven that the 800

SLFN is a linear system. This fact leads to a significant de-

march 2017 ieee Geoscience and remote sensing magazine 15

16 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 17

18 ieee Geoscience and remote sensing magazine march 2017

DEEP LEARNING-BASED APPROACHES x

march 2017 ieee Geoscience and remote sensing magazine 19

FIGURE 6. A spectral classifier based on a deep CNN.

20 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 21

CLASS NUMBER OF SAMPLES

NUMBER NAME TOTAL

22 ieee Geoscience and remote sensing magazine march 2017

grass/­pasture-mowed, and oats. These classes contain only ◗◗ 1-D CNN

march 2017 ieee Geoscience and remote sensing magazine 23

Indian 1×5 1×5 1×4 1×5 1×4 Fully

Kernel Pavia 1×8 1×7 1×8

Number of Feature Map/

parameters were used for all experiments, stating that the

the number of spectral bands of the image, while the num-

88 were searched in the ranges C = 10 -3, 10 -1, ..., 10 4 and

24 ieee Geoscience and remote sensing magazine march 2017

3.5 1 82.24 82.62 81.86 97.25 95.37 82.91 82.62

2 4 92.33 92.14 90.11 96.09 99.49 90.06 92.23

1.5 5 98.30 96.78 98.08 96.80 97.84 97.82 98.39

1 6 99.30 99.30 86.43 99.03 100.00 99.3 95.10

0.5 7 79.10 74.72 79.64 53.26 73.63 85.63 78.73

0 8 50.62 32.95 51.80 66.04 76.18 41.41 53.46

1 14 100.00 99.19 99.31 89.81 90.41 99.6 98.78

15 96.83 97.46 98.08 94.15 94.34 98.52 97.88

0 ◗◗ SVM versus RF: Although both classifiers have the same

march 2017 ieee Geoscience and remote sensing magazine 25

26 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 27

28 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 29

30 ieee Geoscience and remote sensing magazine march 2017

march 2017 ieee Geoscience and remote sensing magazine 31

32 ieee Geoscience and remote sensing magazine march 2017

You might also like

8 0274-6638/17©2017IEEE ieee Geoscience and remote sensing magazine march 2017

grass/pasture-mowed, and oats. These classes contain only ◗◗ 1-D CNN