Advanced Spectral Classifiers For Hyperspectral Images A Review
Advanced Spectral Classifiers For Hyperspectral Images A Review
Advanced Spectral Classifiers For Hyperspectral Images A Review
Spectral
Classifiers for
Hyperspectral
Images
A review
source of information to be fed to advanced classifiers. The During the processing, each pixel is associated with one
output of the classification step is known as the classifica- of the cluster centers based on a similarity criterion [1],
tion map. [3]. Therefore, pixels that belong to different clusters are
Table 1 categorizes different groups of classifiers with more dissimilar to each other compared to pixels within
respect to different criteria, followed by a brief descrip- the same cluster [4], [5].
tion. Since classification is a wide field of research and it is There is a vast amount of literature on u nsupervised
not feasible to investigate all those approaches in a single classification approaches. Among these methods, Kmeans [6],
article, we tried to narrow down our description by exclud- Iterative Self-Organizing Data Analysis Technique (ISODATA)
ing the items highlighted in green in Table 1, which have [7], and Fuzzy Cmeans [8] rank
been extensively covered in other contributions. We reiter- among the most popular. This
ate that our main goal in this article is to provide a com- set of approaches is known
SINCE SUPERVISED
parative assessment and best practice recommendations for being highly sensitive to
APPROACHES CONSIDER
for the remaining contributions in Table 1. the initial cluster configuration
With respect to the availability of training samples, and may be trapped into subop- CLASS-SPECIFIC
classification approaches can be split into two categories, timal solutions [9]. To address INFORMATION PROVIDED BY
i.e., supervised and unsupervised classifiers. Supervised this issue, researchers have tri TRAINING SAMPLES, THEY
approaches classify input data for each class using a set of ed to improve the resilience of LEAD TO MORE PRECISE
representative samples known as training samples. Training the Kmeans (and its family) by CLASSIFICATION MAPS
samples are usually collected either by manually labeling optimizing it with bioinspired THAN UNSUPERVISED
a small number of pixels in an image or based on some optimization techniques [3].
APPROACHES.
field measurements [2]. In contrast, unsupervised classi- Since supervised approaches
fication (also known as clustering) does not consider train- consider class-specific infor-
ing samples. This type of approach classifies the data based mation provided by training
only on an arbitrary number of initial cluster centers that samples, they lead to more precise classification maps than
may be either user specified or quite arbitrarily selected. unsupervised approaches. In addition to unsupervised and
descent-based learning methods, which are generally slow Let (x i t i) be n distinct samples, where x i = [x i1, x i2, ..., x id] T
and may easily converge to a local minima. These tech- ! IR d and t i = [t i1, t i2, ..., t iK ] T ! IR K , where d is the spectral
niques adjust the weights in the steepest descent direction dimensionality of the data and K is the number of spectral
(negative of the gradient), which is the direction in which classes. An SLFN with L hidden nodes and an activation
the performance function decreases most rapidly, but this function f (x) can be expressed as
does not necessarily produce the fastest convergence [64]. In
L L
this sense, several conjugate gradient algorithms have been
proposed to perform a search along conjugate directions,
/ b i fi (x j) = / b i f (w i ·x j + b i) = o j, j = 1, ..., n,(1)
i=1 i=1
which generally result in faster convergence. These algo-
rithms usually require high storage capacity and are widely where w i = [w i1, w i2, ..., w id] T is the weight vector con-
used in networks with a large number of weights. Lastly, necting the i th hidden node and the input nodes,
Newton-based learning algorithms generally provide b etter b i = [b i1, b i2, ..., b iK ] T is the weight vector connecting
and faster optimization than conjugate gradient methods. the i th hidden node and the output nodes, b i is the bias
Based in the Hessian matrix (second derivatives) of the of the i th hidden node, and f (w i ·x j + b i) is the output of
performance index at the current values of the weight and the i th hidden node regarding the input sample x i . The
biases, their convergence is faster, although their complexity above equation can be rewritten compactly as
usually introduces an extra computational burden for the
calculation of the Hessian matrix. H·b = Y, (2)
Recently, the ELM algorithm has been proposed to train
SLFNs [66], [67] and has emerged as an efficient algorithm R f (w ·x + b ) f f (w ·x + b )V
S 1 1 1 L 1 L W
that provides accurate results in much less time. Tradi- H=S h f h W , (3)
S f (w ·x + b ) f f (w ·x + b )W
tional gradient-based learning algorithms assume that 1 n 1 L n L L#L
T X
all of the parameters (weight and bias) of the feedforward
networks need to be tuned, establishing a dependency
between different layers of parameters and fostering very
slow convergence. In [117] and [118], it was first shown that 1,600
an SLFN (with N hidden nodes) with randomly chosen
1,400
input weights and hidden-layer biases can learn exactly
N distinct observations, which means that it may not be 1,200
necessary to adjust the input weights and first hidden- 1,000
Citations
where H is the output matrix of the hidden layer and b is where p 2i is the training error of training sample x i and C is
the output weight matrix. The objective is to find specific a regularization parameter. The output of ELM can be ana-
wt i, bt i, bt (i = 1, ..., L) so that lytically expressed as
I -1
h (x) b = h (x) H T a C + HH T k Y. (9)
2
t i, bt i) bt - Y =
H (w
min w i, bi, b H (w 1, f, w L, b 1, f, b L) b - Y 2 . (5) This expression can be generalized to a kernel version of
ELM using the kernel trick [71]. The inner product opera-
tion considered in h (x) H T and HH T can be replaced by
As mentioned before, the minimum of Hb - Y 2 is tradi a kernel function: h (x i) ·h (x j) = k (x i, x j). Both the regu-
tionally calculated using gradient-based learning algo- larized and kernel extensions of the traditional ELM al-
rithms. The main issues related to these traditional methods gorithm require the setting of the needed parameters (C
are as follows: and all kernel-dependent parameters). When compared
◗◗ First and foremost, all gradient-based learning algo- with traditional learning algorithms, ELM has the follow-
rithms are very time consuming in most applications. ing advantages:
This became an important problem when classifying ◗◗ There is no need to iteratively tune the input weights w i
hyperspectral data. and the hidden-layer biases b i using slow gradient-based
◗◗ The size of the learning rate parameter strongly affects learning algorithms.
the performance of the network. Values that are too ◗◗ Derived from the fact that ELM tries to reach both the
small generate very slow convergence processes, while smallest training error and the smallest norm of output
scores in h that are too large make the learning algo- weights, this algorithm exhibits better generalization
rithm diverge and become unstable. performance in most cases when compared with tradi-
◗◗ The error surface generally presents local minima. Gra- tional approaches.
dient-based learning algorithms can get stuck at local ◗◗ ELM’s learning speed is much faster than in the tradi-
minima. This can be an important issue if local minima tional gradient-based learning algorithms. Depending
are far above global minima. on the application, ELM can be tens to hundreds of
◗◗ FNs can be overtrained using BP-based algorithms, thus times faster [66].
obtaining worse generalization performance. The ef- ◗◗ The use of ELM avoids inherent problems with gradient-
fects of overtraining can be alleviated using regulariza- descent methods such as getting stuck in a local minima
tion or early stopping criteria [119]. or overfitting the model [66].
It has been proved in [66] that the input weights w i
and the hidden-layer biases b i do not need to be tuned, SUPPORT VECTOR MACHINES
so the output matrix of the SVMs [113] have often been used for the classification
hidden layer H can remain of hyperspectral data because of their ability to handle
RECENTLY, THE ELM unchanged after a random high-dimensional data with a limited number of train-
ALGORITHM HAS BEEN initialization. Fixing the input ing s amples. The goal is to define an optimal linear-
PROPOSED TO TRAIN SLFNs weights w i and the hidden- separating hyperplane (the class boundary) within a
layer biases b i means that multidimensional feature space that differentiates the
AND HAS EMERGED AS AN
training an SLFN is equiva- training samples of two classes. The best hyperplane is the
EFFICIENT ALGORITHM
lent to finding a least-squares one that leaves the maximum margin from both classes.
THAT PROVIDES solution bt of the linear sys- The hyperplane is obtained using an optimization prob-
ACCURATE RESULTS IN tem Hb = Y. Different from lem that is solved via structural risk minimization. In this
MUCH LESS TIME. the traditional gradient-based way, in contrast to statistical approaches, SVMs minimize
learning algorithms, ELM classification error on unseen data without any prior as-
aims to reach not only the sumptions made on the probability distribution of the
smallest training error but also the smallest norm of out- data [120].
put weights. The SVM tries to maximize the margins between the
hyperplane and the closest training samples [75]. In other
Minimize: Hb - Y 2
and b 2 . (6) words, to train the classifier, only samples that are close
to the class boundary are needed to locate the hyperplane
Let h (x) = [f (w 1 ·x + b 1), ..., f (w L ·x + b L)], if we express vector. This is why the training samples closest to the hy-
(6) from the optimization theory point of view perplane are called support vectors. More importantly, since
h x lj = f d / x li - 1 ) k lij + b lj n,
i=1
w
where x li - 1 is the i th feature map of (l - 1) th layer, x lj is
the j th feature map of current (i) th layer, and M is the
v number of input feature maps. k lij and b lj are the trainable
parameters in the convolutional layer. f (.) is a nonlinear
function, and * is the convolution operation. It should be
FIGURE 4. A graphical illustration of an RBM. The top layer (h) noted that here we explain one-dimensional (1-D) CNN,
represents the hidden units and the bottom layer (v) represents the as this article deals with spectral classifiers. To find de-
visible units. w: input weight. tailed information about two-dimensional (2-D) and
three-dimensional (3-D) CNN for the classification of
hyperspectral data, see [145].
DBN The pooling operation offers invariance by reducing the
RBM1
resolution of the feature maps. The neuron in the pooling
RBM2 RBM3 layer combines a small N # 1 patch of the convolution layer,
and the most common pooling operation is max pooling.
Output:
Class
A convolution layer, nonlinear function, and pooling layer
Input
Labels are three fundamental parts of CNNs [144]. By stacking
several convolution layers with nonlinear operation and
Logistic several pooling layers, a deep CNN can be formulated. A
Hyperspectral Pixel
Regression deep CNN can hierarchically extract the features of inputs,
Data Vector
which tend to be invariant and robust [100].
FIGURE 5. A spectral classifier based on a DBN. The classification The architecture of a deep CNN for spectral classifica-
scheme shown here has four layers: one input layer, two RBMs, and tion is shown in Figure 6. The input of the system is a
a logistic regression layer. pixel vector of hyperspectral data, and the output is the
Feature Map 1
Pixel Pooling Logistic
Vector Convolution Convolution Pooling Stack Regression
Feature Map 2
Output:
Feature Map 3 Class
Labels
TABLE 5. HOUSTON:
THE NUMBER OF TRAINING AND TEST SAMPLES. represent mostly different types of crops and are detailed in
CLASS NUMBER OF SAMPLES Table 4. Figure 8 shows a three-band false color image and
NUMBER NAME TRAINING TEST its corresponding reference samples.
1 Grass-healthy 198 1,053
HOUSTON DATA
2 Grass-stressed 190 1,064
This data set was captured by the Compact Airborne Spec-
3 Grass-synthetic 192 505
trographic Imager (CASI) over the University of Houston
4 Tree 188 1,056 campus and the neighboring urban area in June 2012.
5 Soil 186 1,056 With a size of 349 # 1905 pixels and a spatial resolution
6 Water 182 143 of 2.5 m, this data set is composed of 144 spectral bands
7 Residential 196 1,072 ranging from 0.38 to 1.05 m. These data consist of 15 class-
8 Commercial 191 1,053 es, including healthy grass, stressed grass, synthetic grass,
trees, soil, water, residential, commercial, road, highway,
9 Road 193 1,059
railway, parking lot 1, parking lot 2, tennis court, and run-
10 Highway 191 1,036
ning track. Parking lot 1 includes parking garages at the
11 Railway 181 1,054
ground level and also in elevated areas, while parking lot
12 Parking lot 1 192 1,041 2 corresponds to parked vehicles. Table 5 demonstrates the
13 Parking lot 2 184 285 different classes with the corresponding number of train-
14 Tennis court 181 247 ing and test samples. Figure 9 shows a three-band false col-
15 Running track 187 473 or image and its corresponding already-separated training
Total 2,832 12,197
and test samples.
ALGORITHM SETUP
In this article, two different scenarios were defined to
Its spatial dimensions are 145 # 145 pixels, and its spa- evaluate different approaches. In the first scenario, differ-
tial resolution is 20 m per pixel. This data set originally ent percentages of the available reference data were chosen
included 220 spectral channels, but 20 water absorption as training samples. In this scenario, only Indian Pines
bands (104–108, 150–163, 220) have been removed, and and Pavia University were considered. For Indian Pines, 1,
the rest (200 bands) were taken into account for the experi- 5, 10, 15, 20, and 25% of the whole sample were random-
ments. The reference data contain 16 classes of interest that ly selected as training samples, except for classes alfalfa,
(b)
(c)
Thematic Classes
Healthy Grass Stressed Grass Synthetic Grass Tree Soil
Water Residential Commercial Road Highway
Railway Parking Lot 1 Parking Lot 2 Tennis Court Running Track
(d)
FIGURE 9. Some CASI Houston hyperspectral data: (a) a color composite representation of the data, using bands 70, 50, and 20 as R, G, and
B, respectively; (b) training samples; (c) test samples; and (d) a legend of the different classes.
FIGURE 10. The architectures of the 1-D CNN on three data sets.
0.2
(c) (d)
(e) (f)
Thematic Classes
Healthy Grass Stressed Grass Synthetic Grass Tree Soil
Water Residential Commercial Road Highway
Railway Parking Lot 1 Parking Lot 2 Tennis Court Running Track
FIGURE 13. Scenario 2: classification maps for Houston data using (a) RF, (b) SVM, (c) BP, (d) KELM, (e) MLR, and (f) 1-D CNN.
Regarding the classification accuracy, it can be seen that longer than the RBF-SVM. On the other hand, the advan-
the ELM achieves comparable results. tage of the deep CNN is that it is extremely fast on the
◗◗ SVM versus KELM: The computational complexity of the testing stage.
SVM is much bigger than the KELM. It can be seen that ◗◗ MLR (executed via LORSAL) versus other methods: Some of
the KELM slightly outperforms the SVM in terms of clas- the MLR advantages are as follows: 1) It converges very
sification accuracy. Experimental validation shows that fast and is relatively insensitive to parameter settings. In
the kernel used in the KELM and SVM is more efficient our experiments, we used the same settings for all data
than the activation function used in ELM. sets and received very competitive results in comparison
◗◗ BP versus ELM versus KELM: In light of the results, it can be with those obtained by other methods. 2) MLR has a
seen how the three versions of the SLFN p rovide compet- very low computational cost, with a practical complex-
itive results in terms of accuracy. However, it should be ity of O (d 2 (K - 1)).
noticed that both the ELM and KELM are on the order of For illustrative purposes, Figure 11 provides a compari-
hundreds or even thousands of times faster than the BP. son of the different classifiers tested in this work with the
Actually, the ELM and KELM have a practical complex- Indian Pines and Pavia University scenes (in terms of OA).
ity of O (L3 + L2 n + (K + d) Ln) and O (2n 3 + (K + d) n 2), As shown by Figure 11, different classifiers provide differ-
respectively [149]. ent performances for the two considered images, indicat-
◗◗ SVM versus 1-D CNN: The main advantage of 2-D and 3-D ing that there is no classifier consistently providing the
CNNs is that they use local connections to handle spatial de- best classification results for different scenes. The stability
pendencies. In this work, however, the 1-D CNN is taken of the different classifiers with the two considered scenes
to have a fair comparison with is illustrated in Figure 12, which demonstrates how much
other spectral approaches. In a classifier is stable with respect to some changes in the
DIFFERENT SOLUTIONS general, the SVM can obtain available training sets. Furthermore, Table 6 gives detailed
DEPEND ON THE higher classification accura- information about the classification accuracies obtained by
cies and work faster than the different approaches in a different application domain, rep-
COMPLEXITY OF THE
1-D CNN, so the use of SVMs resented by the Houston data set. In this case, the optimized
ANALYSIS SCENARIO AND
over the 1-D CNN is recom- classifiers also perform similarly in terms of classification
ON THE CONSIDERED
mended. In terms of central accuracy; so, ultimately, the choice of a given classifier is
APPLICATION DOMAIN. processing unit (CPU) pro- more driven by the simplicity of tuning the parameters and
cessing time, deep-learning configurations rather than by the obtained classification
methods are time consuming results. This is an important observation, as it is felt that
in the training step. Compared to the SVM, the training the hyperspectral community has reached a point at which
time of the 1-D deep CNN is about two or three times many classifiers are able to provide very high classification