1. Introduction
Hyperspectral sensors collect information as a series of images, represented by hundreds of narrow and contiguous spectral bands across a wide range of the spectrum, which allows detailed spectral signatures to be identified for different imaged materials [
1,
2,
3]. The resulting hyperspectral image (HSI) can be used to find objects, identify specific materials and detect processes in different application fields [
1,
3], such as military, agriculture, and mineralogy. Among these applications, classification is a basic problem which aims to assign a class label to each pixel in a HSI [
4]. Due to the discriminative characteristics of spectral curves, traditional HSI classification models are often based on spectral information. Typical spectral-based classifiers [
2] include support vector machines (SVM), bayesian models, random forests (RF), and artificial neural networks.
However, the intrinsic complexity of hyperspectral images usually makes these traditional methods unsuitable for consistently providing satisfactory classification results. Compared with the large number of spectral bands, in practice the number of labeled training samples is usually quite limited. This high dimensionality-small sample problem makes classification much more difficult and can lead to the Hughes phenomenon [
5]. In addition, due to the effects of the acquisition condition and imaging mechanism, there often exist redundant or even noisy spectral bands in the HSI. By performing feature extraction, the above two problems can be alleviated, to a certain extent [
6,
7]. One of key problems is how to effectively extract features of the HSI. Currently, spectral–spatial features are widely used, and HSI classification performance has gradually improved from the use of only spectral features to the joint use of spectral–spatial features [
8,
9,
10,
11].
To extract spectral–spatial features, deep learning models have been introduced for the purpose of HSI classification [
12,
13,
14,
15,
16,
17,
18,
19]. The main idea of deep learning is to extract more abstract features from raw data, by means of multi-layer superimposed representation [
20,
21,
22]. Chen et al. [
12] proposed the use of a stacked auto-encoder (SAE) model to extract high-level features of a HSI by using spatial–spectral joint information. Zhao et al. [
16] used a stacked sparse auto-encoder to extract more abstract and deep-seated features from spectral feature sets, spatial feature sets, and spectral space vectors. Li et al. [
17] introduced the deep belief network (DBN) for spectral–spatial feature extraction and classification of HSIs. Zhong et al. [
18] introduced a diversity-promoting prior to the pre-training and fine-tuning of the DBN model in order to enhance the HSI classification performance. These earlier deep learning-based HSI classification models were generally based on mature deep learning frameworks, such as SAE and DBN. SAE and DBN could extract high-level features and usually showed better classification performance than traditional methods. However, due to the full connection of different layers, they demand the training of a lot of parameters [
19]. In addition, they suffer from spatial information loss, as they require flat spatial HSI patches (in one dimension as a vector) to satisfy their input requirements. Differing from SAE and DBN, a convolutional neural network (CNN) uses local connections to effectively extract the spatial information and uses shared weights to significantly reduce the number of parameters [
19]. Mei et al. [
23] proposed a five-layer CNN model that fused spectral and spatial features, where these features were obtained by calculating the mean and standard deviation per spectral band of the spatial neighborhood. Yang et al. [
24] proposed a two-channel CNN model, where each channel learned features from the spectral domain and spatial domain, respectively. Zhang et al. [
25] proposed a dual-channel CNN model, where a one-dimensional CNN was utilized to automatically extract the hierarchical spectral features and a two-dimensional CNN was applied to extract the hierarchical space-related features. To fully use the spatial–spectral joint information of a HSI, 3D-CNN models (instead of 2D-CNN) have been proposed for HSI classification [
19,
26,
27]. A 3D-CNN model directly processes a 3D data cube in the original HSI, which contains the central target pixel, its spatial neighbors and corresponding spectral information. Therefore, it can fully capture both spatial and spectral information.
The central building block of a CNN is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer [
28]. In this operation, the relationship between channels should be carefully investigated [
28]. From the viewpoint of feature re-calibration, a squeeze and excitation (SE) structure has been proposed to model the interdependencies between the channels of convolutional features [
28]. The SE block contains two operations: squeeze and excitation. The squeeze operation produces a channel descriptor for global information embedding, by aggregating feature maps across their spatial dimensions; and the excitation operation produces channel-specific weights. By performing feature re-calibration, a SE block can selectively emphasise informative features and suppress less-useful ones. The SE block can be integrated into standard deep learning architectures, such as residual networks. A supervised spectral–spatial residual network (SSRN) has been previously proposed for HSI classification [
29]. A SSRN contains spectral and spatial residual blocks, which can be used to extract finer spectral and spatial features from the HSI, and has achieved state-of-the-art HSI classification accuracy in a wide range of applications [
29]. However, the design of spectral and spatial residual blocks hasn’t taken full consideration of the characteristics of a HSI.
A HSI usually contains a large number of spectral bands, where some bands are correlated (redundant) or even noisy, as shown in
Figure 1a,b.
Figure 1a shows the correlation coefficient between different bands of the Indian Pines hyperspectral image. It can be seen that adjacent bands are highly correlated.
Figure 1b shows a noisy band of Indian Pines, where the ground objects are almost covered by noise. In addition, the pixels in a spatial neighborhood may also be inhomogeneous, especially for boundary pixels. For each pixel
, we define an
spatial neighborhood, centered at
, and compute the ratio of the number of inhomogeneous pixels (the pixels whose labels are different from the central pixel
) to the number of total pixels in the spatial neighborhood.
Figure 1c shows the ratio for each pixel. It can be clearly seen that the pixels around the boundary usually have high ratio values, which means that their spatial neighborhoods contain a large number of inhomogeneous pixels. Both the redundant or noisy bands and inhomogeneous neighboring pixels will produce negative effects in the classification.
In this paper, motivated by the idea of attention mechanisms, we construct a spatial–spectral squeeze-and-excitation (SSSE) structure to adaptively learn the weights for different spectral bands and for different neighboring pixels at the same time. SSSE can learn to train the network to suppress or motivate features at certain spectral bands or spatial positions, which can effectively overcome the redundancy in the spectral channels and the pixel inconsistency in the spatial neighborhood. Furthermore, we embed several SSSE modules into a residual network architecture and generate an SSSE based-residual network (SSSERN) model for HSI classification.
The rest of this paper is organized as follows.
Section 2 introduces the residual network and SE structure, and then describes our proposed method. The experimental results and analysis are provided in
Section 3.
Section 4 gives a discussion. Finally,
Section 5 draws the conclusions.
3. Experiments Results
3.1. Datasets
To evaluate the performance of the proposed method in HSI classification, we use the following two benchmark hyperspectral data sets:
(1) Indian Pines: This data was taken by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor. The image scene contains
pixels and 220 spectral bands, from 0.4–2.5
m, where 20 bands were discarded because of atmospheric affection. The spatial resolution of the Indian Pines data was 20 m. There are 16 classes in the data, as shown in
Figure 5. The number of samples in each class is shown in
Table 2.
(2) University of Pavia: This data was acquired by the Reflective Optical System Imaging Spectrometer (ROSIS) sensor. The ROSIS sensor generates 115 bands, ranging from 0.43–0.86
m, in which 12 noisy bands were deleted and the remaining 103 bands are used for the experimental analysis. The spatial resolution is 1.3 m. The scene has the size of
, and contains 9 ground categories, as shown in
Figure 6. The number of samples in each class is shown in
Table 3.
3.2. Classification Performance on Indian Pines and University of Pavia Data Sets
In this paper, the TensorFlow deep learning framework was used to build and train the proposed SSSERN. We compare the proposed method with six available classification methods in the literature: (1) Support Vector Machine (SVM) with a radial basis function kernel; (2) Random Forest (RF); (3) Multi-Layer Perceptron (MLP); (4) 2D-CNN [
25]; (5) 3D-CNN [
12]; and (6) SSRN [
29]. Among these methods, SVM, RF, and MLP are spectral classifiers, and 2D-CNN can be considered as a spatial method which uses PCA to reduce the dimensionality of hyperspectral data and extracts only one principal component. Finally, 3D-CNN, SSRN, and the proposed SSSERN are spatial–spectral methods.
In the experiments, we randomly selected 15% samples from each class to form the training set and the test set consisted of the remaining samples. The experiment was repeated five times with randomly-chosen training samples, and the results of five runs were averaged. The class accuracy (CA), overall accuracy (OA), average accuracy (AA), and kappa coefficient (
) on the testing set were recorded to assess the performance of the different classification methods. In 2D-CNN, 3D-CNN, and our proposed algorithm, the neighborhood window was set as
. The classification results on the two data sets are shown in
Table 4 and
Table 5, respectively.
From the classification results, we can see that:
(1) The proposed SSSERN provided the best classification results on the two data sets.
(2) By jointly using the spectral and spatial information in a deep network architecture, the spatial–spectral methods (i.e., 3D-CNN, SSRN, and the proposed SSSERN) dramatically improved the spectral-based and spatial-based methods.
(3) Compared with existing deep learning methods (i.e., 2D-CNN, 3D-CNN and SSRN), the proposed SSSERN showed better results. This demonstrates that the proposed SSSE structure can extract much more effective spectral–spatial features by highlighting important spectral bands or neighboring pixels and suppressing noisy spectral bands or dissimilar neighboring pixels.
Figure 7 and
Figure 8 show the classification maps of SVM, RF, MLP, 2D-CNN, 3D-CNN, SSRN, and our proposed SSSERN on the Indian Pines and University of Pavia data sets, respectively. The spectral-based classifiers, such as SVM and RF, generated noisy classification maps because they only considered isolated spectral samples and did not use spatial information to enhance the spatial neighborhood consistency. The spatial–spectral classifiers (i.e., 3D-CNN, SSRN, and SSSERN) provided much better results than the spectral classifiers and generated maps with little noise and clear object boundaries. Among all methods, our proposed SSSERN achieved a classification map that was the closest to the actual ground-truth; that is to say, the class boundaries were better defined and the background pixels were better classified.
3.3. Investigation on the Effect of Network Parameters
Now, we investigate the effect of parameters on the classification performance of SSSERN. The parameters are the width of input feature window (i.e., is the window), the combination coefficient , and the number of residual blocks , where controls the size of the input features, is used to indicate the ratio of SpatialSE to SpectralSE, and decides the deepness of the network. We also investigate the effect of the number of training samples, where 5% and 15% samples from each class in Indian Pines are chosen for training.
We first fix
and
, and show the effect of
. Six different values of
(3, 5, 7, 9, 11, and 13) were considered. The corresponding OA values of SSSERN, in the case of 5% and 15% training samples, are shown in
Figure 9. It can be clearly seen that the OA of SSSERN increased rapidly with the increase of
and achieved relatively stable results when
. The optimal values of
were 9 and 11 for 5% and 15% training samples, respectively. In the experiment,
was used.
Next, we investigate the effect of
. From Equation (
7), when
, the SSSE module is reduced to SpatialSE. When
, SSSE is reduced to SpectralSE. When
, SpatialSE and SpectralSE have the same importance in the SSSE. For simplicity, we only considered these three values of
(i.e., 0, 1, and 0.5). The OA of SSSERN versus different
values is shown in
Figure 10, where SpectralSE, SpatialSE, and SSSE correspond to
,
,
, respectively. It can be seen that the SSSE module that combined SpatialSE and SpectralSE provided the best results.
To further investigate the effectiveness of SSSE, we show the results of SSSERN with and without SSSE modules. As shown in
Figure 4, the SSSE module is attached onto the residual block (resBlock). When the SSSE modules are deleted, SSSERN is reduced to a general residual network.
Figure 11 shows the OA of SSSERN with and without SSSE modules. It can be clearly seen that SSSE modules were more effective than traditional residual modules, and the optimal number of SSSE blocks was either 3 or 4.
3.4. Investigation on the Stimulus Values by the SSSE Structure
Although previous experiments have proven the effectiveness of SSSE blocks in improving the network performance, we also want to understand how the automatic gating incentive mechanism works in practice. In this subsection, to show the behavior of the SSSE structure more clearly, we will study the activation outputs of individual samples in the model and check their distribution for different classes on different residual modules. Specifically, we choose six different classes from the Indian Pines data set (Classes 1, 3, 4, 11, 14, and 15), and select 50 samples from each class, and then calculate the average of the SSSE module output of these samples in different layers.
As the activation value in the SSSE structure is composed of two parts—namely, the stimulus value in the spectral and spatial dimensions—the visualization results of these two parts will be shown below.
Figure 12 shows the averaged spectral dimension stimulus value for each class. It can be seen that different classes of samples had different stimulus values for each channel, in each SSSE structure. In the third SSSE structure, Classes 1, 3, 4, and 14 showed synchronization suppression effects at the 36th channel, which demonstrates that the spectral characteristics of these classes were similar in this channel.
Figure 13 shows the activation values of the six classes in the spatial dimensions of different SSSE layers. In the figure, the brighter part corresponds to higher activation values. It can be seen that the features were almost always activated at the center position, and the positions around the boundary were suppressed. The boundary pixels may have been background pixels or pixels from different classes for a large window. In addition, they were far away from the central pixel and, hence, were less important. By suppressing these boundary pixels, the SSSERN model can obtain better results.