Keywords

1 Introduction

Hyperspectral image contains hundreds of bands which range from the visible to the near infrared wavelength [8]. In other words, each pixel in the image captures hundreds of narrow spectral bands from the same area on the Earth. Hyperspectral remote sensing plays an important role in national economy and national defense, and has been widely used in the field of classification, target detection, agricultural monitoring and mineral mapping [13].

There are many problems in the actual use of hyperspectral data. The most important one is that the band redundancy is too great [4]. A hyperspectral image often contains hundreds of bands, and the rich spectral information can enhance the ability to distinguish objects. However, for specific tasks, such as classifying the types of crops, it may not be necessary to analyze the objects with all the bands. In the contrast, the bands irrelevant to the target object may weaken the classification accuracy [20]. Compared with the high dimensionality of bands, the ground truth of HSI are often very limited under the condition that labeling samples is very time-consuming and expensive [2]. As a well-known conclusion, the generalization ability of the classifier is weakened if the dimensionality of features is too high in the case of limited training samples [4, 8]. Therefore, it is necessary to carry out dimension reduction method for hyperspectral data.

Dimension reduction methods are mostly divided into two categories: feature extraction and band selection. The feature extraction based dimension reduction methods consider that hyperspectral data is projected to a lower dimensional space so as to get an abstract representation of the data [16, 21]. The research direction includes principal component analysis [18], wavelet transformation [10], independent component analysis [6] and linear discriminant analysis [9]. Although these methods can get relatively satisfactory results, they also have some serious problems. First, these method have a high time complexity [2]. Second, the extracted features may lose the physical meaning for interpretation [19].

Band selection based dimension reduction methods aim to select discriminative bands for a specific task. The ultimate selected bands should be able to represent the whole hyperspectral data with little loss of effectiveness [17]. It is obvious that band selection has advantages of preserving original physical information compared with feature extraction methods. Therefore, we try to figure out a band selection method for hyperspectral task such as HSI classification.

Band selection methods can be divided into unsupervised and supervised according to whether there are training samples [17]. Numerous researches have focused on the unsupervised band selection methods. This is a very versatile method that comprehensively considers the distribution of all ground category in the image and retains an optimal subset distribution of original features [3]. However, it is not very effective for a specific task such as vegetation classification. Not all the band selected are used to distinguish vegetation category. Supervised band selection algorithms often performs better because of the introduction of prior information. However, there are still some problems for supervised method. The traditional methods are based on the entire image, without taking into account the differences in ground categories, and cannot figure out which band is discriminative for a specific category. Another disadvantage is that it is not possible to select the band flexibly according to the type of ground taking into account different task. Moreover, the traditional supervised method is not intuitive and interpretable. It cannot explain the specific role of a band.

To alleviate problems mentioned above, we proposed a convolutional neural network that generates contribution map (CM-CNN) in this paper. Due to excellent feature extraction capabilities, convolutional neural networks are widely used in hyperspectral image classification. Recent research on CNN shows the ability to localize discriminative image regions in the field of natural image processing [23]. We apply this idea to hyperspectral images classification and try to locate the position of discriminative bands through simple spectral classification. In this paper the structure of CNN has been changed sightly, and we proposed a contribution map by replacing the fully-connected layer with pooling features in order to record the location of discriminative band while classifying at the same time. The main contributions of this paper are as follows: (1) We apply neural network interpretability methods to hyperspectral task for the first time and have achieved remarkable results. (2) The proposed model is able to locate discriminative bands for a specific category and show the contribution of all the bands. (3) The proposed model can be transfered to other hyperspectral tasks such as anomaly detection.

2 Related Work

Band selection is considered to be an efficient dimension reduction method for HSI. The purpose of this method is to obtain a subset to represent the most discriminative information of HSI. Band selection is able to reduce the amount of data and speed up processing at the same time. Supervised methods and unsupervised methods are the most commonly used approaches [17]. The difference is that supervised methods need training samples with ground truth, while unsupervised methods do not require training and learning.

This paper mainly discusses supervised methods for the reason that they have better performance under conditions of prior knowledge. Existing supervised methods are mostly based on classification. Lots of supervised algorithms are proposed including local spatial information based [22], the hypergraph model based [1], the support vector machine (SVM) based [14], the non homogeneous hidden Markov chain based [7] and the sparse independence criterion based methods [5]. However, there are still some problems for the supervised methods above. The result of selected bands depends on the accuracy and even distribution of the labels. Another shortcoming is that the traditional supervised methods are not intuitive and interpretable. It can not figure out which band is discriminative for a specific type of ground category.

With the development of machine learning and computer vision, methods based on deep learning have become the mainstream of computer science [15, 24]. Therefore, applying deep learning methods to hyperspectral images is an inevitable direction. Recently, convolutional neural networks have been widely applied in the field of HSI [8]. The CNN model is a special type of feedforward neural network which is able to automatically learn high-level semantic features from the samples [11]. A 3D CNN model is proposed for HSI classification in [4]. Spectral-spatial based CNN model also gets a satisfying result [13]. CNN can even be applied to hyperspectral image reconstruction [12]. It is also effective for band selection [19]. As far as we know, deep learning methods for band selection are quite limited. Since the neural network is a black box, the interpretability of this method needs to be improved. Some works find that CNN can locate the features that activate the classification result to some extent [23]. The advantage of the deep learning methods is that it can extract more abstract high-level semantic features that are thought to reflect intrinsic meaning of the input [4, 11]. So, we try to exact location information of input bands with CNN.

3 Proposed Framework

We propose a CNN model that can select bands specifically for each category while classifying. First, we introduce the specific process of 1D CNN classification using spectral information. Then, we put forward the contribution map which tends to retain location information of the bands in classification. Finally, we introduce how to select bands based on the contribution map.

3.1 Spectral Based Classification

The ultimate goal of hyperspectral image classification is to assign a unique category to each pixel in an image which contains hundreds of bands. In this paper, we adopt the typical 1D CNN model for classification that only uses the spectral information.

Fig. 1.
figure 1

A typical 1D CNN model for spectral based classification.

Since each pixel is a one-dimensional vector containing hundreds of bands, we propose a 1D CNN model. The network structure is shown in the Fig. 1. The 1D CNN consists of 5 convolution layers, 2 pooling layers and 1 fully-connected layer. Then the last layer of convolution is reshaped to an one-dimensional vector. The feature vector goes through a fully-connected layer lately to reduce dimensions. Finally, we use the softmax function to compute the probability of each category for the pixel.

The main units of 1D CNN include convolution layer, pooling layer and nonlinear function. The formula for the convolution operation is

$$\begin{aligned} x^{k}_{j} = g(\sum \limits ^{N}_{i=1}w^{k}_{ij}x^{k-1}_{i}+b^{k}_{j}), \end{aligned}$$
(1)

where \(x^{k-1}_{i}\) refers to the i-th feature map in (\(k-1\))-th layer. \(x^{k}_{j}\) refers to the j-th feature map in current k-th layer, and N refers to the number of input channels. \(w^{k}_{ij}\) and \(b^{k}_{j}\) are trainable parameters in convolution operation.

The output value of convolution layer is subjected to a non-linear operation through the activation function. Typical activation functions include sigmoid, rectified linear unit (ReLU) and tanh. ReLU is adopted in our model for it’s excellent performance in training. ReLU works as follows:

$$\begin{aligned} g(x)=ReLU(x)= {\left\{ \begin{array}{ll} 0&{} \text{ if } x\le 0,\\ x&{} \text{ if } x>0. \end{array}\right. } \end{aligned}$$
(2)

Max and average pooling are most widely used downsampling method. Max pooling is thought to be robust to rotation and slight translation. Due to the fact that the spectral dimension of hyperspectral images does not have geometric distortion. Pooling layer adopts the average pooling in our model. To a certain extent, it can reduce the impact of noise. The method of average pooling is

$$\begin{aligned} x_{k}=\text {average}(x_{k-1})=\frac{1}{n}\sum \limits ^{n}_{i=1}x^{k-1}_{i}, \end{aligned}$$
(3)

where \(x^{k-1}_{i}\) refers to the output of convolution layer, and n refers to the size of pooling window. \(x^{k}_{i}\) refers to the output of pooling layer.

The result of CNN depends on the weight parameters of the network. So it is very vital to find proper weights. In our model, parameters are randomly initialized and trained by a back-propagation algorithm which is the core training algorithm for all sorts of neural networks [23]. Our model adopts mini-batch update strategy, which is efficient for large data set processing. The cost is computed on a mini-batch of inputs

$$\begin{aligned} c=-\frac{1}{m}\sum \limits ^{m}_{i=1}[y_{i}\log (y_{i}^{label})+(1-y_{i})\log (1-y_{i}^{label})], \end{aligned}$$
(4)

where m is the size of mini-batch which means the number of samples per optimization in training. c refers to the total error of cost function for a mini-batch. \(y_{i}\) is the output of network for each sample. \(y_{i}^{label}\) refers to label of each sample. Since this is a multi-classification problem, we adopt one-hot encoding for the label.

To maximize accuracy, learning rate decay method is added to our training process. With the increase in the number of training epoch, the learning rate slowly declines. It is shown as follows:

$$\begin{aligned} lr=lrmin+(lrmax-lrmin)*\exp (-\frac{i}{P}), \end{aligned}$$
(5)

where lrmax is the initial learning rate of the network. lrmin is the final learning rate of the network. i represents the current training epoch which means once complete training of all training data. P is a constant used to control the rate of decline on learning rate.

3.2 Contribution Map

In this section, we describe the procedure of generating contribution map by changing the last convolution layer in CNN. A contribution map indicates the discriminative spectral regions which are used by the CNN to determine the category. Figure 2 shows the procedure of generating these maps for a particular category.

Fig. 2.
figure 2

CM-CNN uses the global average pooling of feature maps for training directly. Right after the training step, the contribution map can be obtained by weighted summation of feature maps.

The proposed method takes into account the ability of CNN to localize the discriminative image regions in the field of natural image processing [23]. We apply this idea to hyperspectral images classification and try to locate discriminative bands which play an important role in the classification results. Before the final output layer, our model performs global average pooling on the feature maps of the last convolution layer. Then we use the pooled vector to sort through a softmax function. In this way, the weight of each feature map which represents the contribution of the feature to the final classification can be obtained. Finally, we compute a weighted sum of the feature maps in the last convolution layer to get our contribution map.

The specific operation is as follows. First, we assume that the values of the elements in the k-th channel(feature map) in the last convolution layer are expressed as \(f_{k}(x)\). x refers to the spacial position at the feature map. Then, global average pooling is performed on the k-th channel. Let \(F_{k}=\sum \limits _{i=1}^{n}f_{k}(x)\), and \(S_{c}=\sum \limits _{j=1}^{k}w_{k}^{c}F_{k}\). Therefore, for a given category c, \(w_{k}^{c}\) is the weight corresponding to category c for channel k. Virtually, \(w_{k}^{c}\) reflects the importance of \(F_{k}\) for category c. In the end, the output of softmax is calculated by

$$\begin{aligned} p_{c}=\frac{\exp (S_{c})}{\sum _{c}\exp (S_{c})}. \end{aligned}$$
(6)

A little trick is adopted. We replace \(F_{k}\) in \(S_{c}=\sum \limits _{j=1}^{k}w_{k}^{c}F_{k}\) with \(\sum \limits _{i=1}^{n}f_{k}(x)\). Then \(S_{c}\) can be described as follows:

$$\begin{aligned} S_{c}=\sum \limits _{j=1}^{k}w_{k}^{c}\sum \limits _{i=1}^{n}f_{k}(x)=\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{k}w_{k}^{c}f_{k}(x). \end{aligned}$$
(7)

We define \(CM_{c}\) as the contribution map for category c, and \(CM_{c}\) represents the contribution of each node to the classification in the feature map. It is defined as follows:

$$\begin{aligned} CM_{c}=\sum \limits _{j=1}^{k}w_{k}^{c}f_{k}(x). \end{aligned}$$
(8)

We name this changed network as CM-CNN which means a convolutional neural network that can generate contribution map. In the process of training we use \(F_{k}\) to participate in the classification with softmax function. Right after the training step, we calculate \(CM_{c}=\sum \limits _{j=1}^{k}w_{k}^{c}f_{k}(x)\) to get the contribution map. Every time a sample passes through the network we are able to get a contribution map.

3.3 Band Selection

Once the training of our CM-CNN is completed, each training sample through the network can get a contribution map correspondingly. We assume that it roughly reflects the contribution of each band to the final classification result. Our purpose is to prove that this assumption is credible. Figure 3 shows the contribution map.

Fig. 3.
figure 3

An example contribution map in Indian Pines which belongs to category Wheat. Bands in highlight areas contribute a lot to classification.

For a particular category, we average the contribution map of all the training samples to get the final contribution map of that category. The number of bands selected for each category is determined based on the number of bands we need to select in a total. Two algorithms are proposed taking into account the efficiency and accuracy of band selection. The first method is to ignore the non-uniform distribution of samples. Assuming that the number of all the bands is N, the number of category is l, and M bands need to be selected. For each category, we choose the top M/l bands according to the element value in the contribution map which is ranked beforehand. Consider that the hyperspectral bands are highly correlated. If the former band selected and the latter band selected are adjacent bands, we abandon the latter band and select the next band sequentially until the number of bands meets the requirements. The second method takes into account the uneven distribution of the samples. Especially for the classification task, we recommend the second band selection method. The number of bands selected for a category with a large number of samples should be appropriately reduced so as to leaving more bands to those difficult to classify. Assume that the number of samples in different categories is \(H_{c}\), and a maximum and minimum normalization f(x) is applied to all categories of \(\log (H_{c})\). \(f(H_{c})\) is as follows:

$$\begin{aligned} f(H_{c})=\frac{\log (H_{c})-\min (\log (H_{c}))}{\max (\log (H_{c}))-\min (\log (H_{c}))}, \end{aligned}$$
(9)

and then the number of bands to be selected for each category is determined based on

$$\begin{aligned} N_{c}=\frac{1-f(H_{c})}{\sum \limits _{c=1}^{l}f(H_{c})}*N. \end{aligned}$$
(10)

The method of selecting a fixed number of bands on the contribution map is the same as the first method. For each category, we choose the top \(N_{c}\) bands according to the element value in the contribution map. If the former band selected and the latter band selected are adjacent bands, we abandon the latter band and select the next band sequentially until the number of bands meets the requirements.

4 Experiment

4.1 Visualization

The interpretability of neural network model has always been an important topic in natural image processing. Visualization can helps us to get some intuitive knowledge and improve the understanding of the model. So we try to visualize the contribution map to show whether the location distribution of features is regular. Figure 4 shows the contribution maps of a few samples which are randomly selected in the first category (Alfalfa) of Indian Pines’ training data. As we can see, for the same category of samples, the activation region of the contribution map is almost the same. Then, we summarize all the contribution maps for each category in the Indian Pines as shown in Fig. 5. It can be seen that different categories correspond to different activation region.

Fig. 4.
figure 4

The contribution map of Alfalfa which is a category in Indian Pines.

Fig. 5.
figure 5

The contribution map of all the 16 categories of Indian Pines. Bands in highlight areas contribute a lot to classification.

4.2 Indian Pines

Indian Pines data set was obtained by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). The image size is 145 \(\times \) 145 pixels with 220 spectral bands in the range of 0.4–2.5 \(\upmu \)m. The final bands adopted is reduced to 200 by abandoning water absorption bands. There are total 16 categories in Indian Pines data set.

For Indian Pines, Table 1 shows the architecture of the CM-CNN used for the experiment. In the figure, Ci means the i-th convolution layer. There are a total of 10,249 labeled samples in the Indian Pines, of which 7,168 are for training and 3,081 for testing. The training epoch is set to 300. With the increase in the number of training epoch, the learning rate slowly declines. The initial learning rate is set to 0.003 and the final learning rate decade to 0.0001. We have found that average pooling always performs better than max pooling, and smaller kernel size can significantly increase training speed but with lower accuracy. So, we choose 4 as the kernel size for both speed and accuracy. In order to preserve location information, stride is mostly set to 1 to keep output size the same as original input. Once the training of CM-CNN is over, we can get the contribution map to make reasonable band selection according to the method in Sect. 3.2.

Table 1. Architecture of the CM-CNN on Indian Pines.

We compared two supervised and two unsupervised band selection method with our CM-CNN. As is shown in Fig. 6, CM-CNN outperform all these methods when the number of selected bands is larger than 35. The selected bands are classified by a SVM classifier with RBF kernel to verify the validity of them. Wrapper based SVM (SVMwrapper) method and minimal Redundancy Maximal Relevance (mRMR) method are supervised methods [20]. CEM-BCM and ordinary-clustering-based band selection (OCBBS) are unsupervised methods [17]. When the number of selected bands is small, the performance of our method is not very ideal. We figure that it is because the location information of contribution map is not precise enough. Band correlation makes adjacent bands easily selected at the same time. However, experiment proves that CM-CNN is able to extract location information and has the best performance when the number of selected bands is larger than 35.

Fig. 6.
figure 6

Classification result of Indian Pines.

4.3 Pavia University

Pavia University data set was obtained by Reflective Optics Spectrographic Imaging System (ROSIS-03) airborne instrument. The image size is 640 \(\times \) 340 pixels with 115 spectral bands in the range of 0.43–0.86 \(\upmu \)m. The number of bands adopted is reduced to 103 by removing noise bands. The spatial resolution is 1.3 m per pixel. There are total 9 categories in Pavia University data set.

For Pavia University, Table 2 shows the architecture of the CM-CNN used for the experiment. Same as Indian Pines, \(C_{i}\) means the i-th convolution layer. There are a total of 42,776 labeled samples in the Indian Pines, of which 29,943 are for training and 12,833 for testing. The training epoch is set to 400. The initial learning rate is set to 0.008 and he final learning rate decade to 0.0005. The following process is the same as Indian Pines.

Table 2. Architecture of the CM-CNN on Pavia University.
Fig. 7.
figure 7

Classification result of Pavia University.

We compared two supervised and one unsupervised band selection method with our CM-CNN. As is shown in Fig. 7. The selected bands are classified by a SVM classifier with RBF kernel to verify the validity of them. SVMCV is a kind of wrapper based supervised method and minimal redundancy maximal relevance (mRMR) method are also supervised [20]. CEM-BCM is an unsupervised method has been widely compared. CM-CNN performs as good as the two supervised method. Results proved the effectiveness of our method to locate discriminative bands.

5 Conclusion

In this paper, a CNN-based supervised band selection model (CM-CNN) is proposed. The main task is to choose the most discriminative bands so as to effectively represent the original image cube. The bands selected by our method prove to perform well on HSI classification. The contribution map we get proves to be sensitive to discriminative bands. Approach in this paper is instructive for many hyperspectral tasks.