1. Introduction
With the growth of remote sensing technology, hyperspectral imagery (HSI) which can provides both spatial information and abundant spectral information [
1,
2], has been widely employed in various applications such as mineral exploration, ground object identification, survey of agriculture, monitoring of geology, etc. In these applications, pixel level classification is a commonly used technology, which is crucial for both the low-level HSI processing and the high-level understanding of HSI.
Plenty of methods have been proposed for HSI classification. According to the feature utilized for representing pixels in a HSI, they often can be roughly divided into two categories, namely handcrafted feature based methods and deep learning feature based methods. In previous years, handcrafted feature based methods for HSI classification had gained much promising progress. Nevertheless, for the handcrafted feature based methods [
3,
4,
5,
6,
7,
8,
9], various domain knowledge is required in order to extract the appropriate features for the following classification step. More importantly, the handcrafted features are often exhibit shallow structure, and thus are insufficient to represent the many complicated structures in the challenging HSI classification problems. Recently, deep learning feature based methods are extensively investigated, which can learn features from low level to high level with a deep hierarchical structure. Compared with those handcrafted features, the learned deep features often show better nonlinear representation ability for the original images. Therefore, numerous deep learning feature based methods have been developed for HSI classification [
10,
11,
12].
Deep belief network (DBN) [
13] and stacked auto-encoder (SAE) [
14] are two widely used unsupervised deep learning methods. Given the learned deep features, an appropriate classifier can be further trained to process the spectral-spatial classification of hyperspectral data [
11,
15]. Different from SAE and DBN, convolutional neural networks (CNNs) turn to learn deep features from extensive labeled data, which have shown their advantage for traditional image classification problems. To date, many deep CNNs frameworks have been developed for RGB image classification, e.g., AlexNet, VGGNet, GoogLeNet and ResNet [
16,
17,
18,
19,
20,
21]. Some of them have been employed for HSI classification [
22,
23] and gained promising results. Nevertheless, the methods based on traditional CNNs need intricate networks structure and extensive time for networks training [
16,
17,
18,
19,
20,
21].
To mitigate this problem, some newly proposed CNNs methods commence at employing the pre-learned convolutional kernels to reduce the computational cost as well as training examples necessitated for updating the convolutional kernels. These pre-learned convolutional kernels based methods include PCA-Net [
24], MCFSFDP-Net [
25], Random Net and K-means Net [
26]. For example, in [
27], an approach for per-pixel classification of satellite data using CNNs is proposed. It firstly pre-learns CNNs kernels with an unsupervised clustering algorithm, e.g., K-means algorithm. Given those pre-learned kernels, only the classifier is trained with the back propagation scheme for per-pixel classification, object recognition and segmentation [
26,
27]. In [
24], the principal components analysis (PCA) algorithm is combined with the support vector machine (SVM) classifier to build a network for original image classification. Nevertheless, the parameter of dimension reduction in PCA algorithm needs to be empirically determined. In [
25], an modified clustering by fast search and find of density peaks (MCFSFDP) method is proposed to data-adaptively determine the number of pre-learned kernels. In both [
25,
26], the pre-learned convolutional kernels of CNNs framework are generated via clustering patches, which are randomly extracted from the original image. However, the important parameter, kernel size, still has to be determined empirically. How to data-adaptively determine the size of convolutional kernels in the pre-learned convolutional kernels based CNNs framework is seldom studied in recent researches. In recent years, some methods based on large scale computation for adaptively determining the CNNs architecture are emerged. In [
28], a shape driven kernel adaptation in convolutional neural network method is proposed to explore how the shape information can be explicitly deployed into the popular CNN architecture to disentangle irrelevant non-rigid appearance variations in recognition tasks. Actually, this method adopts different adaptation functions to replace the single function in the CNN architecture. In addition, acquiring optimal parameter of networks is also a difficult task. Although in [
29], to solve the problem of kernel size designing in CNNs framework, an evolutionary algorithms (EA) based method is proposed. However, this method needs much calculated quantity and hyper-parameters.
Since the convolutional kernels in both K-means Net and MCFSFDP-Net are learned through the clustering algorithm, the classification results are relying on the quality of the clustering. Better clustering results (i.e., better pre-learned kernels in the pre-learned CNNs framework) result in better features. The evaluation indicators of the clustering results can be divided into two categories, those for the samples without labels and those for the samples with given labels [
30,
31,
32,
33]. In the first category, it includes Compactness (CP), Separation (SP), Davies-Bouldin Index (DBI) and Dunn Validity Index (DVI). CP means the average distance between the points in one class and the center point in the same class, which only considers the effect of inner-class. SP is the average distance between two different center points. In addition, SP also only considers inner-class elements. The DBI is not suitable to evaluate the results of samples with the circular distribution, according to use the Euclidean distance. DVI represents the effectiveness for discrete points, however, it has poor effect for the samples with circular distribution. Generally speaking, this category of methods for evaluating the samples with given labels is just for the supervised clustering. However, most of the clustering problems are unsupervised. In the other category for non-labeled data, the indicators, such as Cluster Accuracy (CA), Rand Index (RI) and Normalized Mutual Information (NMI), are used for evaluating clustering results.
In [
26], K-means Net utilizes K-means algorithm to learn kernels, which is an unsupervised clustering method and based on distance of the sample points. For this reason, the evaluation indicator for measuring the clustering results should be relevant to either inter-class or inner-class distance. Due to different kernel sizes, the same clustering sample with different size owns different location when it projects into 2-D plane, and the number of samples in each class is always different, which belongs to the problem of samples with non-uniform distribution. In this evaluation task, the traditional evaluation indicator based on either inter-class or inner-class distance is not suitable. To better deal with this evaluation problem, a more practical evaluation indicator should be designed to replace the recent unsuitable determination methods. What’ more, the new practical evaluation indicator needs to consider the important factor of number of samples after clustering process in each class.
In this paper, to enhance the HSI classification results of K-means Net, we propose a new size-adaptive kernels based K-means Net. Specifically, a new clustering evaluation indicator for the groups of pre-learned kernels with different sizes is proposed to evaluate the clustering results and determines the adaptive kernel size. Using the proposed method, the adaptive kernel size can be easily determined to well represent the data characteristics. Experimental results on two datasets demonstrate that with the automatically determined kernel size, the proposed method outperforms several state-of-the-art CNNs methods.
In summary, the proposed CNNs framework has two key contributions: (1) a specific size of convolutional kernels can be determined by a new clustering evaluation indicator; (2) the K-means based CNNs framework with adaptive kernel size is effective for HSI classification.
2. The Proposed Method
The K-means based CNNs method with adaptive kernel size includes four major steps: (1) data pre-processing which extracts groups of patches with different patch sizes from block samples (the block samples are extracted from the original HSI for training); (2) K-means for clustering the convolutional kernels with groups of different sizes; (3) the evaluation of clustering results for determining the adaptive kernel size; and (4) HSI classification using the pre-learned kernels with adaptive kernel size in K-means based CNNs. The flowchart of the proposed method is shown as
Figure 1.
2.1. Data Pre-Processing
For simplicity, we denote the HSI employed for classification as in this paper.
First, we randomly select
pixels from
and then extract the corresponding blocks
with size of
centered at each selected pixel as samples. Each pixel contains the information from all spectral bands, here, we omit the bands in the process of describing the size. These extracted
samples are roughly divided into three sets, namely training samples set (
), validation samples set (
) and testing samples set (
). The label of the central block pixel is represented through the property of the whole block. In other words, the property of the central pixel is described via the statistical property of pixels value which includes the central pixel and its surrounding pixels in each whole block. Then,
are fed into the network and the central pixels labels of the input blocks
are used as the ground truth for training a network. In this paper, through comparing different block sizes, we select the block with size of
(i.e.,
) to obtain the best classification results [
23].
Moreover, we randomly extract patches with a size of from training samples, where denotes the number of training samples, with and . The extracted patches are used for learning the convolutional kernels with size of via K-means clustering. In this paper, we choose some groups of patches with different sizes of , , …, , respectively, for the results after convolutional process can be divided with no reminder in pooling process, the pooling size is designed as , in addition, is designed as 10,000.
The process of data extraction is shown as
Figure 2.
2.2. Clustering the Convolutional Kernels via K-Means
The method of pre-learned convolutional kernels is based on K-means algorithm. To verify the adaptive size, we set the class number K as 50 by experience.
We firstly reshape each patch into a column vector as a vector with a size of . All the vectors denote as {, …, , …, }. The steps of K-means are shown as follows:
Step 1: we randomly choose 50 vectors from as the initial cluster centers, i.e., , .
Step 2: for each vector
, if the vector has same label with cluster center
, the vector should exhibit the nearest distance to the cluster center
than that to the other cluster centers.
where the distance is Euclidean distance and
denotes the label of the vector
.
Step 3: for all
vectors
which have the same label of
in class
. We calculate the new mean value
as the new cluster center;
denotes the number of vectors in the class
.
Step 4: stop condition: repeat step 2 and step 3 times, here, . After the computing process, the final mean values, i.e., , represent the final cluster centers. For groups of patch sizes, the cluster centers denote as , here, z = 1, 2, …, 9, since the chosen 9 groups of patch sizes. The cluster centers then are reshaped as the convolutional kernels and are ultimately adopted by the K-means Net.
Since the patches with different sizes extracted from the same sample, they often show different degrees of representation ability. Moreover, the patches with different sizes have different clustering results and different distributions in the 2-D plane. The detailed discussion will be introduced in
Section 3.
2.3. Determination Method of Adaptive Kernel Size
Given the clustering results using different groups of patch (kernel) sizes, we determine the adaptive kernel size as the following steps:
Step 1: We compute the inner-class distance .
We compute the inner-class distance matrix of each class . denotes the class number. We adopt Euclidean distance for distance measure.
A variable value is introduced to represent the number of patches in each class.
The matrix
is given as Equation (3):
where
is 1, 2, …, 50,
denotes the
-th vector in class
,
,
denotes the number of patches in class
,
denotes the cluster center of the class
.
In each class, .
For the variable value , we rank the number of samples in each class from small to large.
The weight
as the quotient between the number of patches in the class
and the number
of all patches, which is shown as Equation (4):
denotes the weight of
, which represents the rank of the number of patches in each class. If a class has the max number of patches, the corresponding
is set to 50/50. Oppositely, if a class has the minimum number of points, whose
is set as 1/50 (50 denotes the number of classes). Through ranking the number of patches in the other classes from large to small, the other
corresponding to the patches number ranks of classes are set to 49/50, 48/50, …, 3/50, 2/50, respectively.
The inner distance
in class
is shown as the Equation (5):
For all classes, the final demonstration of inner-class distance value
is described as Equation (6):
Step 2: We compute the inter-class distance .
We compute the distance matrix which is composed of the distances among the cluster centers. In this paper, the distance matrix with a size of is described as . Each element in is calculated as , where , , and are considered as the class centers of class r and class t, respectively.
We normalize the matrix
as Equation (7):
where
denotes the max element in matrix
.
Finally, the inter-class distance
is shown as:
where
and
denote the line and column of the distance matrix
, respectively.
is a scalar.
Step 3: The evaluation indicator of clustering results with different kernel sizes is shown as follows:
With the different kernel sizes
, the evaluation indicator
is:
where
denotes the kernel size and
denotes the evaluation indicator value of the clustering result with a size of
.
Then, through ranking , the optimal kernel size is , which leads to the largest value .
The flow chart of determining the adaptive size of the convolutioanal kernels is shown in
Figure 3.
2.4. Convolutional Neural Networks Classification
Through reshaping the clustering centers with the adaptive kernel size into patches , these patches can be directly used for the convolutional kernels in the CNNs framework.
With the pre-learned kernels
, a convolutional neural networks as described in [
27] is developed for per-pixel level HSI classification. This CNNs structure consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer and a soft-max layer shown as
Figure 4.
There are 50 kernels in the convolutional layer. Each feature map is calculated by taking the dot product between the
-th kernel
with a size of
,
and local context area
of size
with
number of channels,
. The feature map corresponding to the
-th filter
is calculated as:
where
is the rectified linear unit (ReLU). The kernels were pre-trained using K-means algorithm.
The maximum pooling over a local overlapping spatial region is adopted to down-sample the convolutional layer. The pooling layers for the
-th filter,
, is calculated as:
The feature maps are reshaped into column vectors. Then, all the column vectors are connected into a fully connected layer, auto-encode unit is used to process the connected column vector and it represents the feature of column vector. The output results of hidden layer in the auto-encode unit were used to connect the classification layer.
The last soft-max layer is used for output final classification result.
The structure of the K-means based CNNs with adaptive kernel size is shown as
Figure 4.
3. Experiments and Analysis
To demonstrate the effectiveness of the proposed method, two HSI datasets are adopted in the following experiments. These two datasets are used to validate the feasibility and effectiveness of the proposed CNNs based K-means with adaptive kernel size in classification. In the following sections, we firstly introduce the datasets. Then, the detailed experimental setting are provided. Finally, two experiments are conducted to show the HSI classification results of the proposed method.
3.1. Datasets
Two public image datasets are utilized in our experiments.
Dataset 1: In order to evaluate the proposed method on the complex dataset, the first dataset is the benchmark Indian Pines image, shown as
Figure 5a. It is gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana. The ground reference is shown in
Figure 5b. This image contains 145 × 145 pixels and 224 spectral bands, the wavelength ranges from 0.4 to 2.5 um. The number of bands is reduced to 200. Eleven interesting classes of this image were classified. 5108 image context area samples with a size of
were extracted. We choose 11 categories of the total 16 categories for the experiment. Each category of image samples is given in
Table 1. This dataset is used to analyze the distributions of the extracted patches in different groups with different sizes for certifying the extracted patches for learning kernels with different distributions. It is also used to test the feasibility and effectiveness of the proposed approach for classification.
Dataset 2: The second dataset is the benchmark Pavia University image. It is acquired by ROSIS sensor during a flight campaign over Pavia, northern Italy. As shown in
Figure 6, this image contains
pixels and 103 spectral bands. The number of bands is reduced to 100 by selecting the top 100 bands from 103 bands, and the whole image was used. The total numbers of samples are split into training, validation and testing samples with ratios 0.5, 0.1 and 0.4. Furthermore, 31,571 image context area samples with a size of
were extracted. Among them, 15,785, 3157 and 12,629 samples are used for training, validation and testing, respectively. The details of each category of samples were given in
Table 2. This dataset is used to test the feasibility and effectiveness of the proposed approach for classification.
3.2. Experimental Parameter Settings
In the experiment, the samples (blocks) are randomly extracted from the HSI dataset, and then some groups of patches are extracted from the training samples for learning the pre-learned kernels. Each group of kernels should contain a constant kernel size. The pre-learned kernel number (patches number) should be fixed as 50. And the pre-learned kernels would be used in the pre-learned CNNs framework.
In the experiment, as shown in
Figure 4, the CNNs framework used one convolutional layer, one pooling layer, one auto-encode layer and a classifier. In our algorithm, the pooling layer adopts the overlapping rule with size of
, the number of neurons in the hidden layer of auto-encoding is set to 1000 and the max iteration for training the classifier was 400.The learning rate is 0.0001 and the momentum is 1. The batch size is set as 200. The testing accuracy is the average value of 10 trials.
The code is running on the computer with Intel Xeon E5-2678 V3 2.50 GHz × 2 (Intel, Santa Clara, CA, USA), NVIDIA Tesla (NVIDIA, Santa Clara, CA, USA) K40c GPU × 2, 128 GB RAM, 120 GB SSD and Matlab 2016a (MathWorks, Natick, MA, USA). The gradient is computed via batch gradient descent, which is not computed by GPU.
3.3. Experimental Results
3.3.1. Different Performances in 2-D Plane of the Patches with Different Sizes
The aim of this experiment is to show the performance of the patches of non-uniform distribution with different sizes in 2-D plane. In the experiment, the patches with different sizes are extracted from the HSI in dataset 1, which are reshaped into vectors, and then the vectors are projected onto the 2-D plane through the tSNE_VISURE_2dDATA tools. The chosen patch sizes are , , …, . The 2-D plane represents both distribution of each patch and the result of patches distribution after K-means clustering with 50 classes.
Figure 7 shows performances of the patches of projected into the 2-D plane with different sizes are different. This is because that the patches with different sizes show different qualities. The patches with different sizes projected onto 2-D plane after K-means clustering are also different, which make it difficult to evaluate the clustering results with the different patch size. For this reason, the evaluation indicator of clustering results should be defined anew.
3.3.2. Effectiveness of the Adaptive Kernels Size Determined by CNNs Based K-Means Clustering
To demonstrate the effectiveness of the adaptive kernel size determined by an evaluation indicator of K-means, we compare the value of evaluation indicator and classification accuracy by the CNNs based on K-means algorithm on two HSI datasets. Dataset 1 and Dataset 2 are used in the experiment.
We report the evaluation indicator value and the testing accuracy with different patches sizes on each dataset in
Table 3 and
Table 4.
In
Table 3, the proposed method determines the adaptive kernel size as
on Dataset 1. The chosen kernel sizes in this experiment are
,
, …,
. The kernel size with the largest value of evaluation indicator is
, and the corresponding evaluation indicator value is 16.9080. The evaluation indicator value shows the samples with size of
owns the best clustering result. It can be seen that adaptive kernels size
via the proposed method shows the best testing classification accuracy, 99.7945%. Similar observations can be made from
Table 4. Therefore, the proposed method is demonstrated to have the potential to determine the adaptive kernel size in the other datasets.
3.3.3. Performance Evaluation of K-Means Net
In this part, the proposed method is compared with three state-of-the art pre-learned kernels based CNNs methods, including PCA-Net [
26], Random Net and MCSFDP Net [
27]. For fair comparison, the same CNNs architecture is designed in all the compared methods. The number of kernels is set to 50, while the adaptive kernel with size of
is determined by the proposed method. These parameters in these four networks are determined through tuning parameters such as learning rate and moment value, the iteration is set to 400.
The average testing classification accuracy of our proposed algorithm, PCA-Net and Random Net and MCFSFDP Net on the Dataset 1 and Dataset 2, is given in
Table 5 and
Table 6. The results obviously show that the Random Net, MCFSFDP Net and K-means Net with the adaptive kernel size obtain the acceptable accuracy. It reveals that our proposed method can produce the second best classification result in the four comparison methods. Moreover, the MCFSFDP Net acquired the best classification accuracy, which relies on the more advanced clustering method for pre-learning convolutional kernels.
Specifically, K-means Net has the fast speed and data-determined character for clustering the kernels. These are two advantages of the K-means Net. In the other methods, PCA-Net is data-determined for learning kernels. However, with the increase of sample dimension, the effect of PCA reduction will be dropped. In other words, the parameter of reduction dimension is hard to be designed, which influences the kernel performance of PCA-Net. What’s more, the process of learning kernels via PCA-Net is slower than the process of K-means Net. The kernels are learned via randomly initialization in Random Net. In this reason, the kernels are not data-determined. Therefore, Random Net is not applicable in a large extent. In MCFSFDP Net, the kernels are learned via the MCFSFDP method. These kernels are data-determined by MCFSFDP Net and this clustering method is based on density and distance, which has the advanced clustering performance than K-means. However, the MCFSFDP algorithm needs a step of calculating a distance matrix, this step needs more time and memory than the K-means Net, PCA-Net and Random Net.