Graph-Based Deep Multitask Few-Shot Learning for Hyperspectral Image Classification

Li, Na; Zhou, Deyun; Shi, Jiao; Zheng, Xiaolong; Wu, Tao; Yang, Zhen

doi:10.3390/rs14092246

Open AccessArticle

Graph-Based Deep Multitask Few-Shot Learning for Hyperspectral Image Classification

by

Na Li

¹,

Deyun Zhou

¹,

Jiao Shi

^1,*

,

Xiaolong Zheng

¹,

Tao Wu

¹

and

Zhen Yang

²

¹

School of Electronics and Information, Northwestern Polytechnical University, 127 West Youyi Road, Xi’an 710072, China

²

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(9), 2246; https://doi.org/10.3390/rs14092246

Submission received: 22 March 2022 / Revised: 2 May 2022 / Accepted: 4 May 2022 / Published: 7 May 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Although the deep neural network (DNN) has shown a powerful ability in hyperspectral image (HSI) classification, its learning requires a large number of labeled training samples; otherwise, it is prone to over-fitting and has a poor classification performance. However, this requirement is impractical for HSIs due to the difficulty in obtaining class labels. To make DNNs suitable for HSI classification with few labeled samples, we propose a graph-based deep multitask few-shot learning (GDMFSL) framework that learns the intrinsic relationships among all samples (labeled and unlabeled) of HSIs with the assistance of graph information to alleviate the over-fitting caused by few labeled training samples. Firstly, a semi-supervised graph is constructed to generate graph information. Secondly, a deep multitask network (DMN) is designed, which contains two subnetworks (tasks): a classifier subnetwork for learning class information from labeled samples and a Siamese subnetwork for learning sample relationships from the semi-supervised graph. To effectively learn graph information, a loss function suitable for the Siamese subnetwork is designed that shortens (and expands) the distance between the target sample and its nearest (and farthest) neighbors. Finally, since the number of training samples of the two subnetworks is severely imbalanced, a multitask few-shot learning strategy is designed to make two subnetworks converge simultaneously. Experimental results on the Indian Pines, University of Pavia and Salinas datasets demonstrate that GDMFSL achieves a better classification performance relative to existing competitors in few-shot settings. In particular, when only five labels per class are involved in training, the classification accuracy of GDMFSL on the three datasets reaches 87.58%, 86.42% and 98.85%, respectively.

Keywords:

few-shot learning; graph; few labeled samples; hyperspectral images; semi-supervised; classification

1. Introduction

Combined with spectroscopic technology, hyperspectral imaging technology is used to detect the two-dimensional geometric space and one-dimensional spectral information of the target and obtain continuous and narrow band image data with a high spectral resolution [1]. With the development of hyperspectral imaging, hyperspectral images (HSIs) not only contain abundant spectral information reflecting the unique physical properties of the object material but also provide a fine spatial resolution for the ground features [2]. On account of these advantages, HSIs have been widely used in many fields, such as agriculture [3], transportation [4], medicine [5], earth observation [6] and so on.

Among these various HSI-related applications, one of the most essential tasks is HSI classification, which aims to assign a predefined class label to each pixel [7], and has received a substantial amount of attention [8,9,10,11]. However, due to the computational complexity of high dimensional data and the Hughes phenomenon (the higher the dimension, the worse the classification) caused by the limited labeled samples in HSIs, traditional classification techniques performed poorly [12]. To achieve a better classification of HSIs, one conventional solution is to utilize dimensionality reduction to obtain more discriminative features for facilitating the classification of classifiers (e.g., KNN and SVM). Many classic dimensionality reduction methods have been applied to HSIs; for example, principle component analysis [13], linear discriminant analysis [14], independent component analysis [15], low-rank [16] and so on. Nevertheless, the classification performance is still unsatisfactory because these methods are based on the statistical properties of HSI, which neglect the intrinsic geometric structures [17]. To reveal the intrinsic structures of data, manifold learning was designed to discover the geometric properties of HSI; for instance, isometric mapping [18], locally linear embedding [19] and Laplacian eigenmaps [20].

In fact, a unified framework, namely graph learning, can represent and redefine the above dimensionality reduction methods with different similarity matrices and constraint matrices, which can reveal the intrinsic similar relationships of data and have been widely applied to HSIs [17]. Recently, some advanced spectral–spatial graph learning methods were proposed to represent the complex intrinsic structures in HSIs. Zhou et al. [21] developed a spatial and spectral regularized local discriminant embedding method for the dimensionality reduction of HSIs that described the local similarity information by integrating a spectral-domain regularized local preserving scatter matrix and a spatial-domain local pixel neighborhood preserving scatter matrix. Huang et al. [22] proposed an unsupervised spatial–spectral manifold reconstruction preserving embedding method that explored the spatial relationship between each point and its neighbors to adjust the reconstruction weights to improve the efficiency of manifold reconstruction. Huang et al. [23] put forward a spatial–spectral local discriminant projection method where two weighted scatter matrices were designed to maintain the neighborhood structure in the spatial domain and two reconstruction graphs were constructed to discover the local discriminant relationship in the spectral domain. These advanced unsupervised or semi-supervised graph learning methods obtain more discriminative features by exploring and maintaining the intrinsic relationships among samples. Indeed, they improve the classification performance of classifiers. However, the disadvantage of this solution is that feature extraction and classification are separated and the feature extraction process cannot learn the data distribution suitable for the classifier.

Another solution to the HSI classification problem is deep learning technology, which has the powerful ability to learn discriminative features because of the deep structure and automatic learning patterns from data [24]. Different from the above classification methods, in deep learning, the learning process of feature extraction and classification is synchronous, and the extracted features are suitable for the data distribution of the classifier, so as to achieve a better classification performance. Recently, it has also shown a promising performance in HSI classification [25]. To gain a better spatial description of an object, convolutional neural networks (CNNs) have been widely applied for HSIs [26,27,28,29]. Boggavarapu et al. [30] proposed a robust classification framework for HSI by training convolution neural networks with Gabor embedded patches. Paoletti et al. [31] presented a 3-D CNN architecture for HSI classification that used both spectral and spatial information. Zhong et al. [28] designed an end-to-end spectral–spatial residual network that takes raw 3-D cubes as input data without feature engineering for HSIs classification. Although these deep-learning-based methods have achieved a promising classification performance, their learning process requires sufficient labeled samples as the training data, which is difficult for HSIs. In practice, the collection of HSI-labeled samples is generally laborious, expensive, time-consuming and requires field exploration and verification by experts, so the labeled samples available are always limited, insufficient or even deficient [32]. Unfortunately, the conventional deep learning models with limited training samples always face a serious over-fitting issue for HSI classification [33]. Hence, it remains a challenge to apply deep learning that requires sufficient training samples to HSIs with only limited and few labeled samples.

To address this problem, several different few-shot learning methods have been proposed to deal with HSI classification with few labeled samples in recent years. Few-shot learning aims to study the difference between the samples instead of directly learning what the sample is, which is different from most other deep learning methods [34]. As far as we know, there are three types of networks for few-shot learning for HSIs, including the prototypical network, relation network and Siamese network. The prototypical network learns a metric space in which classification can be performed by computing distances to prototype representations of each class [35]. In [36], Tang et al. proposed a spatial–spectral prototypical network for HSIs that first implemented the local-pattern-coding algorithm for HSIs to generate the spatial–spectral vectors. The relation network learns how to learn a deep distance metric based on the prototypical network, which can precisely describe the difference in samples [37]. Gao et al. in [38] designed a new deep classification model based on a relational network and trained it with the idea of meta-learning. The Siamese network is composed of two parallel subnetworks with the same structure and sharing parameters, in which, the input is a sample pair and the Euclidean distances are used to measure the similarity of an embedding pair. In [39], a supervised deep feature extraction method based on a Siamese convolutional neural network was proposed to improve the performance of HSI classification, in which, an additional classifier was required for classification. To sum up, it is not difficult to see that these networks can be summarized as metric-based models, but they usually measure the difference only among labeled samples and ignore unlabeled samples. In practice, labeled samples in HSIs are so few and limited that the neural network can only learn so much information, while the intrinsic structure of HSI is complex and the neural network needs to learn a variety of information. Hence, deep-learning-based few-shot or even one-shot HSI classification is still a challenge.

Although labeled samples in HSI are few and limited, attainable unlabeled samples are abundant and plentiful. Accordingly, the information implicit in unlabeled samples and the relationship between unlabeled samples and labeled samples are worthy and necessary to explore. Nevertheless, if there is no specific constraint, the neural network cannot learn the information beneficial to classification from the unlabeled samples. Fortunately, graph learning, as described earlier, is quite well-versed in this problem, and can effectively reveal the intrinsic relationship among samples. Inspired by the idea of graph learning, we propose a novel graph-based deep multitask few-shot learning (GDMFSL) framework for HSI classification with few labeled samples, which can learn the intrinsic relationships among all samples with the assist of graph information. In addition, another difference between GDMFSL and the aforementioned few-shot learning methods is that their metric function acts on the embedding feature layer, whereas GDMFSL directly constrains the output layer of the classifier. This will make the graph information act on the classification results more effectively.

The main contributions of this paper can be summarized as follows.

In order to make the deep learning method suitable for HSI classification with only few labeled samples, we propose a novel graph-based deep multitask few-shot learning (GDMFSL) framework that integrates graph information into the neural network. GDMFSL learns information not only from labeled samples but also from unlabeled samples, and even obtains the relationship between labeled samples and unlabeled samples, which can not only alleviate the over-fitting problem caused by limited training samples but also improve the classification performance.
In order to learn both the class information from labeled samples and the graph information, a deep multitask network (DMN) is designed, which contains two subnetworks (tasks): a Siamese subnetwork and a classifier subnetwork. The task of the Siamese subnetwork is to learn the intrinsic relationships among all samples with the assistance of graph information, whereas the classifier subnetwork learns the class information from labeled samples. Accordingly, unlike the networks described earlier for few-shot learning, DMN not only learns what the sample is but also the differences among all samples.
In order to effectively learn graph information, a loss function suitable for the Siamese subnetwork learning and training is designed, which shortens the distance between the target sample and its nearest (or in-class) neighbors and expands the distance between the target sample and its farthest (or inter-class) neighbors. Experimental results show that the designed loss function can converge well, effectively alleviate the over-fitting problem of the classifier subnetwork caused by the few labeled samples and improve the classification performance.
Due to the small number of labeled samples but large number of unlabeled samples in HSIs, the proportion between the number of training samples for the classifier subnetwork and that for the Siamese subnetwork is seriously unbalanced, and so the learning process of DMN is unstable. In order to balance the learning and training of two tasks in DMN, a multitask few-shot learning strategy is designed to make the two tasks converge simultaneously.

This paper is organized as follows. In Section 2, the proposed graph-based deep multitask few-shot learning framework is described in detail. Section 3 presents the experimental results on three datasets that demonstrate the superiority of the proposed GDMFSL. A conclusion is presented in Section 4.

2. Methodology

2.1. The Proposed Graph-Based Deep Multitask Few-Shot Learning Framework

In this paper, we study HSIs with few labeled samples and predict the classes of unlabeled samples. We represent a pixel (sample) of HSI as a vector

x_{i} \in ℜ^{D}

, where D is the number of spectral bands. Suppose that an HSI dataset

X

has m samples, of which, only n (

n ≪ m

) samples are labeled and

m - n

samples are unlabeled, m samples are denoted as

X = \{x_{1}, x_{2}, \dots, x_{m}\}

and n labeled samples are represented as

\{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}

, where

y_{i}

is the class label of

x_{i}

. For ease of calculation, the values of all samples are mapped to the range 0∼1 before learning.

HSI classification is used to predict the classes of unlabeled samples according to the class labels of labeled samples. In general, deep-learning-based classification methods aim to learn mapping between the training samples and their labels under the supervision of enough labeled samples. However, in the case of few and limited labeled samples in HSIs, the conventional deep neural network (DNN) will fall into over-fitting, resulting in poor classification results. In addition, the information obtained from only few labeled samples is not enough to support the classification of a mass of HSI samples with a complex intrinsic structure. To solve this difficulty, this study tries to guide DNN to gain information conducive to classification from plentiful unlabeled samples that are easily acquired. However, if there are no additional constraints on DNN, learning on unlabeled samples is often chaotic. Therefore, this idea is challenging to take forward. In this paper, inspired by graph learning, the proposed graph-based deep multitask few-shot learning framework finds a solution.

Graph learning is an effective technique to reveal the intrinsic similar relationships among samples, which can reflect the homogeneity of data. It has been widely applied for HSI to reduce data redundancy and dimensionality. In graph learning, the graph is used to reflect the relationship of two samples, which can represent some of the statistical or geometrical properties of data [17]. The relation information of unlabeled samples can also be captured and embodied in the graph. Thereupon, the graph should be a good auxiliary tool to assist the DNN in learning information from unlabeled samples.

In this paper, we study the HSI with few labels. Therefore, using not just unlabeled samples, the graph can reflect the relationship among all samples that contain labeled samples and unlabeled samples. In other words, the graph should be able to reflect the relationship not only within unlabeled samples and within labeled samples, but also between labeled samples and unlabeled samples. This will be key to predicting the classes of unlabeled samples. As a result, a semi-supervised graph is required.

Based on the semi-supervised graph and labeled samples, the DNN has two tasks to learn, namely the class attributes of samples and the relationship among samples. The two tasks are different: one is to learn what the sample is, and the other is to learn the differences among the samples. In order to simultaneously learn the two tasks and to make them promote each other, we designed a deep multitask network.

Based on the above, a graph-based deep multitask few-shot learning (GDMFSL) framework was proposed to deal with HSI classification with few labels, which is shown in Figure 1. Obviously, the first step of GDMFSL was to construct a semi-supervised graph on the basis of all of the samples, both labeled and unlabeled. Meanwhile, graph information was generated to prepare for deep multitask network. The second step was for the deep multitask network to learn and train under the supervision of few labels and graph information, where the input contains all samples. Finally, unlabeled samples were fed into deep multitask network to predict classes.

2.2. Construction of Semi-Supervised Graph

A graph G can be denoted as

G = \{X, E, W\}

, which is an undirected graph, where X denotes the vertexes, E denotes the edges and W represents the weight matrix of edges. To construct a graph, the neighbors are connected by edges and a weight is given to the corresponding edges [17]. If vertexes i and j are similar, we should put an edge between vertexes i and j in G and define a weight

W_{i j}

for the edge.

The key to constructing a graph is how to effectively calculate the similarity between samples. For this purpose, the spectral–locational–spatial distance (SLSD) [32] method was employed, which combines spectral, locational and spatial information to excavate the more realistic relationships among samples as much as possible. SLSD not only extracts local spatial neighborhood information but also explores global spatial relations in HSI-based location information. Experimental results in [32] show that neighbor samples obtained by SLSD are more likely to fall into the same class as target samples.

Figure 2 shows the construction of a semi-supervised graph, which is essentially adding the information of few labeled samples to the unsupervised graph. In the following, we will go through the process of constructing a semi-supervised graph in detail. In SLSD, the location information is one of the attributes of pixels. For an HSI dataset

X = \{x_{1}, x_{2}, \dots, x_{m}\} \in ℜ^{D \times m}

with m samples, each of its samples

x_{i} \in ℜ^{D \times 1}

have D spectral bands. Its location information can be denoted as

C = \{c_{1}, c_{2}, \dots, c_{m}\} \in ℜ^{2 \times m}

, where

c_{i} = {[p_{i}, q_{i}]}^{T}

is the coordinate of the pixel

x_{i}

. To fuse the spectral and locational information of pixels in HSIs, a weighted spectral-locational dataset

X^{C} = \{{x^{C}}_{1}, {x^{C}}_{2}, \dots, {x^{C}}_{m}\}

was constructed as follows:

\begin{matrix} X^{C} = [\begin{matrix} β C \\ (1 - β) X \end{matrix}] = [\begin{matrix} β c_{1}, \dots, β c_{m} \\ (1 - β) x_{1}, \dots, (1 - β) x_{m} \end{matrix}], \end{matrix}

(1)

where

β

is a spectral–locational trade-off parameter. The local neighborhood space of

{x^{C}}_{i}

is

Ω ({x^{C}}_{i})

in a

s \times s

spatial window, which has

s^{2} = {1, \dots, r}

samples. SLSD of the sample

x_{i}

and

x_{j}

is defined as

\begin{matrix} d_{SLSD} (x_{i}, x_{j}) = d (Ω ({x^{C}}_{i}), {x^{C}}_{j}) = \frac{\sum_{r = 1}^{s^{2}} t_{i r} ∥{x^{C}}_{j} - {x^{C}}_{i r}∥}{\sum_{r = 1}^{s^{2}} t_{i r}}, {x^{C}}_{i r} \in Ω ({x^{C}}_{i}) \end{matrix}

(2)

where

t_{i r}

is calculated by

t_{i r} = exp (- γ ∥{x^{C}}_{i} - {x^{C}}_{i r}∥), {x^{C}}_{i r} \in Ω ({x^{C}}_{i})

.

γ

is a constant that was empirically set to 0.2 in the experiments.

{x^{C}}_{i r}

is a pixel in

Ω ({x^{C}}_{i})

surrounding

{x^{C}}_{i}

.

Although SLSD is effective at revealing relationships between samples, it is still an estimated and imprecise measurement. For an HSI dataset with n labeled samples,

d_{SLSD} (x_{i}, x_{j})

of labeled samples

x_{i}

and

x_{j}

should be updated. In actual calculations, any

d_{SLSD} (x_{i}, x_{j})

is less than 1. In that way, in terms of n labeled samples

\{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}

,

d_{SLSD} (x_{i}, x_{j})

can be updated as follows:

\begin{matrix} d_{SLSD} (x_{i}, x_{j}) = \{\begin{matrix} 1, if y_{i} = y_{j} \\ 0, if y_{i} \neq y_{j} \\ d_{SLSD} (x_{i}, x_{j}), if y_{i} or y_{j} i s ϕ . \end{matrix} \end{matrix}

(3)

where, if

x_{i}

and

x_{j}

have the same class labels, their SLSD is set 0. If

x_{i}

and

x_{j}

have the different class labels, their SLSD is set 1. In addition, if

x_{i}

or

x_{j}

is unlabeled, its SLSD is not updated. In this manner, the updated

d_{SLSD}

contains information about n labels.

In a graph, a vertex and its neighbors are connected by edges. In this paper, we needed to construct two graphs:

G_{w} = \{X, E_{w}, W_{w}\}

based on the nearest neighbors and

G_{b} = \{X, E_{b}, W_{b}\}

based on the farthest neighbors. On the basis of SLSD,

G_{w}

and

G_{b}

can be constructed. For

G_{w} = \{X, E_{w}, W_{w}\}

,

k 1

nearest neighbors were found as

\begin{matrix} N_{k 1} (x_{i}) = min_{x_{j}} d_{SLSD} (x_{i}, x_{j}) . \end{matrix}

(4)

Since

d_{SLSD}

was updated based on labels,

k 1

nearest neighbors could be obtained from the samples with the smallest

d_{SLSD}

. Then, the weight matrix

W_{w}

was formulated as

\begin{matrix} {W_{w}}_{i j} = \{\begin{matrix} exp (- \frac{d_{SLSD} {(x_{i}, x_{j})}^{2}}{2 {t_{i}}^{2}}), x_{j} \in N_{k 1} (x_{i}), \\ 0, x_{j} \notin N_{k 1} (x_{i}), \end{matrix} \end{matrix}

(5)

in which,

t_{i} = \frac{1}{k 1} \sum_{x_{j} \in N_{k 1} (x_{i})} d_{SLSD} (x_{i}, x_{j})

. For

G_{b} = \{X, E_{b}, W_{b}\}

,

k 2

farthest neighbors were found as

\begin{matrix} N_{k 2} (x_{i}) = \{\begin{matrix} max_{x_{j}} d_{SLSD} (x_{i}, x_{j}), if y_{i} = ϕ, \\ min_{x_{j}} d_{SLSD} (x_{i}, x_{j}), if y_{i} \neq y_{j} . \end{matrix} \end{matrix}

(6)

From that,

k 2

farthest neighbors of unlabeled samples were obtained from the samples with the largest

d_{SLSD}

, whereas those of labeled samples were obtained from the samples with the different class labels and the smallest

d_{SLSD}

. The weight matrix

W_{b}

is formulated as

\begin{matrix} {W_{b}}_{i j} = \{\begin{matrix} 1, x_{j} \in N_{k 2} (x_{i}), \\ 0, x_{j} \notin N_{k 2} (x_{i}) . \end{matrix} \end{matrix}

(7)

In fact,

k 1 ≪ m

and

k 2 ≪ m

,

W_{w}

and

W_{b}

are sparse matrices.

G_{w}

and

G_{b}

involve different sample relationships.

G_{w}

reflects the relationships between the target sample and its nearest neighbors, which have a high probability of belonging to the same class as the target sample, whereas

G_{b}

reflects the relationships between the target sample and its farthest neighbors, which are most likely different classes from the target sample.

Figure 3 illustrates the pipeline of the proposed deep multitask network. The training data of DMN contain both labeled and unlabeled samples, and so the proposed DMN can be regarded as a semi-supervised network. DMN includes a Siamese subnetwork and a classifier subnetwork and they have different tasks and training data. The training data of classifier subnetwork must be labeled samples, which is a conventional standard supervised network. As the name implies, the task of classifier subnetwork is classification, learning the classes of labeled samples to predict unlabeled samples, which, in nature, is learning what samples are. Nevertheless, due to the few and limited labels in HSIs, the conventional classification network often suffers from the problems of over-fitting and poor classification performance. In proposed DMN, Siamese subnetwork was designed to address this problem, whose task was to learn the samples’ relationships from

G_{w}

and

G_{b}

to promote the learning and training of classifier subnetwork. It can be seen from Figure 3 that the two subnetworks have the same architecture and share parameters, which is the hub through which they can communicate and complement each other. The training data of Siamese subnetwork are all samples, including labeled and unlabeled samples. In addition, the training of Siamese network also requires the information generated by the semi-supervised graph, and the value of such information is reflected here. In fact, our designed Siamese subnetwork is essentially an unsupervised network and can still be trained without labels.

2.3. Network Architecture and Loss Function of Deep Multitask Network

Figure 4 shows the generation process of training data for DMN. For an HSI dataset,

X = \{x_{1}, x_{2}, \dots, x_{m}\} \in ℜ^{D \times m}

with m samples (labeled and unlabeled) and n labels (

n ≪ m

). M 3-D cube samples

\{x_{1}^{S}, x_{2}^{S}, \dots, x_{m}^{S}\} \in ℜ^{D \times s \times s \times m}

with the spatial neighborhood are first generated. Since the training data of Siamese subnetwork and classifier subnetwork are different, two training sets need to be established. Classifier subnetwork only trains the labeled samples, so its training data contain n 3-D cube samples

\{x_{1}^{t 1}, x_{2}^{t 1}, \dots, x_{n}^{t 1}\} \in ℜ^{D \times s \times s \times n}

with

\{y_{1}, y_{2}, \dots, y_{n}\}

labels, as shown in Training Set 1

X^{t 1}

of Figure 4. In practice, the size of its input is 3-D cube samples

x_{i}^{t 1} \in ℜ^{D \times s \times s}

. Due to the fact that Siamese subnetwork is to learn the sample relationships in

G_{w}

and

G_{b}

, in addition to the target sample,

k 1

nearest neighbors in

G_{w}

and

k 2

farthest neighbors in

G_{b}

also need to be input into the network during training. Training Set 2

X^{t 2}

of Figure 4 are the training data of Siamese subnetwork, which include m training samples

\{x_{1}^{t 2}, x_{2}^{t 2}, \dots, x_{m}^{t 2}\}

, where a training sample

x_{i}^{t 2} = [x_{i}^{S}, N_{k 1} (x_{i}^{S}), N_{k 2} (x_{i}^{S})] \in ℜ^{D \times s \times s \times (1 + k 1 + k 2)}

contains one target sample

x_{i}^{S}

, its

k 1

nearest neighbors

N_{k 1} (x_{i}^{S})

from

G_{w}

and its

k 2

farthest neighbors

N_{k 2} (x_{i}^{S})

from

G_{b}

. It is worth noting that some neighbors of unlabeled samples are labeled samples, which allows the network to learn the relationship between labeled samples and unlabeled samples to promote classification.

Figure 5 displays the network architecture of classifier subnetwork with a feature extractor and a logistic regression layer. In view of the strong feature extraction capability of convolutional layer, the feature extractor is a fully convolutional network. Here, the features of each layer decrease as the number of layers increases and the size of output is

d \times 1 \times 1

, so the feature extractor can also be regarded as a process of dimensionality reduction. Taking a four-layer feature extractor as an example, for input data

x_{i}^{S} \in ℜ^{D \times s \times s}

, their output can be formulated as

\begin{matrix} f (x_{i}^{S}, Θ) = r (conv (r (conv (r (conv (x_{i}^{S}, θ_{1})), θ_{2})), θ_{3})), \end{matrix}

(8)

where

r ()

is the ReLU function and

conv ()

is the 2-D convolution.

Θ = \{θ_{1}, θ_{2}, θ_{3}\}

is the learning parameter of the feature extractor. Feature extractor and logistic regression layer are fully connected. The output of logistic regression layer is formulated as

\begin{matrix} z_{i} = softmax (f (x_{i}^{S}, Θ), θ_{L}), \end{matrix}

(9)

in which,

θ_{L}

is the learning parameters of logistic regression layer and

softmax ()

is the softmax function. Since the task of classifier subnetwork is classification, for the Training Set 1

\{(x_{1}^{t 1}, y_{1}), (x_{2}^{t 1}, y_{2}), \dots, (x_{n}^{t 1}, y_{n})\}

, the loss function adopts the cross-entropy loss, which is defined as

\begin{matrix} L_{c} = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{k = 1}^{N_{C}} y_{i k} ln (z_{i k}) . \end{matrix}

(10)

Here,

y_{i}

is the class label of

x_{i}^{S}

and

z_{i}

is its predicted label.

y_{i}

and

z_{i}

are two

N_{C}

-dimensional one-hot vectors and

N_{C}

is the number of classes.

y_{i k}

is the kth element of

y_{i}

and

z_{i k}

is the kth element of

z_{i}

.

The task of Siamese subnetwork is to learn the sample relationships from

G_{w}

and

G_{b}

. As described in Section 2.2,

G_{w}

represents the relationships between the target sample and its

k 1

nearest neighbors and

G_{b}

expresses the relationships between the target sample and its

k 2

farthest neighbors. In order to learn the graph information in

G_{w}

and

G_{b}

at the same time, a novel Siamese subnetwork with

(1 + k 1 + k 2)

subnets is designed to learn the relationship between one target sample and

(k 1 + k 2)

samples at a time, of which, the network architecture is shown in Figure 6. This is different from the traditional Siamese network, which has two subnets and only learns the relationship between two samples at a time [40]. In our designed Siamese network, each subnet has the same network structure as the classifier subnetwork. That is, all subnets have the same network structure with a feature extractor and a graph-based constraint layer. The point is that they share parameters.

Corresponding to the network architecture, input data

x_{i}^{t 2} = [x_{i}^{S}, N_{k 1} (x_{i}^{S}), N_{k 2} (x_{i}^{S})] \in ℜ^{D \times s \times s \times (1 + k 1 + k 2)}

of Siamese subnetwork have

(1 + k 1 + k 2)

3-D cubes and

x_{i}^{S}

is the target sample. Meanwhile, we suppose that

x_{i p}^{S} \in N_{k 1} (x_{i}^{S})

is the pth nearest neighbor of

x_{i}^{S}

and

x_{i q}^{S} \in N_{k 2} (x_{i}^{S})

is the qth farthest neighbor of

x_{i}^{S}

, which are both 3-D cubes. In Siamese subnetwork, each subnet inputs one 3-D cube

x_{i}^{S} \in ℜ^{D \times s \times s}

. According to the sequence from top to bottom in Figure 6, the first subnet inputs the target sample

x_{i}^{S}

, and its output can be formulated as

\begin{matrix} z_{i} = subN (x_{i}^{S}, Θ, θ_{L}) = softmax (f (x_{i}^{S}, Θ), θ_{L}), \end{matrix}

(11)

which is same as the output of classifier subnetwork.

f (x_{i}^{S}, Θ)

is also the output of feature extractor and

θ_{L}

is the learning parameters of graph-based constraint layer.

subN ()

represents a subnet mapping function, which applies to all subnets due to the same network structure and shared parameters. In that way, the second to

(1 + k 1)

th subnets input nearest neighbor samples and the

(k 1 + 2)

th to

(1 + k 1 + k 2)

th subnets input farthest neighbor samples. When a subnet inputs a nearest neighbor

x_{i p}^{S} \in N_{k 1} (x_{i}^{S})

of

x_{i}^{S}

, its output is described as

\begin{matrix} z_{i p}^{N} = subN (x_{i p}^{S}, Θ, θ_{L}) . \end{matrix}

(12)

In the same way, when a subnet inputs a farthest neighbor

x_{i q}^{S} \in N_{k 2} (x_{i}^{S})

of

x_{i}^{S}

, its output is described as

\begin{matrix} z_{i q}^{F} = subN (x_{i q}^{S}, Θ, θ_{L}) . \end{matrix}

(13)

Thus, for input data

x_{i}^{t 2} = [x_{i}^{S}, N_{k 1} (x_{i}^{S}), N_{k 2} (x_{i}^{S})]

in Training Set 2 of Figure 4, the output of Siamese subnetwork is

\begin{matrix} Z_{i} (x_{i}^{t 2}, Θ, θ_{L}) = [z_{i}, z_{i 1}^{N}, \dots, z_{i p}^{N}, \dots, z_{i k 1}^{N}, z_{i 1}^{F}, \dots, z_{i q}^{F}, \dots, z_{i k 2}^{F}], \end{matrix}

(14)

which includes the outputs of

(1 + k 1 + k 2)

subnets. In fact, Siamese subnetwork aims to promote the learning of classifier subnetwork to improve classification performance. As a result, based on

G_{w}

and

G_{b}

, Siamese subnetwork should compress the distance

D_{N}

between the target sample and

k 1

nearest neighbors and expand the distance

D_{F}

between the target sample and

k 2

farthest neighbors. The former can be formulated as

\begin{matrix} D_{N} = \frac{1}{m} \sum_{i = 1}^{m} \sum_{p = 1}^{k 1} {W_{w}}_{i p} {∥z_{i} - z_{i p}^{N}∥}^{2}, \end{matrix}

(15)

and the latter can be formulated as

\begin{matrix} D_{F} = \frac{1}{m} \sum_{i = 1}^{m} \sum_{q = 1}^{k 2} {W_{b}}_{i p} {∥z_{i} - z_{i q}^{F}∥}^{2} . \end{matrix}

(16)

W_{w}

and

W_{b}

are the weight matrix from

G_{w}

and

G_{w}

, respectively, which are calculated with Equations (5) and (7).

W_{w}

is based on SLSD, which is proven to be effective in revealing the more real relationships between samples [32], where, if SLSD between

x_{i}

and

x_{j}

is smaller,

{W_{w}}_{i j}

is larger, and vice versa. Generally, neural networks optimize learning parameters by minimizing objective functions. However, in Siamese subnetwork,

D_{N}

needs to be minimized, whereas

D_{F}

needs to be maximized. A simple negative

D_{F}

optimization will make the network unable to converge. To take into account the convergence of the network, the loss function of Siamese subnetwork is defined as

\begin{matrix} L_{s} = D_{N} + exp (- D_{F}) . \end{matrix}

(17)

Here,

exp (- x)

is a decreasing function and converges to 0 as the variable increases. As a result, the loss function

L_{s}

will be optimized towards zero.

2.4. Multitask Few-Shot Learning Strategy

In fact, as its name suggests, DMN is a two-task network. Since the training data required by these two tasks are completely different not only in content but also in format, the learning of DMN faces the problem of the two tasks not being able to update learning parameters in training at the same time. In addition, due to the large difference in the number of training data between the two tasks, DMN learning is easy to fall into a single task, so the two tasks cannot achieve uniform convergence. To sum up, it is challenging to achieve the synergy and balance effect of the two tasks. To solve this problem, we designed a multitask few-shot learning strategy (MFSL).

Next, for ease of explanation, we still introduce MFSL based on two sub-networks. Due to the fact that the purpose of DMN is classification and the number of labeled samples is very small, the learning of the classifier subnetwork is particularly important. The University of Pavia dataset, for example, contains 42,776 samples from nine classes. If five samples are taken from each class, the number of labeled samples is only 45 and the number of unlabeled samples is 42,731. Thus, the number of training samples for the classifier subnetwork is 45, whereas that for the Siamese subnetwork is 42,776, which shows a large gap between them. Therefore, the task of the classifier subnetwork needs to be emphasized constantly.

In this paper, our GDMFSL deals with HSIs with few labels. The training data of the classifier subnetwork of DMN are only labeled samples, so all of them can be used as one batch to participate in DMN learning. Algorithm 1 shows the multitask few-shot learning strategy. MFSL can be understood as that, in training, when the Siamese subnetwork learns a batch of data, the classifier subnetwork also learns a batch of data. When the Siamese subnetwork is learning different batches of data, the classifier subnetwork is always learning a batch of data repeatedly. Of course, when the number of labeled samples increases, the training data of the classifier subnetwork can also be divided into multiple batches. The following experiments also prove that MFSL can balance and converge the two tasks of DMN.

Algorithm 1: Multitask few-shot learning strategy

Input: Training Set 1

X^{t 1}

with labels, Training Set 2

X^{t 2}

and its number m, weight matrix

W_{w}

and

W_{b}

, batch size B, iterations I, learning rate.

Initialize:

θ_{1}, θ_{2}, θ_{3}, θ_{L}

1:: for epoch in $\{1, \dots, I\}$ do
2:: $T_{2} \leftarrow$ RandomShuffle ( $X^{t 2}$ )
3:: for i in $\{1, \dots, int (m / B)\}$ do
4:: $T_{1} \leftarrow$ RandomShuffle ( $X^{t 1}$ with labels)
5:: $T_{2}^{B} \leftarrow T_{2} [i \cdot B : min (i \cdot B + B, m)]$
6:: Update $\{θ_{1}, θ_{2}, θ_{3}, θ_{L}\} \leftarrow$ Minimize ( $L_{c} (T_{1})$ in Equation (10))
7:: Update $\{θ_{1}, θ_{2}, θ_{3}, θ_{L}\} \leftarrow$ Minimize ( $L_{s} (T_{2}^{B})$ in Equation (17))
8:: end for
9:: end for

Output:

Z = [z_{1, \dots,} z_{m}]

with the input

T_{2} [:, :, :, :, 0 : 1]

according to Equation (9)

3. Experiments and Discussion

3.1. Experimental Datasets

To assess the performance of GDMFSL, three public HSI datasets were used, including the Indian Pines (IP), the University of Pavia (UP) and the Salinas.

Figure 7 shows the color image and the labeled image of the IP dataset covering the Indian Pines region, northwest Indiana, USA, which was acquired by the AVIRIS sensor in 1992. Its spatial resolution is 20 m. It has 220 original spectral bands in the wavelength range 0.4∼2.5

μ m

. As a result of the noise and water absorption, 104∼108, 150∼163 and 220 spectral bands were abandoned and the remaining 200 bands were used in this paper. It contains

145 \times 145

pixels, including background with 10,776 pixels and 16 ground-truth classes with 10,249 pixels.

The UP dataset covers the University of Pavia, Northern Italy, which was acquired by the ROSIS sensor. Its spatial resolution is 1.3 m. It has 115 original spectral bands in the wavelength range of 0.4∼0.82

μ m

. Due to removing 12 noise bands, 103 bands are employed in this paper. It has

610 \times 340

pixels, where 164,624 pixels are background and 42,776 pixels are nine ground-truth classes. Figure 8 shows the color image and the labeled image with nine classes.

Salinas Dataset covers Salinas Vally, CA, which was acquired by the AVIRIS sensor in 1998. Its spatial resolution is 3.7 m. It has 224 original bands in the wavelength range of 0.4 to 2.45

μ m

. Due to the fact that 20 bands are severely affected by noise, the remaining 204 bands are used for this paper. Each band has

512 \times 217

pixels, including 16 ground-truth classes with 56,975 pixels and background with 54,129 pixels. The color image and the labeled image with 16 classes are shown in Figure 9.

3.2. Experimental Setting

As described in Section 2.1, our GDMFSL framework consists of two parts: a semi-supervised graph and a deep multitask network. For the semi-supervised graph, four parameters need to be manually set, namely, the size of spatial window s, the spectral–locational trade-off parameter

β

, the number of nearest neighbors

k 1

and that of farthest neighbors

k 2

. In fact, the influence of s,

β

and

k 1

on the graph has been analyzed in [32]. According to that, these three parameters are set separately for different datasets and

k 2

is set to be equal to

k 1

for convenience. In this paper, four parameters were set to

s = 5

,

β = 0.7

,

k 1 = k 2 = 10

for the IP dataset,

s = 7

,

β = 0.05

,

k 1 = k 2 = 20

for the UP dataset and

s = 7

,

β = 0.03

,

k 1 = k 2 = 20

for the Salinas dataset.

Although the DMN contains two subnetworks, their network structure is the same. To further slow down over-fitting and improve DMN classification performance, we added a dropout model between each convolutional layer in the feature extractor. For the IP dataset, the number of features in each layer is

200 - 100 - 50 - 30 - 16

, the filter size per layer is

3 \times 3 - 2 \times 2 - 2 \times 2 - 1 \times 1

, the output size per layer is

5 \times 5 - 3 \times 3 - 2 \times 2 - 1 \times 1 - 1 \times 1

, the dropout model has a retention probability of 0.9 and the learning rate is

6 \times 10^{- 4}

. For the UP dataset, the number of features in each layer is

103 - 70 - 30 - 9

, the filter size per layer is

3 \times 3 - 3 \times 3 - 1 \times 1

, the output size per layer is

5 \times 5 - 3 \times 3 - 1 \times 1 - 1 \times 1

, the dropout model has a retention probability of 0.9 and the learning rate is

3 \times 10^{- 4}

. For the Salinas dataset, the number of features in each layer is

204 - 110 - 60 - 30 - 16

, the filter size per layer is

3 \times 3 - 3 \times 3 - 3 \times 3 - 1 \times 1

, the output size per layer is

7 \times 7 - 5 \times 5 - 3 \times 3 - 1 \times 1 - 1 \times 1

, the dropout model has a retention probability of 0.8 and the learning rate is

8 \times 10^{- 5}

.

In order to verify the superiority of GDMFSL, eight classification methods are selected for comparison, including SVM, KNN, 3D-CNN [41], SSRN [28], SS-CNN [42], DFSL+NN [43], RN-FSC [38] and DCFSL [44]. SVM and KNN are the traditional classification methods and the rest are based on neural networks. In the actual experiment, we utilized the 1-nearest neighbors and a LibSVM toolbox with a radial basis function. 3D-CNN [41] and SSRN [28] are two supervised 3-D deep learning frameworks. SS-CNN [42] is a semi-supervised convolutional neural network. DFSL+NN [43], RN-FSC [38] and DCFSL [44] are three few-shot learning frameworks, all of which are cross-domain methods combined with meta-learning. In the following experiments, 200 labeled source domain samples per class are randomly selected to learn transferable knowledge for these three cross-domain methods.

In order to ensure the fairness of the experiment, 1∼5 labeled target dataset samples per class were used for training, and the rest of the samples of the target dataset were reserved as the testing set. The classification overall accuracy (OA), the average accuracy (AA) and the Kappa coefficient were used to evaluate the classification performance. In addition, each experiment in this paper was repeated 10 times in each condition in order to reduce the experimental random error.

3.3. Convergence Analysis

The DMN in our proposed GDMFSL framework has two subnetworks (tasks) with different loss functions. The classifier subnetwork is used to learn what the sample is, and its loss function is the cross-entropy loss between the outputs of the DMN and labels, which is described in Equation (10). The Siamese subnetwork is used to learn the relationship among samples, and its loss function is the mean-squared loss among outputs of the DMN under different inputs, which compresses the distance between the target samples and its nearest neighbors and expands the distance between the target samples and its farthest neighbors, as described in Equation (17). In addition, the two subnetworks learn under different training data. Meanwhile, they share parameters; that is, they jointly optimize the same learning parameter in the DMN. On the surface, under different directions of optimization, the loss of the two subnetworks is likely to fluctuate, and it is not easy to converge. Nevertheless, to prove the convergence of the two subnetworks of the DMN, we show their loss curves and the classification OA curves of the testing set on three datasets, as shown in Figure 10.

The experiment in Figure 10 was performed based on the training data of five labeled samples in each class. The first row is the classifier subnetwork loss, the second row is the Siamese subnetwork loss and the third row is the prediction accuracy of the testing set. Their learning rates are described in Section 3.2. In Figure 10, the x-axis represents the number of learning parameter updates that are performed after learning each batch of samples. The interval within the curve is 100; that is, one loss is recorded for every 100 parameter updates. From Figure 10, the classifier loss of the three datasets has a smooth convergence curve. This proves that, although the amount of training data of the classifier subnetworks is much less than that of the Siamese subnetwork, the task of the classifier subnetwork to learn the labeled samples is not interfered with by the task of the Siamese subnetwork. Though there are fluctuations, the loss curve of the Siamese subnetwork is still gradually converging, which also indicates that our designed loss function in Equation (17) has convergence. These two loss curves also show that the MFSL strategy we designed is also effective. It is worth noting that the prediction accuracy of the test set not only increases gradually with the decrease in the two losses, but also continues to increase with the convergence of the Siamese subnetwork when the classifier subnetwork has converged but the Siamese subnetwork has not. This proves that our designed Siamese subnetwork is quite advantageous for classification.

3.4. Ablation Study

To demonstrate the effectiveness of the strategy proposed in the GDMFSL framework, we conducted an ablation experiment; the results of which are shown in Table 1. Our GDMFSL framework can be divided into two parts: the semi-supervised graph and the DMN. In order to prove the effectiveness of the proposed DMN, we conducted graph learning based on the semi-supervised graph of GDMFSL to reduce the dimensionality of the HSIs dataset, and then classified the dimensionality reduction results using SVM and KNN, which are named SSGL+SVM and SSGL+KNN in this paper. The DMN contains two subnetworks: a classifier subnetwork and a Siamese subnetwork. In order to prove the contribution of the Siamese subnetwork to the DMN, we conducted an experiment to train only the classifier subnetwork, which is called Classifier SubN in Table 1.

Table 1 shows the classification accuracy of SSGL+SVM, SSGL+KNN, Classifier SubN and GDMFSL for three datasets under the condition of a different number of labeled samples, where the highest OA value for the same classification condition has been in bold. GDMFSL, SSGL+SVM and SSGL+KNN learn from the same graph, and the difference is that GDMFSL utilizes DMN to learn not only graph information but also class information, whereas SSGL+SVM and SSGL+KNN only learn graph information. From Table 1, GDMFSL is superior to SSGL+SVM and SSGL+KNN in all conditions on three datasets, which proves that our proposed DMN is effective. In addition, we found that, on the UP dataset, the classification accuracies of SSGL on SVM and KNN are significantly different. This is because the feature extraction and classification in SSGL+SVM and SSGL+KNN are separated, and the feature data distribution of SSGL on the UP dataset is not suitable for the SVM classifier. At this point, a method in which the feature extractor and classifier can learn together is more valuable.

GDMFSL and Classifier SubN are trained with the same labeled samples; the difference is that GDMFSL utilizes the Siamese subnetwork, learning graph information to share the learning parameters to the classifier subnetwork. Table 1 shows that GDMFSL is much better than Classifier SubN, which means that our designed Siamese sub-network is meaningful and can greatly improve the classification performance of the classifier sub-network. Moreover, we can observe from Table 1 that SSGL+SVM and SSGL+KNN are more likely to be better than Classifier SubN. Based on this, it has to be said that the graph-learning method has more advantages than the traditional deep learning method for HSI classification with few labeled samples. The reason for this is that the traditional deep learning method with a deep structure often falls into over-fitting when only few labeled samples are used to train the network. Nevertheless, GDMFSL has solved this problem with the graph, even surpassing the performance of graph learning.

3.5. Classification Result

To further demonstrate the effectiveness of the GDMFSL in HSI classification with few labeled samples, the classification results of GDMFSL and eight comparison methods are presented in this subsection. In the practice experiments, since DFSL+NN [43], RN-FSC [38] and DCFSL [44] are cross-domain methods, four available HSI data sets, Chikusei, Houston, Botswana and Kennedy Space Center, are collected to form source domain data. After discarding classes with less than 200 samples, 40 classes are used to build the source class.

Table 2, Table 3 and Table 4 report the OA, AA, kappa coefficient and the classification accuracy of each class for HSI classification, where the highest value has been in bold. For three target datasets, we randomly selected five labeled samples from each class for training and the rest for testing. In order to eliminate the influence of randomness on the classification accuracy when selecting labeled samples, each experiment was performed 10 times based on an independent randomly labeled sample. From Table 2, Table 3 and Table 4, we obtained the following conclusions.

(1) The classification accuracies of deep-learning-based methods are mostly better than those of traditional classification methods. For example, the OA values of 3D-CNN [41] and SSRN [28] are approximately 13.7% on IP, 8.08% on UP and 5.69% on Salinas higher than those of SVM and KNN, respectively. One reason is that deep learning methods with a hierarchical network structure can obtain more discriminative features. Another reason is that 3D-CNN [41] and SSRN [28] can obtain spectral–spatial features through the convolutional layer, whereas SVM and KNN only explore spectral features.

(2) Although the learning of unlabeled samples is added in addition to the traditional deep learning method, SS-CNN [42], as a semi-supervised method, does not have a better classification performance but is worse when dealing with few labeled sample classification. However, our proposed semi-supervised method, GDMFSL, achieves a good classification performance. The main reason is that SS-CNN [42] uses unlabeled samples only for data reconstruction and does not acquire and learn the relationship information among samples.

(3) The few-shot learning methods are superior to the traditional deep-learning-based methods. Numerically, the OA values of DFSL+NN [43], RN-FSC [38] and DCFSL [44] are approximately 3.48% on IP, 9.53% on UP and 1.53% on Salinas higher than those of 3D-CNN [41] and SSRN [28], respectively. These few-shot learning methods use a meta-learning strategy to learn a metric space suitable for classification. In effect, they are learning a mapping that better expresses the relationships between samples, which is similar to the learning of relationships between samples in our proposed GDMFSL.

(4) Among all of the algorithms, the GDMFSL proposed by us achieved the best classification results on all three data sets. The classification accuracy of GDMFSL is even much higher than other comparison methods. In terms of data, GDMFSL is 20.77% on IP, 2.77% on UP and 9.51% on Salinas higher than the highest OA value in comparison methods. GDMFSL achieved the highest classification accuracy in most classes, with even some classes with a classification accuracy of 100%. In addition, it is worth noting that, in Table 2, when the IP dataset is classified with only five labeled samples in each class, the OA values of all comparison algorithms are all lower than 70%, whereas that of GDMFSL reaches 87.58%. In addition to the OA value, the values of the AA and kappa coefficients of GDMFSL are the highest among all algorithms for the three datasets. All of these can strongly prove the excellent performance of GDMFSL in HSI few-shot classification.

In order to better display and compare the classification results of different methods, we present the classification maps corresponding to Table 2, Table 3 and Table 4, as shown in Figure 11, Figure 12 and Figure 13. Obviously, compared with other comparison methods, the classification map of GDMFSL is the most similar to the ground truth, and the difference in area is the least on the three datasets. Through Figure 11, Figure 12 and Figure 13 the advantages of GDMFSL in processing HSI classification with only few labeled samples are once again proved visually. Although the classification map of GDMFSL looks clear and smooth, GDMFSL still has drawbacks. A careful look at Figure 11, Figure 12 and Figure 13 shows that most of the misclassification pixels of GDMFSL are at the boundary of the class region and are identified as the class of the adjacent region. Meanwhile, the misclassification area is continuous. The reason is that the learning of the DMN in GDMFSL is greatly influenced by sample relationship information generated by a semi-supervised graph based on SLSD with locational information. Although the addition of location information can improve the ability of the DMN to identify samples within the class region, samples at the boundary of the class region are prone to be misclassified.

The above experiment shows the five-shot classification performance of GDMFSL for HSIs. In order to further verify the performance of GDMFSL with fewer labeled samples and the effect of a different number of labeled samples on different methods, we randomly selected one, two, three, four and five labeled samples per class for the experiment, of which, the classification OA values are shown in Table 5 in which the highest value has been in bold. To more clearly show the numerical changes and comparison in Table 5, the corresponding line chart is presented in Figure 14. From Table 5 and Figure 14, we obtain the following observations.

(1) Though 3D-CNN [41] and SSRN [28] are superior to SVM and KNN in classification when there are five labeled samples in each class, with the decrease in the number of labeled samples, the advantage of 3D-CNN [41] and SSRN [28] becomes less and less, and even weaker than SVM and KNN when the number of labeled samples per class is less than three. For example, in the case of one-shot sample classification in the UP dataset, the OA values of SVM and KNN are approximately 6.87% higher than those of 3D-CNN [41] and SSRN [28]. This is because the fewer the training samples, the more serious the overfitting of deep learning methods. DFSL+NN [43], RN-FSC [38] and DCFSL [44] solve this problem through a meta-learning strategy, whereas our GDMFSL learns the relationship among samples from the semi-supervised graph. The experimental results show that, compared with other few-shot learning methods, GDMFSL achieves a better classification performance.

(2) As the number of labeled samples increased, the OA values of all methods increased. When there was only one labeled sample in each class, the OA values of most methods were quite low and close, whereas GDMFSL was the highest, especially on the Salinas dataset, where the OA value of GDMFSL reached 83.52%.

(3) Although the classification performance of SS-CNN [42] is relatively weak compared with other methods in this paper, it does not mean that the semi-supervised deep learning method is unreliable. On the contrary, the experimental results of GDMFSL prove that semi-supervised deep learning is quite effective in processing hyperspectral image classification, with a small number of labeled samples when unlabeled samples are reasonably used. Compared with SS-CNN [42], GDMFSL has the advantage of learning the relationship among samples.

(4) Obviously, GDMFSL is superior to other few-shot learning methods: DFSL+NN [43], RN-FSC [38] and DCFSL [44]. GDMFSL differs from them in that GDMFSL borrows unlabeled sample information from the target dataset whereas they borrow labeled sample information from other source datasets. Objectively speaking, the latter has problems of domain conversion and different classes between different datasets, whereas the former does not.

(5) It is clear from Figure 14 that GDMFSL has the best classification performance under all conditions. Under different numbers of labeled samples, the OA values of GDMFSL are at least 11.29%, 12.18%, 19.46%, 21.12% and 20.77% higher than those of other methods on the IP dataset, 4.35%, 8,35%, 6.18%, 5.12% and 2.77% on the UP dataset and 7.48%, 12.6%, 11.77%, 8.71% and 9.71% on the Salinas dataset.

4. Concluding Remarks

In this paper, we proposed a GDMFSL framework to deal with HSI classification with few labeled training samples. GDMFSL can be viewed as two parts: a semi-supervised graph and a DMN. First, a semi-supervised graph is constructed to generate graph information, which uses SLSD to estimate sample similarities and then revises them with few labeled samples. Second, a DMN with two subnetworks (tasks) is constructed and trained. The classifier subnetwork is trained on few labeled samples, which learns what the sample class is. The Siamese subnetwork is trained based on all samples (labeled and unlabeled), which learns the differences (relationships) among all samples. The loss function constrains the Siamese subnetwork to shorten the distance between the target sample and its nearest (or intra-class) neighbors and widen the distance between the target sample and its farthest (or inter-class) neighbors. The classifier subnetwork and Siamese subnetwork are jointly trained according to the MFSL strategy, and converge cooperatively.

The experimental results demonstrate that our proposed strategy for incorporating graph information into the DNN is more effective than graph learning in handling the few-shot settings of HSIs, the proposed DMN is more efficient than traditional classification networks, our designed Siamese subnetwork indeed alleviates the over-fitting problem of the classifier subnetwork and greatly improves the classification performance, the loss function of the Siamese subnetwork is convergent, and the MFSL strategy is effective for promoting the common convergence of two subnetworks (tasks).

More importantly, GDMFSL is far superior to the other comparison methods in this paper. Under different numbers of labeled samples, the classification OA values of GDMFSL are at least 11.29%, 12.18%, 19.46%, 21.12% and 20.77% higher than those of other methods on the IP dataset, 4.35%, 8.35%, 6.18%, 5.12% and 2.77% on the UP dataset and 7.48%, 12.6%, 11.77%, 8.71% and 9.71% on the Salinas dataset.

The disadvantage of this work is that DMNs with different network structures need to be designed for different datasets, which makes a trained DMN not universal to other data. It can be said that the DMN is overfitting to the target data. Therefore, our future work will focus on improving the generalizability of DMNs.

Author Contributions

Conceptualization, N.L. and J.S.; methodology, N.L.; software, N.L.; validation, N.L., J.S. and X.Z.; formal analysis, N.L.; investigation, N.L.; resources, D.Z.; data curation, N.L. and T.W.; writing—original draft preparation, N.L.; writing—review and editing, N.L.; visualization, N.L.; supervision, D.Z.; project administration, D.Z.; funding acquisition, J.S. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62076204), the Postdoctoral Science Foundation of Shannxi Province (Grant No.2017BSHEDZZ77), the China Postdoctoral Science Foundation (Grant nos. 2017M613204 and 2017M623246), and the Postdoctoral Science Foundation of China under Grants 2021M700337.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DNN	deep neural network
HSI	hyperspectral image
GDMFSL	graph-based deep multitask few-shot learning
DMN	deep multitask network
KNN	K nearest neighbor
SVM	support vector machine
CNN	convolutional neural network
SLSD	spectral-locational-spatial distance
MFSL	multitask few-shot learning
IP	Indian Pines
UP	University of Pavia
AVIRIS	Airborne Visible/Infrared Imaging Spectrometer
ROSIS	Reflective Optics System Imaging Spectrometer
3D-CNN	3D convolutional neural network
SSRN	spectral–spatial residual network
SS-CNN	semi-supervised convolutional neural network
DFSL+NN	Deep Few-Shot Learning method with nearest neighbor classifier
RN-FSC	relation network for few-shot classification
DCFSL	deep cross-domain few-shot learning
OA	overall accuracy
AA	average accuracy
SSGL+SVM	semi-supervised graph learning method with SVM classifier
SSGL+KNN	semi-supervised graph learning method with KNN classifier

References

ElMasry, G.; Sun, D.W. Principles of hyperspectral imaging technology. In Hyperspectral Imaging for Food Quality Analysis and Control; Elsevier: Amsterdam, The Netherlands, 2010; pp. 3–43. [Google Scholar]
Boldrini, B.; Kessler, W.; Rebner, K.; Kessler, R.W. Hyperspectral imaging: A review of best practice, performance and pitfalls for in-line and on-line applications. J. Near Infrared Spectrosc. 2012, 20, 483–508. [Google Scholar] [CrossRef]
Sahoo, R.N.; Ray, S.; Manjunath, K. Hyperspectral remote sensing of agriculture. Curr. Sci. 2015, 108, 848–859. [Google Scholar]
Bridgelall, R.; Rafert, J.B.; Tolliver, D. Hyperspectral imaging utility for transportation systems. In Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2015; International Society for Optics and Photonics: Bellingham, WA, USA, 2015; Volume 9435, p. 943522. [Google Scholar]
Fei, B. Hyperspectral imaging in medical applications. In Data Handling in Science and Technology; Elsevier: Amsterdam, The Netherlands, 2020; Volume 32, pp. 523–565. [Google Scholar]
Transon, J.; d’Andrimont, R.; Maugnard, A.; Defourny, P. Survey of hyperspectral earth observation applications from space in the sentinel-2 context. Remote Sens. 2018, 10, 157. [Google Scholar] [CrossRef] [Green Version]
Ding, C.; Li, Y.; Wen, Y.; Zheng, M.; Zhang, L.; Wei, W.; Zhang, Y. Boosting Few-Shot Hyperspectral Image Classification Using Pseudo-Label Learning. Remote Sens. 2021, 13, 3539. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Hennessy, A.; Clarke, K.; Lewis, M. Hyperspectral classification of plants: A review of waveband selection generalisability. Remote Sens. 2020, 12, 113. [Google Scholar] [CrossRef] [Green Version]
Zhang, H.; Li, Y.; Jiang, Y.; Wang, P.; Shen, Q.; Shen, C. Hyperspectral classification based on lightweight 3-D-CNN with transfer learning. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5813–5828. [Google Scholar] [CrossRef] [Green Version]
Tu, B.; Zhou, C.; He, D.; Huang, S.; Plaza, A. Hyperspectral classification with noisy label detection via superpixel-to-pixel weighting distance. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4116–4131. [Google Scholar] [CrossRef]
Sawant, S.S.; Prabukumar, M. A review on graph-based semi-supervised learning methods for hyperspectral image classification. Egypt. J. Remote Sens. Space Sci. 2020, 23, 243–248. [Google Scholar] [CrossRef]
Kang, X.; Xiang, X.; Li, S.; Benediktsson, J.A. PCA-based edge-preserving features for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7140–7151. [Google Scholar] [CrossRef]
Bandos, T.V.; Bruzzone, L.; Camps-Valls, G. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Trans. Geosci. Remote Sens. 2009, 47, 862–873. [Google Scholar] [CrossRef]
Wang, J.; Chang, C.I. Independent component analysis-based dimensionality reduction with applications in hyperspectral image analysis. IEEE Trans. Geosci. Remote Sens. 2006, 44, 1586–1600. [Google Scholar] [CrossRef]
He, L.; Li, J.; Plaza, A.; Li, Y. Discriminative low-rank Gabor filtering for spectral–spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2016, 55, 1381–1395. [Google Scholar] [CrossRef]
Zhang, L.; Luo, F. Review on graph learning for dimensionality reduction of hyperspectral image. Geo-Spat. Inf. Sci. 2020, 23, 98–106. [Google Scholar] [CrossRef] [Green Version]
Tenenbaum, J.B.; De Silva, V.; Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef] [PubMed]
Fang, Y.; Li, H.; Ma, Y.; Liang, K.; Hu, Y.; Zhang, S.; Wang, H. Dimensionality reduction of hyperspectral images based on robust spatial information using locally linear embedding. IEEE Geosci. Remote Sens. Lett. 2014, 11, 1712–1716. [Google Scholar] [CrossRef]
Yan, L.; Niu, X. Spectral-angle-based Laplacian eigenmaps for nonlinear dimensionality reduction of hyperspectral imagery. Photogramm. Eng. Remote Sens. 2014, 80, 849–861. [Google Scholar] [CrossRef]
Zhou, Y.; Peng, J.; Chen, C.P. Dimension reduction using spatial and spectral regularized local discriminant embedding for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1082–1095. [Google Scholar] [CrossRef]
Huang, H.; Shi, G.; He, H.; Duan, Y.; Luo, F. Dimensionality reduction of hyperspectral imagery based on spatial–spectral manifold learning. IEEE Trans. Cybern. 2019, 50, 2604–2616. [Google Scholar] [CrossRef] [Green Version]
Huang, H.; Duan, Y.; He, H.; Shi, G.; Luo, F. Spatial-spectral local discriminant projection for dimensionality reduction of hyperspectral image. ISPRS J. Photogramm. Remote Sens. 2019, 156, 77–93. [Google Scholar] [CrossRef]
Paoletti, M.; Haut, J.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef] [Green Version]
Jiao, L.; Liang, M.; Chen, H.; Yang, S.; Liu, H.; Cao, X. Deep fully convolutional network-based spatial distribution prediction for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5585–5599. [Google Scholar] [CrossRef]
Xu, Y.; Du, B.; Zhang, F.; Zhang, L. Hyperspectral image classification via a random patches network. ISPRS J. Photogramm. Remote Sens. 2018, 142, 344–357. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Li, Z.; Huang, H.; Li, Y.; Pan, Y. M3DNet: A manifold-based discriminant feature learning network for hyperspectral imagery. Expert Syst. Appl. 2020, 144, 113089. [Google Scholar] [CrossRef]
Boggavarapu, L.P.K.; Manoharan, P. A new framework for hyperspectral image classification using Gabor embedded patch based convolution neural network. Infrared Phys. Technol. 2020, 110, 103455. [Google Scholar] [CrossRef]
Paoletti, M.E.; Haut, J.M.; Plaza, J.; Plaza, A. A new deep convolutional neural network for fast hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 2018, 145, 120–147. [Google Scholar] [CrossRef]
Li, N.; Zhou, D.; Shi, J.; Wu, T.; Gong, M. Spectral-locational-spatial manifold learning for hyperspectral images dimensionality reduction. Remote Sens. 2021, 13, 2752. [Google Scholar] [CrossRef]
Deng, B.; Jia, S.; Shi, D. Deep metric learning-based feature embedding for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1422–1435. [Google Scholar] [CrossRef]
Jia, S.; Jiang, S.; Lin, Z.; Li, N.; Yu, S. A Survey: Deep Learning for Hyperspectral Image Classification with Few Labeled Samples. Neurocomputing 2021, 448, 179–204. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Tang, H.; Li, Y.; Han, X.; Huang, Q.; Xie, W. A Spatial–Spectral Prototypical Network for Hyperspectral Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2019, 17, 167–171. [Google Scholar] [CrossRef]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
Gao, K.; Liu, B.; Yu, X.; Qin, J.; Zhang, P.; Tan, X. Deep relation network for hyperspectral image few-shot classification. Remote Sens. 2020, 12, 923. [Google Scholar] [CrossRef] [Green Version]
Liu, B.; Yu, X.; Zhang, P.; Yu, A.; Fu, Q.; Wei, X. Supervised deep feature extraction for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1909–1921. [Google Scholar] [CrossRef]
Wang, W.; Chen, Y.; He, X.; Li, Z. Soft Augmentation-Based Siamese CNN for Hyperspectral Image Classification With Limited Training Samples. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef] [Green Version]
Liu, B.; Yu, X.; Zhang, P.; Tan, X.; Yu, A.; Xue, Z. A semi-supervised convolutional neural network for hyperspectral image classification. Remote Sens. Lett. 2017, 8, 839–848. [Google Scholar] [CrossRef]
Liu, B.; Yu, X.; Yu, A.; Zhang, P.; Wan, G.; Wang, R. Deep few-shot learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2290–2304. [Google Scholar] [CrossRef]
Li, Z.; Liu, M.; Chen, Y.; Xu, Y.; Li, W.; Du, Q. Deep cross-domain few-shot learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]

Figure 1. Brief of the proposed graph-based deep multitask few-shot learning framework.

Figure 2. Construction of semi-supervised graph.

Figure 3. Deep multitask network.

Figure 4. Generating training data for DMN.

Figure 5. The network architecture of classifier subnetwork.

Figure 6. The network architecture of Siamese subnetwork.

Figure 7. Indian Pines dataset.

Figure 8. University of Pavia dataset.

Figure 9. Salinas dataset.

Figure 10. The loss curves of the DMN and the predicted accuracy of the testing set on three datasets (five labeled samples per class).

Figure 11. Classification maps of different methods on the IP dataset.

Figure 12. Classification maps of different methods on the UP dataset.

Figure 13. Classification maps of different methods on the Salinas dataset.

Figure 14. Classification accuracy for all methods with different number of labeled sample on three datasets.

Table 1. Classification accuracy of different methods on three datasets.

Dataset	$n_{i}$	1	2	3	4	5
IP	SSGL+SVM	52.42	66.27	71.23	74.97	79.64
	SSGL+KNN	53.17	65.66	70.76	79.83	82.10
	Classifier SubN	41.29	48.15	54.62	59.20	62.19
	GDMFSL	54.29	70.07	78.10	84.18	87.58
UP	SSGL+SVM	38.11	51.56	52.64	55.64	62.51
	SSGL+KNN	45.73	54.07	69.71	73.81	82.60
	Classifier SubN	44.83	54.31	62.51	69.22	72.33
	GDMFSL	61.90	76.38	80.25	85.44	86.42
Salinas	SSGL+SVM	79.19	84.19	85.83	86.61	93.81
	SSGL+KNN	82.86	84.86	86.34	89.42	93.60
	Classifier SubN	69.02	76.07	82.24	85.73	89.32
	GDMFSL	83.52	93.50	97.27	97.20	98.85

Table 2. Classification accuracy (%) of different methods on the IP dataset (five labeled samples per class).

Class	Train	Test	SVM	KNN	3D-CNN	SSRN	SS-CNN	DFSL+NN	RN-FSC	DCFSL	GDMFSL
C1	5	41	72.20	86.96	95.12	18.38	44.13	96.75	96.34	95.37	100.0
C2	5	1423	34.27	37.53	37.70	64.79	46.63	38.65	46.13	43.26	76.82
C3	5	825	39.18	38.19	19.77	27.65	33.32	42.79	40.61	57.95	87.59
C4	5	232	50.34	41.35	32.51	26.65	23.05	68.10	58.62	80.60	100.0
C5	5	478	69.75	55.07	88.45	80.76	57.77	71.20	64.96	72.91	84.05
C6	5	725	66.36	79.04	73.65	86.87	78.98	76.18	69.45	87.96	100.0
C7	5	23	89.13	96.42	81.82	32.24	22.78	100.0	100.0	99.57	100.0
C8	5	473	68.73	67.76	53.35	100.0	93.85	74.84	77.70	86.26	100.0
C9	5	15	86.67	90.00	100.0	57.69	15.77	100.0	100.0	99.33	100.0
C10	5	967	37.49	38.68	41.35	59.69	33.47	47.98	25.49	62.44	83.02
C11	5	2450	33.96	36.00	66.71	70.87	56.32	57.95	65.51	62.75	76.98
C12	5	588	31.43	31.53	37.40	45.00	25.08	38.21	27.13	48.72	82.29
C13	5	200	86.50	89.26	85.71	88.29	60.39	97.50	99.75	99.35	100.0
C14	5	1260	62.93	51.93	62.57	97.18	81.22	83.44	76.35	85.40	100.0
C15	5	381	28.08	12.95	56.42	36.64	63.78	62.29	70.34	66.69	100.0
C16	5	88	90.91	89.24	90.36	60.98	88.33	100.0	100.0	97.61	100.0
OA	80	10169	45.85	42.86	54.76	61.36	51.73	59.65	58.17	66.81	87.58
			±2.44	±1.50	±0.03	±0.49	±3.12	±0.63	±0.02	±2.73	±3.41
AA	—	—	59.24	58.21	63.93	59.75	51.54	72.24	69.90	77.89	92.70
			±1.36	±1.48	±0.02	±0.20	±1.98	±0.42	±0.40	±0.86	±1.90
Kappa	—	—	39.68	40.06	48.72	56.91	45.28	54.55	52.52	62.64	86.13
			±2.48	±1.43	±0.03	±0.48	±2.75	±0.52	±0.14	±0.84	±3.72

Table 3. Classification accuracy (%) of different methods on the UP dataset (five labeled samples per class).

Class	Train	Test	SVM	KNN	3D-CNN	SSRN	SS-CNN	DFSL+NN	RN-FSC	DCFSL	GDMFSL
C1	5	6626	89.98	52.87	59.82	91.84	84.97	69.19	68.55	82.20	59.58
C2	5	18644	83.91	62.29	63.05	95.13	84.61	84.63	93.44	87.74	89.53
C3	5	2094	39.98	61.64	68.91	55.23	28.09	57.47	49.81	67.46	100.0
C4	5	3059	60.22	69.35	77.31	78.02	44.35	89.99	92.15	93.16	54.08
C5	5	1340	95.44	99.25	90.77	98.34	97.38	100.0	99.43	99.49	100.0
C6	5	5024	37.12	48.83	63.40	53.56	40.37	71.23	57.99	77.32	100.0
C7	5	1325	40.62	85.04	87.64	60.07	27.67	70.62	70.04	81.18	100.0
C8	5	3677	68.17	65.48	57.27	85.34	62.37	58.13	63.48	66.73	100.0
C9	5	942	99.13	99.68	95.57	98.08	51.28	96.92	99.19	98.66	93.77
OA	45	42731	64.12	61.72	65.74	76.26	56.61	77.75	80.19	83.65	86.42
			±4.55	±2.73	±1.77	±5.78	±6.79	±1.16	±2.18	±1.77	±2.90
AA	—	—	68.18	71.27	76.72	79.51	57.90	77.57	77.12	83.77	88.57
			±2.27	±1.78	±1.01	±3.21	±3.86	±0.31	±0.84	±1.74	±1.74
Kappa	—	—	55.59	55.49	57.37	70.56	48.20	71.11	73.73	78.70	82.89
			±4.64	±2.65	±1.97	±6.69	±6.45	±1.22	±2.79	±2.01	±3.30

Table 4. Classification accuracy (%) of different methods on the Salinas dataset (five labeled samples per class).

Class	Train	Test	SVM	KNN	3D-CNN	SSRN	SS-CNN	DFSL+NN	RN-FSC	DCFSL	GDMFSL
C1	5	2004	97.57	98.25	95.29	97.55	46.53	95.63	96.47	99.40	99.85
C2	5	3721	87.43	93.37	97.20	98.97	86.71	99.09	99.47	99.76	100.0
C3	5	1971	82.95	93.06	91.45	92.47	77.74	94.01	85.05	91.96	98.98
C4	5	1389	99.11	98.85	97.31	96.50	75.97	99.54	98.75	99.55	79.48
C5	5	2673	94.29	86.22	91.24	94.20	89.43	90.58	83.45	92.70	99.55
C6	5	3954	98.36	96.66	98.80	99.28	99.78	98.47	96.73	99.52	100.0
C7	5	3574	94.39	98.93	99.69	99.98	92.89	99.81	99.61	98.88	100.0
C8	5	11266	59.99	50.32	66.40	86.90	66.23	77.74	72.11	74.57	99.73
C9	5	6198	96.09	97.29	96.25	99.64	93.02	91.13	88.35	99.59	100.0
C10	5	3273	71.45	61.95	70.72	92.01	87.69	60.98	70.53	86.42	97.77
C11	5	1063	91.25	84.55	93.15	95.86	63.28	95.99	90.03	96.61	99.90
C12	5	1922	97.22	78.20	99.65	99.15	76.60	93.13	93.15	99.93	99.89
C13	5	911	97.30	98.03	92.63	89.24	95.57	99.34	98.54	99.30	100.0
C14	5	1065	91.84	91.49	93.56	95.15	95.51	98.06	96.43	98.85	98.50
C15	5	7263	60.52	56.16	68.02	55.97	46.26	77.54	70.18	75.38	97.44
C16	5	1802	81.45	78.36	81.41	98.91	97.43	85.05	82.39	92.22	100.0
OA	80	54049	80.71	78.50	84.20	86.39	72.51	87.05	84.11	89.34	98.85
			±2.75	±2.14	±2.62	±2.68	±3.82	±0.83	±1.36	±2.19	±0.77
AA	—	—	87.58	85.24	89.56	93.24	80.88	91.01	88.83	94.04	98.39
			±1.84	±1.55	±1.79	±1.29	±3.52	±0.66	±2.07	±1.14	±1.41
Kappa	—	—	78.61	76.84	82.46	84.95	69.42	85.63	82.38	88.17	98.72
			±3.00	±2.22	±2.90	±2.90	±4.19	±0.91	±1.53	±2.40	±0.86

Table 5. Classification accuracy (%) for all methods with different number of labeled sample on three datasets.

Dataset	$n_{i}$	Train	Test	SVM	KNN	3D-CNN	SSRN	SS-CNN	DFSL+NN	RN-FSC	DCFSL	GDMFSL
IP	1	16	10,233	32.30	33.81	36.23	37.76	32.18	40.15	31.84	43.00	54.29
	2	32	10,217	33.07	36.24	40.00	40.77	43.43	49.69	39.60	57.89	70.07
16	3	48	10,201	40.79	39.84	43.06	43.52	43.67	54.29	44.29	58.64	78.10
Classes	4	64	10,185	42.63	41.89	49.29	51.84	48.03	57.76	56.53	63.06	84.18
	5	80	10169	45.85	42.86	54.76	61.36	51.39	59.65	58.17	66.81	87.58
UP	1	9	42,767	51.38	51.24	42.96	45.92	34.72	51.85	54.07	57.55	61.90
	2	18	42,758	53.57	52.58	51.48	51.48	41.43	58.51	67.22	68.03	76.38
9	3	27	42,749	59.15	56.27	58.52	65.00	48.44	73.14	72.03	74.07	80.25
Classes	4	36	42,740	62.31	56.86	61.11	65.92	51.20	75.74	79.44	80.32	85.44
	5	45	42,731	64.12	61.72	65.74	76.26	56.61	77.75	80.19	83.65	86.42
Salinas	1	16	54,113	66.25	70.87	69.58	69.79	55.46	76.04	71.87	72.70	83.52
	2	32	54,097	68.12	73.77	71.87	72.39	56.85	80.62	74.06	80.88	93.50
16	3	48	54,081	77.08	76.34	74.79	83.85	65.86	82.81	76.87	85.50	97.27
Classes	4	64	54,065	78.96	77.12	79.27	85.62	70.95	86.46	82.29	88.49	97.20
	5	80	54,049	80.71	78.50	84.20	86.39	71.76	87.05	84.11	89.14	98.85

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, N.; Zhou, D.; Shi, J.; Zheng, X.; Wu, T.; Yang, Z. Graph-Based Deep Multitask Few-Shot Learning for Hyperspectral Image Classification. Remote Sens. 2022, 14, 2246. https://doi.org/10.3390/rs14092246

AMA Style

Li N, Zhou D, Shi J, Zheng X, Wu T, Yang Z. Graph-Based Deep Multitask Few-Shot Learning for Hyperspectral Image Classification. Remote Sensing. 2022; 14(9):2246. https://doi.org/10.3390/rs14092246

Chicago/Turabian Style

Li, Na, Deyun Zhou, Jiao Shi, Xiaolong Zheng, Tao Wu, and Zhen Yang. 2022. "Graph-Based Deep Multitask Few-Shot Learning for Hyperspectral Image Classification" Remote Sensing 14, no. 9: 2246. https://doi.org/10.3390/rs14092246

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph-Based Deep Multitask Few-Shot Learning for Hyperspectral Image Classification

Abstract

1. Introduction

2. Methodology

2.1. The Proposed Graph-Based Deep Multitask Few-Shot Learning Framework

2.2. Construction of Semi-Supervised Graph

2.3. Network Architecture and Loss Function of Deep Multitask Network

2.4. Multitask Few-Shot Learning Strategy

3. Experiments and Discussion

3.1. Experimental Datasets

3.2. Experimental Setting

3.3. Convergence Analysis

3.4. Ablation Study

3.5. Classification Result

4. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI