2.1. The Proposed Graph-Based Deep Multitask Few-Shot Learning Framework
In this paper, we study HSIs with few labeled samples and predict the classes of unlabeled samples. We represent a pixel (sample) of HSI as a vector , where D is the number of spectral bands. Suppose that an HSI dataset has m samples, of which, only n () samples are labeled and samples are unlabeled, m samples are denoted as and n labeled samples are represented as , where is the class label of . For ease of calculation, the values of all samples are mapped to the range 0∼1 before learning.
HSI classification is used to predict the classes of unlabeled samples according to the class labels of labeled samples. In general, deep-learning-based classification methods aim to learn mapping between the training samples and their labels under the supervision of enough labeled samples. However, in the case of few and limited labeled samples in HSIs, the conventional deep neural network (DNN) will fall into over-fitting, resulting in poor classification results. In addition, the information obtained from only few labeled samples is not enough to support the classification of a mass of HSI samples with a complex intrinsic structure. To solve this difficulty, this study tries to guide DNN to gain information conducive to classification from plentiful unlabeled samples that are easily acquired. However, if there are no additional constraints on DNN, learning on unlabeled samples is often chaotic. Therefore, this idea is challenging to take forward. In this paper, inspired by graph learning, the proposed graph-based deep multitask few-shot learning framework finds a solution.
Graph learning is an effective technique to reveal the intrinsic similar relationships among samples, which can reflect the homogeneity of data. It has been widely applied for HSI to reduce data redundancy and dimensionality. In graph learning, the graph is used to reflect the relationship of two samples, which can represent some of the statistical or geometrical properties of data [
17]. The relation information of unlabeled samples can also be captured and embodied in the graph. Thereupon, the graph should be a good auxiliary tool to assist the DNN in learning information from unlabeled samples.
In this paper, we study the HSI with few labels. Therefore, using not just unlabeled samples, the graph can reflect the relationship among all samples that contain labeled samples and unlabeled samples. In other words, the graph should be able to reflect the relationship not only within unlabeled samples and within labeled samples, but also between labeled samples and unlabeled samples. This will be key to predicting the classes of unlabeled samples. As a result, a semi-supervised graph is required.
Based on the semi-supervised graph and labeled samples, the DNN has two tasks to learn, namely the class attributes of samples and the relationship among samples. The two tasks are different: one is to learn what the sample is, and the other is to learn the differences among the samples. In order to simultaneously learn the two tasks and to make them promote each other, we designed a deep multitask network.
Based on the above, a graph-based deep multitask few-shot learning (GDMFSL) framework was proposed to deal with HSI classification with few labels, which is shown in
Figure 1. Obviously, the first step of GDMFSL was to construct a semi-supervised graph on the basis of all of the samples, both labeled and unlabeled. Meanwhile, graph information was generated to prepare for deep multitask network. The second step was for the deep multitask network to learn and train under the supervision of few labels and graph information, where the input contains all samples. Finally, unlabeled samples were fed into deep multitask network to predict classes.
2.2. Construction of Semi-Supervised Graph
A graph
G can be denoted as
, which is an undirected graph, where
X denotes the vertexes,
E denotes the edges and
W represents the weight matrix of edges. To construct a graph, the neighbors are connected by edges and a weight is given to the corresponding edges [
17]. If vertexes
i and
j are similar, we should put an edge between vertexes
i and
j in
G and define a weight
for the edge.
The key to constructing a graph is how to effectively calculate the similarity between samples. For this purpose, the spectral–locational–spatial distance (SLSD) [
32] method was employed, which combines spectral, locational and spatial information to excavate the more realistic relationships among samples as much as possible. SLSD not only extracts local spatial neighborhood information but also explores global spatial relations in HSI-based location information. Experimental results in [
32] show that neighbor samples obtained by SLSD are more likely to fall into the same class as target samples.
Figure 2 shows the construction of a semi-supervised graph, which is essentially adding the information of few labeled samples to the unsupervised graph. In the following, we will go through the process of constructing a semi-supervised graph in detail. In SLSD, the location information is one of the attributes of pixels. For an HSI dataset
with
m samples, each of its samples
have
D spectral bands. Its location information can be denoted as
, where
is the coordinate of the pixel
. To fuse the spectral and locational information of pixels in HSIs, a weighted spectral-locational dataset
was constructed as follows:
where
is a spectral–locational trade-off parameter. The local neighborhood space of
is
in a
spatial window, which has
samples. SLSD of the sample
and
is defined as
where
is calculated by
.
is a constant that was empirically set to 0.2 in the experiments.
is a pixel in
surrounding
.
Although SLSD is effective at revealing relationships between samples, it is still an estimated and imprecise measurement. For an HSI dataset with
n labeled samples,
of labeled samples
and
should be updated. In actual calculations, any
is less than 1. In that way, in terms of
n labeled samples
,
can be updated as follows:
where, if
and
have the same class labels, their SLSD is set 0. If
and
have the different class labels, their SLSD is set 1. In addition, if
or
is unlabeled, its SLSD is not updated. In this manner, the updated
contains information about
n labels.
In a graph, a vertex and its neighbors are connected by edges. In this paper, we needed to construct two graphs:
based on the nearest neighbors and
based on the farthest neighbors. On the basis of SLSD,
and
can be constructed. For
,
nearest neighbors were found as
Since
was updated based on labels,
nearest neighbors could be obtained from the samples with the smallest
. Then, the weight matrix
was formulated as
in which,
. For
,
farthest neighbors were found as
From that,
farthest neighbors of unlabeled samples were obtained from the samples with the largest
, whereas those of labeled samples were obtained from the samples with the different class labels and the smallest
. The weight matrix
is formulated as
In fact, and , and are sparse matrices.
and involve different sample relationships. reflects the relationships between the target sample and its nearest neighbors, which have a high probability of belonging to the same class as the target sample, whereas reflects the relationships between the target sample and its farthest neighbors, which are most likely different classes from the target sample.
Figure 3 illustrates the pipeline of the proposed deep multitask network. The training data of DMN contain both labeled and unlabeled samples, and so the proposed DMN can be regarded as a semi-supervised network. DMN includes a Siamese subnetwork and a classifier subnetwork and they have different tasks and training data. The training data of classifier subnetwork must be labeled samples, which is a conventional standard supervised network. As the name implies, the task of classifier subnetwork is classification, learning the classes of labeled samples to predict unlabeled samples, which, in nature, is learning what samples are. Nevertheless, due to the few and limited labels in HSIs, the conventional classification network often suffers from the problems of over-fitting and poor classification performance. In proposed DMN, Siamese subnetwork was designed to address this problem, whose task was to learn the samples’ relationships from
and
to promote the learning and training of classifier subnetwork. It can be seen from
Figure 3 that the two subnetworks have the same architecture and share parameters, which is the hub through which they can communicate and complement each other. The training data of Siamese subnetwork are all samples, including labeled and unlabeled samples. In addition, the training of Siamese network also requires the information generated by the semi-supervised graph, and the value of such information is reflected here. In fact, our designed Siamese subnetwork is essentially an unsupervised network and can still be trained without labels.
2.3. Network Architecture and Loss Function of Deep Multitask Network
Figure 4 shows the generation process of training data for DMN. For an HSI dataset,
with m samples (labeled and unlabeled) and n labels (
). M 3-D cube samples
with the spatial neighborhood are first generated. Since the training data of Siamese subnetwork and classifier subnetwork are different, two training sets need to be established. Classifier subnetwork only trains the labeled samples, so its training data contain n 3-D cube samples
with
labels, as shown in Training Set 1
of
Figure 4. In practice, the size of its input is 3-D cube samples
. Due to the fact that Siamese subnetwork is to learn the sample relationships in
and
, in addition to the target sample,
nearest neighbors in
and
farthest neighbors in
also need to be input into the network during training. Training Set 2
of
Figure 4 are the training data of Siamese subnetwork, which include m training samples
, where a training sample
contains one target sample
, its
nearest neighbors
from
and its
farthest neighbors
from
. It is worth noting that some neighbors of unlabeled samples are labeled samples, which allows the network to learn the relationship between labeled samples and unlabeled samples to promote classification.
Figure 5 displays the network architecture of classifier subnetwork with a feature extractor and a logistic regression layer. In view of the strong feature extraction capability of convolutional layer, the feature extractor is a fully convolutional network. Here, the features of each layer decrease as the number of layers increases and the size of output is
, so the feature extractor can also be regarded as a process of dimensionality reduction. Taking a four-layer feature extractor as an example, for input data
, their output can be formulated as
where
is the ReLU function and
is the 2-D convolution.
is the learning parameter of the feature extractor. Feature extractor and logistic regression layer are fully connected. The output of logistic regression layer is formulated as
in which,
is the learning parameters of logistic regression layer and
is the softmax function. Since the task of classifier subnetwork is classification, for the Training Set 1
, the loss function adopts the cross-entropy loss, which is defined as
Here, is the class label of and is its predicted label. and are two -dimensional one-hot vectors and is the number of classes. is the kth element of and is the kth element of .
The task of Siamese subnetwork is to learn the sample relationships from
and
. As described in
Section 2.2,
represents the relationships between the target sample and its
nearest neighbors and
expresses the relationships between the target sample and its
farthest neighbors. In order to learn the graph information in
and
at the same time, a novel Siamese subnetwork with
subnets is designed to learn the relationship between one target sample and
samples at a time, of which, the network architecture is shown in
Figure 6. This is different from the traditional Siamese network, which has two subnets and only learns the relationship between two samples at a time [
40]. In our designed Siamese network, each subnet has the same network structure as the classifier subnetwork. That is, all subnets have the same network structure with a feature extractor and a graph-based constraint layer. The point is that they share parameters.
Corresponding to the network architecture, input data
of Siamese subnetwork have
3-D cubes and
is the target sample. Meanwhile, we suppose that
is the
pth nearest neighbor of
and
is the
qth farthest neighbor of
, which are both 3-D cubes. In Siamese subnetwork, each subnet inputs one 3-D cube
. According to the sequence from top to bottom in
Figure 6, the first subnet inputs the target sample
, and its output can be formulated as
which is same as the output of classifier subnetwork.
is also the output of feature extractor and
is the learning parameters of graph-based constraint layer.
represents a subnet mapping function, which applies to all subnets due to the same network structure and shared parameters. In that way, the second to
th subnets input nearest neighbor samples and the
th to
th subnets input farthest neighbor samples. When a subnet inputs a nearest neighbor
of
, its output is described as
In the same way, when a subnet inputs a farthest neighbor
of
, its output is described as
Thus, for input data
in Training Set 2 of
Figure 4, the output of Siamese subnetwork is
which includes the outputs of
subnets. In fact, Siamese subnetwork aims to promote the learning of classifier subnetwork to improve classification performance. As a result, based on
and
, Siamese subnetwork should compress the distance
between the target sample and
nearest neighbors and expand the distance
between the target sample and
farthest neighbors. The former can be formulated as
and the latter can be formulated as
and
are the weight matrix from
and
, respectively, which are calculated with Equations (
5) and (
7).
is based on SLSD, which is proven to be effective in revealing the more real relationships between samples [
32], where, if SLSD between
and
is smaller,
is larger, and vice versa. Generally, neural networks optimize learning parameters by minimizing objective functions. However, in Siamese subnetwork,
needs to be minimized, whereas
needs to be maximized. A simple negative
optimization will make the network unable to converge. To take into account the convergence of the network, the loss function of Siamese subnetwork is defined as
Here, is a decreasing function and converges to 0 as the variable increases. As a result, the loss function will be optimized towards zero.
2.4. Multitask Few-Shot Learning Strategy
In fact, as its name suggests, DMN is a two-task network. Since the training data required by these two tasks are completely different not only in content but also in format, the learning of DMN faces the problem of the two tasks not being able to update learning parameters in training at the same time. In addition, due to the large difference in the number of training data between the two tasks, DMN learning is easy to fall into a single task, so the two tasks cannot achieve uniform convergence. To sum up, it is challenging to achieve the synergy and balance effect of the two tasks. To solve this problem, we designed a multitask few-shot learning strategy (MFSL).
Next, for ease of explanation, we still introduce MFSL based on two sub-networks. Due to the fact that the purpose of DMN is classification and the number of labeled samples is very small, the learning of the classifier subnetwork is particularly important. The University of Pavia dataset, for example, contains 42,776 samples from nine classes. If five samples are taken from each class, the number of labeled samples is only 45 and the number of unlabeled samples is 42,731. Thus, the number of training samples for the classifier subnetwork is 45, whereas that for the Siamese subnetwork is 42,776, which shows a large gap between them. Therefore, the task of the classifier subnetwork needs to be emphasized constantly.
In this paper, our GDMFSL deals with HSIs with few labels. The training data of the classifier subnetwork of DMN are only labeled samples, so all of them can be used as one batch to participate in DMN learning. Algorithm 1 shows the multitask few-shot learning strategy. MFSL can be understood as that, in training, when the Siamese subnetwork learns a batch of data, the classifier subnetwork also learns a batch of data. When the Siamese subnetwork is learning different batches of data, the classifier subnetwork is always learning a batch of data repeatedly. Of course, when the number of labeled samples increases, the training data of the classifier subnetwork can also be divided into multiple batches. The following experiments also prove that MFSL can balance and converge the two tasks of DMN.
Algorithm 1: Multitask few-shot learning strategy |
Input: Training Set 1 with labels, Training Set 2 and its number m, weight matrix and , batch size B, iterations I, learning rate. |
Initialize: |
- 1:
for epoch in do - 2:
RandomShuffle () - 3:
for i in do - 4:
RandomShuffle ( with labels) - 5:
- 6:
Update Minimize ( in Equation ( 10)) - 7:
Update Minimize ( in Equation ( 17)) - 8:
end for - 9:
end for
|
Output: with the input according to Equation (9) |