3.1. Problem Formulation
In few-shot classification tasks, meta-learning training methods are usually used, and they are mainly divided into two stages: meta-training and meta-testing. In the meta-learning method, both the original training data set and the test data set must undergo special processing. During the training process, two data sets are used: the support set, S, and the query set, Q, and the same is true for the testing process. Few-shot learning (FSL) is usually divided into N-way K-shot problems, where N refers to how many types of samples are used in the meta-testing process, and K refers to how many samples there are in each category in the meta-testing process.
The N-way K-shot data set is divided by randomly selecting N categories from which there are many samples and then randomly selecting samples from each category from these N categories; X can be any number, but it must not exceed the maximum number of samples for each category. The selected K samples are called the support set, , and the X samples are called the query set, . Among them, (, ) represents the i-th image in the support set, S, and the corresponding label , (,) represents the i-th image in the query set, Q, and the corresponding label . During meta-training, a model for learning the mapping from S to Q is trained at each epoch iteration. At meta-testing, the model uses the learned mapping to assign the example in Q to one of N categories from the sampled support set S.
3.3. Corner Multi-Head Attention (CMA)
The CFM module is used to extract corner feature maps. The CMA module takes the corner extractor feature basic representation
and converts it into a suitable form
, as well as the feature extractor feature
, to prepare reliable input for the CMA module to analyze feature correlations between images.
Figure 2 and
Figure 3 show the architecture of the CFM module and CMA module, respectively.
Corner feature map (CFM). This map is generated via a corner detector, mainly using the Shi–Tomasi corner detector [
23], which can produce higher-quality corner information. For each pixel {(
y)} in each image and the image tensor
S, the angular response function expression is
Among them, and are the eigenvalues of S. If the function is greater than a certain threshold, , the pixel {(x, y)} is a corner point, and its point position is mapped to 1; otherwise, it is 0. That is, the corner point map is a sparse matrix, and the position information of the corner point exists at the matrix mapping position 1. After the function is applied, a one-dimensional vector is obtained. In order to comply with the feature vector fusion method, a three-layer Concat connection function is used to obtain = (N, 3, H, W), and then a linear transformation Linner is used to obtain .
The schematic diagram of corner feature map calculation is shown in
Figure 2. The corner point map extracted via the CFM module is shown in
Figure 4.
CMA computation. This module mainly models the cross relationship between image features and corner features. The core part uses the attention mechanism. The formula of the attention mechanism is as follows:
Q is a matrix of size
, and
K is a matrix of size
, which refers to the dimension of the key or query. The dot product result is the attention score matrix
, which is of size
. Therefore,
is understood as the attention score of the
i-th query on the
j-th key. The larger this value is, the greater the association between the
i-th query and the
j-th key. Of course, to prevent the attention value from being too large and to stabilize the training process, the attention mechanism usually scales this result by dividing it by
. And the softmax function normalizes the attention scores of each query for all keys. Finally, it is multiplied by the weighted sum of the
V matrix, and
represents the value queried by the
i-th query. In order to enhance the model’s attention to different features, the attention mechanism is extended to multiple heads:
is the weight matrix at the output, and
H is the number of heads. It includes three vectors: a query vector, a key vector, and a value vector. For a given query vector, multi-head attention performs a weighted sum of key vectors. The weight is calculated by the similarity between the query vector and the key vector, and then the resulting weighted sum is multiplied by the value vector for the output [
26]. In the multi-head mechanism, the input sequence is split into multiple heads, with each head performing independent computations. Each head generates a separate output, which is then combined to form the final result. This approach allows the model to capture different aspects of the input data simultaneously, enhancing its ability to learn diverse representations and improve overall performance. And the normalization layer is used to normalize the input features before or after applying the multi-head attention mechanism. It is used to normalize the input features to reduce the internal covariate shift problem in model training.
Then, the image features and corner features are added using the residual formula:
It strengthens image-level features by enhancing features to help few-shot learners better understand “what to learn and distinguish” in images. Our experiments show that CMA is robust to intra-class invariant features and helps generalize to unseen classes.
3.4. Learning Embedding Adaptation
In this subsection, we derive adaptive embeddings for few-shot learning and then describe the learning objectives of our approach (Algorithm 1).
Algorithm 1: Few-Shot Learning via Embedding Adaptation with CMA |
Input: Few-shot training set: . The loss weight hyperparameters . |
For each |
Output: learn |
1: for in do |
2: Extract and |
3: Calculate corner feature map: by Equation (1) |
4: Calculate the intersection feature map of image features |
and corner features: by Equation (2)–(4) |
5: Calculate task embedding by Equation (5) |
6: Calculate few-shot classifier loss: by Equations (6) and (7) |
7: Calculate the total loss function: L by Equation (8) |
8: end for |
9: return |
Embedding adaptation (EA) for FSL. The main learning is a set-to-set function, rather than the traditional instance function, which contextualizes the image instances of the set. We adopt the set-to-set adaptive embedding method using a transformer in the FEAT [
14] framework. It can be adapted to different types of task-independent embedding functions and computed using similarity measures.
Learning objective. The main goal is to learn a learnable parameter, , through the meta-training process.
Firstly, we compute a prototype,
, for category
K in support sets
S as follows:
is a feature extractor with learnable parameters,
. Then, based on the prototype method, the probability distribution of the data point
in the query set
C classes is defined as follows:
d(·,·) represents the cosine distance. Therefore, the loss of the nearest centroid classifier on an episode can be calculated as follows:
We adopt a joint training approach to improve the proposed module and backbone network by combining two losses: the cross-entropy-based classification loss
and the embedding space-based classification loss
L is total loss, and
is set by default.