1. Introduction
In recent decades, advancements in remote sensing technology have led to the development of spaceborne sensors, which now offer sub-meter spatial resolution, comparable to airborne images from a few decades ago [
1]. Operating continuously, these sensors have generated pretty valuable data, making automatic image interpretation and object detection increasingly essential.
With the advancement of convolutional neural networks (CNNs), significant breakthroughs have been achieved in the field of object detection, with many high-performance detectors demonstrating outstanding performance under extensive training data [
2,
3,
4]. However, in practical applications, the labeled data obtained from remote sensing sensors are limited, and acquiring them is costly and difficult. CNN-based detectors are prone to overfitting and exhibiting poor performance when faced with such limited data. To address this issue, many studies have begun researching how to train detectors that can perform well with a limited amount of data, giving rise to few-shot object detection (FSOD) [
5,
6,
7].
Numerous FSOD methods adhere to a two-stage training approach, which entails an initial base training phase using an established base dataset containing ample training samples to acquire general prior knowledge. Subsequently, the base-trained detector undergoes fine-tuning using a few-shot dataset comprising target categories. The categories within the base dataset are referred to as base classes, while the target categories added in the few-shot dataset are denoted as novel classes. Currently, there are two mainstream approaches to addressing FSOD: meta-learning and transfer learning. They both follow a two-stage training paradigm, the effectiveness of which has been demonstrated across various scenarios [
8,
9]. Specifically, meta-learning-based methods stand out for their ability to rapidly adapt to extremely limited sample scenarios and exhibit strong generalization capabilities [
10,
11,
12], which makes them an excellent choice for situations where training data are scarce. Nonetheless, several challenges exist when dealing with remote sensing images (RSIs) in FSOD.
Firstly, during the phase of fine-tuning on the novel class, the fine-tuning dataset needs to be constructed according to the N-way-K-shot principle [
13]. For instance, when N = 10 and K = 3, the fine-tuning dataset should consist of 10 categories, with each category containing three labels. However, the correspondence between the three labels and the images is not strictly one-to-one. In existing remote sensing datasets, it is challenging to ensure one label corresponds to one image, as a single image usually contains multiple objects. To adhere to the N-way-K-shot principle, only a portion of these labels from these images can be utilized as training inputs. Consequently, incomplete labeling frequently occurs in the fine-tuning dataset. As shown in
Figure 1, when fine-tuning for the “airplane” class, as a novel class, only certain objects in the image are annotated. The given labels provide positive guidance to the detector, whereas missing labels may lead the detector to regard the objects as background, causing significant confusion, referred to as the incompletely annotated objects (IAO) issue. Current FSOD methods addressing the IAO issue [
14,
15] employ pseudo-labels or novel classifiers to mitigate the problem. However, these approaches rely on the model’s current understanding of novel class knowledge, which is evidently much lower than that of base classes in few-shot conditions. Since the root cause of the IAO issue lies in data processing problems, we believe that addressing it from the perspective of data is straightforward and effective.
Additionally, a significant concern within meta-learning frameworks is the method of integrating support set features with query set features. Presently, two predominant aggregation approaches are recognized: class-specific aggregation (CSA) [
12] and class-agnostic aggregation (CAA) [
16]. CSA [
12] merges features of the same class from both the support set and the query set, enhancing the detector’s ability to memorize specific objects. Conversely, CAA [
16] allows for the fusion of object features from different classes between the support set and the query set, thereby aiding the detector in distinguishing between classes. Moreover, there exists a technique for encoding support set features into vector form and assisting the query set through channel-wise multiplication, which offers operational simplicity, lightweight parameters, and universality. Li et al. [
11] employed this method in RSIs, where support set images and their corresponding label mask images were jointly fed into convolutional layers, ultimately encoding feature vectors to assist the query set features. However, Han et al. [
16] suggested that such vectors may be influenced by data scarcity and variations in examples, failing to adequately represent the entire class distribution. This limitation can be partly attributed to [
11] introducing background information during encoding. Guan et al. [
17] proposed encoding only the objects, which to some extent reduces the variance. However, objects in RSIs exhibit significant intra-class differences, and encoding only the objects may still result in large variance in the output vectors, as shown in
Figure 2. Hence, we hold the view that during training, enabling the support set’s features to be derived not solely from a specific set of data but rather by synthesizing the auxiliary features of the support set would help stabilize the output vectors of the support set and enhance their auxiliary effects.
In the end, objects within RSIs obtained from remote sensing sensors often exhibit minor inter-class differences (MIDs), manifested in various aspects such as color and shape, as shown in
Figure 3. Relying solely on support set features to assist the detection does not effectively enhance the classifier’s classification ability. The prevailing method for addressing the MID issue is contrastive learning. However, these approaches [
18,
19] significantly increase model complexity and are challenging to directly apply to meta-learning-based few-shot object detectors. We fully exploit the distinct characteristics of various objects in the support set. By enhancing the detector’s ability to distinguish between different components of the support set’s output vectors through cross-entropy, its knowledge of distinguishing different classes can be reinforced.
To address the aforementioned challenges and considering the limited computational resources in practical applications that require models to be as lightweight as possible, we propose a novel Balanced Few-Shot Object Detector based on the single-stage detector YOLOv9 (GELAN-C version) [
20]. Given its consideration of balance and stability in handling samples for FSOD across all its designed components, it is fittingly dubbed B-FSDet. To begin with, in an effort to achieve a genuine balanced input sample during fine-tuning, we propose a straightforward yet highly effective data clearing strategy (DCS). The DCS operates on fine-tuning dataset images, removing redundant objects based on the complete set of labels and a subset of labels used for few-shot learning. Notably, this process is lightweight as it does not rely on complex deep-learning-based image inpainting techniques. Instead, it simply employs white Gaussian noise (WGN) to replace the objects. Importantly, our detector’s loss computation only involves valuable positive samples, thus minimizing the impact of substituted WGN on the detector performance. What is more, to ensure that the output vectors from the support set comprehensively represent the features of each class, we introduce the stationary feature extraction module (SFEM), based on [
11,
20]. We also apply dynamic exponential moving average (DEMA) to the output vectors to mitigate the impact stemming from unstable model parameters during training and the addition of novel classes during fine-tuning. In the meantime, we propose a stationary and fast prediction method (SFPM) coupled with SFEM that does not rely on support set images specifically matched with the object with high detecting speed. Instead, it randomly selects from the support set class library, demonstrating significant robustness and efficiency. Finally, we propose the inter-class discrimination support loss (ICDSL). Building upon the existing detection head, we augment the detection head with the decoding function for support set vectors. ICDSL is calculated between the decoded results and the ground truth class provided by the support set to strengthen the detection head’s ability to discriminate different classes.
In summary, the main contributions of this paper can be summarized as follows:
We propose a novel Balanced Few-Shot Object Detector (B-FSDet), based on the YOLOv9 (GELAN-C version) [
20] and meta-learning. Considering the limited computational resources, B-FSDet achieves remarkably high detection accuracy with a low parameter count, and effectively addresses numerous challenges prevalent in RSIs.
To ensure genuine balance in input samples during fine-tuning, we introduce DCS, which removes redundant objects from fine-tuning dataset images. The lightweight process employs WGN to replace the redundant objects, resulting in precise alignment of objects with labels and adherence to the N-way-K-shot principle.
To make the output vectors comprehensively represent the features of each class in the support set, we introduce SFEM and SFPM. The two parts construct a stationary meta-learning mode, improving the robustness of the detector.
Addressing the issue of minor inter-class differences, we propose ICDSL to strengthen the detection head’s ability to discriminate between classes.
3. Methods
3.1. Problem Setting
FSOD aims to train a detection model on a dataset with base classes. This allows it to detect objects in images with new classes, even with few annotated samples. A meta-learning-based detector is trained to glean meta-knowledge from a vast amount of detection tasks sampled from base classes, enabling it to generalize effectively to novel classes. Each sampled task is termed an episode, where an episode consists of a collection of support images and a set of query images . During each episode, the support images serve as training samples, instructing the model on how to tackle the given task, while the query images act as test samples, assessing the model’s performance on the task.
For a remote sensing dataset encountering the FSOD problem, we approach it as follows. Following a methodology akin to the fine-tuning approach [
36], our meta-learning mode entails two primary stages: base class training and novel class (meta) fine-tuning. Initially, as delineated in [
40], we partition the dataset into base class data
and novel class data
. Here,
comprises image data
and the corresponding label
, denoted as
where
m represents the class sequence,
i symbolizes the image sequence, and
denotes the number of base classes.
is similarly structured, expressed by
where
have the same meanings as represented in
and
j symbolizes the image sequence, while
N denotes the total number of classes. And, there must be
meaning that there is no overlap between the labels of the novel classes and the base classes.
During base class training, we form both the query set
and the support set
from
, which can be expressed by
where
denotes the label mask images corresponding to the images. Both the query set and the support set utilize all the base class data. Subsequently, in novel class fine-tuning, adhering to the N-way-K-shot principle, we select K objects and labels for each class for
and
, which can be expressed as
Assuming
represents the untrained model, it adheres to the following paradigm throughout the entire training:
where
denotes the model after base training and
inherits the parameters of
, obtained through fine-tuning with
and
.
3.2. Framework Overview
The overall structure of B-FSDet is shown in
Figure 6. Built upon the YOLOv9 [
20] framework, our endeavor focuses on the transformation of efficient single-stage detectors into robust few-shot detectors, while addressing diverse challenges inherent in the practical utilization of RSIs. B-FSDet comprises SFEM, SFPM, DCS, feature extraction layers based on YOLOv9 [
20], and a detection head. The training process follows the meta-learning paradigm outlined in [
11,
12], involving base class training followed by fine-tuning on new classes. During base class training, query set images are processed through the YOLOv9-based feature extraction layer, while support set images undergo shared feature extraction layers before vector encoding. All specific network architecture parameters based on YOLOv9 are available in [
20] and
Section 2.4, and specific implementation of other modules is detailed in later sections. The encoded vectors then undergo the DEMA operation with previously encoded vectors, followed by channel-wise multiplication with features of the query sets to obtain fused features, which are then inputted into the detection head for loss calculation. In the fine-tuning phase, the query set images undergo DCS to remove unannotated objects before being processed and the operations on the support set remain consistent with the base class training phase. During prediction, SFPM utilizes the stationary support set vectors obtained during training for inference, enabling a further reduction in model parameters.
3.3. Few-Shot Data Clearing Strategy (DCS)
As previously discussed, the issues of IAO-novel and IAO-base have a significant impact on the detector’s performance. Previous research has aimed to mitigate confusion for the detector by assigning pseudo-labels to unannotated objects or reduce the impact of the noisy labels [
38,
39]. In a similar manner, a consistent label classifier is proposed to make the labels more consistent during base training and fine-tuning [
15], as shown in
Figure 7. However, the inclusion of pseudo-labels is contingent upon the current performance of the network and the distribution of objects within the current batch. Consequently, pseudo-labels are often random and not entirely accurate across all classes. This imbalance in samples can lead to a decrease in detection accuracy for certain classes. Similar methods also rely on the current performance of the model’s classifier to make judgments, so they cannot achieve complete accuracy. The root cause of the IAO problem lies in data imbalance, stemming from both the data collection and data processing. Instead of focusing solely on enhancing the model’s ability to detect unannotated objects, it is more effective to address data imbalance directly through data clearing.
Hence, we propose a simple and effective data clearing strategy (DCS) aimed at FSOD. This method focuses on removing redundant objects from training images rather than adding missing parts to training labels. This process is only conducted during the fine-tuning stage when the data volume is extremely low. It is both feasible and yields significant improvements in performance, aligning with the training principles of FSOD.
Specifically, DCS comprises four steps. Firstly, identify the missing labels for each image during fine-tuning. For example, if an image contains objects
with corresponding labels
denotes the number of objects included), but in the fine-tuning dataset, only the label
is provided, then the missing labels
corresponding to objects
are determined. There should be
Secondly, locate the objects corresponding to the missing labels. In digital images, each pixel is inherently discrete. represents a two-dimensional function, denoting the object portion of the entire image. Thirdly, replace these objects with white Gaussian noise (WGN). Finally, restore the annotated object to its background. It is important to consider that targets in RSIs may overlap, and losing annotated objects along with some background information is not desirable. Therefore, annotated instances and their surrounding 10 pixels are preserved and restored.
The universality of WGN in randomness enables its substitution for unannotated objects, making it adaptable to any target. Coupled with the loss function ignoring negative samples, it effectively mitigates the noise impact while addressing the issues of IAO-novel and IAO-base. Let
, the pixel values of each point of the target with missing labels are replaced through random sampling from
.
is a random variable with a probability density function given by
In DCS, we let .
3.4. Stationary Feature Extraction Module (SFEM)
A notable drawback of the meta-learning-based method is the difficulty in designing detectors. FSODM [
11] has demonstrated certain rationality in supporting set output features to assist query set feature learning in vector form. This approach is efficient and parameter-free, yet the variance in the output vectors is relatively large, making it hard to represent the whole class. Based on [
11,
20], we redesign the feature extraction module and propose a loss-based dynamic exponential moving average (DEMA) method, where the output vectors are influenced not only by the current batch input support set images but also by the output vectors from previous batches. Therefore, during the later stages of training, the vectors originating from all objects within the support set can adequately represent the majority of class features. In this process, the task is accomplished solely with common convolutional layers, in accordance with the lightweight design of the model. The extraction part is detailed in
Table 1.
As shown in
Figure 8, the specific procedure of DEMA involves recording the encoding vectors
during the last batch training. Subsequently, in the next batch, the output vectors are updated by
where
, respectively, represent the current vector’s output, and
is the dynamic factor scaled by the loss
in the last batch, as expressed by
In SFEM, . During the experiments, the typical range of the total loss is between 10 and 40, resulting in the update weight accounting for approximately 10% to 40%. Further details of are introduced in the next part. The behavior of the decay weight is dynamically adjusted based on the loss of the current iteration. If the loss for the current iteration is significant, indicating that the quality of the current batch is not satisfactory, the decay weight will increase. This results in placing more emphasis on the weight of previous DEMA results to stabilize the learning process. Conversely, if the loss is lower, implying a higher quality batch, the decay weight will decrease, allowing the model to adapt more quickly to new information.
As training progresses, particularly in the later stages, comes to represent relatively stable class centroids. This stability is crucial for ensuring that the model’s predictions become more consistent and reliable over time, reflecting the accumulated knowledge from previous iterations. By dynamically adjusting the decay weight, the model effectively balances between integrating new data and retaining the robustness of learned features, ultimately leading to improved performance and generalization.
3.5. Loss Computation
During training, SFEM reduces the variance in the output vectors, making the feature vectors provided by the support set more stable and reliable. However, in RSIs, inter-class differences are often minimal. To increase the differentiation of each class component in the support set output vectors while maintaining stability, we propose the inter-class discrimination support loss (ICDSL). In addition to channel fusion with the query set, the vectors encoded by the support set directly enter the detection head. We integrate a vector decoder into the detection head and apply the cross-entropy loss (CEL) to the decoded results to enhance the detector’s ability to distinguish between classes.
The whole loss is calculated by
where
are based on [
20], and
is based on the CEL.
are fixed hyperparameters with
, while
represents the gain of
, which is discussed in detail in subsequent experiments. Suppose the aggregation features
before the detection head are expressed by
where
, respectively, represent the width, height, and channels of the images (for the sake of simplicity, only one scale is presented), and
represents the sources of fused features, where
q denotes those from the query set and
s denotes those from the support set.
As shown in
Figure 9, when calculating
, the detection layer first filters out the fused features that match between the support set and the query set, and then, decouples them, calculating the corresponding
and
. It is evident that non-matching fused features
and
are introduced during the computation of
. We smooth the calculated loss by averaging the loss involved; hence, the final
is expressed by
We consider the non-matching fused features as noise handled by data augmentation, enhancing the model’s robustness against interference.
As for
, firstly, let the output vectors be
, representing the three scales of features.
D denotes the decoder operator and we have the decoding output
where
represents the class score after the linear transformation
D of the three scales. Then, we introduce CEL to enhance inter-class disparities within RSIs. This enables the model to effectively discern subtle differences between different classes, thereby improving the classification performance. As illustrated in
Figure 10, by penalizing incorrect classifications and rewarding correct ones based on the logarithmic difference between predicted and ground truth class probabilities, the model is incentivized to learn more discriminative features representative of each class.
is calculated by
where
represents the ground truth classes.
3.6. Stationary and Fast Prediction Method (SFPM)
In both the CAA [
16] and CSA [
12] feature aggregation methods, previous meta-learning-based few-shot detectors entail the fusion of image output features with all classes in the support set during prediction. Subsequently, the result corresponding to the matched support set class is obtained from the fused features, as shown in
Figure 11.
Regardless of whether the auxiliary features from the support set are constant or generated in real time during prediction, this approach still slows down the model’s prediction speed. Additionally, due to significant inter-class differences, the vectors may not necessarily match the objects in query set images perfectly. We propose a fast and accurate prediction method SFPM, as shown in
Figure 11. During prediction, it is unnecessary to fuse every vector from the support set with the query set; instead, only one is randomly selected. This implies that B-FSDet can effectively distinguish between different classes, thanks to the support of ICDSL and the introduction of non-matching scenarios during feature fusion at training time. In subsequent experiments, we demonstrate that SFPM achieves significantly faster inference speeds compared to previous methods, without compromising much detection accuracy.
5. Conclusions
This paper presents B-FSDet, a few-shot object detector based on YOLOv9 and meta-learning, designed to tackle various challenges encountered in RSIs. Firstly, we introduce the DCS, which effectively filters out incompletely annotated objects from images, ensuring a balanced distribution of true labels and objects and thereby reducing confusion for the detector. Secondly, we propose SFEM and SPFM, constructing a stationary meta-learning mode achieving high detection accuracy while maintaining extremely fast inference speeds. Finally, ICDSL is introduced to increase inter-class differences among target classes, enhancing the detector’s ability to distinguish between classes effectively. The results indicate that B-FSDet achieves a detection accuracy approximately 8% higher than current methods in most scenarios. Specifically, in the NWPU.v2 dataset, under the 3-shot setting, B-FSDet achieves an nAP exceeding 75%, while under the 20-shot setting, both bAP and nAP are close to or exceed 90%. On the DIOR dataset, B-FSDet achieves a bAP exceeding 70% under all settings, while under all split settings with the 20-shot setup, nAP exceeds 40%. Additionally, the experimental data in
Figure 16 demonstrate that our proposed B-FSDet achieves an inference speed approximately 2–3 times faster than current SOTA methods.
While B-FSDet has achieved considerable progress, addressing the substantial intra-class variations present in RSIs poses ongoing challenges. Particularly, acquiring comprehensive knowledge of new classes with extremely limited samples remains a significant hurdle. Future efforts will focus on “expanding balanced samples” to enable the detector to acquire more nuanced knowledge of new classes and explore a balanced classifier to match with it. Additionally, a learnable self-attention layer may replace the DEMA strategy to realize better performance.