2411.02861v1
2411.02861v1
2411.02861v1
Abstract Developing accurate and efficient detectors foreground-background ratio and (2) small instances and
for drone imagery is challenging due to the inherent com- complex backgrounds, which lead to inadequate training,
plexity of aerial scenes. While some existing methods aim resulting insufficient distillation. Therefore, we propose
to achieve high accuracy by utilizing larger models, their a task-wise Lightweight Mutual Lifting (Light-ML) mod-
computational cost is prohibitive for drones. Recently, ule with a Centerness-based Instance-aware Distillation
Knowledge Distillation (KD) has shown promising po- (CID) strategy. The Light-ML module mutually har-
tential for maintaining satisfactory accuracy while signif- monizes the classification and localization branches by
icantly compressing models in general object detection. channel shuffling and convolution, integrating teacher su-
Considering the advantages of KD, this paper presents pervision across different tasks during back-propagation,
the first attempt to adapt it to object detection on thus facilitating training the student model. The CID
drone imagery and addresses two intrinsic issues: (1) low strategy extracts valuable regions surrounding instances
through the centerness of proposals, enhancing distilla-
Bowei Du tion efficacy. Experiments on the VisDrone, UAVDT,
E-mail: boweidu@buaa.edu.cn
and COCO benchmarks demonstrate that the proposed
Zhixuan Liao approach promotes the accuracies of existing state-of-
E-mail: zxliao2000@buaa.edu.cn the-art KD methods with comparable computational
requirements. Codes will be available upon acceptance.
Yanan Zhang
E-mail: zhangyanan@buaa.edu.cn Keywords Object Detection · Drone Imagery
Detection · Knowledge Distillation
Zhi Cai
E-mail: caizhi97@buaa.edu.cn
Complex
Background
VisDrone Dataset
Small
Instances
(a) (b)
Fig. 1: Challenges to object detection on drone imagery (VisDrone): (a) low foreground-background ratio and (b)
small instances and complex backgrounds.
A number of efforts have been made to address the to extract additional valuable regions complementing
trade-off between accuracy and efficiency of detectors positive samples for teacher supervision reinforcement.
on drone images through model compression techniques, The strategies mainly include foreground-background
including lightweight network methods (Zhu et al., 2021; region (Guo et al., 2021; Wang et al., 2019; Yang et al.,
Zhou et al., 2022; Zhu et al., 2023), sparse convolution 2022b), teacher-student comparison (Dai et al., 2021),
methods (Yang et al., 2022a; Du et al., 2023) and net- and IoU-based selection (Zheng et al., 2022). However,
work pruning methods (Liu et al., 2021; Zhang et al., regarding the case of drone imagery, instances are of-
2019). Nevertheless, lightweight network methods design ten small and backgrounds are usually complicated as
elaborate modules to improve the performance with- shown in Fig. 1 (b), both of which make it more diffi-
out reducing much computational load, while network cult to extract additional regions, further limiting the
pruning methods tend to disregard essential calcula- supervisory information, and thus incur an inadequate
tions on small objects which are dominant in drone distillation effect.
images, thus leading to a sub-optimal accuracy. Ad-
To address the aforementioned challenges, this paper
ditionally, sparse convolution methods show a better
proposes a novel approach with a task-wise Lightweight
speed-accuracy trade-off by implementing constrained
Mutual Lifting (Light-ML) module and a Centerness-
sparse convolutions, albeit at the cost of compromising
based Instance-aware Distillation (CID) strategy for
the generalizability of the model on drone platforms.
drone images. To mitigate the insufficient supervision
Recently, Knowledge Distillation (KD) has emerged
caused by the low foreground-to-background ratio, the
as a promising alternative to model compression, offering
Light-ML module is designed to mutually harmonize
largely reduced training and deployment costs while
the classification and localization branches through
maintaining competitive performance, and has proved
an innovative module consisting of channel shuffling
effective in general object detection. KD typically follows
and convolution. It enhances the input information for
a student-teacher framework, where knowledge of the
both branches, and integrates teacher supervision clues
larger teacher is distilled and transferred to the smaller
across different tasks during the back-propagation stage,
student, thus generating a model of a compressed size
thereby amplifying the impact of KD, and effectively
with an enhanced accuracy.
reducing the performance gap between the teacher and
Unfortunately, it is not so straightforward to adapt
student models. For the issue of insufficient distillation
KD to detectors on drone images, and two major chal-
due to difficulties in extracting additional information
lenges remain. First, as illustrated in Fig. 1 (a), drone
from small instances in complex background, the CID
images typically exhibit a low ratio between foreground
strategy generates valuable regions around instances
and background, which limits the supervisory informa-
based on the centerness of the prediction of each anchor
tion conveyed by the teacher model in the foreground
box or anchor point, which enables the smooth and
area. This limited supervision leads to insufficient distil-
flexible estimation of informative regions surrounding
lation, resulting in a more pronounced performance gap
objects, particularly concerning small instances.
between the teacher and student models. Second, when
applying KD in object detection, a common practice is The contribution of our work lies in three-fold:
Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting 3
– We propose a novel KD approach to object detection a mosaic-based approach to merge estimation results
in particular for drone images. To the best of our into a unified image and employs a Multi-Proxy De-
knowledge, this is the first attempt to introduce KD tection Network to improve the classification accuracy.
techniques for model compression to build detectors Focus&Detect (Koyun et al., 2022) introduces the Gaus-
for drone imagery. sian Mixture Model for coarse regions estimation and
– We design a Lightweight Mutual Lifting (Light-ML) proposes an incomplete box suppression algorithm to
module and a centerness-based instance-aware distil- avoid overlapping regions.
lation (CID) strategy to address the inherent chal- In contrast to the inefficiency of the coarse-to-
lenges of applying KD to drone imagery detection. fine detection methodology, various methods based
These components enrich the supervision information on lightweight detection models are adopted. Slim-
during distillation, thereby enhancing the distillation YOLOv3 (Zhang et al., 2019) introduces network slim-
effect. ming (Liu et al., 2017) into YOLOv3 (Redmon and
– We make extensive evaluation on three public bench- Farhadi, 2018) while TPH-YOLOv5 (Zhu et al., 2021)
marks (VisDrone, UAVDT, and COCO) with vari- and FasterX (Zhou et al., 2022) introduce attention-
ous detection pipelines (e.g. GFL v1 and ATSS) and based modules into YOLOv5 (Jocher et al., 2020) and
achieve the state-of-the-art accuracy, showing the YOLOX (Ge et al., 2021) to improve the detection ac-
great potential of KD in this field. curacy on drone images respectively. QueryDet (Yang
et al., 2022a) and CEASC (Du et al., 2023) introduce
sparse convolution into detectors as the low foreground
2 Related Work ratio on drone images. Nevertheless, these methods en-
counter limitations concerning their acceleration capa-
2.1 General Object Detection bilities and the inherent complexity of drone images,
making it challenging to achieve an effective trade-off
Most CNN-based general object detection methods are between speed and efficiency.
categorized into multi-stage detectors (Ren et al., 2015;
He et al., 2017; Cai and Vasconcelos, 2018), one-stage
anchor-based detectors (Lin et al., 2017; Zhang et al., 2.3 Knowledge Distillation for Object Detection
2020; Li et al., 2020b, 2021) and one-stage anchor-free
detectors (Duan et al., 2019; Zhu et al., 2019; Tian Knowledge Distillation (KD) serves as an efficient model
et al., 2019). Multi-stage detectors first generate pro- compression method to train lightweight student mod-
posal regions and subsequently refine classification and els through extra supervised knowledge generated from
regression results in the next stage. Conversely, one-stage more powerful teacher models. KD has been first intro-
detectors directly classify and locate objects within the duced into the classification task (Hinton et al., 2015;
entire feature. Anchor-based detectors utilize prior an- Romero et al., 2014; Park et al., 2019; Liu et al., 2019;
chor boxes to identify potential proposals or predictions, Zhao et al., 2022; Zong et al., 2022) and demonstrated
while anchor-free detectors alleviate complicated com- its efficacy. On the object detection task, KD can be
putation associated with anchor boxes by leveraging key broadly divided into feature-based distillation (Guo
points of the object e.g. centerness constraints. et al., 2021; Yang et al., 2022b; Cao et al., 2022; Yang
et al., 2022c) and logit-based distillation (Zheng et al.,
2022; Yang et al., 2023). Feature-based distillation fo-
2.2 Drone Images Object Detection cuses on multi-scale intermediate features from the FPN
output. FGD (Yang et al., 2022b) separates the fore-
Contemporary studies mainly concentrate on promot- ground and background features, employing attention-
ing detection accuracy in drone images and usually based supervision to enhance information extraction
employ a coarse-to-fine framework, which first adopts within focal areas. PKD (Cao et al., 2022) leverages
a coarse network to approximate the locations of re- the Pearson Correlation Coefficient to focus on the re-
gions harboring densely distributed objects and then a lational information. MGD (Yang et al., 2022c) intro-
fine network to detect small objects within these iden- duces a masked image modeling structure, establishing
tified regions. ClusDet (Yang et al., 2019) integrates a a general feature-based distillation approach. GKD (Lan
sub-network to generate coarse regions and proposes and Tian, 2024) leverages gradient information to high-
a scale estimation network for better fine detection. light and distill the most impactful feature. Logit-based
DMNet (Li et al., 2020a) introduces density map into distillation focuses on logits from classification and lo-
coarse estimating and utilizes sliding window to get min- calization tasks. LD (Zheng et al., 2022) leverages the
imal regions. UFPMP-Det (Huang et al., 2022) employs probability distribution of bounding boxes proposed by
4 Bowei Du et al.
GFL (Li et al., 2020b), applying Kullback-Leible diver- And logit-based approaches focus on distilling detection
gence for distillation, and proposes an IoU-based Region results C i and Bi :
Weighting method. BCKD (Yang et al., 2023) formu-
1 X
lating classification logit maps as multiple binary clas- Ldis = Icls Lcls (C Ti , C Si ) + Ireg Lreg (BTi , BSi ) (2)
N i
sification maps, mitigating inconsistency issues. These
methods have not considered the performance gap be-
tween teacher-student models and information extrac- where C Ti , C Si are classification results from teacher and
tion on small objects, resulting in restricted performance student detector, BTi , BSi are localization results from
in drone images. In contrast, our method introduces teacher and student detector.
lightweight student feature lifting with instance-aware
information extraction to achieve better accuracy.
3.2 Lightweight Mutual Lifting
x4
Backbone+FPN
cls branch
… Positive
conv
conv
Sample
Teacher
Positive
loc branch Sample
…
conv
conv
CID
conv
conv
Light-ML
Positive
Sample
…
conv
conv
Student
Positive
Sample
Backbone+FPN cls branch
Backbone+FPN x4
conv
channel shuffle
Anchor Points
VLR Region
Small Instance
Centerness of Proposals
Fig. 2: Framework of our proposed distillation approach. Lightweight Mutual Lifting (Light-ML) mechanism is
integrated into the detection heads of the student model for feature lifting and Centerness-based Instance-aware
Distillation (CID) introduces an adaptive knowledge weighting algorithm for focal distillation combined with global
distillation, enabling the extraction of additional supervision information from the teacher models. Notably, our
proposed method can be applied to existing logit-based distillation approaches easily.
′ ′ ′ ′
increased number of feature channels leads to higher are then evenly split into Fs cls , Fs reg and Fc cls , Fc reg .
reg
computational load. On the other hand, using fea- Finally, we derive Fout cls
∈ RN ×C×H×W and Fout ∈
N ×C×H×W
ture processing guided by attention mechanism may R by concatenating the obtained features,
be constrained in both operational speed and deploy- as illustrated in Fig. 2.
ment efficiency. Consequently, we introduce a CSP- cls ′ ′
Fout = concat(Fc cls , Fs cls ) (5)
wise structure (Wang et al., 2020) to reduce module reg ′ ′
computations. Specifically, we split the input features Fout = concat(Fc reg , Fs reg ) (6)
into the convolutional part as Fccls ∈ RN ×kC×H×W , The proposed Light-ML structure achieves feature
Fcreg ∈ RN ×kC×H×W and the shuffling part as Fscls ∈ lifting within the student model. Similar to the channel
RN ×(1−k)C×H×W , Fsreg ∈ RN ×(1−k)C×H×W , where shuffle in ShuffleNet (Zhang et al., 2018), Light-ML aug-
k ∈ [0, 1] denotes the division ratio. ments features by integrating information between dif-
Firstly, we introduce channel shuffle (Zhang et al., ferent task branches during forward propagation, while
2018) between Fscls and Fsreg to facilitate the exchange in backward propagation, it optimizes the model from
of feature information between these two branches, thus the detection heads to the neck using gradient and dis-
enabling feature fusion without incurring additional tillation information from both branches as additional
computational costs. And we further apply convolution supervisory clues, which further improve the overall uti-
on Fccls and Fcreg to achieve enhanced feature lifting: lization of supervision information, thereby alleviating
Fsmid = c-shuf f le(concat(Fscls , Fsreg )) (3) inadequate training due to insufficient distillation infor-
mation and enhancing the distillation process. Therefore,
Fcmid = conv(concat(Fccls , Fcreg )) (4)
the Light-ML structure could alleviate the insufficient
where c-shuf f le indicates the channel shuffle function. supervision issue caused by low foreground ratio, thus
Fsmid ∈ RN ×2(1−k)C×H×W and Fcmid ∈ RN ×2kC×H×W suitable in drone imagery detection.
6 Bowei Du et al.
3.3 Centerness-based Instance-aware Distillation values in the positive sample regions of small instances
and low values in the contextual regions.
As illustrated in Sec.3.1, information weighting results Nevertheless, in FCOS, centerness is only defined
I highlight more consequential knowledge to distill within the GT boxes, which may lead to insufficient
and improving the accuracy of the training procedure. supervision, since the low foreground ratio in drone
LD (Zheng et al., 2022) analyzes knowledge distribution imagery. Therefore, we utilize the detection results de-
patterns related to classification and localization tasks, rived from fully trained teacher models as the basis for
identifying context regions of instances as Valuable Lo- information weighting.
calization Regions (VLR) for localization distillation. Specifically, we employ the predicted bounding boxes
However, despite the fact that the localization branch or the integral results of the predicted regression distri-
heavily relies on information around small instances in butions from trained teacher models as Bi ∈ Rn×4 to
drone image detection, previous work (Zheng et al., 2022) derive centerness. Additionally, we calculate the diag-
suffers from the low confidence issue due to the low DIoU onal lengths of each GT boxes, and apply 0.75 times
value when capturing valuable localization information, these lengths as the thresholds to preliminary filter the
leading to unreliable knowledge, thus inappropriate for distillation regions. This approach allows our centerness
drone images. target to be size-robust, effectively indicating whether
Specifically, previous methods propose VLR based the position lies within the contextual regions surround-
on regions where the DIoU value is lower than γLD ·αpos , ing the small instances.
setting the distillation weight to IoU. Here, γLD serves Since our primary objective is to capture valuable
as a hyper-parameter, and αpos as the threshold for information surrounding instances, we prioritize the
positive sample region. In drone imagery detection, small selection of anchors Ai characterized by lower centerness
instances lead to low IoU value between the ground truth values, which signify greater deviation between predicted
(GT) boxes and the predefined anchor boxes. As a result, objects and these anchors, indicating their proximity to
previous method could not provide adequate supervision the contexts. Therefore, the centerness-based instance-
due to the low distillation weight, thus inappropriate aware region weighting can be derived as follows:
for drone imagery. (
Moreover, lower DIoU values lead the previous meth- 1 − centernessi , centernessi < γ
Ivlr = (8)
ods to set a lower positive sample threshold αpos , result- 0, centernessi ≥ γ
ing in the expansion of positive sample regions, which
causes VLR to deviate from small instances. Conse- where the hyper-parameter γ ∈ [0, 1] controls region
quently, the teacher model distills more background weighting degree, the coverage of valuable regions will
noise instead of valuable localization knowledge. gradually expand with γ increases.
To address them, we propose a novel centerness- Combining centerness-based instance-aware regions
based instance-aware distillation technique, aimed at Ivlr with positive regions Imain of the positive samples
introducing a size-robust metric to enhance distillation generated from the label assignment process for positive
efficiency. samples, the comprehensive focal distillation within the
To effectively identify potential valuable regions localization branch can be formulated as:
around the instances, we introduce the centerness tar-
1 X
get (Tian et al., 2019) for weighting purposes. For the Ldis
f = (Imain + αIvlr )Lreg (BTi , BSi ) (9)
N i
i-th level, predefined anchors Ai ∈ Rn×2 and corre-
sponding bounding boxes Bi ∈ Rn×4 is accomplished
where α denoted the loss weighting of centerness-based
through detection heads. The centerness can be derived
instance-aware regions.
as follows:
s Compared to the VLR proposed in LD (Zheng et al.,
min(li∗ , ri∗ ) min(t∗i , b∗i ) 2022), the centerness-based instance-aware region only
centernessi = × (7) depends on its position in the predicted box, robust to
max(li∗ , ri∗ ) max(t∗i , b∗i )
small size, making it more general and avoiding the low
where li∗ , ri∗ , t∗i and b∗i come from location offsets calcu- DIoU problem. Furthermore, it places more emphasis
lated through Ai and Bi . on the regions around the small instances, which could
Considering the properties of centerness, it is solely effectively focus on local information.
relying on the position relative to the corresponding As merely distilling knowledge within the positive
bounding boxes, making it robust to size variations. sample regions and their immediate contexts can be
Additionally, when using the GT boxes as Bi , as in insufficient in transferring enough supervisory informa-
FCOS (Tian et al., 2019), the centerness will exhibit high tion from the teacher model, especially concerning drone
Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting 7
images with a significantly lower foreground ratio. There- 4.2 Implementation Details
fore, we additionally adopt a global distilling approach
for features proposed by (Yang et al., 2022c) within the We implement the proposed method using Py-
localization branch, which incorporates a Masked Image Torch (Paszke et al., 2019) and MMDetection (Chen
Modeling technique for global information reconstruc- et al., 2019). All the student models on the VisDrone
tion on the student model. The masked feature Mi is and COCO datasets are trained for 12 epochs with SGD
set to 1 at each pixel with a probability determined by optimizer. The learning rate is initially set at 0.01 with
the hyper-parameter λ. And the reconstruction module linear warm-up strategy and decreases by 10 after 8
G is lightweight and consists of two convolutional layers. and 11 epochs. On UAVDT dataset, we train models
The overall global distillation loss can be obtained from: for 6 epochs with an initial learning rate of 0.01, and
decreases by 10 after 4 and 5 epochs. The input im-
1 X age sizes are set to 1333×800 for VisDrone and COCO,
Ldis
g = L(BTi , G(BSi × Mi )) (10) and 1024×540 for UAVDT. The hyperparameters k in
N i
Sec. 3.2 and γ in Eq. (8) are set as {k = 0.25, γ = 0.45},
where L refers to the Smooth-L1 loss. and following the previous work, the hyperparameters
Combining the focal loss Ldis with the global loss α in Eq. (9), λ in Sec. 3.3, and β in Eq. (11) are set
f
dis as {α = 1, λ = 0.65, β = 4}. The training phase for
Lg , the overall distillation loss on the localization
VisDrone and UAVDT is conducted on 2 Nvidia RTX
branch can be formulated as:
2080Ti GPUs, and 8 Nvidia RTX 2080Ti GPUs on
Ldis dis dis
loc = Lf + βLg (11) COCO dataset.
Table 1: Comparsion of AP/AR (%) and GFLOPs on VisDrone by our proposed method and existing state-of-the-art
methods, with GFL-ResNet101 as teacher and GFL-ResNet50/GFL-ResNet18 as students.
Table 2: Comparsion of mAP/AP (%) and GFLOPs on UAVDT by our proposed method and existing state-of-the-
art methods, with GFL-ResNet101 as teacher and GFL-ResNet50/GFL-ResNet18 as students.
As summarized in Table 3, our method achieves im- that our contributions are beneficial for the student to
provements of 2.0% and 1.3% in various student models learn from the teacher in locating small instances which
compared to the feature-based method FGD. Addition- are dominant on drone images.
ally, it shows improvements ranging from 0.1%-1.1%
in basic logit-distillation methods, with only a minor
increase in computational demand. To further demonstrate the generalizability of our
method, we further validate our method on other base de-
It is noteworthy that, as shown in Table 3, when tectors, i.e. FCOS (Tian et al., 2019) and ATSS (Zhang
employing ResNet18 as the student model, we observe et al., 2020), as shown in Table 4, employing ResNet101
a significant relative improvement of 6.4% and 3.7% as the teacher model and ResNet18 as the student model
compared to the baseline in the accuracy of APS , greater on VisDrone dataset. Similar to GFL, our method shows
than those in APM (0.7% and 1.4%) and APL (2.8% and an improvement of 1.8%-3.2% in mAP over the feature-
3.2%). It demonstrates the effectiveness of our newly based method FGD, and 0.2%-0.6% over basic logit-
proposed Light-ML model on weaker student models based methods, demonstrating the flexibility of our dis-
and the impact of CID on smaller instances, indicating tillation approach.
Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting 9
Table 3: Comparision of mAP/AP (%) and GFLOPs on COCO between our proposed method and existing
state-of-the-art methods, with GFL-ResNet101 as teacher and GFL-ResNet50/GFL-ResNet18 as students.
Table 4: Comparsion of AP/AR (%) and GFLOPs on VisDrone by our proposed method and existing state-of-the-art
methods on different dense object detectors (FCOS and ATSS).
Table 6: Ablation study on detailed designs on Light-ML with GFL ResNet101-ResNet18 on VisDrone.
4.4.2 On Light-ML cate that the student model achieves the best trade-off
between accuracy and efficiency when k = 0.25, which
We evaluate the performance of our proposed Light-ML is therefore used in our experiments.
structure by comparing it with other lightweight feature
lifting structures, including channel shuffle (Zhang et al.,
2018) for all channels, feature fusion from the classifi- 4.4.3 On CID
cation branch to the regression branch, as well as from
the regression branch to the classification branch, using We separately evaluate the effect of Centerness-based
an attention-based approach proposed in (Li et al., Instance-aware Distillation and Global Distillation in
2021), denoted as ”Cls to Reg Fusion” and ”Reg to Table 8. When employing the CID method, student mod-
Cls Fusion”, respectively. As displayed in Table 6, the els can achieve a better distillation effect around small
two attention-based fusion methods show a performance instances, resulting in a 0.7% improvement in accuracy.
decline, primarily attributed to the incomplete feature Moreover, only utilizing global distillation results in in-
lifting from different tasks. The application of channel ferior performance as it cannot provide sufficient clues
shuffling results in a 0.4% performance improvement, around instances. However, when we combine the global
whereas our approach achieves a superior performance- distillation with CID, it further enhances the utiliza-
efficiency trade-off by introducing the CSP structure and tion of potential distillation information in background
additional convolution, leading to a 0.6% improvement region, leading to a 0.1% improvement in accuracy.
in accuracy with only a 2% additional GFLOPs. As described in Eq.(9), γ influences the extent of
As described in Sec.3.2, k signifies the division ra- region weighting in the CID method. In Table 9, we
tio of convolution and channel shuffle operators. The report the detection accuracy for various values of γ.
computational cost of the Light-ML module grows as k The results indicate that the student model achieves the
increases. In Table 7, we report the detection accuracy highest performance when γ=0.45, which is therefore
and efficiency for various values of k. The results indi- used in our experiments.
Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting 11
Table 8: Ablation study of CID and Global distillation with GFL ResNet101-ResNet18 on VisDrone.
Fig. 3: Visualization of the frequency distribution of the mean (a) IoU, (b) GIoU and (c) DIoU value for each
ground truth box in the VisDrone dataset, where small instances refer to instances with an area less than 32 × 32
As illustrated in Fig. 3, we compile statistics on the compared to our CID, this If eat is still insufficient to
IoU, GIoU and DIoU distribution corresponding to each distill enough knowledge. In contrast, CID generates a
ground truth box. It is obvious that the IoU, GIoU and more precise VLR with a higher distillation information
DIoU values for the ground truth of small instances weight, demonstrating the effectiveness of our method.
are relative lower than those of medium and large in- To address the issue of low distillation information
stances. This demonstrate that the small instances are weight, a common practice is to assign a larger weight
constrained by their smaller instances, leading to a lower to the VLR in the loss function. Therefore, we double
distillation weight in LD, which in turn results in LD the distillation weight in the VLR proposed by LD with
being unable to distill adequate supervisory information. different γLD values. As shown in Fig. 5, the doubled dis-
To demonstrate that LD introduces more back- tillation information weights If eat show some advantage
ground noise in small instances, we calculate the fre- over the weight used in CID.
quency distribution of the area ratio between the positive Therefore, we double the distillation weight hyper-
sample regions of LD and the ground truth boxes. As parameter λV LR of LD on VLR from the default 0.25 to
shown in Fig. 4, LD sets a larger positive sample region 0.5, and conduct experiments on LD with different γLD .
for smaller instances, resulting in a farther VLR, which However, as shown in Fig. 6, LD still struggles to distill
in turn introduces more background noise. more knowledge to the student model, but it remains
We also visualize the VLR proposed by LD with inferior to CID due to the background noise.
different γLD values and the VLR proposed by our CID. Furthermore, when we block the information in the
As shown in Fig. 5, LD generates a farther VLR for VLR during distillation, the student model still achieves
small instances with a lower distillation information an mAP accuracy of 26.1, which is close to the mAP of
weight If eat . Moreover, although a lower γLD assigns LD. This demonstrates that the VLR proposed by LD
a higher If eat to regions far from the small instances, is far from the small instances, introducing background
12 Bowei Du et al.
(a) (b)
Fig. 4: Visualization of the frequency distribution of the area ratio between the positive sample regions of LD and
the ground truth boxes in the VisDrone dataset, where: (a) instances with an area less than 32 × 32, (b) instances
with an area not less than 32 × 32.
CID (Ours) LD ( ) LD ( ) LD ( ) LD ( ) LD ( )
CID (Ours) LD ( ) LD ( ) LD ( ) LD ( ) LD ( )
Fig. 5: Visualization of VLR regions using CID and LD with different γLD values. Highlighted areas indicate
activated regions for distillation.
noise rather than valuable localization knowledge. In in distillation due to the low ratio of foreground regions
contrast, our CID method proposes the VLR around in drone imagery. Consequently, we extend the concept
small instances, making it more suitable for drone im- of centerness to be computed across the entire feature
agery detection. maps, leading to an additional improvement of 0.3%.
This demonstrates the effectiveness of our method.
It is worth noting that when we adopt the centerness
calculation similar to that used in FCOS, to propose the 4.4.4 Visualization of Knowledge Distillation
VLR and set the distillation weight to 1 − centernessi ,
the student model achieves an improvement of 0.9%. To demonstrate the effectiveness of our distillation
This result further confirms that LD suffers the issue method on the localization branch, we visualize the
that it distills background noise, making it unsuitable performance distance between the teacher and student
in drone imagery detection. However, as illustrated in models at the P3 level of the FPN for small instances on
Fig. 6, the traditional centerness method still falls short the VisDrone and UAVDT datasets. As shown in Fig.7,
Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting 13
Fig. 6: Ablation study on the different VLR regions and weights in GFL ResNet18 on VisDrone. ”LD
(λV LR =0.00/0.25/0.50)” denote LD with different distillation weight on VLR; ”CID within GT” indicates the use
of the FCOS method for calculating centerness.
Fig. 7: Visualization of cosine distance of the localization branch between the teacher (GFL-ResNet101) and the
student (GFL-ResNet18) at the P3 level of the FPN. Lower cosine distance indicates better performance.
our approach significantly reduces the performance gap small instances. Extensive experiment results conducted
between the teacher and student models in the positive across VisDrone, COCO, and UAVDT datasets demon-
sample regions and the VLR regions surrounding the strate the improvement in accuracy with competitive
instances important for localization, thereby confirming computation on our proposed approach.
the effectiveness of our method.
5 Conclusion
Data Availability Statements
We propose a novel knowledge distillation method for
drone image object detection. It first introduces a Light- The VisDrone (Zhu et al., 2018), UAVDT (Du
ML structure designed to improve the utilization of et al., 2018) and COCO (Lin et al., 2014) databases
supervision information during distillation by task-wise used in this manuscript are deposited in pub-
feature lifting, boosting the distillation efficiency conse- licly available repositories respectively: https://
quently. Meanwhile, it proposes a localization branch dis- github.com/VisDrone/VisDrone-Dataset, https://
tillation strategy integrating Centerness-based Instance- sites.google.com/view/grli-uavdt and https://
aware Distillation to refine the distillation around the cocodataset.org.
14 Bowei Du et al.