Centerness-Based Instance-Aware Knowledge Distillation With Task-Wise Mutual Lifting For Object Detection On Drone Imagery

Noname manuscript No.
(will be inserted by the editor)
Centerness-based Instance-aware Knowledge Distillation with

Task-wise Mutual Lifting for Object Detection on Drone
Imagery
Bowei Du1,2,† · Zhixuan Liao1,2,† · Yanan Zhang1,2 · Zhi Cai1,2 · Jiaxin
Chen2 · Di Huang1,2,3,
arXiv:2411.02861v1 [cs.CV] 5 Nov 2024
Received: date / Accepted: date
Abstract Developing accurate and efficient detectors foreground-background ratio and (2) small instances and
for drone imagery is challenging due to the inherent com- complex backgrounds, which lead to inadequate training,
plexity of aerial scenes. While some existing methods aim resulting insufficient distillation. Therefore, we propose
to achieve high accuracy by utilizing larger models, their a task-wise Lightweight Mutual Lifting (Light-ML) mod-
computational cost is prohibitive for drones. Recently, ule with a Centerness-based Instance-aware Distillation
Knowledge Distillation (KD) has shown promising po- (CID) strategy. The Light-ML module mutually har-
tential for maintaining satisfactory accuracy while signif- monizes the classification and localization branches by
icantly compressing models in general object detection. channel shuffling and convolution, integrating teacher su-
Considering the advantages of KD, this paper presents pervision across different tasks during back-propagation,
the first attempt to adapt it to object detection on thus facilitating training the student model. The CID
drone imagery and addresses two intrinsic issues: (1) low strategy extracts valuable regions surrounding instances
through the centerness of proposals, enhancing distilla-
Bowei Du tion efficacy. Experiments on the VisDrone, UAVDT,
E-mail: boweidu@buaa.edu.cn
and COCO benchmarks demonstrate that the proposed
Zhixuan Liao approach promotes the accuracies of existing state-of-
E-mail: zxliao2000@buaa.edu.cn the-art KD methods with comparable computational
requirements. Codes will be available upon acceptance.
Yanan Zhang
E-mail: zhangyanan@buaa.edu.cn Keywords Object Detection · Drone Imagery
Detection · Knowledge Distillation
Zhi Cai
E-mail: caizhi97@buaa.edu.cn
Jiaxin Chen 1 Introduction

E-mail: jiaxinchen@buaa.edu.cn
Unmanned Aerial Vehicle (UAV) equipment has played
Di Huang
E-mail: dhuang@buaa.edu.cn
a pivotal role in numerous applications such as traf-
fic management and security surveillance, due to its
1
State Key Laboratory of Software Development En- commendable flight capabilities and user-friendly op-
vironment, Beihang University, Beijing, China eration, which stimulates a surge in the necessity for
2
School of Computer Science and Engineering, Bei-
vision tasks, in particular object detection. Although
hang University, Beijing, China general object detection has largely advanced during
the last decade with sophisticated models and increas-
3
Hangzhou Innovation Institute, Beihang University, ing accuracies delivered, relatively large computational
Hangzhou, China
requirements make it difficult to adapt them to resource-
†
indicates equal contribution. constrained drone hardware, thereby posing a significant
challenge in developing effective and efficient detectors
refers to the corresponding author. on drone imagery.
2 Bowei Du et al.
COCO Dataset
Complex
Background
VisDrone Dataset
Small
Instances
(a) (b)
Fig. 1: Challenges to object detection on drone imagery (VisDrone): (a) low foreground-background ratio and (b)
small instances and complex backgrounds.
A number of efforts have been made to address the to extract additional valuable regions complementing
trade-off between accuracy and efficiency of detectors positive samples for teacher supervision reinforcement.
on drone images through model compression techniques, The strategies mainly include foreground-background
including lightweight network methods (Zhu et al., 2021; region (Guo et al., 2021; Wang et al., 2019; Yang et al.,
Zhou et al., 2022; Zhu et al., 2023), sparse convolution 2022b), teacher-student comparison (Dai et al., 2021),
methods (Yang et al., 2022a; Du et al., 2023) and net- and IoU-based selection (Zheng et al., 2022). However,
work pruning methods (Liu et al., 2021; Zhang et al., regarding the case of drone imagery, instances are of-
2019). Nevertheless, lightweight network methods design ten small and backgrounds are usually complicated as
elaborate modules to improve the performance with- shown in Fig. 1 (b), both of which make it more diffi-
out reducing much computational load, while network cult to extract additional regions, further limiting the
pruning methods tend to disregard essential calcula- supervisory information, and thus incur an inadequate
tions on small objects which are dominant in drone distillation effect.
images, thus leading to a sub-optimal accuracy. Ad-
To address the aforementioned challenges, this paper
ditionally, sparse convolution methods show a better
proposes a novel approach with a task-wise Lightweight
speed-accuracy trade-off by implementing constrained
Mutual Lifting (Light-ML) module and a Centerness-
sparse convolutions, albeit at the cost of compromising
based Instance-aware Distillation (CID) strategy for
the generalizability of the model on drone platforms.
drone images. To mitigate the insufficient supervision
Recently, Knowledge Distillation (KD) has emerged
caused by the low foreground-to-background ratio, the
as a promising alternative to model compression, offering
Light-ML module is designed to mutually harmonize
largely reduced training and deployment costs while
the classification and localization branches through
maintaining competitive performance, and has proved
an innovative module consisting of channel shuffling
effective in general object detection. KD typically follows
and convolution. It enhances the input information for
a student-teacher framework, where knowledge of the
both branches, and integrates teacher supervision clues
larger teacher is distilled and transferred to the smaller
across different tasks during the back-propagation stage,
student, thus generating a model of a compressed size
thereby amplifying the impact of KD, and effectively
with an enhanced accuracy.
reducing the performance gap between the teacher and
Unfortunately, it is not so straightforward to adapt
student models. For the issue of insufficient distillation
KD to detectors on drone images, and two major chal-
due to difficulties in extracting additional information
lenges remain. First, as illustrated in Fig. 1 (a), drone
from small instances in complex background, the CID
images typically exhibit a low ratio between foreground
strategy generates valuable regions around instances
and background, which limits the supervisory informa-
based on the centerness of the prediction of each anchor
tion conveyed by the teacher model in the foreground
box or anchor point, which enables the smooth and
area. This limited supervision leads to insufficient distil-
flexible estimation of informative regions surrounding
lation, resulting in a more pronounced performance gap
objects, particularly concerning small instances.
between the teacher and student models. Second, when
applying KD in object detection, a common practice is The contribution of our work lies in three-fold:
Centerness-based Instance-aware Knowledge Distillation with Task-wise Mutual Lifting 3
– We propose a novel KD approach to object detection a mosaic-based approach to merge estimation results
in particular for drone images. To the best of our into a unified image and employs a Multi-Proxy De-
knowledge, this is the first attempt to introduce KD tection Network to improve the classification accuracy.
techniques for model compression to build detectors Focus&Detect (Koyun et al., 2022) introduces the Gaus-
for drone imagery. sian Mixture Model for coarse regions estimation and
– We design a Lightweight Mutual Lifting (Light-ML) proposes an incomplete box suppression algorithm to
module and a centerness-based instance-aware distil- avoid overlapping regions.
lation (CID) strategy to address the inherent chal- In contrast to the inefficiency of the coarse-to-
lenges of applying KD to drone imagery detection. fine detection methodology, various methods based
These components enrich the supervision information on lightweight detection models are adopted. Slim-
during distillation, thereby enhancing the distillation YOLOv3 (Zhang et al., 2019) introduces network slim-
effect. ming (Liu et al., 2017) into YOLOv3 (Redmon and
– We make extensive evaluation on three public bench- Farhadi, 2018) while TPH-YOLOv5 (Zhu et al., 2021)
marks (VisDrone, UAVDT, and COCO) with vari- and FasterX (Zhou et al., 2022) introduce attention-
ous detection pipelines (e.g. GFL v1 and ATSS) and based modules into YOLOv5 (Jocher et al., 2020) and
achieve the state-of-the-art accuracy, showing the YOLOX (Ge et al., 2021) to improve the detection ac-
great potential of KD in this field. curacy on drone images respectively. QueryDet (Yang
et al., 2022a) and CEASC (Du et al., 2023) introduce
sparse convolution into detectors as the low foreground
2 Related Work ratio on drone images. Nevertheless, these methods en-
counter limitations concerning their acceleration capa-
2.1 General Object Detection bilities and the inherent complexity of drone images,
making it challenging to achieve an effective trade-off
Most CNN-based general object detection methods are between speed and efficiency.
categorized into multi-stage detectors (Ren et al., 2015;
He et al., 2017; Cai and Vasconcelos, 2018), one-stage
anchor-based detectors (Lin et al., 2017; Zhang et al., 2.3 Knowledge Distillation for Object Detection
2020; Li et al., 2020b, 2021) and one-stage anchor-free
detectors (Duan et al., 2019; Zhu et al., 2019; Tian Knowledge Distillation (KD) serves as an efficient model
et al., 2019). Multi-stage detectors first generate pro- compression method to train lightweight student mod-
posal regions and subsequently refine classification and els through extra supervised knowledge generated from
regression results in the next stage. Conversely, one-stage more powerful teacher models. KD has been first intro-
detectors directly classify and locate objects within the duced into the classification task (Hinton et al., 2015;
entire feature. Anchor-based detectors utilize prior an- Romero et al., 2014; Park et al., 2019; Liu et al., 2019;
chor boxes to identify potential proposals or predictions, Zhao et al., 2022; Zong et al., 2022) and demonstrated
while anchor-free detectors alleviate complicated com- its efficacy. On the object detection task, KD can be
putation associated with anchor boxes by leveraging key broadly divided into feature-based distillation (Guo
points of the object e.g. centerness constraints. et al., 2021; Yang et al., 2022b; Cao et al., 2022; Yang
et al., 2022c) and logit-based distillation (Zheng et al.,
2022; Yang et al., 2023). Feature-based distillation fo-
2.2 Drone Images Object Detection cuses on multi-scale intermediate features from the FPN
output. FGD (Yang et al., 2022b) separates the fore-
Contemporary studies mainly concentrate on promot- ground and background features, employing attention-
ing detection accuracy in drone images and usually based supervision to enhance information extraction
employ a coarse-to-fine framework, which first adopts within focal areas. PKD (Cao et al., 2022) leverages
a coarse network to approximate the locations of re- the Pearson Correlation Coefficient to focus on the re-
gions harboring densely distributed objects and then a lational information. MGD (Yang et al., 2022c) intro-
fine network to detect small objects within these iden- duces a masked image modeling structure, establishing
tified regions. ClusDet (Yang et al., 2019) integrates a a general feature-based distillation approach. GKD (Lan
sub-network to generate coarse regions and proposes and Tian, 2024) leverages gradient information to high-
a scale estimation network for better fine detection. light and distill the most impactful feature. Logit-based
DMNet (Li et al., 2020a) introduces density map into distillation focuses on logits from classification and lo-
coarse estimating and utilizes sliding window to get min- calization tasks. LD (Zheng et al., 2022) leverages the
imal regions. UFPMP-Det (Huang et al., 2022) employs probability distribution of bounding boxes proposed by
4 Bowei Du et al.
GFL (Li et al., 2020b), applying Kullback-Leible diver- And logit-based approaches focus on distilling detection
gence for distillation, and proposes an IoU-based Region results C i and Bi :
Weighting method. BCKD (Yang et al., 2023) formu-
1 X
lating classification logit maps as multiple binary clas- Ldis = Icls Lcls (C Ti , C Si ) + Ireg Lreg (BTi , BSi ) (2)
N i
sification maps, mitigating inconsistency issues. These
methods have not considered the performance gap be-
tween teacher-student models and information extrac- where C Ti , C Si are classification results from teacher and
tion on small objects, resulting in restricted performance student detector, BTi , BSi are localization results from
in drone images. In contrast, our method introduces teacher and student detector.
lightweight student feature lifting with instance-aware
information extraction to achieve better accuracy.
3.2 Lightweight Mutual Lifting
According to (Cho and Hariharan, 2019; Mirzadeh et al.,

3 The Proposed Method
2020; Zong et al., 2022), we attribute the degradation
in distillation on drone imagery to the capability gap
We propose a knowledge distillation framework as
between teacher and student models. Specifically, the
shown in Fig. 2, which employs a logit-based distil-
low foreground ratio in drone imagery results in fewer
lation pipeline and integrates the Lightweight Mutual
positive regions, leading to insufficient supervisory in-
Lifting module into the detection head to mitigate the
formation during training. Additionally, the student
inadequate training in student model by lifting the dis-
model’s backbone exhibits limitations in information
tillation supervision information. Additionally, we incor-
extraction while the parameter ratio within detection
porate the Centerness-based Instance-aware Distillation
heads increases as the backbone size diminishes, leading
method to facilitate the generation of optimal valuable
to inadequate training under limited supervision, result-
regions for information extraction.
ing in degraded distillation performance. To intuitively
In this section, we first briefly review two pri-
demonstrate the inadequate training issue in drone im-
mary types of KD pipelines. Next, we present the
age detection task, we conduct experiments as described
Lightweight Mutual Lifting module and the Centerness-
in the Section 4.4.2.
based Instance-aware Distillation method, respectively.
To address this issue, a common practice is to reduce
the number of parameters in the detection head. How-
ever, it weakens the student model’s representation and
3.1 Preliminaries
further widen the capability gap between the teacher
and student models. Therefore, we propose a method
Prevalent object detection models typically comprise
that enables adequate training without compromising
a backbone network and task-specific detection heads.
the student model’s representation.
Given an input image Xinput , the detector leverages the
Inspired by ShuffleNet (Zhang et al., 2018), which
backbone network along with Feature Pyramid Network
uses cross-group information lifting operations such as
(FPN) to extract feature information F1 , ..., FN across
channel shuffle, to integrate information during forward
multiple scales. Subsequently, the task-specific detec-
propagation and enhance the model’s representation,
tion heads process Fi individually for classification and
we apply a similar concept. Additionally, during back-
localization tasks in different scales, and yield the detec-
ward propagation, cross-group information flow provides
tion results C i ∈ Rn×C and Bi ∈ Rn×4 or Bi ∈ Rn×4D ,
supplementary supervisory information to each group
where n represents the number of anchors in this scale,
through the back-propagated gradient.
C represents the number of target classes, and D repre-
Therefore, to achieve adequate training without com-
sents the number of regression distributions (Li et al.,
promising the student model’s representation, we pro-
2020b; Zheng et al., 2022).
pose the Lightweight Mutual Lifting module (Light-ML),
For distillation on object detection, feature-based
which enhances the performance of inadequately trained
approaches focus on distilling intermediate features Fi :
detection heads by lifting features from both the classi-
1 X fication and localization branches.
Ldis = If eat Lf eat (FiT , FiS ) (1)
N i As displayed in Fig. 2, we utilize the results from
feature extraction subnets of student models as inputs,
reg
where I represents distillation information weighting namely Fin cls
∈ RN ×C×H×W and Fin ∈ RN ×C×H×W .
result, L represents distilation loss function, FiT , FiS are While directly applying convolutional operators can
intermediate features from teacher and student detector. effectively achieve cross-task information lifting, the
x4
Backbone+FPN
cls branch
… Positive
conv
conv
Sample
Teacher
Positive
loc branch Sample
…
conv
conv
CID
loc branch Global

…
conv
conv
Light-ML
Positive
Sample
…
conv
conv
Student
Positive
Sample
Backbone+FPN cls branch
Backbone+FPN x4
Light-weight Mutual Lifting （Light-ML） Centerness-based Instance-aware Distillation （CID）

Exchange Lifting
-
conv
channel shuﬄe
Anchor Points
VLR Region
Small Instance
Centerness of Proposals
Fig. 2: Framework of our proposed distillation approach. Lightweight Mutual Lifting (Light-ML) mechanism is
integrated into the detection heads of the student model for feature lifting and Centerness-based Instance-aware
Distillation (CID) introduces an adaptive knowledge weighting algorithm for focal distillation combined with global
distillation, enabling the extraction of additional supervision information from the teacher models. Notably, our
proposed method can be applied to existing logit-based distillation approaches easily.
′ ′ ′ ′
increased number of feature channels leads to higher are then evenly split into Fs cls , Fs reg and Fc cls , Fc reg .
reg
computational load. On the other hand, using fea- Finally, we derive Fout cls
∈ RN ×C×H×W and Fout ∈
N ×C×H×W
ture processing guided by attention mechanism may R by concatenating the obtained features,
be constrained in both operational speed and deploy- as illustrated in Fig. 2.
ment efficiency. Consequently, we introduce a CSP- cls ′ ′
Fout = concat(Fc cls , Fs cls ) (5)
wise structure (Wang et al., 2020) to reduce module reg ′ ′
computations. Specifically, we split the input features Fout = concat(Fc reg , Fs reg ) (6)
into the convolutional part as Fccls ∈ RN ×kC×H×W , The proposed Light-ML structure achieves feature
Fcreg ∈ RN ×kC×H×W and the shuffling part as Fscls ∈ lifting within the student model. Similar to the channel
RN ×(1−k)C×H×W , Fsreg ∈ RN ×(1−k)C×H×W , where shuffle in ShuffleNet (Zhang et al., 2018), Light-ML aug-
k ∈ [0, 1] denotes the division ratio. ments features by integrating information between dif-
Firstly, we introduce channel shuffle (Zhang et al., ferent task branches during forward propagation, while
2018) between Fscls and Fsreg to facilitate the exchange in backward propagation, it optimizes the model from
of feature information between these two branches, thus the detection heads to the neck using gradient and dis-
enabling feature fusion without incurring additional tillation information from both branches as additional
computational costs. And we further apply convolution supervisory clues, which further improve the overall uti-
on Fccls and Fcreg to achieve enhanced feature lifting: lization of supervision information, thereby alleviating
Fsmid = c-shuf f le(concat(Fscls , Fsreg )) (3) inadequate training due to insufficient distillation infor-
mation and enhancing the distillation process. Therefore,
Fcmid = conv(concat(Fccls , Fcreg )) (4)
the Light-ML structure could alleviate the insufficient
where c-shuf f le indicates the channel shuffle function. supervision issue caused by low foreground ratio, thus
Fsmid ∈ RN ×2(1−k)C×H×W and Fcmid ∈ RN ×2kC×H×W suitable in drone imagery detection.
6 Bowei Du et al.
3.3 Centerness-based Instance-aware Distillation values in the positive sample regions of small instances
and low values in the contextual regions.
As illustrated in Sec.3.1, information weighting results Nevertheless, in FCOS, centerness is only defined
I highlight more consequential knowledge to distill within the GT boxes, which may lead to insufficient
and improving the accuracy of the training procedure. supervision, since the low foreground ratio in drone
LD (Zheng et al., 2022) analyzes knowledge distribution imagery. Therefore, we utilize the detection results de-
patterns related to classification and localization tasks, rived from fully trained teacher models as the basis for
identifying context regions of instances as Valuable Lo- information weighting.
calization Regions (VLR) for localization distillation. Specifically, we employ the predicted bounding boxes
However, despite the fact that the localization branch or the integral results of the predicted regression distri-
heavily relies on information around small instances in butions from trained teacher models as Bi ∈ Rn×4 to
drone image detection, previous work (Zheng et al., 2022) derive centerness. Additionally, we calculate the diag-
suffers from the low confidence issue due to the low DIoU onal lengths of each GT boxes, and apply 0.75 times
value when capturing valuable localization information, these lengths as the thresholds to preliminary filter the
leading to unreliable knowledge, thus inappropriate for distillation regions. This approach allows our centerness
drone images. target to be size-robust, effectively indicating whether
Specifically, previous methods propose VLR based the position lies within the contextual regions surround-
on regions where the DIoU value is lower than γLD ·αpos , ing the small instances.
setting the distillation weight to IoU. Here, γLD serves Since our primary objective is to capture valuable
as a hyper-parameter, and αpos as the threshold for information surrounding instances, we prioritize the
positive sample region. In drone imagery detection, small selection of anchors Ai characterized by lower centerness
instances lead to low IoU value between the ground truth values, which signify greater deviation between predicted
(GT) boxes and the predefined anchor boxes. As a result, objects and these anchors, indicating their proximity to
previous method could not provide adequate supervision the contexts. Therefore, the centerness-based instance-
due to the low distillation weight, thus inappropriate aware region weighting can be derived as follows:
for drone imagery. (
Moreover, lower DIoU values lead the previous meth- 1 − centernessi , centernessi < γ
Ivlr = (8)
ods to set a lower positive sample threshold αpos , result- 0, centernessi ≥ γ
ing in the expansion of positive sample regions, which
causes VLR to deviate from small instances. Conse- where the hyper-parameter γ ∈ [0, 1] controls region
quently, the teacher model distills more background weighting degree, the coverage of valuable regions will
noise instead of valuable localization knowledge. gradually expand with γ increases.
To address them, we propose a novel centerness- Combining centerness-based instance-aware regions
based instance-aware distillation technique, aimed at Ivlr with positive regions Imain of the positive samples
introducing a size-robust metric to enhance distillation generated from the label assignment process for positive
efficiency. samples, the comprehensive focal distillation within the
To effectively identify potential valuable regions localization branch can be formulated as:
around the instances, we introduce the centerness tar-
1 X
get (Tian et al., 2019) for weighting purposes. For the Ldis
f = (Imain + αIvlr )Lreg (BTi , BSi ) (9)
N i
i-th level, predefined anchors Ai ∈ Rn×2 and corre-
sponding bounding boxes Bi ∈ Rn×4 is accomplished
where α denoted the loss weighting of centerness-based
through detection heads. The centerness can be derived
instance-aware regions.
as follows:
s Compared to the VLR proposed in LD (Zheng et al.,
min(li∗ , ri∗ ) min(t∗i , b∗i ) 2022), the centerness-based instance-aware region only
centernessi = × (7) depends on its position in the predicted box, robust to
max(li∗ , ri∗ ) max(t∗i , b∗i )
small size, making it more general and avoiding the low
where li∗ , ri∗ , t∗i and b∗i come from location offsets calcu- DIoU problem. Furthermore, it places more emphasis
lated through Ai and Bi . on the regions around the small instances, which could
Considering the properties of centerness, it is solely effectively focus on local information.
relying on the position relative to the corresponding As merely distilling knowledge within the positive
bounding boxes, making it robust to size variations. sample regions and their immediate contexts can be
Additionally, when using the GT boxes as Bi , as in insufficient in transferring enough supervisory informa-
FCOS (Tian et al., 2019), the centerness will exhibit high tion from the teacher model, especially concerning drone
images with a significantly lower foreground ratio. There- 4.2 Implementation Details
fore, we additionally adopt a global distilling approach
for features proposed by (Yang et al., 2022c) within the We implement the proposed method using Py-
localization branch, which incorporates a Masked Image Torch (Paszke et al., 2019) and MMDetection (Chen
Modeling technique for global information reconstruc- et al., 2019). All the student models on the VisDrone
tion on the student model. The masked feature Mi is and COCO datasets are trained for 12 epochs with SGD
set to 1 at each pixel with a probability determined by optimizer. The learning rate is initially set at 0.01 with
the hyper-parameter λ. And the reconstruction module linear warm-up strategy and decreases by 10 after 8
G is lightweight and consists of two convolutional layers. and 11 epochs. On UAVDT dataset, we train models
The overall global distillation loss can be obtained from: for 6 epochs with an initial learning rate of 0.01, and
decreases by 10 after 4 and 5 epochs. The input im-
1 X age sizes are set to 1333×800 for VisDrone and COCO,
Ldis
g = L(BTi , G(BSi × Mi )) (10) and 1024×540 for UAVDT. The hyperparameters k in
N i
Sec. 3.2 and γ in Eq. (8) are set as {k = 0.25, γ = 0.45},
where L refers to the Smooth-L1 loss. and following the previous work, the hyperparameters
Combining the focal loss Ldis with the global loss α in Eq. (9), λ in Sec. 3.3, and β in Eq. (11) are set
f
dis as {α = 1, λ = 0.65, β = 4}. The training phase for
Lg , the overall distillation loss on the localization
VisDrone and UAVDT is conducted on 2 Nvidia RTX
branch can be formulated as:
2080Ti GPUs, and 8 Nvidia RTX 2080Ti GPUs on
Ldis dis dis
loc = Lf + βLg (11) COCO dataset.
where β donated the weighting of global distillation loss.

As the Centerness-based Instances-aware Distillation
4.3 Main results
(CID) focuses on the distillation region weighting, our
proposed method exhibits adaptability for integration
Our methods can be applied to various logit-based
into various prevailing state-of-the-art methodologies by
distillation approaches, thus we apply our method to
replacing the loss function Lreg in a flexible way.
LD (Zheng et al., 2022) and BCKD (Yang et al., 2023)
methods respectively. The performance of these models
is evaluated by comparing them with the state-of-the-
4 Experimental Results and Analysis
art distillation methods, including FGD (Yang et al.,
In this section, we evaluate the effectiveness of our pro- 2022b), LD, and BCKD.
posed knowledge distillation approach by comparing it As reported in Table 1, we evaluate performance
to the state-of-the-art distillation methods on various using GFL (Li et al., 2020b) as the base detector
datasets and conducting extensive ablation studies. and ResNet101 as the teacher model, while employ-
ing ResNet50 and ResNet18 as the student models on
VisDrone dataset. Our method significantly surpasses
4.1 Datasets and Evaluation Metrics the feature-based method FGD, exceeding it by over
3.0% and 2.5% on different students, and achieves con-
We conduct experiments on three benchmarks widely siderable improvements of 0.2%-1.3% in different basic
employed in drone object detection and general ob- logit-based methods, with only a marginal increase in
ject detection tasks, i.e. VisDrone (Zhu et al., 2018), computational cost by 3.3 GFLOPs. It is important to
UAVDT (Du et al., 2018), and COCO (Lin et al., 2014). note that the ResNet101-ResNet50 distillation, based on
VisDrone contains 7,019 high-resolution drone images BCKD and our method, promotes the student model’s
with 10 categories. Following previous work (Yang et al., mAP accuracy to 29.7%, equaling that of the teacher
2019; Huang et al., 2022; Du et al., 2023), we select 6,471 model.
images for training and 548 images for testing. UAVDT On another benchmark for drone object detection
contains 23,258 training images and 15,069 test images UAVDT, as reported in Table 2, our method significantly
with 3 categories and a resolution of 1024×540 pixels. surpasses FGD by over 2.3% and 3.0% in mAP on
COCO contains 118,000 images for training and 5,000 various student models, and shows an improvement of
for testing with 80 categories. 0.4%-0.9% compared to basic methods, indicating the
We report Average Precision (AP) and Average Re- generalizability of our approach.
call (AR) as the evaluation metrics for accuracy, and We also conduct corresponding experiments on the
GFLOPs as the metric for efficiency. popular COCO dataset for general object detection.
8 Bowei Du et al.
Table 1: Comparsion of AP/AR (%) and GFLOPs on VisDrone by our proposed method and existing state-of-the-art
methods, with GFL-ResNet101 as teacher and GFL-ResNet50/GFL-ResNet18 as students.
Method mAP AP50 AP75 AR1 AR10 AR100 AR500 GFLOPs

GFL-ResNet101 (T) 29.7 52.0 29.4 0.59 6.51 37.0 44.0 294.2
GFL-ResNet50 (S) 24.6 45.2 23.4 1.06 7.88 32.0 37.8 214.3
FGD (Yang et al., 2022b) 26.7 48.1 25.8 1.12 8.10 35.1 39.0 214.3
LD (Zheng et al., 2022) 28.9 50.6 28.6 0.64 6.52 36.4 43.2 214.3
LD + Ours 29.2 51.1 29.0 0.72 6.58 36.8 43.8 217.6
BCKD (Yang et al., 2023) 29.5 52.4 29.2 0.59 6.25 36.7 43.7 214.3
BCKD + Ours 29.7 52.1 29.2 0.61 6.39 37.0 43.9 217.6
GFL-ResNet101 (T) 29.7 52.0 29.4 0.59 6.51 37.0 44.0 294.2
GFL-ResNet18 (S) 23.4 43.6 22.0 1.06 7.52 31.2 37.5 162.3
FGD (Yang et al., 2022b) 25.7 46.3 24.9 0.82 7.28 33.8 38.1 162.3
LD (Zheng et al., 2022) 26.5 47.0 25.8 0.61 6.19 34.1 41.0 162.3
LD + Ours 27.8 49.6 27.3 0.78 6.48 35.2 42.4 165.6
BCKD (Yang et al., 2023) 27.9 49.8 27.4 0.60 6.21 35.4 42.5 162.3
BCKD + Ours 28.2 49.8 27.9 0.61 6.23 35.9 42.8 165.6
Table 2: Comparsion of mAP/AP (%) and GFLOPs on UAVDT by our proposed method and existing state-of-the-
art methods, with GFL-ResNet101 as teacher and GFL-ResNet50/GFL-ResNet18 as students.
Method mAP AP50 AP75 GFLOPs

GFL-ResNet101 (T) 20.0 34.5 21.2 152.6
GFL-ResNet50 (S) 19.0 31.3 22.0 111.4
FGD (Yang et al., 2022b) 19.0 32.5 20.6 111.4
LD (Zheng et al., 2022) 20.4 33.2 23.2 111.4
LD + Ours 21.3 35.2 24.2 113.1
BCKD (Yang et al., 2023) 20.0 33.8 22.1 111.4
BCKD + Ours 20.9 35.0 23.4 113.1
GFL-ResNet101 (T) 20.0 34.5 21.2 152.6
GFL-ResNet18 (S) 16.6 30.0 17.3 84.4
FGD (Yang et al., 2022b) 15.6 28.4 15.6 84.4
LD (Zheng et al., 2022) 18.0 30.4 19.9 84.4
LD + Ours 18.5 31.2 20.5 86.1
BCKD (Yang et al., 2023) 18.2 31.9 19.1 84.4
BCKD + Ours 18.6 31.7 20.4 86.1
As summarized in Table 3, our method achieves im- that our contributions are beneficial for the student to
provements of 2.0% and 1.3% in various student models learn from the teacher in locating small instances which
compared to the feature-based method FGD. Addition- are dominant on drone images.
ally, it shows improvements ranging from 0.1%-1.1%
in basic logit-distillation methods, with only a minor
increase in computational demand. To further demonstrate the generalizability of our
method, we further validate our method on other base de-
It is noteworthy that, as shown in Table 3, when tectors, i.e. FCOS (Tian et al., 2019) and ATSS (Zhang
employing ResNet18 as the student model, we observe et al., 2020), as shown in Table 4, employing ResNet101
a significant relative improvement of 6.4% and 3.7% as the teacher model and ResNet18 as the student model
compared to the baseline in the accuracy of APS , greater on VisDrone dataset. Similar to GFL, our method shows
than those in APM (0.7% and 1.4%) and APL (2.8% and an improvement of 1.8%-3.2% in mAP over the feature-
3.2%). It demonstrates the effectiveness of our newly based method FGD, and 0.2%-0.6% over basic logit-
proposed Light-ML model on weaker student models based methods, demonstrating the flexibility of our dis-
and the impact of CID on smaller instances, indicating tillation approach.
Table 3: Comparision of mAP/AP (%) and GFLOPs on COCO between our proposed method and existing
state-of-the-art methods, with GFL-ResNet101 as teacher and GFL-ResNet50/GFL-ResNet18 as students.
Method mAP AP50 AP75 APS APM APL GFLOPs

GFL-ResNet101 (T) 44.9 63.1 49.0 28.0 49.1 57.2 294.2
GFL-ResNet50 (S) 40.1 58.2 43.1 23.3 44.4 52.5 214.3
FGD (Yang et al., 2022b) 41.3 58.8 44.8 24.5 45.6 53.0 214.3
LD (Zheng et al., 2022) 42.1 60.3 45.6 24.5 46.2 54.8 214.3
LD + Ours 43.0 60.7 46.9 25.5 46.9 55.8 217.6
BCKD (Yang et al., 2023) 43.2 61.6 46.9 25.7 47.3 55.9 214.3
BCKD + Ours 43.3 61.9 47.3 26.1 47.3 56.1 217.6
GFL-ResNet101 (T) 44.9 63.1 49.0 28.0 49.1 57.2 294.2
GFL-ResNet18 (S) 35.8 53.1 38.2 18.9 38.9 47.9 162.3
FGD (Yang et al., 2022b) 38.3 55.4 41.4 21.3 41.5 50.8 162.3
LD (Zheng et al., 2022) 37.5 54.7 40.4 20.2 41.2 49.4 162.3
LD + Ours 38.6 55.4 41.9 21.5 41.5 50.8 165.6
BCKD (Yang et al., 2023) 38.6 56.4 41.7 21.4 42.0 50.0 162.3
BCKD + Ours 39.6 57.5 42.7 22.2 42.6 51.6 165.6
Table 4: Comparsion of AP/AR (%) and GFLOPs on VisDrone by our proposed method and existing state-of-the-art
methods on different dense object detectors (FCOS and ATSS).
Method mAP AP50 AP75 AR1 AR10 AR100 AR500 GFLOPs

FCOS-ResNet101 (T) 27.4 48.7 26.6 0.53 6.61 35.3 43.3 294.2
FCOS-ResNet18 (S) 22.6 42.1 21.3 0.65 6.48 30.8 37.9 162.3
FGD (Yang et al., 2022b) 23.6 42.8 22.5 0.51 5.80 31.6 36.7 162.3
LD (Zheng et al., 2022) 25.4 45.6 24.6 0.51 6.07 33.3 41.3 162.3
LD + Ours 25.9 46.4 24.9 0.58 6.23 33.9 41.9 165.6
BCKD (Yang et al., 2023) 26.5 47.8 25.5 0.53 6.29 34.5 42.6 162.3
BCKD + Ours 26.8 48.1 26.0 0.57 6.35 34.8 42.7 165.6
ATSS-ResNet101 (T) 28.2 49.2 28.0 0.66 6.66 36.1 43.8 294.2
ATSS-ResNet18 (S) 23.7 43.4 22.8 0.72 6.77 31.6 38.5 162.3
FGD (Yang et al., 2022b) 25.6 45.4 25.2 0.64 7.00 34.1 38.5 162.3
LD (Zheng et al., 2022) 26.0 45.8 25.8 0.60 6.30 34.1 41.8 162.3
LD + Ours 26.6 46.8 26.3 0.70 6.58 34.6 42.9 165.6
BCKD (Yang et al., 2023) 27.2 48.0 26.9 0.63 6.49 35.1 42.9 162.3
BCKD + Ours 27.4 48.0 27.2 0.63 6.35 35.3 43.1 165.6
4.4 Ablation Studies method to small instances. By combining Light-ML with

CID and Global distillation, the performance improve-
4.4.1 On Light-ML and CID ment further increases to 1.3% and 1.0% respectively on
VisDrone and COCO, which demonstrates the remark-
able efficacy of the proposed method.
As shown in Table 5, the adoption of the Light-ML
component leads to a 0.5%-1.0% improvement in mAP
performance on VisDrone and COCO datasets with only
an additional 2% in GFLOPs, largely due to the reduced
performance gap between teacher and student models, as Significantly, the adoption of the Light-ML com-
well as enhanced utilization of supervisory information ponent and our distillation method results in a 4.9%
from the Light-ML structure. Adopting the localization relative improvement on the VisDrone dataset, surpass-
branch distillation method CID and Global distillation ing the 2.9% relative improvement achieved on COCO.
leads to a 0.7% and 0.5% improvement, primarily at- This further demonstrates the efficacy of our method in
tributed to the superior adaptability of our proposed drone imagery detection.
10 Bowei Du et al.
Table 5: Ablation study on components with GFL ResNet101-ResNet18.
Dataset Light-ML CID and Global mAP AP50 AP75 GFLOPs

26.5 47.0 25.8 162.3
✓ 27.0 47.9 26.3 165.6
VisDrone
✓ 27.2 48.6 26.6 162.3
✓ ✓ 27.8 49.6 27.3 165.6
37.5 54.7 40.4 162.3
✓ 38.4 55.1 41.8 165.6
COCO
✓ 38.0 55.4 41.0 162.3
✓ ✓ 38.6 55.4 41.9 165.6
Table 6: Ablation study on detailed designs on Light-ML with GFL ResNet101-ResNet18 on VisDrone.
Method Cls to Reg Reg to Cls mAP AP50 AP75 GFLOPs

Baseline 27.2 48.6 26.6 162.3
Channel Shuffle (Zhang et al., 2018) ✓ ✓ 27.6 49.2 27.1 162.3
Cls to Reg Fusion (Li et al., 2021) ✓ 27.2 48.4 26.6 163.8
Reg to Cls Fusion (Li et al., 2021) ✓ 27.0 48.1 26.5 162.7
Light-ML (Ours) ✓ ✓ 27.8 49.6 27.3 165.6
Table 7: Ablation on the hyper-parameter k with GFL ResNet101-ResNet18 on VisDrone.
hyper-parameter k mAP AP50 AP75 GFLOPs

0 27.6 49.2 27.1 162.3
0.25 27.8 49.6 27.3 165.6
0.50 27.8 49.4 27.2 175.5
0.75 27.8 49.3 27.1 191.9
1 27.7 49.0 27.0 214.9
4.4.2 On Light-ML cate that the student model achieves the best trade-off
between accuracy and efficiency when k = 0.25, which
We evaluate the performance of our proposed Light-ML is therefore used in our experiments.
structure by comparing it with other lightweight feature
lifting structures, including channel shuffle (Zhang et al.,
2018) for all channels, feature fusion from the classifi- 4.4.3 On CID
cation branch to the regression branch, as well as from
the regression branch to the classification branch, using We separately evaluate the effect of Centerness-based
an attention-based approach proposed in (Li et al., Instance-aware Distillation and Global Distillation in
2021), denoted as ”Cls to Reg Fusion” and ”Reg to Table 8. When employing the CID method, student mod-
Cls Fusion”, respectively. As displayed in Table 6, the els can achieve a better distillation effect around small
two attention-based fusion methods show a performance instances, resulting in a 0.7% improvement in accuracy.
decline, primarily attributed to the incomplete feature Moreover, only utilizing global distillation results in in-
lifting from different tasks. The application of channel ferior performance as it cannot provide sufficient clues
shuffling results in a 0.4% performance improvement, around instances. However, when we combine the global
whereas our approach achieves a superior performance- distillation with CID, it further enhances the utiliza-
efficiency trade-off by introducing the CSP structure and tion of potential distillation information in background
additional convolution, leading to a 0.6% improvement region, leading to a 0.1% improvement in accuracy.
in accuracy with only a 2% additional GFLOPs. As described in Eq.(9), γ influences the extent of
As described in Sec.3.2, k signifies the division ra- region weighting in the CID method. In Table 9, we
tio of convolution and channel shuffle operators. The report the detection accuracy for various values of γ.
computational cost of the Light-ML module grows as k The results indicate that the student model achieves the
increases. In Table 7, we report the detection accuracy highest performance when γ=0.45, which is therefore
and efficiency for various values of k. The results indi- used in our experiments.
Table 8: Ablation study of CID and Global distillation with GFL ResNet101-ResNet18 on VisDrone.
CID Global Distillation mAP AP50 AP75 GFLOPs

27.0 47.9 26.3 165.6
✓ 27.7 49.1 27.3 165.6
✓ 27.0 47.9 26.2 165.6
✓ ✓ 27.8 49.6 27.3 165.6
(a) (b) (c)
Fig. 3: Visualization of the frequency distribution of the mean (a) IoU, (b) GIoU and (c) DIoU value for each
ground truth box in the VisDrone dataset, where small instances refer to instances with an area less than 32 × 32
Table 9: Ablation on the hyper-parameter γ with GFL ResNet101-ResNet18 on VisDrone.
γ 0.6 0.55 0.5 0.45 0.4 0.35

mAP(%) 27.4 27.6 27.7 27.8 27.7 27.5
As illustrated in Fig. 3, we compile statistics on the compared to our CID, this If eat is still insufficient to
IoU, GIoU and DIoU distribution corresponding to each distill enough knowledge. In contrast, CID generates a
ground truth box. It is obvious that the IoU, GIoU and more precise VLR with a higher distillation information
DIoU values for the ground truth of small instances weight, demonstrating the effectiveness of our method.
are relative lower than those of medium and large in- To address the issue of low distillation information
stances. This demonstrate that the small instances are weight, a common practice is to assign a larger weight
constrained by their smaller instances, leading to a lower to the VLR in the loss function. Therefore, we double
distillation weight in LD, which in turn results in LD the distillation weight in the VLR proposed by LD with
being unable to distill adequate supervisory information. different γLD values. As shown in Fig. 5, the doubled dis-
To demonstrate that LD introduces more back- tillation information weights If eat show some advantage
ground noise in small instances, we calculate the fre- over the weight used in CID.
quency distribution of the area ratio between the positive Therefore, we double the distillation weight hyper-
sample regions of LD and the ground truth boxes. As parameter λV LR of LD on VLR from the default 0.25 to
shown in Fig. 4, LD sets a larger positive sample region 0.5, and conduct experiments on LD with different γLD .
for smaller instances, resulting in a farther VLR, which However, as shown in Fig. 6, LD still struggles to distill
in turn introduces more background noise. more knowledge to the student model, but it remains
We also visualize the VLR proposed by LD with inferior to CID due to the background noise.
different γLD values and the VLR proposed by our CID. Furthermore, when we block the information in the
As shown in Fig. 5, LD generates a farther VLR for VLR during distillation, the student model still achieves
small instances with a lower distillation information an mAP accuracy of 26.1, which is close to the mAP of
weight If eat . Moreover, although a lower γLD assigns LD. This demonstrates that the VLR proposed by LD
a higher If eat to regions far from the small instances, is far from the small instances, introducing background
12 Bowei Du et al.
(a) (b)
Fig. 4: Visualization of the frequency distribution of the area ratio between the positive sample regions of LD and
the ground truth boxes in the VisDrone dataset, where: (a) instances with an area less than 32 × 32, (b) instances
with an area not less than 32 × 32.
CID (Ours) LD ( ) LD ( ) LD ( ) LD ( ) LD ( )
CID (Ours) LD ( ) LD ( ) LD ( ) LD ( ) LD ( )
Fig. 5: Visualization of VLR regions using CID and LD with different γLD values. Highlighted areas indicate
activated regions for distillation.
noise rather than valuable localization knowledge. In in distillation due to the low ratio of foreground regions
contrast, our CID method proposes the VLR around in drone imagery. Consequently, we extend the concept
small instances, making it more suitable for drone im- of centerness to be computed across the entire feature
agery detection. maps, leading to an additional improvement of 0.3%.
This demonstrates the effectiveness of our method.
It is worth noting that when we adopt the centerness
calculation similar to that used in FCOS, to propose the 4.4.4 Visualization of Knowledge Distillation
VLR and set the distillation weight to 1 − centernessi ,
the student model achieves an improvement of 0.9%. To demonstrate the effectiveness of our distillation
This result further confirms that LD suffers the issue method on the localization branch, we visualize the
that it distills background noise, making it unsuitable performance distance between the teacher and student
in drone imagery detection. However, as illustrated in models at the P3 level of the FPN for small instances on
Fig. 6, the traditional centerness method still falls short the VisDrone and UAVDT datasets. As shown in Fig.7,
Fig. 6: Ablation study on the different VLR regions and weights in GFL ResNet18 on VisDrone. ”LD
(λV LR =0.00/0.25/0.50)” denote LD with different distillation weight on VLR; ”CID within GT” indicates the use
of the FCOS method for calculating centerness.
w/o Distilation LD BCKD Ours

VisDrone
UAVDT
Fig. 7: Visualization of cosine distance of the localization branch between the teacher (GFL-ResNet101) and the
student (GFL-ResNet18) at the P3 level of the FPN. Lower cosine distance indicates better performance.
our approach significantly reduces the performance gap small instances. Extensive experiment results conducted
between the teacher and student models in the positive across VisDrone, COCO, and UAVDT datasets demon-
sample regions and the VLR regions surrounding the strate the improvement in accuracy with competitive
instances important for localization, thereby confirming computation on our proposed approach.
the effectiveness of our method.
5 Conclusion
Data Availability Statements
We propose a novel knowledge distillation method for
drone image object detection. It first introduces a Light- The VisDrone (Zhu et al., 2018), UAVDT (Du
ML structure designed to improve the utilization of et al., 2018) and COCO (Lin et al., 2014) databases
supervision information during distillation by task-wise used in this manuscript are deposited in pub-
feature lifting, boosting the distillation efficiency conse- licly available repositories respectively: https://
quently. Meanwhile, it proposes a localization branch dis- github.com/VisDrone/VisDrone-Dataset, https://
tillation strategy integrating Centerness-based Instance- sites.google.com/view/grli-uavdt and https://
aware Distillation to refine the distillation around the cocodataset.org.
14 Bowei Du et al.
References Li X, Wang W, Wu L, Chen S, Hu X, Li J, Tang J,

Yang J (2020b) Generalized focal loss: Learning qual-
Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving ified and distributed bounding boxes for dense object
into high quality object detection. In: CVPR detection. In: NeurIPS
Cao W, Zhang Y, Gao J, Cheng A, Cheng K, Cheng Li X, Wang W, Hu X, Li J, Tang J, Yang J (2021)
J (2022) Pkd: General distillation framework for ob- Generalized focal loss v2: Learning reliable localiza-
ject detectors via pearson correlation coefficient. In: tion quality estimation for dense object detection. In:
NeurIPS CVPR
Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun Lin TY, Maire M, Belongie S, Hays J, Perona P, Ra-
S, Feng W, Liu Z, Xu J, et al. (2019) Mmdetection: manan D, Dollár P, Zitnick CL (2014) Microsoft coco:
Open mmlab detection toolbox and benchmark. arXiv Common objects in context. In: ECCV
preprint arXiv:190607155 Lin TY, Goyal P, Girshick R, He K, Dollár P (2017)
Cho JH, Hariharan B (2019) On the efficacy of knowl- Focal loss for dense object detection. In: ICCV
edge distillation. In: ICCV Liu L, Zhang S, Kuang Z, Zhou A, Xue JH, Wang X,
Dai X, Jiang Z, Wu Z, Bao Y, Wang Z, Liu S, Zhou E Chen Y, Yang W, Liao Q, Zhang W (2021) Group
(2021) General instance distillation for object detec- fisher pruning for practical network compression. In:
tion. In: CVPR ICML
Du B, Huang Y, Chen J, Huang D (2023) Adaptive Liu Y, Cao J, Li B, Yuan C, Hu W, Li Y, Duan Y (2019)
sparse convolutional networks with global context Knowledge distillation via instance relationship graph.
enhancement for faster object detection on drone In: CVPR
images. In: CVPR Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017)
Du D, Qi Y, Yu H, Yang Y, Duan K, Li G, Zhang W, Learning efficient convolutional networks through net-
Huang Q, Tian Q (2018) The unmanned aerial vehicle work slimming. In: ICCV
benchmark: Object detection and tracking. In: ECCV Mirzadeh SI, Farajtabar M, Li A, Levine N, Matsukawa
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) A, Ghasemzadeh H (2020) Improved knowledge dis-
Centernet: Keypoint triplets for object detection. In: tillation via teacher assistant. In: AAAI
ICCV Park W, Kim D, Lu Y, Cho M (2019) Relational knowl-
Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: Exceed- edge distillation. In: CVPR
ing yolo series in 2021. arXiv preprint arXiv:210708430 Paszke A, Gross S, Massa F, Lerer A, Bradbury J,
Guo J, Han K, Wang Y, Wu H, Chen X, Xu C, Xu Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga
C (2021) Distilling object detectors via decoupled L, et al. (2019) Pytorch: An imperative style, high-
features. In: CVPR performance deep learning library. In: NeurIPS
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask Redmon J, Farhadi A (2018) Yolov3: An incremental
r-cnn. In: ICCV improvement. arXiv preprint arXiv:180402767
Hinton G, Vinyals O, Dean J (2015) Distilling the Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: To-
knowledge in a neural network. arXiv preprint wards real-time object detection with region proposal
arXiv:150302531 networks. In: NeurIPS
Huang Y, Chen J, Huang D (2022) Ufpmp-det: Toward Romero A, Ballas N, Kahou SE, Chassang A, Gatta
accurate and efficient object detection on drone im- C, Bengio Y (2014) Fitnets: Hints for thin deep nets.
agery. In: AAAI arXiv preprint arXiv:14126550
Jocher G, Stoken A, Borovec J, Changyu L, Hogan A, Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully con-
Diaconu L, Poznanski J, Yu L, Rai P, Ferriday R, volutional one-stage object detection. In: ICCV
et al. (2020) ultralytics/yolov5: v3. 0. Zenodo Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh
Koyun OC, Keser RK, Akkaya IB, Töreyin BU (2022) IH (2020) Cspnet: A new backbone that can enhance
Focus-and-detect: A small object detection framework learning capability of cnn. In: CVPR Workshops
for aerial images. Signal Processing: Image Communi- Wang T, Yuan L, Zhang X, Feng J (2019) Distilling
cation 104:116675 object detectors with fine-grained feature imitation.
Lan Q, Tian Q (2024) Gradient-guided knowledge dis- In: CVPR
tillation for object detectors. In: WACV Yang C, Huang Z, Wang N (2022a) Querydet: Cascaded
Li C, Yang T, Zhu S, Chen C, Guan S (2020a) Den- sparse query for accelerating high-resolution small
sity map guided object detection in aerial images. In: object detection. In: CVPR
CVPR Workshops Yang F, Fan H, Chu P, Blasch E, Ling H (2019) Clus-
tered object detection in aerial images. In: ICCV
Yang L, Zhou X, Li X, Qiao L, Li Z, Yang Z, Wang G, Li

X (2023) Bridging cross-task protocol inconsistency
for distillation in dense object detection. In: ICCV
Yang Z, Li Z, Jiang X, Gong Y, Yuan Z, Zhao D, Yuan
C (2022b) Focal and global knowledge distillation for
detectors. In: CVPR
Yang Z, Li Z, Shao M, Shi D, Yuan Z, Yuan C (2022c)
Masked generative distillation. In: ECCV
Zhang P, Zhong Y, Li X (2019) Slimyolov3: Narrower,
faster and better for real-time uav applications. In:
ICCV Workshops
Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the
gap between anchor-based and anchor-free detection
via adaptive training sample selection. In: CVPR
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An
extremely efficient convolutional neural network for
mobile devices. In: CVPR
Zhao B, Cui Q, Song R, Qiu Y, Liang J (2022) Decoupled
knowledge distillation. In: CVPR
Zheng Z, Ye R, Wang P, Ren D, Zuo W, Hou Q, Cheng
MM (2022) Localization distillation for dense object
detection. In: CVPR
Zhou W, Min X, Hu R, Long Y, Luo H, et al.
(2022) Fasterx: Real-time object detection based
on edge gpus for uav applications. arXiv preprint
arXiv:220903157
Zhu C, He Y, Savvides M (2019) Feature selective
anchor-free module for single-shot object detection.
In: CVPR
Zhu L, Xiong J, Xiong F, Hu H, Jiang Z (2023) Yolo-
drone: Airborne real-time detection of dense small
objects from high-altitude perspective. arXiv preprint
arXiv:230406925
Zhu P, Wen L, Bian X, Ling H, Hu Q (2018) Vision meets
drones: A challenge. arXiv preprint arXiv:180407437
Zhu X, Lyu S, Wang X, Zhao Q (2021) Tph-yolov5:
Improved yolov5 based on transformer prediction head
for object detection on drone-captured scenarios. In:
ICCV Workshops
Zong M, Qiu Z, Ma X, Yang K, Liu C, Hou J, Yi
S, Ouyang W (2022) Better teacher better student:
Dynamic prior knowledge for knowledge distillation.
In: ICLR

Centerness-Based Instance-Aware Knowledge Distillation With Task-Wise Mutual Lifting For Object Detection On Drone Imagery

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Centerness-Based Instance-Aware Knowledge Distillation With Task-Wise Mutual Lifting For Object Detection On Drone Imagery

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Centerness-Based Instance-Aware Knowledge Distillation With Task-Wise Mutual Lifting For Object Detection On Drone Imagery

Uploaded by

Copyright:

Available Formats

Noname manuscript No.

(will be inserted by the editor)

Centerness-based Instance-aware Knowledge Distillation with

Received: date / Accepted: date

Jiaxin Chen 1 Introduction

According to (Cho and Hariharan, 2019; Mirzadeh et al.,

loc branch Global

Light-weight Mutual Lifting （Light-ML） Centerness-based Instance-aware Distillation （CID）

where β donated the weighting of global distillation loss.

Method mAP AP50 AP75 AR1 AR10 AR100 AR500 GFLOPs

Method mAP AP50 AP75 GFLOPs

Method mAP AP50 AP75 APS APM APL GFLOPs

Method mAP AP50 AP75 AR1 AR10 AR100 AR500 GFLOPs

4.4 Ablation Studies method to small instances. By combining Light-ML with

Table 5: Ablation study on components with GFL ResNet101-ResNet18.

Dataset Light-ML CID and Global mAP AP50 AP75 GFLOPs

Method Cls to Reg Reg to Cls mAP AP50 AP75 GFLOPs

Table 7: Ablation on the hyper-parameter k with GFL ResNet101-ResNet18 on VisDrone.

hyper-parameter k mAP AP50 AP75 GFLOPs

CID Global Distillation mAP AP50 AP75 GFLOPs

(a) (b) (c)

Table 9: Ablation on the hyper-parameter γ with GFL ResNet101-ResNet18 on VisDrone.

γ 0.6 0.55 0.5 0.45 0.4 0.35

w/o Distilation LD BCKD Ours

References Li X, Wang W, Wu L, Chen S, Hu X, Li J, Tang J,

Yang L, Zhou X, Li X, Qiao L, Li Z, Yang Z, Wang G, Li

You might also like