Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
46 views

Decoupled Knowledge Distillation

This paper proposes a method called Decoupled Knowledge Distillation (DKD) to improve upon classical knowledge distillation using logits. It reformulates the classical knowledge distillation loss into two parts: target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). The paper finds that the classical formulation couples the effects of TCKD and NCKD in a way that suppresses their effectiveness. DKD addresses this by decoupling TCKD and NCKD using separate hyperparameters, allowing their roles to be balanced more efficiently and flexibly. Experiments on image classification and object detection tasks show DKD achieves comparable or better results than complex feature-based distillation methods with better

Uploaded by

zhenhua wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Decoupled Knowledge Distillation

This paper proposes a method called Decoupled Knowledge Distillation (DKD) to improve upon classical knowledge distillation using logits. It reformulates the classical knowledge distillation loss into two parts: target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). The paper finds that the classical formulation couples the effects of TCKD and NCKD in a way that suppresses their effectiveness. DKD addresses this by decoupling TCKD and NCKD using separate hyperparameters, allowing their roles to be balanced more efficiently and flexibly. Experiments on image classification and object detection tasks show DKD achieves comparable or better results than complex feature-based distillation methods with better

Uploaded by

zhenhua wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Decoupled Knowledge Distillation

Borui Zhao1 Quan Cui2 Renjie Song1 Yiyu Qiu1,3 Jiajun Liang1
1 2 3
MEGVII Technology Waseda University Tsinghua University
zhaoborui.gm@gmail.com, cui-quan@toki.waseda.jp,
chouyy18@mails.tsinghua.edu.cn, {songrenjie, liangjiajun}@megvii.com
arXiv:2203.08679v2 [cs.CV] 12 Jul 2022

Abstract target
non-target

State-of-the-art distillation methods are mainly based on


Teacher KD
distilling deep features from intermediate layers, while the
significance of logit distillation is greatly overlooked. To KL Loss

provide a novel viewpoint to study logit distillation, we re-


formulate the classical KD loss into two parts, i.e., target
Student
class knowledge distillation (TCKD) and non-target class
knowledge distillation (NCKD). We empirically investigate (a) Classical Knowledge Distillation (KD).
and prove the effects of the two parts: TCKD transfers target .0
/
knowledge concerning the “difficulty” of training samples, non-target
TCKD

while NCKD is the prominent reason why logit distillation Target Class
KL Loss
works. More importantly, we reveal that the classical KD Teacher

loss is a coupled formulation, which (1) suppresses the ef- Decoupling +


./8
fectiveness of NCKD and (2) limits the flexibility to bal- NCKD
ance these two parts. To address these issues, we present z
Non-target
Class KL Loss
Decoupled Knowledge Distillation (DKD), enabling TCKD Student

and NCKD to play their roles more efficiently and flexibly. !"#$$%&#" '( = *!'( + , − ./0 ∗ 2!'(
Compared with complex feature-based methods, our DKD ('( 345$ = 6 ∗ *!'( + 7 ∗ 2!'(

achieves comparable or even better results and has bet- (b) Decoupled Knowledge Distillation (DKD).
ter training efficiency on CIFAR-100, ImageNet, and MS- Figure 1. Illustration of the classical KD [12] and our DKD. We
COCO datasets for image classification and object detec- reformulate KD into a weighted sum of two parts, i.e., TCKD and
tion tasks. This paper proves the great potential of logit NCKD. The first equation shows that KD (1) couples NCKD with
distillation, and we hope it will be helpful for future re- pTt (the teacher’s confidence on the target class), and (2) couples
search. The code is available at https://github.com/megvii- the importance of two parts. Furthermore, we demonstrate that the
research/mdistiller. first coupling suppresses the effectiveness, and the second limits
the flexibility for knowledge transfer. We propose DKD to address
these issues, which employs hyper-parameters α for TCKD and β
for NCKD, killing the two birds with one stone.
1. Introduction
In the last decades, the computer vision field has been KD represents a series of methods concentrating on trans-
revolutionized by deep neural networks (DNN), which suc- ferring knowledge from a heavy model (teacher) to a light
cessfully boost various real-scenario tasks, e.g., image clas- one (student), which can improve the light model’s perfor-
sification [9, 13, 21], objection detection [8, 27], and seman- mance without introducing extra costs.
tic segmentation [31,45]. However, powerful networks nor- The concept of KD was firstly proposed in [12] to trans-
mally benefit from large model capacities, introducing high fer the knowledge via minimizing the KL-Divergence be-
computational and storage costs. Such costs are not prefer- tween prediction logits of teachers and students (Figure 1a).
able in industrial applications, where lightweight models Since [28], most of the research attention has been drawn to
are widely deployed. In the literature, a potential direction distill knowledge from deep features of intermediate lay-
of cutting down the costs is knowledge distillation (KD). ers. Compared with logits-based methods, the performance
of feature distillation is superior on various tasks, so re- it could provide. Secondly, the significance of TCKD and
search on logit distillation has been barely touched. How- NCKD are coupled, i.e., weighting TCKD and NCKD sepa-
ever, training costs of feature-based methods are unsatis- rately is not allowed. Such limitation is not preferable since
factory, because extra computational and storage usage are TCKD and NCKD should be separately considered since
introduced (e.g., network modules and complex operations) their contributions are from different aspects.
for distilling deep features during training time. To address these issues, we propose a flexible and ef-
Logit distillation requires marginal computational and ficient logit distillation method named Decoupled Knowl-
storage costs, but the performance is inferior. Intuitively, edge Distillation (DKD, Figure 1b). DKD decouples the
logit distillation should achieve comparable performance NCKD loss from the coefficient negatively correlated with
as feature distillation, since logits are in higher semantic the teacher’s confidence by replacing it with a constant
level than deep features. We suppose that the potential of value, improving the distillation effectiveness on well-
logit distillation is limited by unknown reasons, causing predicted samples. Meanwhile, NCKD and TCKD are also
the unsatisfactory performance. To revitalize logits-based decoupled so that their importance can be separately con-
methods, we start this work by delving into the mecha- sidered by adjusting the weight of each part.
nism of KD. Firstly, we divide a classification prediction Overall, our contributions are summarized as follows:
into two levels: (1) a binary prediction for the target class • We provide an insightful view to study logit distillation
and all the non-target classes and (2) a multi-category pre- by dividing the classical KD into TCKD and NCKD.
diction for each non-target class. Based on this, we refor- Additionally, the effects of both parts are respectively
mulate the classical KD loss [12] into two parts, as shown analyzed and proved.
in Figure 1b. One is a binary logit distillation for the tar- • We reveal limitations of the classical KD loss caused
get class and the other is a multi-category logit distilla- by its highly coupled formulation. Coupling NCKD
tion for non-target classes. For simplification, we respec- with the teacher’s confidence suppresses the effective-
tively name them as target classification knowledge distil- ness of knowledge transfer. Coupling TCKD with
lation (TCKD) and non-target classification knowledge dis- NCKD limits the flexibility to balance the two parts.
tillation (NCKD). The reformulation allows us to study the • We propose an effective logit distillation method
effects of the two parts independently. named DKD to overcome these limitations. DKD
TCKD transfers knowledge via binary logit distillation, achieves state-of-the-art performances on various
which means only the prediction of the target class is pro- tasks. We also empirically validate the higher train-
vided while the specific prediction of each non-target class ing efficiency and better feature transferability of DKD
is unknown. A reasonable hypothesis is that TCKD trans- compared with feature-based distillation methods.
fers knowledge about the “difficulty” of training samples,
i.e., the knowledge describes how difficult it is to recog- 2. Related work
nize each training sample. To validate this, we design ex- The concept of knowledge distillation (KD) was firstly
periments from three aspects to increase the “difficulty” of proposed by Hinton et al. in [12]. KD defines a learn-
training data, i.e., stronger augmentation, noisier label and ing manner where a bigger teacher network is employed to
inherently challenging dataset. guide the training of a smaller student network for many
NCKD only considers the knowledge among non-target tasks [12, 17, 18]. The “dark knowledge” is transferred to
logits. Interestingly, we empirically prove that only ap- students via soft labels from teachers. For raising the at-
plying NCKD achieves comparable or even better results tention on negative logits, the hyper-parameter temperature
than the classical KD, indicating the vital importance of was introduced. The following works can be divided into
knowledge contained in non-target logits, which could be two types, distillation from logits [3, 6, 22, 40, 44] and inter-
the prominent “dark knowledge”. mediate features [10, 11, 14, 15, 23, 25, 28, 33, 34, 41, 43].
More importantly, our reformulation demonstrates that Previous works of logit distillation mainly focus on
the classical KD loss is a highly coupled formulation (as proposing effective regularization and optimization meth-
shown in Figure 1b), which could be the reason why the po- ods rather than novel methods. DML [44] proposes a mu-
tential of logit distillation is limited. Firstly, the NCKD loss tual learning manner to train students and teachers simulta-
term is weighted by a coefficient that negatively correlates neously. TAKD [22] introduces an intermediate-sized net-
with the teacher’s prediction confidence on the target class. work named “teacher assistant” to bridge the gap between
Thus larger prediction scores would lead to smaller weights. teachers and students. Besides, several works also focus on
The coupling significantly suppresses the effects of NCKD interpreting the classical KD method [2, 26].
on well-predicted training samples. Such suppression is not State-of-the-art methods are mainly based on interme-
preferable since the more confident the teacher is in the diate features, which can directly transfer representations
training sample, the more reliable and valuable knowledge from the teacher to the student [10, 11, 28] or transfer the
correlation between samples captured in the teacher to the \begin {split} \text {KD} &= \text {KL}(\mathbf {p}^{\mathcal {T}}|| \mathbf {p}^{\mathcal {S}})\\ &=p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + \sum _{i=1,i\neq t}^{C} p^{\mathcal {T}}_{i}\log (\frac {p^{\mathcal {T}}_{i}}{p^{\mathcal {S}}_{i}}). \end {split} \label {kd_ori_form}
student [23, 33, 34]. Most of the feature-based methods (3)
could achieve preferable performances (significant higher
than logits-based methods), yet involving considerably high
computational and storage costs.
According to Eqn.(1) and Eqn.(2) we have p̂i = pi /p\t , so
This paper focuses on analyzing what limits the potential
we can rewrite Eqn.(3) as:
of logits-based methods and revitalizing logit distillation.

3. Rethinking Knowledge Distillation \begin {split} \text {KD} &= p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + p^{\mathcal {T}}_{\backslash t}\sum _{i=1,i\neq t}^{C} \hat p^{\mathcal {T}}_{i}(\log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}}) +\log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}})) \\ &= \underbrace {p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + p^{\mathcal {T}}_{\backslash t} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}})}_{{\text {KL}(\mathbf {b}^{\mathcal {T}}||\mathbf {b}^{\mathcal {S}})}} + p^{\mathcal {T}}_{\backslash t} \underbrace {\sum _{i=1,i\neq t}^{C} \hat p^{\mathcal {T}}_{i} \log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}})}_{\text {KL}(\hat {\mathbf {p}}^{\mathcal {T}} || \hat {\mathbf {p}}^{\mathcal {S}})}. \end {split} \label {reform_kd1}

In this section, we delve into the mechanism of knowl-


edge distillation. We reformulate KD loss into a weighted
sum of two parts, one is relevant to the target class, and the
other is not. We explore the effect of each part in the knowl-
edge distillation framework and reveal some limitations of (4)
the classical KD. Inspired by the findings, we further pro-
pose a novel logit distillation method, achieving remarkable Then, Eqn.(4) can be rewritten as:
performance on various tasks.
\begin {split} \text {KD} &= \text {KL}(\mathbf {b}^{\mathcal {T}}||\mathbf {b}^{\mathcal {S}}) + (1-p_{t}^{\mathcal {T}}) \text {KL}(\hat {\mathbf {p}}^{\mathcal {T}} || \hat {\mathbf {p}}^{\mathcal {S}}) \end {split} \label {form_dkd} (5)
3.1. Reformulating KD
As reflected by Eqn.(5), the KD loss is reformulated
Notations. For a training sample from the t-th class, into a weighted sum of two terms. KL(bT ||bS ) repre-
the classification probabilities can be denoted as p = sents the similarity between the teacher’s and student’s bi-
[p1 , p2 , ..., pt , ..., pC ] ∈ R1×C , where pi is the probabil- nary probabilities of the target class. Thus, we name it
ity of the i-th class and C is the number of classes. Each Target Class Knowledge Distillation(TCKD). Meanwhile,
element in p can be obtained by the softmax function: KL(p̂T ||p̂S ) represents the similarity between the teacher’s
and student’s probabilities among non-target classes, named
p_{i} = \frac {\exp (z_{i})}{\sum _{j=1}^{C} \exp (z_{j})}, \label {defpi} (1) Non-Target Class Knowledge Distillation(NCKD). Eqn.(5)
could be rewritten as:
where zi represents the logit of the i-th class.
To separate the predictions relevant and irrelevant to \begin {split} \text {KD} &= \text {TCKD} + (1-p_{t}^{\mathcal {T}}) \text {NCKD}. \end {split} \label {form_general} (6)
the target class, we define the following notations. b =
Obviously, the weight of NCKD is coupled with pTt .
[pt , p\t ] ∈ R1×2 represents the binary probabilities of the
The reformulation above inspires us to investigate the in-
target class (pt ) and all the other non-target classes (p\t ),
dividual effects of TCKD and NCKD, which will reveal the
which can be calculated by:
limitations of the classical coupled formulation.

p_{t} = \frac {\exp (z_{t})}{\sum _{j=1}^{C} \exp (z_{j})}, p_{\backslash t} = \frac {\sum _{k=1,k\neq t}^{C} \exp (z_{k})}{\sum _{j=1}^{C} \exp (z_{j})}. 3.2. Effects of TCKD and NCKD

Performance gain of each part. We individually study the


Meanwhile, we declare p̂ = [p̂1 , ..., p̂t−1 , p̂t+1 , ..., p̂C ] ∈ effects of TCKD and NCKD on CIFAR-100 [16]. ResNet
R1×(C−1) to independently model probabilities among non- [9], WideResNet (WRN) [42] and ShuffleNet [21] are se-
target classes (i.e., without considering the t-th class). Each lected as training models, among which both the same and
element is calculated by: different architectures are considered. The experimental re-
sults are reported in Table 1. For each teacher-student pair,
\hat {p}_{i} = \frac {\exp (z_{i})}{\sum _{j=1,j\neq t}^{C} \exp (z_{j})}. \label {defP} (2) we report the results of (1) the student baseline (vanilla
training), (2) the classical KD (where TCKD and NCKD
are both used), (3) singly TCKD and (4) singly NCKD. The
Reformulation. In this part1 , we attempt to reformulate KD weight of each loss is set as 1.0 (including the default cross-
with the binary probabilities b and the probabilities among entropy loss). Other implementation details are the same as
non-target classes p̂. T and S denote the teacher and the those in Sec 4.
student, respectively. The classical KD uses KL-Divergence Intuitively, TCKD concentrates on the knowledge related
as the loss function, which can be written as2 : to the target class since the corresponding loss function con-
1 More mathematical formulations are in the supplement. siders only binary probabilities. Conversely, NCKD focuses
2 We omit the temperature (T) in [12] without loss of generality on the knowledge among non-target classes. We notice that
student TCKD NCKD top-1 ∆ racy. As for students, we train ResNet8×4 and ShuffleNet-
ResNet32×4 as the teacher V1 models with/without TCKD. Results in Table 2 reveal
72.50 - that TCKD obtains significant performance gains if strong
✓ ✓ 73.63 +1.13 augmentations are applied.
ResNet8×4
✓ 68.63 -3.87
✓ 74.26 +1.76 student TCKD top-1 ∆
70.50 - 73.82 -
✓ ✓ 74.29 +3.79 ResNet8×4
ShuffleNet-V1 ✓ 75.33 +1.51
✓ 70.52 +0.02 77.13 -
✓ 74.91 +4.41 ShuffleNet-V1
✓ 77.98 +0.85
WRN-40-2 as the teacher
Table 2. Accuracy(%) on the CIFAR-100 validation. We set
73.26 -
ResNet32×4 as the teacher and ResNet8×4 as the student. Both
✓ ✓ 74.96 +1.70
WRN-16-2 teachers and students are trained with AutoAugment [5].
✓ 70.96 -2.30
✓ 74.76 +1.50
70.50 - noisy ratio TCKD top-1 ∆
✓ ✓ 74.92 +4.42 70.99 -
ShuffleNet-V1 0.1
✓ 70.62 +0.12 ✓ 70.96 -0.03
✓ 75.12 +4.62 67.55 -
0.2
✓ 68.03 +0.48
Table 1. Accuracy(%) on the CIFAR-100 validation set. ∆ repre-
64.62 -
sents the performance improvement over the baseline. 0.3
✓ 65.26 +0.64
singly applying TCKD could be unhelpful (e.g., 0.02% and Table 3. Accuracy(%) on the CIFAR-100 validation with different
0.12% gain on ShuffleNet-V1) or even harmful (e.g., 2.30% noisy ratios on the training set. We set ResNet32×4 as the teacher
and ResNet8×4 as the student.
drop on WRN-16-2 and 3.87% drop on ResNet8×4) for the
student. However, the distillation performances of NCKD (2) Noisy Labels can also increase the difficulty of train-
are comparable and even better than the classical KD (e.g., ing data. We train ResNet32×4 models as teachers and
1.76% vs. 1.13% on ResNet8×4). The ablation results sug- ResNet8×4 as students on CIFAR-100 with {0.1, 0.2, 0.3}
gest that the target-class-related knowledge could not be symmetric noisy ratios, following [7, 35]. As reported in
as important as knowledge among non-target classes. To Table 3, the results indicate that TCKD achieves more per-
dive into this phenomenon, we provide further analyses pre- formance promotions on noisier training data.
sented as follows. (3) Challenging Datasets (e.g., ImageNet [29]) are also
considered. It shows that TCKD could bring +0.32% per-
TCKD transfers the knowledge concerning the “diffi- formance gain on ImageNet in Table 4.
culty” of training samples. According to Eqn.(5), TCKD
transfers “dark knowledge” via the binary classification TCKD top-1 ∆
task, which could be related to the sample “difficulty”. For 70.71 -
instance, a training sample with pTt = 0.99 could be “eas- ✓ 71.03 +0.32
ier” for the student to learn compared with another one Table 4. Accuracy(%) on the ImageNet validation. We set ResNet-
with pTt = 0.75. Since TCKD conveys the “difficulty” of 34 as the teacher and ResNet-18 as the student.
training samples, we suppose the effectiveness would be re- Conclusively, we demonstrate the effectiveness of
vealed when the training data becomes challenging. How- TCKD by experimenting with various strategies to increase
ever, the CIFAR-100 training set is easy to fit3 . Thus the the difficulty of training data (e.g. strong augmentation,
knowledge of “difficulty” provided by the teacher is not in- noisy labels, difficult tasks). The results validate that the
formative. In this part, experiments from three perspectives knowledge concerning the “difficulty” of training samples
are performed to validate: The more difficult the training could be more beneficial when distilling knowledge on
data is, the more benefits TCKD could provide4 . more challenging training data.
(1) Applying Strong Augmentation is a straightforward
way to increase the difficulty of training data. We train a NCKD is the prominent reason why logit distillation
ResNet32×4 model as the teacher with AutoAugment [5] works but is greatly suppressed. Interestingly, we notice
on CIFAR-100, achieving 81.29% top-1 validation accu- in Table 1 when only NCKD is applied, the performances
are comparable or even better than the classical KD. It
3 Training accuracies on CIFAR-100 could be 100% after convergence.
4 All
shows that the knowledge among non-target classes is of vi-
experiments from these perspectives are performed with NCKD,
since we suppose that TCKD should not be singly employed according to
tal importance to logit distillation, which can be the promi-
the results in Table 1. The probable reasons and analyzes are attached in nent “dark knowledge”. However, by reviewing Eqn.(5),
the supplement. we notice that the NCKD loss is coupled with (1 − pTt ),
where pTt represents the teacher’s confidence on the tar- Algorithm 1 Pseudo code of DKD in a PyTorch-like style.
get class. Therefore, more confident predictions result in
smaller NCKD weights. We suppose that the more confi- # l_stu: student output logits
dent the teacher is in the training sample, the more reli- # l_tea: teacher output logits
able and valuable knowledge it could provide. However, # T: temperature for KD & DKD
# t: labels, (N, C), bool type
the loss weights are highly suppressed by such confident # alpha, beta: hyper-parameters for DKD
predictions. We suppose that this fact would limit the effec-
p_stu = F.softmax(l_stu / T)
tiveness of knowledge transfer, which is firstly investigated p_tea = F.softmax(l_tea / T)
thanks to our reformulation of KD in Eqn. (5). # pt & pnt: (N, 1), Eqn.(2)
We design an ablation experiment to verify that well- pt_stu, pnt_stu = p_stu[t], p_stu[1-t].sum(1)
pt_tea, pnt_tea = p_tea[t], p_tea[1-t].sum(1)
predicted samples do transfer better knowledge than the oth- # pnct: (N, C-1), Eqn.(3)
ers. Firstly we rank the training samples according to pTt , pnct_stu = F.softmax(l_stu[1-t]/T)
and evenly split them into two sub-sets. For clarity, one pnct_tea = F.softmax(l_tea[1-t]/T)
sub-set includes samples with top-50% pTt while remain-
# TCKD
ing samples are in the other sub-set. Then we train student tckd = kl_div(log(pt_stu), pt_tea) +
networks with NCKD on each subset to compare the per- kl_div(log(pnt_stu), pnt_tea)
formance gain (while the cross-entropy loss is still on the # NCKD
whole set). Table 5 shows that utilizing NCKD on the top- nckd = F.kl_div(log(pnct_stu), pnct_tea)
50% samples achieves better performance, suggesting that # ori KD
the knowledge of well-predicted samples is richer than oth- # kd_loss = (tckd + pnt_tea*nckd) * T**2
ers. However, the loss weight of well-predicted samples are # DKD
suppressed by the high confidence of the teacher. dkd_loss = (alpha*tckd + beta*nckd) * T**2

0-50% 50-100% top-1


✓ ✓ 74.26
rately considered since their contributions are from dif-
✓ 74.23
ferent aspects.
✓ 73.96
Table 5. Accuracy(%) on the CIFAR-100 validation set. We set Benefiting from our reformulation of KD, we propose
ResNet32×4 as the teacher and ResNet8×4 as the student. a novel logit distillation method named Decoupled Knowl-
edge Distillation(DKD) to address the above issues. Our
3.3. Decoupled Knowledge Distillation proposed DKD independently considers TCKD and NCKD
in a decoupled formulation, as shown in Figure 1b. Specif-
So far, we have reformulated the classical KD loss into
ically, we introduce two hyper-parameters α and β, as the
a weighted sum of two independent parts, and further val-
weights of TCKD and NCKD, respectively. The loss func-
idated the effectiveness of TCKD and revealed the sup-
tion of DKD can be written as follows:
pression of NCKD. Specifically, TCKD transfers knowl-
edge concerning the “difficulty” of training samples. More \text {DKD} = \alpha \text {TCKD} + \beta \text {NCKD}. (7)
significant improvements could be obtained by TCKD on
more challenging training data. NCKD transfers knowledge In DKD, (1 − pTt ), which would suppress NCKD’s ef-
among non-target classes, which would be suppressed in the fectiveness, is replaced by β. What’s more, it’s allowed
condition that the weight (1 − pTt ) is relatively small. to adjust α and β to balance the importance of TCKD
Instinctively, both TCKD and NCKD are indispensable and NCKD. Through decoupling NCKD and TCKD, DKD
and crucial. However, in the classical KD formulation, provides an efficient and flexible manner for logit distilla-
TCKD and NCKD are coupled from the following aspects: tion. Algorithm 1 provides the pseudo-code of DKD in a
PyTorch-like [24] style.
• For one thing, NCKD is coupled with (1 − pTt ), which
could suppress NCKD on the well-predicted samples. 4. Experiments
Since results in Table 5 show that well-predicted sam-
ples could bring more performance gain, the coupled We mainly experiment on two representative tasks, i.e.,
form could limit the effectiveness of NCKD. image classification and object detection, including:
• For another, weights of NCKD and TCKD are coupled CIFAR-100 [16] is a well-known image classification
under the classical KD framework. It’s not allowed to dataset, containing 32×32 images of 100 categories. Train-
change each term’s weight to balance the importance. ing and validate sets are composed of 50k and 10k images.
We suppose that TCKD and NCKD should be sepa- ImageNet [29] is a large-scale classification dataset that
ResNet56 ResNet110 ResNet32×4 WRN-40-2 WRN-40-2 VGG13
teacher
distillation 72.34 74.31 79.42 75.61 75.61 74.64
manner ResNet20 ResNet32 ResNet8×4 WRN-16-2 WRN-40-1 VGG8
student
69.06 71.14 72.50 73.26 71.98 70.36
FitNet [28] 69.21 71.06 73.50 73.58 72.24 71.02
RKD [23] 69.61 71.82 71.90 73.35 72.22 71.48
features CRD [33] 71.16 73.48 75.51 75.48 74.14 73.94
OFD [10] 70.98 73.23 74.95 75.24 74.33 73.95
ReviewKD [1] 71.89 73.89 75.63 76.12 75.09 74.84
KD [12] 70.66 73.08 73.33 74.92 73.54 72.98
logits DKD 71.97 74.11 76.32 76.24 74.81 74.68
∆ +1.31 +1.03 +2.99 +1.32 +1.27 +1.70
Table 6. Results on the CIFAR-100 validation. Teachers and students are in the same architectures. And ∆ represents the performance
improvement over the classical KD. All results are the average over 5 trials.

consists of 1000 classes. The training set contains 1.28 mil- Notably, DKD achieves consistent improvements on all
lion images and the validation set contains 50k images. teacher-student pairs, compared with the baseline and the
MS-COCO [20] is an 80-category general object detec- classical KD. Our method achieves 1 ∼ 2% and 2 ∼ 3%
tion dataset. The train2017 split contains 118k images, improvements on teacher-student pairs of the same and dif-
and the val2017 split contains 5k images. ferent series, respectively. It strongly supports the effec-
All implementation details are attached in supplement tiveness of DKD. Furthermore, DKD achieves comparable
due to the page limit. or even better performances than feature-based distillation
methods, significantly improving the trade-off between dis-
4.1. Main Results tillation performance and training efficiency, which will be
Firstly, we demonstrate the improvements contributed by further discussed in Sec 4.2.
decoupling (1) NCKD and pTt and (2) NCKD and TCKD, ImageNet image classification. Top-1 and top-5 accura-
respectively. Then, we benchmark our method on image cies of image classification on ImageNet are reported in Ta-
classification and object detection tasks. ble 8 and Table 9. Our DKD achieves a significant improve-
Ablation: α and β. The two tables below report the stu- ment. It’s worth mentioning that the performance of DKD
dent accuracy (%) with different α and β. ResNet32×4 is better than the most state-of-the-art results of feature dis-
and ResNet8×4 are set as the teacher and the student, re- tillation methods.
spectively. Firstly, we prove that decoupling (1 − pTt ) and MS-COCO object detection. As discussed in previous
NCKD can bring reasonable performance gain (73.63% vs. works, the performance of the object detection task greatly
74.79%) in the first table. Then, we demonstrate that de- depends on the quality of deep features to locate inter-
coupling weights of NCKD and TCKD could contribute ested objects. This rule also stands in transferring knowl-
to further improvements (74.79% vs. 76.32%). Moreover, edge between detectors [17, 37], i.e., feature mimicking
the second table indicates that TCKD is indispensable, and is of vital importance since logits are not capable of pro-
the improvements from TCKD are stable with different α viding knowledge for object localization. As shown in
around 1.05 . Table 10, singly applying DKD can hardly achieve out-
standing performances, but expectedly surpasses the clas-
β 1 − pTt 1.0 2.0 4.0 8.0 10.0
sical KD. Thus, we introduce the feature-based distillation
top-1 73.63 74.79 75.44 75.94 76.32 76.18
method ReviewKD [1] to obtain satisfactory results. It
α 0.0 0.2 0.5 1.0 2.0 4.0
can be observed that our DKD can further boost AP met-
top-1 75.30 75.64 76.12 76.32 76.11 75.42
rics, even the distillation performance of ReviewKD is rel-
atively high. Conclusively, new state-of-the-art results are
CIFAR-100 image classification. We discuss experimental obtained by combining our DKD with feature-based distil-
results on CIFAR-100 to examine our DKD. The validation lation methods on the object detection task.
accuracy is reported in Table 6 and Table 7. Table 6 contains
the results where teachers and students are of the same net- 4.2. Extensions
work architectures. Table 7 shows the results where teach- For a better understanding of DKD, we provide exten-
ers and students are from different series. sions from four perspectives. First of all, we comprehen-
5 We fix α as 1.0 for simplification in the first table, and β as 8.0 in the sively compare the training efficiency of DKD with repre-
second table since it achieves the best performance in the first one. sentative state-of-the-art methods. Then, we provide a new
ResNet32×4 WRN-40-2 VGG13 ResNet50 ResNet32×4
teacher
distillation 79.42 75.61 74.64 79.34 79.42
manner ShuffleNet-V1 ShuffleNet-V1 MobileNet-V2 MobileNet-V2 ShuffleNet-V2
student
70.50 70.50 64.60 64.60 71.82
FitNet [28] 73.59 73.73 64.14 63.16 73.54
RKD [23] 72.28 72.21 64.52 64.43 73.21
features CRD [33] 75.11 76.05 69.73 69.11 75.65
OFD [10] 75.98 75.85 69.48 69.04 76.82
ReviewKD [1] 77.45 77.14 70.37 69.89 77.78
KD [12] 74.07 74.83 67.37 67.35 74.45
logits DKD 76.45 76.70 69.71 70.35 77.07
∆ +2.38 +1.87 +2.34 +3.00 +2.62
Table 7. Results on the CIFAR-100 validation. Teachers and students are in different architectures. And ∆ represents the performance
improvement over the classical KD. All results are the average over 5 trials.
distillation manner features logits
teacher student AT [43] OFD [10] CRD [33] ReviewKD [1] KD [12] KD* DKD
top-1 73.31 69.75 70.69 70.81 71.17 71.61 70.66 71.03 71.70
top-5 91.42 89.07 90.01 89.98 90.13 90.51 89.88 90.05 90.41
Table 8. Top-1 and top-5 accuracy (%) on the ImageNet validation. We set ResNet-34 as the teacher and ResNet-18 as the student.
KD* represents the result of our implementation. All results are the average over 3 trials.
distillation manner features logits
teacher student AT [43] OFD [10] CRD [33] ReviewKD [1] KD [12] KD* DKD
top-1 76.16 68.87 69.56 71.25 71.37 72.56 68.58 70.50 72.05
top-5 92.86 88.76 89.33 90.34 90.41 91.00 88.98 89.80 91.05
Table 9. Top-1 and top-5 accuracy (%) on the ImageNet validation. We set ResNet-50 as the teacher and MobileNet-V1 as the student.
KD* represents the result of our implementation. All results are the average over 3 trials.

perspective to explain why bigger models are not always tive, e.g., ESKD [3] employs early-stopped teacher models
better teachers and alleviate this problem by utilizing DKD. to alleviate this problem, and these teachers would be under-
Moreover, following [33], we examine the transferability of convergence and yield smaller pTt .
deep features learned by DKD. And we also present some To validate our conjecture, we perform our DKD on a
visualizations to validate the superiority of DKD. series of teacher models. Experimental results in Table 11
Training efficiency. We assess the training costs of state- and Table 12 consistently indicate that our DKD alleviates
of-the-art distillation methods, proving the high training ef- the bigger models are not always better teachers problem.
ficiency of DKD. As shown in Figure 2, our DKD achieves
the best trade-off between model performances and training 77

costs (e.g., training time and extra parameters). Since DKD


76
is reformulated from the classical KD, it needs almost the Our DKD
ReviewKD CRD
same computational complexity as KD, and of course no 75
accuracy(%)

OFD
extra parameters. However, feature-based distillation meth- 74
ods require extra training time for distillation intermediate
FitNet
layer features, as well as the GPU memory costs. 73 KD

Improving performances of big teachers. We provide a 72


RKD
new potential explanation on the bigger models are not al- 71
ways better teachers issue. Specifically, bigger teachers 5 15 25 35 45 55
training time (ms)
are expected but cannot transfer more beneficial knowledge,
even achieving worse performances than smaller ones. Figure 2. Training time (per batch) vs. accuracy on CIFAR-100.
Previous works [3, 36] explained this phenomenon with We set ResNet32×4 as the teacher and ResNet8×4 as the student.
the large capacity gap between big teachers and small stu- The table shows the number of extra parameters for each method.
dents. However, we suppose that the main reason is the sup-
pression of NCKD, i.e., the (1 − pTt ) would become smaller Feature transferability. We perform experiments to eval-
with the teacher getting bigger. What’s more, related works uate the transferability of deep features to verify that our
on this problem also could be explained from this perspec- DKD transfers more generalizable knowledge. Follow-
R-101 & R-18 R-101 & R-50 R-50 & MV2
AP AP50 AP75 AP AP50 AP75 AP AP50 AP75
teacher 42.04 62.48 45.88 42.04 62.48 45.88 40.22 61.02 43.81
student 33.26 53.61 35.26 37.93 58.84 41.05 29.47 48.87 30.90
KD [12] 33.97 54.66 36.62 38.35 59.41 41.71 30.13 50.28 31.35
FitNet [28] 34.13 54.16 36.71 38.76 59.62 41.80 30.20 49.80 31.69
FGFI [38] 35.44 55.51 38.17 39.44 60.27 43.04 31.16 50.68 32.92
ReviewKD [1] 36.75 56.72 34.00 40.36 60.97 44.08 33.71 53.15 36.13
DKD 35.05 56.60 37.54 39.25 60.90 42.73 32.34 53.77 34.01
DKD+ReviewKD 37.01 57.53 39.85 40.65 61.51 44.44 34.35 54.89 36.61
Table 10. Results on MS-COCO based on Faster-RCNN [27]-FPN [19]: AP evaluated on val2017. Teacher-student pairs are ResNet-
101 (R-101) & ResNet-18 (R-18), ResNet-101 & ResNet-50 (R-50) and ResNet-50 & MobileNet-V2 (MV2) respectively. All results are
the average over 3 trials. More details are attached in supplement.

W-28-2 W-40-2 W-16-4 W-28-4


teacher
75.45 75.61 77.51 78.60
KD 75.37 74.92 75.79 75.04
DKD 75.92 76.24 76.00 76.45
Table 11. Results on CIFAR-100. We set WRN-16-2 as the student
and WRN series networks as teachers.
VGG13 WRN-16-4 ResNet50
teacher
74.64 77.51 79.34
KD 74.93 75.79 75.36
DKD 75.45 76.00 76.60 Figure 3. t-SNE of features learned by KD (left) and DKD (right).
Table 12. Results on CIFAR-100. We set WRN-16-2 as the student
and networks from different series as teachers.

ing [33], we use the WRN-16-2 distilled from WRN-40-


2 as the feature extractor. Then linear probing tasks are
performed on downstream datasets, i.e. STL-10 [4] and
Tiny-ImageNet6 . Results are reported in Table 13, prov- Figure 4. Difference of correlation matrices of student and teacher
ing the outstanding transferability of features learned with logits. Obviously, DKD (right) leads to a smaller difference (more
similar prediction) than KD (left).
our DKD. Implementation details are in the supplement.
baseline KD FitNet CRD ReviewKD DKD parts, i.e., target class knowledge distillation (TCKD) and
STL-10 69.7 70.9 70.3 71.6 72.4 72.9 non-target class knowledge distillation (NCKD). The ef-
TI 33.7 33.9 33.5 35.6 36.6 37.1
fects of both parts are respectively investigated and proved.
Table 13. Comparison with previous methods on transferring More importantly, we reveal that the coupled formulation
features from CIFAR-100 to STL-10 and Tiny-ImageNet (TI).
of KD limits the effectiveness and flexibility of knowledge
Visualizations. We present visualizations from two per- transfer. To overcome these issues, we propose Decoupled
spectives (with setting teacher as ResNet32x4 and student Knowledge Distillation (DKD), which achieves significant
as ResNet8x4 on CIFAR-100). (1) The t-SNE (Fig. 3) re- improvements on CIFAR-100, ImageNet and MS-COCO
sults show that representations of DKD are more separable datasets for image classification and object detection tasks.
than KD, proving that DKD benefits the discriminability of Besides, the superiority of DKD in training efficiency and
deep features. (2) We also visualize the difference of corre- feature transferability is also demonstrated. We hope this
lation matrices of student and teacher logits (Fig. 4). Com- paper will contribute to future logit distillation research.
pared with KD, DKD helps the student to output more sim- limitations and future works. Noticeable limitations are
ilar logits with the teacher, i.e., achieving better distillation discussed as follows. DKD could not outperform state-
performances. of-the-art feature-based methods (e.g., ReviewKD [1]) on
object detection tasks because logits-based methods can-
5. Discussion and Conclusion not transfer knowledge about localization. Besides, we
have provided an intuitive guidance on how to adjust β in
This paper provides a novel viewpoint to interpret logit
our supplement. However, the strict correlation between
distillation by reformulating the classical KD loss into two
the distillation performance and β is not fully investigated,
6 https://www.kaggle.com/c/tiny-imagenet which will be our future research direction.
References [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[1] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Zitnick. Microsoft coco: Common objects in context. In
Distilling knowledge via knowledge review. In CVPR, 2021. ECCV, 2014. 6
6, 7, 8, 11, 13 [21] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.
[2] Xu Cheng, Zhefan Rao, Yilan Chen, and Quanshi Zhang. Shufflenet V2: Practical guidelines for efficient cnn architec-
Explaining knowledge distillation by quantifying the knowl- ture design. In ECCV, 2018. 1, 3, 11
edge. In CVPR, 2020. 2 [22] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir
[3] Jang Hyun Cho and Bharath Hariharan. On the efficacy of Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Im-
knowledge distillation. In ICCV, 2019. 2, 7 proved knowledge distillation via teacher assistant. In AAAI,
[4] Adam Coates, Andrew Ng, and Honglak Lee. An analysis 2020. 2
of single-layer networks in unsupervised feature learning. In [23] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Rela-
AISTATS, 2011. 8 tional knowledge distillation. In CVPR, 2019. 2, 3, 6, 7
[5] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- [24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
van, and Quoc V Le. Autoaugment: Learning augmentation James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
strategies from data. In CVPR, 2019. 4, 12 Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,
[6] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,
Laurent Itti, and Anima Anandkumar. Born again neural net- Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu
works. In ICML, 2018. 2 Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im-
[7] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao perative style, high-performance deep learning library, 2019.
Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co- 5
teaching: Robust training of deep neural networks with ex- [25] Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao
tremely noisy labels. arXiv:1804.06872, 2018. 4, 12 Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correla-
[8] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- tion congruence for knowledge distillation. In CVPR, 2019.
shick. Mask R-CNN. In ICCV, 2017. 1 2
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [26] Mary Phuong and Christoph Lampert. Towards understand-
Deep residual learning for image recognition. In CVPR, ing knowledge distillation. In ICML, 2019. 2
2016. 1, 3, 11 [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
[10] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, No-
proposal networks. In NeurIPS, 2015. 1, 8, 11
jun Kwak, and Jin Young Choi. A comprehensive overhaul
of feature distillation. In ICCV, 2019. 2, 6, 7, 13 [28] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou,
Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets:
[11] Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young
Hints for thin deep nets. ICLR, 2015. 1, 2, 6, 7, 8
Choi. Knowledge transfer via distillation of activation
[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
boundaries formed by hidden neurons. In AAAI, 2019. 2
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
[12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the
Aditya Khosla, Michael Bernstein, et al. ImageNet large
knowledge in a neural network. In arXiv:1503.02531, 2015.
scale visual recognition challenge. IJCV, 2015. 4, 5
1, 2, 3, 6, 7, 8, 12
[30] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
[13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- moginov, and Liang-Chieh Chen. MobilenetV2: Inverted
works. In CVPR, 2018. 1 residuals and linear bottlenecks. In CVPR, 2018. 11
[14] Zehao Huang and Naiyan Wang. Like what you [31] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully
like: Knowledge distill via neuron selectivity transfer. convolutional networks for semantic segmentation. T-PAMI,
arXiv:1707.01219, 2017. 2 2016. 1
[15] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing [32] K. Simonyan and A Zisserman. Very deep convolutional net-
complex network: Network compression via factor transfer. works for large-scale image recognition. In ICLR, May 2015.
In NeurIPS, 2018. 2 11
[16] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple [33] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-
layers of features from tiny images. 2009. 3, 5, 12 trastive representation distillation. In ICLR, 2020. 2, 3, 6,
[17] Quanquan Li, Shengying Jin, and Junjie Yan. Mimicking 7, 8, 11, 12, 13
very efficient network for object detection. In CVPR, 2017. [34] Frederick Tung and Greg Mori. Similarity-preserving knowl-
2, 6 edge distillation. In ICCV, 2019. 2, 3
[18] Zheng Li, Jingwen Ye, Mingli Song, Ying Huang, and Zhi- [35] Brendan Van Rooyen, Aditya Krishna Menon, and Robert C
geng Pan. Online knowledge distillation for efficient pose Williamson. Learning with symmetric label noise: The im-
estimation. In ICCV, 2021. 2 portance of being unhinged. arXiv:1505.07634, 2015. 4, 12
[19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, [36] Lin Wang and Kuk-Jin Yoon. Knowledge distillation and
Bharath Hariharan, and Serge Belongie. Feature pyramid student-teacher learning for visual intelligence: A review
networks for object detection. In CVPR, 2017. 8, 11 and new outlooks. T-PAMI, 2021. 7
[37] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Dis-
tilling object detectors with fine-grained feature imitation. In
CVPR, 2019. 6
[38] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Dis-
tilling object detectors with fine-grained feature imitation. In
CVPR, 2019. 8
[39] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019. 11
[40] Chenglin Yang, Lingxi Xie, Chi Su, and Alan L Yuille. Snap-
shot distillation: Teacher-student optimization in one gener-
ation. In CVPR, 2019. 2
[41] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A
gift from knowledge distillation: Fast optimization, network
minimization and transfer learning. In CVPR, 2017. 2
[42] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
works. In BMVC, 2016. 3, 11
[43] Sergey Zagoruyko and Nikos Komodakis. Paying more at-
tention to attention: Improving the performance of convolu-
tional neural networks via attention transfer. ICLR, 2017. 2,
7
[44] Ying Zhang, Tao Xiang, Timothy M Hospedales, and
Huchuan Lu. Deep mutual learning. In CVPR, 2018. 2
[45] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR, 2017. 1
A. Appendix for 240 epochs with SGD. As the batch size is 64, the
learning rates are 0.01 for ShuffleNet [21] and MobileNet-
A.1. Details about the reformulation in Sec 3.1 V2 [30], 0.05 for the other series (e.g. VGG [32], ResNet [9]
Details of the mathematical derivation in Sec 3.1 of the and WRN [42]). The learning rate is divided by 10 at 150,
manuscript are as follows (notations are the same in Sec 3.1 180 and 210 epochs. The weight decay and the momentum
of the manuscript): are set to 5e-4 and 0.9. The weight for the cross-entropy
loss is set to 1.0. The temperature is set as 4 and α is set as
1.0 for all experiments. The proper value of β could be dif-
\begin {split} \text {KD} &= \text {KL}(\mathbf {p}^{\mathcal {T}}|| \mathbf {p}^{\mathcal {S}})\\ &= \sum _{i=1}^{C} p^{\mathcal {T}}_{i}\log (\frac {p^{\mathcal {T}}_{i}}{p^{\mathcal {S}}_{i}})\\ &=p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + \sum _{i=1,i\neq t}^{C} p^{\mathcal {T}}_{i}\log (\frac {p^{\mathcal {T}}_{i}}{p^{\mathcal {S}}_{i}}). \end {split} \label {sup_reform_detail1}
ferent for different teachers, and the details and discussions
are in the next section. And we utilize a 20-epoch linear
warmup for all experiments since the value of β could be
(8)
high leading to a large initial loss.
ImageNet: Our implementation for ImageNet follows
the standard practice. We train the models for 100 epochs.
As the batch size is 512, the learning rate is initialized to
According to Eqn.(1) and Eqn.(2) of the manuscript, we 0.2 and divided by 10 for every 30 epochs. Weight decay
have p̂i = pi /p\t . Thus, we can rewrite Eqn.(8) to: is 1e-4 and the weight for the cross-entropy loss is set to
1.0. We set temperature as 1 and α as 0.5 for all exper-
\begin {split} \text {KD} &= p^{\mathcal {T}}_{t} \log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}})+\sum _{i=1, i\neq t}^{C} p^{\mathcal {T}}_{\backslash t} \hat p^{\mathcal {T}}_{i}\log (\frac {p^{\mathcal {T}}_{\backslash t}\hat p^{\mathcal {T}}_{i}}{p^{\mathcal {S}}_{\backslash t}\hat p^{\mathcal {S}}_{i}})\\ &= p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + \sum _{i=1, i\neq t}^{C} p^{\mathcal {T}}_{\backslash t} \hat p^{\mathcal {T}}_{i}(\log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}}) +\log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}})) \\ &= p^{\mathcal {T}}_{t} \log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + \sum _{i=1,i\neq t}^{C} p^{\mathcal {T}}_{\backslash t} \hat p^{\mathcal {T}}_{i} \log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}}) \\ &+ \sum _{i=1,i\neq t}^{C} p^{\mathcal {T}}_{\backslash t} \hat p^{\mathcal {T}}_{i} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}}). \end {split} \label {sup_reform_kd2} iments. Strictly following [1, 33], for distilling networks
of the same architecture, the teacher is ResNet-34 model,
the student is ResNet-18, and β is set to 0.5. For differ-
ent series, the teacher is ResNet-50 model, the student is
MobileNet-V1, and β is set to 2.0.
(9)
MS-COCO: Our implementation for MS-COCO follows
the settings in [1]. We use the two-stage method Faster R-
CNN [27] with FPN [19] as the feature extractors. ResNet
[9] models and MobileNet-V2 [30] are selected as teachers
and students. All students are trained with the 1x sched-
uler (schedulers and task-specific loss weights follow De-
Since pT\t and pS\t are irrelevant to the class index i, we tectron2 [39]). We employ the DKD loss on the R-CNN
have: head, and set α as 1.0, β as 0.25, and temperature as 1 for
\begin {split} \sum _{i=1,i\neq t}^{C} p^{\mathcal {T}}_{\backslash t} \hat p^{\mathcal {T}}_{i} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}}) &= p^{\mathcal {T}}_{\backslash t} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}}) \sum _{i=1,i\neq t}^{C} \hat p^{\mathcal {T}}_{i} \\ &= p^{\mathcal {T}}_{\backslash t} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}}). \end {split} all experiments.
Results of compared methods are reported in their origi-
nal papers or reproduced by previous works [1, 33].
(10)
A.3. Guidance for tuning β
We suppose that the importance of NCKD in knowledge
Then, transfer could be related to the confidence of the teacher.
Intuitively, the more confident the teacher is, the more valu-
\begin {split} \text {KD} &= \underbrace {p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + p^{\mathcal {T}}_{\backslash t} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}})}_{{\text {KL}(\mathbf {b}^{\mathcal {T}}||\mathbf {b}^{\mathcal {S}})}} + p^{\mathcal {T}}_{\backslash t} \underbrace {\sum _{i=1,i\neq t}^{C} \hat p^{\mathcal {T}}_{i} \log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}})}_{\text {KL}(\hat {\mathbf {p}}^{\mathcal {T}} || \hat {\mathbf {p}}^{\mathcal {S}})}. \end {split} \label {sup_reform_kd1} able the NCKD could be, and the larger β should be applied.
However, NCKD could increase the gradient contributed by
logits of non-target classes. Thus, an improper large β could
(11) harm the correctness of the student’s prediction. If the logit
value of the target class is much higher than all non-target
According to the definition of KL-Divergence, Eqn.(11) classes, the teacher could be regarded as more confident and
can be rewritten as (which is the same as Eqn.(5) of the a large beta could be more reasonable. Thus, we suppose
manuscript): that the value of β could be related to the logit value gap
between the target and all non-target classes. Specifically,
\begin {split} \text {KD} &= \text {KL}(\mathbf {b}^{\mathcal {T}}||\mathbf {b}^{\mathcal {S}}) + (1-p_{t}^{\mathcal {T}}) \text {KL}(\hat {\mathbf {p}}^{\mathcal {T}} || \hat {\mathbf {p}}^{\mathcal {S}}) \end {split} \label {sup_form_dkd} (12) the gap between the logit of the target class (i.e., zt , where z
represents the output logit and t represents the target class)
A.2. Implementation: Experiments in Sec 4
and the max logit among non-target classes could be reliable
CIFAR-100: Our implementation for CIFAR-100 fol- guidance for tuning β, which can be denoted as zt − zmax ,
lows the practice in [33]. Teachers and students are trained where zmax = max({zi |i ̸= t})).
β ResNet56 WRN-40-2 ResNet32x4 Noisy labels. We also perform experiments on noisy train-
1.0 76.02 75.94 74.95 ing data to verify that TCKD conveys the knowledge about
2.0 76.32 76.25 75.64 sample “difficulty”. Specifically, we follow the settings
4.0 75.91 76.17 75.82 of [7, 35], utilizing the symmetric noise type8 . We train a
6.0 75.62 76.70 76.34
teacher network on the noisy training data and select the
8.0 75.33 76.44 76.45
10.0 75.35 76.21 76.32
best epoch to distill the student (on the same training data).
zt − zmax 5.36 7.24 8.40
A.5. Explanation about why TCKD brings perfor-
Table A.1. Accuracy(%) on CIFAR-100 [16] with different β and
mance drop in Table 1
different teachers. The gap (zt − zmax ) is also reported.
In Table 1 of the manuscript, we reveal that singly apply-
We report experimental results on CIFAR-100 [16] to ing TCKD could bring performance drop sometimes. An
verify this conjecture. We select ResNet56, WRN-40-2 and explanation for this phenomenon is that the high tempera-
ResNet32×4 as teachers and ShuffleNet-V1 as the student, ture (T=4) will lead to a great gradient to increase the non-
and apply different β. Both top-1 accuracy (%) and the gap target classes’ logits, which will harm the correctness of
zt − zmax (averaged over all training samples) are reported. the student’s prediction. Without NCKD, the information
As shown in Table A.1, the best value of β could be posi- about the class similarity (or the prominent dark knowl-
tively proportional to the gap, which we suppose could be edge) is not available, so that TCKD’s gradient could do
guidance of tuning β and a direction for further research. no good but lead to performance drop (since TCKD could
Based on this, the value of β for each teacher in Table 6 and bring marginal performance gain on easy-fitting training
Table 7 of the manuscript is set as follows (in Table A.2): data). To verify that the large temperature is not proper
when singly applying TCKD, we perform experiments with
teacher zt − zmax β
different temperatures (T) in the table below. Results in Ta-
ResNet56 5.36 2.0
ResNet110 6.73 2.0
T 1 2 3 4
WRN-40-2 7.24 6.0
VGG13 8.25 6.0 top-1 73.24 73.05 71.69 70.96
ResNet50 8.53 8.0
Table A.3. Accuracy (%) with different temperature(T) when only
ResNet32×4 8.40 8.0
applying TCKD. The teacher and the student are set as WRN-40-2
Table A.2. The value of β for different teachers in Table 6 and and WRN-16-2, respectively.
Table 7 of the manuscript.
ble A.3 show that the performance is almost the same as the
vanilla training baseline (73.26 in Table 1 of the manuscript)
A.4. Implementation: Experiments in Sec 3.2 when the temperature is set as 1. And the performance drop
In this part, we report the implementation details of the is positively related to the temperature.
experiments in Sec 3.2 of the manuscript.
A.6. How to employ DKD on detectors
Basic settings. We set the loss term of KD and CE as 1.0
and 1.0, respectively (instead of the default 0.1CE + 0.9KD In this paper, we employ our DKD on the two-stage ob-
setting in [33]). The setting in [33] follows the loss form ject detector Faster R-CNN. We only employ our DKD on
proposed by [12], which assumes that the sum of all terms’ the R-CNN head. Specifically, given a student network, we
weights should be 1.0. However, the NCKD loss is target- utilize the labels assigned to the proposals (generated by the
irrelevant, which means the target-relevant loss could be 0.1 RPN module) as the target class(if IoU(proposal) < 0.5, the
if we utilize the original setting when only applying NCKD. target class is set as “background”). Then, we use a teacher
Based on this, we set the loss weight of all terms (e.g., KD, network to get the R-CNN prediction logits of the same pro-
TCKD, NCKD and CE) as 1.0 for all experiments in Sec 3.2 posals (locations are the same, while the features are from
of the manuscript. the teacher’s backbone). Thus, we can employ our DKD
by minimizing the KL-Divergence (i.e., TCKD and NCKD)
Strong augmentation. We employ the AutoAugment [5] between the student’s logits and the teacher’s.
to reveal the effectiveness of TCKD in Sec 3.2 of the
manuscript. Specifically, we add the CIFAR AutoAugment A.7. Implementation: Experiments in Sec 4.2
policy7 after applying the default augmentation (random
crop and horizontal flip). Then we train the teacher and the Training efficiency. We report the training time of each
student with the same augmentation policy. distillation method in Figure 2 of the manuscript. The
7 https://github.com/DeepVoltaire/AutoAugment 8 https://github.com/bhanML/Co-teaching/blob/master/data/cifar.py
training time (per batch) is the sum of (1) the data pro-
cessing time (e.g., including the time to sample the con-
trast examples in [33]), (2) the network forward time and
the gradient backward time and (3) the memory updating
time (e.g.,including the time to update the contrast mem-
ory in [33]). We also report the number of extra parameters
for each method. Besides the learnable parameters (e.g.,
the connectors in [10] and the ABF modules in [1]), we
also calculate the extra dictionary memory(e.g., the contrast
memory in [33]).
Feature transferability. We perform linear probing exper-
iments to verify the feature transferability of our DKD in
Sec 4.2 of the manuscript. We use the WRN-16-2 distilled
from a WRN-40-2 teacher as the feature extractor (only us-
ing the feature generated by the final global average pool-
ing module), then train linear fully-connected (FC) lay-
ers as classifier modules for STL-10 and Tiny-ImageNet
datasets (the feature extractor is fixed during training). We
train the FC via an SGD optimizer with 0.9 momentum and
0.0 weight decay. The number of total epochs is set as 40,
and the learning rate is set to 0.1 for a 128 batch size and
divided by 10 for every 10 epochs.

You might also like