Decoupled Knowledge Distillation
Decoupled Knowledge Distillation
Borui Zhao1 Quan Cui2 Renjie Song1 Yiyu Qiu1,3 Jiajun Liang1
1 2 3
MEGVII Technology Waseda University Tsinghua University
zhaoborui.gm@gmail.com, cui-quan@toki.waseda.jp,
chouyy18@mails.tsinghua.edu.cn, {songrenjie, liangjiajun}@megvii.com
arXiv:2203.08679v2 [cs.CV] 12 Jul 2022
Abstract target
non-target
while NCKD is the prominent reason why logit distillation Target Class
KL Loss
works. More importantly, we reveal that the classical KD Teacher
and NCKD to play their roles more efficiently and flexibly. !"#$$%&#" '( = *!'( + , − ./0 ∗ 2!'(
Compared with complex feature-based methods, our DKD ('( 345$ = 6 ∗ *!'( + 7 ∗ 2!'(
achieves comparable or even better results and has bet- (b) Decoupled Knowledge Distillation (DKD).
ter training efficiency on CIFAR-100, ImageNet, and MS- Figure 1. Illustration of the classical KD [12] and our DKD. We
COCO datasets for image classification and object detec- reformulate KD into a weighted sum of two parts, i.e., TCKD and
tion tasks. This paper proves the great potential of logit NCKD. The first equation shows that KD (1) couples NCKD with
distillation, and we hope it will be helpful for future re- pTt (the teacher’s confidence on the target class), and (2) couples
search. The code is available at https://github.com/megvii- the importance of two parts. Furthermore, we demonstrate that the
research/mdistiller. first coupling suppresses the effectiveness, and the second limits
the flexibility for knowledge transfer. We propose DKD to address
these issues, which employs hyper-parameters α for TCKD and β
for NCKD, killing the two birds with one stone.
1. Introduction
In the last decades, the computer vision field has been KD represents a series of methods concentrating on trans-
revolutionized by deep neural networks (DNN), which suc- ferring knowledge from a heavy model (teacher) to a light
cessfully boost various real-scenario tasks, e.g., image clas- one (student), which can improve the light model’s perfor-
sification [9, 13, 21], objection detection [8, 27], and seman- mance without introducing extra costs.
tic segmentation [31,45]. However, powerful networks nor- The concept of KD was firstly proposed in [12] to trans-
mally benefit from large model capacities, introducing high fer the knowledge via minimizing the KL-Divergence be-
computational and storage costs. Such costs are not prefer- tween prediction logits of teachers and students (Figure 1a).
able in industrial applications, where lightweight models Since [28], most of the research attention has been drawn to
are widely deployed. In the literature, a potential direction distill knowledge from deep features of intermediate lay-
of cutting down the costs is knowledge distillation (KD). ers. Compared with logits-based methods, the performance
of feature distillation is superior on various tasks, so re- it could provide. Secondly, the significance of TCKD and
search on logit distillation has been barely touched. How- NCKD are coupled, i.e., weighting TCKD and NCKD sepa-
ever, training costs of feature-based methods are unsatis- rately is not allowed. Such limitation is not preferable since
factory, because extra computational and storage usage are TCKD and NCKD should be separately considered since
introduced (e.g., network modules and complex operations) their contributions are from different aspects.
for distilling deep features during training time. To address these issues, we propose a flexible and ef-
Logit distillation requires marginal computational and ficient logit distillation method named Decoupled Knowl-
storage costs, but the performance is inferior. Intuitively, edge Distillation (DKD, Figure 1b). DKD decouples the
logit distillation should achieve comparable performance NCKD loss from the coefficient negatively correlated with
as feature distillation, since logits are in higher semantic the teacher’s confidence by replacing it with a constant
level than deep features. We suppose that the potential of value, improving the distillation effectiveness on well-
logit distillation is limited by unknown reasons, causing predicted samples. Meanwhile, NCKD and TCKD are also
the unsatisfactory performance. To revitalize logits-based decoupled so that their importance can be separately con-
methods, we start this work by delving into the mecha- sidered by adjusting the weight of each part.
nism of KD. Firstly, we divide a classification prediction Overall, our contributions are summarized as follows:
into two levels: (1) a binary prediction for the target class • We provide an insightful view to study logit distillation
and all the non-target classes and (2) a multi-category pre- by dividing the classical KD into TCKD and NCKD.
diction for each non-target class. Based on this, we refor- Additionally, the effects of both parts are respectively
mulate the classical KD loss [12] into two parts, as shown analyzed and proved.
in Figure 1b. One is a binary logit distillation for the tar- • We reveal limitations of the classical KD loss caused
get class and the other is a multi-category logit distilla- by its highly coupled formulation. Coupling NCKD
tion for non-target classes. For simplification, we respec- with the teacher’s confidence suppresses the effective-
tively name them as target classification knowledge distil- ness of knowledge transfer. Coupling TCKD with
lation (TCKD) and non-target classification knowledge dis- NCKD limits the flexibility to balance the two parts.
tillation (NCKD). The reformulation allows us to study the • We propose an effective logit distillation method
effects of the two parts independently. named DKD to overcome these limitations. DKD
TCKD transfers knowledge via binary logit distillation, achieves state-of-the-art performances on various
which means only the prediction of the target class is pro- tasks. We also empirically validate the higher train-
vided while the specific prediction of each non-target class ing efficiency and better feature transferability of DKD
is unknown. A reasonable hypothesis is that TCKD trans- compared with feature-based distillation methods.
fers knowledge about the “difficulty” of training samples,
i.e., the knowledge describes how difficult it is to recog- 2. Related work
nize each training sample. To validate this, we design ex- The concept of knowledge distillation (KD) was firstly
periments from three aspects to increase the “difficulty” of proposed by Hinton et al. in [12]. KD defines a learn-
training data, i.e., stronger augmentation, noisier label and ing manner where a bigger teacher network is employed to
inherently challenging dataset. guide the training of a smaller student network for many
NCKD only considers the knowledge among non-target tasks [12, 17, 18]. The “dark knowledge” is transferred to
logits. Interestingly, we empirically prove that only ap- students via soft labels from teachers. For raising the at-
plying NCKD achieves comparable or even better results tention on negative logits, the hyper-parameter temperature
than the classical KD, indicating the vital importance of was introduced. The following works can be divided into
knowledge contained in non-target logits, which could be two types, distillation from logits [3, 6, 22, 40, 44] and inter-
the prominent “dark knowledge”. mediate features [10, 11, 14, 15, 23, 25, 28, 33, 34, 41, 43].
More importantly, our reformulation demonstrates that Previous works of logit distillation mainly focus on
the classical KD loss is a highly coupled formulation (as proposing effective regularization and optimization meth-
shown in Figure 1b), which could be the reason why the po- ods rather than novel methods. DML [44] proposes a mu-
tential of logit distillation is limited. Firstly, the NCKD loss tual learning manner to train students and teachers simulta-
term is weighted by a coefficient that negatively correlates neously. TAKD [22] introduces an intermediate-sized net-
with the teacher’s prediction confidence on the target class. work named “teacher assistant” to bridge the gap between
Thus larger prediction scores would lead to smaller weights. teachers and students. Besides, several works also focus on
The coupling significantly suppresses the effects of NCKD interpreting the classical KD method [2, 26].
on well-predicted training samples. Such suppression is not State-of-the-art methods are mainly based on interme-
preferable since the more confident the teacher is in the diate features, which can directly transfer representations
training sample, the more reliable and valuable knowledge from the teacher to the student [10, 11, 28] or transfer the
correlation between samples captured in the teacher to the \begin {split} \text {KD} &= \text {KL}(\mathbf {p}^{\mathcal {T}}|| \mathbf {p}^{\mathcal {S}})\\ &=p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + \sum _{i=1,i\neq t}^{C} p^{\mathcal {T}}_{i}\log (\frac {p^{\mathcal {T}}_{i}}{p^{\mathcal {S}}_{i}}). \end {split} \label {kd_ori_form}
student [23, 33, 34]. Most of the feature-based methods (3)
could achieve preferable performances (significant higher
than logits-based methods), yet involving considerably high
computational and storage costs.
According to Eqn.(1) and Eqn.(2) we have p̂i = pi /p\t , so
This paper focuses on analyzing what limits the potential
we can rewrite Eqn.(3) as:
of logits-based methods and revitalizing logit distillation.
3. Rethinking Knowledge Distillation \begin {split} \text {KD} &= p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + p^{\mathcal {T}}_{\backslash t}\sum _{i=1,i\neq t}^{C} \hat p^{\mathcal {T}}_{i}(\log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}}) +\log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}})) \\ &= \underbrace {p^{\mathcal {T}}_{t}\log (\frac {p^{\mathcal {T}}_{t}}{p^{\mathcal {S}}_{t}}) + p^{\mathcal {T}}_{\backslash t} \log (\frac {p^{\mathcal {T}}_{\backslash t}}{p^{\mathcal {S}}_{\backslash t}})}_{{\text {KL}(\mathbf {b}^{\mathcal {T}}||\mathbf {b}^{\mathcal {S}})}} + p^{\mathcal {T}}_{\backslash t} \underbrace {\sum _{i=1,i\neq t}^{C} \hat p^{\mathcal {T}}_{i} \log (\frac {\hat p^{\mathcal {T}}_{i}}{\hat p^{\mathcal {S}}_{i}})}_{\text {KL}(\hat {\mathbf {p}}^{\mathcal {T}} || \hat {\mathbf {p}}^{\mathcal {S}})}. \end {split} \label {reform_kd1}
p_{t} = \frac {\exp (z_{t})}{\sum _{j=1}^{C} \exp (z_{j})}, p_{\backslash t} = \frac {\sum _{k=1,k\neq t}^{C} \exp (z_{k})}{\sum _{j=1}^{C} \exp (z_{j})}. 3.2. Effects of TCKD and NCKD
consists of 1000 classes. The training set contains 1.28 mil- Notably, DKD achieves consistent improvements on all
lion images and the validation set contains 50k images. teacher-student pairs, compared with the baseline and the
MS-COCO [20] is an 80-category general object detec- classical KD. Our method achieves 1 ∼ 2% and 2 ∼ 3%
tion dataset. The train2017 split contains 118k images, improvements on teacher-student pairs of the same and dif-
and the val2017 split contains 5k images. ferent series, respectively. It strongly supports the effec-
All implementation details are attached in supplement tiveness of DKD. Furthermore, DKD achieves comparable
due to the page limit. or even better performances than feature-based distillation
methods, significantly improving the trade-off between dis-
4.1. Main Results tillation performance and training efficiency, which will be
Firstly, we demonstrate the improvements contributed by further discussed in Sec 4.2.
decoupling (1) NCKD and pTt and (2) NCKD and TCKD, ImageNet image classification. Top-1 and top-5 accura-
respectively. Then, we benchmark our method on image cies of image classification on ImageNet are reported in Ta-
classification and object detection tasks. ble 8 and Table 9. Our DKD achieves a significant improve-
Ablation: α and β. The two tables below report the stu- ment. It’s worth mentioning that the performance of DKD
dent accuracy (%) with different α and β. ResNet32×4 is better than the most state-of-the-art results of feature dis-
and ResNet8×4 are set as the teacher and the student, re- tillation methods.
spectively. Firstly, we prove that decoupling (1 − pTt ) and MS-COCO object detection. As discussed in previous
NCKD can bring reasonable performance gain (73.63% vs. works, the performance of the object detection task greatly
74.79%) in the first table. Then, we demonstrate that de- depends on the quality of deep features to locate inter-
coupling weights of NCKD and TCKD could contribute ested objects. This rule also stands in transferring knowl-
to further improvements (74.79% vs. 76.32%). Moreover, edge between detectors [17, 37], i.e., feature mimicking
the second table indicates that TCKD is indispensable, and is of vital importance since logits are not capable of pro-
the improvements from TCKD are stable with different α viding knowledge for object localization. As shown in
around 1.05 . Table 10, singly applying DKD can hardly achieve out-
standing performances, but expectedly surpasses the clas-
β 1 − pTt 1.0 2.0 4.0 8.0 10.0
sical KD. Thus, we introduce the feature-based distillation
top-1 73.63 74.79 75.44 75.94 76.32 76.18
method ReviewKD [1] to obtain satisfactory results. It
α 0.0 0.2 0.5 1.0 2.0 4.0
can be observed that our DKD can further boost AP met-
top-1 75.30 75.64 76.12 76.32 76.11 75.42
rics, even the distillation performance of ReviewKD is rel-
atively high. Conclusively, new state-of-the-art results are
CIFAR-100 image classification. We discuss experimental obtained by combining our DKD with feature-based distil-
results on CIFAR-100 to examine our DKD. The validation lation methods on the object detection task.
accuracy is reported in Table 6 and Table 7. Table 6 contains
the results where teachers and students are of the same net- 4.2. Extensions
work architectures. Table 7 shows the results where teach- For a better understanding of DKD, we provide exten-
ers and students are from different series. sions from four perspectives. First of all, we comprehen-
5 We fix α as 1.0 for simplification in the first table, and β as 8.0 in the sively compare the training efficiency of DKD with repre-
second table since it achieves the best performance in the first one. sentative state-of-the-art methods. Then, we provide a new
ResNet32×4 WRN-40-2 VGG13 ResNet50 ResNet32×4
teacher
distillation 79.42 75.61 74.64 79.34 79.42
manner ShuffleNet-V1 ShuffleNet-V1 MobileNet-V2 MobileNet-V2 ShuffleNet-V2
student
70.50 70.50 64.60 64.60 71.82
FitNet [28] 73.59 73.73 64.14 63.16 73.54
RKD [23] 72.28 72.21 64.52 64.43 73.21
features CRD [33] 75.11 76.05 69.73 69.11 75.65
OFD [10] 75.98 75.85 69.48 69.04 76.82
ReviewKD [1] 77.45 77.14 70.37 69.89 77.78
KD [12] 74.07 74.83 67.37 67.35 74.45
logits DKD 76.45 76.70 69.71 70.35 77.07
∆ +2.38 +1.87 +2.34 +3.00 +2.62
Table 7. Results on the CIFAR-100 validation. Teachers and students are in different architectures. And ∆ represents the performance
improvement over the classical KD. All results are the average over 5 trials.
distillation manner features logits
teacher student AT [43] OFD [10] CRD [33] ReviewKD [1] KD [12] KD* DKD
top-1 73.31 69.75 70.69 70.81 71.17 71.61 70.66 71.03 71.70
top-5 91.42 89.07 90.01 89.98 90.13 90.51 89.88 90.05 90.41
Table 8. Top-1 and top-5 accuracy (%) on the ImageNet validation. We set ResNet-34 as the teacher and ResNet-18 as the student.
KD* represents the result of our implementation. All results are the average over 3 trials.
distillation manner features logits
teacher student AT [43] OFD [10] CRD [33] ReviewKD [1] KD [12] KD* DKD
top-1 76.16 68.87 69.56 71.25 71.37 72.56 68.58 70.50 72.05
top-5 92.86 88.76 89.33 90.34 90.41 91.00 88.98 89.80 91.05
Table 9. Top-1 and top-5 accuracy (%) on the ImageNet validation. We set ResNet-50 as the teacher and MobileNet-V1 as the student.
KD* represents the result of our implementation. All results are the average over 3 trials.
perspective to explain why bigger models are not always tive, e.g., ESKD [3] employs early-stopped teacher models
better teachers and alleviate this problem by utilizing DKD. to alleviate this problem, and these teachers would be under-
Moreover, following [33], we examine the transferability of convergence and yield smaller pTt .
deep features learned by DKD. And we also present some To validate our conjecture, we perform our DKD on a
visualizations to validate the superiority of DKD. series of teacher models. Experimental results in Table 11
Training efficiency. We assess the training costs of state- and Table 12 consistently indicate that our DKD alleviates
of-the-art distillation methods, proving the high training ef- the bigger models are not always better teachers problem.
ficiency of DKD. As shown in Figure 2, our DKD achieves
the best trade-off between model performances and training 77
OFD
extra parameters. However, feature-based distillation meth- 74
ods require extra training time for distillation intermediate
FitNet
layer features, as well as the GPU memory costs. 73 KD