From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Yang, Zhendong; Zeng, Ailing; Li, Zhe; Zhang, Tianke; Yuan, Chun; Li, Yu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.13005 (cs)

[Submitted on 23 Mar 2023 (v1), last revised 17 Jul 2023 (this version, v2)]

Title:From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Authors:Zhendong Yang, Ailing Zeng, Zhe Li, Tianke Zhang, Chun Yuan, Yu Li

View PDF

Abstract:Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Our codes are available at this https URL.

Comments:	Accepted by ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.13005 [cs.CV]
	(or arXiv:2303.13005v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.13005

Submission history

From: Zhendong Yang [view email]
[v1] Thu, 23 Mar 2023 02:59:36 UTC (2,367 KB)
[v2] Mon, 17 Jul 2023 12:22:21 UTC (2,367 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators