The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Son, Seungwoo; Ryu, Jegwang; Lee, Namhoon; Lee, Jaeho

Computer Science > Machine Learning

arXiv:2302.10494 (cs)

[Submitted on 21 Feb 2023 (v1), last revised 15 Jul 2024 (this version, v3)]

Title:The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Authors:Seungwoo Son, Jegwang Ryu, Namhoon Lee, Jaeho Lee

View PDF HTML (experimental)

Abstract:Knowledge distillation is an effective method for training lightweight vision models. However, acquiring teacher supervision for training samples is often costly, especially from large-scale models like vision transformers (ViTs). In this paper, we develop a simple framework to reduce the supervision cost of ViT distillation: masking out a fraction of input tokens given to the teacher. By masking input tokens, one can skip the computations associated with the masked tokens without requiring any change to teacher parameters or architecture. We find that masking patches with the lowest student attention scores is highly effective, saving up to 50% of teacher FLOPs without any drop in student accuracy, while other masking criterion leads to suboptimal efficiency gains. Through in-depth analyses, we reveal that the student-guided masking provides a good curriculum to the student, making teacher supervision easier to follow during the early stage and challenging in the later stage.

Comments:	ECCV 2024
Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2302.10494 [cs.LG]
	(or arXiv:2302.10494v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2302.10494

Submission history

From: Seungwoo Son [view email]
[v1] Tue, 21 Feb 2023 07:48:34 UTC (5,837 KB)
[v2] Wed, 31 May 2023 04:50:46 UTC (3,084 KB)
[v3] Mon, 15 Jul 2024 06:37:04 UTC (8,169 KB)

Computer Science > Machine Learning

Title:The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators