Google Scholar

Training data-efficient image transformers & distillation through attention

H Touvron, M Cord, M Douze, F Massa… - International …, 2021 - proceedings.mlr.press

H Touvron, M Cord, M Douze, F Massa, A Sablayrolles, H Jégou

International conference on machine learning, 2021•proceedings.mlr.press

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These high-performing vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption. In this work, we produce competitive convolution-free transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1%(single-crop) on ImageNet with no external data. We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.

proceedings.mlr.press

Show moreShow less

Save Cite Cited by 6380 Related articles All 6 versions View as HTML

Cite

Advanced search

Saved to My library

Training data-efficient image transformers & distillation through attention