Google Scholar

Exploring plain vision transformer backbones for object detection

Y Li, H Mao, R Girshick, K He - European conference on computer vision, 2022 - Springer

European conference on computer vision, 2022•Springer

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for
object detection. This design enables the original ViT architecture to be fine-tuned for object
detection without needing to redesign a hierarchical backbone for pre-training. With minimal
adaptations for fine-tuning, our plain-backbone detector can achieve competitive results.
Surprisingly, we observe:(i) it is sufficient to build a simple feature pyramid from a single-
scale feature map (without the common FPN design) and (ii) it is sufficient to use window …

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available (https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet).

Springer

Show moreShow less

Save Cite Cited by 643 Related articles All 6 versions

Cite

Advanced search

Saved to My library

Exploring plain vision transformer backbones for object detection