A Simple Single-Scale Vision Transformer For Object Localization
A Simple Single-Scale Vision Transformer For Object Localization
1 2 2 2 2 2
Wuyang Chen *, Xianzhi Du , Fan Yang , Lucas Beyer , Xiaohua Zhai , Tsung-Yi Lin ,
2 2 2 1 2
Huizhong Chen , Jing Li , Xiaodan Song , Zhangyang Wang , Denny Zhou
1 2
University of Texas, Austin, Google
arXiv:2112.09747v1 [cs.CV] 17 Dec 2021
Abstract
52
46
3.2. UViT: a simple yet effective solution
44 SD MF 2×
Based on our study in Section 3.1, we are motivated
42 to simplify the ViT design for dense prediction tasks and
provide a neat solution, as illustrated in Figure 3. Taking
40
8 × 8 patches of input images, we learn the representation by
38 using a constant token resolution of 18 scale (i.e., the number
of tokens remains the same) and a constant hidden size (i.e.,
36
500 600 700 800 900 1000 1100 the channel number will not be increased). A single-scale
GFLOPs feature map will be fed into a detection or segmentation
head. Meanwhile, attention windows [3] will be leveraged
Figure 2. The benefits of various commonly used CNN-inspired to reduce the computation cost.
design changes to ViT: spatial downsampling (“SD”), multi-scale Though being simple, still we have two core questions to
features (“MF”), and doubled channels (“2×”). With controlled be determined in our design: (1) How to balance the UViT’s
number of parameters (72M) and input size (640×640), not using depth, width, and input size to achieve the best performance-
any of these designs and sticking to the original ViT model [15]
efficiency trade-off? (Section 3.2.1) (2) Which attention
performs the best in a wide range of FLOPs we explore.
window strategy can effectively save the computation cost
without sacrificing the performance? (Section 3.2.2)
million parameters, and are trained and evaluated under the
same 640 × 640 input size. Therefore, for vertically aligned 3.2.1 A compound scaling rule of UViTs
dots, they share the same FLOPs, number of parameters,
Previous works studied compound scaling rules for
and input size, thus being fairly comparable. We control the
CNNs [36] and ViTs [44] on image classification tasks. How-
FLOPs (x-axis) by changing the depths or attention windows
ever, few works studied the scaling of ViTs on dense pre-
allocated to different stages, see our Supplement Section A
diction tasks. To achieve the best performance-efficiency
for more architecture details.
trade-off, we systematically study the compounding scaling
of UViTs on three dimensions: input size, depth, and widths.
Observations 1
We show our results in Figure 4 and Figure 5 . For all mod-
• Spatial Downsampling (“SD”) does not seem to be ben- els (circle markers), we first train them on ImageNet-1k, then
eficial. Our hypothesis is that, under the same FLOPs fine-tune them on the COCO detection task.
constraint, the self-attention layers already provide global • Depth (number of attention blocks): we study different
features, and do not need to downsample the features to UViT models of depths selected from {12, 18, 24, 32, 40}.
enlarge the receptive field.
• Input size: we study three levels of input sizes: 640 × 640,
• Multi-scale Features (“MF”) can mitigate the poor per- 768 × 768, 896 × 896, and 1024 × 1024.
formance from downsampling by leveraging early high-
resolution features (“SD+MF”). However, the vanilla set- • Width (i.e. hidden size, or output dimension of attention
ting still outperforms this combination. We hypothesize blocks): we will tune the width to further control differ-
that high-resolution features are extracted too early in the ent model sizes and computation costs to make different
encoder; in contrast, tokens in vanilla ViTs are able to scaling rules fairly comparable.
learn fine-grained details throughout the encoder blocks.
Observations
• Doubled channels (“2×”) plus multi-scale features (“MF”)
may potentially seem competitive. However, ViT does not • In general, UViT can achieve a strong mAP with a mod-
show strong inductive bias on “deeper compressed features erate computation cost (FLOPs) and a highly compact
with more embedding dimension”. This observation is also number of parameters (even fewer than 70M including the
aligned with findings in [31] that ViTs have highly similar Cascaded FPN head).
representations throughout the model, indicating that we 1
This compound scaling rule is studied in Section 3.2.1 before we
should not sacrifice embedding dimensions of early layers study the attention window strategy in Section 3.2.2. Thus for all models in
to compensate deeper layers. Figure 4 and Figure 5 we adopt the window scale as 12 , for fair comparisons.
𝐻 𝑊 𝐻 𝑊
3×𝐻×𝑊 C× × C× ×
8 8 8 8 Detection
✓ Single-scale
UViT Encoder: Basic Attention Blocks Feature Maps
✓ Constant Token Resolution & Hidden Size
Spatial Downsampling, Double Channels Multi-scale
Instance
(UViT) Segmentation
8×8 Patches +
Position Embeddings (UViT+)
Attention Windows
Figure 3. We keep the architecture of our UViT neat: image patches (plus position embeddings) are processed by a stack of vanilla
attention blocks with a constant resolution and hidden size. Single-scale feature maps as outputs are fed into head modules for detection or
segmentation tasks. Constant (UViT, Section 3.2.1) or progressive (UViT+, Section 3.2.2) attention windows are introduced to reduce the
computation cost. We demonstrate that this simple architecture is strong, without introducing design overhead from hierarchical spatial
downsampling, doubled channels, and multi-scale feature pyramids.
• For input sizes (Figure 4 by different line styles): large each self-attention layer on COCO. Given a sequence feature
inputs generally create more room for models to be further of length L and the attention score s (after softmax) from a
scaled up. Across a wide range of model parameters and specific head, the relative receptive field r is defined as:
FLOPs, we find that the scaling under an 896 × 896 input
L
j=1 si,j ∣i − j∣
size constantly outperforms smaller input sizes (which L ∑
1
lead to severe model overfitting), and is also better than r= ∑ , i, j = 1, ⋯, L, (1)
L i=1 max(i, L − i)
1024 × 1024 within a comparable FLOPs range.
L
• For the model depths (Figure 5), different depths in col- where ∑j=1 si,j = 1 for j = 1, ⋯, L. This relative receptive
ors): we find that considering both FLOPs and number filed takes into consideration the token’s position and the fur-
of parameters, 18 blocks achieve better performance than thest possible location a token can aggregate, and indicates
12/24/32/40 blocks. This indicate UViT needs a balanced the spatial focus of the self-attention layer.
trade-off between depth and width, instead of sacrificing We collect the averages and standard deviations across
depth for more width (e.g. 12 blocks) or sacrificing width different attention heads. As shown in Figure 6, we can
for more depth (e.g. 40 blocks). see that tokens in early attention layers, although having the
potential to aggregate long-range features, weight more on
In summary, based on our final compound scaling rule, their neighbor tokens and thus act like a “local operator”.
we propose our basic version of UViT as 18 attention blocks As the attention layers stack deeper, the receptive field in-
under 896 × 896 input size. See our Supplement Section B creases, transiting the self-attention to a global operation.
for more architecture details. This inspires us that, if we explicitly limit the attention range
of early layers, we may save the computation cost but still
3.2.2 Attention windows: a progressive strategy preserve the capability of self-attentions.
In this section, we will show that a progressive attention Attention window improves UViT efficiency. Motivated
window strategy can reduce UViT’s computation cost while by Figure 6, we want to study the most effective attention
still preserve or even increase the performance. window strategy. Specifically, our study can be decomposed
to focus on two sub-problems:
Early attentions are local operators. Originally, self-
attention [39] is a global operation: unlike convolution layers 1) How small the window size can early attention layers
that share its weights to local regions, any pair of token in endure? To answer this question, we start the attention
the sequence will contribute to the feature aggregation, thus blocks with square windows with different small scales:
{ 16 , 8 , 4 } of height or width a sequence’s 2D shape .
collecting global information to each token. In practice, how- 1 1 1 2
50 50
640 × 640 640 × 640
49 768 × 768 49 768 × 768
896 × 896 896 × 896
1024 × 1024 1024 × 1024
48 L = 18 48 L = 18
700 800 900 1000 1100 1200 1300 40 50 60 70 80 90 100 110
FLOPs (G) Params. (M))
Figure 4. Input scaling rule for UViT on COCO object detection. Given a fixed depth, an input size of 896 × 896 (thin solid line) leaves
more room for model scaling (by increasing the width) and is slightly better than 1024 × 1024 (thick solid line); and 640 × 640 (dashed
line) or 768 × 768 (dotted line) are of worse performance-efficiency trade-off. Black capital letters “T”, “S”, and “B” annotate three final
depth/width configurations of UViT variants we will propose (Table 2). Different sizes of markers represent the hidden sizes (widths).
B B
52 52
S S
MSCOCO mAP [%]
global attention, and also windows of 14 ∼ 12 scale. We show ImageNet-1k with a 224 × 224 input size and a batch size of
our results in Table 1, and summarize observations below: 1024. We follow the convention in [15]: during ImageNet
pretraining the kernel size of the first linear projection layer
• With smaller constant window scale ( 12 / 13 / 14 ), we save is 16 × 16. During fine-tuning, we will use a more fine-
more computation cost with slight sacrifice in mAP. grained 8 × 8 patch size for the dense sampling purpose.
The kernel weight of the first linear project layer will be
• Adopting constant global attentions throughout the interpolated from 16×16 to 8×8, and the position embedding
whole encoder blocks (window size as 1, first row) is will be also be elongated by interpolation.
largely redundant, which contributes marginal benefits
but suffers from huge a computation cost. 4.2. Architectures
1
• Early attentions can use smaller windows like 4
-scale, We propose three variants of our UViT variants. The
but over-shrank window sizes ( 16 1 1
, 8 ) can impair the architecture configurations of our model variants are listed
capability of self-attentions (3rd, 4th rows). in Table 2, and are also annotated in Figure 4 and Figure 5
(“T”, “S”, “B” in black). The number of heads is fixed as six,
• Deeper layers still require global attentions to preserve and the expansion ratio of each FFN (feed-forward network)
the final performance (last two rows). layer if fixed as four in all experiments. As discussed in
Section 3.2.2, the attention window strategy will be “[4 ]×
−1
• A properly designed window strategy (5th row) can out-
14 → [2 ] × 2 → [1] × 2”.
−1
perform vanilla solutions (1st, 2nd row) with a reduced
computation cost.
Table 2. Architecture variants of our UViT.
In conclusion, we set the window scale of our basic ver-
−1 Hidden Params. (M) Params. (M)
sion (UViT, Section 3.2.1) as constant 3 , and proposed an Name Depth
improved version of our model, dubbed “UViT+”, which Size (backbone) (Cascade Mask R-CNN)
will adopt the attention window strategy as “[4 ] × 14 →
−1
UViT-T 18 222 13.5 47.4
[2 ] × 2 → [1] × 2”.
−1
UViT-S 18 288 21.7 53.8
UViT-B 18 384 36.9 74.4
4. Final Results
We conduct our experiments on COCO [27] object de-
tection and instance segmentation tasks to show our final
4.3. COCO detection & instance segmentation
performance. Settings Object detection experiments are conducted on
COCO 2017 [27], which contains 118K training and 5K
4.1. Implementations
validation images. We consider the popular Cascade Mask-
We implement our model and training in TensorFlow RCNN detection framework [4,19], and leverage multi-scale
and Keras. Experiments are conducted on TPUs. Before training [6, 35] (resizing the input to 896 × 896), AdamW
−3
fine-tuning on object detection or segmentation, we follow optimizer [29] (with an initial learning rate as 3 × 10 ),
−4
the DeiT [37] training settings to pretrain our UViTs on weight decay as 1 × 10 , and a batch size of 256.
Table 3. Two-stage object detection and instance segmentation results on COCO 2017. We compare employing different backbones with
−1
Cascade Mask R-CNN on single model without test-time augmentation. UViT sets a constant window scale as 3 , and UViT+ adopts the
attention window strategy as “[4 ] × 14 → [2 ] × 2 → [1] × 2”. We also reproduced the performance of ResNet under the same settings.
−1 −1
mask
Backbone Resolution GFLOPs Params. (M) APval APval
ResNet-18 896×896 370.4 48.9 44.2 38.5
ResNet-50 896×896 408.8 61.9 47.4 40.8
Swin-T [28] 480∼800×1333 745 86 50.5 43.7
Shuffle-T [24] 480∼800×1333 746 86 50.8 44.1
UViT-T (ours) 896×896 613 47.4 51.1 43.6
UViT-T+ (ours) 896×896 710 47.4 51.2 43.9
ResNet-101 896×896 468.2 81.0 48.5 41.8
Swin-S [28] 480∼800×1333 838 107 51.8 44.7
Shuffle-S [24] 480∼800×1333 844 107 51.9 44.9
UViT-S (ours) 896×896 744 53.8 51.4 44.1
UViT-S+ (ours) 896×896 866 53.8 51.9 44.5
ResNet-152 896×896 527.7 96.7 49.1 42.1
Swin-B [28] 480∼800×1333 982 145 51.9 45
Shuffle-B [24] 480∼800×1333 989 145 52.2 45.3
GCNet [5] - 1041 - 51.8 44.7
UViT-B (ours) 896×896 975 74.4 51.9 44.3
UViT-B+ (ours) 896×896 1160 74.4 52.5 44.8
UViT-B+ w/ self-training (ours) 896×896 1160 74.4 53.9 46.1
From Table 3 we can see that on different levels of model box AP and mask AP by 1.4% and 1.3%, respectively.
variants, our UViTs are highly compact in terms of number
of parameters, and achieve competitive or stronger perfor-
mance with comparable FLOPs. To make this comparison 5. Conclusion
clean, we did not adopt any system-level techniques [28] to
In this work, we present a simple, single-scale vision
boost the performance. As we did not leverage and CNN-
transformer backbone that can serve as a strong baseline for
like hierarchical pyramid structures, the results of our simple
object detection and instance segmentation. Our UViT ar-
and neat solution suggest that, the original design philosophy
chitecture does not involve any hierarchical pyramid designs
of ViT [15] is a strong baseline without any hand-crafted ar-
that widely adopted in CNNs and recently developed vision
chitecture customization. We also show the mAP-efficiency
transformer models. To validate our method, we first care-
trade-off curve in Figure 1. Our UViT achieve strong results
fully study the benefits from individual design techniques:
with much better efficiency, compared with both CNNs and
spatial downsampling, multi-scale features, and doubled
other ViT works.
channels. Our study shows that for vision transformers, a
Additionally, we adopt self-training on top of our largest plain encoder with a constant feature resolution and hidden
model (UViT-B) to evaluate the performance gain by lever- size performs the best, indicating that the design philosophy
aging unlabeled data, similar as [48]. We use ImageNet- of CNNs may not be optimal for ViTs on dense predic-
1K without labels as the unlabeled set, and a pretrained tion tasks. Following this observation, we further optimize
UViT-B model as the teacher model to generate pseudo- the UViT’s performance-efficiency trade-off by studying
labels. All predicted boxes with confidence score larger a compound scaling rule (depth, width, input size) and a
than 0.5 are kept, together with their corresponding masks. progressive attention widow strategy. Our proposed UViT
For self-training, the student model is initialized from the architectures achieve strong performance on both COCO ob-
same weights of the teacher model. The ratio of labeled ject detection and instance segmentation. Most importantly,
data to pseudo-labeled data is 1:1 in each batch. Apart from we hope our work could bring the attention to the community
increasing training steps by 2× for each epoch, all other that ViTs may require careful and special architecture design
hyperparameters remain unchanged. We can see from the on dense prediction tasks, instead of directly adopting CNN
last row in Table 3 that self-training significantly improves design conventions in black-box.
References formers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020. 1, 2, 3, 4, 5, 6, 7, 8
[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun,
[16] Dennis Elbrächter, Dmytro Perekrestenko, Philipp Grohs, and
Mario Lucic, and Cordelia Schmid. Vivit: A video vision
Helmut Bölcskei. Deep neural network approximation theory.
transformer. ArXiv, abs/2103.15691, 2021. 1
arXiv preprint arXiv:1901.02220, 2019. 2
[2] Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew [17] Ronen Eldan and Ohad Shamir. The power of depth for
Zhai, and Dmitry Kislyuk. Toward transformer-based object feedforward neural networks. In Conference on learning
detection. arXiv preprint arXiv:2012.09958, 2020. 2 theory, pages 907–940. PMLR, 2016. 2
[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- [18] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu,
former: The long-document transformer. arXiv preprint and Yunhe Wang. Transformer in transformer. arXiv preprint
arXiv:2004.05150, 2020. 3, 4 arXiv:2103.00112, 2021. 1, 2
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
into high quality object detection. In Proceedings of the IEEE shick. Mask r-cnn. In Proceedings of the IEEE international
conference on computer vision and pattern recognition, pages conference on computer vision, pages 2961–2969, 2017. 3, 7
6154–6162, 2018. 1, 3, 7
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[5] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Deep residual learning for image recognition. In Proceed-
Hu. Global context networks. IEEE Transactions on Pattern ings of the IEEE conference on computer vision and pattern
Analysis and Machine Intelligence, 2020. 8 recognition, pages 770–778, 2016. 1, 2, 3
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [21] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Chun, Junsuk Choe, and Seong Joon Oh. Rethinking
end object detection with transformers. In European Con- spatial dimensions of vision transformers. arXiv preprint
ference on Computer Vision, pages 213–229. Springer, 2020. arXiv:2103.16302, 2021. 1, 2
7 [22] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local
[7] Liang-Chieh Chen, George Papandreou, Florian Schroff, and relation networks for image recognition. In Proceedings of the
Hartwig Adam. Rethinking atrous convolution for semantic IEEE International Conference on Computer Vision, pages
image segmentation. arXiv preprint arXiv:1706.05587, 2017. 3464–3473, 2019. 2
2, 3 [23] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
[8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian ian Q Weinberger. Densely connected convolutional networks.
Schroff, and Hartwig Adam. Encoder-decoder with atrous In Proceedings of the IEEE conference on computer vision
separable convolution for semantic image segmentation. In and pattern recognition, pages 4700–4708, 2017. 2
Proceedings of the European conference on computer vision [24] Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng,
(ECCV), pages 801–818, 2018. 3 Gang Yu, and Bin Fu. Shuffle transformer: Rethink-
[9] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When ing spatial shuffle for vision transformer. arXiv preprint
vision transformers outperform resnets without pretraining or arXiv:2106.03650, 2021. 1, 8
strong data augmentations. arXiv preprint arXiv:2106.01548, [25] Shiyu Liang and Rayadurgam Srikant. Why deep neu-
2021. 2 ral networks for function approximation? arXiv preprint
[10] Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, and arXiv:1610.04161, 2016. 2
Huaxia Xia. Do we really need explicit position encodings [26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
for vision transformers? arXiv e-prints, pages arXiv–2102, Bharath Hariharan, and Serge Belongie. Feature pyramid
2021. 2 networks for object detection. In Proceedings of the IEEE
[11] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expres- conference on computer vision and pattern recognition, pages
sive power of deep learning: A tensor analysis. In Conference 2117–2125, 2017. 2, 3
on learning theory, pages 698–728. PMLR, 2016. 2 [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[12] Arlin P. S. Crotts. Vatt/columbia microlensing survey of m31 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
and the galaxy. arXiv: Astrophysics, 1996. 1 Zitnick. Microsoft coco: Common objects in context. In
[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, European conference on computer vision, pages 740–755.
Han Hu, and Yichen Wei. Deformable convolutional net- Springer, 2014. 7, 11, 12
works. In Proceedings of the IEEE international conference [28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
on computer vision, pages 764–773, 2017. 2 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Hierarchical vision transformer using shifted windows. arXiv
Fei-Fei. Imagenet: A large-scale hierarchical image database. preprint arXiv:2103.14030, 2021. 1, 2, 3, 8
In 2009 IEEE conference on computer vision and pattern [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
recognition, pages 248–255. IEEE, 2009. 1, 2 regularization. arXiv preprint arXiv:1711.05101, 2017. 7
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [30] Muzammal Naseer, Kanchana Ranasinghe, Salman Khan,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Yang. Intriguing properties of vision transformers. arXiv
vain Gelly, et al. An image is worth 16x16 words: Trans- preprint arXiv:2105.10497, 2021. 2
[31] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, [45] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor-
Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transform- ing self-attention for image recognition. In Proceedings of
ers see like convolutional neural networks? arXiv preprint the IEEE/CVF Conference on Computer Vision and Pattern
arXiv:2108.08810, 2021. 2, 3, 4 Recognition, pages 10076–10085, 2020. 2
[32] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan [46] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xi-
alone self-attention in vision models. arXiv preprint ang, Philip HS Torr, et al. Rethinking semantic segmentation
arXiv:1906.05909, 2019. 2 from a sequence-to-sequence perspective with transformers.
[33] Karen Simonyan and Andrew Zisserman. Very deep convo- In Proceedings of the IEEE/CVF Conference on Computer
lutional networks for large-scale image recognition. arXiv Vision and Pattern Recognition, pages 6881–6890, 2021. 2
preprint arXiv:1409.1556, 2014. 1, 2 [47] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xi-
[34] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- aochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng.
nav Gupta. Revisiting unreasonable effectiveness of data in Deepvit: Towards deeper vision transformer. arXiv preprint
deep learning era. In Proceedings of the IEEE international arXiv:2103.11886, 2021. 3
conference on computer vision, pages 843–852, 2017. 2 [48] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao
[35] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Liu, Ekin D Cubuk, and Quoc V Le. Rethinking pre-training
Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, and self-training. arXiv preprint arXiv:2006.06882, 2020. 8
Changhu Wang, et al. Sparse r-cnn: End-to-end object
detection with learnable proposals. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 14454–14463, 2021. 7
[36] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. arXiv preprint
arXiv:1905.11946, 2019. 3, 4
[37] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through atten-
tion. arXiv preprint arXiv:2012.12877, 2020. 3, 7
[38] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Gabriel Synnaeve, and Hervé Jégou. Going deeper with
image transformers. arXiv preprint arXiv:2103.17239, 2021.
2
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30:5998–6008, 2017. 1, 5
[40] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, et al. Deep high-resolution represen-
tation learning for visual recognition. IEEE transactions on
pattern analysis and machine intelligence, 2020. 2
[41] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
mid vision transformer: A versatile backbone for dense predic-
tion without convolutions. arXiv preprint arXiv:2102.12122,
2021. 2
[42] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transformers.
arXiv preprint arXiv:2105.15203, 2021. 1, 2
[43] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-
to-token vit: Training vision transformers from scratch on
imagenet. arXiv preprint arXiv:2101.11986, 2021. 2, 3
[44] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and
Lucas Beyer. Scaling vision transformers. arXiv preprint
arXiv:2106.04560, 2021. 1, 4
A. Model architectures studied in Figure 2
We show details of all architectures studied in Figure 2 in the Table 4 below. As mentioned in Section 3.1, we study all
combinations of the above three techniques (spatial downsampling “SD”, multi-scale features “MF”, doubled channels “2×”),
i.e. eight settings in total, and show the results in Figure 2. Each dot in Figure 2 indicates an individually designed and trained
model. To make all comparisons fair, we carefully design all models such that they are all of around 72 million parameters.
We control their FLOPs by changing the depths or attention windows allocated to different stages.
Table 4. Model architectures in Figure 2, all studied under a 640 × 640 input size on MS-COCO [27]. “SD”: spatial downsampling. “MF”:
multi-scale features. “2×”: doubled channels. Without any of these three techniques (first section in this table), the whole network has a
constant feature resolution and hidden size; all other seven settings below will split the network into three stages, since they require either a
progressive feature downsampling or multi-scale features from each stage. Input scale is relative to the 2D shape of the input image H × W
−1
(e.g. 8 indicates the 2D shape of the UViT’s sequence feature is 18 H × 81 W ). The window scale is relative to the 2D shape of sequence
−1
feature’s h × w (e.g. 8 indicates the 2D shape of the attention window is 18 h × 18 w). Numbers with underscores in the column “Output
Scale” indicate feature maps that will be fed into the FPN detection head (i.e., the last output of backbone if no “MF” is applied, or features
from all three stages if “MF” is applied).
SD MF 2× Input Scale #Layers Window Scale Hidden Size Output Scale Params. (M) FLOPs (G) mAP
−1
16 534.1 44.5
−1
8 540.9 48.2
−1 −1 −1
8 18 4 384 8 72.1 567.9 50.1
−1
2 676.2 50.7
1 1109.1 50.8
Stage 1 Stage 2 Stage 3
SD MF 2× Params. (M) FLOPs (G) mAP
Input Window Hidden Output Input Window Hidden Output Input Window Hidden Output
#Layers #Layers #Layers
Scale Scale Size Scale Scale Scale Size Scale Scale Scale Size Scale
✓ 6 6 6 607.1 41.0
✓ 8 5 5 688.28 42.0
−1 −1 −1 −1 −1 −1
✓ 8 10 1 384 8 16 4 1 384 16 32 4 1 384 32 72.1 769.47 42.6
✓ 12 3 3 850.68 43.0
✓ 14 2 2 931.88 43.4
−1 −1 −1
✓ 16 16 16 534.3 44.3
−1 −1 −1
✓ 8 8 8 541.03 47.6
−1 −1 −1 −1 −1 −1 −1 −1 −1
✓ 8 6 4 384 8 8 6 4 384 16 8 6 4 384 32 72.1 568.09 49.4
−1 −1 −1
✓ 2 2 2 676.33 50.3
✓ 1 1 1 1109.3 50.2
−1 −1 −1
✓ 16 16 16 558.4 43.4
−1 −1 −1
✓ 8 8 8 561.5 44.4
−1 −1 −1 −1 −1 −1 −1 −1 −1
✓ 8 6 4 152 8 8 6 4 304 8 8 6 4 608 8 73.8 587.7 46.3
−1 −1 −1
✓ 2 2 2 692.2 46.6
✓ 1 1 1 1110.2 48.3
✓ ✓ 2 8 8 459.7 45.8
✓ ✓ 4 7 7 540.9 47.5
✓ ✓ 6 6 6 622.1 48.5
−1 −1 −1 −1 −1 −1
✓ ✓ 8 8 1 384 8 16 5 1 384 16 32 5 1 384 32 72.1 703.3 48.0
✓ ✓ 10 4 4 784.5 48.6
✓ ✓ 12 3 3 865.7 50.2
✓ ✓ 15 2 1 989.5 50.4
✓ ✓ 128 256 9 512 70.2 529.1 37.6
✓ ✓ 160 320 5 640 69.3 581.7 38.9
−1 −1 −1 −1 −1 −1
✓ ✓ 8 16 1 192 8 16 1 1 384 16 32 3 1 768 32 69.3 637.4 40.2
✓ ✓ 224 448 2 896 71.4 696.6 41.7
✓ ✓ 256 512 1 1024 69.2 756.5 42.5
−1 −1 −1
✓ ✓ 16 16 16 566.3 45.7
−1 −1 −1
✓ ✓ −1 8 −1 −1 8 −1 −1 8 −1 569.5 46.4
8 6 −1 152 8 8 6 −1 304 16 8 6 −1 608 32 73.8
✓ ✓ 4 4 4 595.6 48.1
−1 −1 −1
✓ ✓ 2 2 2 700.1 49.0
✓ ✓ ✓ 16 128 256 9 512 73.3 552.1 44.3
✓ ✓ ✓ 16 160 320 5 640 72.4 604.9 45.5
✓ ✓ ✓ −1 16 192 −1 −1 384 −1 −1 3 768 −1 72.4 660.7 47.6
8 1 8 16 1 1 16 32 1 32
✓ ✓ ✓ 16 224 448 2 896 74.5 719.9 48.8
✓ ✓ ✓ 16 256 512 1 1024 72.4 779.9 49.4
✓ ✓ ✓ 28 224 448 1 896 72.1 992.3 49.5
B. Model architectures in Figure 4 and Figure 5
−1
We show all architectures studied in our compound scaling rule in Figure 4 and Figure 5. All models are of 2 -scale
attention windows for fair comparisons.
Table 5. Model architectures in Figure 4 and Figure 5 (MS-COCO [27]). Configurations (depth, width) of UViT-T/S/B are annotated.