A Simple Single-Scale Vision Transformer For Object Localization

A simple single-scale vision transformer called Universal Vision Transformer (UViT) achieves competitive performance on object localization and instance segmentation tasks without using a multistage design or handcrafted multiscale features. The UViT demonstrates that a vanilla vision transformer architecture with a constant feature resolution and hidden size throughout the encoder blocks is sufficient for these tasks. Key aspects of the UViT include leveraging self-attention to naturally learn multiscale features, optimizing model scaling for accuracy and computation cost, and reducing computation by using only attention windows. Experiments show the UViT achieves strong results on COCO detection and segmentation benchmarks with a compact and simple design.

Uploaded by

Lưu Hải

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views

A Simple Single-Scale Vision Transformer For Object Localization

Uploaded by

Lưu Hải

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

A Simple Single-Scale Vision Transformer for Object Localization

and Instance Segmentation

1 2 2 2 2 2
Wuyang Chen *, Xianzhi Du , Fan Yang , Lucas Beyer , Xiaohua Zhai , Tsung-Yi Lin ,
2 2 2 1 2
Huizhong Chen , Jing Li , Xiaodan Song , Zhangyang Wang , Denny Zhou
1 2
University of Texas, Austin, Google
arXiv:2112.09747v1 [cs.CV] 17 Dec 2021

Abstract
52

MSCOCO mAP [%]

This work presents a simple vision transformer design as 50
a strong baseline for object localization and instance segmen-
tation tasks. Transformers recently demonstrate competitive 48 SD MF 2x
performance in image classification tasks. To adopt ViT to UViT+ (ours)
UViT (ours)
object detection and dense prediction tasks, many works in- 46 ViT
herit the multistage design from convolutional networks and Swin
ResNet
highly customized ViT architectures. Behind this design, the 44
goal is to pursue a better trade-off between computational 500 1000 1500 2000 50 100 150
cost and effective aggregation of multiscale global contexts. GFLOPs Params [M]
However, existing works adopt the multistage architectural
Figure 1. Trade-off between mAP (COCO) and FLOPs (left) /
design as a black-box solution without a clear understand-
number of parameters (right). We compare our UViT / UViT+ with
ing of its true benefits. In this paper, we comprehensively Swin Transformer [28], ViT [44], and ResNet (18/50/101/152) [20],
study three architecture design choices on ViT – spatial re- all adopting the same standard Cascade Mask RCNN framework [4].
duction, doubled channels, and multiscale features – and Our UViT is compact, strong, and simple, avoid using any hier-
demonstrate that a vanilla ViT architecture can fulfill this archical design (“SD”: spatial downsampling, “MF”: multi-scale
goal without handcrafting multiscale features, maintaining features, “2×”: double channels).
the original ViT design philosophy [15]. We further complete
a scaling rule to optimize our model’s trade-off on accuracy
and computation cost / model size. By leveraging a constant
extend transformers to more vision tasks, such as object de-
feature resolution and hidden size throughout the encoder
tection [18, 24, 28], semantic segmentation [18, 24, 28, 42]
blocks, we propose a simple and compact ViT architecture
and video action recognition [1, 12]. Conventionally, CNN
called Universal Vision Transformer (UViT) that achieves
architectures adopt a multistage design with gradually en-
strong performance on COCO object detection and instance
larged receptive fields [20,33] via spatial reduction or dilated
segmentation tasks.
convolutions. These design choices are also naturally intro-
duced to change vision transformer architectures [21, 28, 42],
with two main purposes: 1) support of multi-scale features,
1. Introduction since dense vision tasks require the visual understanding of
Transformer [39], the de-facto standard architecture for objects of different scales and sizes; 2) reduction of com-
natural language processing (NLP), recently has shown putation cost, for the input images of dense vision tasks
promising results on computer vision tasks. Vision Trans- are often of high resolutions, and computation complexity
former (ViT) [15], an architecture consisting of a sequence of vanilla self-attention is quadratic to the sequence length.
of transformer encoder blocks, has achieved competitive The motivation behind these changes is that, tokens of the
performance compared to convolution neural networks original ViT are all of a fixed fine-grained scale throughout
(CNNs) [20, 33] on the ImageNet classification task [14]. the encoder attention blocks, which is not adaptive to dense
With the success on image classification, recent works vision applications, and more importantly incur huge com-
putation/memory overhead. Despite the success of recent
* Work done during the first author’s research internship with Google. works, it is still unclear if complex design conventions of
CNNs are indeed necessary for ViT to achieve competitive larger input size creates more room for improvement via
performance on vision tasks, and how much benefit comes model scaling, and a moderate depth (number of attention
from each individual design. blocks) outperforms shallower or deeper ones.
In this work, we demonstrate that a vanilla ViT architec-
ture, which we call Universal Vision Transformer (UViT), • We reduce the computation cost via only attention win-
is sufficient for achieving competitive results on the tasks dows. We observe that attention’s receptive field is limited
of object detection and instance segmentation. First, to sup- in early layers and compact local attentions are sufficient,
port multiscale features, instead of re-designing ViT with while only deeper layers require global attentions.
a multistage fashion [21, 28, 42], our core motivation is that
self-attention mechanism naturally encourages the learning • Experiments on COCO detection and instance segmen-
of non-local information and makes the feature pyramid tation demonstrate that our UViT is simple yet a strong
no longer necessary for ViTs on dense vision tasks. This baseline for transformers on dense prediction tasks.
leads us to design a simple yet strong UViT architecture:
we only leverage constant feature resolution and hidden size 2. Related Works
throughout the encoder blocks, and extract a single-scale
feature map. This uniform design also has the potential 2.1. CNN backbones for dense prediction problems
of supporting multi-modal/multi-task learning and vision- CNNs are now mainstream and standard deep network
language problems. Second, to reduce the computation models for dense prediction tasks in computer vision, such as
cost, we adopt window splits in attention layers. We observe object detection and semantic segmentation. During decades
that on large inputs from detection or segmentation, global of development, people summarized several high-level
attention in early layers are redundant and compact local and fundamental design conventions: 1) deeper networks
attentions are both effective and efficient. This motivates us for more accurate function approximation [11, 16, 17, 25]:
to progressively increase the window sizes as the attention ResNet [20], DenseNet [23]; 2) shallow widths in early lay-
layers become deeper, leading to the drop of self-attention’s ers for high feature resolutions, and wider widths in deeper
computation cost with preserved performance. layers for compressed features, which can deliver good
To support the above two motivations, we systematically performance-efficiency trade-off: Vgg [33], ResNet [20]; 3)
study fundamental architecture design choices for UViTs enlarged receptive fields for learning long-term correlations:
on dense prediction tasks. It is worth noting that, although dilated convolution (Deeplab series [7]), deformable convo-
recent works try to analyze vision transformer’s generaliza- lutions [13]; 4) hierarchical feature pyramids for learning
tion [30], loss landscapes [9], and patterns of learned repre- across a wide range of object scales: FPN [26], ASPP [7],
sentations [31, 38], they mostly focus on image classification HRNet [40]. In short, the motivations behind these suc-
tasks. On dense prediction tasks like object detection and cessful design solutions fall in two folds: 1) to support the
instance segmentation, many transformer models [21, 28, 42] semantic understanding of objects with diverse sizes and
directly inherit architecture principles from CNNs, without scales; 2) to maintain a balanced computation cost under
validating the actual benefit of each individual design choice. large input sizes. These two motivations, or challenges, also
In contrast, our simple solution is based on rigorous abla- exist in designing our UViT architectures when we are facing
tion studies on dense prediction tasks, which is for the first dense prediction tasks, for which we provide a comprehen-
time. Moreover, we complete a comprehensive study of sive study in our work (Section 3.1).
UViT’s compound scaling rule on dense prediction tasks,
providing a series of UViT configurations that improve the 2.2. ViT backbones for dense prediction problems
performance-efficiency trade-off with highly compact archi-
The first ViT work [15] adopted a transformer encoder
tectures (even fewer than 40M parameters on transformer
on coarse non-overlapping image patches for image classifi-
backbone). Our proposed UViT architectures serve as a sim-
cation and requires large-scale training datasets (JFT [34],
ple yet strong baseline on both COCO object detection and
ImageNet-21K [14]) for pretraining. DeiT further introduce
instance segmentation. We summarize our contributions as:
strong augmentations on both data-level and architecture-
• We systematically study benefits of fundamental archi- level to efficiently train ViT on ImageNet-1k [14]. Beyond
tecture designs for ViTs on dense prediction tasks, and image classification, more and more works try to design ViT
propose a simple UViT design that show strong perfor- backbones for dense prediction tasks. Initially people try to
mance without hand-crafting CNN-like feature pyramid directly learn high-resolution features extracted by ViT back-
design conventions into transformers. bone via extra interpolation or convolution layers [2, 46].
Some works also leverage self-attention operations to re-
• We discover a new compound scaling rule (depth, width, place partial or all convolution layers in CNNs [22, 32, 45].
input size) for UViTs on dense vision tasks. We find a More recent trends [10, 18, 28, 41–43] start following design
conventions in CNNs discussed above (Section 2.1) and cus- cal pyramid structures, to support both multi-scale features
tomize ViT architectures to be CNN-like: tokens are progres- and reduction of computation cost [7, 20, 26]. Although re-
sively merged to downsample the feature resolutions with cent trends in designing ViTs also inherit these techniques,
reduced computation cost, along with increased embedding it is still unclear whether they are still beneficial to ViTs.
sizes. Multi-scale feature maps are also collected from the Meanwhile, ViT [15], DeiT [37], and T2T-ViT [43] demon-
ViT backbone. These ViT works can successfully achieve strate that, at least for image classification, a constant feature
strong or state-of-the-art performance on object detection or resolution and hidden size can also achieve competitive per-
semantic segmentation, but the architecture is again highly formance. Without spatial downsampling, the computation
customized for vision problems and lose the potential for cost of self-attention blocks can also be reduced by using
multi-modal learning in the future. More importantly, those attention window splits [3, 28], i.e., to limit the spatial range
CNN-like design conventions are directly inherited into ViTs of the query and the key when we calculate the dot prod-
without a clear understanding on each individual benefits, uct attention scores. To better understand each individual
leading to empirical black-box designs. In contrast, the technique, we provide a comprehensive study on the contri-
simple and neat solution we will provide is motivated by a butions of CNN-like design conventions to ViTs on dense
complete study on ViT’s architecture preference on dense prediction tasks.
prediction tasks (Section 3.1 and 3.2).
2.3. Inductive bias of ViT architecture Implementations: We conduct this study on the object
detection task on COCO 2017 dataset. We leverage the
Since the architecture of vision transformers is still in its standard Cascade Mask-RCNN detection framework [4, 19],
infant stage, there are few works that systematically study with a fixed input size as 640 × 640. All detection models
principles in ViT’s model design and scaling rule. Initially, are fine-tuned from an ImageNet pretrained initialization.
people leverage coarse tokenizations, constant feature reso- More details can be found in Section 4.1.
lution, and constant hidden size [15, 37], while recently fine-
grained tokens, spatial downsampling, and doubled channels
Settings: We systematically study the benefit of spatial
are also becoming popular in ViT design [28, 47]. They all
downsampling, multi-scale features, and doubled channels
achieve good performance, calling for an organized study on
to the object detection performance of ViT. We start from a
the benefits of different fundamental designs. In addition, dif-
baseline ViT architecture close to the S16 model proposed
ferent learning behaviors of self-attentions (compared with
in [15], which has 18 attention blocks, a hidden size of 384,
CNNs) make the scaling law of ViTs highly unclear. Recent
and six heads for each self-attention layer. The first linear
works [31] revealed ViT generates more uniform receptive
project layer embeds images into 18 -scale patches, i.e., the
fields across layers, enabling the aggregation of global infor-
mation in early layers. This is contradictory to CNNs which input feature resolution to the transformer encoder is 18 . The
require deeper layers to help the learning of visual global attention blocks will be grouped into three stages for one or
information [8]. Attention scores of ViTs are also found a combination of two or three purposes below:
to gradually become indistinguishable as the encoder goes • Spatial downsampling: tokens will be merged between two
deeper, leading to identical and redundant feature maps, and consecutive stages to downsample the feature resolution.
plateaued performance [47]. These observations all indicate If the channel number is also doubled between stages, the
that previously discovered design conventions and scaling tokens will be merged by a learned convolution with a
laws for CNNs [20, 36] may not be suitable for ViTs, thus stride as 2; otherwise, tokens will be merged by a 2D
calling for comprehensive studies on the new inductive bias bi-linear interpolation.
of ViT’s architecture on dense prediction tasks.
• Multi-scale features: after each stage, features of a spe-
3. Methods cific resolution will be output and fed into the detection
Our work targets on designing a simple ViT model for FPN head. Multi-scale features of three target resolutions
dense prediction tasks, and try to avoid hand-crafted cus- ( 18 , 16
1 1
, 32 ) will be collected from the encoder from early
tomization on architectures. We will first explain our moti- to deep attention layers.
vations with comprehensive ablation studies on individual • Doubled channels: after each of the first two stages, the
design benefits in Section 3.1, and then elaborate the discov- token’s hidden size will be doubled via a linear projection.
ered principles of our UViT designs in Section 3.2.
We study all combinations of the above three techniques,
3.1. Is a simple ViT design all you need?
i.e. eight settings in total, and show the results in Figure 2.
As discussed in Section 1, traditionally CNNs leverage Note that each dot in Figure 2 indicates an individually
resolution downsampling, doubled channels, and hierarchi- designed and trained model. All models are of around 72
In summary, we did not find strong benefits by adopting
50
CNN-like design conventions. Instead, A simple architecture
48 solution of a constant feature resolution and hidden size
could be a strong ViT baseline.
MSCOCO mAP [%]

46
3.2. UViT: a simple yet effective solution
44 SD MF 2×
Based on our study in Section 3.1, we are motivated
42 to simplify the ViT design for dense prediction tasks and
provide a neat solution, as illustrated in Figure 3. Taking
40
8 × 8 patches of input images, we learn the representation by
38 using a constant token resolution of 18 scale (i.e., the number
of tokens remains the same) and a constant hidden size (i.e.,
36
500 600 700 800 900 1000 1100 the channel number will not be increased). A single-scale
GFLOPs feature map will be fed into a detection or segmentation
head. Meanwhile, attention windows [3] will be leveraged
Figure 2. The benefits of various commonly used CNN-inspired to reduce the computation cost.
design changes to ViT: spatial downsampling (“SD”), multi-scale Though being simple, still we have two core questions to
features (“MF”), and doubled channels (“2×”). With controlled be determined in our design: (1) How to balance the UViT’s
number of parameters (72M) and input size (640×640), not using depth, width, and input size to achieve the best performance-
any of these designs and sticking to the original ViT model [15]
efficiency trade-off? (Section 3.2.1) (2) Which attention
performs the best in a wide range of FLOPs we explore.
window strategy can effectively save the computation cost
without sacrificing the performance? (Section 3.2.2)
million parameters, and are trained and evaluated under the
same 640 × 640 input size. Therefore, for vertically aligned 3.2.1 A compound scaling rule of UViTs
dots, they share the same FLOPs, number of parameters,
Previous works studied compound scaling rules for
and input size, thus being fairly comparable. We control the
CNNs [36] and ViTs [44] on image classification tasks. How-
FLOPs (x-axis) by changing the depths or attention windows
ever, few works studied the scaling of ViTs on dense pre-
allocated to different stages, see our Supplement Section A
diction tasks. To achieve the best performance-efficiency
for more architecture details.
trade-off, we systematically study the compounding scaling
of UViTs on three dimensions: input size, depth, and widths.
Observations 1
We show our results in Figure 4 and Figure 5 . For all mod-
• Spatial Downsampling (“SD”) does not seem to be ben- els (circle markers), we first train them on ImageNet-1k, then
eficial. Our hypothesis is that, under the same FLOPs fine-tune them on the COCO detection task.
constraint, the self-attention layers already provide global • Depth (number of attention blocks): we study different
features, and do not need to downsample the features to UViT models of depths selected from {12, 18, 24, 32, 40}.
enlarge the receptive field.
• Input size: we study three levels of input sizes: 640 × 640,
• Multi-scale Features (“MF”) can mitigate the poor per- 768 × 768, 896 × 896, and 1024 × 1024.
formance from downsampling by leveraging early high-
resolution features (“SD+MF”). However, the vanilla set- • Width (i.e. hidden size, or output dimension of attention
ting still outperforms this combination. We hypothesize blocks): we will tune the width to further control differ-
that high-resolution features are extracted too early in the ent model sizes and computation costs to make different
encoder; in contrast, tokens in vanilla ViTs are able to scaling rules fairly comparable.
learn fine-grained details throughout the encoder blocks.
Observations
• Doubled channels (“2×”) plus multi-scale features (“MF”)
may potentially seem competitive. However, ViT does not • In general, UViT can achieve a strong mAP with a mod-
show strong inductive bias on “deeper compressed features erate computation cost (FLOPs) and a highly compact
with more embedding dimension”. This observation is also number of parameters (even fewer than 70M including the
aligned with findings in [31] that ViTs have highly similar Cascaded FPN head).
representations throughout the model, indicating that we 1
This compound scaling rule is studied in Section 3.2.1 before we
should not sacrifice embedding dimensions of early layers study the attention window strategy in Section 3.2.2. Thus for all models in
to compensate deeper layers. Figure 4 and Figure 5 we adopt the window scale as 12 , for fair comparisons.
𝐻 𝑊 𝐻 𝑊
3×𝐻×𝑊 C× × C× ×
8 8 8 8 Detection
✓ Single-scale
UViT Encoder: Basic Attention Blocks Feature Maps
✓ Constant Token Resolution & Hidden Size
Spatial Downsampling, Double Channels Multi-scale
Instance
(UViT) Segmentation
8×8 Patches +
Position Embeddings (UViT+)

Attention Windows
Figure 3. We keep the architecture of our UViT neat: image patches (plus position embeddings) are processed by a stack of vanilla
attention blocks with a constant resolution and hidden size. Single-scale feature maps as outputs are fed into head modules for detection or
segmentation tasks. Constant (UViT, Section 3.2.1) or progressive (UViT+, Section 3.2.2) attention windows are introduced to reduce the
computation cost. We demonstrate that this simple architecture is strong, without introducing design overhead from hierarchical spatial
downsampling, doubled channels, and multi-scale feature pyramids.

• For input sizes (Figure 4 by different line styles): large each self-attention layer on COCO. Given a sequence feature
inputs generally create more room for models to be further of length L and the attention score s (after softmax) from a
scaled up. Across a wide range of model parameters and specific head, the relative receptive field r is defined as:
FLOPs, we find that the scaling under an 896 × 896 input
L
j=1 si,j ∣i − j∣
size constantly outperforms smaller input sizes (which L ∑
1
lead to severe model overfitting), and is also better than r= ∑ , i, j = 1, ⋯, L, (1)
L i=1 max(i, L − i)
1024 × 1024 within a comparable FLOPs range.
L
• For the model depths (Figure 5), different depths in col- where ∑j=1 si,j = 1 for j = 1, ⋯, L. This relative receptive
ors): we find that considering both FLOPs and number filed takes into consideration the token’s position and the fur-
of parameters, 18 blocks achieve better performance than thest possible location a token can aggregate, and indicates
12/24/32/40 blocks. This indicate UViT needs a balanced the spatial focus of the self-attention layer.
trade-off between depth and width, instead of sacrificing We collect the averages and standard deviations across
depth for more width (e.g. 12 blocks) or sacrificing width different attention heads. As shown in Figure 6, we can
for more depth (e.g. 40 blocks). see that tokens in early attention layers, although having the
potential to aggregate long-range features, weight more on
In summary, based on our final compound scaling rule, their neighbor tokens and thus act like a “local operator”.
we propose our basic version of UViT as 18 attention blocks As the attention layers stack deeper, the receptive field in-
under 896 × 896 input size. See our Supplement Section B creases, transiting the self-attention to a global operation.
for more architecture details. This inspires us that, if we explicitly limit the attention range
of early layers, we may save the computation cost but still
3.2.2 Attention windows: a progressive strategy preserve the capability of self-attentions.

In this section, we will show that a progressive attention Attention window improves UViT efficiency. Motivated
window strategy can reduce UViT’s computation cost while by Figure 6, we want to study the most effective attention
still preserve or even increase the performance. window strategy. Specifically, our study can be decomposed
to focus on two sub-problems:
Early attentions are local operators. Originally, self-
attention [39] is a global operation: unlike convolution layers 1) How small the window size can early attention layers
that share its weights to local regions, any pair of token in endure? To answer this question, we start the attention
the sequence will contribute to the feature aggregation, thus blocks with square windows with different small scales:
{ 16 , 8 , 4 } of height or width a sequence’s 2D shape .
collecting global information to each token. In practice, how- 1 1 1 2

ever, self-attention in different layers may still have biases

in regions they prefer to focus on. 2
For example, if the input sequence has (896/8)×(896/8) = 112×112
1
To validate this assumption, we select a pretrained ViT- tokens, a window of scale 16 will contain 7 × 7 = 49 elements. Similar
1 1
B16 model [15], and calculate the relative receptive field of ideas for 8 and 4 .
B B
52 52
S S
MSCOCO mAP [%]

MSCOCO mAP [%]

T T
51 51

50 50
640 × 640 640 × 640
49 768 × 768 49 768 × 768
896 × 896 896 × 896
1024 × 1024 1024 × 1024
48 L = 18 48 L = 18
700 800 900 1000 1100 1200 1300 40 50 60 70 80 90 100 110
FLOPs (G) Params. (M))
Figure 4. Input scaling rule for UViT on COCO object detection. Given a fixed depth, an input size of 896 × 896 (thin solid line) leaves
more room for model scaling (by increasing the width) and is slightly better than 1024 × 1024 (thick solid line); and 640 × 640 (dashed
line) or 768 × 768 (dotted line) are of worse performance-efficiency trade-off. Black capital letters “T”, “S”, and “B” annotate three final
depth/width configurations of UViT variants we will propose (Table 2). Different sizes of markers represent the hidden sizes (widths).

B B
52 52
S S
MSCOCO mAP [%]

MSCOCO mAP [%]

51 T
ViT-B16 51 T ViT-S16 ViT-B16
ViT-S16
50 50
L = 12 L = 12
49 L = 18 49 L = 18
L = 24 ViT-Ti16 L = 24
48 ViT-Ti16 L = 32 48 L = 32
L = 40 L = 40
47 ViT-s16 ViT 47 ViT-s16 ViT
600 800 1000 1200 2200 50 60 70 130
FLOPs (G)) Params. (M)
Figure 5. Model scaling rule for UViT on COCO object detection. 18 attention blocks (orange), which provide a balanced trade-off
between depth and width, performs better than shallower or deeper UViTs. Black capital letters “T”, “S”, and “B” annotate three final
depth/width configurations of UViT variants we will propose (Table 2). Different sizes of markers represent the hidden sizes (widths).

2) Do deeper layers require global attentions, or some

local attentions are also sufficient? To compare with
global attentions (window size as 1), we will also try 0.5
small attention windows (window size of 12 scale) in
Relative Receptive Field

deeper layers. 0.4

To represent an attention window strategy that “progres-

0.3
sively increases window scales from 14 to 12 to 1”, we use
a simple annotation “[4 ] × 14 → [2 ] × 2 → [1] × 2”,
−1 −1
0.2
indicating that there are 14 attention blocks assigned with
1
4
-scale windows, then two attention blocks assigned with 0.1
1
2
-scale windows, and finally two attention blocks assigned 1 2 3 4 5 6 7 8 9 10 11 12
with 1-scale windows. When comparing different window Depth
strategies, we make sure all strategies have the same number
Figure 6. Relative attention’s receptive field of a ImageNet pre-
of parameters and share the similar computation cost for trained ViT-B16 [15] along depth (indices of attention blocks),
fair comparisons. We also include four more baselines of on the COCO dataset. Error bars are standard deviations across
constant attention window scale across all attention blocks: different attention heads.
Table 1. Over-shrank window sizes in early layers are harmful, and global attention windows in deep layers are vital to the final performance.
Fractions in brackets indicate attention window scales (relative to sequence feature sizes), and the multiplier indicates the number of attention
blocks allocated to an attention window scale (18 blocks in total).

[window_scale] × #layers GFLOPs APval

[1] × 18 2962 52.4
[2 ] × 18
−1
1299 52.3
[3 ] × 18
−1
974.82 51.9
[4 ] × 18
−1
882.83 51.7
[16 ] × 4 [8 ] × 4 [4 ] × 4 [2 ] × 4 [1] × 2
−1 −1 −1 −1
1154 52.0
[8 ] × 9 → [4 ] × 4 → [2 ] × 3 → [1] × 2
−1 −1 −1
1131 52.2
[4 ] × 14 → [2 ] × 2 → [1] × 2
−1 −1
1160 52.5
[4 ] × 6 → [2 ] × 12
−1 −1
1160 52.2

global attention, and also windows of 14 ∼ 12 scale. We show ImageNet-1k with a 224 × 224 input size and a batch size of
our results in Table 1, and summarize observations below: 1024. We follow the convention in [15]: during ImageNet
pretraining the kernel size of the first linear projection layer
• With smaller constant window scale ( 12 / 13 / 14 ), we save is 16 × 16. During fine-tuning, we will use a more fine-
more computation cost with slight sacrifice in mAP. grained 8 × 8 patch size for the dense sampling purpose.
The kernel weight of the first linear project layer will be
• Adopting constant global attentions throughout the interpolated from 16×16 to 8×8, and the position embedding
whole encoder blocks (window size as 1, first row) is will be also be elongated by interpolation.
largely redundant, which contributes marginal benefits
but suffers from huge a computation cost. 4.2. Architectures
1
• Early attentions can use smaller windows like 4
-scale, We propose three variants of our UViT variants. The
but over-shrank window sizes ( 16 1 1
, 8 ) can impair the architecture configurations of our model variants are listed
capability of self-attentions (3rd, 4th rows). in Table 2, and are also annotated in Figure 4 and Figure 5
(“T”, “S”, “B” in black). The number of heads is fixed as six,
• Deeper layers still require global attentions to preserve and the expansion ratio of each FFN (feed-forward network)
the final performance (last two rows). layer if fixed as four in all experiments. As discussed in
Section 3.2.2, the attention window strategy will be “[4 ]×
−1
• A properly designed window strategy (5th row) can out-
14 → [2 ] × 2 → [1] × 2”.
−1
perform vanilla solutions (1st, 2nd row) with a reduced
computation cost.
Table 2. Architecture variants of our UViT.
In conclusion, we set the window scale of our basic ver-
−1 Hidden Params. (M) Params. (M)
sion (UViT, Section 3.2.1) as constant 3 , and proposed an Name Depth
improved version of our model, dubbed “UViT+”, which Size (backbone) (Cascade Mask R-CNN)
will adopt the attention window strategy as “[4 ] × 14 →
−1
UViT-T 18 222 13.5 47.4
[2 ] × 2 → [1] × 2”.
−1
UViT-S 18 288 21.7 53.8
UViT-B 18 384 36.9 74.4
4. Final Results
We conduct our experiments on COCO [27] object de-
tection and instance segmentation tasks to show our final
4.3. COCO detection & instance segmentation
performance. Settings Object detection experiments are conducted on
COCO 2017 [27], which contains 118K training and 5K
4.1. Implementations
validation images. We consider the popular Cascade Mask-
We implement our model and training in TensorFlow RCNN detection framework [4,19], and leverage multi-scale
and Keras. Experiments are conducted on TPUs. Before training [6, 35] (resizing the input to 896 × 896), AdamW
−3
fine-tuning on object detection or segmentation, we follow optimizer [29] (with an initial learning rate as 3 × 10 ),
−4
the DeiT [37] training settings to pretrain our UViTs on weight decay as 1 × 10 , and a batch size of 256.
Table 3. Two-stage object detection and instance segmentation results on COCO 2017. We compare employing different backbones with
−1
Cascade Mask R-CNN on single model without test-time augmentation. UViT sets a constant window scale as 3 , and UViT+ adopts the
attention window strategy as “[4 ] × 14 → [2 ] × 2 → [1] × 2”. We also reproduced the performance of ResNet under the same settings.
−1 −1

mask
Backbone Resolution GFLOPs Params. (M) APval APval
ResNet-18 896×896 370.4 48.9 44.2 38.5
ResNet-50 896×896 408.8 61.9 47.4 40.8
Swin-T [28] 480∼800×1333 745 86 50.5 43.7
Shuffle-T [24] 480∼800×1333 746 86 50.8 44.1
UViT-T (ours) 896×896 613 47.4 51.1 43.6
UViT-T+ (ours) 896×896 710 47.4 51.2 43.9
ResNet-101 896×896 468.2 81.0 48.5 41.8
Swin-S [28] 480∼800×1333 838 107 51.8 44.7
Shuffle-S [24] 480∼800×1333 844 107 51.9 44.9
UViT-S (ours) 896×896 744 53.8 51.4 44.1
UViT-S+ (ours) 896×896 866 53.8 51.9 44.5
ResNet-152 896×896 527.7 96.7 49.1 42.1
Swin-B [28] 480∼800×1333 982 145 51.9 45
Shuffle-B [24] 480∼800×1333 989 145 52.2 45.3
GCNet [5] - 1041 - 51.8 44.7
UViT-B (ours) 896×896 975 74.4 51.9 44.3
UViT-B+ (ours) 896×896 1160 74.4 52.5 44.8
UViT-B+ w/ self-training (ours) 896×896 1160 74.4 53.9 46.1

From Table 3 we can see that on different levels of model box AP and mask AP by 1.4% and 1.3%, respectively.
variants, our UViTs are highly compact in terms of number
of parameters, and achieve competitive or stronger perfor-
mance with comparable FLOPs. To make this comparison 5. Conclusion
clean, we did not adopt any system-level techniques [28] to
In this work, we present a simple, single-scale vision
boost the performance. As we did not leverage and CNN-
transformer backbone that can serve as a strong baseline for
like hierarchical pyramid structures, the results of our simple
object detection and instance segmentation. Our UViT ar-
and neat solution suggest that, the original design philosophy
chitecture does not involve any hierarchical pyramid designs
of ViT [15] is a strong baseline without any hand-crafted ar-
that widely adopted in CNNs and recently developed vision
chitecture customization. We also show the mAP-efficiency
transformer models. To validate our method, we first care-
trade-off curve in Figure 1. Our UViT achieve strong results
fully study the benefits from individual design techniques:
with much better efficiency, compared with both CNNs and
spatial downsampling, multi-scale features, and doubled
other ViT works.
channels. Our study shows that for vision transformers, a
Additionally, we adopt self-training on top of our largest plain encoder with a constant feature resolution and hidden
model (UViT-B) to evaluate the performance gain by lever- size performs the best, indicating that the design philosophy
aging unlabeled data, similar as [48]. We use ImageNet- of CNNs may not be optimal for ViTs on dense predic-
1K without labels as the unlabeled set, and a pretrained tion tasks. Following this observation, we further optimize
UViT-B model as the teacher model to generate pseudo- the UViT’s performance-efficiency trade-off by studying
labels. All predicted boxes with confidence score larger a compound scaling rule (depth, width, input size) and a
than 0.5 are kept, together with their corresponding masks. progressive attention widow strategy. Our proposed UViT
For self-training, the student model is initialized from the architectures achieve strong performance on both COCO ob-
same weights of the teacher model. The ratio of labeled ject detection and instance segmentation. Most importantly,
data to pseudo-labeled data is 1:1 in each batch. Apart from we hope our work could bring the attention to the community
increasing training steps by 2× for each epoch, all other that ViTs may require careful and special architecture design
hyperparameters remain unchanged. We can see from the on dense prediction tasks, instead of directly adopting CNN
last row in Table 3 that self-training significantly improves design conventions in black-box.
References formers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020. 1, 2, 3, 4, 5, 6, 7, 8
[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun,
[16] Dennis Elbrächter, Dmytro Perekrestenko, Philipp Grohs, and
Mario Lucic, and Cordelia Schmid. Vivit: A video vision
Helmut Bölcskei. Deep neural network approximation theory.
transformer. ArXiv, abs/2103.15691, 2021. 1
arXiv preprint arXiv:1901.02220, 2019. 2
[2] Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew [17] Ronen Eldan and Ohad Shamir. The power of depth for
Zhai, and Dmitry Kislyuk. Toward transformer-based object feedforward neural networks. In Conference on learning
detection. arXiv preprint arXiv:2012.09958, 2020. 2 theory, pages 907–940. PMLR, 2016. 2
[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- [18] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu,
former: The long-document transformer. arXiv preprint and Yunhe Wang. Transformer in transformer. arXiv preprint
arXiv:2004.05150, 2020. 3, 4 arXiv:2103.00112, 2021. 1, 2
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
into high quality object detection. In Proceedings of the IEEE shick. Mask r-cnn. In Proceedings of the IEEE international
conference on computer vision and pattern recognition, pages conference on computer vision, pages 2961–2969, 2017. 3, 7
6154–6162, 2018. 1, 3, 7
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[5] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Deep residual learning for image recognition. In Proceed-
Hu. Global context networks. IEEE Transactions on Pattern ings of the IEEE conference on computer vision and pattern
Analysis and Machine Intelligence, 2020. 8 recognition, pages 770–778, 2016. 1, 2, 3
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [21] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Chun, Junsuk Choe, and Seong Joon Oh. Rethinking
end object detection with transformers. In European Con- spatial dimensions of vision transformers. arXiv preprint
ference on Computer Vision, pages 213–229. Springer, 2020. arXiv:2103.16302, 2021. 1, 2
7 [22] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Local
[7] Liang-Chieh Chen, George Papandreou, Florian Schroff, and relation networks for image recognition. In Proceedings of the
Hartwig Adam. Rethinking atrous convolution for semantic IEEE International Conference on Computer Vision, pages
image segmentation. arXiv preprint arXiv:1706.05587, 2017. 3464–3473, 2019. 2
2, 3 [23] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
[8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian ian Q Weinberger. Densely connected convolutional networks.
Schroff, and Hartwig Adam. Encoder-decoder with atrous In Proceedings of the IEEE conference on computer vision
separable convolution for semantic image segmentation. In and pattern recognition, pages 4700–4708, 2017. 2
Proceedings of the European conference on computer vision [24] Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng,
(ECCV), pages 801–818, 2018. 3 Gang Yu, and Bin Fu. Shuffle transformer: Rethink-
[9] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When ing spatial shuffle for vision transformer. arXiv preprint
vision transformers outperform resnets without pretraining or arXiv:2106.03650, 2021. 1, 8
strong data augmentations. arXiv preprint arXiv:2106.01548, [25] Shiyu Liang and Rayadurgam Srikant. Why deep neu-
2021. 2 ral networks for function approximation? arXiv preprint
[10] Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, and arXiv:1610.04161, 2016. 2
Huaxia Xia. Do we really need explicit position encodings [26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
for vision transformers? arXiv e-prints, pages arXiv–2102, Bharath Hariharan, and Serge Belongie. Feature pyramid
2021. 2 networks for object detection. In Proceedings of the IEEE
[11] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expres- conference on computer vision and pattern recognition, pages
sive power of deep learning: A tensor analysis. In Conference 2117–2125, 2017. 2, 3
on learning theory, pages 698–728. PMLR, 2016. 2 [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[12] Arlin P. S. Crotts. Vatt/columbia microlensing survey of m31 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
and the galaxy. arXiv: Astrophysics, 1996. 1 Zitnick. Microsoft coco: Common objects in context. In
[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, European conference on computer vision, pages 740–755.
Han Hu, and Yichen Wei. Deformable convolutional net- Springer, 2014. 7, 11, 12
works. In Proceedings of the IEEE international conference [28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
on computer vision, pages 764–773, 2017. 2 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Hierarchical vision transformer using shifted windows. arXiv
Fei-Fei. Imagenet: A large-scale hierarchical image database. preprint arXiv:2103.14030, 2021. 1, 2, 3, 8
In 2009 IEEE conference on computer vision and pattern [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
recognition, pages 248–255. IEEE, 2009. 1, 2 regularization. arXiv preprint arXiv:1711.05101, 2017. 7
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [30] Muzammal Naseer, Kanchana Ranasinghe, Salman Khan,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Yang. Intriguing properties of vision transformers. arXiv
vain Gelly, et al. An image is worth 16x16 words: Trans- preprint arXiv:2105.10497, 2021. 2
[31] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, [45] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor-
Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transform- ing self-attention for image recognition. In Proceedings of
ers see like convolutional neural networks? arXiv preprint the IEEE/CVF Conference on Computer Vision and Pattern
arXiv:2108.08810, 2021. 2, 3, 4 Recognition, pages 10076–10085, 2020. 2
[32] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan [46] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xi-
alone self-attention in vision models. arXiv preprint ang, Philip HS Torr, et al. Rethinking semantic segmentation
arXiv:1906.05909, 2019. 2 from a sequence-to-sequence perspective with transformers.
[33] Karen Simonyan and Andrew Zisserman. Very deep convo- In Proceedings of the IEEE/CVF Conference on Computer
lutional networks for large-scale image recognition. arXiv Vision and Pattern Recognition, pages 6881–6890, 2021. 2
preprint arXiv:1409.1556, 2014. 1, 2 [47] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xi-
[34] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- aochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng.
nav Gupta. Revisiting unreasonable effectiveness of data in Deepvit: Towards deeper vision transformer. arXiv preprint
deep learning era. In Proceedings of the IEEE international arXiv:2103.11886, 2021. 3
conference on computer vision, pages 843–852, 2017. 2 [48] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao
[35] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Liu, Ekin D Cubuk, and Quoc V Le. Rethinking pre-training
Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, and self-training. arXiv preprint arXiv:2006.06882, 2020. 8
Changhu Wang, et al. Sparse r-cnn: End-to-end object
detection with learnable proposals. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 14454–14463, 2021. 7
[36] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. arXiv preprint
arXiv:1905.11946, 2019. 3, 4
[37] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through atten-
tion. arXiv preprint arXiv:2012.12877, 2020. 3, 7
[38] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Gabriel Synnaeve, and Hervé Jégou. Going deeper with
image transformers. arXiv preprint arXiv:2103.17239, 2021.
2
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30:5998–6008, 2017. 1, 5
[40] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, et al. Deep high-resolution represen-
tation learning for visual recognition. IEEE transactions on
pattern analysis and machine intelligence, 2020. 2
[41] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
mid vision transformer: A versatile backbone for dense predic-
tion without convolutions. arXiv preprint arXiv:2102.12122,
2021. 2
[42] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transformers.
arXiv preprint arXiv:2105.15203, 2021. 1, 2
[43] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-
to-token vit: Training vision transformers from scratch on
imagenet. arXiv preprint arXiv:2101.11986, 2021. 2, 3
[44] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and
Lucas Beyer. Scaling vision transformers. arXiv preprint
arXiv:2106.04560, 2021. 1, 4
A. Model architectures studied in Figure 2

We show details of all architectures studied in Figure 2 in the Table 4 below. As mentioned in Section 3.1, we study all
combinations of the above three techniques (spatial downsampling “SD”, multi-scale features “MF”, doubled channels “2×”),
i.e. eight settings in total, and show the results in Figure 2. Each dot in Figure 2 indicates an individually designed and trained
model. To make all comparisons fair, we carefully design all models such that they are all of around 72 million parameters.
We control their FLOPs by changing the depths or attention windows allocated to different stages.

Table 4. Model architectures in Figure 2, all studied under a 640 × 640 input size on MS-COCO [27]. “SD”: spatial downsampling. “MF”:
multi-scale features. “2×”: doubled channels. Without any of these three techniques (first section in this table), the whole network has a
constant feature resolution and hidden size; all other seven settings below will split the network into three stages, since they require either a
progressive feature downsampling or multi-scale features from each stage. Input scale is relative to the 2D shape of the input image H × W
−1
(e.g. 8 indicates the 2D shape of the UViT’s sequence feature is 18 H × 81 W ). The window scale is relative to the 2D shape of sequence
−1
feature’s h × w (e.g. 8 indicates the 2D shape of the attention window is 18 h × 18 w). Numbers with underscores in the column “Output
Scale” indicate feature maps that will be fed into the FPN detection head (i.e., the last output of backbone if no “MF” is applied, or features
from all three stages if “MF” is applied).
SD MF 2× Input Scale #Layers Window Scale Hidden Size Output Scale Params. (M) FLOPs (G) mAP
−1
16 534.1 44.5
−1
8 540.9 48.2
−1 −1 −1
8 18 4 384 8 72.1 567.9 50.1
−1
2 676.2 50.7
1 1109.1 50.8
Stage 1 Stage 2 Stage 3
SD MF 2× Params. (M) FLOPs (G) mAP
Input Window Hidden Output Input Window Hidden Output Input Window Hidden Output
#Layers #Layers #Layers
Scale Scale Size Scale Scale Scale Size Scale Scale Scale Size Scale
✓ 6 6 6 607.1 41.0
✓ 8 5 5 688.28 42.0
−1 −1 −1 −1 −1 −1
✓ 8 10 1 384 8 16 4 1 384 16 32 4 1 384 32 72.1 769.47 42.6
✓ 12 3 3 850.68 43.0
✓ 14 2 2 931.88 43.4
−1 −1 −1
✓ 16 16 16 534.3 44.3
−1 −1 −1
✓ 8 8 8 541.03 47.6
−1 −1 −1 −1 −1 −1 −1 −1 −1
✓ 8 6 4 384 8 8 6 4 384 16 8 6 4 384 32 72.1 568.09 49.4
−1 −1 −1
✓ 2 2 2 676.33 50.3
✓ 1 1 1 1109.3 50.2
−1 −1 −1
✓ 16 16 16 558.4 43.4
−1 −1 −1
✓ 8 8 8 561.5 44.4
−1 −1 −1 −1 −1 −1 −1 −1 −1
✓ 8 6 4 152 8 8 6 4 304 8 8 6 4 608 8 73.8 587.7 46.3
−1 −1 −1
✓ 2 2 2 692.2 46.6
✓ 1 1 1 1110.2 48.3
✓ ✓ 2 8 8 459.7 45.8
✓ ✓ 4 7 7 540.9 47.5
✓ ✓ 6 6 6 622.1 48.5
−1 −1 −1 −1 −1 −1
✓ ✓ 8 8 1 384 8 16 5 1 384 16 32 5 1 384 32 72.1 703.3 48.0
✓ ✓ 10 4 4 784.5 48.6
✓ ✓ 12 3 3 865.7 50.2
✓ ✓ 15 2 1 989.5 50.4
✓ ✓ 128 256 9 512 70.2 529.1 37.6
✓ ✓ 160 320 5 640 69.3 581.7 38.9
−1 −1 −1 −1 −1 −1
✓ ✓ 8 16 1 192 8 16 1 1 384 16 32 3 1 768 32 69.3 637.4 40.2
✓ ✓ 224 448 2 896 71.4 696.6 41.7
✓ ✓ 256 512 1 1024 69.2 756.5 42.5
−1 −1 −1
✓ ✓ 16 16 16 566.3 45.7
−1 −1 −1
✓ ✓ −1 8 −1 −1 8 −1 −1 8 −1 569.5 46.4
8 6 −1 152 8 8 6 −1 304 16 8 6 −1 608 32 73.8
✓ ✓ 4 4 4 595.6 48.1
−1 −1 −1
✓ ✓ 2 2 2 700.1 49.0
✓ ✓ ✓ 16 128 256 9 512 73.3 552.1 44.3
✓ ✓ ✓ 16 160 320 5 640 72.4 604.9 45.5
✓ ✓ ✓ −1 16 192 −1 −1 384 −1 −1 3 768 −1 72.4 660.7 47.6
8 1 8 16 1 1 16 32 1 32
✓ ✓ ✓ 16 224 448 2 896 74.5 719.9 48.8
✓ ✓ ✓ 16 256 512 1 1024 72.4 779.9 49.4
✓ ✓ ✓ 28 224 448 1 896 72.1 992.3 49.5
B. Model architectures in Figure 4 and Figure 5
−1
We show all architectures studied in our compound scaling rule in Figure 4 and Figure 5. All models are of 2 -scale
attention windows for fair comparisons.

Table 5. Model architectures in Figure 4 and Figure 5 (MS-COCO [27]). Configurations (depth, width) of UViT-T/S/B are annotated.

Input Size Depth Width Params. (M) FLOPs (G) mAP

384 72.1 676.2 50.4
432 80.9 748.3 50.5
640 × 640 18 462 86.9 796.6 50.7
492 93.3 847.4 50.4
564 110.2 979.4 50.1
288 58.2 725.9 51.1
306 60.7 761.1 51.5
330 64.3 810.0 51.3
768 × 768 18
384 73.1 928.5 51.5
432 82.1 1043.5 51.6
462 88.2 1120.1 51.3
186 47.4 710.2 51
222 (UViT-T) 51.0 801.4 51.3
246 53.8 866.1 51.7
896 × 896 18
288 (UViT-S) 59.2 986.8 51.7
330 65.4 1117.1 52.1
384 (UViT-B) 74.4 1298.7 52.3
120 42.6 710.3 47.9
132 43.5 750.1 48.9
144 44.4 791.0 49.3
1024 × 1024 18 162 45.8 854.3 50.4
198 49.3 987.6 51.4
246 54.7 1179.7 51.7
288 60.3 1361.2 52.0
276 52.1 748.4 50.8
300 54.4 796.2 50.8
896 × 896 12 324 56.9 846.2 51.0
360 60.9 925.0 51.5
390 64.5 994.2 51.5
156 46.5 739.0 50.6
180 49.2 813.8 50.8
896 × 896 24 192 50.6 852.7 51.3
258 60.1 1085.4 51.8
294 66.3 1225.7 51.6
120 44.6 732.5 50
132 45.9 777.4 50.4
896 × 896 32 144 47.3 823.8 51.2
180 52.3 971.1 51.5
240 62.8 1244.4 52.0
96 43.2 723.2 48.5
102 43.8 749.3 49.1
114 45.2 802.9 50.1
896 × 896 40
126 46.8 858.2 50.7
150 50.3 974.0 51.2
156 51.2 1004.0 51.2