Brief Review — FasterViT: Fast Vision Transformers with Hierarchical Attention

FasterViT, Outperforms Swin and ConvNeXt

4 min read1 day ago

**Comparison of image throughput and ImageNet-1K Top-1 accuracy**

FasterViT: Fast Vision Transformers with Hierarchical Attention
FasterViT, by NVIDIA
2024 ICLR, Over 20 Citations (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2]
==== My Other Paper Readings Are Also Over Here ====

Hierarchical Attention (HAT) approach is proposed, which decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
Efficient window-based self-attention is used in which each window has access to dedicated carrier tokens that participate in local and global representation learning.
At a high level, global self-attentions enable the efficient cross-window communication at lower costs.

Outline

FasterViT
Results

1. FasterViT

1.1. Overall Architecture

**Overview of the FasterViT architecture.**

It exploits convolutional layers in the earlier stages that operate on higher resolution. The second half of the model relies on novel hierarchical attention layers.

1.2. Stem & Downsampler Blocks & Conv BLocks

Stem: An input image x is converted into overlapping patches by two consecutive 3×3 convolutional layers, each with a stride of 2.
Downsampler Blocks: The spatial resolution is reduced by 2 between stages by a downsampling block. 2D layer norm, followed by a 3×3 convolutional layer with a stride of 2.
Conv Blocks: Stage 1 and 2 consist of residual convolutional blocks, wit the use of BN and GELU, which are defined as:

1.3. Hierarchical Attention

**Hierarchical Attention (HAT) Approach**

**Proposed Hierarchical Attention block**

With an input feature map x, the input feature map is first partitioned into n×n local windows with n=H²/k², where k is the window size, as:

The key idea of is the formulation of carrier tokens (CTs) that help to have an attention footprint much larger than a local window at low cost.

At first, CTs are initialized by pooling to L-2^c tokens per window:

This conv is an efficient positional encoding, also used in Twins.
c = 1.
These pooled tokens represent a summary of their respective local windows.
In every HAT block, CTs undergo the attention procedure:

where LN is layer norm. MHSA represents multi-head self attention. MLP is a 2-layer MLP with GELU.

The interaction between the local and carrier tokens, ˆx_l and ˆx_ct,l, respectively, si then computed.

At first, local features and CTs are concatenated. Each local window only has access to its corresponding CTs:

These tokens undergo another set of attention procedure:

Finally, tokens are further split back and used in the subsequent hierarchical attention layers:

The above procedures are are iteratively applied for a number of layers in the stage.
To further facilitate long-shot-range interaction, global information propagation is performed in the end of the stage:

1.3. Attention Map Comparison

2. Results

2.1. Image Classification

Comparing to Conv-based architectures, FasterViT achieves higher accuracy under the same throughput, for example, FasterViT outperforms ConvNeXt-T by 2.2%.
Considering the accuracy and throughput trade-off, FasterViT models are significantly faster than Transformer-based models such as the family of Swin Transformers.
Furthermore, compared to hybrid models, such as the recent EfficientFormer and MaxViT (Tu et al., 2022) models, FasterViT on average has a higher throughput while achieving a better ImageNet top-1 performance.