Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Brief Review — FasterViT: Fast Vision Transformers with Hierarchical Attention

FasterViT, Outperforms Swin and ConvNeXt

Sik-Ho Tsang
4 min read1 day ago
Comparison of image throughput and ImageNet-1K Top-1 accuracy

FasterViT: Fast Vision Transformers with Hierarchical Attention
FasterViT
, by NVIDIA
2024 ICLR, Over 20 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2]
==== My Other Paper Readings Are Also Over Here ====

  • Hierarchical Attention (HAT) approach is proposed, which decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
  • Efficient window-based self-attention is used in which each window has access to dedicated carrier tokens that participate in local and global representation learning.
  • At a high level, global self-attentions enable the efficient cross-window communication at lower costs.

Outline

  1. FasterViT
  2. Results

1. FasterViT

1.1. Overall Architecture

Overview of the FasterViT architecture.
  • It exploits convolutional layers in the earlier stages that operate on higher resolution. The second half of the model relies on novel hierarchical attention layers.

1.2. Stem & Downsampler Blocks & Conv BLocks

  • Stem: An input image x is converted into overlapping patches by two consecutive 3×3 convolutional layers, each with a stride of 2.
  • Downsampler Blocks: The spatial resolution is reduced by 2 between stages by a downsampling block. 2D layer norm, followed by a 3×3 convolutional layer with a stride of 2.
  • Conv Blocks: Stage 1 and 2 consist of residual convolutional blocks, wit the use of BN and GELU, which are defined as:

1.3. Hierarchical Attention

Hierarchical Attention (HAT) Approach
Proposed Hierarchical Attention block
  • With an input feature map x, the input feature map is first partitioned into n×n local windows with n=H²/k², where k is the window size, as:

The key idea of is the formulation of carrier tokens (CTs) that help to have an attention footprint much larger than a local window at low cost.

  • At first, CTs are initialized by pooling to L-2^c tokens per window:
  • This conv is an efficient positional encoding, also used in Twins.
  • c = 1.
  • These pooled tokens represent a summary of their respective local windows.
  • In every HAT block, CTs undergo the attention procedure:
  • where LN is layer norm. MHSA represents multi-head self attention. MLP is a 2-layer MLP with GELU.

The interaction between the local and carrier tokens, ˆx_l and ˆx_ct,l, respectively, si then computed.

  • At first, local features and CTs are concatenated. Each local window only has access to its corresponding CTs:
  • These tokens undergo another set of attention procedure:
  • Finally, tokens are further split back and used in the subsequent hierarchical attention layers:
  • The above procedures are are iteratively applied for a number of layers in the stage.
  • To further facilitate long-shot-range interaction, global information propagation is performed in the end of the stage:

1.3. Attention Map Comparison

Attention map comparison

2. Results

2.1. Image Classification

Image Classification
  • Comparing to Conv-based architectures, FasterViT achieves higher accuracy under the same throughput, for example, FasterViT outperforms ConvNeXt-T by 2.2%.
  • Considering the accuracy and throughput trade-off, FasterViT models are significantly faster than Transformer-based models such as the family of Swin Transformers.
  • Furthermore, compared to hybrid models, such as the recent EfficientFormer and MaxViT (Tu et al., 2022) models, FasterViT on average has a higher throughput while achieving a better ImageNet top-1 performance.
ImageNet-21K pretrained classification benchmarks

FasterViT-4 has a better accuracy-throughput trade-off compared to other counterparts.

2.2. Downstream Tasks

MS COCO Using Cascade Mask R-CNN Framework

FasterViT models have better accuracy-throughput trade-off when compared to other counterparts.

ADE20K Using UPerNet Framework

Similar to previous tasks, FasterViT models benefit from a better performance-throughput trade-off.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.