Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: Department of Electronics Engineering, Sogang University, South Korea 22institutetext: AI Lab, CTO Division, LG Electronics, South Korea 33institutetext: Department of Electrical & Electronics Engineering, Pusan National University, South Korea
33email: {hyunwoo137, dbqls1219, beoungwoo, moonsh97, sjkang}@sogang.ac.kr
33email: kbkong@pusan.ac.kr

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Hyunwoo Yu\orcidlink0009-0009-4426-8272 Equal Contribution 11 Yubin Cho11footnotemark: 1\orcidlink0009-0001-8604-5431 1122 Beoungwoo Kang11footnotemark: 1 11 Seunghun Moon11footnotemark: 1 11 Kyeongbo Kong11footnotemark: 1\orcidlink0000-0002-1135-7502 33 Suk-Ju Kang\orcidlink0000-0002-4809-956X Corresponding Author 11
Abstract

We present an Encoder-Decoder Attention Transformer, ED-AFormer, which consists of the Embedding-Free Transformer (EFT) encoder and the all-attention decoder leveraging our Embedding-Free Attention (EFA) structure. The proposed EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity, not the specific roles of the query, key and value. For the decoder, we explore the optimized structure for considering the globality, which can improve the semantic segmentation performance. In addition, we propose a novel Inference Spatial Reduction (ISR) method for the computational efficiency. Different from the previous spatial reduction attention methods, our ISR method further reduces the key-value resolution at the inference phase, which can mitigate the computation-performance trade-off gap for the efficient semantic segmentation. Our EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models in three public benchmarks, including ADE20K, Cityscapes and COCO-Stuff. Furthermore, our ISR method reduces the computational cost by up to 61% with minimal mIoU performance degradation on Cityscapes dataset. The code is available at https://github.com/hyunwoo137/EDAFormer.

Keywords:
Semantic segmentation Embedding-free self-attention Inference spatial reduction

1 Introduction

Semantic segmentation, which aims to obtain the accurate pixel-wise prediction for the whole image, is one of the most fundamental tasks in the computer vision [32, 42] and is widely used in various downstream applications [11, 12, 34]. From the CNN-based models [29, 49, 41, 51, 42, 4] to the transformer-based models [55, 78, 58, 59, 40, 62, 22, 72], semantic segmentation models have been introduced in different structures. However, compared to other tasks, the semantic segmentation has a large amount of computation, as it treats the high resolution images and requires the per-pixel prediction decoder. Therefore, it is a significant challenge to explore the efficient structure for this task.

With the great success of the Vision Transformer [20] (ViT), recent semantic segmentation models [65, 24, 67, 18, 46, 6, 66, 45, 43, 52] mainly utilize the transformer-based structure to improve the performance by modeling the global context via the self-attention mechanism, and various advanced self-attention structures [27, 23, 73, 75, 7, 35, 21, 38, 44, 67, 19, 36] have been introduced. In this paper, we analyze the general self-attention mechanism as two parts. The first is that the input feature is assigned the specific roles as the query, key and value by embedding the input features through the linear projection with the learnable parameters. The second is functioning as a global non-linearity, which obtains the attention weight between the query and the key via the softmax and then projects the attention weight into the value. We focus on that the real important part of global context modeling is the global non-linear functioning, not the specific roles (i.e., the query, key, and value) assigned to the input feature. We found that the simple but effective method, which removes the specific roles of the input feature, rather improves the performance. Therefore, we propose a novel self-attention structure, Embedding-Free Attention (EFA), which omits the embeddings of the query, key and value.

With this powerful module, we also propose a semantic segmentation model, Encoder-Decoder Attention Transformer (EDAFormer), which is composed of the proposed Embedding-Free Transformer (EFT) encoder and the all-attention decoder. For the encoder, we adopt the hierarchical structure, and leverage our EFA module in the transformer blocks that effectively extract the global context features. For the decoder, inspired by [77, 24, 31], our all-attention decoder not only leverages our EFA, which effectively extracts the global context, but also is explored which level features need more global attention in the decoder. We empirically found that the higher level feature is more effective to consider the global context. Therefore, we design the all-attention decoder that leverages the more number of EFA modules to the higher level feature.

In addition, this paper addresses the issue of requiring additional training in the different structures whenever lighter (or less lightweight) models for lower computation (or higher accuracy). This issue causes user inconvenience and limits the versatility of lightweight methodologies.

To solve this issue, we introduce a novel Inference Spatial Reduction (ISR) method that reduces the key-value resolution more at the inference phase than at the training phase. Our ISR exploits the Spatial Reduction Attention (SRA)-based structure in a completely different perspective with the existing SRA-based models [58, 62, 65, 59, 50], as we focus on making the reduction ratio different at training and inference. Through our method, the query learns a larger amount of the key and value information during training, and better copes with the reduced key and value during inference. This has the following two advantages. (1) Our method reduces the computational cost with little degradation in performance. (2) Our method allows to selectively adjust various computational costs of one pretrained model.

We demonstrate the effectiveness of the proposed method in terms of the computational cost and performance on three public semantic segmentation benchmarks. Compared to the transformer-based semantic segmentation models, our model achieve the competitive performance in terms of the efficiency and the accuracy. Our contributions are summarized as follows:

  • We propose a novel embedding-free attention structure that removes the specific roles of the query, key, and value but focuses on global non-linearity, thus achieving strong performance.

  • We introduce a semantic segmentation model, EDAFormer, which is designed with the EFT encoder and the all-attention decoder. Our decoder exploits the more number of the proposed EFA module at the higher level to capture the global context more effectively.

  • We propose a novel ISR method for the efficiency, which enables to reduce the computational cost with less degradation in performance at the inference phase and allows to selectively adjust the computational cost of the pretrained transformer model.

  • Our EDAFormer outperforms the existing transformer-based semantic segmentation models in terms of the efficiency and the accuracy on three public semantic segmentation benchmarks.

2 Related Works

2.1 Attention for Global Context

The importance of modeling the global context has been demonstrated by the self-attention mechanism in the transformer. Beyond the general attention method, various attention methods have been studied. [58, 59] proposed the spatial reduction attention mechanism, which reduces the key-value resolution for efficiency. [63] leveraged the pyramid pooling to reduce the key-value in multi-scale resolution. Based on the spatial reduction attention structure, [23, 75, 76] exploited the convolutional layer in the attention. The window-based attention method [40, 39] considered the local window regions for efficiency. [13] proposed the local window attention with global attention. The convolution-based attention [62, 65, 23, 15] used the convolutional operation to consider local context with global context. The channel reduction attention method [31] reduced the query and key channels. However, all these self-attention methods are based on the query, key and value embeddings. Different from these methods, we propose the efficient Embedding-Free Attention module by focusing on that the global non-linearity is important in the attention mechanism.

2.2 Transformer-based Semantic Segmentation

Since ViT [20] achieved the great performance in the image classification task, the transformer-based architectures have also been studied on the semantic segmentation, one of the most fundamental vision tasks. SETR [78] was the first semantic segmentation model to adopt the transformer architecture as a backbone with convolutional decoder. Beyond introducing the effective encoder structures, recent method [65] proposed the efficient encoder-decoder structures for the semantic segmentation. SegFormer [65] introduced a mix transformer encoder and a purely MLP-based decoder. FeedFormer [50] introduced a cross attention-based decoder to refer the low-level feature information of the transformer encoder. VWFormer [66] used the transformer encoder and exploited the window-based attention for considering the multi-scale representation in the decoder. We introduce the efficient Encoder-Decoder Attention TransFormer model for the semantic segmentation to effectively capture the global context at both the encoder and the decoder.

3 Proposed Method

This section introduces our Encoder-Decoder Attention Transformer (EDAF-ormer), which is composed of the Embedding-Free Transformer (EFT) encoder and the all-attention decoder. Additionally, we describe our Inference Spatial Reduction (ISR) method that can reduce the computational cost effectively.

Refer to caption
Figure 1: (a) Overall architecture of the proposed EDAFormer, consisting of two main parts: an EFT encoder and an all-attention decoder. The encoder and decoder of EDAFormer are designed with the query, key and value embedding free attention structure. (b) Details of the EFT block that contains EFA module.

3.1 Overall Architecture

EDAFormer. As shown in Fig. 1 (a), we leverage a hierarchical encoder structure, which is effective in the semantic segmentation task. When the input image is IH×W×3𝐼superscript𝐻𝑊3I\in\mathbb{R}^{H\times W\times 3}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, the output feature of each stage is defined as FiH2i+1×W2i+1×CisubscriptF𝑖superscript𝐻superscript2𝑖1𝑊superscript2𝑖1subscript𝐶𝑖\textbf{F}_{i}\in\mathbb{R}^{\frac{H}{2^{i+1}}\times\frac{W}{2^{i+1}}\times C_% {i}}F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_ARG × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where i{1,2,3,4}𝑖1234i\in\{1,2,3,4\}italic_i ∈ { 1 , 2 , 3 , 4 } denotes the index of the encoder stage, and C𝐶Citalic_C is the channel dimension. At each stage, the features are first downsampled by the patch embedding block before being input to the transformer block.

Refer to caption
Figure 2: Comparison of the previous method and our EFA.

As illustrated in Fig. 1 (b), our transformer block structure of the encoder is composed of the Embedding-Free Attention (EFA) and the Feed-Forward Layer (FFL). As shown in Fig. 2 (b), our EFA module omits the linear projection for the query Q, key K and value V embeddings, which are lightweight and effectively extracts the global context. Additionally, we adopt the spatial reduction attention (SRA) structure [59] to leverage our ISR in the inference phase. We use the non-parametric operations and the average pooling to reduce the key-value spatial resolution, which has less impact on performance with the spatial reduction in the inference phase. The EFA module is formulated as follows:

Q=xin,K=V=𝚂𝚁(xin,R),Att=𝚜𝚘𝚏𝚝𝚖𝚊𝚡(QKT/dk),xout=AttV,formulae-sequenceformulae-sequenceQsubscriptx𝑖𝑛KV𝚂𝚁subscriptx𝑖𝑛𝑅formulae-sequenceAtt𝚜𝚘𝚏𝚝𝚖𝚊𝚡QsuperscriptK𝑇subscript𝑑𝑘subscriptx𝑜𝑢𝑡AttV\begin{split}&\textbf{Q}=\textbf{x}_{in},\ \textbf{K}=\textbf{V}=\mathtt{SR}(% \textbf{x}_{in},R),\\ \textbf{Att}=\mathtt{so}&\mathtt{ftmax}(\textbf{Q}\cdot\textbf{K}^{T}/\sqrt{d_% {k}}),\ \textbf{x}_{out}=\textbf{Att}\cdot\textbf{V},\end{split}start_ROW start_CELL end_CELL start_CELL Q = x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , K = V = typewriter_SR ( x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_R ) , end_CELL end_ROW start_ROW start_CELL Att = typewriter_so end_CELL start_CELL typewriter_ftmax ( Q ⋅ K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) , x start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = Att ⋅ V , end_CELL end_ROW (1)

where 𝚂𝚁𝚂𝚁\mathtt{SR}typewriter_SR and R𝑅Ritalic_R denote the spatial reduction via the average pooling and the reduction ratio, respectively. xinsubscriptx𝑖𝑛\textbf{x}_{in}x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT is directly used as the query, and the spatial reduced features are used as the key-value. In the part where the softmax function is used for similarity scores between the query and the key, the global non-linearity can be applied to the input features, allowing the global context extraction without the specific roles of the query, key, and value. Then, the FFL is formulated as follows:

𝙵𝙵𝙻(xin)=𝙻𝚒𝚗𝚎𝚊𝚛((𝙳𝚆(𝙻𝚒𝚗𝚎𝚊𝚛(xin))),\mathtt{FFL}(\textbf{x}_{in})=\mathtt{Linear}((\mathtt{DW}(\mathtt{Linear}(% \textbf{x}_{in}))),typewriter_FFL ( x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) = typewriter_Linear ( ( typewriter_DW ( typewriter_Linear ( x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ) ) , (2)

where 𝙳𝚆𝙳𝚆\mathtt{DW}typewriter_DW indicates the depth-wise convolution. As the EFA and FFL are connected sequentially, the whole process of our EFT block is formulated as:

z=𝙴𝙵𝙰(𝙻𝙽(xin))+xin,xout=𝙵𝙵𝙻(𝙻𝙽(z))+z,formulae-sequencez𝙴𝙵𝙰𝙻𝙽subscriptx𝑖𝑛subscriptx𝑖𝑛subscriptx𝑜𝑢𝑡𝙵𝙵𝙻𝙻𝙽zz\begin{split}&\textbf{z}=\mathtt{EFA}(\mathtt{LN}(\textbf{x}_{in}))+\textbf{x}% _{in},\\ &\textbf{x}_{out}=\mathtt{FFL}(\mathtt{LN}(\textbf{z}))+\textbf{z},\end{split}start_ROW start_CELL end_CELL start_CELL z = typewriter_EFA ( typewriter_LN ( x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ) + x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL x start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = typewriter_FFL ( typewriter_LN ( z ) ) + z , end_CELL end_ROW (3)

where z is the intermediate features, and 𝙻𝙽𝙻𝙽\mathtt{LN}typewriter_LN is a layer normalization. This embedding-free structure is effective for the classification and the semantic segmentation. In addition, we empirically find that our embedding-free structure is effective for our ISR in terms of considering the trade-off between the computation and the performance degradation.

All-attention decoder. As previous models [70, 77, 24] have demonstrated, applying the SRA to the encoder features in the decoder is effective for capturing the global semantic-aware features. We thus design an all-attention decoder, which consists of EFT blocks at all of the decoder stages. We also explore the optimal structure of the decoder for using EFT blocks. As a result, applying more attention blocks to the high-level features was effective for capturing globally more semantic informative features. As shown in Fig. 1 (a), our decoder has a hierarchical structure that utilizes 3, 2, and 1 EFT blocks at the 1st to 3rd decoder stages, respectively. This structure is composed of a larger number of transformer blocks compared to the decoders of the previous transformer-based segmentation models, but has lower computational costs compared to previous models because the EFT block is lightweight.

In the all-attention decoder, the output features FisubscriptF𝑖\textbf{F}_{i}F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each encoder stage i{2,3,4}𝑖234i\in\{2,3,4\}italic_i ∈ { 2 , 3 , 4 } are first fed into the EFT blocks in each decoder stage j{3,2,1}𝑗321j\in\{3,2,1\}italic_j ∈ { 3 , 2 , 1 }, where j𝑗jitalic_j denotes the index of the decoder stages. Then, the features F^jHj×Wj×Cjsubscript^F𝑗superscriptsubscript𝐻𝑗subscript𝑊𝑗subscript𝐶𝑗\widehat{\textbf{F}}_{j}\in\mathbb{R}^{H_{j}\times W_{j}\times C_{j}}over^ start_ARG F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of each decoder stage are up-sampled to H2×W2subscript𝐻2subscript𝑊2H_{2}\times W_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT resolution using the bilinear interpolation. These up-sampled features UjH2×W2×CjsubscriptU𝑗superscriptsubscript𝐻2subscript𝑊2subscript𝐶𝑗\textbf{U}_{j}\in\mathbb{R}^{H_{2}\times W_{2}\times C_{j}}U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are then concatenated and passed to linear layers for fusion. Finally, the final prediction mask is projected into the number of classes Cclssubscript𝐶𝑐𝑙𝑠C_{cls}italic_C start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT mask by another linear layer. This process is formulated as:

F^j=𝙴𝙵𝚃(𝙻𝙽(Fi))+Fi,iUj=𝚄𝚙𝚜𝚊𝚖𝚙𝚕𝚎(F^j),Fc=𝙲𝚘𝚗𝚌𝚊𝚝(Uj),jM=𝙻𝚒𝚗𝚎𝚊𝚛(𝙻𝚒𝚗𝚎𝚊𝚛(Fc)),formulae-sequencesubscript^F𝑗𝙴𝙵𝚃𝙻𝙽subscriptF𝑖subscriptF𝑖formulae-sequencefor-all𝑖subscriptU𝑗𝚄𝚙𝚜𝚊𝚖𝚙𝚕𝚎subscript^F𝑗formulae-sequencesubscriptF𝑐𝙲𝚘𝚗𝚌𝚊𝚝subscriptU𝑗for-all𝑗M𝙻𝚒𝚗𝚎𝚊𝚛𝙻𝚒𝚗𝚎𝚊𝚛subscriptF𝑐\begin{split}&\widehat{\textbf{F}}_{j}=\mathtt{EFT}(\mathtt{LN}(\textbf{F}_{i}% ))+\textbf{F}_{i},\ \forall i\ \\ {\textbf{U}_{j}}=\mathtt{Up}&\mathtt{sample}(\widehat{\textbf{F}}_{j}),\ % \textbf{F}_{c}=\mathtt{Concat}({\textbf{U}_{j}}),\ \forall j\\ &\textbf{M}=\mathtt{Linear}(\mathtt{Linear}(\textbf{F}_{c})),\end{split}start_ROW start_CELL end_CELL start_CELL over^ start_ARG F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = typewriter_EFT ( typewriter_LN ( F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i end_CELL end_ROW start_ROW start_CELL U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = typewriter_Up end_CELL start_CELL typewriter_sample ( over^ start_ARG F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = typewriter_Concat ( U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ∀ italic_j end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL M = typewriter_Linear ( typewriter_Linear ( F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (4)

where MH2×W2×CclsMsuperscriptsubscript𝐻2subscript𝑊2subscript𝐶𝑐𝑙𝑠\textbf{M}\in\mathbb{R}^{H_{2}\times W_{2}\times C_{cls}}M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the final prediction mask.

Refer to caption
Figure 3: Overview of our ISR method at the encoder stage-1. Our ISR applies the reduction ratio at the inference, reducing the key and value tokens selectively. This framework can be performed at every stage that contains the self-attention structure. It leads to flexibly reduce the computational cost without disrupting the spatial structure.

3.2 Inference Spatial Reduction Method

Different from previous SRA, our inference spatial reduction (ISR) method reduces the key-value spatial resolution at the inference phase. Our method achieves the computational efficiency by changing the hyperparameter associated with the ‘reduction ratio R𝑅Ritalic_R’ of the average pooling in the EFA module. Our ISR can be used in the self-attention structures because the self-attention has a special structure where reducing the resolution of key and value does not affect the shape of the input and output features. Due to this structure, the reduction ratio can be adjusted during inference without affecting the resolution of the input and output features.

However, reducing the key and value resolution largely at training has the advantage of computational efficiency, but leads to the performance degradation because the query cannot consider enough information from the key and value. To address this issue, our ISR alleviates the trade-off gap between the computational cost and the accuracy by reducing the resolution of the key and value at inference. In this part, we describe that our ISR is applied to the our EDAFormer, which is the optimized architecture for applying our ISR effectively.

As shown in Fig. 1, our EDAformer uses the proposed transformer blocks in both encoder-decoder structures. Each pooling-based SRA used in each encoder stage and decoder stage has a corresponding reduction ratio setting that reduces the key and value resolution. At training as illustrated in Fig. 3, the reduction ratios tEisuperscriptsubscript𝑡𝐸𝑖t_{E}^{i}italic_t start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of each encoder stage are set to [8, 4, 2, 1], which are the default setting of other previous models [58, 65, 59] using SRA. The reduction ratios tDjsuperscriptsubscript𝑡𝐷𝑗t_{D}^{j}italic_t start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT of the decoder stage that takes each encoder features are set to [1, 2, 4], which are equal to the reduction ratios of the corresponding encoder stage. tEsubscript𝑡𝐸t_{E}italic_t start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and tDsubscript𝑡𝐷t_{D}italic_t start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT denote the reduction ratio of the encoder and decoder at training, respectively. The computational complexity of the previous attention is as follows:

Ω(𝚂𝚁𝙰)=2(hw)2r2c,Ω𝚂𝚁𝙰2superscript𝑤2superscript𝑟2𝑐\mathrm{\Omega}(\mathtt{SRA})=2{\frac{(hw)^{2}}{r^{2}}}c,roman_Ω ( typewriter_SRA ) = 2 divide start_ARG ( italic_h italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_c , (5)

where ΩΩ\mathrm{\Omega}roman_Ω and 𝚂𝚁𝙰𝚂𝚁𝙰\mathtt{SRA}typewriter_SRA denote the computational complexity and the spatial reduction attention. hhitalic_h, w𝑤witalic_w and c𝑐citalic_c represent the height, the width and the channel of the features, respectively. r𝑟ritalic_r is the reduction ratio at training phase.

Under these reduction ratio settings, we train our EDAFormer to get pretrained weights. After that, at inference phase, it is possible to optionally adjust the inference computational reduction by selecting the reduction ratios at the discretion of the user. As shown in Fig. 3, rEisuperscriptsubscript𝑟𝐸𝑖r_{E}^{i}italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and rDjsuperscriptsubscript𝑟𝐷𝑗r_{D}^{j}italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the reduction ratio of the encoder and decoder at inference, respectively. They are formulated as:

rEi=tEi×aEi,irDj=tDj×aDj,jformulae-sequencesuperscriptsubscript𝑟𝐸𝑖superscriptsubscript𝑡𝐸𝑖superscriptsubscript𝑎𝐸𝑖for-all𝑖superscriptsubscript𝑟𝐷𝑗superscriptsubscript𝑡𝐷𝑗superscriptsubscript𝑎𝐷𝑗for-all𝑗\begin{split}&r_{E}^{i}=t_{E}^{i}\times a_{E}^{i},\ \forall i\\ &r_{D}^{j}=t_{D}^{j}\times a_{D}^{j},\ \forall j\end{split}start_ROW start_CELL end_CELL start_CELL italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_t start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT × italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , ∀ italic_i end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_t start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × italic_a start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , ∀ italic_j end_CELL end_ROW (6)

where aEisuperscriptsubscript𝑎𝐸𝑖a_{E}^{i}italic_a start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and aDjsuperscriptsubscript𝑎𝐷𝑗a_{D}^{j}italic_a start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the additional reduction ratio of the encoder and decoder at inference, respectively. After applying our ISR, the computational complexity is as follows:

Ω(𝙸𝚂𝚁(𝚂𝚁𝙰))=2(hw)2r2a2c,Ω𝙸𝚂𝚁𝚂𝚁𝙰2superscript𝑤2superscript𝑟2superscript𝑎2𝑐\mathrm{\Omega}(\mathtt{ISR}(\mathtt{SRA}))=2{\frac{(hw)^{2}}{r^{2}a^{2}}}c,roman_Ω ( typewriter_ISR ( typewriter_SRA ) ) = 2 divide start_ARG ( italic_h italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_c , (7)

where 𝙸𝚂𝚁𝙸𝚂𝚁\mathtt{ISR}typewriter_ISR is the inference spatial reduction and a𝑎aitalic_a is the additional reduction ratio at inference. Therefore, one of the advantage of our ISR is that it is simple to obtain the computational reduction on the pretrained model without additional training. Our ISR reduces the performance degradation compared with reducing by r2a2superscript𝑟2superscript𝑎2r^{2}a^{2}italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT at training. Empirically, the optimal setting is [16,8,2,1]-[2,4,8] in the encoder-decoder, which has the best reduction ratio of the performance degradation to the computational cost reduction.

4 Experiment

Method Params (M) ADE20K Cityscapes COCO-Stuff
GFLOPs \downarrow mIoU (%) \uparrow GFLOPs \downarrow mIoU (%) \uparrow GFLOPs \downarrow mIoU (%) \uparrow
Segformer-B0 [65] 3.8 8.4 37.4 125.5 76.2 8.4 35.6
FeedFormer [50] 4.5 7.8 39.2 107.4 77.9 - -
VWFormer-B0 [66] 3.7 5.1 38.9 - 77.2 5.1 36.2
EDAFormer-T   (w/o ISR) 4.9 5.6 42.3 151.7 78.7 5.6 40.3
EDAFormer-T   (w/   ISR) 4.9 4.7 42.1 94.9 78.7 4.7 40.3
OCRNet [17] 70.5 164.8 45.6 1296.8 81.1 - -
Swin UperNet-T [40] 60.0 236.0 44.4 - - - -
ContrastiveSeg [57] 58.0 - - - 79.2 - -
SenFormer [2] 144.0 179.0 46.0 - - - -
Segformer-B2 [65] 27.5 62.4 46.5 717.1 81.0 62.4 44.6
ProtoSeg [80] 90.5 - 48.6 - 80.6 - 42.4
MaskFormer [10] 42.0 55.0 46.7 - - - -
Mask2Former [9] 47.0 74.0 47.7 - - - -
FeedFormer-B2 [50] 29.1 42.7 48.0 522.7 81.5 - -
VWFormer-B2 [66] 27.4 38.5 48.1 - 81.7 38.5 45.2
EDAFormer-B   (w/o ISR) 29.4 32.0 49.0 605.9 81.6 32.0 45.9
EDAFormer-B   (w/   ISR) 29.4 29.4 48.9 452.9 81.6 29.4 45.8
Table 1: Comparison with the transformer-based state-of-the-art semantic segmentation model on three public datasets. GFLOPs are computed using 512×512512512512\times 512512 × 512 resolutions for ADE20K and COCO-Stuff, and 2048×1024204810242048\times 10242048 × 1024 resolutions for Cityscapes.

4.1 Experimental Settings

Datasets. ADE20K [79] is a challenging scene parsing dataset captured at indoors and outdoors. It consists of 150 semantic categories, and 20,210/2,000/3,352 images for training, validation, and testing. Cityscapes [14] is an urban driving scene dataset that contains 5,000 fine-annotated images with 19 semantic categories. It consists of 2,975/500/1,525 images in training, validation, and test sets. COCO-Stuff [3] is a challenging dataset, which contains 164,062 images labeled with 172 semantic categories.

Implementation details. The mmsegmentation codebase was used to train our model on 4 RTX 3090 GPUs. We pretrained our encoder on ImageNet-1K [16], and our decoder was randomly initialized. For classification and segmentation evaluation, we adopted Top-1 accuracy and mean Intersection over Union (mIoU), respectively. We applied the same training settings and data augmentation as PVTv2[58] for ImageNet pretraining. We applied random horizontal flipping, random scaling with a ratio of 0.5-2.0 and random cropping with the size of 512×\times×512, 1024×\times×1024, and 512×\times×512 for ADE20K, Cityscapes, and COCO-Stuff, respectively. The batch size was 16 for ADE20K and COCO-Stuff, and 8 for Cityscapes. We used the AdamW optimizer for 160K iterations on ADE20K, Cityscapes and COCO-Stuff.

4.2 Comparison with State-of-the-art Methods

Models Params (M) GFLOPs Top-1 Acc. (%)
RSB-ResNet-18 [29, 61] 12 1.8 70.6
PVTv2-B0 [59] 3.4 0.6 70.5
MiT-B0 [65] 3.7 0.6 70.5
EFT-T (Ours) 3.7 0.6 72.3
ResNet50 [29] 25.5 4.1 78.5
RSB-ResNet-152 [29, 61] 60.0 11.6 81.8
DeiT-S [54] 22.0 4.6 79.8
PVT-Small [58] 25.0 3.8 79.8
PVTv2-B2 [59] 25.4 4.0 82.0
MiT-B2 [65] 25.4 4.0 81.6
T2T-ViT-14 [74] 21.5 4.8 81.5
TNT-S [26] 23.8 4.8 81.5
ResMLP-S24 [53] 30.0 6.0 79.4
Swin-Mixer-T/D6 [40] 23.0 4.0 79.7
Visformer-S [8] 40.2 4.8 82.1
gMLP-S [37] 20.0 4.5 79.6
PoolFormer-S36 [71] 31.0 5.0 81.4
EfficientFormer-L3 [35] 31.3 3.9 82.4
FasterViT-0 [27] 31.4 3.3 82.1
EFT-B (Ours) 25.4 4.2 82.4
Table 2: Comparison with the previous models on ImageNet. GFLOPs were computed with 224×\times×224.

Semantic segmentation. In Table 1, we compared our EDAFormer with the previous transformer-based methods on three public datasets. The comparison includes the parameter size, FLOPs, and mIoU performance. Our lightweight model, EDAFormer-T (w/ ISR), showed 42.1%, 78.7% and 40.3% mIoU, and our larger model, EDAFormer-B (w/ ISR), yielded 48.9%, 81.6% and 45.8% mIoU on each dataset. Compared to previous methods, both of our EDAFormer achieved the state-of-the-art performance with the efficient computation.

EFT encoder on ImageNet. In Table 2, we compared our Embedding-Free Transformer (EFT) encoder with the existing models on ImageNet-1K classification. Our EFT achieved higher performance than other transformer models. This result indicates that our EFT backbone is effective in the classification task by considering the spatial information globally even without the embeddings of the query, key and value.

4.3 Effectiveness of our EFA at Decoder

To verify the effectiveness of considering the globality at the decoder, we compared the different operations at the Embedding-Free Attention (EFA) position of the EFT block in Table 3 (a). The applied operations are the local context operation (i.e., DW Conv, Conv) and the global context operation (i.e., w/ embedding attention, w/o embedding attention). Our w/o embedding structure improved 1.6% and 2.4% mIoU compared to the depth-wise convolution and the standard convolution, respectively. These results show that capturing the global context in the decoder is important for the mIoU performance improvement. While w/ embedding method outperformed the local context operation by capturing global context, our EFA further improved mIoU by 0.8% with the lightweight model parameter and FLOPs. This indicates that our EFA module better models the global context.

4.4 Structural Analysis of our All-attention Decoder

Our decoder, a {3-2-1} structure, is the hierarchical structure with six EFT blocks that assigns more attention blocks to high-level semantic features. In Table 3 (b), we verified the effectiveness of our decoder structure compared with three cases. The case of {2-2-2} structure assigned two EFT blocks equally to all decoder stages. The cases of {1-2-3}, {1-4-1} and our {3-2-1} allocated more EFT blocks to the decoder stage-3, 2 and 1, respectively. As a result, our {3-2-1} structure assigning more attention to higher level features shows better performance of 0.8%, 1.7%, 1.8% mIoU compared to {2-2-2},{1-2-3}, and {1-4-1}, respectively. These results indicate that allocating the additional attention layers to the higher level features, which contain richer semantic information, is more effective for semantic segmentation performance.

(a) Effectiveness of our EFA for the decoder
Operation Params (M) ADE20K
GFLOPs mIoU(%)
DW Conv 4.5 5.1 40.7
Conv 6.6 6.0 39.9
w/ embedding 5.7 6.1 41.5
w/o embedding 4.9 5.6 42.3
(b) Ablation on the number of EFA at each decoder stage
Stage-1 Stage-2 Stage-3 Params (M) ADE20K
GFLOPs mIoU(%)
2 2 2 4.6 5.7 41.5
1 2 3 4.2 5.8 40.6
1 4 1 4.4 5.7 40.5
3 2 1 4.9 5.6 42.3
Table 3: Ablation studies of our all-attention decoder structure on the validation set of ADE20K. Our EFT encoder is used as the backbone.
[ rE1,rE2,rE3,rE4superscriptsubscript𝑟𝐸1superscriptsubscript𝑟𝐸2superscriptsubscript𝑟𝐸3superscriptsubscript𝑟𝐸4r_{E}^{1},r_{E}^{2},r_{E}^{3},r_{E}^{4}italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ]-[ rD1,rD2,rD3superscriptsubscript𝑟𝐷1superscriptsubscript𝑟𝐷2superscriptsubscript𝑟𝐷3r_{D}^{1},r_{D}^{2},r_{D}^{3}italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] Reduction ratio Params (M) ADE20K Cityscapes COCO-Stuff
Train Inference GFLOPs \downarrow mIoU (%) \uparrow GFLOPs \downarrow mIoU (%) \uparrow GFLOPs \downarrow mIoU (%) \uparrow
(a) EDAFormer-T with the different reduction ratio at inference.
[ 8, 4, 2, 1 ]-[ 1, 2, 4 ] [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] 4.9 5.6 42.3 151.7 78.7 5.6 40.3
[ 8, 4, 2, 1 ]-[ 2, 4, 8 ] 4.9 5.3 (-5.4%) 42.2 (-0.1) 133.6 (-11.9%) 78.7 (-0.0) 5.3 (-5.4%) 40.3 (-0.0)
[16, 8, 2, 1]-[ 2, 4, 8 ] 4.9 4.7 (-16.1%) 42.1 (-0.2) 94.9 (-37.4%) 78.7 (-0.0) 4.7 (-16.1%) 40.3 (-0.0)
[16, 8, 4, 2]-[ 2, 4, 8 ] 4.9 4.1 (-26.8%) 41.3 (-1.0) 59.1 (-61.0%) 78.1 (-0.6) 4.1 (-26.8%) 39.1 (-1.2)
[16, 8, 4, 2]-[ 2, 4, 8 ] 4.9 4.1 (-26.8%) 42.1 (-0.2) 59.1 (-61.0%) 78.6 (-0.1) 4.1 (-26.8%) 40.2 (-0.1)
(b) EDAFormer-B with the different reduction ratio at inference.
[ 8, 4, 2, 1 ]-[ 1, 2, 4 ] [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] 29.4 32.0 49.0 605.9 81.6 32.0 45.9
[ 8, 4, 2, 1 ]-[ 2, 4, 8 ] 29.4 31.3 (-2.2%) 48.9 (-0.1) 569.0 (-6.1%) 81.6 (-0.0) 31.3 (-2.2%) 45.8 (-0.1)
[16, 8, 2, 1]-[ 2, 4, 8 ] 29.4 29.4 (-8.1%) 48.9 (-0.1) 452.9 (-25.3%) 81.6 (-0.0) 29.4 (-8.1%) 45.8 (-0.1)
[16, 8, 4, 2]-[ 2, 4, 8 ] 29.4 26.6 (-16.9%) 48.3 (-0.7) 298.1 (-50.8%) 81.4 (-0.2) 26.6 (-16.9%) 45.0 (-0.9)
[16, 8, 4, 2]-[ 2, 4, 8 ] 29.4 26.6 (-16.9%) 48.7 (-0.3) 298.1 (-50.8%) 81.6 (-0.0) 26.6 (-16.9%) 45.7 (-0.2)
Table 4: Computation and performance of our model on three standard benchmarks. indicates that the same reduction ratio is applied at training and inference. indicates the fine-tuning. Bold is optimal inference reduction ratio for our EDAFormer.

4.5 Effectivness of our ISR in our EDAFormer

In Table 4, we verified the effectiveness of our Inference Spatial Reduction (ISR) method in the proposed EDAFormer-T and EDAFormer-B, and empirically found the optimal reduction ratio. At training, our EDAFormer was trained with the base setting of [8,4,2,1]-[1,2,4]. At inference, We experimented on applying our ISR to only decoder (i.e.[8,4,2,1]-[2,4,8]), part of the encoder-decoder (i.e.[16,8,2,1]-[2,4,8]), and all of the encoder-decoder (i.e.[16,8,4,2]-[2,4,8]). The setting of [16,8,2,1]-[2,4,8] showed the optimal performance for improving the computational efficiency compared to the accuracy degradation. Compared to EDAFormer-T with the base setting, EDAFormer-T with the optimal setting reduced the computation by 16.1%, 37.4% and 16.1% on ADE20K, Cityscapes and COCO-Stuff, respectively. The performance dropped by only 0.2% mIoU on ADE20K and did not drop on COCO-Stuff and Cityscapes. Furthermore, EDAFormer-B reduced the computation by 8.1% with only 0.1% mIoU degradation on ADE20K and COCO-Stuff, and reduced the computation by 25.3% without performance degradation on Cityscapes. These results indicate that our ISR method is simple, yet significantly reduces the computational cost with little performance degradation. In addition, our method showed the impressive effectiveness by only adjusting the reduction ratio at the inference without fine-tuning. Our ISR is effective without the fine-tuning, but we trained the models with 40K iterations for fine-tuning to further compensate for performance degradation at higher reduction ratio of [16,8,4,2]-[2,4,8]. As a result, EDAFormer-T showed a 0.2% drop in mIoU on ADE20K, and 0.1% drops in mIoU on Cityscapes and COCO-Stuff. EDAFormer-B showed 0.3% and 0.2% drops in mIoU on ADE20K and COCO-Stuff, and no drop in mIoU on Cityscapes.

4.6 Comparison between the model with and without ISR.

In Table 5 (a), we compared our w/ ISR with w/o ISR, which used the same reduction ratio of [16,8,2,1]-[2,4,8] at both training and inference. Our EDAFormer with our ISR was trained with the reduction ratio of [8,4,2,1]-[1,2,4] and adjusted the ratio to [16,8,2,1]-[2,4,8] at inference. Despite the same computation at inference phase, the result with our ISR showed better mIoU than the case w/o ISR, with both 0.5% improvements for our EDAFormer-T and EDAFormer-B, respectively. Therefore, our model w/ ISR, which considers enough information of the key and value during training, can achieve better performance than the model that cannot consider enough information by reducing the resolution of the key and value during training.

Method [ rE1,rE2,rE3,rE4superscriptsubscript𝑟𝐸1superscriptsubscript𝑟𝐸2superscriptsubscript𝑟𝐸3superscriptsubscript𝑟𝐸4r_{E}^{1},r_{E}^{2},r_{E}^{3},r_{E}^{4}italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ]-[ rD1,rD2,rD3superscriptsubscript𝑟𝐷1superscriptsubscript𝑟𝐷2superscriptsubscript𝑟𝐷3r_{D}^{1},r_{D}^{2},r_{D}^{3}italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] Reduction ratio COCO-Stuff
Train Inference GFLOPs mIoU(%)
(a) Comparisons of our models with and without our ISR method
EDAFormer-T
w/o ISR [16, 8, 2, 1]-[ 2, 4, 8 ] [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 4.7 39.8
w/ ISR [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 4.7 40.3
EDAFormer-B
w/o ISR [16, 8, 2, 1]-[ 2, 4, 8 ] [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 29.4 45.3
w/ ISR [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 29.4 45.8
Method Reduction ratio Params (M) Cityscapes
[ rE1,rE2,rE3,rE4superscriptsubscript𝑟𝐸1superscriptsubscript𝑟𝐸2superscriptsubscript𝑟𝐸3superscriptsubscript𝑟𝐸4r_{E}^{1},r_{E}^{2},r_{E}^{3},r_{E}^{4}italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ]-[ rD1,rD2,rD3superscriptsubscript𝑟𝐷1superscriptsubscript𝑟𝐷2superscriptsubscript𝑟𝐷3r_{D}^{1},r_{D}^{2},r_{D}^{3}italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] GFLOPs mIoU(%)
(b) Effectiveness of our EFA structure for our ISR
w/ embedding [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] 5.7 153.5 78.7
[ 8, 4, 2, 1 ]-[ 2, 4, 8 ] 5.7 134.7 78.5 (-0.2)
[ 8, 4, 2, 1 ]-[ 3, 6, 9 ] 5.7 131.2 78.2 (-0.5)
[ 8, 4, 2, 1 ]-[ 4, 8, 12 ] 5.7 130.0 77.9 (-0.8)
w/o embedding [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] 4.9 151.7 78.7
[ 8, 4, 2, 1 ]-[ 2, 4, 8 ] 4.9 133.6 78.7 (-0.0)
[ 8, 4, 2, 1 ]-[ 3, 6, 9 ] 4.9 130.3 78.7 (-0.0)
[ 8, 4, 2, 1 ]-[ 4, 8, 12 ] 4.9 129.1 78.6 (-0.1)
Table 5: (a) Ablation for mIoU (%) performance comparisons of our models with and without our ISR method on COCO-Stuff. (b) Ablation for the effectiveness of our EFA structure for our ISR on Cityscapes val.

4.7 Effectiveness of Embedding-Free Structure for ISR

To verify the effectiveness of our embedding-free structure for ISR. We experiment the ablated model that w/ embedding attention is adopt to our EFA position in all-attention decoder. We also compared with the ablated model (i.e., w/ embedding) by applying our ISR to the decoder stages in Table 5 (b). The w/ embedding structure showed the gradual performance degradation as the reduction ratio increased, and the reduction ratio of [8,4,2,1]-[4,8,12] showed the performance decrease of 0.8% mIoU. However, our structure showed no performance degradation up to the reduction ratio of [8,4,2,1]-[3,6,9], and only a 0.1% drop in mIoU at the reduction ratio of [8,4,2,1]-[4,8,12]. This indicate that our w/o embedding structure is effective with proposed ISR method.

Method Reduction ratio Cityscapes
[ rE1,rE2,rE3,rE4superscriptsubscript𝑟𝐸1superscriptsubscript𝑟𝐸2superscriptsubscript𝑟𝐸3superscriptsubscript𝑟𝐸4r_{E}^{1},r_{E}^{2},r_{E}^{3},r_{E}^{4}italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ]-[ rD1,rD2,rD3superscriptsubscript𝑟𝐷1superscriptsubscript𝑟𝐷2superscriptsubscript𝑟𝐷3r_{D}^{1},r_{D}^{2},r_{D}^{3}italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] mIoU (%) \uparrow FPS (img/s) \uparrow
(a) Comparison of different spatial reduction methods for our ISR
w/o ISR [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] 78.7 10.2
Bipartite matching [1] [ 11.2, 5.6, 2.8, 1.4 ]-[ 1.4, 2.8, 5.6 ] 78.7 10.5
Max pooling [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 78.4 13.3
Overlapped pooling [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 78.7 13.2
Average pooling [ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 78.7 13.3
(b) Inference speed improvement by increasing the reduction ratio
Average pooling [ 8, 4, 2, 1 ]-[ 1, 2, 4 ] 78.7 10.2
[ 8, 4, 2, 1 ]-[ 2, 4, 8 ] 78.7 11.0   (+7.8%)
[ 16, 8, 2, 1 ]-[ 2, 4, 8 ] 78.7 13.2 (+29.4%)
[ 16, 8, 4, 2 ]-[ 2, 4, 8 ] 78.1 15.0 (+47.1%)
Table 6: (a) Performance and inference speed of our ISR with different spatial reduction methods. (b) Inference speed by increasing the reduction ratio.
Models Params (M) GFLOPs \downarrow mIoU (%) \uparrow
CvT [62] 21.0 365.5 80.1
CvT + ISR 21.0 222.6 (-39.1%) 79.8 (-0.3)
MViT [72] 32.0 1435.6 80.5
MViT + ISR 32.0 838.0 (-41.6%) 80.3 (-0.2)
LVT [67] 5.0 132.1 79.6
LVT + ISR 5.0 86.1 (-34.8%) 79.5 (-0.1)
Swin [40] 36.2 272.2 79.7
Swin + ISR 36.2 208.0 (-23.6%) 79.0 (-0.7)
DaViT [18] 36.2 304.8 81.3
DaViT + ISR 36.2 242.0 (-20.6%) 80.9 (-0.4)
PVTv2 [59] 4.8 121.8 78.6
PVTv2 + ISR 4.8 63.4 (-47.9%) 78.3 (-0.3)
MiT [65] 4.9 117.4 78.2
MiT [65] + ISR 4.9 59.0 (-49.7%) 77.6 (-0.6)
SegFormer [65] 3.8 125.5 76.2
SegFormer + ISR 3.8 82.5 (-34.3%) 75.6 (-0.6)
FeedFormer [50] 4.5 107.5 77.9
FeedFormer + ISR 4.5 68.8 (-36.0%) 77.4 (-0.5)
EDAFormer (Ours) 4.9 151.7 78.7
EDAFormer + ISR (Ours) 4.9 94.9 (-37.4%) 78.7 (-0.0)
Table 7: Applying our ISR without finetuning to various transformer-based models on Cityscapes val.

4.8 Comparison of Spatial Reduction Methods for ISR

In Table 7 (a), We experimented to compare which method is better in terms of the mIoU and inference speed (FPS) for the key-value spatial reduction. The bipartite matching-based pooling had no mIoU degradation even though it was applied to every encoder-decoder stage. However, the bipartite matching can reduce maximum 50% of tokens, which corresponds to a reduction ratio of r=1.4𝑟1.4r=1.4italic_r = 1.4 (2absent2\approx\sqrt{2}≈ square-root start_ARG 2 end_ARG). This is because it divides the tokens into two sets and merges them. In addition, this method has the additional latency caused by the matching algorithm. Therefore, the bipartite matching showed similar FPS compared to w/o ISR even though they reduce the computation of the attention. The max pooling showed a drop of 0.3% mIoU, and the overlapped pooling was slightly slower than the average pooling. Therefore, we adopted the average pooling method to reduce the tokens, which is a simple operation for general purposes and is most effective in terms of performance with inference speed.

4.9 Inference Speed Enhancement

In Table 7 (b), we represented the inference speed (FPS) comparisons of various reduction ratios. We measured the inference speed by using a single RTX 3090 GPU without any additional accelerating techniques. Compared to base setting, applying our ISR shows 29.4% and 47.1% FPS improvements in the reduction ratios of [16,8,2,1]-[2,4,8] and [16,8,4,2]-[2,4,8], respectively. The inference speed became faster as the computational cost was reduced by increasing the reduction ratio. These results indicate that the the computational reduction by our ISR leads to the improvement of the actual inference speed.

4.10 Applying ISR to Various Transformer-based Models

Our ISR can be universally applied not only to our EDAFormer, but also to other transformer-based models by using the additional spatial reduction at the inference. To verify generalizability of our ISR, we applied ours to various models in Table 7. The transformer-based backbones are trained with our decoder for the semantic segmentation task. For the convolutional self-attention models (i.e., CvT [62], MViT [72] and LVT [67]), our ISR significantly reduced computation by 34.8similar-to\sim41.6% with 0.1similar-to\sim0.3% performance degradation. Our method also showed the effective computational reduction with less performance degradation for window-based attention models (i.e., Swin [40] and DaViT [18]), spatial reduction attention-based models (i.e., PVTv2 [59] and MiT [65]) and segmentation models (i.e., SegFormer [65] and FeedFormer [50]). The result for FeedFormer using the cross-attention decoder showed that our method is also effective in the cross-attention mechanism. These results indicate that our ISR framework can be effectively extended to various transformer-based architecture using different attention methods, and our EDAFormer is especially the optimized architecture for applying our ISR effectively.

Refer to caption
Figure 4: Visualization of the attention score map, output features, and prediction map on ADE20K. ‘Base’ represents our EDAFormer trained with the base reduction ratio of [8,4,2,1]-[1,2,4]. ‘w/ ISR’ represents our EDAFormer applied our ISR method.

4.11 Visualization of Features

Fig. 4 visualized the features and prediction maps of the EDAFormer-B decoder stage-2 before and after applying the ISR. Firstly, we visualized the attention score maps representing the similarity score between the query and key. When ISR was applied, the resolution of the attention score map was reduced because the resolution of the key was reduced. Compared to the similarity scores without applying the ISR, the similarity scores between the query and key applying the ISR were well maintained. In other words, the attention regions before and after applying ISR were similar, even though we reduce the key tokens rather than the attention score map. Therefore, this means that applying our ISR can maintain the semantic similarity scores in the global regions.

Secondly, we compared the output features after operating between the attention score map and values. Surprisingly, the output features before and after applying ISR showed almost the same results. Therefore, these results indicate that the information obtained from the self-attention operation is maintained even though the spatial reduction is applied to the key and value in inference. Thirdly, when comparing the prediction maps, the results before and after applying the ISR are almost same. This means that the effect of ISR can be applied not only to the decoder stage-2, but also to the whole EDAFormer network.

Refer to caption
Figure 5: Qualitative results on ADE20K, Cityscapes, and COCO-Stuff. Compared to SegFormer, the predictions of our EDAFormer are more precise for various categories.

4.12 Qualitative Results

In Fig. 5, we visualized our segmentation predictions on ADE20K, Cityscapes and COCO-Stuff, compared with the embedding-based transformer model (i.e. SegFormer [65]). Our EDAFormer better predicted the finer details near object boundaries. Our model also better segmented the large regions (e.g., road, roof and truck) than SegFormer. Furthermore, our model predicted the objects of the same category (e.g., sofa) that were far apart more precisely than SegFormer. This indicates that our embedding-free attention structure can capture enough global spatial information.

5 Conclusion

In this paper, we present an efficient transformer-based semantic segmentation model, EDAFormer, which leverages the proposed embedding-free attention module. The embedding-free attention structure can rethink the self-attention mechanism in the aspect of modeling the global context. In addition, we propose the novel inference spatial reduction framework for the efficiency, which changes the condition between train-inference phases. We hope that our attention mechanism and framework could further research efforts in exploring the lightweight and efficient transformer-based semantic segmentation model.

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd (IO201218-08232-01) and the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. RS-2024-00414230) and MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2023-00260091) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and National Supercomputing Center with supercomputing resources including technical support(KSC-2023-CRE-0444).

References

  • [1] Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022)
  • [2] Bousselham, W., Thibault, G., Pagano, L., Machireddy, A., Gray, J., Chang, Y.H., Song, X.: Efficient self-ensemble for semantic segmentation. arXiv preprint arXiv:2111.13280 (2021)
  • [3] Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1209–1218 (2018)
  • [4] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (2017)
  • [5] Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J., Wang, J.: Mixformer: Mixing features across windows and dimensions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5249–5259 (2022)
  • [6] Chen, X., Liu, Z., Tang, H., Yi, L., Zhao, H., Han, S.: Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2061–2070 (2023)
  • [7] Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5270–5279 (2022)
  • [8] Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: The vision-friendly transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 589–598 (2021)
  • [9] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1290–1299 (2022)
  • [10] Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation pp. 17864–17875 (2021)
  • [11] Cho, Y., Kang, S.: Class attention transfer for semantic segmentation. In: 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). pp. 41–45. IEEE (2022)
  • [12] Cho, Y., Yu, H., Kang, S.J.: Cross-aware early fusion with stage-divided vision and language transformer encoders for referring image segmentation. IEEE Transactions on Multimedia (2023)
  • [13] Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting the design of spatial attention in vision transformers pp. 9355–9366 (2021)
  • [14] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
  • [15] Dai, Z., Liu, H., Le, Q.V., Tan, M.: Coatnet: Marrying convolution and attention for all data sizes pp. 3965–3977 (2021)
  • [16] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [17] Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: Object-contextual representations for semantic segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. pp. 173–190 (2020)
  • [18] Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: Davit: Dual attention vision transformers. In: Proceedings of the European conference on computer vision (ECCV). pp. 74–92 (2022)
  • [19] Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12124–12134 (2022)
  • [20] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [21] Gong, C., Wang, D., Li, M., Chen, X., Yan, Z., Tian, Y., Liu, Q., Chandra, V.: Nasvit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In: International Conference on Learning Representations (2021)
  • [22] Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12259–12269 (2021)
  • [23] Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12175–12185 (2022)
  • [24] Guo, M.H., Lu, C.Z., Hou, Q., Liu, Z., Cheng, M.M., Hu, S.M.: Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575 (2022)
  • [25] Han, D., Pan, X., Han, Y., Song, S., Huang, G.: Flatten transformer: Vision transformer using focused linear attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5961–5971 (2023)
  • [26] Han, K., Xiao, A., Wu, E., Guo, J., XU, C., Wang, Y.: Transformer in transformer pp. 15908–15919 (2021)
  • [27] Hatamizadeh, A., Heinrich, G., Yin, H., Tao, A., Alvarez, J.M., Kautz, J., Molchanov, P.: Fastervit: Fast vision transformers with hierarchical attention. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=kB4yBiNmXX
  • [28] Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J., Molchanov, P.: Global context vision transformers. In: International Conference on Machine Learning. PMLR. pp. 12633–12646 (2023)
  • [29] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [30] Huang, H., Zhou, X., He, R.: Orthogonal transformer: An efficient vision transformer backbone with token orthogonalization pp. 14596–14607 (2022)
  • [31] Kang, B., Moon, S., Cho, Y., Yu, H., Kang, S.J.: Metaseg: Metaformer-based global contexts-aware network for efficient semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 434–443 (2024)
  • [32] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
  • [33] Lee, Y., Kim, J., Willette, J., Hwang, S.J.: Mpvit: Multi-path vision transformer for dense prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7287–7296 (2022)
  • [34] Li, L., et al.: Semantic hierarchy-aware segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
  • [35] Li, Y., Yuan, G., Wen, Y., Hu, E., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: Efficientformer: Vision transformers at mobilenet speed. arXiv preprint arXiv:2206.01191 (2022)
  • [36] Lin, W., Wu, Z., Chen, J., Huang, J., Jin, L.: Scale-aware modulation meet transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6015–6026 (2023)
  • [37] Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to mlps. Advances in Neural Information Processing Systems 34, 9204–9215 (2021)
  • [38] Liu, H., Jiang, X., Li, X., Bao, Z., Jiang, D., Ren, B.: Nommer: Nominate synergistic context in vision transformer for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12073–12082 (2022)
  • [39] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B.: Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12009–12019 (2022)
  • [40] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
  • [41] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)
  • [42] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
  • [43] Lu, C., de Geus, D., Dubbelman, G.: Content-aware token sharing for efficient semantic segmentation with vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23631–23640 (2023)
  • [44] Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
  • [45] Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12179–12188 (2021)
  • [46] Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification pp. 13937–13949 (2021)
  • [47] Ren, P., Li, C., Wang, G., Xiao, Y., Du, Q., Liang, X., Chang, X.: Beyond fixation: Dynamic window visual transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11987–11997 (2022)
  • [48] Ren, S., Zhou, D., He, S., Feng, J., Wang, X.: Shunted self-attention via multi-scale token aggregation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10853–10862 (2022)
  • [49] RONNEBERGER, O., FISCHER, P., BROX, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference. pp. 234–241 (2021)
  • [50] Shim, J.h., Yu, H., Kong, K., Kang, S.J.: Feedformer: revisiting transformer decoder for efficient semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2263–2271 (2023)
  • [51] Tan, M., Le, Q.: Efficientnetv2: Smaller models and faster training. In: International conference on machine learning. pp. 10096–10106. PMLR (2021)
  • [52] Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 777–786 (2023)
  • [53] Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., et al.: Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
  • [54] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357. PMLR (2021)
  • [55] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [56] Wang, J., Gou, C., Wu, Q., Feng, H., Han, J., Ding, E., Wang, J.: Rtformer: Efficient design for real-time semantic segmentation with transformer pp. 7423–7436 (2022)
  • [57] Wang, W., et al.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV. pp. 7303–7313 (2021)
  • [58] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 568–578 (2021)
  • [59] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media 8(3), 415–424 (2022)
  • [60] Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., Liu, W.: Crossformer: A versatile vision transformer hinging on cross-scale attention. In: International Conference on Learning Representations (2022)
  • [61] Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
  • [62] Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22–31 (2021)
  • [63] Wu, Y.H., Liu, Y., Zhan, X., Cheng, M.M.: P2t: Pyramid pooling transformer for scene understanding (2022)
  • [64] Xia, Z., Pan, X., Song, S., Li, L.E., Gao, H.: Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4794–4803 (2022)
  • [65] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090 (2021)
  • [66] Yan, H., Wu, M., Zhang, C.: Multi-scale representations by varing window attention for semantic segmentation. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=lAhWGOkpSR
  • [67] Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., Yuille, A.: Lite vision transformer with enhanced self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11998–12008 (2022)
  • [68] Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., Gao, J.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
  • [69] Yang, R., Ma, H., Wu, J., Tang, Y., Xiao, X., Zheng, M., Li, X.: Scalablevit: Rethinking the context-oriented generalization of vision transformer. In: Proceedings of the European conference on computer vision (ECCV). pp. 480–496 (2022)
  • [70] Yu, H., Shim, J.h., Kwak, J., Song, J.W., Kang, S.J.: Vision transformer-based retina vessel segmentation with deep adaptive gamma correction. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1456–1460. IEEE (2022)
  • [71] Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10819–10829 (2022)
  • [72] Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., Yan, S.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6824–6835 (2021)
  • [73] Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 558–567 (2021)
  • [74] Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 558–567 (2021)
  • [75] Zhang, Q., Yang, Y.B.: Rest: An efficient transformer for visual recognition pp. 15475–15485 (2021)
  • [76] Zhang, Q., Yang, Y.B.: Rest v2: simpler, faster and stronger pp. 36440–36452 (2022)
  • [77] Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G., Shen, C.: Topformer: Token pyramid transformer for mobile semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12083–12093 (2022)
  • [78] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6881–6890 (2021)
  • [79] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641 (2017)
  • [80] Zhou, T., et al.: Rethinking semantic segmentation: A prototype view. In: CVPR. pp. 2582–2593 (2022)

Appendix

  • In Appendix 0.A, we present performance-computation comparisons with our EDAFormer and other transformer-based state-of-the-art models.

  • In Appendix 0.B, we provide the computational analysis of our method in the attention block.

  • In Appendix 0.C, we present fair comparisons of semantic segmentation decoder structures with the same backbone.

  • In Appendix 0.D, we present the comparisons of computations and performance on various reduction ratios.

  • In Appendix 0.E, we present the FPS comparison with other segmentation models.

  • In Appendix 0.F, we provide the in-depth analysis for the effectiveness of our embedding-free structure.

  • In Appendix 0.G, we provide the additional visualizations of features before and after applying our ISR.

  • In Appendix 0.H, we provide qualitative results compared with the proposed model and previous state-of-the-art models on ADE20K, Cityscapes and COCO-Stuff datasets.

Models Params (M) GFLOPs \downarrow mIoU (%) \uparrow
PVT-Tiny [58] 15.5 32.2 32.9
SegFormer-B0 [65] 3.8 8.4 37.4
RTFormer-Slim [56] 4.8 17.5 36.7
FeedFormer-B0 [50] 4.5 7.8 39.2
VWFormer-B0 [66] 3.7 5.1 38.9
EDAFormer-T   (w/o ISR) 4.9 5.6 42.3
EDAFormer-T   (w/   ISR) 4.9 4.7 42.1
PVT-Medium [58] 48.0 61.0 41.6
Focal-T [68] 62.0 998.0 45.8
Twins-SVT-UperNet-S [13] 54.4 228.0 46.2
SegFormer-B2 [65] 27.5 62.4 46.5
MaskFormer [10] 42.0 55.0 46.7
SenFormer [2] 144.0 179.0 46.0
CrossFormer-S [60] 62.3 968.5 47.4
MPViT-S [33] 52.0 943.0 48.3
DW-T [47] 61.0 953.0 45.7
MixFormer-B1 [5] 35.0 854.0 42.0
DAT-T [64] 32.0 198.0 42.6
NomMer-T [38] 54.0 954.0 46.1
Shunted-S[48] 52.0 940.0 48.9
Mask2Former [9] 47.0 74.0 47.7
DaViT-Tiny [18] 60.0 940.0 46.3
Scalable ViT-S [69] 57.0 931.0 48.5
Ortho-S [30] 54.0 956.0 48.5
RTFormer-Base [56] 16.8 67.4 42.1
FeedFormer-B2 [50] 29.1 42.7 48.0
GC ViT-T [28] 58.0 947.0 47.0
Flatten-Swin-T [25] 60.0 946.0 44.8
VWFormer-B2 [66] 27.4 38.5 48.1
EDAFormer-T   (w/o ISR) 29.4 32.0 49.0
EDAFormer-T   (w/   ISR) 29.4 29.4 48.9
Table 8: Comparison with previous transformer-based models on ADE20K. GFLOPs were computed with 512×\times×512.
[Uncaptioned image]
Figure 6: Performance-Computation curves of our EDAFormer and existing segmentation models on ADE20K.

Appendix 0.A Additional Comparison to Transformer-based Models

In Table Appendix, we compared with additional transformer-based models on ADE20K [79] validation set as most transformer-based backbone studies include ADE20K results to show the semantic segmentation performance. We showed 4.7 GFLOPs with 42.1% mIoU and 29.4 GFLOPs with 48.9% mIoU in EDAFormer-T and EDAFormer-B (w/ ISR), respectively. In addition, Fig. 6 presented the performa-nce-computation curves, which include the comparisons with lightweight transfo-rmer-based models. These results showed that our EDAFormer achieved the efficient and significant performance compared to previous transformer-based state-of-the-art models.

Mechanism QKV Embedding Global Functioning Output Projection Others Total
MFLOPs \downarrow Params (K) MFLOPs \downarrow Params (K) MFLOPs \downarrow Params (K) MFLOPs \downarrow Params (K) MFLOPs \downarrow Params (K)
\textOmega  (SRA) 4.82 49.6 2.46 0.0 3.21 16.5 0.83 16.5 11.32 82.6
\textOmega  (EFA w/o ISR) 0.00 0.0 2.46 0.0 3.21 16.5 0.83 16.5 6.50 (-42.6%) 33.0 (-60.0%)
\textOmega  (EFA w/    ISR) 0.00 0.0 0.61 0.0 3.21 16.5 0.18 16.5 4.00 (-64.7%) 33.0 (-60.0%)
[Uncaptioned image]
Table 9: Computational analysis of our method in the attention block. The FLOPs and parameters were computed on stage 3 features of 224 ×\times× 224 size.

Appendix 0.B Computational Analysis in Attention Block

In Table 9, we compared the computation of the attention to analyze the effectiveness of our embedding-free structure and our inference spatial reduction (ISR) method. We analyzed the attention mechanism by dividing into the query-key-value embeddings, the global functioning, the output projection and others. Since our structure is based on spatial reduction attention (SRA), others are the spatial reduction operation. Our embedding-free structure effectively reduced the total MFLOPs by 42.6% and the parameters by 60.0%. In addition, our ISR reduced the computation of the global functioning. Therefore, compared to the original SRA, our embedding-free structure with ISR reduced the MFLOPs by 64.7% and the parameters by 60.0%.

Appendix 0.C Comparison of Decoder Structure with Same Backbone

Encoder Decoder GFLOPs \downarrow mIoU (%) \uparrow
MIT-B0 SegFormer [65] 8.4 37.4
FeedFormer [50] 7.8 39.2
SegNeXt [24] 5.2 38.7
VWFormer [66] 5.1 38.9
EDAFormer(Ours) 4.6 40.1
Table 10: Comparison of our EDAFormer and other segmentation methods with a MiT backbone [65] on ADE20K.

As the backbone has a significant impact on the semantic segmentation performance, we experimented with other segmentation methods using the same backbone for a more fair comparison of the decoder structure in Table 10. We use a Mix Transformer (MiT) structure as a common backbone, which is widely used as a transformer-based backbone in the semantic segmentation. In the decoder, we compared our EDAFormer (i.e., All-attention decoder) with previous powerful methods, including SegFormer [65] (i.e., All-MLP decoder), FeedFormer [50] (i.e., Feature query decoder), SegNeXt [24] (i.e., Ham decoder), VWFormer [66] (i.e., Multi-scale decoder). As shown in Table 10, our EDAFormer showed the most efficient computational cost with remarkable mIoU performance by modeling the global context.

Reduction ratio Cityscapes
[ rE1,rE2,rE3,rE4superscriptsubscript𝑟𝐸1superscriptsubscript𝑟𝐸2superscriptsubscript𝑟𝐸3superscriptsubscript𝑟𝐸4r_{E}^{1},r_{E}^{2},r_{E}^{3},r_{E}^{4}italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ]-[ rD1,rD2,rD3superscriptsubscript𝑟𝐷1superscriptsubscript𝑟𝐷2superscriptsubscript𝑟𝐷3r_{D}^{1},r_{D}^{2},r_{D}^{3}italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] GFLOPs \downarrow mIoU (%) \uparrow
[ 8 - 4 - 2 - 1 ]-[ 1 - 2 - 4 ] 151.7 78.7
[ 8 - 4 - 2 - 1 ]-[ 1 - 2 - 8 ] 145.3 78.7
[ 8 - 4 - 2 - 1 ]-[ 1 - 4 - 8 ] 138.8 78.7
[ 8 - 4 - 2 - 1 ]-[ 2 - 4 - 8 ] 133.6 78.7
[16 - 4 - 2 - 1 ]-[ 1 - 2 - 4 ] 125.9 78.7
[16 - 4 - 2 - 1 ]-[ 1 - 2 - 8 ] 119.5 78.7
[16 - 4 - 2 - 1 ]-[ 1 - 4 - 8 ] 113.0 78.7
[16 - 4 - 2 - 1 ]-[ 2 - 4 - 8 ] 107.8 78.7
[16 - 8 - 2 - 1 ]-[ 1 - 2 - 4 ] 113.0 78.7
[16 - 8 - 2 - 1 ]-[ 1 - 2 - 8 ] 106.6 78.7
[16 - 8 - 2 - 1 ]-[ 1 - 4 - 8 ] 100.1 78.7
[16 - 8 - 2 - 1 ]-[ 2 - 4 - 8 ] 94.9 78.7
[16 - 8 - 4 - 1 ]-[ 1 - 2 - 4 ] 80.6 78.1
[16 - 8 - 4 - 1 ]-[ 1 - 2 - 8 ] 74.1 78.1
[16 - 8 - 4 - 1 ]-[ 1 - 4 - 8 ] 67.6 78.1
[16 - 8 - 4 - 1 ]-[ 2 - 4 - 8 ] 62.5 78.1
[16 - 8 - 4 - 2 ]-[ 1 - 2 - 4 ] 77.1 78.1
[16 - 8 - 4 - 2 ]-[ 1 - 2 - 8 ] 70.7 78.1
[16 - 8 - 4 - 2 ]-[ 1 - 4 - 8 ] 64.2 78.1
[16 - 8 - 4 - 2 ]-[ 2 - 4 - 8 ] 59.1 78.1
Table 11: Comparison of the computations and performance of our model with different reduction ratio at inference. During training, [Encoder]-[Decoder] reduction ratio was [8, 4, 2, 1]-[1, 2, 4], and it is notated with . GFLOPs were computed with the input size of 2048×1024204810242048\times 10242048 × 1024. The optimal inference reduction ratio is in gray.

Appendix 0.D Various Reduction Ratios for ISR

In Table 11, we compared computational cost (FLOPs) and mIoU performance of our EDAFormer-T under various ISR conditions (i.e. the reduction ratio) on Cityscapes dataset. We experimented with 20 number of conditions by increasing the reduction ratio of each encoder stage and decoder stage to demonstrate more results of various reduction ratio than the Table 4 of the main paper. Firstly, the results showed that there is no mIoU performance degradation when the reduction ratio was increased in each decoder stage of our all-attention decoder. These results indicate that the decoder composed of the embedding-free attention is effective to apply our ISR. Secondly, the mIoU performance was maintained when the reduction ratio increased in both 1st and 2nd encoder stages. However, the performance decreased by 0.6% (i.e. 78.7% \to 78.1%) when the reduction ratio increased in both the 3rd and 4th encoder stages. Therefore, we suggest the reduction ratio of [16,8,2,1]-[2,4,8] as the optimal condition of our ISR, but the user can selectively leverage other conditions if lower computational cost is required even with the performance degradation.

Appendix 0.E FPS Comparison with Other Segmentation Models

In Table 12, we present the inference speed comparison without any additional accelerating techniques. For fair comparison, we measured Frames Per Second (FPS) of a whole single image of 2048×\times×1024 on Cityscapes using a single RTX3090 GPU. Compared to previous segmentation methods, our method achieved a FPS improvement with a higher mIoU score.

Model Params (M) GFLOPs \downarrow mIoU (%) \uparrow FPS (img/s) \uparrow
SegFormer-B0 3.8 125.5 76.2 8.8
FeedFormer 4.5 107.4 77.9 9.3
VWFormer-B0 3.7 - 77.2 8.9
EDAFormer-T (w/  ISR) 4.9 94.9 78.7 13.3
Table 12: FPS comparison with other segmentation models on Cityscapes. FPS is tested on a single RTX 3090 GPU.

Appendix 0.F Applying EFA to Other Backbones

For in-depth analysis, we analysed the effectiveness of our embedding-free structure on other backbones in Table 13. In each backbone, we applied our method and added the number of the attention blocks for fair model size. Compared to the other two methods, our method showed 1.5% and 0.8% higher accuracy with similar parameter size and the same computational cost. These results demonstrated that our method is also effective for the other transformer-based encoders.

Model Method Params (M) MFLOPs \downarrow Top-1 Acc. (%)
PVT original 13.2 1.9 75.1
w/o embedding 13.3 1.9 76.6
PVT v2 original 3.7 0.6 70.5
w/o embedding 3.6 0.6 71.3
Table 13: Applying our EFA to other backbones on ImageNet-1K.
Refer to caption
Figure 7: Visualizing the input features of the self-attention, the attention score maps, the output features of the attention, and the prediction maps. ‘w/o ISR’ represents our EDAFormer-T with the base reduction ratio of [8,4,2,1]-[1,2,4]. ‘w/ ISR’ represents our EDAFormer-T applied our ISR method with the reduction ratio of [16,8,2,1]-[2,4,8].
Refer to caption
Figure 8: Visualization of qualitative results on ADE20K, Cityscapes and COCO-Stuff. Compared to previous state-of-the-art semantic segmentation models (i.e., SegFormer and FeedFormer), our EDAFormer predicts more precisely for various categories.

Appendix 0.G Additional Feature Visualization

In Fig. 7, we visualized the input-output features of our embedding-free attention, the attention score maps, and the predictions before and after applying ISR. The attention regions in the attention score maps applying our ISR were well maintained in comparison to those without ISR, even though the key and value tokens were reduced. In addition, compared to without ISR, the output features with ISR preserved the spatial information by leveraging the self-attention mechanism where the number of the key-value tokens does not affect the input-output spatial structure. As a result, the prediction maps with our ISR were also largely identical to those without ISR.

Appendix 0.H Additional Qualitative Results

Qualitative results of our EDAFormer and other state-of-the-art models were illustrated in Fig. 8 on ADE20K [79], Cityscapes [14] and COCO-Stuff [3]. SegFormer [65] and FeedFormer [50] were comparatively analyzed for AED20K and Cityscapes, while COCO-Stuff was exclusively compared with SegFormer. Compared to previous methods, our EDAFormer not only presented better performance for large regions, but also exhibited more precise and detailed predictions for boundary regions. These results demonstrate that our EDAFormer, an encoder-decoder attention structure based on EFA, is an efficient yet powerful network for semantic segmentation.