Mixed Attention and Channel Shift Transformer for Efficient Action Recognition
Abstract
1 Introduction


2 Related Work
2.1 CNNs
2.2 Visual Transformers
3 Method
3.1 Random, Spatial, and Temporal Attention Mechanisms

3.2 Mixed Attention Operation


3.3 Channel Shift Operation

3.4 MACS Transformer
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Ablation Study
Method | FLOPs | Params | Top-1 |
---|---|---|---|
(G) | (M) | (%) | |
Base2D from [62] | 39.1 | 39.8 | 74.0 |
Base3D from [62] | 46.5 | 39.8 | 76.3 |
S | 39.1 | 39.8 | 74.1 |
T | 38.1 | 39.8 | 56.4 |
R | 40.1 | 39.8 | 73.5 |
S + T | 39.1 | 39.8 | 75.6 |
S\(\cup\)T | 39.1 | 39.8 | 75.4 |
R + T | 40.2 | 39.8 | 73.3 |
R\(\cup\)T | 40.2 | 39.8 | 68.8 |
S + R | 41.2 | 39.8 | 75.1 |
S\(\cup\)R | 41.2 | 39.8 | 75.4 |
Method | FLOPs | Params | Top-1 |
---|---|---|---|
(G) | (M) | (%) | |
Base2D from [62] | 39.1 | 39.8 | 74.0 |
Base3D from [62] | 46.5 (\(\uparrow 18.9%\)) | 39.8 | 76.3 |
Mix-A | 41.3 (\(\uparrow 5.6%\)) | 39.8 | 75.9 |
Mix-B | 41.3 (\(\uparrow 5.6%\)) | 39.8 | 75.8 |
Mix-C | 41.3 (\(\uparrow 5.6%\)) | 39.8 | 75.0 |
Mix-A\({}^{\ast}\) | 41.3 (\(\uparrow 5.6%\)) | 39.8 | 76.0 |

Attention Operation | Channel Shift | FLOPs | Top-1 |
---|---|---|---|
(G) | (%) | ||
Base2D from [62] | - | 39.1 | 74.0 |
Base3D from [62] | - | 46.5 (\(\uparrow 18.9%\)) | 76.3 |
Mixed attention | Temporal shift | 41.3 (\(\uparrow 5.6%\)) | 75.9 |
Periodic shift | 41.3 (\(\uparrow 5.6%\)) | 76.0 | |
1D Conv (TS Init) | 41.3 (\(\uparrow 5.6%\)) | 76.0 | |
1D Conv (PS Init) | 41.3 (\(\uparrow 5.6%\)) | 76.2 |
Method | Backbone | #F \(\times\) Res | Test | Top-1 |
---|---|---|---|---|
(H \(\times\) W) | Setting | (%) | ||
Base2D from [62] | Visformer-S | 8 \(\times\, 224^{2}\) | \(3\times 5\) | 74.0 |
Base2D from [62] | ViT-B | 8 \(\times\, 224^{2}\) | \(3\times 5\) | 76.4 |
MACS | Visformer-S | 8 \(\times\, 224^{2}\) | \(3\times 5\) | 76.2 |
Visformer-S | 32 \(\times\, 224^{2}\) | \(3\times 1\) | 78.0 | |
Visformer-S | 32 \(\times\, 320^{2}\) | \(3\times 1\) | 79.8 | |
Visformer-S | 32 \(\times\, 360^{2}\) | \(3\times 1\) | 80.0 | |
MACS | ViT-B | 8 \(\times\, 224^{2}\) | \(3\times 5\) | 78.1 |
ViT-B | 32 \(\times\, 224^{2}\) | \(3\times 1\) | 80.2 | |
ViT-B | 48 \(\times\, 224^{2}\) | \(3\times 1\) | 80.8 |

4.4 Comparison with the State-of-the-Art
Method | Backbone | Pre-Train | Resolution | Frames | GFLOPs | Params | Training | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|---|
Dataset | (H\(\,\times\,\)W) | (\(\times\,\)Views) | (M) | Epochs | (%) | (%) | |||
3D CNNs | |||||||||
I3D [7] | Inception V1 | ImgNet-1K | 224\(\,\times\,\)224 | 250 | 108\(\,\times\,\)NA | 12.0 | - | 71.1 | 90.3 |
Two-Stream I3D [7] | Inception V1 | ImgNet-1K | 224\(\,\times\,\)224 | 500 | 216\(\,\times\,\)NA | 25.0 | - | 75.7 | 92.0 |
ARTNet [52] | ResNet18 | None | 112\(\,\times\,\) 112 | 16 | 24\(\,\times\,\)250 | 35.2 | - | 69.2 | 88.3 |
S3D-G [58] | Inception V1 | ImgNet-1K | 224\(\,\times\,\)224 | 250 | 71\(\,\times\,\)NA | 11.5 | 112 | 74.7 | 93.4 |
Non-Local R101 [54] | ResNet101 | ImgNet-1K | 256\(\,\times\,\) 256 | 128 | 359\(\,\times\,\) 30 | 54.3 | 196 | 77.7 | 93.3 |
SlowFast\({}_{16\times 8}\) [14] | ResNet101+NL | None | 256\(\,\times\,\) 256 | 32 | 234\(\,\times\,\) 30 | 59.9 | 196 | 79.8 | 93.9 |
ip-CSN [47] | ResNet152 | Sports1M | 224\(\,\times\,\) 224 | 32 | 109\(\,\times\,\) 30 | - | 45 | 79.2 | 93.8 |
SmallBigNet [26] | ResNet101 | ImgNet-1K | 224\(\,\times\,\) 224 | 32 | 418\(\,\times\,\) 12 | - | 110 | 77.4 | 93.3 |
X3D-XL [13] | X2D | None | 356\(\,\times\,\) 356 | 16 | 48\(\,\times\,\) 30 | 11.0 | 256 | 79.1 | 93.9 |
CorrNet [51] | ResNet101 | None | 224\(\,\times\,\) 224 | 32 | 224\(\,\times\,\) 30 | - | 250 | 79.2 | - |
2D CNNs | |||||||||
TSM [32] | ResNet50 | ImgNet-1K | 256\(\,\times\,\) 256 | 16 | 65\(\,\times\,\) 10 | 24.3 | 100 | 74.7 | - |
TAM [12] | bLResNet50 | Kinetics-400 | 224\(\,\times\,\) 224 | 48 | 93\(\,\times\,\) 9 | 25.0 | 75 | 73.5 | 91.2 |
TEA [28] | ResNet50 | ImgNet-1K | 256\(\,\times\,\) 256 | 16 | 70\(\,\times\,\) 30 | 35.3 | 50 | 76.1 | 92.5 |
TEINet [34] | ResNet50 | ImgNet-1K | 256\(\,\times\,\) 256 | 16 | 66\(\,\times\,\) 30 | 30.8 | 100 | 76.2 | 92.5 |
TANet [36] | ResNet50 | ImgNet-1K | 256\(\,\times\,\) 256 | 16 | 86\(\,\times\,\) 12 | 25.6 | 100 | 76.9 | 92.9 |
TDN-R101 [53] | ResNet101 | ImgNet-1K | 256\(\,\times\,\) 256 | 24 | 198\(\,\times\,\) 30 | 43.9 | 100 | 79.4 | 94.4 |
GC-TDN-R50 [20] | ResNet50 | ImgNet-1K | 256\(\,\times\,\) 256 | 24 | 110\(\,\times\,\) 30 | 27.4 | 100 | 79.6 | 94.1 |
MDAF [50] | ResNet50 | ImgNet-1K | 224\(\,\times\,\) 224 | 8 | 34\(\,\times\,\) 30 | 24.5 | 150 | 76.2 | 92.0 |
CANet [16] | ResNet101 | ImgNet-1K | 256\(\,\times\,\) 256 | 8 | 67\(\,\times\,\)30 | 44.1 | 50 | 77.9 | 93.5 |
Transformers | |||||||||
LAPS [62] | Visformer-S | ImgNet-10K | 224\(\,\times\,\) 224 | 8 | 40\(\,\times\,\) 15 | 39.8 | 18 | 76.0 | 92.6 |
ViT (Video) [10] | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 8 | 135\(\,\times\,\) 30 | 85.9 | 18 | 76.0 | 92.5 |
TokShift [63] | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 16 | 270\(\,\times\,\) 30 | 85.9 | 18 | 78.2 | 93.8 |
TokShift (MR) [63] | ViT-B | ImgNet-21K | 256\(\,\times\,\) 256 | 8 | 176\(\,\times\,\) 30 | 85.9 | 18 | 77.7 | 93.6 |
VTN [39] | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 250 | 4218\(\,\times\,\) 1 | 114.0 | 25 | 78.6 | 93.7 |
TimeSformer [4] | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 8 | 197\(\,\times\,\) 3 | 121.4 | 15 | 78.0 | 93.7 |
TimeSformer-HR [4] | ViT-B | ImgNet-21K | 448\(\,\times\,\) 448 | 16 | 1703\(\,\times\,\) 3 | 121.4 | 15 | 79.7 | 94.4 |
TimeSformer-L [4] | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 96 | 2380\(\,\times\,\) 3 | 121.4 | 15 | 80.7 | 94.7 |
STTM [15] | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 11 | 2288\(\,\times\,\) 1 | 89.6 | 30 | 80.2 | - |
MACS-Visf | Visformer-S | ImgNet-10K | 224\(\,\times\,\) 224 | 8 | 41\(\,\times\,\) 15 | 39.8 | 18 | 76.2 | 92.4 |
MACS-Visf (H) | Visformer-S | ImgNet-10K | 320\(\,\times\,\) 320 | 32 | 472\(\,\times\,\) 3 | 40.0 | 18 | 79.8 | 94.3 |
MACS-ViT | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 8 | 152\(\,\times\,\) 15 | 86.1 | 18 | 78.1 | 93.7 |
MACS-ViT (L) | ViT-B | ImgNet-21K | 224\(\,\times\,\) 224 | 48 | 1265\(\,\times\,\) 3 | 86.1 | 18 | 80.8 | 94.8 |
Method | Backbone | #F\(\,\times\,\)Res | Top-1 |
---|---|---|---|
(T\(\,\times\,\)HW) | (%) | ||
TSM [32] | ResNet50 | 16\(\,\times\, 224^{2}\) | 63.4 |
SlowFast [14] | ResNet50 | 16\(\,\times\, 224^{2}\) | 61.7 |
SmallBig [26] | ResNet50 | 24\(\,\times\, 224^{2}\) | 63.3 |
VidTr-L [27] | ViT-B | 32\(\,\times\, 224^{2}\) | 63.0 |
TimeSformer-HR [4] | ViT-B | 16\(\,\times\, 448^{2}\) | 62.2 |
TimeSformer-L [4] | ViT-B | 96\(\,\times\, 224^{2}\) | 62.4 |
MACS-Visf (H) | Visformer-S | 32\(\,\times\, 320^{2}\) | 64.8 |
Method | Pre-Train | #F\(\,\times\,\)Res | Top-1 |
---|---|---|---|
Dataset | (T\(\,\times\,\)HW) | (%) | |
P3D [42] | Sports-1M | 16\(\,\times\,224^{2}\) | 84.2 |
TSM [32] | Kinetics-400 | 8\(\,\times\,256^{2}\) | 95.9 |
ViT (Video) [10] | ImageNet-21k | 8\(\,\times\,256^{2}\) | 91.5 |
TokShift [63] | Kinetics-400 | 8\(\,\times\,256^{2}\) | 95.4 |
TokShift-L (HR) [63] | Kinetics-400 | 8\(\,\times\,384^{2}\) | 96.8 |
LAPS (H) [62] | Kinetics-400 | 32\(\,\times\,320^{2}\) | 96.9 |
MACS-Visf (H) | Kinetics-400 | 32\(\,\times\,320^{2}\) | 97.2 |
MACS-ViT (L) | Kinetics-400 | 48\(\,\times\,224^{2}\) | 97.3 |
Method | Pre-Train | #F\(\,\times\,\)Res | Top-1 |
---|---|---|---|
Dataset | (T\(\,\times\,\)HW) | (%) | |
TSM [32] | Kinetics-400 | 8\(\,\times\, 224^{2}\) | 63.5 |
SAP [55] | Kinetics-400 | 64\(\,\times\, 256^{2}\) | 64.1 |
ViT (Video) [10] | ImageNet-21k | 8\(\,\times\, 224^{2}\) | 62.6 |
TokShift (HR) [63] | Kinetics-400 | 8\(\,\times\, 384^{2}\) | 65.8 |
LAPS (H) [62] | Kinetics-400 | 32\(\,\times\, 320^{2}\) | 66.1 |
MACS-Visf (H) | Kinetics-400 | 32\(\,\times\, 320^{2}\) | 66.3 |
MACS-ViT (L) | Kinetics-400 | 48\(\,\times\, 224^{2}\) | 67.3 |
4.5 Visualization



4.6 Discussion
5 Conclusion
References
Index Terms
- Mixed Attention and Channel Shift Transformer for Efficient Action Recognition
Recommendations
Nesting spatiotemporal attention networks for action recognition
Graphical abstractDisplay Omitted
Highlights- This work takes spatial attention and temporal attention as nesting components.
AbstractAction recognition is an important yet challenging problem. While attention mechanism is widely used to extract informative features for action recognition, most previous attention models regard spatial attention and temporal attention ...
Attention with structure regularization for action recognition
AbstractRecognizing human action in video is an important task with a wide range of applications. Recently, motivated by the findings in human visual perception, there have been numerous attempts on introducing attention mechanisms to action ...
Highlights- A block-structured sparsity prior is introduced to model attention regions.
- A ...
Tennis Action Recognition Based on Multi-Branch Mixed Attention
Knowledge Science, Engineering and ManagementAbstractTennis action recognition is a challenging problem due to its inherent characteristics such as high speed and large amplitude of actions. In this paper, an end-to-end, Multi-Branch Mixed Attention-based network model (MBMA-Net) is proposed to ...
Comments
Information & Contributors
Information
Published In

Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
Funding Sources
- Exploratory Research Project of Zhejiang Lab
- National Natural Science Foundation of China
- National Key R&D Program of China
- Key R&D Program of Xinjiang, China
- Natural Science Foundation of Shandong Province
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 80Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)49
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in