MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Zhang, David Junhao; Li, Kunchang; Wang, Yali; Chen, Yunpeng; Chandra, Shashwat; Qiao, Yu; Liu, Luoqi; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.12527 (cs)

[Submitted on 24 Nov 2021 (v1), last revised 23 Aug 2022 (this version, v3)]

Title:MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Authors:David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

View PDF

Abstract:Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at this https URL.

Comments:	ECCV2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2111.12527 [cs.CV]
	(or arXiv:2111.12527v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.12527

Submission history

From: Junhao Zhang [view email]
[v1] Wed, 24 Nov 2021 14:52:20 UTC (1,829 KB)
[v2] Mon, 15 Aug 2022 07:21:36 UTC (5,541 KB)
[v3] Tue, 23 Aug 2022 12:05:19 UTC (5,541 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators