ActionFormer: Localizing Moments of Actions with Transformers

Zhang, Chenlin; Wu, Jianxin; Li, Yin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2202.07925 (cs)

[Submitted on 16 Feb 2022 (v1), last revised 28 Aug 2022 (this version, v2)]

Title:ActionFormer: Localizing Moments of Actions with Transformers

Authors:Chenlin Zhang, Jianxin Wu, Yin Li

View PDF

Abstract:Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at this http URL.

Comments:	Accepted to ECCV 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2202.07925 [cs.CV]
	(or arXiv:2202.07925v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2202.07925

Submission history

From: Yin Li [view email]
[v1] Wed, 16 Feb 2022 08:34:11 UTC (1,305 KB)
[v2] Sun, 28 Aug 2022 19:46:29 UTC (1,371 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ActionFormer: Localizing Moments of Actions with Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ActionFormer: Localizing Moments of Actions with Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators