Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Ntinou, Ioanna; Sanchez, Enrique; Tzimiropoulos, Georgios

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.17686 (cs)

[Submitted on 29 Dec 2023 (v1), last revised 23 May 2024 (this version, v2)]

Title:Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Authors:Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos

View PDF HTML (experimental)

Abstract:Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution, and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity.
In this paper, we observe that \textbf{a straight bipartite matching loss can be applied to the output tokens of a vision transformer}. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at \href{this https URL}{this https URL}

Comments:	CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.17686 [cs.CV]
	(or arXiv:2312.17686v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.17686

Submission history

From: Ioanna Ntinou [view email]
[v1] Fri, 29 Dec 2023 17:08:38 UTC (39,339 KB)
[v2] Thu, 23 May 2024 15:52:11 UTC (16,535 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators