Relaxed Transformer Decoders for Direct Action Proposal Generation

Tan, Jing; Tang, Jiaqi; Wang, Limin; Wu, Gangshan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2102.01894 (cs)

[Submitted on 3 Feb 2021 (v1), last revised 19 Aug 2021 (this version, v3)]

Title:Relaxed Transformer Decoders for Direct Action Proposal Generation

Authors:Jing Tan, Jiaqi Tang, Limin Wang, Gangshan Wu

View PDF

Abstract:Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at this https URL.

Comments:	ICCV 2021 camera ready version
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2102.01894 [cs.CV]
	(or arXiv:2102.01894v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2102.01894

Submission history

From: Limin Wang [view email]
[v1] Wed, 3 Feb 2021 06:29:28 UTC (1,110 KB)
[v2] Wed, 7 Apr 2021 08:43:34 UTC (1,631 KB)
[v3] Thu, 19 Aug 2021 13:42:57 UTC (1,371 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Relaxed Transformer Decoders for Direct Action Proposal Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Relaxed Transformer Decoders for Direct Action Proposal Generation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators