On Pursuit of Designing Multi-modal Transformer for Video Grounding

Cao, Meng; Chen, Long; Shou, Mike Zheng; Zhang, Can; Zou, Yuexian

Computer Science > Computer Vision and Pattern Recognition

arXiv:2109.06085 (cs)

[Submitted on 13 Sep 2021 (v1), last revised 11 Apr 2022 (this version, v2)]

Title:On Pursuit of Designing Multi-modal Transformer for Video Grounding

Authors:Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou

View PDF

Abstract:Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment boundaries. However, all these methods are not end-to-end, i.e., they always rely on some time-consuming post-processing steps to refine predictions. To this end, we reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as GTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. To facilitate the end-to-end training, we use a Cubic Embedding layer to transform the raw videos into a set of visual tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is optimized via a Many-to-One matching loss. Furthermore, we conduct comprehensive studies to investigate different model design choices. Extensive results on three benchmarks have validated the superiority of GTR. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.

Comments:	Accepted by Conference on Empirical Methods in Natural Language Processing (EMNLP 2021, Oral)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2109.06085 [cs.CV]
	(or arXiv:2109.06085v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2109.06085

Submission history

From: Meng Cao [view email]
[v1] Mon, 13 Sep 2021 16:01:19 UTC (3,040 KB)
[v2] Mon, 11 Apr 2022 09:12:15 UTC (3,040 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:On Pursuit of Designing Multi-modal Transformer for Video Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:On Pursuit of Designing Multi-modal Transformer for Video Grounding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators