Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Tan, Chaolei; Lai, Jianhuang; Zheng, Wei-Shi; Hu, Jian-Fang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.11463 (cs)

[Submitted on 18 Mar 2024 (v1), last revised 14 May 2024 (this version, v2)]

Title:Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Authors:Chaolei Tan, Jianhuang Lai, Wei-Shi Zheng, Jian-Fang Hu

View PDF HTML (experimental)

Abstract:Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

Comments:	Accepted to CVPR 2024. v2: fix a typo in figure 1
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.11463 [cs.CV]
	(or arXiv:2403.11463v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.11463

Submission history

From: Chaolei Tan [view email]
[v1] Mon, 18 Mar 2024 04:30:31 UTC (3,273 KB)
[v2] Tue, 14 May 2024 17:34:46 UTC (1,640 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators