Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

Wu, Bofeng; Niu, Guocheng; Yu, Jun; Xiao, Xinyan; Zhang, Jian; Wu, Hua

Computer Science > Computer Vision and Pattern Recognition

arXiv:2105.08252 (cs)

[Submitted on 18 May 2021]

Title:Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

Authors:Bofeng Wu, Guocheng Niu, Jun Yu, Xinyan Xiao, Jian Zhang, Hua Wu

View PDF

Abstract:This paper proposes an approach to Dense Video Captioning (DVC) without pairwise event-sentence annotation. First, we adopt the knowledge distilled from relevant and well solved tasks to generate high-quality event proposals. Then we incorporate contrastive loss and cycle-consistency loss typically applied to cross-modal retrieval tasks to build semantic matching between the proposals and sentences, which are eventually used to train the caption generation module. In addition, the parameters of matching module are initialized via pre-training based on annotated images to improve the matching performance. Extensive experiments on ActivityNet-Caption dataset reveal the significance of distillation-based event proposal generation and cross-modal retrieval-based semantic matching to weakly supervised DVC, and demonstrate the superiority of our method to existing state-of-the-art methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2105.08252 [cs.CV]
	(or arXiv:2105.08252v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2105.08252

Submission history

From: Bofeng Wu [view email]
[v1] Tue, 18 May 2021 03:21:37 UTC (2,652 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2021-05

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Guocheng Niu
Jun Yu
Xinyan Xiao
Jian Zhang
Hua Wu

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators