research-article

Partially Relevant Video Retrieval

Authors:

Xun WangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 246 - 257

https://doi.org/10.1145/3503161.3547976

Published: 10 October 2022 Publication History

Get Access

Abstract

Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.

Supplementary Material

MP4 File (MM22-fp0929.mp4)

This video is about the paper named Partially Relevant Video Retrieval accepted by ACM MM 2022. To fill the conventional text-to-video retrieval(T2VR) task gap between the literature and the real world, we propose a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR) in the paper. In the video, we will start with the motivation of the paper, and report the related work, methods, and experiments in turn. If you are interested in our research, please refer to our paper's homepage http://danieljf24.github.io/prvr/ to get the full paper and source code.

Download
174.84 MB

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Learn from Unlabeled Videos for Near-duplicate Video Retrieval

Localized content based image retrieval

Leveraging non-relevant images to enhance image retrieval performance

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations