research-article

Reducing 0s bias in video moment retrieval with a circular competence-based captioner

Authors:

Zheng QinAuthors Info & Claims

Volume 60, Issue 2

https://doi.org/10.1016/j.ipm.2022.103147

Published: 01 March 2023 Publication History

Abstract

The current study addresses the problem of retrieving a specific moment from an untrimmed video by a sentence query. Existing methods have achieved high performance by designing various structures to match visual-text relations. Yet, these methods tend to return an interval starting from 0s, which we named “0s bias”. In this paper, we propose a Circular Co-Teaching (CCT) mechanism using a captioner to improve an existing retrieval model (localizer) from two aspects: biased annotations and easy samples. Correspondingly, CCT contains two processes: (1) Pseudo Query Generation (captioner to localizer), aiming at transferring the knowledge from generated queries to the localizer to balance annotations; (2) Competence-based Curriculum Learning (localizer to captioner), training the captioner in an easy-to-hard fashion guided by localization results, making pairs of the false-positive moment and pseudo query become easy samples for the localizer. Extensive experiments show that our CCT can alleviate “0s bias” with even 4% improvement for existing approaches on average in two public datasets (ActivityNet-Captions, and Charades-STA), in terms of R@1,IoU=0.7. Notably, our method also outperforms baselines in an out-of-distribution scenario. We also quantitatively validate CCT’s ability to cope with “0s bias” by a proposed metric, DM. Our study not only theoretically contributes to detecting “0s bias”, but also provides a highly effective tool for video moment retrieval by alleviating such bias.

Highlights

•

We, for the first time, detect and alleviate the “0s bias” in video moment retrieval.

•

We detect “0s bias” tending to estimate a moment starting from 0s, not the right one.

•

We alleviate “0s bias” by Circular Co-Teaching of a localizer and a captioner.

•

Our model improves the existing methods with little extra training cost as a plug-in.

•

We generate pseudo queries matching the false-positive moments starting from 0s.

References

[1]

Anne Hendricks L., Wang O., Shechtman E., Sivic J., Darrell T., Russell B., Localizing moments in video with natural language, in: ICCV, 2017, pp. 5803–5812.

[2]

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., et al. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433).

[3]

Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41–48).

[4]

Carreira J., Zisserman A., Quo vadis, action recognition? a new model and the kinetics dataset, in: CVPR, 2017, pp. 6299–6308.

[5]

Chen S., Jiang Y.-G., Semantic proposal for activity localization in videos via sentence query, in: Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2019, pp. 8199–8206.

[6]

Chen S., Jiang Y.-G., Motion Guided Region message passing for video captioning, in: ICCV, 2021, pp. 1543–1552.

[7]

Chen S., Jiang Y.-G., Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning, in: CVPR, 2021, pp. 8425–8435.

[8]

Chen Y.-C., Li L., Yu L., Kholy A.E., Ahmed F., Gan Z., et al., Uniter: Learning universal image-text representations, 2019, arXiv preprint arXiv:1909.11740.

[9]

Deng C., Chen S., Chen D., He Y., Wu Q., Sketch, ground, and refine: Top-down dense video captioning, in: CVPR, 2021, pp. 234–243.

[10]

Ding X., Wang N., Zhang S., Cheng D., Li X., Huang Z., et al., Support-set based cross-supervision for video grounding, in: ICCV, 2021, pp. 11573–11582.

[11]

Duan X., Huang W., Gan C., Wang J., Zhu W., Huang J., Weakly supervised dense event captioning in videos, Advances in Neural Information Processing Systems 31 (2018).

[12]

Gao J., Sun C., Yang Z., Nevatia R., Tall: Temporal activity localization via language query, in: ICCV, 2017, pp. 5267–5275.

[13]

He D., Zhao X., Huang J., Li F., Liu X., Wen S., Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos, in: AAAI, Vol. 33, 2019, pp. 8393–8400.

[14]

Jiang B., Huang X., Yang C., Yuan J., SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval, Information Processing & Management 56 (6) (2019).

[15]

Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.

[16]

Korbar B., Petroni F., Girdhar R., Torresani L., Video understanding as machine translation, 2020, arXiv preprint arXiv:2006.07203.

[17]

Krishna R., Hata K., Ren F., Fei-Fei L., Carlos Niebles J., Dense-captioning events in videos, in: ICCV, 2017, pp. 706–715.

[18]

Li X., Liu Y., Xu K., Zhao Z., Liu S., A context-based network for referring image segmentation, in: 2020 IEEE international conference on image processing (ICIP), IEEE, 2020, pp. 1436–1440.

[19]

Lin Z., Zhao Z., Zhang Z., Wang Q., Liu H., Weakly-supervised video moment retrieval via semantic completion network, in: AAAI, Vol. 34, 2020, pp. 11539–11546.

[20]

Liu D., Qu X., Dong J., Zhou P., Cheng Y., Wei W., et al., Context-aware biaffine localizing network for temporal sentence grounding, in: CVPR, 2021, pp. 11235–11244.

[21]

Liu M., Wang X., Nie L., He X., Chen B., Chua T.-S., Attentive moment retrieval in videos, in: SIGIR, 2018, pp. 15–24.

[22]

Long X., Gan C., De Melo G., Video captioning with multi-faceted attention, Transactions of the Association for Computational Linguistics 6 (2018) 173–184.

[23]

Mallick A.K., Mukhopadhyay S., Video retrieval framework based on color co-occurrence feature of adaptive low rank extracted keyframes and graph pattern matching, Information Processing & Management 59 (2) (2022).

[24]

Miech A., Zhukov D., Alayrac J.-B., Tapaswi M., Laptev I., Sivic J., HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips, in: ICCV, 2019.

[25]

Mun J., Cho M., Han B., Local-global video-text interactions for temporal grounding, in: CVPR, 2020, pp. 10810–10819.

[26]

Pei W., Zhang J., Wang X., Ke L., Shen X., Tai Y.-W., Memory-attended recurrent network for video captioning, in: CVPR, 2019, pp. 8347–8356.

[27]

Platanios E.A., Stretcu O., Neubig G., Poczos B., Mitchell T.M., Competence-based curriculum learning for neural machine translation, 2019, arXiv preprint arXiv:1903.09848.

[28]

Ryu H., Kang S., Kang H., Yoo C.D., Semantic grouping network for video captioning, in: AAAI, Vol. 35, 2021, pp. 2514–2522.

[29]

Sanborn F.W., Harris R.J., A cognitive psychology of mass communication, Routledge, 2019.

[30]

Shou Z., Wang D., Chang S.-F., Temporal action localization in untrimmed videos via multi-stage cnns, in: CVPR, 2016, pp. 1049–1058.

[31]

Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.

[32]

Sun C., Myers A., Vondrick C., Murphy K., Schmid C., Videobert: A joint model for video and language representation learning, in: ICCV, 2019, pp. 7464–7473.

[33]

Tran D., Bourdev L., Fergus R., Torresani L., Paluri M., Learning spatiotemporal features with 3d convolutional networks, in: ICCV, 2015, pp. 4489–4497.

[34]

Venugopalan S., Rohrbach M., Donahue J., Mooney R., Darrell T., Saenko K., Sequence to sequence-video to text, in: ICCV, 2015, pp. 4534–4542.

[35]

Wang W., Huang Y., Wang L., Language-driven temporal activity localization: A semantic matching reinforcement learning model, in: CVPR, 2019, pp. 334–343.

[36]

Wang B., Ma L., Zhang W., Jiang W., Wang J., Liu W., Controllable video captioning with pos sequence guidance based on gated fusion network, in: ICCV, 2019, pp. 2641–2650.

[37]

Wang T., Zhang R., Lu Z., Zheng F., Cheng R., Luo P., End-to-end dense video captioning with parallel decoding, in: ICCV, 2021, pp. 6847–6857.

[38]

Wu J., Li G., Han X., Lin L., Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos, in: ACMMM, 2020, pp. 1283–1291.

[39]

Xiao S., Chen L., Zhang S., Ji W., Shao J., Ye L., et al., Boundary proposal network for two-stage natural language video localization, in: AAAI, Vol. 35, 2021, pp. 2986–2994.

[40]

Xu H., He K., Plummer B.A., Sigal L., Sclaroff S., Saenko K., Multilevel language and vision integration for text-to-clip retrieval, in: Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2019, pp. 9062–9069.

[41]

Yang X., Feng F., Ji W., Wang M., Chua T.-S., Deconfounded video moment retrieval with causal intervention, in: SIGIR, 2021, pp. 1–10.

[42]

Yu X., Malmir M., He X., Chen J., Wang T., Wu Y., et al., Cross interaction network for natural language guided video moment retrieval, in: SIGIR, Association for Computing Machinery, New York, NY, USA, 2021, pp. 1860–-1864. URL: https://doi.org/10.1145/3404835.3463021.

[43]

Yuan, Y., Lan, X., Chen, L., Liu, W., Wang, X., & Zhu, W. (2021). A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric. In Proceedings of the 2nd international workshop on human-centric multimedia analysis.

[44]

Yuan Y., Ma L., Wang J., Liu W., Zhu W., Semantic conditioned dynamic modulation for temporal sentence grounding in videos, IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99) (2020) 1.

[45]

Yuan Y., Mei T., Zhu W., To find where you talk: Temporal sentence localization in video with attention based location regression, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, 2019, pp. 9159–9166.

[46]

Zeng R., Xu H., Huang W., Chen P., Tan M., Gan C., Dense regression network for video grounding, in: CVPR, 2020, pp. 10287–10296.

[47]

Zhang D., Dai X., Wang X., Wang Y.-F., Davis L.S., Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment, in: CVPR, 2019, pp. 1247–1257.

[48]

Zhang S., Peng H., Fu J., Luo J., Learning 2d temporal adjacent networks for moment localization with natural language, in: AAAI, Vol. 34, 2020, pp. 12870–12877.

[49]

Zhang Z., Qi Z., Yuan C., Shan Y., Li B., Deng Y., et al., Open-book video captioning with retrieve-copy-generate network, in: CVPR, 2021, pp. 9837–9846.

[50]

Zhang J., Sang J., Yi Q., Yang Y., Dong H., Yu J., Pre-training also transfers non-robustness, 2021, arXiv preprint arXiv:2106.10989.

[51]

Zhang H., Sun A., Jing W., Nan G., Zhen L., Zhou J.T., et al., Video corpus moment retrieval with contrastive learning, in: SIGIR, Association for Computing Machinery, New York, NY, USA, 2021, pp. 685–695. URL: https://doi.org/10.1145/3404835.3462874.

[52]

Zhang H., Sun A., Jing W., Zhou J.T., Span-based localizing network for natural language video localization, 2020, arXiv preprint arXiv:2004.13931.

[53]

Zhou L., Zhou Y., Corso J.J., Socher R., Xiong C., End-to-end dense video captioning with masked transformer, in: CVPR, 2018, pp. 8739–8748.

Cited By

Wang GWu XTu XLiu ZYan J(2024)Unsupervised Video Moment Retrieval with Knowledge-Based Pseudo-Supervision ConstructionACM Transactions on Information Systems10.1145/370122943:1(1-26)Online publication date: 19-Oct-2024
https://dl.acm.org/doi/10.1145/3701229
Wang GWu XQin ZShi LBaeza-Yates RBonchi F(2024)Routing Evidence for Unseen Actions in Video Moment RetrievalProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671693(3024-3035)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671693
Hu XWang GShan SLiu YLi J(2023)Self-Supervised Graph Convolution for Video Moment RetrievalArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44204-9_34(407-419)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-44204-9_34

Index Terms

Reducing 0s bias in video moment retrieval with a circular competence-based captioner
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Selective Query-Guided Debiasing for Video Corpus Moment Retrieval
Computer Vision – ECCV 2022
Abstract
Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query. Existing retrieval systems tend to rely on retrieval bias as a shortcut and thus, fail to sufficiently learn multi-modal ...
A Survey on Video Moment Localization
Video moment localization, also known as video moment retrieval, aims to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video ...
Partial Annotation-based Video Moment Retrieval via Iterative Learning
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Given a descriptive language query, Video Moment Retrieval (VMR) aims to seek the corresponding semantic-consistent moment clip in the video, which is represented as a pair of the start and end timestamps. Although current methods have achieved ...

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 60, Issue 2

Mar 2023

1443 pages

ISSN:0306-4573

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 March 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang GWu XTu XLiu ZYan J(2024)Unsupervised Video Moment Retrieval with Knowledge-Based Pseudo-Supervision ConstructionACM Transactions on Information Systems10.1145/370122943:1(1-26)Online publication date: 19-Oct-2024
https://dl.acm.org/doi/10.1145/3701229
Wang GWu XQin ZShi LBaeza-Yates RBonchi F(2024)Routing Evidence for Unseen Actions in Video Moment RetrievalProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671693(3024-3035)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671693
Hu XWang GShan SLiu YLi J(2023)Self-Supervised Graph Convolution for Video Moment RetrievalArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44204-9_34(407-419)Online publication date: 26-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-44204-9_34

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents