Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Reducing 0s bias in video moment retrieval with a circular competence-based captioner

Published: 01 March 2023 Publication History

Abstract

The current study addresses the problem of retrieving a specific moment from an untrimmed video by a sentence query. Existing methods have achieved high performance by designing various structures to match visual-text relations. Yet, these methods tend to return an interval starting from 0s, which we named “0s bias”. In this paper, we propose a Circular Co-Teaching (CCT) mechanism using a captioner to improve an existing retrieval model (localizer) from two aspects: biased annotations and easy samples. Correspondingly, CCT contains two processes: (1) Pseudo Query Generation (captioner to localizer), aiming at transferring the knowledge from generated queries to the localizer to balance annotations; (2) Competence-based Curriculum Learning (localizer to captioner), training the captioner in an easy-to-hard fashion guided by localization results, making pairs of the false-positive moment and pseudo query become easy samples for the localizer. Extensive experiments show that our CCT can alleviate “0s bias” with even 4% improvement for existing approaches on average in two public datasets (ActivityNet-Captions, and Charades-STA), in terms of R@1,IoU=0.7. Notably, our method also outperforms baselines in an out-of-distribution scenario. We also quantitatively validate CCT’s ability to cope with “0s bias” by a proposed metric, DM. Our study not only theoretically contributes to detecting “0s bias”, but also provides a highly effective tool for video moment retrieval by alleviating such bias.

Highlights

We, for the first time, detect and alleviate the “0s bias” in video moment retrieval.
We detect “0s bias” tending to estimate a moment starting from 0s, not the right one.
We alleviate “0s bias” by Circular Co-Teaching of a localizer and a captioner.
Our model improves the existing methods with little extra training cost as a plug-in.
We generate pseudo queries matching the false-positive moments starting from 0s.

References

[1]
Anne Hendricks L., Wang O., Shechtman E., Sivic J., Darrell T., Russell B., Localizing moments in video with natural language, in: ICCV, 2017, pp. 5803–5812.
[2]
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., et al. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433).
[3]
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41–48).
[4]
Carreira J., Zisserman A., Quo vadis, action recognition? a new model and the kinetics dataset, in: CVPR, 2017, pp. 6299–6308.
[5]
Chen S., Jiang Y.-G., Semantic proposal for activity localization in videos via sentence query, in: Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2019, pp. 8199–8206.
[6]
Chen S., Jiang Y.-G., Motion Guided Region message passing for video captioning, in: ICCV, 2021, pp. 1543–1552.
[7]
Chen S., Jiang Y.-G., Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning, in: CVPR, 2021, pp. 8425–8435.
[8]
Chen Y.-C., Li L., Yu L., Kholy A.E., Ahmed F., Gan Z., et al., Uniter: Learning universal image-text representations, 2019, arXiv preprint arXiv:1909.11740.
[9]
Deng C., Chen S., Chen D., He Y., Wu Q., Sketch, ground, and refine: Top-down dense video captioning, in: CVPR, 2021, pp. 234–243.
[10]
Ding X., Wang N., Zhang S., Cheng D., Li X., Huang Z., et al., Support-set based cross-supervision for video grounding, in: ICCV, 2021, pp. 11573–11582.
[11]
Duan X., Huang W., Gan C., Wang J., Zhu W., Huang J., Weakly supervised dense event captioning in videos, Advances in Neural Information Processing Systems 31 (2018).
[12]
Gao J., Sun C., Yang Z., Nevatia R., Tall: Temporal activity localization via language query, in: ICCV, 2017, pp. 5267–5275.
[13]
He D., Zhao X., Huang J., Li F., Liu X., Wen S., Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos, in: AAAI, Vol. 33, 2019, pp. 8393–8400.
[14]
Jiang B., Huang X., Yang C., Yuan J., SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval, Information Processing & Management 56 (6) (2019).
[15]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.
[16]
Korbar B., Petroni F., Girdhar R., Torresani L., Video understanding as machine translation, 2020, arXiv preprint arXiv:2006.07203.
[17]
Krishna R., Hata K., Ren F., Fei-Fei L., Carlos Niebles J., Dense-captioning events in videos, in: ICCV, 2017, pp. 706–715.
[18]
Li X., Liu Y., Xu K., Zhao Z., Liu S., A context-based network for referring image segmentation, in: 2020 IEEE international conference on image processing (ICIP), IEEE, 2020, pp. 1436–1440.
[19]
Lin Z., Zhao Z., Zhang Z., Wang Q., Liu H., Weakly-supervised video moment retrieval via semantic completion network, in: AAAI, Vol. 34, 2020, pp. 11539–11546.
[20]
Liu D., Qu X., Dong J., Zhou P., Cheng Y., Wei W., et al., Context-aware biaffine localizing network for temporal sentence grounding, in: CVPR, 2021, pp. 11235–11244.
[21]
Liu M., Wang X., Nie L., He X., Chen B., Chua T.-S., Attentive moment retrieval in videos, in: SIGIR, 2018, pp. 15–24.
[22]
Long X., Gan C., De Melo G., Video captioning with multi-faceted attention, Transactions of the Association for Computational Linguistics 6 (2018) 173–184.
[23]
Mallick A.K., Mukhopadhyay S., Video retrieval framework based on color co-occurrence feature of adaptive low rank extracted keyframes and graph pattern matching, Information Processing & Management 59 (2) (2022).
[24]
Miech A., Zhukov D., Alayrac J.-B., Tapaswi M., Laptev I., Sivic J., HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips, in: ICCV, 2019.
[25]
Mun J., Cho M., Han B., Local-global video-text interactions for temporal grounding, in: CVPR, 2020, pp. 10810–10819.
[26]
Pei W., Zhang J., Wang X., Ke L., Shen X., Tai Y.-W., Memory-attended recurrent network for video captioning, in: CVPR, 2019, pp. 8347–8356.
[27]
Platanios E.A., Stretcu O., Neubig G., Poczos B., Mitchell T.M., Competence-based curriculum learning for neural machine translation, 2019, arXiv preprint arXiv:1903.09848.
[28]
Ryu H., Kang S., Kang H., Yoo C.D., Semantic grouping network for video captioning, in: AAAI, Vol. 35, 2021, pp. 2514–2522.
[29]
Sanborn F.W., Harris R.J., A cognitive psychology of mass communication, Routledge, 2019.
[30]
Shou Z., Wang D., Chang S.-F., Temporal action localization in untrimmed videos via multi-stage cnns, in: CVPR, 2016, pp. 1049–1058.
[31]
Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[32]
Sun C., Myers A., Vondrick C., Murphy K., Schmid C., Videobert: A joint model for video and language representation learning, in: ICCV, 2019, pp. 7464–7473.
[33]
Tran D., Bourdev L., Fergus R., Torresani L., Paluri M., Learning spatiotemporal features with 3d convolutional networks, in: ICCV, 2015, pp. 4489–4497.
[34]
Venugopalan S., Rohrbach M., Donahue J., Mooney R., Darrell T., Saenko K., Sequence to sequence-video to text, in: ICCV, 2015, pp. 4534–4542.
[35]
Wang W., Huang Y., Wang L., Language-driven temporal activity localization: A semantic matching reinforcement learning model, in: CVPR, 2019, pp. 334–343.
[36]
Wang B., Ma L., Zhang W., Jiang W., Wang J., Liu W., Controllable video captioning with pos sequence guidance based on gated fusion network, in: ICCV, 2019, pp. 2641–2650.
[37]
Wang T., Zhang R., Lu Z., Zheng F., Cheng R., Luo P., End-to-end dense video captioning with parallel decoding, in: ICCV, 2021, pp. 6847–6857.
[38]
Wu J., Li G., Han X., Lin L., Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos, in: ACMMM, 2020, pp. 1283–1291.
[39]
Xiao S., Chen L., Zhang S., Ji W., Shao J., Ye L., et al., Boundary proposal network for two-stage natural language video localization, in: AAAI, Vol. 35, 2021, pp. 2986–2994.
[40]
Xu H., He K., Plummer B.A., Sigal L., Sclaroff S., Saenko K., Multilevel language and vision integration for text-to-clip retrieval, in: Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2019, pp. 9062–9069.
[41]
Yang X., Feng F., Ji W., Wang M., Chua T.-S., Deconfounded video moment retrieval with causal intervention, in: SIGIR, 2021, pp. 1–10.
[42]
Yu X., Malmir M., He X., Chen J., Wang T., Wu Y., et al., Cross interaction network for natural language guided video moment retrieval, in: SIGIR, Association for Computing Machinery, New York, NY, USA, 2021, pp. 1860–-1864. URL: https://doi.org/10.1145/3404835.3463021.
[43]
Yuan, Y., Lan, X., Chen, L., Liu, W., Wang, X., & Zhu, W. (2021). A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric. In Proceedings of the 2nd international workshop on human-centric multimedia analysis.
[44]
Yuan Y., Ma L., Wang J., Liu W., Zhu W., Semantic conditioned dynamic modulation for temporal sentence grounding in videos, IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99) (2020) 1.
[45]
Yuan Y., Mei T., Zhu W., To find where you talk: Temporal sentence localization in video with attention based location regression, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, 2019, pp. 9159–9166.
[46]
Zeng R., Xu H., Huang W., Chen P., Tan M., Gan C., Dense regression network for video grounding, in: CVPR, 2020, pp. 10287–10296.
[47]
Zhang D., Dai X., Wang X., Wang Y.-F., Davis L.S., Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment, in: CVPR, 2019, pp. 1247–1257.
[48]
Zhang S., Peng H., Fu J., Luo J., Learning 2d temporal adjacent networks for moment localization with natural language, in: AAAI, Vol. 34, 2020, pp. 12870–12877.
[49]
Zhang Z., Qi Z., Yuan C., Shan Y., Li B., Deng Y., et al., Open-book video captioning with retrieve-copy-generate network, in: CVPR, 2021, pp. 9837–9846.
[50]
Zhang J., Sang J., Yi Q., Yang Y., Dong H., Yu J., Pre-training also transfers non-robustness, 2021, arXiv preprint arXiv:2106.10989.
[51]
Zhang H., Sun A., Jing W., Nan G., Zhen L., Zhou J.T., et al., Video corpus moment retrieval with contrastive learning, in: SIGIR, Association for Computing Machinery, New York, NY, USA, 2021, pp. 685–695. URL: https://doi.org/10.1145/3404835.3462874.
[52]
Zhang H., Sun A., Jing W., Zhou J.T., Span-based localizing network for natural language video localization, 2020, arXiv preprint arXiv:2004.13931.
[53]
Zhou L., Zhou Y., Corso J.J., Socher R., Xiong C., End-to-end dense video captioning with masked transformer, in: CVPR, 2018, pp. 8739–8748.

Cited By

View all
  • (2024)Unsupervised Video Moment Retrieval with Knowledge-Based Pseudo-Supervision ConstructionACM Transactions on Information Systems10.1145/370122943:1(1-26)Online publication date: 19-Oct-2024
  • (2024)Routing Evidence for Unseen Actions in Video Moment RetrievalProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671693(3024-3035)Online publication date: 25-Aug-2024
  • (2023)Self-Supervised Graph Convolution for Video Moment RetrievalArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44204-9_34(407-419)Online publication date: 26-Sep-2023

Index Terms

  1. Reducing 0s bias in video moment retrieval with a circular competence-based captioner
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Processing and Management: an International Journal
          Information Processing and Management: an International Journal  Volume 60, Issue 2
          Mar 2023
          1443 pages

          Publisher

          Pergamon Press, Inc.

          United States

          Publication History

          Published: 01 March 2023

          Author Tags

          1. Video moment retrieval
          2. Competence-based captioner
          3. Circular Co-Teaching
          4. 0s bias

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 25 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Unsupervised Video Moment Retrieval with Knowledge-Based Pseudo-Supervision ConstructionACM Transactions on Information Systems10.1145/370122943:1(1-26)Online publication date: 19-Oct-2024
          • (2024)Routing Evidence for Unseen Actions in Video Moment RetrievalProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671693(3024-3035)Online publication date: 25-Aug-2024
          • (2023)Self-Supervised Graph Convolution for Video Moment RetrievalArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44204-9_34(407-419)Online publication date: 26-Sep-2023

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media