Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3512527.3531403acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Learning Sample Importance for Cross-Scenario Video Temporal Grounding

Published: 27 June 2022 Publication History

Abstract

The task of temporal grounding aims to locate video moment in an untrimmed video, with a given sentence query. This paper for the first time investigates some superficial biases that are specific to the temporal grounding task, and proposes a novel targeted solution. Most alarmingly, we observe that existing temporal ground models heavily rely on some biases (e.g., high preference on frequent concepts or certain temporal intervals) in the visual modal. This leads to inferior performance when generalizing the model in cross-scenario test setting. To this end, we propose a novel method called Debiased Temporal Language Localizer (Debias-TLL) to prevent the model from naively memorizing the biases and enforce it to ground the query sentence based on true inter-modal relationship. Debias-TLL simultaneously trains two models. By our design, a large discrepancy of these two models' predictions when judging a sample reveals higher probability of being a biased sample. Harnessing the informative discrepancy, we devise a data re-weighing scheme for mitigating the data biases. We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced. Experiments show large-margin superiority of the proposed method in comparison with state-of-the-art competitors.

Supplementary Material

MP4 File (ICMR22-222.mp4)
This paper for the first time investigates superficial biases that are specific to the video temporal grounding task. The goal of temporal grounding is to locate video moment in an untrimmed video, with a given sentence query. We observe that existing temporal grounding models heavily rely on some data biases (e.g., high preference on frequent concepts or certain temporal intervals) in the visual modal. This leads to inferior performance when generalizing the model in cross-scenario test setting. To this end, we propose a novel method called Debiased Temporal Language Localizer (Debias-TLL) to prevent the model from naively memorizing the biases and enforce it to ground the query sentence based on true inter-modal relationship.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV.
[2]
Peijun Bao, Qian Zheng, and Yadong Mu. 2021. Dense Events Grounding in Video. In AAAI.
[3]
Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. 2019. Rubi: Reducing unimodal biases for visual question answering. In NeurIPS.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR.
[5]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual Samples Synthesizing for Robust Visual Question Answering. (2020).
[6]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In ICCV.
[7]
Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander Hauptmann. 2019. Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755 (2019).
[8]
Xinzhe Han, Shuhui Wang, Chi Su, Qingming Huang, and Qi Tian. 2021. Greedy gradient ensemble for robust visual question answering. In ICCV.
[9]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing moments in video with temporal language. In EMNLP.
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
[11]
RichardSocher Jeffrey Pennington and ChristopherD Manning. 2014. Glove: Global vectors for word representation. In EMNLP.
[12]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[13]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV.
[14]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In ICCV.
[15]
Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. 2018c. Temporal modular networks for retrieving complex compositional activities in videos. In ECCV.
[16]
Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware Biaffine Localizing Network for Temporal Sentence Grounding. In CVPR.
[17]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In SIGIR.
[18]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In ACM MM.
[19]
Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two Causal Principles for Improving Visual Dialog. In CVPR.
[20]
Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. In NeurIPS.
[21]
Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, and Stephen Gould. 2021. DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video. In WACV.
[22]
Mattia Soldan, Mengmeng Xu, Sisi Qu, Jesper Tegner, and Bernard Ghanem. 2021. VLG-Net: Video-language graph matching network for video grounding. In ICCV.
[23]
Jonathan C Stroud, Ryan McCaffrey, Rada Mihalcea, Jia Deng, and Olga Russakovsky. 2019. Compositional Temporal Visual Grounding of Natural Language Event Descriptions. arXiv preprint arXiv:1912.02256 (2019).
[24]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In CVPR.
[25]
Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. In AAAI.
[26]
Weining Wang, Yan Huang, and Liang Wang. 2019 a. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In CVPR.
[27]
Weining Wang, Yan Huang, and Liang Wang. 2019 b. Language-driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In CVPR.
[28]
Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. In NeurIPS.
[29]
Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019 a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR.
[30]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In AAAI.
[31]
Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019 c. Exploiting Temporal Relationships in Video Moment Localization with Natural Language. In ACM MM.
[32]
Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019 b. Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR.

Cited By

View all
  • (2024)Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement2024 14th International Conference on Pattern Recognition Systems (ICPRS)10.1109/ICPRS62101.2024.10677814(1-7)Online publication date: 15-Jul-2024
  • (2024)Training-Free Video Temporal Grounding Using Large-Scale Pre-trained ModelsComputer Vision – ECCV 202410.1007/978-3-031-73007-8_2(20-37)Online publication date: 1-Oct-2024
  • (2023)Mixup-Augmented Temporally Debiased Video Grounding with Content-Location DisentanglementProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612401(4450-4459)Online publication date: 26-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Debias
  2. deep neural network
  3. visual temporal grounding

Qualifiers

  • Research-article

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement2024 14th International Conference on Pattern Recognition Systems (ICPRS)10.1109/ICPRS62101.2024.10677814(1-7)Online publication date: 15-Jul-2024
  • (2024)Training-Free Video Temporal Grounding Using Large-Scale Pre-trained ModelsComputer Vision – ECCV 202410.1007/978-3-031-73007-8_2(20-37)Online publication date: 1-Oct-2024
  • (2023)Mixup-Augmented Temporally Debiased Video Grounding with Content-Location DisentanglementProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612401(4450-4459)Online publication date: 26-Oct-2023
  • (2023)Temporal Sentence Grounding in Videos: A Survey and Future DirectionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.325862845:8(10443-10465)Online publication date: 1-Aug-2023
  • (2023)FedVMR: A New Federated Learning Method for Video Moment RetrievalICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096019(1-5)Online publication date: 4-Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media