Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Frame-Wise Cross-Modal Matching for Video Moment Retrieval

Published: 01 January 2022 Publication History

Abstract

Video moment retrieval targets at retrieving a golden moment in a video for a given natural language query. The main challenges of this task include 1) the requirement of accurately localizing (i.e., the start time and the end time of) the relevant moment in an untrimmed video stream, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query to identify relevant clips. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress has been achieved by this approach, we argue that those methods have not well captured the cross-modal interactions between the query and video frames. In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM) model which predicts the temporal boundaries based on an interaction modeling between two modalities. In addition, an attention module is introduced to automatically assign higher weights to query words with richer semantic cues, which are considered to be more important for finding relevant video contents. Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy. Extensive experiments on two public datasets TACoS and Charades-STA demonstrate the superiority of our method over several state-of-the-art methods. Ablation studies have been also conducted to examine the effectiveness of different modules in our ACRM model.

References

[1]
J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 5267–5275.
[2]
L. Anne Hendrickset al., “Localizing moments in video with natural language,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5803–5812.
[3]
S. Chen and Y.-G. Jiang, “Semantic proposal for activity localization in videos via sentence query,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 8199–8206.
[4]
N. Xuet al., “Multi-level policy and reward-based deep reinforcement learning framework for image captioning,”IEEE Trans. Multimedia, vol. 22, no. 5, pp. 1372–1383, May2020.
[5]
W. Zhu, X. Wang, and W. Gao, “Multimedia intelligence: When multimedia meets artificial intelligence,”IEEE Trans. Multimedia, vol. 22, no. 7, pp. 1823–1835, Jul.2020.
[6]
L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,”IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep.2017.
[7]
C. Yanet al., “STAT: Spatial-temporal attention mechanism for video captioning,”IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241, Jan.2020.
[8]
M. Liuet al., “Cross-modal moment localization in videos,” in Proc. ACM Multimedia Conf. Multimedia Conf., 2018, pp. 843–851.
[9]
M. Liuet al., “Attentive moment retrieval in videos,” in Proc. Int. ACM SIGIR Conf. Res. & Develop. Inf. Retrieval, 2018, pp. 15–24.
[10]
R. Ge, J. Gao, K. Chen, and R. Nevatia, “MAC: Mining activity concepts for language-based temporal localization,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2019, pp. 245–253.
[11]
B. Jiang, X. Huang, C. Yang, and J. Yuan, “Cross-modal video moment retrieval with spatial and language-temporal attention,” in Proc. 2019 Int. Conf. Multimedia Retrieval, 2019, pp. 217–225.
[12]
J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua, “Temporally grounding natural sentence in video,” in Proc. 2018 Conf. Empirical Methods Natural Lang. Process., 2018, pp. 162–171.
[13]
H. Xu, A. Das, and K. Saenko, “R-C3D: Region convolutional 3D network for temporal activity detection,” in Proc. IEEE Int. Conf. 2017, pp. 5783–5792.
[14]
H. Xu, K. He, L. Sigal, S. Sclaroff, and K. Saenko, “Text-to-clip video retrieval with early fusion and re-captioning,”CoRR, vol. abs/1804.05113, Aug. 13, 2018. [Online]. Available: https://dblp.org/rec/journals/corr/abs-1804-05113.bib
[15]
H. Xuet al., “Multilevel language and vision integration for text-to-clip retrieval,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 9062–9069.
[16]
M. Hahn, A. Kadav, J. M. Rehg, and H. P. Graf, “Tripping through time: Efficient localization of activities in videos,”2019, arXiv:1904.09936.
[17]
J. Wu, G. Li, S. Liu, and L. Lin, “Tree-structured policy based progressive reinforcement learning for temporally language grounding in video,”2020, arXiv:2001.06680.
[18]
S. Ghosh, A. Agarwal, Z. Parekh, and A. Hauptmann, “ExCL: Extractive clip localization using natural language descriptions,” in Proc. Conf. North American Chapter Assoc. Comput. Linguistics Human Lang. Technol., vol. 1 (Long and Short Papers), Jun. 2019, pp. 1984–1990. [Online]. Available: https://www.aclweb.org/anthology/N19-1198
[19]
L. Chenet al., “Rethinking the bottom-up framework for query-based video localization,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 10 551–10 558.
[20]
H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,” in Proc. 58th Annu. Meeting Assoc. Comput. LinguisticsJul. 2020, pp. 6543–6554. [Online]. Available: https://www.aclweb.org/anthology/2020.acl-main.585
[21]
Y. Yuan, T. Mei, and W. Zhu, “To find where you talk: Temporal sentence localization in video with attention based location regression,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 9159–9166.
[22]
C. Rodriguez, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, “Proposal-free temporal moment localization of a natural-language query in video using guided attention,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., Mar. 2020.
[23]
M. Rohrbachet al., “Script data for attribute-based recognition of composite activities,”in Eur. Conf. Comput. Vis., 2012, pp. 144–157.
[24]
D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, “MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1247–1257.
[25]
J. Wang, L. Ma, and W. Jiang, “Temporally grounding language queries in videos by contextual boundary-aware prediction,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12 168–12175.
[26]
S. Wang and J. Jiang, “Learning natural language inference with LSTM,” in Proc. Conf. North Amer. Chapter Association Comput. Linguistics: Human Lang. Technol.: San Diego, California, USA, Jun. 12, 2016, vol. 17, pp. 1442–1451.
[27]
S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2D temporal adjacent networks for moment localization with natural language,”2019, arXiv:1912.03590.
[28]
J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia, “Turn tap: Temporal unit regression network for temporal action proposals,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3628–3636.
[29]
S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1942–1950.
[30]
L. Wang, Y. Xiong, D. Lin, and L. Van Gool, “Untrimmednets for weakly supervised action recognition and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4325–4334.
[31]
S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-TALC: Weakly-supervised temporal activity localization and classification,” in Eur. Conf. Comput. Vis., pp. 588–607, Sep. 2018.
[32]
Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “Autoloc: Weakly-supervised temporal action localization in untrimmed videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 154–171.
[33]
T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “BSN: Boundary sensitive network for temporal action proposal generation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19.
[34]
F. Feng, L. Nie, X. Wang, R. Hong, and T.-S. Chua, “Computational social indicators: A case study of chinese university ranking,” in Proc. 40th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2017, pp. 455–464.
[35]
F. Feng, X. He, Y. Liu, L. Nie, and T.-S. Chua, “Learning on partial-order hypergraphs,” in Proc. 2018 World Wide Web Conf., 2018, pp. 1523–1532.
[36]
P. Bojanowskiet al., “Weakly-supervised alignment of video with text,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4462–4470.
[37]
D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2657–2664.
[38]
J. Dong, X. Li, and C. G. Snoek, “Predicting visual features from text for image and video caption retrieval,”IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3377–3388, 2018.
[39]
J. Donget al., “Dual encoding for zero-example video retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9346–9355.
[40]
N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury, “Learning joint embedding with multimodal cues for cross-modal video-text retrieval,” in Proc. 2018 ACM Int. Conf. Multimedia Retrieval, 2018, pp. 19–27.
[41]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308.
[42]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
[43]
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov.1997.
[44]
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. 2014 Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1532–1543.
[45]
R. Kiroset al., “Skip-thought vectors,” in Adv. Neural Inf. Process. Syst., 2015, pp. 3294–3302.
[46]
G. A. Sigurdssonet al., “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in Proc. Eur. Conf. Comput. Vision, 2016, pp. 510–526.
[47]
C. D. Manninget al., “The stanford CoreNLP natural language processing toolkit,” in Assoc. Comput. Linguistics (ACL) Syst. Demonstrations, 2014, pp. 55–60. [Online]. Available: http://www.aclweb.org/anthology/P/P14/P14-5010
[48]
R. Huet al., “Natural language object retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4555–4564.
[49]
W. Wang, Y. Huang, and L. Wang, “Language-driven temporal activity localization: A semantic matching reinforcement learning model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 334–343.
[50]
Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioned dynamic modulation for temporal sentence grounding in videos,” in Adv. Neural Inf. Process. Syst., 2019, pp. 536–546.
[51]
R. Zenget al., “Dense regression network for video grounding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 287– 10296.

Cited By

View all
  • (2024)Breaking barriers of system heterogeneityProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/419(3789-3797)Online publication date: 3-Aug-2024
  • (2024)Bias-conflict sample synthesis and adversarial removal debias strategy fo temporal sentence grounding in videoProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i5.28252(4533-4541)Online publication date: 20-Feb-2024
  • (2024)Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and PseudolabelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681197(5643-5652)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 24, Issue
2022
2475 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Breaking barriers of system heterogeneityProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/419(3789-3797)Online publication date: 3-Aug-2024
  • (2024)Bias-conflict sample synthesis and adversarial removal debias strategy fo temporal sentence grounding in videoProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i5.28252(4533-4541)Online publication date: 20-Feb-2024
  • (2024)Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and PseudolabelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681197(5643-5652)Online publication date: 28-Oct-2024
  • (2024)Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680774(9214-9223)Online publication date: 28-Oct-2024
  • (2024)PTAN: Principal Token-aware Adjacent Network for Compositional Temporal GroundingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658113(618-627)Online publication date: 30-May-2024
  • (2024)Towards Visual-Prompt Temporal Answer Grounding in Instructional VideoIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341104546:12(8836-8853)Online publication date: 1-Dec-2024
  • (2024)Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked AttentionIEEE Transactions on Multimedia10.1109/TMM.2024.345306226(11204-11218)Online publication date: 1-Jan-2024
  • (2024)Gist, Content, Target-Oriented: A 3-Level Human-Like Framework for Video Moment RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.344367226(11044-11056)Online publication date: 1-Jan-2024
  • (2024)DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.339588826(9575-9590)Online publication date: 2-May-2024
  • (2024)Effective and Robust Adversarial Training Against Data and Label CorruptionsIEEE Transactions on Multimedia10.1109/TMM.2024.339467726(9477-9488)Online publication date: 2-May-2024
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media