research-article

Maskable Retentive Network for Video Moment Retrieval

Authors:

Meng WangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 1476 - 1485

https://doi.org/10.1145/3664647.3680746

Published: 28 October 2024 Publication History

Abstract

Video Moment Retrieval (MR) tasks involve predicting the moment described by a given natural language or spoken language query in an untrimmed video. In this paper, we propose a novel Maskable Retentive Network (MRNet) to address two key challenges in MR tasks: cross-modal guidance and video sequence modeling. Our approach introduces a new retention mechanism into the multimodal Transformer architecture, incorporating modality-specific attention modes. Specifically, we employ the Unlimited Attention for language-related attention regions to maximize cross-modal mutual guidance. Then, we introduce the Maskable Retention for video-only attention region to enhance video sequence modeling, that is, recognizing two crucial characteristics of video sequences: 1) bidirectional, decaying, and non-linear temporal associations between video clips, and 2) sparse associations of key information semantically related to the query. We propose a bidirectional decay retention mask to explicitly model temporal-distant context dependencies of video sequences, along with a learnable sparse retention mask to adaptively capture strong associations relevant to the target event. Extensive experiments conducted on five popular benchmarks ActivityNet Captions, TACoS, Charades-STA, ActivityNet Speech, and QVHighlights for MR tasks demonstrate the significant improvements achieved by our method over existing approaches. Code is available at https://github.com/xian-sh/MRNet.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV. 5803--5812.

[2]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML. 1298--1312.

[3]

Joao Barbosa, Heike Stein, Rebecca L Martinez, Adrià Galan-Gadea, Sihai Li, Josep Dalmau, Kirsten CS Adam, Josep Valls-Solé, Christos Constantinidis, and Albert Compte. 2020. Interplay between persistent activity and activity-silent dynamics in the prefrontal cortex underlies serial biases in working memory. Nature neuroscience, Vol. 23, 8 (2020), 1016--1024.

[4]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. Springer, 213--229.

[5]

Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022. Partially relevant video retrieval. In ACM MM. 246--257.

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.

[7]

Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. 2019. Temporal localization of moments in video collections with natural language. (2019).

[8]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In ICCV. 5267--5275.

[9]

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. 2021. Relation-aware video reading comprehension for temporal language grounding. In EMNLP. 3978--3988.

[10]

Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In ICCV. 1523--1532.

[11]

Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. TCSVT, Vol. 34, 7 (2024), 6238--6252.

[12]

Jingjing Hu, Dan Guo, Kun Li, Zhan Si, Xun Yang, Xiaojun Chang, and Meng Wang. 2024. Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding. arXiv preprint arXiv:2403.14174 (2024).

[13]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. 706--715.

[14]

Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries. NIPS, Vol. 34 (2021), 11846--11858.

[15]

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In ECCV. Springer, 447--463.

[16]

Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In AAAI, Vol. 35. 1902--1910.

[17]

Kun Li, Dan Guo, and Meng Wang. 2023. ViGT: proposal-free video grounding with a learnable token in the transformer. Science China Information Sciences, Vol. 66, 10 (2023), 202102.

[18]

Kun Li, Jiaxiu Li, Dan Guo, Xun Yang, and Meng Wang. 2023. Transformer-based visual grounding with cross-modality interaction. TOMM, Vol. 19, 6 (2023), 1--19.

[19]

Yicong Li, Xun Yang, An Zhang, Chun Feng, Xiang Wang, and Tat-Seng Chua. 2023. Redundancy-aware transformer for video question answering. In ACM MM. 3172--3180.

[20]

Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, and Meng Wang. 2024. Object-aware adaptive-positivity learning for audio-visual question answering. In AAAI. 3306--3314.

[21]

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. 2023. Univtg: Towards unified video-language temporal grounding. In ICCV. 2794--2804.

[22]

Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. 2023. Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding. TMM (2023).

[23]

Daizong Liu and Wei Hu. 2022. Skimming, locating, then perusing: A human-like framework for natural language video localization. In ACM MM. 4536--4545.

[24]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, Guoshun Nan, Pan Zhou, Zichuan Xu, Lixing Chen, He Yan, and Yu Cheng. 2023. Filling the Information Gap between Video and Query for Language-Driven Moment Retrieval. In ACM MM. 4190--4199.

[25]

Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In ACM MM. 4070--4078.

[26]

Nayu Liu, Xian Sun, Hongfeng Yu, Fanglong Yao, Guangluan Xu, and Kun Fu. 2023. M^2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment Localization. TNNLS (2023).

[27]

Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. 2022. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In CVPR. 3042--3051.

[28]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[29]

WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. 2023. Query-dependent video representation for moment retrieval and highlight detection. In CVPR. 23023--23033.

[30]

Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In CVPR. 10810--10819.

[31]

Ke Ning, Lingxi Xie, Jianzhuang Liu, Fei Wu, and Qi Tian. 2021. Interaction-integrated network for natural language moment localization. TIP, Vol. 30 (2021), 2538--2548.

[32]

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. 2023. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023).

[33]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.

[34]

Wei Qian, Dan Guo, Kun Li, Xiaowei Zhang, Xilan Tian, Xun Yang, and Meng Wang. 2024. Dual-path tokenlearner for remote photoplethysmography-based physiological measurement with facial videos. TCSS, Vol. 11, 3 (2024), 4465--4477.

[35]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML. 8748--8763.

[36]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. ACL, Vol. 1 (2013), 25--36.

[37]

Cristian Rodriguez, Edison Marrese-Taylor, Basura Fernando, Hiroya Takamura, and Qi Wu. 2023. Memory-efficient Temporal Moment Localization in Long Videos. In ACL. 1901--1916.

[38]

Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV. 2464--2473.

[39]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019. arXiv preprint arXiv:1910.01108 (2019).

[40]

Muah Seol, Jonghee Kim, and Jinyoung Moon. 2023. BMRN: Boundary Matching and Refinement Network for Temporal Moment Localization with Natural Language. In CVPRW. 5571--5579.

[41]

Xuyang Shen, Dong Li, Jinxing Zhou, Zhen Qin, Bowen He, Xiaodong Han, Aixuan Li, Yuchao Dai, Lingpeng Kong, Meng Wang, et al. 2023. Fine-grained audible video description. In CVPR. 10585--10596.

[42]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021).

[43]

Xin Sun, Jialin Gao, Yizhe Zhu, Xuan Wang, and Xi Zhou. 2023. Video Moment Retrieval via Comprehensive Relation-aware Network. TCSVT (2023).

[44]

Xin Sun, Xuan Wang, Jialin Gao, Qiong Liu, and Xi Zhou. 2022. You need to read again: Multi-granularity perception network for moment retrieval in videos. In SIGIR. 1022--1032.

[45]

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023).

[46]

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. 2022. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554 (2022).

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS, Vol. 30 (2017).

[48]

Fei Wang, Dan Guo, Kun Li, and Meng Wang. 2024. Eulermormer: Robust eulerian motion magnification via dynamic filtering within transformer. In AAAI, Vol. 38. 5345--5353.

Digital Library

[49]

Weikang Wang, Jing Liu, Yuting Su, and Weizhi Nie. 2023. Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition. In ACM MM. 4867--4876.

[50]

Xin Wang, Zihao Wu, Hong Chen, Xiaohan Lan, and Wenwu Zhu. 2023. Mixup-Augmented Temporally Debiased Video Grounding with Content-Location Disentanglement. In ACM MM. 4450--4459.

[51]

Ye Wang, Wang Lin, Shengyu Zhang, Tao Jin, Linjun Li, Xize Cheng, and Zhou Zhao. 2023. Weakly-supervised spoken video grounding via semantic interaction learning. In ACL. 10914--10932.

[52]

Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gangshan Wu. 2022. Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, Vol. 36. 2613--2623.

[53]

Yan Xia, Zhou Zhao, Shangwei Ye, Yang Zhao, Haoyuan Li, and Yi Ren. 2022. Video-Guided Curriculum Learning for Spoken Video Grounding. In ACM MM. 5191--5200.

[54]

Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. In AAAI, Vol. 35. 2986--2994.

[55]

Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, and Sidan Du. 2023. MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer. arXiv preprint arXiv:2305.00355 (2023).

[56]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In SIGIR. 1339--1348.

[57]

Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In SIGIR. 1--10.

[58]

Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, and Tat-Seng Chua. 2022. Video moment retrieval with cross-modal neural architecture search. TIP, Vol. 31 (2022), 1204--1216.

[59]

Xun Yang, Jianming Zeng, Dan Guo, Shanshan Wang, Jianfeng Dong, and Meng Wang. 2024. Robust Video Question Answering via Contrastive Cross-Modality Representation Learning. Science China Information Sciences (2024).

[60]

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In CVPR. 10287--10296.

[61]

Bolin Zhang, Bin Jiang, Chao Yang, and Liang Pang. 2022. Dual-channel localization networks for moment retrieval with natural language. In ICMR. 351--359.

[62]

Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Natural language video localization: A revisit in span-based question answering framework. TPAMI, Vol. 44, 8 (2021), 4252--4266.

[63]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931 (2020).

[64]

Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR. 12669--12678.

[65]

Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. 2021. Multi-scale 2d temporal adjacency networks for moment localization with natural language. TPAMI, Vol. 44, 12 (2021), 9073--9087.

[66]

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, Vol. 34. 12870--12877.

[67]

Minghang Zheng, Sizhe Li, Qingchao Chen, Yuxin Peng, and Yang Liu. 2023. Phrase-level Temporal Relationship Mining for Temporal Sentence Localization. In AAAI. 3669--3677.

[68]

Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization. TOMM, Vol. 19, 2 (2023), 1--21.

Digital Library

[69]

Jinxing Zhou, Dan Guo, Yuxin Mao, Yiran Zhong, Xiaojun Chang, and Meng Wang. 2024. Label-anticipated Event Disentanglement for Audio-Visual Video Parsing. In ECCV. 1--22.

[70]

Jinxing Zhou, Dan Guo, and Meng Wang. 2023. Contrastive Positive Sample Propagation along the Audio-Visual Event Line. TPAMI (2023), 7239--7257.

[71]

Jinxing Zhou, Dan Guo, Yiran Zhong, and Meng Wang. 2024. Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling. IJCV (2024), 1--22.

[72]

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio--visual segmentation. In ECCV. 386--403.

[73]

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In CVPR. 8436--8444.

Index Terms

Maskable Retentive Network for Video Moment Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
Pattern Recognition and Computer Vision
Abstract
Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships ...
Self-Supervised Graph Convolution for Video Moment Retrieval
Artificial Neural Networks and Machine Learning – ICANN 2023
Abstract
Video Moment Retrieval is a task locating a moment from an untrimmed video that are relevant to a given query. It is a highly challenging multi-modal task due to biased annotations and complex cross-model interaction. In this paper, we propose ...
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. The relevance between the video and query is partial, mainly evident in two aspects: (1) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities
Major Project of Anhui Province
National Natural Science Foundation of China

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
249
Total Downloads

Downloads (Last 12 months)249
Downloads (Last 6 weeks)153

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten