survey

A Survey on Video Moment Localization

Authors:

Yong RuiAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 9

Article No.: 188, Pages 1 - 37

https://doi.org/10.1145/3556537

Published: 16 January 2023 Publication History

Abstract

Video moment localization, also known as video moment retrieval, aims to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities. In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones. We also review the datasets available for video moment localization and group results of related work. In addition, we discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.

References

[1]

Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. 2018. Diagnosing error in temporal action detectors. In Proceedings of the European Conference on Computer Vision. 256–272.

Digital Library

[2]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.

[3]

Hakan Bilen and Andrea Vedaldi. 2016. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2846–2854.

[4]

J. Burgner-Kahrs, D. C. Rucker, and H. Choset. 2015. Continuum robots for medical applications: A survey. IEEE Transactions on Robotics 31, 6 (2015), 1261–1280.

Digital Library

[5]

Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin. 2020. STRONG: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the ACM International Conference on Multimedia. 4162–4170.

Digital Library

[6]

Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zheng Qin. 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the ACM International Conference on Multimedia. 898–906.

Digital Library

[7]

Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 162–171.

[8]

Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. 2019. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 8175–8182.

Digital Library

[9]

Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision. 333–351.

Digital Library

[10]

Shaoxiang Chen and Yugang Jiang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence. 8199–8206.

Digital Library

[11]

Shaoxiang Chen and Yu-Gang Jiang. 2020. Hierarchical visual-textual graph for temporal activity localization via language. In European Conference on Computer Vision. 601–618.

Digital Library

[12]

Yi-Wen Chen, Yi-Hsuan Tsai, and Ming-Hsuan Yang. 2021. End-to-end multi-modal video temporal grounding. Advances in Neural Information Processing Systems 34 (2021), 28442–28453.

[13]

Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly-supervised spatio-temporally grounding natural sentence in video. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 1884–1894.

[14]

L. Claussmann, M. Revilloud, D. Gruyer, and S. Glaser. 2020. A review of motion planning for highway autonomous driving. IEEE Transactions on Intelligent Transportation Systems 21, 5 (2020), 1826–1848.

[15]

Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan C. Russell. 2019. Temporal localization of moments in video collections with natural language. CoRR (2019), 1–17.

[16]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.

[17]

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. 2021. Relation-aware video reading comprehension for temporal language grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 3978–3988.

[18]

Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 1523–1532.

[19]

Junyu Gao and Changsheng Xu. 2021. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1646–1657.

[20]

Mingfei Gao, Larry Davis, Richard Socher, and Caiming Xiong. 2019. WSLLN: Weakly supervised natural language localization networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1481–1487.

[21]

Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining activity concepts for language-based temporal localization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 245–253.

[22]

Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexander G. Hauptmann. 2019. ExCL: Extractive clip localization using natural language descriptions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1984–1990.

[23]

Meera Hahn, Asim Kadav, James M. Rehg, and Hans Peter Graf. 2020. Tripping through time: Efficient localization of activities in videos. In Proceedings of the British Machine Vision Conference. 1–16.

[24]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 8393–8400.

Digital Library

[25]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. 2018. Localizing moments in video with temporal language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1380–1390.

[26]

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555–4564.

[27]

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing 30 (2021), 4667–4677.

Digital Library

[28]

Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. 2021. Coarse-to-fine semantic alignment for cross-modal moment localization. IEEE Transactions on Image Processing 30 (2021), 5933–5943.

Digital Library

[29]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the International Conference on Multimedia Retrieval. 217–225.

Digital Library

[30]

Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the Advances in Neural Information Processing Systems. 3294–3302.

Digital Library

[31]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.

[32]

Cheng Li, Yuming Zhao, Shihao Peng, and Jinting Chen. 2019. Bidirectional single-stream temporal sentence query localization in untrimmed videos. In Proceedings of the IEEE International Conference on Image Processing. 270–274.

[33]

Ding Li, Rui Wu, Yongqiang Tang, Zhizhong Zhang, and Wensheng Zhang. 2021. Multi-scale 2D representation learning for weakly-supervised moment retrieval. In 2020 25th International Conference on Pattern Recognition. 8616–8623.

[34]

Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. 2019. Global-local temporal representations for video person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 3958–3967.

[35]

Jianing Li, Shiliang Zhang, and Tiejun Huang. 2020. Multi-scale temporal cues learning for video person re-identification. IEEE Transactions on Image Processing 29 (2020), 4461–4473.

[36]

Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.

[37]

Guoqiang Liang, Shiyu Ji, and Yanning Zhang. 2021. Local-enhanced interaction for temporal moment localization. In Proceedings of the International Conference on Multimedia Retrieval. 201–209.

Digital Library

[38]

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In International Conference on Learning Representations. 1–14.

[39]

Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the ACM International Conference on Multimedia. 988–996.

Digital Library

[40]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence. 11539–11546.

[41]

Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai. 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE Transactions on Image Processing 29 (2020), 3750–3762.

Digital Library

[42]

Bingbin Liu, Serena Yeung, Edward Chou, Dean Huang, Li Feifei, and Juan Carlos Niebles. 2018. Temporal modular networks for retrieving complex compositional activities in videos. In Proceedings of the European Conference on Computer Vision. 569–586.

Digital Library

[43]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the International Conference on Computational Linguistics. 1841–1851.

[44]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. 2021. Adaptive proposal generation network for temporal sentence localization in videos. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 9292–9301.

[45]

Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11235–11244.

[46]

Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Jointly cross- and self-modal graph attention network for query-based moment localization. In Proceedings of the ACM International Conference on Multimedia. 4070–4078.

Digital Library

[47]

Daizong Liu, Xiaoye Qu, and Pan Zhou. 2021. Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 9302–9311.

[48]

M. Liu, L. Nie, X. Wang, Q. Tian, and B. Chen. 2019. Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing 28, 3 (2019), 1235–1247.

Digital Library

[49]

M. Liu, L. Qu, L. Nie, M. Liu, L. Duan, and B. Chen. 2020. Iterative local-global collaboration learning towards one-shot video person re-identification. IEEE Transactions on Image Processing 29 (2020), 9360–9372.

[50]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15–24.

Digital Library

[51]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. 843–851.

Digital Library

[52]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. 21–37.

[53]

Xinfang Liu, Xiushan Nie, Zhifang Tan, Jie Guo, and Yilong Yin. 2021. A survey on natural language video localization. arXiv preprint arXiv:2104.00234 (2021), 1–13.

[54]

Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 1–14.

Digital Library

[55]

Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. 2019. Debug: Dense bottom-up grounding approach for natural language video localization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5144–5153.

[56]

Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, and Chang D. Yoo. 2020. VLANet: Video-language alignment network for weakly-supervised video moment retrieval. In Proceedings of the European Conference on Computer Vision. 156–171.

Digital Library

[57]

Esa Rahtu Mayu Otani, Yuta Nakahima, and Janne Heikkilä. 2020. Uncovering hidden challenges in query-based video moment retrieval. In The British Machine Vision Conference. 1–12.

[58]

A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision. 2630–2640.

[59]

Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K. Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592–11601.

[60]

Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10807–10816.

[61]

Jinwoo Nam, Daechul Ahn, Dongyeop Kang, Seong Jong Ha, and Jonghyun Choi. 2021. Zero-shot natural language video localization. In Proceedings of the IEEE International Conference on Computer Vision. 1470–1479.

[62]

Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. 2021. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2765–2775.

[63]

Ke Ning, Lingxi Xie, Jianzhuang Liu, Fei Wu, and Qi Tian. 2021. Interaction-integrated network for natural language moment localization. IEEE Transactions on Image Processing 30 (2021), 2538–2548.

[64]

Cristian Rodriguez Opazo, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2453–2462.

[65]

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 485–494.

[66]

Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the ACM International Conference on Multimedia. 4280–4288.

Digital Library

[67]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.

[68]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.

[69]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 91–99.

Digital Library

[70]

Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, and Stephen Gould. 2021. DORi: Discovering object relationships for moment localization of a natural language query in a video. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1079–1088.

[71]

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision. 817–834.

[72]

Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. 144–157.

Digital Library

[73]

Arka Sadhu, Kan Chen, and Ram Nevatia. 2020. Video object grounding using semantic roles in language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10417–10427.

[74]

Gunnar A. Sigurdsson, Olga Russakovsky, and Abhinav Gupta. 2017. What actions are needed for understanding human actions in videos? In Proceedings of the IEEE International Conference on Computer Vision. 2137–2146.

[75]

Gunnar A. Sigurdsson, Gul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. 510–526.

[76]

Xiaomeng Song and Yahong Han. 2018. VAL: Visual-attention action localizer. In Proceedings of the Pacific-Rim Conference on Multimedia, Vol. 11165. 340–350.

Digital Library

[77]

Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2602–2610.

[78]

Xiaoyang Sun, Hanli Wang, and Bin He. 2021. MABAN: Multi-agent boundary-aware network for natural language moment retrieval. IEEE Transactions on Image Processing 30 (2021), 5589–5599.

[79]

Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press.

Digital Library

[80]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.

[81]

Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan A. Plummer. 2021. LoGAN: Latent graph co-attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2083–2092.

[82]

Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng. 2021. Frame-wise cross-modal matching for video moment retrieval. IEEE Transactions on Multimedia 24 (2021), 1338–1349.

Digital Library

[83]

Haoyu Tang, Jihua Zhu, Lin Wang, Qinghai Zheng, and Tianwei Zhang. 2021. Multi-level query interaction for temporal language grounding. IEEE Transactions on Intelligent Transportation Systems (2021), 1–10.

[84]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. [n.d.]. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.

[85]

Z. Tu, H. Li, D. Zhang, J. Dauwels, B. Li, and J. Yuan. 2019. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing 28, 6 (2019), 2799–2812.

[86]

Hao Wang, Zheng-Jun Zha, Xuejin Chen, Zhiwei Xiong, and Jiebo Luo. 2020. Dual path interaction network for video moment localization. In Proceedings of the ACM International Conference on Multimedia. 4116–4124.

Digital Library

[87]

Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7026–7035.

[88]

Jingwen Wang, Lin Ma, and Wenhao Jiang. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. 12168–12175.

[89]

Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4325–4334.

[90]

Weining Wang, Yan Huang, and Liang Wang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334–343.

[91]

Yuechen Wang, Jiajun Deng, Wengang Zhou, and Houqiang Li. 2021. Weakly supervised temporal adjacent network for language grounding. IEEE Transactions on Multimedia (2021), 1–13.

[92]

Zheng Wang, Jingjing Chen, and Yu-Gang Jiang. 2021. Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In Proceedings of the ACM International Conference on Multimedia. 1459–1468.

Digital Library

[93]

Aming Wu and Yahong Han. 2018. Multi-modal circulant fusion for video-to-language and backward. In Proceedings of the International Joint Conference on Artificial Intelligence. 1029–1035.

[94]

Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the ACM International Conference on Multimedia. 1283–1291.

Digital Library

[95]

Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence. 12386–12393.

[96]

Ziyue Wu, Junyu Gao, Shucheng Huang, and Changsheng Xu. 2021. Diving into the relations: Leveraging semantic and visual structures for video moment retrieval. In IEEE International Conference on Multimedia and Expo. 1–6.

[97]

Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, and Jun Xiao. 2021. Natural language video localization with learnable moment proposals. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 4008–4017.

[98]

Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2986–2994.

[99]

Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dynamic co-attention networks for question answering. In Proceedings of the International Conference on Learning Representations. 1–14.

[100]

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 5783–5792.

[101]

Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 9062–9069.

Digital Library

[102]

L. Yang, H. Peng, D. Zhang, J. Fu, and J. Han. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing 29 (2020), 8535–8548.

[103]

Wenfei Yang, Tianzhu Zhang, Yongdong Zhang, and Feng Wu. 2021. Local correspondence network for weakly supervised temporal sentence grounding. IEEE Transactions on Image Processing 30 (2021), 3252–3262.

Digital Library

[104]

Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.

Digital Library

[105]

Yulan Yang, Zhaohui Li, and Gangyan Zeng. 2020. A survey of temporal activity localization via language in untrimmed videos. In 2020 International Conference on Culture-oriented Science & Technology. 596–601.

[106]

Y. Yang, J. Zhou, J. Ai, Y. Bin, A. Hanjalic, H. T. Shen, and Y. Ji. 2018. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600–5611.

Digital Library

[107]

Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Proceedings of the Conference on Neural Information Processing Systems. 534–544.

[108]

Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence. 9159–9166.

Digital Library

[109]

Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10284–10293.

[110]

Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-modal relational graph for cross-modal video moment retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2215–2224.

[111]

Da Zhang, Xiyang Dai, Xin Wang, Yuanfang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.

[112]

Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.

Digital Library

[113]

Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2022. Natural language video localization: A revisit in span-based question answering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2022), 4252–4266.

Digital Library

[114]

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 6543–6554.

[115]

J. Zhang, K. Mei, Y. Zheng, and J. Fan. 2019. Exploiting mid-level semantics for large-scale complex video classification. IEEE Transactions on Multimedia 21, 10 (2019), 2518–2530.

Digital Library

[116]

Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12669–12678.

[117]

Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. 2021. Multi-scale 2D temporal adjacency networks for moment localization with natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–14.

[118]

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence. 12870–12877.

[119]

Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the ACM International Conference on Multimedia. 1230–1238.

Digital Library

[120]

S. Zhang, Y. Zhu, and A. K. Roy-Chowdhury. 2016. Context-aware surveillance video summarization. IEEE Transactions on Image Processing 25, 11 (2016), 5469–5478.

[121]

Zongmeng Zhang, Xianjing Han, Xuemeng Song, Yan Yan, and Liqiang Nie. 2021. Multi-modal interaction graph convolutional network for temporal language localization in videos. IEEE Transactions on Image Processing 30 (2021), 8265–8277.

[122]

Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664.

Digital Library

[123]

Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, and Xiuqiang He. 2020. Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In Proceedings of the ACM International Conference on Multimedia. 4098–4106.

Digital Library

[124]

Yang Zhao, Zhou Zhao, Zhu Zhang, and Zhijie Lin. 2021. Cascaded prediction network via segment tree for temporal video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4197–4206.

[125]

Luowei Zhou, Nathan Louis, and Jason J. Corso. 2018. Weakly-supervised video object grounding from text by loss weighting and object interaction. In British Machine Vision Conference. 1–12.

Cited By

Huo NCheng RKao BNing WHaldar NLi XLi JNajafi MLi TQu G(2024)ZeroEA: A Zero-Training Entity Alignment Framework via Pre-Trained Language ModelProceedings of the VLDB Endowment10.14778/3654621.365464017:7(1765-1774)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654640
Rutinowski JYoussef HFranke SPriyanta IPolachowski FRoidl MReining C(2024)Semi-automated computer vision-based tracking of multiple industrial entities: a framework and dataset creation approachJournal on Image and Video Processing10.1186/s13640-024-00623-62024:1Online publication date: 22-Mar-2024
https://dl.acm.org/doi/10.1186/s13640-024-00623-6
Wang YZhou LWang YPeng Z(2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1155/2024/1941221
Show More Cited By

Index Terms

A Survey on Video Moment Localization
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning
The newly emerging language-based video moment retrieval task aims at retrieving a target video moment from an untrimmed video given a natural language as the query. It is more applicable in reality since it is able to accurately localize a specific video ...
STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

In this article, we tackle the cross-modal video moment localization issue, namely, localizing the most relevant video moment in an untrimmed video given a sentence as the query. The majority of existing methods focus on generating video moment ...
Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction
Abstract
Localizing moments in a video via natural language queries is a challenging task where models are trained to identify the start and the end timestamps of the moment in a video. However, it is labor intensive to obtain the temporal endpoint ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 55, Issue 9

September 2023

835 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3567474

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 January 2023

Online AM: 17 August 2022

Accepted: 08 August 2022

Revised: 11 May 2022

Received: 11 April 2021

Published in CSUR Volume 55, Issue 9

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Funding Sources

National Natural Science Foundation of China
Shandong Provincial Natural Science Foundation for Distinguished Young Scholars
Major Basic Research Project of Natural Science Foundation of Shandong Province
Science and Technology Innovation Program for Distinguished Young Scholars of Shandong Province Higher Education Institutions
Professors of Shandong Jianzhu University

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

149
Total Citations
View Citations
2,289
Total Downloads

Downloads (Last 12 months)915
Downloads (Last 6 weeks)77

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huo NCheng RKao BNing WHaldar NLi XLi JNajafi MLi TQu G(2024)ZeroEA: A Zero-Training Entity Alignment Framework via Pre-Trained Language ModelProceedings of the VLDB Endowment10.14778/3654621.365464017:7(1765-1774)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654640
Rutinowski JYoussef HFranke SPriyanta IPolachowski FRoidl MReining C(2024)Semi-automated computer vision-based tracking of multiple industrial entities: a framework and dataset creation approachJournal on Image and Video Processing10.1186/s13640-024-00623-62024:1Online publication date: 22-Mar-2024
https://dl.acm.org/doi/10.1186/s13640-024-00623-6
Wang YZhou LWang YPeng Z(2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1155/2024/1941221
Breve BCimino GDeufemia V(2024)Hybrid Prompt Learning for Generating Justifications of Security Risks in Automation RulesACM Transactions on Intelligent Systems and Technology10.1145/3675401Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675401
Huang QLuo ZXing ZZeng JChen JXu XChen Y(2024)Revealing the Unseen: AI Chain on LLMs for Predicting Implicit Data Flows to Generate Data Flow Graphs in Dynamically-Typed CodeACM Transactions on Software Engineering and Methodology10.1145/3672458Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3672458
Schmidt DSpencer-Smith JFu QWhite J(2024)Towards a Catalog of Prompt Patterns to Enhance the Discipline of Prompt EngineeringACM SIGAda Ada Letters10.1145/3672359.367236443:2(43-51)Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3672359.3672364
McIntosh TLiu TSusnjak TWatters PHalgamuge M(2024)A Reasoning and Value Alignment Test to Assess Advanced GPT ReasoningACM Transactions on Interactive Intelligent Systems10.1145/3670691Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3670691
Eskandani NSalvaneschi GAdams BZimmermann TOzkaya ILin DZhang J(2024)Towards AI for Software SystemsProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664767(79-84)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3664646.3664767
Ji SLi XSun WDong HTaalas AZhang YWu HPitkänen EMarttinen P(2024)A Unified Review of Deep Learning for Automated Medical CodingACM Computing Surveys10.1145/3664615Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3664615
Sarda KNamrud ZLitoiu MShwartz LWatts Id'Amorim M(2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663855
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents