research-article

Interactive Video Corpus Moment Retrieval using Reinforcement Learning

Authors:

Chong Wah NgoAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 296 - 306

https://doi.org/10.1145/3503161.3548277

Published: 10 October 2022 Publication History

Abstract

Known-item video search is effective with human-in-the-loop to interactively investigate the search result and refine the initial query. Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection. This paper tackles the problem by reinforcement learning, aiming to reach a search target within a few rounds of interaction by long-term learning from user feedbacks. Specifically, the system interactively plans for navigation path based on feedback and recommends a potential target that maximizes the long-term reward for user comment. We conduct experiments for the challenging task of video corpus moment retrieval (VCMR) to localize moments from a large video corpus. The experimental results on TVR and DiDeMo datasets verify that our proposed work is effective in retrieving the moments that are hidden deep inside the ranked lists of CONQUER and HERO, which are the state-of-the-art auto-search engines for VCMR.

Supplementary Material

MOV File (mmfp2265.mov)

Presentation video

Download
379.76 MB

References

[1]

Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, and Claudio Vairo. 2021. Visione at video browser showdown 2021. In International Conference on Multimedia Modeling. Springer, 473--478.

Digital Library

[2]

George Awad, Asad A. Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth J. F. Jones, and Georges Quénot. 2021. Evaluating Multiple Video Understanding and Retrieval Tasks at TRECVID 2021. In Proceedings of TRECVID 2021. NIST, USA.

[3]

Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin. 2020. Strong: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4162--4170.

Digital Library

[4]

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.

[5]

I.J. Cox, M.L. Miller, S.M. Omohundro, and P.N. Yianilos. 1996. PicHunter: Bayesian relevance feedback for image retrieval. In Proceedings of 13th International Conference on Pattern Recognition, Vol. 3. 361--369 vol.3. https://doi.org/10.1109/ICPR.1996.546971

[6]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition. 326--335.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423

[8]

Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. 2019. Temporal Localization of Moments in Video Collections with Natural Language. arxiv: 1907.12763 [cs.CV]

[9]

Xiaoxiao Guo, Steven Rennie, Hui Wu, Gerald Tesauro, Yu Cheng, and Rogerio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. Advances in Neural Information Processing Systems, Vol. 2018-Decem, NeurIPS (2018), 678--688.

[10]

Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019 (2019), 8393--8400. https://doi.org/10.1609/aaai.v33i01.33018393

Digital Library

[11]

Silvan Heller, Ralph Gasser, Cristina Illi, Maurizio Pasquinelli, Loris Sauter, Florian Spiess, and Heiko Schuldt. 2021. Towards explainable interactive multi-modal video retrieval with vitrivr. In International Conference on Multimedia Modeling. Springer, 435--440.

Digital Library

[12]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing Moments in Video with Natural Language. Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob (2017). https://doi.org/10.1109/ICCV.2017.618

[13]

Zhijian Hou, Chong-Wah Ngo, and W. K. Chan. 2021. CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval. Vol. 1. Association for Computing Machinery. 3900--3908 pages. https://doi.org/10.1145/3474085.3475281

Digital Library

[14]

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), Vol. 50, 2 (2017), 1--35.

Digital Library

[15]

Yu-Gang Jiang, Jun Yang, Chong-Wah Ngo, and Alexander G Hauptmann. 2009. Representations of keypoint-based semantic concept detection: A comprehensive study. IEEE Transactions on Multimedia, Vol. 12, 1 (2009), 42--53.

Digital Library

[16]

Miroslav Kratochvil, František Mejzlík, Patrik Veselý, Tomáš Souček, and Jakub Lokoč. 2020. SOMHunter: lightweight video search system with SOM-guided relevance feedback. In Proceedings of the 28th ACM International Conference on Multimedia. 4481--4484.

Digital Library

[17]

Miroslav Kratochvíl, Patrik Veselý, František Mejzlík, and Jakub Lokoč. 2020. Som-hunter: Video browsing with relevance-to-som feedback loop. In International Conference on Multimedia Modeling. Springer, 790--795.

Digital Library

[18]

Yoonho Lee, Heeju Choi, Sungjune Park, and Yong Man Ro. 2021. IVIST: interactive video search tool in VBS 2021. In International Conference on Multimedia Modeling. Springer, 423--428.

Digital Library

[19]

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. In ECCV.

[20]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. In EMNLP.

[21]

Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A User Simulator for Task-Completion Dialogues. arXiv preprint arXiv:1612.05688 (2016).

[22]

Jakub Lokoč, Gregor Kovalčík, Bernd Münzer, Klaus Schöffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. 2019a. Interactive search or sequential browsing? A detailed analysis of the video browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1 (2019), 1--18.

Digital Library

[23]

Jakub Lokoč, Gregor Kovalčík, Tomáš Souček, Jaroslav Moravec, and Přemysl Čech. 2019b. VIRET: A video retrieval tool for interactive known-item search. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 177--181.

Digital Library

[24]

Jakub Lokoč, František Mejzlík, Tomáš Souček, Patrik Dokoupil, and Ladislav Peška. 2022. Video Search with Context-Aware Ranker and Relevance Feedback. In International Conference on Multimedia Modeling. Springer, 505--510.

[25]

Jakub Lokoč, Patrik Veselý, František Mejzlík, Gregor Kovalčík, Tomáš Souček, Luca Rossetto, Klaus Schoeffmann, Werner Bailer, Cathal Gurrin, Loris Sauter, et al. 2021. Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 3 (2021), 1--26.

Digital Library

[26]

Jakub Lokoč, Gregor Kovalčík, Tomáš Souček, Jaroslav Moravec, and Pvremysl Cech. 2019. A Framework for Effective Known-Item Search in Video (MM '19). Association for Computing Machinery, New York, NY, USA, 1777--1785. https://doi.org/10.1145/3343031.3351046

Digital Library

[27]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).

[28]

Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, and Liqiang Nie. 2021. Hierarchical deep residual reasoning for temporal moment localization. In ACM Multimedia Asia. 1--7.

[29]

Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In 33rd International Conference on Machine Learning, ICML 2016, Vol. 4. 2850--2869.

[30]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari With Deep Reinforcement Learning. In NIPS Deep Learning Workshop.

[31]

Phuong Anh Nguyen and Chong Wah Ngo. 2021. Interactive Search vs. Automatic Search: An Extensive Study on Video Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 17, 2 (2021). https://doi.org/10.1145/3429457

Digital Library

[32]

Ladislav Peška, Gregor Kovalčík, Tomáš Souček, Vít Škrhák, and Jakub Lokoć. 2021. W2VV BERT model at VBS 2021. In International Conference on Multimedia Modeling. Springer, 467--472.

[33]

Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. 2019a. Deep learning-based concept detection in vitrivr. In International Conference on Multimedia Modeling. Springer, 616--621.

[34]

Luca Rossetto, Ralph Gasser, Jakub Lokoć, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomáš Souček, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, et al. 2020. Interactive video retrieval in the age of deep learning--detailed evaluation of VBS 2019. IEEE Transactions on Multimedia, Vol. 23 (2020), 243--256.

[35]

Luca Rossetto, Heiko Schuldt, George Awad, and Asad A. Butt. 2019b. V3C -- A Research Video Collection. In MultiMedia Modeling, Ioannis Kompatsiaris, Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, and Stefanos Vrochidis (Eds.). Springer International Publishing, Cham, 349--360.

[36]

Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, Vol. 21, 2 (2006), 97--126.

[37]

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484--489.

[38]

Cees GM Snoek, Marcel Worring, Ork de Rooij, Koen EA van de Sande, Rong Yan, and Alexander G Hauptmann. 2008. VideOlympics: real-time evaluation of multimedia retrieval systems. IEEE MultiMedia, Vol. 15, 1 (2008), 86--91.

Digital Library

[39]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[40]

Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. Advances in neural information processing systems, Vol. 32 (2019).

[41]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5100--5111. https://doi.org/10.18653/v1/D19--1514

[42]

Bart Thomee and Michael S Lew. 2012. Interactive search in image retrieval: a survey. International Journal of Multimedia Information Retrieval, Vol. 1, 2 (2012), 71--86.

[43]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM, Vol. 59, 2 (2016), 64--73.

Digital Library

[44]

Kazuya Ueki, Yumi Nakagome, Koji Hirakawa, Kotaro Kikuchi, Yoshihiko Hayashi, Tetsuji Ogawa, and Tetsunori Kobayashi. 2018. Waseda_Meisei at TRECVID 2018: Ad-hoc Video Search. In TRECVID.

[45]

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. 2017. Context-aware captions from context-agnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 251--260.

[46]

Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. 2020a. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the 28th ACM International Conference on Multimedia. 1283--1291.

Digital Library

[47]

Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020b. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12386--12393.

[48]

Jiaxin Wu and Chong Wah Ngo. 2020. Interpretable Embedding for Ad-Hoc Video Search. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (2020), 3357--3366. https://doi.org/10.1145/3394171.3413916

Digital Library

[49]

Jiaxin Wu, Phuong Anh Nguyen, Zhixin Ma, and Chong-Wah Ngo. 2021. SQL-Like Interpretable Interactive Video Search. In International Conference on Multimedia Modeling.

[50]

Rong Yan, Alexander Hauptmann, and Rong Jin. 2003. Multimedia search with pseudo-relevance feedback. In International Conference on Image and Video Retrieval. Springer, 238--247.

[51]

Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. 2019. Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning. Proceedings of the 27th ACM International Conference on Multimedia (2019).

Digital Library

[52]

Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685--695.

Digital Library

[53]

Zhenxing Zhang, Rami Albatal, Cathal Gurrin, and Alan F Smeaton. 2015. Interactive known-item search using semantic textual and colour modalities. In International Conference on Multimedia Modeling. Springer, 282--286.

Index Terms

Interactive Video Corpus Moment Retrieval using Reinforcement Learning
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Video Corpus Moment Retrieval with Contrastive Learning
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct ...
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. The relevance between the video and query is partial, mainly evident in two aspects: (1) ...
Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning
The newly emerging language-based video moment retrieval task aims at retrieving a target video moment from an untrimmed video given a natural language as the query. It is more applicable in reality since it is able to accurately localize a specific video ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Singapore Ministry of Education (MOE)

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
281
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)4

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents