Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548277acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Interactive Video Corpus Moment Retrieval using Reinforcement Learning

Published: 10 October 2022 Publication History

Abstract

Known-item video search is effective with human-in-the-loop to interactively investigate the search result and refine the initial query. Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection. This paper tackles the problem by reinforcement learning, aiming to reach a search target within a few rounds of interaction by long-term learning from user feedbacks. Specifically, the system interactively plans for navigation path based on feedback and recommends a potential target that maximizes the long-term reward for user comment. We conduct experiments for the challenging task of video corpus moment retrieval (VCMR) to localize moments from a large video corpus. The experimental results on TVR and DiDeMo datasets verify that our proposed work is effective in retrieving the moments that are hidden deep inside the ranked lists of CONQUER and HERO, which are the state-of-the-art auto-search engines for VCMR.

Supplementary Material

MOV File (mmfp2265.mov)
Presentation video

References

[1]
Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, and Claudio Vairo. 2021. Visione at video browser showdown 2021. In International Conference on Multimedia Modeling. Springer, 473--478.
[2]
George Awad, Asad A. Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth J. F. Jones, and Georges Quénot. 2021. Evaluating Multiple Video Understanding and Retrieval Tasks at TRECVID 2021. In Proceedings of TRECVID 2021. NIST, USA.
[3]
Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin. 2020. Strong: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4162--4170.
[4]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.
[5]
I.J. Cox, M.L. Miller, S.M. Omohundro, and P.N. Yianilos. 1996. PicHunter: Bayesian relevance feedback for image retrieval. In Proceedings of 13th International Conference on Pattern Recognition, Vol. 3. 361--369 vol.3. https://doi.org/10.1109/ICPR.1996.546971
[6]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition. 326--335.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423
[8]
Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. 2019. Temporal Localization of Moments in Video Collections with Natural Language. arxiv: 1907.12763 [cs.CV]
[9]
Xiaoxiao Guo, Steven Rennie, Hui Wu, Gerald Tesauro, Yu Cheng, and Rogerio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. Advances in Neural Information Processing Systems, Vol. 2018-Decem, NeurIPS (2018), 678--688.
[10]
Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019 (2019), 8393--8400. https://doi.org/10.1609/aaai.v33i01.33018393
[11]
Silvan Heller, Ralph Gasser, Cristina Illi, Maurizio Pasquinelli, Loris Sauter, Florian Spiess, and Heiko Schuldt. 2021. Towards explainable interactive multi-modal video retrieval with vitrivr. In International Conference on Multimedia Modeling. Springer, 435--440.
[12]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing Moments in Video with Natural Language. Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob (2017). https://doi.org/10.1109/ICCV.2017.618
[13]
Zhijian Hou, Chong-Wah Ngo, and W. K. Chan. 2021. CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval. Vol. 1. Association for Computing Machinery. 3900--3908 pages. https://doi.org/10.1145/3474085.3475281
[14]
Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), Vol. 50, 2 (2017), 1--35.
[15]
Yu-Gang Jiang, Jun Yang, Chong-Wah Ngo, and Alexander G Hauptmann. 2009. Representations of keypoint-based semantic concept detection: A comprehensive study. IEEE Transactions on Multimedia, Vol. 12, 1 (2009), 42--53.
[16]
Miroslav Kratochvil, František Mejzlík, Patrik Veselý, Tomáš Souček, and Jakub Lokoč. 2020. SOMHunter: lightweight video search system with SOM-guided relevance feedback. In Proceedings of the 28th ACM International Conference on Multimedia. 4481--4484.
[17]
Miroslav Kratochvíl, Patrik Veselý, František Mejzlík, and Jakub Lokoč. 2020. Som-hunter: Video browsing with relevance-to-som feedback loop. In International Conference on Multimedia Modeling. Springer, 790--795.
[18]
Yoonho Lee, Heeju Choi, Sungjune Park, and Yong Man Ro. 2021. IVIST: interactive video search tool in VBS 2021. In International Conference on Multimedia Modeling. Springer, 423--428.
[19]
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. In ECCV.
[20]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. In EMNLP.
[21]
Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A User Simulator for Task-Completion Dialogues. arXiv preprint arXiv:1612.05688 (2016).
[22]
Jakub Lokoč, Gregor Kovalčík, Bernd Münzer, Klaus Schöffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. 2019a. Interactive search or sequential browsing? A detailed analysis of the video browser showdown 2018. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1 (2019), 1--18.
[23]
Jakub Lokoč, Gregor Kovalčík, Tomáš Souček, Jaroslav Moravec, and Přemysl Čech. 2019b. VIRET: A video retrieval tool for interactive known-item search. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 177--181.
[24]
Jakub Lokoč, František Mejzlík, Tomáš Souček, Patrik Dokoupil, and Ladislav Peška. 2022. Video Search with Context-Aware Ranker and Relevance Feedback. In International Conference on Multimedia Modeling. Springer, 505--510.
[25]
Jakub Lokoč, Patrik Veselý, František Mejzlík, Gregor Kovalčík, Tomáš Souček, Luca Rossetto, Klaus Schoeffmann, Werner Bailer, Cathal Gurrin, Loris Sauter, et al. 2021. Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 3 (2021), 1--26.
[26]
Jakub Lokoč, Gregor Kovalčík, Tomáš Souček, Jaroslav Moravec, and Pvremysl Cech. 2019. A Framework for Effective Known-Item Search in Video (MM '19). Association for Computing Machinery, New York, NY, USA, 1777--1785. https://doi.org/10.1145/3343031.3351046
[27]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).
[28]
Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, and Liqiang Nie. 2021. Hierarchical deep residual reasoning for temporal moment localization. In ACM Multimedia Asia. 1--7.
[29]
Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In 33rd International Conference on Machine Learning, ICML 2016, Vol. 4. 2850--2869.
[30]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari With Deep Reinforcement Learning. In NIPS Deep Learning Workshop.
[31]
Phuong Anh Nguyen and Chong Wah Ngo. 2021. Interactive Search vs. Automatic Search: An Extensive Study on Video Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 17, 2 (2021). https://doi.org/10.1145/3429457
[32]
Ladislav Peška, Gregor Kovalčík, Tomáš Souček, Vít Škrhák, and Jakub Lokoć. 2021. W2VV BERT model at VBS 2021. In International Conference on Multimedia Modeling. Springer, 467--472.
[33]
Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. 2019a. Deep learning-based concept detection in vitrivr. In International Conference on Multimedia Modeling. Springer, 616--621.
[34]
Luca Rossetto, Ralph Gasser, Jakub Lokoć, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomáš Souček, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, et al. 2020. Interactive video retrieval in the age of deep learning--detailed evaluation of VBS 2019. IEEE Transactions on Multimedia, Vol. 23 (2020), 243--256.
[35]
Luca Rossetto, Heiko Schuldt, George Awad, and Asad A. Butt. 2019b. V3C -- A Research Video Collection. In MultiMedia Modeling, Ioannis Kompatsiaris, Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, and Stefanos Vrochidis (Eds.). Springer International Publishing, Cham, 349--360.
[36]
Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The knowledge engineering review, Vol. 21, 2 (2006), 97--126.
[37]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484--489.
[38]
Cees GM Snoek, Marcel Worring, Ork de Rooij, Koen EA van de Sande, Rong Yan, and Alexander G Hauptmann. 2008. VideOlympics: real-time evaluation of multimedia retrieval systems. IEEE MultiMedia, Vol. 15, 1 (2008), 86--91.
[39]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[40]
Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. Advances in neural information processing systems, Vol. 32 (2019).
[41]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5100--5111. https://doi.org/10.18653/v1/D19--1514
[42]
Bart Thomee and Michael S Lew. 2012. Interactive search in image retrieval: a survey. International Journal of Multimedia Information Retrieval, Vol. 1, 2 (2012), 71--86.
[43]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM, Vol. 59, 2 (2016), 64--73.
[44]
Kazuya Ueki, Yumi Nakagome, Koji Hirakawa, Kotaro Kikuchi, Yoshihiko Hayashi, Tetsuji Ogawa, and Tetsunori Kobayashi. 2018. Waseda_Meisei at TRECVID 2018: Ad-hoc Video Search. In TRECVID.
[45]
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. 2017. Context-aware captions from context-agnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 251--260.
[46]
Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. 2020a. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the 28th ACM International Conference on Multimedia. 1283--1291.
[47]
Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020b. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12386--12393.
[48]
Jiaxin Wu and Chong Wah Ngo. 2020. Interpretable Embedding for Ad-Hoc Video Search. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (2020), 3357--3366. https://doi.org/10.1145/3394171.3413916
[49]
Jiaxin Wu, Phuong Anh Nguyen, Zhixin Ma, and Chong-Wah Ngo. 2021. SQL-Like Interpretable Interactive Video Search. In International Conference on Multimedia Modeling.
[50]
Rong Yan, Alexander Hauptmann, and Rong Jin. 2003. Multimedia search with pseudo-relevance feedback. In International Conference on Image and Video Retrieval. Springer, 238--247.
[51]
Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. 2019. Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning. Proceedings of the 27th ACM International Conference on Multimedia (2019).
[52]
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685--695.
[53]
Zhenxing Zhang, Rami Albatal, Cathal Gurrin, and Alan F Smeaton. 2015. Interactive known-item search using semantic textual and colour modalities. In International Conference on Multimedia Modeling. Springer, 282--286.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. interactive search
  2. reinforcement learning
  3. user simulation
  4. video corpus moment retrieval

Qualifiers

  • Research-article

Funding Sources

  • Singapore Ministry of Education (MOE)

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 281
    Total Downloads
  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)4
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media