research-article

Text-Video Retrieval via Multi-Modal Hypergraph Networks

Authors:

Dawei YinAuthors Info & Claims

WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

Pages 369 - 377

https://doi.org/10.1145/3616855.3635757

Published: 04 March 2024 Publication History

Abstract

Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.

References

[1]

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 1708--1718. https://doi.org/10.1109/ ICCV48922.2021.00175

[2]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190--200.

Digital Library

[3]

Xing Cheng, Hezheng Lin, XiangyuWu, Fan Yang, and Dong Shen. 2021. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. CoRR abs/2109.04290 (2021). arXiv:2109.04290 https://arxiv.org/abs/2109. 04290

[4]

Youngok Choi and Edie M Rasmussen. 2002. Users' relevance criteria in image retrieval in American history. Information processing & management 38, 5 (2002), 695--726.

[5]

Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583--11593.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). 4171-- 4186.

[7]

Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE transactions on circuits and systems for video technology 32, 8 (2022), 5680--5694.

Digital Library

[8]

Alex Falcon, Giuseppe Serra, and Oswald Lanz. 2022. A feature-space multimodal data augmentation technique for text-video retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4385--4394.

Digital Library

[9]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CoRR abs/2106.11097 (2021). arXiv:2106.11097 https://arxiv.org/abs/2106.11097

[10]

Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3558--3565.

Digital Library

[11]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multimodal Transformer for Video Retrieval. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 12349). Springer, 214--229. https: //doi.org/10.1007/978--3-030--58548--8_13

[12]

Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W Bruce Croft, and Xueqi Cheng. 2020. A deep look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.

[13]

Xudong Guo, Xun Guo, and Yan Lu. 2021. Ssan: Separable self-attention network for video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12618--12627.

[14]

Ning Han, Jingjing Chen, Guangyi Xiao, Yawen Zeng, Chuhao Shi, and Hao Chen. 2021. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).

[15]

Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, ZhennanWang, Li Yuan, Chang Liu, and Jie Chen. 2023. Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment. CoRR abs/2305.12218 (2023). https://doi.org/10.48550/ arXiv.2305.12218 arXiv:2305.12218

[16]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331--7341.

[17]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16--20, 2020. Association for Computational Linguistics, 2046--2065. https://doi.org/10.18653/v1/2020.emnlpmain. 161

[18]

Yi Li, Kyle Min, Subarna Tripathi, and Nuno Vasconcelos. 2023. SViTT: Temporal Learning of Sparse Video-Text Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18919--18929.

[19]

Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenxuan Tu, Siwei Wang, Sihang Zhou, Xinwang Liu, and Fuchun Sun. 2022. Reasoning over different types of knowledge graphs: Static, temporal and multi-modal. arXiv preprint arXiv:2212.05767 (2022).

[20]

Baolong Liu, Qi Zheng, Yabing Wang, Minsong Zhang, Jianfeng Dong, and Xun Wang. 2022. FeatInter: exploring fine-grained object features for video-text retrieval. Neurocomputing 496 (2022), 178--191.

Digital Library

[21]

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen,Wenkui Ding, and Zhongyuan Wang. 2021. HiT: Hierarchical Transformer with Momentum Contrast for Video- Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 11895--11905. https: //doi.org/10.1109/ICCV48922.2021.01170

[22]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use What You Have: Video retrieval using representations from collaborative experts. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9--12, 2019. BMVA Press, 279. https://bmvc2019.org/wp-content/uploads/papers/ 0363-paper.pdf

[23]

Yu Liu, Huai Chen, Lianghua Huang, Di Chen, Bin Wang, Pan Pan, and Lisheng Wang. 2022. Animating Images to Transfer CLIP for Video-Text Retrieval. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 1906--1911. https://doi.org/10.1145/3477495.3531776

Digital Library

[24]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304. https://doi.org/10.1016/j. neucom.2022.07.028

Digital Library

[25]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. InMM'22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 638--647. https://doi.org/10.1145/3503161.3547910

Digital Library

[26]

Mandela Patrick, Po-Yao Huang, Yuki Markus Asano, Florian Metze, Alexander G. Hauptmann, João F. Henriques, and Andrea Vedaldi. 2021. Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=EqoXe2zmhrh

[27]

Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Songcen Xu, Peng Dai, Juwei Lu, and Youliang Yan. 2023. CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18983--18992.

[28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748--8763. http://proceedings.mlr.press/v139/radford21a.html

[29]

Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2895--2905. https://doi.org/10.18653/v1/p19- 1279

[30]

Jinpeng Wang, Bin Chen, Dongliang Liao, Ziyun Zeng, Gongfu Li, Shu-Tao Xia, and Jin Xu. 2022. Hybrid contrastive quantization for efficient cross-view video retrieval. In Proceedings of the ACM Web Conference 2022. 3020--3030.

Digital Library

[31]

XiaohanWang, Linchao Zhu, and Yi Yang. 2021. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 5079--5088. https://doi.org/10.1109/CVPR46437.2021.00504

[32]

Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. Hanet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM international conference on Multimedia. 3518--3527.

Digital Library

[33]

Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. 2023. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704--10713.

[34]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.

[35]

Konstantin Yakovlev, Gregory Polyakov, Ilseyar Alimova, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, and Irina Piontkovskaya. 2023. Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23--27, 2023. ACM, 2394-- 2398. https://doi.org/10.1145/3539618.3592064

Digital Library

[36]

Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 11542--11552. https://doi.org/10.1109/ICCV48922.2021.01136

[37]

Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, and Yu Liu. 2023. Deep learning for video-text retrieval: a review. Int. J. Multim. Inf. Retr. 12, 1 (2023), 3. https: //doi.org/10.1007/s13735-023-00267--8

[38]

Syed Zubair, Fei Yan, and Wenwu Wang. 2013. Dictionary learning based sparse coefficients for audio classification with max and average pooling. Digit. Signal Process. 23, 3 (2013), 960--970. https://doi.org/10.1016/j.dsp.2013.01.004

Digital Library

Cited By

Shifa AKennedy RAsghar M(2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3670472

Index Terms

Text-Video Retrieval via Multi-Modal Hypergraph Networks
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Recommendations

Towards Explainable Interactive Multi-modal Video Retrieval with Vitrivr
MultiMedia Modeling
Abstract
This paper presents the most recent iteration of the vitrivr multimedia retrieval system for its participation in the Video Browser Showdown (VBS) 2021. Building on existing functionality for interactive multi-modal retrieval, we overhaul query ...
Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Due to the popularity of video contents on the Internet, the information retrieval between videos and texts has attracted broad interest from researchers, which is a challenging cross-modal retrieval task. A common solution is to learn a joint embedding ...
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

March 2024

1246 pages

ISBN:9798400703713

DOI:10.1145/3616855

General Chairs:
Luz Angélica
Caudillo Mata (MDA Geointelligence)
,
Silvio Lattanzi
Google Research
,
Andrés Muñoz Medina
Google Research
,
Program Chairs:
Leman Akoglu
CMU
,
Aristides Gionis
KTH
,
Sergei Vassilvitskii
Google Research

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '24

Sponsor:

WSDM '24: The 17th ACM International Conference on Web Search and Data Mining

March 4 - 8, 2024

Merida, Mexico

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
277
Total Downloads

Downloads (Last 12 months)277
Downloads (Last 6 weeks)25

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shifa AKennedy RAsghar M(2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3670472

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents