Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3616855.3635757acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Text-Video Retrieval via Multi-Modal Hypergraph Networks

Published: 04 March 2024 Publication History

Abstract

Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.

References

[1]
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 1708--1718. https://doi.org/10.1109/ ICCV48922.2021.00175
[2]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190--200.
[3]
Xing Cheng, Hezheng Lin, XiangyuWu, Fan Yang, and Dong Shen. 2021. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. CoRR abs/2109.04290 (2021). arXiv:2109.04290 https://arxiv.org/abs/2109. 04290
[4]
Youngok Choi and Edie M Rasmussen. 2002. Users' relevance criteria in image retrieval in American history. Information processing & management 38, 5 (2002), 695--726.
[5]
Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11583--11593.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). 4171-- 4186.
[7]
Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE transactions on circuits and systems for video technology 32, 8 (2022), 5680--5694.
[8]
Alex Falcon, Giuseppe Serra, and Oswald Lanz. 2022. A feature-space multimodal data augmentation technique for text-video retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4385--4394.
[9]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CoRR abs/2106.11097 (2021). arXiv:2106.11097 https://arxiv.org/abs/2106.11097
[10]
Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hypergraph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3558--3565.
[11]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multimodal Transformer for Video Retrieval. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 12349). Springer, 214--229. https: //doi.org/10.1007/978--3-030--58548--8_13
[12]
Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W Bruce Croft, and Xueqi Cheng. 2020. A deep look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.
[13]
Xudong Guo, Xun Guo, and Yan Lu. 2021. Ssan: Separable self-attention network for video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12618--12627.
[14]
Ning Han, Jingjing Chen, Guangyi Xiao, Yawen Zeng, Chuhao Shi, and Hao Chen. 2021. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).
[15]
Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, ZhennanWang, Li Yuan, Chang Liu, and Jie Chen. 2023. Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment. CoRR abs/2305.12218 (2023). https://doi.org/10.48550/ arXiv.2305.12218 arXiv:2305.12218
[16]
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331--7341.
[17]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for VideoLanguage Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16--20, 2020. Association for Computational Linguistics, 2046--2065. https://doi.org/10.18653/v1/2020.emnlpmain. 161
[18]
Yi Li, Kyle Min, Subarna Tripathi, and Nuno Vasconcelos. 2023. SViTT: Temporal Learning of Sparse Video-Text Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18919--18929.
[19]
Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenxuan Tu, Siwei Wang, Sihang Zhou, Xinwang Liu, and Fuchun Sun. 2022. Reasoning over different types of knowledge graphs: Static, temporal and multi-modal. arXiv preprint arXiv:2212.05767 (2022).
[20]
Baolong Liu, Qi Zheng, Yabing Wang, Minsong Zhang, Jianfeng Dong, and Xun Wang. 2022. FeatInter: exploring fine-grained object features for video-text retrieval. Neurocomputing 496 (2022), 178--191.
[21]
Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen,Wenkui Ding, and Zhongyuan Wang. 2021. HiT: Hierarchical Transformer with Momentum Contrast for Video- Text Retrieval. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 11895--11905. https: //doi.org/10.1109/ICCV48922.2021.01170
[22]
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use What You Have: Video retrieval using representations from collaborative experts. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9--12, 2019. BMVA Press, 279. https://bmvc2019.org/wp-content/uploads/papers/ 0363-paper.pdf
[23]
Yu Liu, Huai Chen, Lianghua Huang, Di Chen, Bin Wang, Pan Pan, and Lisheng Wang. 2022. Animating Images to Transfer CLIP for Video-Text Retrieval. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 1906--1911. https://doi.org/10.1145/3477495.3531776
[24]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304. https://doi.org/10.1016/j. neucom.2022.07.028
[25]
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. InMM'22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 638--647. https://doi.org/10.1145/3503161.3547910
[26]
Mandela Patrick, Po-Yao Huang, Yuki Markus Asano, Florian Metze, Alexander G. Hauptmann, João F. Henriques, and Andrea Vedaldi. 2021. Support-set bottlenecks for video-text representation learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=EqoXe2zmhrh
[27]
Renjing Pei, Jianzhuang Liu, Weimian Li, Bin Shao, Songcen Xu, Peng Dai, Juwei Lu, and Youliang Yan. 2023. CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18983--18992.
[28]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748--8763. http://proceedings.mlr.press/v139/radford21a.html
[29]
Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the Blanks: Distributional Similarity for Relation Learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2895--2905. https://doi.org/10.18653/v1/p19- 1279
[30]
Jinpeng Wang, Bin Chen, Dongliang Liao, Ziyun Zeng, Gongfu Li, Shu-Tao Xia, and Jin Xu. 2022. Hybrid contrastive quantization for efficient cross-view video retrieval. In Proceedings of the ACM Web Conference 2022. 3020--3030.
[31]
XiaohanWang, Linchao Zhu, and Yi Yang. 2021. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 5079--5088. https://doi.org/10.1109/CVPR46437.2021.00504
[32]
Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. Hanet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM international conference on Multimedia. 3518--3527.
[33]
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, and Wanli Ouyang. 2023. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10704--10713.
[34]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.
[35]
Konstantin Yakovlev, Gregory Polyakov, Ilseyar Alimova, Alexander Podolskiy, Andrey Bout, Sergey Nikolenko, and Irina Piontkovskaya. 2023. Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23--27, 2023. ACM, 2394-- 2398. https://doi.org/10.1145/3539618.3592064
[36]
Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 11542--11552. https://doi.org/10.1109/ICCV48922.2021.01136
[37]
Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, and Yu Liu. 2023. Deep learning for video-text retrieval: a review. Int. J. Multim. Inf. Retr. 12, 1 (2023), 3. https: //doi.org/10.1007/s13735-023-00267--8
[38]
Syed Zubair, Fei Yan, and Wenwu Wang. 2013. Dictionary learning based sparse coefficients for audio classification with max and average pooling. Digit. Signal Process. 23, 3 (2013), 960--970. https://doi.org/10.1016/j.dsp.2013.01.004

Cited By

View all
  • (2024)Simple Yet Effective: Structure Guided Pre-trained Transformer for Multi-modal Knowledge Graph ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681112(1554-1563)Online publication date: 28-Oct-2024
  • (2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
  • (2024)Hierarchical bi-directional conceptual interaction for text-video retrievalMultimedia Systems10.1007/s00530-024-01525-330:6Online publication date: 15-Oct-2024

Index Terms

  1. Text-Video Retrieval via Multi-Modal Hypergraph Networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining
      March 2024
      1246 pages
      ISBN:9798400703713
      DOI:10.1145/3616855
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 March 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. hypergraph neural networks
      2. multi-modal hypergraph
      3. text-video retrieval

      Qualifiers

      • Research-article

      Conference

      WSDM '24

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)322
      • Downloads (Last 6 weeks)36
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Simple Yet Effective: Structure Guided Pre-trained Transformer for Multi-modal Knowledge Graph ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681112(1554-1563)Online publication date: 28-Oct-2024
      • (2024)GDPR-compliant Video Search and Retrieval System for Surveillance DataProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3670472(1-6)Online publication date: 30-Jul-2024
      • (2024)Hierarchical bi-directional conceptual interaction for text-video retrievalMultimedia Systems10.1007/s00530-024-01525-330:6Online publication date: 15-Oct-2024

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media