research-article

External Knowledge Dynamic Modeling for Image-text Retrieval

Authors:

Anan LiuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5330 - 5338

https://doi.org/10.1145/3581783.3613786

Published: 27 October 2023 Publication History

Abstract

Image-text retrieval is a fundamental branch in cross-modal retrieval. The core is to explore the semantic correspondence to align relevant image-text pairs. Some existing methods rely on global semantics and co-occurrence frequency to design knowledge introduction patterns for consistent representations. However, they lack flexibility due to the limitations of fixed information and empirical feedback. To address these issues, we develop an External Knowledge Dynamic Modeling~(EKDM) architecture based on the filtering mechanism, which dynamically explores different knowledge towards varied image-text pairs. Specially, we first capture abundant concepts and relationships from external knowledge to construct visual and textual corpus sets. Then, we progressively explores concepts related to images and texts by dynamic global representations. To endow the model with the capability of relationship decision, we integrate the variable spatial locations between objects for association exploration. Since the filtering mechanism is conditioned on dynamic semantics and variable spatial locations, our model can dynamically model different knowledge for different image-text pairs. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method.

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Computer Vision - ECCV 2016 - 14th European Conference, Vol. 9909. 382--398.

[2]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proc. CVPR. 12652--12660.

[3]

Ran Chen, Hanli Wang, Lei Wang, and Sam Kwong. 2022. Two-stream Hierarchical Similarity Reasoning for Image-text Matching. CoRR, Vol. abs/2203.05349 (2022).

[4]

Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2022. RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis From Constrained Prior Knowledge. IEEE Trans. Circuits Syst. Video Technol., Vol. 32, 8 (2022), 5187--5200.

Digital Library

[5]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In Association for the Advancement of Artificial Intelligence. 1218--1226.

[6]

Jeffrey L. Elman. 2009. On the Meaning of Words and Dinosaur Bones: Lexical Knowledge Without a Lexicon. Cogn. Sci., Vol. 33, 4 (2009), 547--582.

[7]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In Proc. BMVC. 12.

[8]

Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval With Generative Models. In Proc. CVPR. 7181--7189.

[9]

Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and Geometry-Aware Self-Attention Network for Image Captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10324--10333.

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput., Vol. 9, 8 (1997), 1735--1780.

Digital Library

[11]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning Semantic Concepts and Order for Image and Sentence Matching. In Proc. CVPR. 6163--6171.

[12]

Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-Wise Hierarchical Alignment Network for Image-Text Matching. In International Joint Conference on Artificial Intelligence. 765--771.

[13]

Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik G. Learned-Miller, and Xinlei Chen. 2020. In Defense of Grid Features for Visual Question Answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern. 10264--10273.

[14]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR, Vol. abs/1411.2539 (2014).

[15]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proc. ECCV, Vol. 11208. 212--228.

[16]

Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Relation-Aware Graph Attention Network for Visual Question Answering. In 2019 IEEE/CVF International Conference on Computer Vision. 10312--10321.

[17]

Yongzhi Li, Duo Zhang, and Yadong Mu. 2020. Visual-Semantic Matching by Exploring High-Order Attention and Distraction. In Proc. CVPR. 12783--12792.

[18]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. ECCV, Vol. 8693. 740--755.

[19]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured Network for Image-Text Matching. In Proc. CVPR. 10918--10927.

[20]

Hong Luo, Han Liu, Kejun Li, and Bo Zhang. 2019. Automatic quality assessment for 2D fetal sonographic standard plane based on multi-task learning. CoRR, Vol. abs/1912.05260 (2019).

[21]

Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, and Michael M. Bronstein. 2017. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 5425--5434.

[22]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532--1543.

[23]

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. Int. J. Comput. Vis., Vol. 123, 1 (2017), 74--93.

Digital Library

[24]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proc. ACM Multimedia. 1047--1055.

Digital Library

[25]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.

Digital Library

[26]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems. 91--99.

[27]

Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. 2019. Knowledge Aware Semantic Concept Expansion for Image-Text Matching. In International Joint Conference on Artificial Intelligence. 5182--5189.

[28]

Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin Li. 2022. Cross-Modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis. IEEE Trans. Multim., Vol. 24 (2022), 832--845.

[29]

Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020b. Consensus-Aware Visual-Semantic Embedding for Image-Text Matching. In European Conference on Computer Vision, Vol. 12369. 18--34.

[30]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep Structure-Preserving Image-Text Embeddings. In Proc. CVPR. 5005--5013.

[31]

Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020a. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval. In Proc. WACV. 1497--1506.

[32]

Thomas C. Wendell and Kadry Abdelhamied. 1992. A phoneme recognition system using modular construction of time-delay neural networks. In Proc. CBMS. 704--709.

[33]

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2021. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Networks Learn. Syst., Vol. 32, 1 (2021), 4--24.

[34]

Kun Xu, Yuxuan Lai, Yansong Feng, and Zhiguo Wang. 2019b. Enhancing Key-Value Memory Neural Networks for Knowledge Based Question Answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Volume 1. 2937--2947.

[35]

Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019a. Deep adversarial metric learning for cross-modal retrieval. World Wide Web, Vol. 22, 2 (2019), 657--672.

Digital Library

[36]

Song Yang, Qiang Li, Wenhui Li, Xuanya Li, and An-An Liu. 2022. Dual-Level Representation Enhancement on Characteristic and Context for Image-Text Retrieval. IEEE Trans. Circuits Syst. Video Technol., Vol. 32, 11 (2022), 8037--8050.

Digital Library

[37]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring Visual Relationship for Image Captioning. In Computer Vision - ECCV 2018 - 15th European Conference, Vol. 11218. 711--727.

[38]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2017. Neural Motifs: Scene Graph Parsing with Global Context. CoRR, Vol. abs/1711.06640 (2017).

[39]

Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, and Jingkuan Song. 2021. Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching. In Proceedings of the 29th ACM International Conference on Multimedia. 2205--2213.

Digital Library

[40]

Kun Zhang, Zhendong Mao, Anan Liu, and Yongdong Zhang. 2022. Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching. IEEE Transactions on Multimedia (2022).

[41]

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li. 2020. Context-Aware Attention Network for Image-Text Retrieval. In 2020 Conference on Computer Vision and Pattern Recognition. 3533--3542.

[42]

Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In Conference on European Conference Computer Vision, Vol. 11205. 707--723.

Index Terms

External Knowledge Dynamic Modeling for Image-text Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval
Advances in Knowledge Discovery and Data Mining
Abstract
Benefiting from the superiority of the pretraining paradigm on large-scale multi-modal data, current cross-modal pretrained models (such as CLIP) have shown excellent performance on text-to-image retrieval. However, the current research mainly ...
Multi-task Collaborative Network for Image-Text Retrieval
MultiMedia Modeling
Abstract
Image-text retrieval aims to capture semantic relevance between images and texts. Most existing approaches rely solely on the image-text pairs to learn visual-semantic representation through fine-grained alignments while neglecting the potential ...
Point to Rectangle Matching for Image Text Retrieval
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

The difficulty of image-text retrieval is further exacerbated by the phenomenon of one-to-many correspondence, where multiple semantic manifestations of the other modality could be obtained by a given query. However, the prevailing methods adopt the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
275
Total Downloads

Downloads (Last 12 months)275
Downloads (Last 6 weeks)12

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents