Cross-modal alignment with graph reasoning for image-text retrieval

Cui, Zheng; Hu, Yongli; Sun, Yanfeng; Gao, Junbin; Yin, Baocai

doi:10.1007/s11042-022-12444-8

Cross-modal alignment with graph reasoning for image-text retrieval

Published: 18 March 2022

Volume 81, pages 23615–23632, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zheng Cui¹,
Yongli Hu¹,
Yanfeng Sun¹,
Junbin Gao² &
…
Baocai Yin¹

1312 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Image-text retrieval task has received a lot of attention in the modern research field of artificial intelligence. It still remains challenging since image and text are heterogeneous cross-modal data. The key issue of image-text retrieval is how to learn a common feature space while semantic correspondence between image and text remains. Existing works cannot gain fine cross-modal feature representation because the semantic relation between local features is not effectively utilized and the noise information is not suppressed. In order to address these issues, we propose a Cross-modal Alignment with Graph Reasoning (CAGR) model, in which the refined cross-modal features in the common feature space are learned and then a fine-grained cross-modal alignment method is implemented. Specifically, we introduce a graph reasoning module to explore semantic connection for local elements in each modality and measure their importance by self-attention mechanism. In a multi-step reasoning manner, the visual semantic graph and textual semantic graph can be effectively learned and the refined visual and textual features can be obtained. Finally, to measure the similarity between image and text, a novel alignment approach named cross-modal attentional fine-grained alignment is used to compute similarity score between two sets of features. Our model achieves the competitive performance compared with the state-of-the-art methods on Flickr30K dataset and MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAM: cross-modal semantic alignments module for image-text retrieval

Article 26 June 2023

Cross-modal multi-relationship aware reasoning for image-text matching

Article 27 January 2021

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663
Chen T, Luo J (2020) Expressing objects just like words: Recurrent visual embedding for image-text matching. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 10583–10590
Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9346–9355
Eisenschtat A, Wolf L (2017) Linking image and text with 2-way nets. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4601–4611
Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia, pp 7–16
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European conference on computer vision (ECCV), vol 5. Springer
Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) Coot: Cooperative hierarchical transformer for video-text representation learning. arXiv:2011.00597
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv:1406.2661
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 162–190. Springer
Huang F, Zhang X, Zhao Z, Li Z (2018) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process 28(4):2008–2020
Article MathSciNet Google Scholar
Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2310–2318
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6163–6171
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2901–2910
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR)
Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the european conference on computer vision (ECCV), pp 201–216
Li X, Xu C, Yang G, Chen Z, Dong J (2019) W2vv++ fully deep learning for ad-hoc video search. In: Proceedings of the 27th ACM international conference on multimedia, pp 1786–1794
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick C (2014) Microsoft coco: Common objects in context in european conference on computer vision. Springer, Cham, pp 740–755. [Google Scholar]
Google Scholar
Liu C, Mao Z, Liu AA, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM international conference on multimedia, pp 3–11
Liu C, Mao Z, Zhang T, Xie H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10921–10930
Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv:1907.13487
Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE international conference on computer vision, pp 4107–4116
Ma L, Jiang W, Jie Z, Wang X (2019) Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345:36–44
Article Google Scholar
Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 299–307
Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE international conference on computer vision, pp 1881–1889
Peng Y, Qi J, Yuan Y (2017) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing Communications and Applications 15(1)
Perez E, Strub F, De Vries H, Dumoulin V, Courville A (2017) Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Sig Process 45(11):2673–2681
Article Google Scholar
Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2160–2167
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1979–1988
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162
Wang J, He Y, Kang C, Xiang S, Pan C (2015) Image-text cross-modal retrieval via modality-specific feature learning. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 347–354
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5005–5013
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 5764–5773
Wu J, Wu C, Lu J, Wang L, Cui X (2021) Region reinforcement network with topic constraint for image-text matching. IEEE Transactions on Circuits and Systems for Video Technology
Yang X, Dong J, Cao Y, Wang X, Wang M, Chua TS (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in information retrieval, pp 1339–1348
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Zhai D, Chang H, Shan S, Chen X, Gao W (2012) Multiview metric learning with global consistency and local smoothness. Acm Trans Intell Syst Technol 3(3):1–22
Article Google Scholar
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen YD (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl 16(2):1–23
Article Google Scholar

Download references

Acknowledgements

This work is supported by National Key R&D Program of China (No.2021ZD0111900), Natural Science Foundation of China (U21B2038, U1811463, U19B2039).

Author information

Authors and Affiliations

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing, China
Zheng Cui, Yongli Hu, Yanfeng Sun & Baocai Yin
Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, NSW, Australia
Junbin Gao

Authors

Zheng Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yongli Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yanfeng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Junbin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Baocai Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongli Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cui, Z., Hu, Y., Sun, Y. et al. Cross-modal alignment with graph reasoning for image-text retrieval. Multimed Tools Appl 81, 23615–23632 (2022). https://doi.org/10.1007/s11042-022-12444-8

Download citation

Received: 24 April 2021
Revised: 05 August 2021
Accepted: 25 January 2022
Published: 18 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s11042-022-12444-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal alignment with graph reasoning for image-text retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SAM: cross-modal semantic alignments module for image-text retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Cross-modal alignment with graph reasoning for image-text retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SAM: cross-modal semantic alignments module for image-text retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation