Abstract
Image-text retrieval task has received a lot of attention in the modern research field of artificial intelligence. It still remains challenging since image and text are heterogeneous cross-modal data. The key issue of image-text retrieval is how to learn a common feature space while semantic correspondence between image and text remains. Existing works cannot gain fine cross-modal feature representation because the semantic relation between local features is not effectively utilized and the noise information is not suppressed. In order to address these issues, we propose a Cross-modal Alignment with Graph Reasoning (CAGR) model, in which the refined cross-modal features in the common feature space are learned and then a fine-grained cross-modal alignment method is implemented. Specifically, we introduce a graph reasoning module to explore semantic connection for local elements in each modality and measure their importance by self-attention mechanism. In a multi-step reasoning manner, the visual semantic graph and textual semantic graph can be effectively learned and the refined visual and textual features can be obtained. Finally, to measure the similarity between image and text, a novel alignment approach named cross-modal attentional fine-grained alignment is used to compute similarity score between two sets of features. Our model achieves the competitive performance compared with the state-of-the-art methods on Flickr30K dataset and MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.
Similar content being viewed by others
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12655–12663
Chen T, Luo J (2020) Expressing objects just like words: Recurrent visual embedding for image-text matching. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 10583–10590
Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9346–9355
Eisenschtat A, Wolf L (2017) Linking image and text with 2-way nets. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4601–4611
Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia, pp 7–16
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European conference on computer vision (ECCV), vol 5. Springer
Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) Coot: Cooperative hierarchical transformer for video-text representation learning. arXiv:2011.00597
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv:1406.2661
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 162–190. Springer
Huang F, Zhang X, Zhao Z, Li Z (2018) Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans Image Process 28(4):2008–2020
Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2310–2318
Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6163–6171
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2901–2910
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4437–4446
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR)
Lee KH, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the european conference on computer vision (ECCV), pp 201–216
Li X, Xu C, Yang G, Chen Z, Dong J (2019) W2vv++ fully deep learning for ad-hoc video search. In: Proceedings of the 27th ACM international conference on multimedia, pp 1786–1794
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick C (2014) Microsoft coco: Common objects in context in european conference on computer vision. Springer, Cham, pp 740–755. [Google Scholar]
Liu C, Mao Z, Liu AA, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM international conference on multimedia, pp 3–11
Liu C, Mao Z, Zhang T, Xie H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10921–10930
Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv:1907.13487
Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE international conference on computer vision, pp 4107–4116
Ma L, Jiang W, Jie Z, Wang X (2019) Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345:36–44
Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 299–307
Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE international conference on computer vision, pp 1881–1889
Peng Y, Qi J, Yuan Y (2017) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing Communications and Applications 15(1)
Perez E, Strub F, De Vries H, Dumoulin V, Courville A (2017) Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Sig Process 45(11):2673–2681
Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2160–2167
Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1979–1988
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162
Wang J, He Y, Kang C, Xiang S, Pan C (2015) Image-text cross-modal retrieval via modality-specific feature learning. In: Proceedings of the 5th ACM on international conference on multimedia retrieval, pp 347–354
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5005–5013
Wang Z, Liu X, Li H, Sheng L, Yan J, Wang X, Shao J (2019) Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 5764–5773
Wu J, Wu C, Lu J, Wang L, Cui X (2021) Region reinforcement network with topic constraint for image-text matching. IEEE Transactions on Circuits and Systems for Video Technology
Yang X, Dong J, Cao Y, Wang X, Wang M, Chua TS (2020) Tree-augmented cross-modal encoding for complex-query video retrieval. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in information retrieval, pp 1339–1348
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Zhai D, Chang H, Shan S, Chen X, Gao W (2012) Multiview metric learning with global consistency and local smoothness. Acm Trans Intell Syst Technol 3(3):1–22
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3536–3545
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen YD (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl 16(2):1–23
Acknowledgements
This work is supported by National Key R&D Program of China (No.2021ZD0111900), Natural Science Foundation of China (U21B2038, U1811463, U19B2039).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cui, Z., Hu, Y., Sun, Y. et al. Cross-modal alignment with graph reasoning for image-text retrieval. Multimed Tools Appl 81, 23615–23632 (2022). https://doi.org/10.1007/s11042-022-12444-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12444-8