research-article

A method for image–text matching based on semantic filtering and adaptive adjustment

Authors:

Chenjie DuAuthors Info & Claims

EURASIP Journal on Image and Video Processing, Volume 2024, Issue 1

https://doi.org/10.1186/s13640-024-00639-y

Published: 29 August 2024 Publication History

Abstract

As image–text matching (a critical task in the field of computer vision) links cross-modal data, it has captured extensive attention. Most of the existing methods intended for matching images and texts explore the local similarity levels between images and sentences to align images with texts. Even though this fine-grained approach has remarkable gains, how to further mine the deep semantics between data pairs and focus on the essential semantics in data remains to be quested. In this work, a new semantic filtering and adaptive approach (FAAR) was proposed to ease the above problem. To be specific, the filtered attention (FA) module selectively focuses on typical alignments with the interference of meaningless comparisons eliminated. Next, the adaptive regulator (AR) further adjusts the attention weights of key segments for filtered regions and words. The superiority of our proposed method was validated by a number of qualitative experiments and analyses on the Flickr30K and MSCOCO data sets.

References

[1]

J. Chen, H. Hu, H. Wu, Y. Jiang, C. Wang, Learning the best pooling strategy for visual semantic embedding, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15789–15798,.

[2]

T. Xu, X. Liu, Z. Huang, D. Guo, R. Hong, M. Wang, Early-learning regularized contrastive learning for cross-modal retrieval with noisy labels, Proceedings of the 30th ACM International Conference on Multimedia, 629–637,.

Digital Library

[3]

Zhang D, Wu X, Xu T, and Kittler J Two-stage supervised discrete hashing for cross-modal retrieval IEEE Trans. Syst. Man,Cybern 2022 52 11 7014-7026

[4]

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086,.

[5]

K. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, Proceedings of the European conference on computer vision (ECCV), 201–216,.

Digital Library

[6]

Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, J. Shao, Camp: cross-modal adaptive message passing for text-image retrieval, Proceedings of the IEEE/CVF international conference on computer vision, 5764–5773,.

[7]

X. Wei, T. Zhang, Y. Li, Y. Zhang, F. Wu, Multi-modality cross attention network for image and sentence matching, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10941–10950,.

[8]

A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE conference on computer vision and pattern recognition, 3128–3137,.

[9]

H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, J. Han, Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12655–12663,.

[10]

C. Liu, Z. Mao, A. Liu, T. Zhang, B. Wang, Y. Zhang, Focus your attention: a bidirectional focal attention network for image-text matching, Proceedings of the 27th ACM international conference on multimedia, 3–11,.

Digital Library

[11]

H. Diao, Y. Zhang, L. Ma, H. Lu, Similarity reasoning and filtration for image-text matching, Proceedings of the AAAI Conference on Artificial Intelligence, AAAI Press, California,. 1218–1226,.

[12]

Yang S, Li Q, Li W, Li X, Jin R, Lv B, Wang R, and Liu A Semantic completion and filtration for image-text retrieval ACM Trans. Multimed. Comput. Commun. Appl. 2023 19 4 1-20

Digital Library

[13]

Y. Huang, Q. Wu, C. Song, L. Wang, Learning semantic concepts and order for image and sentence matching, Proceedings of the IEEE conference on computer vision and pattern recognition. 6163–6171,.

[14]

Hardoon DR, Szedmak S, and Shawe-Taylor J Canonical correlation analysis: an overview with application to learning methods Neural Comput. 2004 16 12 2639-2664

Digital Library

[15]

G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, International conference on machine learning, PMLR, 1247–1255,.

Digital Library

[16]

L. Wang, Y. Li, S. Lazebnik, Learning deep structure-preserving image-text embeddings, Proceedings of the IEEE conference on computer vision and pattern recognition, 5005–5013,.

[17]

R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv preprint arXiv:1411.2539 (2014),.

Digital Library

[18]

F. Faghri, D.J. Fleet, J.R. Kiros, S. Fidler, Vse++: improving visual-semantic embeddings with hard negatives, arXiv preprint arXiv:1707.05612 (2017).

[19]

H. Nam, J. Ha, J. Kim, Dual attention networks for multimodal reasoning and matching, Proceedings of the IEEE conference on computer vision and pattern recognition, 299–307,.

[20]

T. Wang, X. Xu, Y. Yang, A. Hanjalic, H.T. Shen, J. Song, Matching images and text with multi-modal tensor fusion and re-ranking, Proceedings of the 27th ACM international conference on multimedia, 12–20,.

Digital Library

[21]

Ma L, Jiang W, Jie Z, Jiang Y, and Liu W Matching image and sentence with multi-faceted representations Ieee Trans. Circuits Syst. Video Technol. 2019 30 7 2250-2261

Digital Library

[22]

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y Generative adversarial networks Commun. Acm 2020 63 11 139-144

Digital Library

[23]

B. Wang, Y. Yang, X. Xu, A. Hanjalic, H.T. Shen, Adversarial cross-modal retrieval, Proceedings of the 25th ACM international conference on Multimedia, 154–162,.

Digital Library

[24]

Peng Y and Qi J Cm-gans: cross-modal generative adversarial networks for common representation learning, ACM Transactions on Multimedia Computing, Communications, and Applications ACM Trans. Multimedia Comput. Commun. Appl 2019 15 1 1-24

Digital Library

[25]

J. Gu, J. Cai, S.R. Joty, L. Niu, G. Wang, Look, imagine and match: improving textual-visual cross-modal retrieval with generative models, Proceedings of the IEEE conference on computer vision and pattern recognition, 7181–7189,.

[26]

Shen F, Xu Y, Liu L, Yang Y, Huang Z, and Shen HT Unsupervised deep hashing with similarity-adaptive and discrete optimization Ieee Trans. Pattern Anal. Mach. Intell. 2018 40 12 3034-3044

[27]

Yang E, Liu T, Deng C, and Tao D Adversarial examples for hamming space search Ieee T. Cybern. 2018 50 4 1473-1484

[28]

Zheng F, Tang Y, and Shao L Hetero-manifold regularisation for cross-modal hashing Ieee Trans. Pattern Anal. Mach. Intell. 2016 40 5 1059-1071

[29]

Deng C, Chen Z, Liu X, Gao X, and Tao D Triplet-based deep hashing network for cross-modal retrieval Ieee Trans. Image Process. 2018 27 8 3893-3903

[30]

Yang E, Deng C, Li C, Liu W, Li J, and Tao D Shared predictive cross-modal deep quantization Ieee Trans. Neural Netw. Learn. Syst. 2018 29 11 5292-5303

[31]

Z. Hu, Y. Luo, J. Lin, Y. Yan, J. Chen, Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. IJCAI. 789–795,.

[32]

Y. Wang, H. Yang, X. Qian, L. Ma, J. Lu, B. Li, X. Fan, Position focused attention network for image-text matching, arXiv preprint arXiv:1907.09748 (2019),.

[33]

J. Wehrmann, D.M. Souza, M.A. Lopes, R.C. Barros, Language-agnostic visual-semantic embeddings, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5804–5813,.

[34]

Q. Zhang, Z. Lei, Z. Zhang, S.Z. Li, Context-aware attention network for image-text retrieval, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3536–3545,.

[35]

Ren S, He K, Girshick R, and Sun J Faster r-cnn: towards real-time object detection with region proposal networks IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 6 1137-1149

Digital Library

[36]

Zhang K, Mao Z, and Liu A Unified adaptive relevance distinguishable attention network for image-text matching Ieee Trans. Multimedia 2023 25 1320-1332

Digital Library

[37]

R. Zellers, M. Yatskar, S. Thomson, Y. Choi, Neural motifs: scene graph parsing with global context, Proceedings of the IEEE conference on computer vision and pattern recognition. 5831–5840,.

[38]

S. Wang, R. Wang, Z. Yao, S. Shan, X. Chen, Cross-modal scene graph matching for relationship-aware image-text retrieval, Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1508–1517,.

[39]

Peng Y and Chi J Unsupervised cross-media retrieval using domain adaptation with scene graph Ieee Trans. Circuits Syst. Video Technol. 2019 30 11 4368-4379

[40]

Ma L, Jiang W, Jie Z, and Wang X Bidirectional image-sentence retrieval by local and global deep matching Neurocomputing 2019 345 36-44

Digital Library

[41]

Li Z, Ling F, Zhang C, and Ma H Combining global and local similarity for cross-media retrieval Ieee Access 2020 8 21847-21856

[42]

Xu X, Wang T, Yang Y, Zuo L, Shen F, and Shen HT Cross-modal attention with semantic consistence for image–text matching Ieee Trans. Neural Netw. Learn. Syst. 2020 31 12 5412-5425

[43]

M. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015),.

[44]

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, International conference on machine learning, PMLR. 2048–2057.

[45]

B.A. Plummer, L. Wang, C.M. Cervantes, J.C. Caicedo, J. Hockenmaier, S. Lazebnik, Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models, Proceedings of the IEEE international conference on computer vision. 2641–2649,.

Digital Library

[46]

Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollmr P, and Zitnick CL Microsoft coco common objects in context, Computer Vision ICCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 2014 Cham Springer 740-755

[47]

A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, Learning transferable visual models from natural language supervision, International conference on machine learning, PMLR. 8748–8763.

[48]

Z. Zeng, W. Mao, A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval, arXiv preprint arXiv:2201.02772 (2022).

Index Terms

A method for image–text matching based on semantic filtering and adaptive adjustment
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Cross-modal multi-relationship aware reasoning for image-text matching
Abstract
Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation ...
Context-Aware Multi-View Summarization Network for Image-Text Matching
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most ...
Semantic Completion and Filtration for Image–Text Retrieval
Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations and eliminating the large gap between visual and ...

Comments

Information & Contributors

Information

Published In

cover image Journal on Image and Video Processing

Journal on Image and Video Processing Volume 2024, Issue 1

Sep 2024

644 pages

EISSN:1687-5281

Issue’s Table of Contents

© The Author(s) 2024.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 29 August 2024

Accepted: 03 August 2024

Received: 31 August 2023

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Significant Science and Technology Project of Ningbo
Key Technologies Research and Development Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents