Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A method for image–text matching based on semantic filtering and adaptive adjustment

Published: 29 August 2024 Publication History

Abstract

As image–text matching (a critical task in the field of computer vision) links cross-modal data, it has captured extensive attention. Most of the existing methods intended for matching images and texts explore the local similarity levels between images and sentences to align images with texts. Even though this fine-grained approach has remarkable gains, how to further mine the deep semantics between data pairs and focus on the essential semantics in data remains to be quested. In this work, a new semantic filtering and adaptive approach (FAAR) was proposed to ease the above problem. To be specific, the filtered attention (FA) module selectively focuses on typical alignments with the interference of meaningless comparisons eliminated. Next, the adaptive regulator (AR) further adjusts the attention weights of key segments for filtered regions and words. The superiority of our proposed method was validated by a number of qualitative experiments and analyses on the Flickr30K and MSCOCO data sets.

References

[1]
J. Chen, H. Hu, H. Wu, Y. Jiang, C. Wang, Learning the best pooling strategy for visual semantic embedding, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15789–15798,.
[2]
T. Xu, X. Liu, Z. Huang, D. Guo, R. Hong, M. Wang, Early-learning regularized contrastive learning for cross-modal retrieval with noisy labels, Proceedings of the 30th ACM International Conference on Multimedia, 629–637,.
[3]
Zhang D, Wu X, Xu T, and Kittler J Two-stage supervised discrete hashing for cross-modal retrieval IEEE Trans. Syst. Man,Cybern 2022 52 11 7014-7026
[4]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086,.
[5]
K. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, Proceedings of the European conference on computer vision (ECCV), 201–216,.
[6]
Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, J. Shao, Camp: cross-modal adaptive message passing for text-image retrieval, Proceedings of the IEEE/CVF international conference on computer vision, 5764–5773,.
[7]
X. Wei, T. Zhang, Y. Li, Y. Zhang, F. Wu, Multi-modality cross attention network for image and sentence matching, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10941–10950,.
[8]
A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE conference on computer vision and pattern recognition, 3128–3137,.
[9]
H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, J. Han, Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12655–12663,.
[10]
C. Liu, Z. Mao, A. Liu, T. Zhang, B. Wang, Y. Zhang, Focus your attention: a bidirectional focal attention network for image-text matching, Proceedings of the 27th ACM international conference on multimedia, 3–11,.
[11]
H. Diao, Y. Zhang, L. Ma, H. Lu, Similarity reasoning and filtration for image-text matching, Proceedings of the AAAI Conference on Artificial Intelligence, AAAI Press, California,. 1218–1226,.
[12]
Yang S, Li Q, Li W, Li X, Jin R, Lv B, Wang R, and Liu A Semantic completion and filtration for image-text retrieval ACM Trans. Multimed. Comput. Commun. Appl. 2023 19 4 1-20
[13]
Y. Huang, Q. Wu, C. Song, L. Wang, Learning semantic concepts and order for image and sentence matching, Proceedings of the IEEE conference on computer vision and pattern recognition. 6163–6171,.
[14]
Hardoon DR, Szedmak S, and Shawe-Taylor J Canonical correlation analysis: an overview with application to learning methods Neural Comput. 2004 16 12 2639-2664
[15]
G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, International conference on machine learning, PMLR, 1247–1255,.
[16]
L. Wang, Y. Li, S. Lazebnik, Learning deep structure-preserving image-text embeddings, Proceedings of the IEEE conference on computer vision and pattern recognition, 5005–5013,.
[17]
R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv preprint arXiv:1411.2539 (2014),.
[18]
F. Faghri, D.J. Fleet, J.R. Kiros, S. Fidler, Vse++: improving visual-semantic embeddings with hard negatives, arXiv preprint arXiv:1707.05612 (2017).
[19]
H. Nam, J. Ha, J. Kim, Dual attention networks for multimodal reasoning and matching, Proceedings of the IEEE conference on computer vision and pattern recognition, 299–307,.
[20]
T. Wang, X. Xu, Y. Yang, A. Hanjalic, H.T. Shen, J. Song, Matching images and text with multi-modal tensor fusion and re-ranking, Proceedings of the 27th ACM international conference on multimedia, 12–20,.
[21]
Ma L, Jiang W, Jie Z, Jiang Y, and Liu W Matching image and sentence with multi-faceted representations Ieee Trans. Circuits Syst. Video Technol. 2019 30 7 2250-2261
[22]
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y Generative adversarial networks Commun. Acm 2020 63 11 139-144
[23]
B. Wang, Y. Yang, X. Xu, A. Hanjalic, H.T. Shen, Adversarial cross-modal retrieval, Proceedings of the 25th ACM international conference on Multimedia, 154–162,.
[24]
Peng Y and Qi J Cm-gans: cross-modal generative adversarial networks for common representation learning, ACM Transactions on Multimedia Computing, Communications, and Applications ACM Trans. Multimedia Comput. Commun. Appl 2019 15 1 1-24
[25]
J. Gu, J. Cai, S.R. Joty, L. Niu, G. Wang, Look, imagine and match: improving textual-visual cross-modal retrieval with generative models, Proceedings of the IEEE conference on computer vision and pattern recognition, 7181–7189,.
[26]
Shen F, Xu Y, Liu L, Yang Y, Huang Z, and Shen HT Unsupervised deep hashing with similarity-adaptive and discrete optimization Ieee Trans. Pattern Anal. Mach. Intell. 2018 40 12 3034-3044
[27]
Yang E, Liu T, Deng C, and Tao D Adversarial examples for hamming space search Ieee T. Cybern. 2018 50 4 1473-1484
[28]
Zheng F, Tang Y, and Shao L Hetero-manifold regularisation for cross-modal hashing Ieee Trans. Pattern Anal. Mach. Intell. 2016 40 5 1059-1071
[29]
Deng C, Chen Z, Liu X, Gao X, and Tao D Triplet-based deep hashing network for cross-modal retrieval Ieee Trans. Image Process. 2018 27 8 3893-3903
[30]
Yang E, Deng C, Li C, Liu W, Li J, and Tao D Shared predictive cross-modal deep quantization Ieee Trans. Neural Netw. Learn. Syst. 2018 29 11 5292-5303
[31]
Z. Hu, Y. Luo, J. Lin, Y. Yan, J. Chen, Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. IJCAI. 789–795,.
[32]
Y. Wang, H. Yang, X. Qian, L. Ma, J. Lu, B. Li, X. Fan, Position focused attention network for image-text matching, arXiv preprint arXiv:1907.09748 (2019),.
[33]
J. Wehrmann, D.M. Souza, M.A. Lopes, R.C. Barros, Language-agnostic visual-semantic embeddings, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5804–5813,.
[34]
Q. Zhang, Z. Lei, Z. Zhang, S.Z. Li, Context-aware attention network for image-text retrieval, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3536–3545,.
[35]
Ren S, He K, Girshick R, and Sun J Faster r-cnn: towards real-time object detection with region proposal networks IEEE Trans. Pattern Anal. Mach. Intell. 2017 39 6 1137-1149
[36]
Zhang K, Mao Z, and Liu A Unified adaptive relevance distinguishable attention network for image-text matching Ieee Trans. Multimedia 2023 25 1320-1332
[37]
R. Zellers, M. Yatskar, S. Thomson, Y. Choi, Neural motifs: scene graph parsing with global context, Proceedings of the IEEE conference on computer vision and pattern recognition. 5831–5840,.
[38]
S. Wang, R. Wang, Z. Yao, S. Shan, X. Chen, Cross-modal scene graph matching for relationship-aware image-text retrieval, Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1508–1517,.
[39]
Peng Y and Chi J Unsupervised cross-media retrieval using domain adaptation with scene graph Ieee Trans. Circuits Syst. Video Technol. 2019 30 11 4368-4379
[40]
Ma L, Jiang W, Jie Z, and Wang X Bidirectional image-sentence retrieval by local and global deep matching Neurocomputing 2019 345 36-44
[41]
Li Z, Ling F, Zhang C, and Ma H Combining global and local similarity for cross-media retrieval Ieee Access 2020 8 21847-21856
[42]
Xu X, Wang T, Yang Y, Zuo L, Shen F, and Shen HT Cross-modal attention with semantic consistence for image–text matching Ieee Trans. Neural Netw. Learn. Syst. 2020 31 12 5412-5425
[43]
M. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015),.
[44]
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, International conference on machine learning, PMLR. 2048–2057.
[45]
B.A. Plummer, L. Wang, C.M. Cervantes, J.C. Caicedo, J. Hockenmaier, S. Lazebnik, Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models, Proceedings of the IEEE international conference on computer vision. 2641–2649,.
[46]
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollmr P, and Zitnick CL Microsoft coco common objects in context, Computer Vision ICCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 2014 Cham Springer 740-755
[47]
A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, Learning transferable visual models from natural language supervision, International conference on machine learning, PMLR. 8748–8763.
[48]
Z. Zeng, W. Mao, A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval, arXiv preprint arXiv:2201.02772 (2022).

Index Terms

  1. A method for image–text matching based on semantic filtering and adaptive adjustment
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Journal on Image and Video Processing
          Journal on Image and Video Processing  Volume 2024, Issue 1
          Sep 2024
          644 pages

          Publisher

          Hindawi Limited

          London, United Kingdom

          Publication History

          Published: 29 August 2024
          Accepted: 03 August 2024
          Received: 31 August 2023

          Author Tags

          1. Cross-modal retrieval
          2. Image–text matching
          3. Semantic filtration
          4. Attention mechanisms
          5. Feature learning

          Qualifiers

          • Research-article

          Funding Sources

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 22 Sep 2024

          Other Metrics

          Citations

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media