Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cross-modal Semantically Augmented Network for Image-text Matching

Published: 11 December 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Image-text matching plays an important role in solving the problem of cross-modal information processing. Since there are nonnegligible semantic differences between heterogenous pairwise data, a crucial challenge is how to learn a unified representation. Existing methods mainly rely on the alignment between regional image features and corresponding entity words. However, the regional features in the image are often more concerned with the foreground entity information, and the attribute information of the entities and the relational information are ignored. How to effectively integrate entity-attribute alignment and relationship alignment has not been fully studied. Therefore, we propose a Cross-Modal Semantically Augmented Network for Image-Text Matching (CMSAN), which combines the relationships between entities in the image with the semantics of relational words in the text. CMSAN (1) proposes an adaptive word-type prediction model that classifies the words into four types, i.e., entity word, attribute word, relation word, and unnecessary word. It can align different image features at multiple levels. CMSAN (2) designs a sophisticated relationship alignment module and an entity-attribute alignment module that maximizes the exploitation of the semantic information, which enables the model to have more discriminative power and further improves the matching accuracy.

    References

    [1]
    Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12655–12663.
    [2]
    Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15789–15798.
    [3]
    Tianlang Chen and Jiebo Luo. 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10583–10590.
    [4]
    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
    [5]
    Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. 2018. Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans. Image Process. 27, 8 (2018), 3893–3903.
    [6]
    Cheng Deng, Xinxun Xu, Hao Wang, Muli Yang, and Dacheng Tao. 2020. Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans. Image Process. 29 (2020), 8892–8902.
    [7]
    Cheng Deng, Erkun Yang, Tongliang Liu, Jie Li, Wei Liu, and Dacheng Tao. 2019. Unsupervised semantic-preserving adversarial hashing for image search. IEEE Trans. Image Process. 28, 8 (2019), 4032–4044.
    [8]
    Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1218–1226.
    [9]
    Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2083–2090.
    [10]
    Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
    [11]
    Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, and Xiao Liu. 2021. Structured multi-modal feature embedding and alignment for image-sentence retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 5185–5193.
    [12]
    Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.
    [13]
    Xiangteng He and Yuxin Peng. 2019. Fine-grained visual-textual representation learning. IEEE Trans. Circ. Syst. Vid. Technol. 30, 2 (2019), 520–531.
    [14]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.
    [15]
    Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021).
    [16]
    Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.
    [17]
    Shuqiang Jiang, Xinhang Song, and Qingming Huang. 2014. Relative image similarity learning with contextual information for Internet cross-media retrieval. Multim. Syst. 20 (2014), 645–657.
    [18]
    Hong Lan and Pufen Zhang. 2022. Learning and integrating multi-level matching features for image-text retrieval. IEEE Sig. Process. Lett. 29 (2022), 374–378.
    [19]
    Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201–216.
    [20]
    Chuanxiang Li, Zhenduo Chen, Pengfei Zhang, Xin Luo, Liqiang Nie, Wei Zhang, and Xinshun Xu. 2018. SCRATCH: A scalable discrete matrix factorization hashing for cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. 1–9.
    [21]
    Jiangtong Li, Li Niu, and Liqing Zhang. 2022. Action-aware embedding enhancement for image-text retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1323–1331.
    [22]
    Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4654–4662.
    [23]
    Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3–11.
    [24]
    Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921–10930.
    [25]
    Hong Liu, Rongrong Ji, Yongjian Wu, and Gang Hua. 2016. Supervised matrix factorization for cross-modality hashing. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 1767–1773.
    [26]
    Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, and Guanbin Li. 2021. Cross-modal progressive comprehension for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 44, 9 (2021), 4761–4775.
    [27]
    Xu Lu, Li Liu, Liqiang Nie, Xiaojun Chang, and Huaxiang Zhang. 2020. Semantic-driven interpretable deep multi-modal hashing for large-scale multimedia retrieval. IEEE Trans. Multim. 23 (2020), 4541–4554.
    [28]
    Xu Lu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiushan Nie, and Huaxiang Zhang. 2019. Flexible online multi-modal hashing for large-scale multimedia retrieval. In Proceedings of the 27th ACM International Conference on Multimedia. 1129–1137.
    [29]
    Xu Lu, Lei Zhu, Li Liu, Liqiang Nie, and Huaxiang Zhang. 2021. Graph convolutional multi-modal hashing for flexible multimedia retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 1414–1422.
    [30]
    Lin Ma, Wenhao Jiang, Zequn Jie, and Xu Wang. 2019. Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345 (2019), 36–44.
    [31]
    Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multim. Comput., Commun. Applic. 17, 4 (2021), 1–23.
    [32]
    Weiqing Min, Shuhuan Mei, Zhuo Li, and Shuqiang Jiang. 2020. A two-stage triplet network training framework for image retrieval. IEEE Trans. Multim. 22, 12 (2020), 3128–3138.
    [33]
    Yuxin Peng and Jingze Chi. 2019. Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circ. Syst. Vid. Technol. 30, 11 (2019), 4368–4379.
    [34]
    Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput., Commun. Applic. 15, 1 (2019), 1–24.
    [35]
    Sadaqat Ur Rehman, Shanshan Tu, Yongfeng Huang, and Zhongliang Yang. 2016. Face recognition: A novel un-supervised convolutional neural network method. In Proceedings of the IEEE International Conference of Online Analysis and Computing Science (ICOACS’16). IEEE, 139–144.
    [36]
    Sadaqat Ur Rehman, Shanshan Tu, Obaid Ur Rehman, Yongfeng Huang, Chathura M. Sarathchandra Magurawalage, and Chin-Chen Chang. 2018. Optimization of CNN through novel training strategy for visual classification problems. Entropy 20, 4 (2018), 290.
    [37]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).
    [38]
    Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1979–1988.
    [39]
    Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3027–3035.
    [40]
    Mengxiao Tian, Xinxiao Wu, and Yunde Jia. 2022. Adaptive latent graph representation learning for image-text matching. IEEE Trans. Image Process. 32 (2022), 471–482.
    [41]
    Sadaqat ur Rehman, Shanshan Tu, Yongfeng Huang, and Guojie Liu. 2017. CSFL: A novel unsupervised convolution neural network approach for visual pattern classification. AI Commun. 30, 5 (2017), 311–324.
    [42]
    Sadaqat ur Rehman, Shanshan Tu, Muhammad Waqas, Yongfeng Huang, Obaid ur Rehman, Basharat Ahmad, and Salman Ahmad. 2019. Unsupervised pre-trained filter learning approach for efficient convolution neural network. Neurocomputing 365 (2019), 171–190.
    [43]
    D. Wang, Q. Wang, L. He, X. Gao, and Y. Tian. 2020. Joint and individual matrix factorization hashing for large-scale cross-modal retrieval. Pattern Recog. 107 (2020), 107479.
    [44]
    Hao Wang, Cheng Deng, Tongliang Liu, and Dacheng Tao. 2021. Transferable coupled network for zero-shot sketch-based image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2021), 9181–9194.
    [45]
    Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision. Springer, 18–34.
    [46]
    Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005–5013.
    [47]
    Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508–1517.
    [48]
    Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia. 12–20.
    [49]
    Yongxin Wang, Xin Luo, Liqiang Nie, Jingkuan Song, Wei Zhang, and Xin-Shun Xu. 2020. BATCH: A scalable asymmetric discrete cross-modal hashing. IEEE Trans. Knowl. Data Eng. 33, 11 (2020), 3507–3519.
    [50]
    Yunbo Wang and Yuxin Peng. 2021. MARS: Learning modality-agnostic representation for scalable cross-media retrieval. IEEE Trans. Circ. Syst. Vid. Technol. 32, 7 (2021), 4765–4777.
    [51]
    Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circ. Syst. Vid. Technol. 31, 7 (2020), 2866–2879.
    [52]
    Jie Wu, Chunlei Wu, Jing Lu, Leiquan Wang, and Xuerong Cui. 2021. Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circ. Syst. Vid. Technol. 32, 1 (2021), 388–397.
    [53]
    Jianlong Wu, Xingxu Xie, Liqiang Nie, Zhouchen Lin, and Hongbin Zha. 2021. Reconstruction regularized low-rank subspace learning for cross-modal retrieval. Pattern Recog. 113 (2021), 107813.
    [54]
    De Xie, Cheng Deng, Chao Li, Xianglong Liu, and Dacheng Tao. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 29 (2020), 3626–3637.
    [55]
    Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-modal attention with semantic consistence for image–text matching. IEEE Trans. Neural Netw. Learn. Syst. 31, 12 (2020), 5412–5425.
    [56]
    Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441–3450.
    [57]
    Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4145–4154.
    [58]
    Tao Yao, Yaru Han, Ruxin Wang, Xiangwei Kong, Lianshan Yan, Haiyan Fu, and Qi Tian. 2020. Efficient discrete supervised hashing for large-scale cross-modal retrieval. Neurocomputing 385 (2020), 358–367.
    [59]
    Tao Yao, Xiangwei Kong, Haiyan Fu, and Qi Tian. 2019. Discrete semantic alignment hashing for cross-media retrieval. IEEE Trans. Cybern. 50, 12 (2019), 4896–4907.
    [60]
    Tao Yao, Yiru Li, Weili Guan, Gang Wang, Ying Li, Lianshan Yan, and Qi Tian. 2023. Discrete robust matrix factorization hashing for large-scale cross-media retrieval. IEEE Trans. Knowl. Data Eng. 35, 2 (2023), 1391–1401.
    [61]
    Tao Yao, Gang Wang, Lianshan Yan, Xiangwei Kong, Qingtang Su, Caiming Zhang, and Qi Tian. 2019. Online latent semantic hashing for cross-media retrieval. Pattern Recog. 89 (2019), 1–11.
    [62]
    Tao Yao, Ruxin Wang, Jintao Wang, Ying Li, Jun Yue, Lianshan Yan, and Qi Tian. 2024. Efficient supervised graph embedding hashing for large-scale cross-media retrieval. Pattern Recog. 145 (2024), 109934.
    [63]
    Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
    [64]
    Chengyuan Zhang, Jiayu Song, Xiaofeng Zhu, Lei Zhu, and Shichao Zhang. 2021. HCMSL: Hybrid cross-modal similarity learning for cross-modal retrieval. ACM Trans. Multim. Comput., Commun. Applic. 17, 1s (2021), 1–22.
    [65]
    Kun Zhang, Zhendong Mao, An-An Liu, and Yongdong Zhang. 2022. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multim. 25 (2022), 1320–1332.
    [66]
    Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15661–15670.
    [67]
    Zheng Zhang, Zhihui Lai, Zi Huang, Wai Keung Wong, Guo-Sen Xie, Li Liu, and Ling Shao. 2019. Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Trans. Image Process. 28, 10 (2019), 4803–4818.
    [68]
    Zheng Zhang, Haoyang Luo, Lei Zhu, Guangming Lu, and Heng Tao Shen. 2022. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Trans. Knowl. Data Eng. 35, 5 (2022), 5091–5104.
    [69]
    Zheng Zhang, Jianning Wang, Lei Zhu, Yadan Luo, and Guangming Lu. 2023. Deep collaborative graph hashing for discriminative image retrieval. Pattern Recog. 139 (2023), 109462.
    [70]
    Lei Zhu, Chaoqun Zheng, Weili Guan, Jingjing Li, Yang Yang, and Heng Tao Shen. 2023. Multi-modal hashing for efficient multimedia retrieval: A survey. IEEE Trans. Knowl. Data Eng.DOI:
    [71]
    Lei Zhu, Chaoqun Zheng, Xu Lu, Zhiyong Cheng, Liqiang Nie, and Huaxiang Zhang. 2021. Efficient multi-modal hashing with online query adaption for multimedia retrieval. ACM Trans. Inf. Syst. 40, 2 (2021), 1–36.

    Cited By

    View all
    • (2024)Towards Retrieval-Augmented Architectures for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366720:8(1-22)Online publication date: 12-Jun-2024
    • (2024)Cross-modal Semantic Interference Suppression for image-text matchingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108005133:PAOnline publication date: 1-Jul-2024

    Index Terms

    1. Cross-modal Semantically Augmented Network for Image-text Matching

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
      April 2024
      676 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613617
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 December 2023
      Online AM: 04 November 2023
      Accepted: 26 October 2023
      Revised: 24 September 2023
      Received: 03 June 2023
      Published in TOMM Volume 20, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Image-text matching
      2. cross-modal semantically augmented
      3. adaptive word-type prediction model
      4. relationship alignment

      Qualifiers

      • Research-article

      Funding Sources

      • Natural Science Foundation of Shandong Province
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)394
      • Downloads (Last 6 weeks)31
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Towards Retrieval-Augmented Architectures for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366720:8(1-22)Online publication date: 12-Jun-2024
      • (2024)Cross-modal Semantic Interference Suppression for image-text matchingEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108005133:PAOnline publication date: 1-Jul-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media