Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

Published: 12 November 2021 Publication History

Abstract

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calculate the image-sentence similarity. However, the image-sentence similarity obtained by these methods may be coarse as (1) an intermediate common space is introduced to implicitly match the heterogeneous features of images and sentences in a global level, and (2) only the inter-modality relations of images and sentences are captured while the intra-modality relations are ignored. To overcome the limitations, we propose a novel Cross-Modal Hybrid Feature Fusion (CMHF) framework for directly learning the image-sentence similarity by fusing multimodal features with inter- and intra-modality relations incorporated. It can robustly capture the high-level interactions between visual regions in images and words in sentences, where flexible attention mechanisms are utilized to generate effective attention flows within and across the modalities of images and sentences. A structured objective with ranking loss constraint is formed in CMHF to learn the image-sentence similarity based on the fused fine-grained features of different modalities bypassing the usage of intermediate common space. Extensive experiments and comprehensive analysis performed on two widely used datasets—Microsoft COCO and Flickr30K—show the effectiveness of the hybrid feature fusion framework in CMHF, in which the state-of-the-art matching performance is achieved by our proposed CMHF method.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.
[2]
Hedi Ben-Younes, Rémi Cadène, Matthieu Cord, and Nicolas Thome. 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2631–2639.
[3]
Hedi Ben-Younes, Rémi Cadène, Nicolas Thome, and Matthieu Cord. 2019. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). 8102–8109.
[4]
Leon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010. 177–186.
[5]
Timothy J. Buschman and Earl K. Miller. 2007. Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 315, 5820 (2007), 1860–1862.
[6]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 12652–12660.
[7]
Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. 2018. Fine-grained attention mechanism for neural machine translation. Neurocomputing 284 (2018), 171–176.
[8]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.
[9]
Stéphane Clinchant, Julien Ah-Pine, and Gabriela Csurka. 2011. Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the IEEE International Conference on Computer Vision. ACM, New York, NY, 44.
[10]
Fartash Faghri, David J. Fleet, Jamie Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 12.
[11]
Zhiwei Fang, Jing Liu, Xueliang Liu, Qu Tang, Yong Li, and Hanqing Lu. 2019. BTDP: Toward sparse fusion with block term decomposition pooling for visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 50, 21 pages.
[12]
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129.
[13]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457–468.
[14]
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6639–6648.
[15]
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 317–326.
[16]
David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 1222–1225.
[17]
Ming Hou, Jiajia Tang, Jianhai Zhang, Wanzeng Kong, and Qibin Zhao. 2019. Deep multimodal multilinear fusion with high-order polynomial pooling. In Advances in Neural Information Processing Systems. 12113–12122.
[18]
Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, and Zhoujun Li. 2019. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Transactions on Image Processing 28, 4 (2019), 2008–2020.
[19]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.
[20]
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 7254–7262.
[21]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6163–6171.
[22]
Y. Huang, Q. Wu, W. Wang, and L. Wang. 2020. Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2020), 636–650.
[23]
Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv:1610.04325.
[24]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
[25]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332.
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.
[27]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201–216.
[28]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890–1899.
[29]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.
[30]
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia (MM’17). 970–978.
[31]
Ruoyu Liu, Yao Zhao, Shikui Wei, Liang Zheng, and Yi Yang. 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 27, 19 pages.
[32]
Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 1950–1959.
[33]
Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4127–4136.
[34]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv:1806.00064.
[35]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 22, 24 pages.
[36]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 91–99.
[37]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 5814–5824.
[38]
Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.
[39]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[41]
Haoran Wang, Zhong Ji, Zhigang Lin, Yanwei Pang, and Xuelong Li. 2020. Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching. Pattern Recognition 105 (2020), 107359.
[42]
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 12369. 18–34.
[43]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215.
[44]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005–5013.
[45]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2017. Learning two-branch neural networks for image-text matching tasks. arXiv:1704.03470.
[46]
Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1398–1406.
[47]
Shuo Wang, Dan Guo, Xin Xu, Li Zhuo, and Meng Wang. 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 56, 16 pages.
[48]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). 1497–1506.
[49]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia (MM’19). 12–20.
[50]
Xin Wang, Yuan-Fang Wang, and William Yang Wang. 2018. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. arXiv:1804.05448.
[51]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5764–5773.
[52]
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 10938–10947.
[53]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2020. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2020), 1–14.
[54]
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449–460.
[55]
Cedric Westphal and Guanhong Pei. 2009. Scalable routing via greedy embedding. In Proceedings of IEEE INFOCOM 2009. IEEE, Los Alamitos, CA, 2826–2830.
[56]
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6609–6618.
[57]
Ran Xu, Caiming Xiong, Wei Chen, and Jason J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
[58]
Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Tao Heng Shen. 2020. Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (2020), 1–18.
[59]
Xing Xu, Huimin Lu, Jingkuan Song, Yang Yang, Heng Tao Shen, and Xuelong Li. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 2400–2413.
[60]
Zhenguo Yang, Zehang Lin, Peipei Kang, Jianming Lv, Qing Li, and Wenyin Liu. 2020. Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), Article 9, 22 pages.
[61]
Zhaoda Ye and Yuxin Peng. 2020. Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2020), Article 105, 20 pages.
[62]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
[63]
Jin Yuan, Lei Zhang, Songrui Guo, Yi Xiao, and Zhiyong Li. 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), Article 83, 22 pages.
[64]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250.
[65]
Dongxiang Zhang, Rui Cao, and Sai Wu. 2019. Information fusion in visual question answering: A survey. Information Fusion 52 (2019), 268–280.
[66]
Shanshan Zhang, Jian Yang, and Bernt Schiele. 2018. Occluded pedestrian detection through guided attention in CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6995–7003.
[67]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision, Vol. 11205. 707–723.

Cited By

View all
  • (2024)PAR-Net: An Enhanced Dual-Stream CNN–ESN Architecture for Human Physical Activity RecognitionSensors10.3390/s2406190824:6(1908)Online publication date: 16-Mar-2024
  • (2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
  • (2024)Improving efficiency of DNN-based relocalization module for autonomous driving with server-side computingJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00592-113:1Online publication date: 25-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 4
November 2021
529 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3492437
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2021
Accepted: 01 March 2021
Revised: 01 February 2021
Received: 01 November 2020
Published in TOMM Volume 17, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image-sentence matching
  2. multimodal feature fusion
  3. cross-modal retrieval
  4. attention mechanism

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Fundamental Research Funds for the Central Universities
  • Sichuan Science and Technology Program, China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)177
  • Downloads (Last 6 weeks)12
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PAR-Net: An Enhanced Dual-Stream CNN–ESN Architecture for Human Physical Activity RecognitionSensors10.3390/s2406190824:6(1908)Online publication date: 16-Mar-2024
  • (2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
  • (2024)Improving efficiency of DNN-based relocalization module for autonomous driving with server-side computingJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00592-113:1Online publication date: 25-Jan-2024
  • (2024)Enhancing trust transfer in supply chain finance: a blockchain-based transitive trust modelJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00557-w13:1Online publication date: 2-Jan-2024
  • (2024)Privacy-Enhanced Prototype-based Federated Cross-modal Hashing for Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3674507Online publication date: 25-Jun-2024
  • (2024)HCCNet: Hybrid Coupled Cooperative Network for Robust Indoor LocalizationACM Transactions on Sensor Networks10.1145/366564520:4(1-22)Online publication date: 8-Jul-2024
  • (2024)InteractNet: Social Interaction Recognition for Semantic-rich VideosACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366820:8(1-21)Online publication date: 12-Jun-2024
  • (2024)Real-Time Attentive Dilated U-Net for Extremely Dark Image EnhancementACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365466820:8(1-19)Online publication date: 12-Jun-2024
  • (2024)NSDIE: Noise Suppressing Dark Image Enhancement Using Multiscale Retinex and Low-Rank MinimizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877220:6(1-22)Online publication date: 8-Mar-2024
  • (2024)Learning Offset Probability Distribution for Accurate Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363721420:5(1-24)Online publication date: 22-Jan-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media