research-article

Learning Hierarchical Visual-Semantic Representation with Phrase Alignment

Authors:

Qingheng Zhang,

Binqiang ZhaoAuthors Info & Claims

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Pages 349 - 357

https://doi.org/10.1145/3460426.3463610

Published: 01 September 2021 Publication History

Abstract

Effective visual-semantic representation is critical to the image-text matching task. Various methods are proposed to develop image representation with more semantic concepts and a lot of progress has been achieved. However, the internal hierarchical structure in both image and text, which could effectively enhance the semantic representation, is rarely explored in the image-text matching task. In this work, we propose a Hierarchical Visual-Semantic Network (HVSN) with fine-grained semantic alignment to exploit the hierarchical structure. Specifically, we first model the spatial or semantic relationship between objects and aggregate them into visual semantic concepts by the Local Relational Attention (LRA) module. Then we employ Gated Recurrent Unit (GRU) to learn relationships between visual semantic concepts and generate the global image representation. For the text part, we develop phrase features from related words, then generate text representation by learning relationships between these phrases. Besides, the model is trained with joint optimization of image-text retrieval and phrase alignment task to capture the fine-grained interplay between vision and language. Our approach achieves state-of-the-art performance on Flickr30K and MS-COCO datasets. On Flickr30K, our approach outperforms the current state-of-the-art method by 3.9% relatively in text retrieval with image query and 1.3% relatively for image retrieval with text query (based on Recall@1). On MS-COCO, our HVSN improves image retrieval by 2.3% relatively and text retrieval by 1.2% relatively. Both quantitative and visual ablation studies are provided to verify the effectiveness of the proposed modules.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (2017-07--25). 6077--6086. showeprint[arXiv]http://arxiv.org/abs/1707.07998v3 [cs.CV] http://arxiv.org/pdf/1707.07998v3

[2]

Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for Sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. Association for Computational Linguistics, Sofia, Bulgaria, 178--186. https://www.aclweb.org/anthology/W13--2322

[3]

Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4601--4611.

[4]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE+: Improved Visual-Semantic Embeddings. CoRR, Vol. abs/1707.05612 (2017). arxiv: 1707.05612 http://arxiv.org/abs/1707.05612

[5]

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marctextquotesingle Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2121--2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf

[6]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition.

[7]

Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval With Generative Models. In The IEEE Conference on Computer Vision and Pattern Recognition.

[8]

Yan Huang and Liang Wang. 2019. ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching. In Proceedings of the IEEE International Conference on Computer Vision. 5774--5783. http://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_ACMM_Aligned_Cross-Modal_Memory_for_Few-Shot_Image_and_Sentence_Matching_ICCV_2019_paper.pdf

[9]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171. http://openaccess.thecvf.com/content_cvpr_2018/papers/Huang_Learning_Semantic_Concepts_CVPR_2018_paper.pdf

[10]

Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image Generation From Scene Graphs. In The IEEE Conference on Computer Vision and Pattern Recognition.

[11]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image Retrieval Using Scene Graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition.

[13]

Fumi Katsuki and Christos Constantinidis. 2014. Bottom-up and top-down attention: different processes and overlapping neural systems. The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry, Vol. 20, 5 (October 2014), 509-521. https://doi.org/10.1177/1073858413514136

[14]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR, Vol. abs/1411.2539 (2014). arxiv: 1411.2539 http://arxiv.org/abs/1411.2539

[15]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision. 201--216. https://eccv2018.org/openaccess/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf

Digital Library

[16]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision. 4654--4662. http://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Visual_Semantic_Reasoning_for_Image-Text_Matching_ICCV_2019_paper.pdf

[17]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.

[18]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3--11.

Digital Library

[19]

Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Relationship Detection with Language Priors. CoRR, Vol. abs/1608.00187 (2016). arxiv: 1608.00187 http://arxiv.org/abs/1608.00187

[20]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111--3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Digital Library

[21]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299--307.

[22]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[23]

Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979--1988. https://www.zpascal.net/cvpr2019/Song_Polysemous_Visual-Semantic_Embedding_for_Cross-Modal_Retrieval_CVPR_2019_paper.pdf

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

[25]

Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019 a. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia. 12--20. https://arxiv.org/pdf/1908.04011.pdf

Digital Library

[26]

Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019 b. Position focused attention network for image-text matching. Proceedings of International Joint Conference on Artificial Intelligence. https://arxiv.org/pdf/1907.09748.pdf

[27]

Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019 a. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations. In The IEEE Conference on Computer Vision and Pattern Recognition.

[28]

Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019 b. Learning Fragment Self-Attention Embeddings for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM '19). Association for Computing Machinery, New York, NY, USA, 2088--2096. https://doi.org/10.1145/3343031.3350940

Digital Library

[29]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.

[30]

Luke S. Zettlemoyer and Michael Collins. 2012. Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. CoRR, Vol. abs/1207.1420 (2012). arxiv: 1207.1420 http://arxiv.org/abs/1207.1420

[31]

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3536--3545.

[32]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535 (2017).

Index Terms

Learning Hierarchical Visual-Semantic Representation with Phrase Alignment
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Multi-view inter-modality representation with progressive fusion for image-text matching
Abstract
Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult to ...
Consensus-Aware Visual-Semantic Embedding for Image-Text Matching
Computer Vision – ECCV 2020
Abstract
Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations, thereby exploiting their matching relationships and making the ...
SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval
Advanced Intelligent Computing Technology and Applications
Abstract
Composed image retrieval, aimed at enhancing user searches by accurately capturing intent, involves semantically aligning and fusing image and text features. We propose the Semantic-guided Hierarchical Alignment and Fusion network (SHAF), which is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

August 2021

715 pages

ISBN:9781450384636

DOI:10.1145/3460426

General Chairs:
Wen-Huang Cheng
National Yang Ming Chiao Tung University, Taiwan
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Meng Wang
Hefei University of Technology, China
,
Program Chairs:
Wei-Ta Chu
National Cheng Kung University, Taiwan
,
Jiaying Liu
Peking University, China
,
Marcel Worring
University of Amsterdam, Netherlands

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

August 21 - 24, 2021

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
162
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents