Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3460426.3463610acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Learning Hierarchical Visual-Semantic Representation with Phrase Alignment

Published: 01 September 2021 Publication History

Abstract

Effective visual-semantic representation is critical to the image-text matching task. Various methods are proposed to develop image representation with more semantic concepts and a lot of progress has been achieved. However, the internal hierarchical structure in both image and text, which could effectively enhance the semantic representation, is rarely explored in the image-text matching task. In this work, we propose a Hierarchical Visual-Semantic Network (HVSN) with fine-grained semantic alignment to exploit the hierarchical structure. Specifically, we first model the spatial or semantic relationship between objects and aggregate them into visual semantic concepts by the Local Relational Attention (LRA) module. Then we employ Gated Recurrent Unit (GRU) to learn relationships between visual semantic concepts and generate the global image representation. For the text part, we develop phrase features from related words, then generate text representation by learning relationships between these phrases. Besides, the model is trained with joint optimization of image-text retrieval and phrase alignment task to capture the fine-grained interplay between vision and language. Our approach achieves state-of-the-art performance on Flickr30K and MS-COCO datasets. On Flickr30K, our approach outperforms the current state-of-the-art method by 3.9% relatively in text retrieval with image query and 1.3% relatively for image retrieval with text query (based on Recall@1). On MS-COCO, our HVSN improves image retrieval by 2.3% relatively and text retrieval by 1.2% relatively. Both quantitative and visual ablation studies are provided to verify the effectiveness of the proposed modules.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (2017-07--25). 6077--6086. showeprint[arXiv]http://arxiv.org/abs/1707.07998v3 [cs.CV] http://arxiv.org/pdf/1707.07998v3
[2]
Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for Sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. Association for Computational Linguistics, Sofia, Bulgaria, 178--186. https://www.aclweb.org/anthology/W13--2322
[3]
Aviv Eisenschtat and Lior Wolf. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4601--4611.
[4]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE+: Improved Visual-Semantic Embeddings. CoRR, Vol. abs/1707.05612 (2017). arxiv: 1707.05612 http://arxiv.org/abs/1707.05612
[5]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marctextquotesingle Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2121--2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf
[6]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition.
[7]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval With Generative Models. In The IEEE Conference on Computer Vision and Pattern Recognition.
[8]
Yan Huang and Liang Wang. 2019. ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching. In Proceedings of the IEEE International Conference on Computer Vision. 5774--5783. http://openaccess.thecvf.com/content_ICCV_2019/papers/Huang_ACMM_Aligned_Cross-Modal_Memory_for_Few-Shot_Image_and_Sentence_Matching_ICCV_2019_paper.pdf
[9]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171. http://openaccess.thecvf.com/content_cvpr_2018/papers/Huang_Learning_Semantic_Concepts_CVPR_2018_paper.pdf
[10]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image Generation From Scene Graphs. In The IEEE Conference on Computer Vision and Pattern Recognition.
[11]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image Retrieval Using Scene Graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition.
[13]
Fumi Katsuki and Christos Constantinidis. 2014. Bottom-up and top-down attention: different processes and overlapping neural systems. The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry, Vol. 20, 5 (October 2014), 509-521. https://doi.org/10.1177/1073858413514136
[14]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR, Vol. abs/1411.2539 (2014). arxiv: 1411.2539 http://arxiv.org/abs/1411.2539
[15]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision. 201--216. https://eccv2018.org/openaccess/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf
[16]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision. 4654--4662. http://openaccess.thecvf.com/content_ICCV_2019/papers/Li_Visual_Semantic_Reasoning_for_Image-Text_Matching_ICCV_2019_paper.pdf
[17]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.
[18]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3--11.
[19]
Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Relationship Detection with Language Priors. CoRR, Vol. abs/1608.00187 (2016). arxiv: 1608.00187 http://arxiv.org/abs/1608.00187
[20]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111--3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
[21]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299--307.
[22]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[23]
Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979--1988. https://www.zpascal.net/cvpr2019/Song_Polysemous_Visual-Semantic_Embedding_for_Cross-Modal_Retrieval_CVPR_2019_paper.pdf
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[25]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019 a. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. In Proceedings of the 27th ACM International Conference on Multimedia. 12--20. https://arxiv.org/pdf/1908.04011.pdf
[26]
Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019 b. Position focused attention network for image-text matching. Proceedings of International Joint Conference on Artificial Intelligence. https://arxiv.org/pdf/1907.09748.pdf
[27]
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019 a. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations. In The IEEE Conference on Computer Vision and Pattern Recognition.
[28]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019 b. Learning Fragment Self-Attention Embeddings for Image-Text Matching. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM '19). Association for Computing Machinery, New York, NY, USA, 2088--2096. https://doi.org/10.1145/3343031.3350940
[29]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.
[30]
Luke S. Zettlemoyer and Michael Collins. 2012. Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. CoRR, Vol. abs/1207.1420 (2012). arxiv: 1207.1420 http://arxiv.org/abs/1207.1420
[31]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3536--3545.
[32]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535 (2017).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
August 2021
715 pages
ISBN:9781450384636
DOI:10.1145/3460426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. image-text matching
  2. multi-modal retrieval
  3. phrase alignment
  4. visual-semantic representation

Qualifiers

  • Research-article

Conference

ICMR '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 162
    Total Downloads
  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media