research-article

Hierarchical Composition Learning for Composed Query Image Retrieval

Authors:

Yang YangAuthors Info & Claims

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Article No.: 21, Pages 1 - 7

https://doi.org/10.1145/3469877.3490601

Published: 10 January 2022 Publication History

Abstract

Composed query image retrieval is a growing research topic. The object is to retrieve images not only generally resemble the reference image, but differ according to the desired modification text. Existing methods mainly explore composing modification text with global feature or local entity descriptor of reference image. However, they ignore the fact that modification text is indeed diverse and arbitrary. It not only relates to abstractive global feature or concrete local entity transformation, but also often associates with the fine-grained structured visual adjustment. Thus, it is insufficient to emphasize the global or local entity visual for the query composition. In this work, we tackle this task by hierarchical composition learning. Specifically, the proposed method first encodes images into three representations consisting of global, entity and structure level representations. Structure level representation is richly explicable, which explicitly describes entities as well as attributes and relationships in the image with a directed graph. Based on these, we naturally perform hierarchical composition learning by fusing modification text and reference image in the global-entity-structure manner. It can transform the visual feature conditioned on modification text to target image in a coarse-to-fine manner, which takes advantage of the complementary information among three levels. Moreover, we introduce a hybrid space matching to explore global, entity and structure alignments which can get high performance and good interpretability.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.

[2]

Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2019. Describing Video With Attention-Based Bidirectional LSTM. IEEE Transactions on Cybernetics 49, 7 (2019), 2631–2641.

[3]

Yi Bin, Yang Yang, Chaofan Tao, Zi Huang, Jingjing Li, and Heng Tao Shen. 2019. MR-NET: Exploiting Mutual Relation for Visual Relationship Detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 8110–8117.

Digital Library

[4]

Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search With Text Feedback by Visiolinguistic Attention Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.

[6]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference. 12.

[7]

Zhixiao Fu, Xinyuan Chen, Jianfeng Dong, and Shouling Ji. 2021. Multi-Order Adversarial Representation Learning for Composed Query Image Retrieval. In International Conference on Acoustics, Speech and Signal Processing. 1685–1689.

[8]

Xin Gao, Fumin Shen, Yang Yang, Xing Xu, Hanxi Li, and Heng Tao Shen. 2017. Asymmetric sparse hashing. In Proceedings of the IEEE International Conference on Multimedia and Expo. 127–132.

[9]

Xiang Guan, Guoqing Wang, Xing Xu, and Yi Bin. 2021. Learning Hierarchal Channel Attention for Fine-grained Visual Classification. In Proceedings of the ACM International Conference on Multimedia. 5011–5019.

Digital Library

[10]

Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In Proceedings of the IEEE Conference on Computer Vision. IEEE Computer Society, 1472–1480.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[12]

Shiyuan He, Bokun Wang, Zheng Wang, Yang Yang, Fumin Shen, Zi Huang, and Heng Tao Shen. 2020. Bidirectional Discrete Matrix Factorization Hashing for Image Search. IEEE Transactions on Cybernetics 50, 9 (2020), 4157–4168.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[14]

Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed Query Image Retrieval Using Locally Bounded Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3593–3602.

[15]

Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, Richang Hong, and Heng Tao Shen. 2019. Collective Reconstructive Embeddings for Cross-Modal Hashing. IEEE Transactions on Image Processing 28, 6 (2019), 2770–2784.

[16]

Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, and Heng Tao Shen. 2018. Hashing with Angular Reconstructive Embeddings. IEEE Transactions on Image Processing 27, 2 (2018), 545–555.

Digital Library

[17]

Phillip Isola, Joseph J. Lim, and Edward H. Adelson. 2015. Discovering states and transformations in image collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1383–1391.

[18]

Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Multimodal Residual Learning for Visual QA. In Advances in Neural Information Processing Systems. 361–369.

[19]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations.

[20]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907(2016).

[21]

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30–38.

[22]

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[23]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.

[24]

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence. 3942–3951.

[25]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6(2017), 1137–1149.

Digital Library

[26]

Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems.4967–4976.

[27]

Fumin Shen, Xin Gao, Li Liu, Yang Yang, and Heng Tao Shen. 2017. Deep Asymmetric Pairwise Hashing. In Proceedings of the ACM International Conference on Multimedia. 1522–1530.

Digital Library

[28]

Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2021. Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing. IEEE Transactions on Knowledge and Data Engineering 33, 10(2021), 3351–3365.

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems. 5998–6008.

[30]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[31]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval: an Empirical Odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439–6448.

[32]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM International Conference on Multimedia. 154–162.

Digital Library

[33]

Jiawei Wang, Shuai Zhu, Jiao Xu, and Da Cao. 2019. The retrieval of the beautiful: Self-supervised salient object detection for beauty product retrieval. In Proceedings of the ACM International Conference on Multimedia. 2548–2552.

Digital Library

[34]

Zheng Wang, Wu Liu, Yusuke Matsui, and Shin’ichi Satoh. 2020. Effective and efficient: Toward open-world instance re-identification. In Proceedings of the ACM International Conference on Multimedia. 4789–4790.

Digital Library

[35]

Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

Digital Library

[36]

Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494–2507.

Digital Library

[37]

Yahui Xu, Yang Yang, Fumin Shen, Xing Xu, Yuxuan Zhou, and Heng Tao Shen. 2017. Attribute hashing for zero-shot image retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo. 133–138.

[38]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.

[39]

Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video Captioning by Adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600–5611.

Digital Library

[40]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing With Global Context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 5831–5840.

[41]

Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. In Proceedings of the ACM International Conference on Multimedia. 3367–3376.

Digital Library

[42]

Hongfei Zhang, Xia Song, Chenyan Xiong, Corby Rosset, Paul N Bennett, Nick Craswell, and Saurabh Tiwary. 2019. Generic intent representation in web search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 65–74.

Digital Library

[43]

Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12669–12678.

[44]

Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Heng Tao Shen, and Tat-Seng Chua. 2019. More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining. IEEE Transactions on Image Processing 28, 1 (2019), 32–44.

Digital Library

[45]

Wengang Zhou, Houqiang Li, and Qi Tian. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv preprint arXiv:1706.06064(2017).

Cited By

Xu YBin YWei JYang YWang GShen H(2023)Multi-Modal Transformer With Global-Local Alignment for Composed Query Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.323549525(8346-8357)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3235495
Vaze SCarion NMisra I(2023)GeneCIS: A Benchmark for General Conditional Image Similarity2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00663(6862-6872)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00663
Wan YZou GYan CZhang B(2022)Dual attention composition network for fashion image retrieval with attribute manipulationNeural Computing and Applications10.1007/s00521-022-07994-935:8(5889-5902)Online publication date: 9-Nov-2022
https://dl.acm.org/doi/10.1007/s00521-022-07994-9

Index Terms

Hierarchical Composition Learning for Composed Query Image Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Composed Query Based Image Retrieval (CQBIR) aims at searching images relevant to a composed query, i.e., a reference image together with a modifier text. Compared with conventional image retrieval, which takes a single image or text to retrieve desired ...
Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

In this paper, we focus on the composed query image retrieval task, namely retrieving the target images that are similar to a composed query, in which a modification text is combined with a query image to describe a user's accurate search intention. ...
Web image retrieval via learning semantics of query image
ICME'09: Proceedings of the 2009 IEEE international conference on Multimedia and Expo

The performance of traditional image retrieval approaches remains unsatisfactory, as they are restricted by the well-known semantic gap and the diversity of textual semantics. To tackle these problems, we propose an improved image retrieval framework ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

December 2021

508 pages

ISBN:9781450386074

DOI:10.1145/3469877

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MMAsia '21

Sponsor:

SIGMM

MMAsia '21: ACM Multimedia Asia

December 1 - 3, 2021

Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
122
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu YBin YWei JYang YWang GShen H(2023)Multi-Modal Transformer With Global-Local Alignment for Composed Query Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.323549525(8346-8357)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3235495
Vaze SCarion NMisra I(2023)GeneCIS: A Benchmark for General Conditional Image Similarity2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00663(6862-6872)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00663
Wan YZou GYan CZhang B(2022)Dual attention composition network for fashion image retrieval with attribute manipulationNeural Computing and Applications10.1007/s00521-022-07994-935:8(5889-5902)Online publication date: 9-Nov-2022
https://dl.acm.org/doi/10.1007/s00521-022-07994-9

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents