Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3469877.3490601acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Composition Learning for Composed Query Image Retrieval

Published: 10 January 2022 Publication History

Abstract

Composed query image retrieval is a growing research topic. The object is to retrieve images not only generally resemble the reference image, but differ according to the desired modification text. Existing methods mainly explore composing modification text with global feature or local entity descriptor of reference image. However, they ignore the fact that modification text is indeed diverse and arbitrary. It not only relates to abstractive global feature or concrete local entity transformation, but also often associates with the fine-grained structured visual adjustment. Thus, it is insufficient to emphasize the global or local entity visual for the query composition. In this work, we tackle this task by hierarchical composition learning. Specifically, the proposed method first encodes images into three representations consisting of global, entity and structure level representations. Structure level representation is richly explicable, which explicitly describes entities as well as attributes and relationships in the image with a directed graph. Based on these, we naturally perform hierarchical composition learning by fusing modification text and reference image in the global-entity-structure manner. It can transform the visual feature conditioned on modification text to target image in a coarse-to-fine manner, which takes advantage of the complementary information among three levels. Moreover, we introduce a hybrid space matching to explore global, entity and structure alignments which can get high performance and good interpretability.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
[2]
Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2019. Describing Video With Attention-Based Bidirectional LSTM. IEEE Transactions on Cybernetics 49, 7 (2019), 2631–2641.
[3]
Yi Bin, Yang Yang, Chaofan Tao, Zi Huang, Jingjing Li, and Heng Tao Shen. 2019. MR-NET: Exploiting Mutual Relation for Visual Relationship Detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 8110–8117.
[4]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search With Text Feedback by Visiolinguistic Attention Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[6]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference. 12.
[7]
Zhixiao Fu, Xinyuan Chen, Jianfeng Dong, and Shouling Ji. 2021. Multi-Order Adversarial Representation Learning for Composed Query Image Retrieval. In International Conference on Acoustics, Speech and Signal Processing. 1685–1689.
[8]
Xin Gao, Fumin Shen, Yang Yang, Xing Xu, Hanxi Li, and Heng Tao Shen. 2017. Asymmetric sparse hashing. In Proceedings of the IEEE International Conference on Multimedia and Expo. 127–132.
[9]
Xiang Guan, Guoqing Wang, Xing Xu, and Yi Bin. 2021. Learning Hierarchal Channel Attention for Fine-grained Visual Classification. In Proceedings of the ACM International Conference on Multimedia. 5011–5019.
[10]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In Proceedings of the IEEE Conference on Computer Vision. IEEE Computer Society, 1472–1480.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[12]
Shiyuan He, Bokun Wang, Zheng Wang, Yang Yang, Fumin Shen, Zi Huang, and Heng Tao Shen. 2020. Bidirectional Discrete Matrix Factorization Hashing for Image Search. IEEE Transactions on Cybernetics 50, 9 (2020), 4157–4168.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735–1780.
[14]
Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed Query Image Retrieval Using Locally Bounded Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3593–3602.
[15]
Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, Richang Hong, and Heng Tao Shen. 2019. Collective Reconstructive Embeddings for Cross-Modal Hashing. IEEE Transactions on Image Processing 28, 6 (2019), 2770–2784.
[16]
Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, and Heng Tao Shen. 2018. Hashing with Angular Reconstructive Embeddings. IEEE Transactions on Image Processing 27, 2 (2018), 545–555.
[17]
Phillip Isola, Joseph J. Lim, and Edward H. Adelson. 2015. Discovering states and transformations in image collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1383–1391.
[18]
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Multimodal Residual Learning for Visual QA. In Advances in Neural Information Processing Systems. 361–369.
[19]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations.
[20]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907(2016).
[21]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30–38.
[22]
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[23]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.
[24]
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence. 3942–3951.
[25]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6(2017), 1137–1149.
[26]
Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems.4967–4976.
[27]
Fumin Shen, Xin Gao, Li Liu, Yang Yang, and Heng Tao Shen. 2017. Deep Asymmetric Pairwise Hashing. In Proceedings of the ACM International Conference on Multimedia. 1522–1530.
[28]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2021. Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing. IEEE Transactions on Knowledge and Data Engineering 33, 10(2021), 3351–3365.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems. 5998–6008.
[30]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[31]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval: an Empirical Odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439–6448.
[32]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM International Conference on Multimedia. 154–162.
[33]
Jiawei Wang, Shuai Zhu, Jiao Xu, and Da Cao. 2019. The retrieval of the beautiful: Self-supervised salient object detection for beauty product retrieval. In Proceedings of the ACM International Conference on Multimedia. 2548–2552.
[34]
Zheng Wang, Wu Liu, Yusuke Matsui, and Shin’ichi Satoh. 2020. Effective and efficient: Toward open-world instance re-identification. In Proceedings of the ACM International Conference on Multimedia. 4789–4790.
[35]
Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[36]
Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494–2507.
[37]
Yahui Xu, Yang Yang, Fumin Shen, Xing Xu, Yuxuan Zhou, and Heng Tao Shen. 2017. Attribute hashing for zero-shot image retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo. 133–138.
[38]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.
[39]
Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video Captioning by Adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600–5611.
[40]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing With Global Context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 5831–5840.
[41]
Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. In Proceedings of the ACM International Conference on Multimedia. 3367–3376.
[42]
Hongfei Zhang, Xia Song, Chenyan Xiong, Corby Rosset, Paul N Bennett, Nick Craswell, and Saurabh Tiwary. 2019. Generic intent representation in web search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 65–74.
[43]
Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12669–12678.
[44]
Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Heng Tao Shen, and Tat-Seng Chua. 2019. More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining. IEEE Transactions on Image Processing 28, 1 (2019), 32–44.
[45]
Wengang Zhou, Houqiang Li, and Qi Tian. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv preprint arXiv:1706.06064(2017).

Cited By

View all
  • (2023)Multi-Modal Transformer With Global-Local Alignment for Composed Query Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.323549525(8346-8357)Online publication date: 1-Jan-2023
  • (2023)GeneCIS: A Benchmark for General Conditional Image Similarity2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00663(6862-6872)Online publication date: Jun-2023
  • (2022)Dual attention composition network for fashion image retrieval with attribute manipulationNeural Computing and Applications10.1007/s00521-022-07994-935:8(5889-5902)Online publication date: 9-Nov-2022

Index Terms

  1. Hierarchical Composition Learning for Composed Query Image Retrieval
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
      December 2021
      508 pages
      ISBN:9781450386074
      DOI:10.1145/3469877
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 January 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Composed query image retrieval
      2. Hierarchical composition learning
      3. Large-scale dataset

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      MMAsia '21
      Sponsor:
      MMAsia '21: ACM Multimedia Asia
      December 1 - 3, 2021
      Gold Coast, Australia

      Acceptance Rates

      Overall Acceptance Rate 59 of 204 submissions, 29%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 21 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Multi-Modal Transformer With Global-Local Alignment for Composed Query Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.323549525(8346-8357)Online publication date: 1-Jan-2023
      • (2023)GeneCIS: A Benchmark for General Conditional Image Similarity2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00663(6862-6872)Online publication date: Jun-2023
      • (2022)Dual attention composition network for fashion image retrieval with attribute manipulationNeural Computing and Applications10.1007/s00521-022-07994-935:8(5889-5902)Online publication date: 9-Nov-2022

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media