Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback

Published: 08 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Fashion image retrieval with text feedback aims to find the target image according to the reference image and the modification from the user. This is a challenging task, as it requires not only the synergistic understanding of both visual and textual modalities but also the ability to model a wide variety of styles that fashion images contain. Hence, the crucial aspect of addressing this problem lies in exploiting the abundant semantic information inherent in fashion images and correlating it with the textual description of style. Recognizing that style is generally situated at the local level, we explicitly define style as the commonalities and differences between local areas of fashion images. Building upon this, we propose a Style-guided Patch InteRaction approach for fashion Image retrieval with Text feedback (SPIRIT), which focuses on the decisive influence of local details of fashion images on their style. Three corresponding networks are designed pertinently. The Patch-level Style Commonality network is introduced to fully leverage the semantic information among patches and compute their average as the style commonality. Subsequently, the Patch-level Style Difference network employs a graph reasoning network to model the patch-level difference and filter out insignificant patches. By considering the above two networks, mutual information about style is obtained from the interaction between patches. Finally, the Visual Textual Fusion network is utilized to integrate visual features with rich semantic information and textual features. Experimental results on four benchmark datasets demonstrate that our proposed SPIRIT achieves state-of-the-art performance. Source code is available at https://github.com/PKU-ICST-MIPL/SPIRIT_TOMM2024.

    References

    [1]
    Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4432–4441.
    [2]
    Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. 2021. Compositional learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1140–1149.
    [3]
    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-shot composed image retrieval with textual inversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15338–15347.
    [4]
    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968.
    [5]
    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466–21474.
    [6]
    Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In Proceedings of the 11th European Conference on Computer Vision (ECCV ’10), Part I 11. Springer, 663–676.
    [7]
    Prashanth Chandran, Gaspard Zoss, Paulo Gotardo, Markus Gross, and Derek Bradley. 2021. Adaptive convolutions for structure-aware style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7972–7981.
    [8]
    Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image search with text feedback by visiolinguistic attention learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3001–3011.
    [9]
    Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, and Tat-Seng Chua. 2022. Composed image retrieval with text feedback via multi-grained uncertainty regularization. arXiv:2211.07394. Retrieved from https://arxiv.org/abs/2211.07394
    [10]
    Yanzhe Chen, Huasong Zhong, Xiangteng He, Yuxin Peng, and Lele Cheng. 2023. Real20M: A large-scale e-commerce dataset for cross-domain retrieval. In Proceedings of the 31st ACM International Conference on Multimedia. 4939–4948.
    [11]
    Tai-Yin Chiu and Danna Gurari. 2022. Photowct2: Compact autoencoder for photorealistic style transfer resulting from blockwise training and skip connections of high-frequency residuals. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2868–2877.
    [12]
    Guillaume Couairon, Matthijs Douze, Matthieu Cord, and Holger Schwenk. 2022. Embedding arithmetic of multimodal queries for image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4950–4958.
    [13]
    Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. 2022. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. arXiv:2203.08101. Retrieved from https://arxiv.org/abs/2203.08101
    [14]
    Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1218–1226.
    [15]
    Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye. 2020. Modality-agnostic attention fusion for visual search with text feedback. arXiv:2007.00145. Retrieved from https://arxiv.org/abs/2007.00145
    [16]
    Yujie Fu, Pengju Zhang, Bingxi Liu, Zheng Rong, and Yihong Wu. 2022. Learning to reduce scale differences for large-scale invariant image matching. IEEE Trans. Circ. Syst. Vid. Technol. 33, 3 (2022), 1335–1348.
    [17]
    Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. 2019. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5337–5345.
    [18]
    Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, and Pradeep Natarajan. 2022. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14105–14115.
    [19]
    Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Proceedings of the 14th European Conference on Computer Vision (ECCV ’16), Part VI 14. Springer, 241–257.
    [20]
    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, and Sangdoo Yun. 2023. CompoDiff: Versatile composed image retrieval with latent diffusion. arXiv:2303.11916. Retrieved from https://arxiv.org/abs/2303.11916
    [21]
    Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Advances in Neural Information Processing Systems, Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2018/file/a01a0380ca3c61428c26a231f0e49a09-Paper.pdf
    [22]
    Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis. 2017. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE International Conference on Computer Vision. 1463–1471.
    [23]
    Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. Fashionvil: Fashion-focused vision-and-language representation learning. In European Conference on Computer Vision. Springer, 634–651.
    [24]
    Yunpeng Han, Lisai Zhang, Qingcai Chen, Zhijian Chen, Zhonghua Li, Jianxin Yang, and Zhao Cao. 2023. FashionSAP: Symbols and attributes prompt for fine-grained fashion vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15028–15038.
    [25]
    Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed query image retrieval using locally bounded features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3596–3605.
    [26]
    Yuxin Hou, Eleonora Vig, Michael Donoser, and Loris Bazzani. 2021. Learning attribute-driven disentangled representations for interactive fashion retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12147–12157.
    [27]
    Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision. 1501–1510.
    [28]
    Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. 2022. SAC: Semantic attention composition for text-conditioned image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4021–4030.
    [29]
    Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, and Balaji Krishnamurthy. 2020. Trace: Transform aggregate and compose visiolinguistic representations for image search with text feedback. arXiv:2009.01485. Retrieved from https://arxiv.org/abs/2009.01485
    [30]
    Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. 2023. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11060–11069.
    [31]
    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the 14th European Conference on Computer Vision (ECCV ’16), Part II 14. Springer, 694–711.
    [32]
    Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual compositional learning in interactive image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1771–1779.
    [33]
    Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. Adv. Neural Inf. Process. Syst. 29 (2016).
    [34]
    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
    [35]
    Umut Kocasari, Alara Dirik, Mert Tiftikci, and Pinar Yanardag. 2022. StyleMC: Multi-channel based fast text-guided image generation and manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 895–904.
    [36]
    Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2018. Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV ’18). 35–51.
    [37]
    Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. Cosmo: Content-style modulation for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 802–812.
    [38]
    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. 2023. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23390–23400.
    [39]
    Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal style transfer via feature transforms. Adv. Neural Inf. Process. Syst. 30 (2017).
    [40]
    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2125–2134.
    [41]
    Zhe Ma, Jianfeng Dong, Zhongzi Long, Yao Zhang, Yuan He, Hui Xue, and Shouling Ji. 2020. Fine-grained fashion similarity learning by attribute-specific embedding network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11741–11748.
    [42]
    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
    [43]
    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [44]
    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
    [45]
    Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305–19314.
    [46]
    Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. Adv. Neural Inf. Process. Syst. 30 (2017).
    [47]
    Rishab Sharma and Anirudha Vishvakarma. 2019. Retrieving similar e-commerce images using deep learning. arXiv:1901.03546. Retrieved from https://arxiv.org/abs/1901.03546
    [48]
    Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. 2021. Rtic: Residual learning for text and image composition using graph convolutional network. arXiv:2104.03015. Retrieved from https://arxiv.org/abs/2104.03015
    [49]
    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https://arxiv.org/abs/1811.00491
    [50]
    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).
    [51]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
    [52]
    Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6439–6448.
    [53]
    Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, and Liqiang Nie. 2023. Target-guided composed image retrieval. In Proceedings of the 31st ACM International Conference on Multimedia. 915–923.
    [54]
    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11307–11317.
    [55]
    Xide Xia, Meng Zhang, Tianfan Xue, Zheng Sun, Hui Fang, Brian Kulis, and Jiawen Chen. 2020. Joint bilateral learning for real-time universal photorealistic style transfer. In Proceedings of the 16th European Conference on Computer Vision (ECCV ’20), Part VIII 16. Springer, 327–342.
    [56]
    Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, and Heng Tao Shen. 2023. Multi-modal transformer with global-local alignment for composed query image retrieval. IEEE Trans. Multimedia (TMM’23).
    [57]
    Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. 2020. Curlingnet: Compositional learning between images and text for fashion iq data. arXiv:2003.12299. Retrieved from https://arxiv.org/abs/2003.12299
    [58]
    Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11782–11791.
    [59]
    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579–5588.
    [60]
    Xu Zhang, Zhedong Zheng, Xiaohan Wang, and Yi Yang. 2023. Relieving triplet ambiguity: Consensus network for language-guided image retrieval. arXiv:2306.02092. Retrieved from https://arxiv.org/abs/2306.02092
    [61]
    Xiaoyang Zheng, Zilong Wang, Sen Li, Ke Xu, Tao Zhuang, Qingwen Liu, and Xiaoyi Zeng. 2023. Make: Vision-language pre-training based product retrieval in taobao search. In Companion Proceedings of the ACM Web Conference 2023. 356–360.
    [62]
    Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang. 2023. Amc: Adaptive multi-expert collaborative network for text-guided image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 19, 6 (2023), 1–22.
    [63]
    Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. 2020. Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5104–5113.
    [64]
    Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, and Ling Shao. 2021. Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12647–12657.

    Cited By

    View all
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 11-Jul-2024

    Index Terms

    1. SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
      June 2024
      715 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613638
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 March 2024
      Online AM: 15 January 2024
      Accepted: 28 December 2023
      Revised: 18 November 2023
      Received: 26 August 2023
      Published in TOMM Volume 20, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Fashion image retrieval with text feedback
      2. style modeling
      3. multimodal fusion

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)281
      • Downloads (Last 6 weeks)48
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 11-Jul-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media