Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612408acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Published: 27 October 2023 Publication History

Abstract

In e-commerce, products and micro-videos serve as two primary carriers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of specific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, existing datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi-modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclusion of user retrieval intentions. To address these limitations, we propose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven approach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Additionally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experiments on two downstream tasks validate the effectiveness of our proposed approaches. The dataset and source code are available at https://github.com/PKU-ICST-MIPL/Real20M_ACMMM2023.

References

[1]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining CLIP-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466--21474.
[2]
Patrick John Chia, Giuseppe Attanasio, Federico Bianchi, Silvia Terragni, Ana Rita Magalhães, Diogo Goncalves, Ciro Greco, and Jacopo Tagliabue. 2022. Fashionclip: Connecting language and images for product representations. arXiv preprint arXiv:2204.03972 (2022).
[3]
Charles Corbière, Hédi Ben-Younes, Alexandre Ramé, and Charles Ollion. 2017. Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction. In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2268--2274. https://doi.org/10.1109/ICCVW.2017.266
[4]
Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. 2022. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. arXiv preprint arXiv:2203.08101 (2022).
[5]
Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Michael C. Kampffmeyer, Xiaoyong Wei, Minlong Lu, Yaowei Wang, and Xiaodan Liang. 2022. M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 21220--21230. https://doi.org/10. 1109/CVPR52688.2022.02057
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
[7]
Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, and Joemon M Jose. 2023. Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1022--1031.
[8]
Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. 2019. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5337--5345.
[9]
Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. 2022. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35 (2022), 6704--6719.
[10]
Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, and Pradeep Natarajan. 2022. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14105--14115.
[11]
M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C Berg, and Tamara L Berg. 2015. Where to buy it: Matching street clothing photos in online shops. In Proceedings of the IEEE international conference on computer vision. 3343--3351.
[12]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 1472--1480. https: //doi.org/10.1109/ICCV.2017.163
[13]
Weixiang Hong, Kaixiang Ji, Jiajia Liu, Jian Wang, Jingdong Chen, and Wei Chu. 2021. Gilbert: Generative vision-language pre-training for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1379--1388.
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti. arXiv preprint arXiv:1412.6980 (2014).
[15]
Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. Cosmo: Content-style modulation for image retrieval with text feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 802--812.
[16]
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953--4963.
[17]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.
[18]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694--9705.
[19]
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deep-fashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1096--1104.
[20]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304.
[21]
Andrei Neculai, Yanbei Chen, and Zeynep Akata. 2022. Probabilistic Compositional Embeddings for Multimodal Image Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4547--4557.
[22]
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. 2022. Expanding language-image pretrained models for general video recognition. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV. Springer, 1--18.
[23]
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In Proceedings of the 30th ACM International Conference on Multimedia. 3234--3243.
[24]
Manish Pathak and Aditya Jain. 2021. Solving Fashion Recommendation--The Farfetch Challenge. arXiv preprint arXiv:2108.01314 (2021).
[25]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[26]
Negar Rostamzadeh, Seyedarian Hosseini, Thomas Boquet, Wojciech Stokowiec, Ying Zhang, Christian Jauvin, and Chris Pal. 2018. Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018).
[27]
Rohan Sarkar, Navaneeth Bodla, Mariya Vasileva, Yen-Liang Lin, Anurag Beniwal, Alan Lu, and Gerard Medioni. 2022. Outfittransformer: Outfit representations for fashion recommendation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2263--2267.
[28]
Lei Shi, Kai Shuang, Shijie Geng, Peng Su, Zhengkai Jiang, Peng Gao, Zuohui Fu, Gerard de Melo, and Sen Su. 2020. Contrastive visual-linguistic pretraining. arXiv preprint arXiv:2007.13135 (2020).
[29]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).
[30]
Yuxin Tian, Shawn Newsam, and Kofi Boakye. 2023. Fashion Image Retrieval With Text Feedback by Additive Attention Compositional Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1011--1021.
[31]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[32]
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogério Feris. 2021. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 11307--11317. https://doi.org/10.1109/CVPR46437.2021.01115
[33]
An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. 2022. Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. arXiv preprint arXiv:2211.01335 (2022).
[34]
Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 11762--11771. https://doi.org/10.1109/ICCV48922.2021. 01157
[35]
Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang. 2023. AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval. ACM Transactions on Multimedia Computing, Communications and Applications (2023).
[36]
Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, and Ling Shao. 2021. Kaleido-bert: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12647--12657.

Cited By

View all
  • (2024)MFE-transformer: Adaptive English text named entity recognition method based on multi-feature extraction and transformerComputer Science and Information Systems10.2298/CSIS240418061G21:4(1865-1885)Online publication date: 2024
  • (2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
  • (2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-domain retrieval
  2. e-commerce datasets
  3. large-scale data collection

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)109
  • Downloads (Last 6 weeks)11
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MFE-transformer: Adaptive English text named entity recognition method based on multi-feature extraction and transformerComputer Science and Information Systems10.2298/CSIS240418061G21:4(1865-1885)Online publication date: 2024
  • (2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
  • (2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
  • (2024)MuJo-SF: Multimodal Joint Slot Filling for Attribute Value Prediction of E-Commerce CommoditiesIEEE Transactions on Multimedia10.1109/TMM.2024.340766726(10354-10366)Online publication date: 1-Jan-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media