research-article

Multi-Modal Self-Supervised Learning for Recommendation

Authors:

Chuxu ZhangAuthors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 790 - 800

https://doi.org/10.1145/3543507.3583206

Published: 30 April 2023 Publication History

Abstract

The online emergence of multi-modal sharing platforms (e.g., TikTok, Youtube) is powering personalized recommender systems to incorporate various modalities (e.g., visual, textual and acoustic) into the latent user representations. While existing works on multi-modal recommendation exploit multimedia content features in enhancing item embeddings, their model representation capability is limited by heavy label reliance and weak robustness on sparse user behavior data. Inspired by the recent progress of self-supervised learning in alleviating label scarcity issue, we explore deriving self-supervision signals with effectively learning of modality-aware user preference and cross-modal dependencies. To this end, we propose a new Multi-Modal Self-Supervised Learning (MMSSL) method which tackles two key challenges. Specifically, to characterize the inter-dependency between the user-item collaborative view and item multi-modal semantic view, we design a modality-aware interactive structure learning paradigm via adversarial perturbations for data augmentation. In addition, to capture the effects that user’s modality-aware interaction pattern would interweave with each other, a cross-modal contrastive learning approach is introduced to jointly preserve the inter-modal semantic commonality and user preference diversity. Experiments on real-world datasets verify the superiority of our method in offering great potential for multimedia recommendation over various state-of-the-art baselines. The implementation is released at: https://github.com/HKUDS/MMSSL.

References

[1]

Jianxin Chang, Chen Gao, Yu Zheng, Yiqun Hui, Yanan Niu, Yang Song, Depeng Jin, and Yong Li. 2021. Sequential recommendation with graph neural networks. In SIGIR. 378–387.

[2]

Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In SIGIR. 335–344.

[3]

Lei Chen, Le Wu, Richang Hong, Kun Zhang, and Meng Wang. 2020. Revisiting Graph Based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach. In AAAI, Vol. 34. 27–34.

[4]

Mengru Chen, Chao Huang, Lianghao Xia, Wei Wei, Yong Xu, and Ronghua Luo. 2023. Heterogeneous Graph Contrastive Learning for Recommendation. In WSDM.

[5]

Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In SIGIR. 765–774.

[6]

Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph neural networks for social recommendation. In WWW. 417–426.

[7]

Xiaoyan Gao, Fuli Feng, Xiangnan He, Heyan Huang, Xinyu Guan, Chong Feng, Zhaoyan Ming, and Tat-Seng Chua. 2019. Hierarchical attention network for visually-aware food recommendation. Transactions on Multimedia (TMM) 22, 6 (2019), 1647–1659.

[8]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS. JMLR Workshop and Conference Proceedings, 249–256.

[9]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. Neurips 30 (2017).

[10]

Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalized ranking from implicit feedback. In AAAI, Vol. 30.

[11]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In SIGIR. 639–648.

Digital Library

[12]

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CVPR (2012).

[13]

Hu Hu, Tian Tan, and Yanmin Qian. 2018. Generative adversarial networks based data augmentation for noise robust speech recognition. In ICASSP. IEEE, 5044–5048.

[14]

Chao Huang, Huance Xu, Yong Xu, Peng Dai, Lianghao Xia, Mengyin Lu, Liefeng Bo, Hao Xing, Xiaoping Lai, and Yanfang Ye. 2021. Knowledge-aware coupled graph neural network for social recommendation. In AAAI. 4115–4122.

[15]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR, 448–456.

[16]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).

[17]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. NeurIPS 33 (2020), 18661–18673.

[18]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[19]

Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Vol. 32.

[20]

Kaizhao Liang, Jacky Y Zhang, Oluwasanmi O Koyejo, and Bo Li. 2020. Does Adversarial Transferability Indicate Knowledge Transferability¿ (2020).

[21]

Zihan Lin, Changxin Tian, Yupeng Hou, and Wayne Xin Zhao. 2022. Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning. In WWW. 2320–2329.

[22]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In ICLR.

[23]

Andrew L Maas, Awni Y Hannun, Andrew Y Ng, 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML.

[24]

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In SIGIR. 43–52.

[25]

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 (2016).

[26]

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1979–1993.

[27]

Henning Petzka, Asja Fischer, and Denis Lukovnicov. 2017. On the regularization of wasserstein gans. ICLR (2017).

[28]

Ruihong Qiu, Sen Wang, Zhi Chen, Hongzhi Yin, and Zi Huang. 2021. Causalrec: Causal inference for visual debiasing in visually-aware recommendation. In MM. ACM, 3844–3852.

[29]

Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).

[30]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP.

[31]

Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. 2017. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems 30 (2017).

[32]

Rui Sun, Xuezhi Cao, Yan Zhao, Junchen Wan, Kun Zhou, Fuzheng Zhang, Zhongyuan Wang, and Kai Zheng. 2020. Multi-modal knowledge graphs for recommender systems. In CIKM. 1405–1414.

[33]

Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-supervised Learning for Multimedia Recommendation. Transactions on Multimedia (TMM) (2022).

[34]

Quoc-Tuan Truong, Aghiles Salah, and Hady Lauw. 2021. Multi-modal recommender systems: Hands-on exploration. In Recsys. ACM, 834–837.

[35]

Di Wang, Quan Wang, Yaqiang An, Xinbo Gao, and Yumin Tian. 2020. Online collective matrix factorization hashing for large-scale cross-media retrieval. In SIGIR. 1409–1418.

[36]

Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In CVPR. 2495–2504.

[37]

Qifan Wang, Yinwei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, and Liqiang Nie. 2021. DualGNN: Dual Graph Neural Network for Multimedia Recommendation. Transactions on Multimedia (TMM) (2021).

[38]

Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. Kgat: Knowledge graph attention network for recommendation. In KDD. 950–958.

Digital Library

[39]

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. In SIGIR.

[40]

Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, and Minghui Qiu. 2020. Global context enhanced graph neural networks for session-based recommendation. In SIGIR. 169–178.

[41]

Wei Wei, Chao Huang, Lianghao Xia, Yong Xu, Jiashu Zhao, and Dawei Yin. 2022. Contrastive meta learning with behavior multiplicity for recommendation. In WSDM. 1120–1128.

[42]

Yinwei Wei, Xiang Wang, Xiangnan He, Liqiang Nie, Yong Rui, and Tat-Seng Chua. 2021. Hierarchical user intent graph network for multimedia recommendation. Transactions on Multimedia (TMM) (2021).

[43]

Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. 2021. Contrastive learning for cold-start recommendation. In MM. 5382–5390.

[44]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-refined convolutional network for multimedia recommendation with implicit feedback. In MM. 3541–3549.

[45]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In MM. 1437–1445.

Digital Library

[46]

Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised graph learning for recommendation. In SIGIR. 726–735.

[47]

Lianghao Xia, Chao Huang, Yong Xu, Jiashu Zhao, Dawei Yin, and Jimmy Xiangji Huang. 2022. Hypergraph Contrastive Collaborative Filtering. In SIGIR. 70–79.

[48]

Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In ICDE. IEEE, 1259–1273.

[49]

Zixuan Yi, Xi Wang, Iadh Ounis, and Craig Macdonald. 2022. Multi-modal Graph Contrastive Learning for Micro-video Recommendation. In SIGIR. 1807–1811.

[50]

Junliang Yu, Hongzhi Yin, Jundong Li, Qinyong Wang, Nguyen Quoc Viet Hung, and Xiangliang Zhang. 2021. Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation. In WWW. 413–424.

[51]

Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang. 2021. Mining Latent Structures for Multimedia Recommendation. In MM. 3872–3880.

[52]

Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In CIKM. 1893–1902.

[53]

Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In WWW. 2069–2080.

Cited By

Yu LHu JDu QNiu X(2025)MVideoRec: Micro Video Recommendations through Modality Decomposition and Contrastive LearningACM Transactions on Information Systems10.1145/371185543:3(1-27)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3711855
Ong RKhong ANejdl WAuer SCha MMoens MNajork M(2025)Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal RecommendationProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining10.1145/3701551.3703561(773-781)Online publication date: 10-Mar-2025
https://dl.acm.org/doi/10.1145/3701551.3703561
Li HDu HLi YFu JLi CZhuang ZLi JNi YNejdl WAuer SCha MMoens MNajork M(2025)Teach Me How to Denoise: A Universal Framework for Denoising Multi-modal Recommender Systems via Guided CalibrationProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining10.1145/3701551.3703507(782-791)Online publication date: 10-Mar-2025
https://dl.acm.org/doi/10.1145/3701551.3703507
Show More Cited By

Index Terms

Multi-Modal Self-Supervised Learning for Recommendation
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Bootstrap Latent Representations for Multi-modal Recommendation
WWW '23: Proceedings of the ACM Web Conference 2023

This paper studies the multi-modal recommendation problem, where the item multi-modality information (e.g., images and textual descriptions) is exploited to improve the recommendation accuracy. Besides the user-item interaction graph, existing state-of-...
Multi-modal Mixture of Experts Represetation Learning for Sequential Recommendation
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Within online platforms, it is critical to capture the dynamic user preference from the sequential interaction behaviors for making accurate recommendation over time. Recently, significant progress has been made in sequential recommendation with deep ...
Guiding Graph Learning with Denoised Modality for Multi-modal Recommendation
Database Systems for Advanced Applications
Abstract
Multi-modal recommendation improves the recommendation accuracy by leveraging various modalities (e.g., visual, textual, and acoustic) of rich item content. However, most existing studies overlook that modality features can be noisy for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23: Proceedings of the ACM Web Conference 2023

April 2023

4293 pages

ISBN:9781450394161

DOI:10.1145/3543507

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
1,254
Total Downloads

Downloads (Last 12 months)659
Downloads (Last 6 weeks)51

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu LHu JDu QNiu X(2025)MVideoRec: Micro Video Recommendations through Modality Decomposition and Contrastive LearningACM Transactions on Information Systems10.1145/371185543:3(1-27)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3711855
Ong RKhong ANejdl WAuer SCha MMoens MNajork M(2025)Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal RecommendationProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining10.1145/3701551.3703561(773-781)Online publication date: 10-Mar-2025
https://dl.acm.org/doi/10.1145/3701551.3703561
Li HDu HLi YFu JLi CZhuang ZLi JNi YNejdl WAuer SCha MMoens MNajork M(2025)Teach Me How to Denoise: A Universal Framework for Denoising Multi-modal Recommender Systems via Guided CalibrationProceedings of the Eighteenth ACM International Conference on Web Search and Data Mining10.1145/3701551.3703507(782-791)Online publication date: 10-Mar-2025
https://dl.acm.org/doi/10.1145/3701551.3703507
Wang SYang Y(2025)Addressing information bias in multimodal recommendation systems based on expert systemsInternational Conference on Mechatronics and Intelligent Control (ICMIC 2024)10.1117/12.3045752(134)Online publication date: 16-Jan-2025
https://doi.org/10.1117/12.3045752
Zhou JLiao JZhu XWen JZhou W(2025)Simplified self-supervised learning for hybrid propagation graph-based recommendationNeural Networks10.1016/j.neunet.2025.107145185(107145)Online publication date: May-2025
https://doi.org/10.1016/j.neunet.2025.107145
Li YJi HYu FCheng LChe N(2025)Temporal multi-modal knowledge graph generation for link predictionNeural Networks10.1016/j.neunet.2024.107108185(107108)Online publication date: May-2025
https://doi.org/10.1016/j.neunet.2024.107108
Chen TWang TZhang HXu J(2025)M2KGRL: A semantic-matching based framework for multimodal knowledge graph representation learningExpert Systems with Applications10.1016/j.eswa.2025.126388269(126388)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2025.126388
Gattiglia G(2025)Managing Artificial Intelligence in Archeology. An overviewJournal of Cultural Heritage10.1016/j.culher.2024.11.02071(225-233)Online publication date: Jan-2025
https://doi.org/10.1016/j.culher.2024.11.020
Wang WDu WXu DWang WPeng W(2025)A survey on self-supervised learning for non-sequential tabular dataMachine Learning10.1007/s10994-024-06674-0114:1Online publication date: 16-Jan-2025
https://doi.org/10.1007/s10994-024-06674-0
Dang YLiu YYang EGuo GJiang LZhao JWang X(2024)Efficient and Adaptive Recommendation Unlearning: A Guided Filtering Framework to Erase Outdated PreferencesACM Transactions on Information Systems10.1145/370663343:2(1-25)Online publication date: 5-Dec-2024
https://dl.acm.org/doi/10.1145/3706633
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten