research-article

Two-Stage Constrained Actor-Critic for Short Video Recommendation

Authors:

Kun GaiAuthors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 865 - 875

https://doi.org/10.1145/3543507.3583259

Published: 30 April 2023 Publication History

Abstract

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including WatchTime and various types of interactions with multiple videos. On the one hand, the platforms aim at optimizing the users’ cumulative WatchTime (main goal) in the long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also need to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such as Like, Follow, Share, etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms fail to work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. In stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned in the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate the effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both WatchTime and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

References

[1]

M Mehdi Afsar, Trafford Crump, and Behrouz Far. 2021. Reinforcement learning based recommender systems: A survey. arXiv preprint arXiv:2101.06286 (2021).

[2]

Md Hijbul Alam, Woo-Jong Ryu, and SangKeun Lee. 2016. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences 339 (2016), 206–223.

Digital Library

[3]

Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2019. Large-scale interactive recommendation with tree-structured policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3312–3320.

Digital Library

[4]

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 456–464.

Digital Library

[5]

Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1187–1196.

Digital Library

[6]

Xu Chen, Yali Du, Long Xia, and Jun Wang. 2021. Reinforcement Recommendation with User Multi-aspect Preference. In Proceedings of the Web Conference 2021. 425–435.

Digital Library

[7]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.

Digital Library

[8]

Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2017. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18, 1 (2017), 6070–6120.

Digital Library

[9]

Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. 2018. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018).

[10]

Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015).

[11]

Chongming Gao, Wenqiang Lei, Jiawei Chen, Shiqi Wang, Xiangnan He, Shijun Li, Biao Li, Yuan Zhang, and Peng Jiang. 2022. CIRS: Bursting Filter Bubbles by Counterfactual Interactive Recommender System. arXiv preprint arXiv:2204.01266 (2022).

[12]

Chongming Gao, Shijun Li, Yuan Zhang, Jiawei Chen, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He. 2022. KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (Atlanta, GA, USA) (CIKM ’22). 5 pages. https://doi.org/10.1145/3511808.3557624

Digital Library

[13]

Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480.

Digital Library

[14]

Yingqiang Ge, Shuchang Liu, Ruoyuan Gao, Yikun Xian, Yunqi Li, Xiangyu Zhao, Changhua Pei, Fei Sun, Junfeng Ge, Wenwu Ou, 2021. Towards Long-term Fairness in Recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 445–453.

Digital Library

[15]

Yingqiang Ge, Xiaoting Zhao, Lucia Yu, Saurabh Paul, Diane Hu, Chu-Cheng Hsieh, and Yongfeng Zhang. 2022. Toward Pareto Efficient Fairness-Utility Trade-off inRecommendation through Reinforcement Learning. arXiv preprint arXiv:2201.00140 (2022).

[16]

Xudong Gong, Qinlin Feng, Yuan Zhang, Jiangling Qin, Weijie Ding, Biao Li, and Peng Jiang. 2022. Real-time Short Video Recommendation on Mobile Devices. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (Atlanta, GA, USA) (CIKM ’22).

Digital Library

[17]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).

[18]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).

[19]

Xiao Lin, Hongjie Chen, Changhua Pei, Fei Sun, Xuanji Xiao, Hanxiao Sun, Yongfeng Zhang, Wenwu Ou, and Peng Jiang. 2019. A pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation. In Proceedings of the 13th ACM Conference on recommender systems. 20–28.

Digital Library

[20]

Zihan Lin, Hui Wang, Jingshu Mao, Wayne Xin Zhao, Cheng Wang, Peng Jiang, and Ji-Rong Wen. 2022. Feature-aware Diversified Re-ranking with Disentangled Representations for Relevant Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3327–3335.

Digital Library

[21]

Dong Liu and Chenyang Yang. 2019. A deep reinforcement learning approach to proactive content pushing and recommendation for mobile users. IEEE Access 7 (2019), 83120–83136.

[22]

Tie-Yan Liu 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.

[23]

Yongshuai Liu, Avishai Halev, and Xin Liu. 2021. Policy learning with constraints in model-free reinforcement learning: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.

[24]

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H Chi. 2020. Off-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020. 463–473.

Digital Library

[25]

Hossam Mossalam, Yannis M Assael, Diederik M Roijers, and Shimon Whiteson. 2016. Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707 (2016).

[26]

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. 2020. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. (2020).

[27]

Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. 2016. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2978–2981.

[28]

Thanh Thi Nguyen, Ngoc Duy Nguyen, Peter Vamplew, Saeid Nahavandi, Richard Dazeley, and Chee Peng Lim. 2020. A multi-objective deep reinforcement learning framework. Engineering Applications of Artificial Intelligence 96 (2020), 103915.

[29]

Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692.

Digital Library

[30]

Doina Precup. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series (2000), 80.

Digital Library

[31]

Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. 2001. Off-policy temporal-difference learning with function approximation. In ICML. 417–424.

[32]

Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. arXiv preprint arXiv:1810.04650 (2018).

[33]

Dusan Stamenkovic, Alexandros Karatzoglou, Ioannis Arapakis, Xin Xin, and Kleomenis Katevas. 2021. Choosing the Best of Both Worlds: Diverse and Novel Recommendations through Multi-Objective Reinforcement Learning. arXiv preprint arXiv:2110.15097 (2021).

[34]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[35]

Chen Tessler, Daniel J Mankowitz, and Shie Mannor. 2018. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074 (2018).

[36]

Jiayin Wang, Weizhi Ma, Jiayu Li, Hongyu Lu, Min Zhang, Biao Li, Yiqun Liu, Peng Jiang, and Shaoping Ma. 2022. Make Fairness More Fair: Fair Item Utility Estimation and Exposure Re-Distribution. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1868–1877.

Digital Library

[37]

Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson, Lisa Chung, Ed H Chi, and Minmin Chen. 2022. Surrogate for Long-Term User Experience in Recommender Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4100–4109.

Digital Library

[38]

C Ch White, CC III WHITE, and KIM KW. 1980. Solution procedures for vector criterion Markov decision processes. (1980).

[39]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning (1992), 5–32.

[40]

Yikun Xian, Zuohui Fu, Shan Muthukrishnan, Gerard De Melo, and Yongfeng Zhang. 2019. Reinforcement knowledge graph reasoning for explainable recommendation. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 285–294.

Digital Library

[41]

Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M Jose. 2022. Supervised Advantage Actor-Critic for Recommender Systems. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1186–1196.

Digital Library

[42]

Ruohan Zhan, Changhua Pei, Qiang Su, Jianfeng Wen, Xueliang Wang, Guanyu Mu, Dong Zheng, Peng Jiang, and Kun Gai. 2022. Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4472–4481.

Digital Library

[43]

Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1040–1048.

Digital Library

[44]

Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2017. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209 (2017).

[45]

Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, and Dawei Yin. 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2810–2818.

Digital Library

Cited By

Zhou THairi FYang HLiu JTong TYang FMomma MGao YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Finite-time convergence and sample complexity of actor-critic multi-objective reinforcement learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694632(61913-61933)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694632
Zhang GWang YChen XQian HZhan KWang BWooldridge MDy JNatarajan S(2024)UNEX-RLProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i8.28783(9305-9313)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i8.28783
Su SChen XWang YWu YZhang ZZhan KWang BGai K(2024)RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688128(670-679)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688128
Show More Cited By

Index Terms

Two-Stage Constrained Actor-Critic for Short Video Recommendation
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Actor-Critic Algorithms for Constrained Multi-agent Reinforcement Learning
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

Multi-agent reinforcement learning has gained lot of popularity primarily owing to the success of deep function approximation architectures. However, many real-life multi-agent applications often impose constraints on the joint action sequence that can ...
Off-Policy Actor-critic for Recommender Systems
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems

Industrial recommendation platforms are increasingly concerned with how to make recommendations that cause users to enjoy their long term experience on the platform. Reinforcement learning emerged naturally as an appealing approach for its promise in 1) ...
Offline-boosted actor-critic: adaptively blending optimal historical behaviors in deep off-policy RL
ICML'24: Proceedings of the 41st International Conference on Machine Learning

Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23: Proceedings of the ACM Web Conference 2023

April 2023

4293 pages

ISBN:9781450394161

DOI:10.1145/3543507

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
516
Total Downloads

Downloads (Last 12 months)221
Downloads (Last 6 weeks)20

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou THairi FYang HLiu JTong TYang FMomma MGao YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Finite-time convergence and sample complexity of actor-critic multi-objective reinforcement learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694632(61913-61933)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694632
Zhang GWang YChen XQian HZhan KWang BWooldridge MDy JNatarajan S(2024)UNEX-RLProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i8.28783(9305-9313)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i8.28783
Su SChen XWang YWu YZhang ZZhan KWang BGai K(2024)RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688128(670-679)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688128
Zhao HCai GZhu JDong ZXu JWen JBaeza-Yates RBonchi F(2024)Counteracting Duration Bias in Video Recommendation via Counterfactual Watch TimeProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671817(4455-4466)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671817
Bai YZhang YFeng FLu JZang XLei CSong YBaeza-Yates RBonchi F(2024)GradCraft: Elevating Multi-task Recommendations through Holistic Gradient CraftingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671585(4774-4783)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671585
Liu ZLiu SYang BXue ZCai QZhao XZhang ZHu LLi HJiang PBaeza-Yates RBonchi F(2024)Modeling User Retention through Generative Flow NetworksProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671531(5497-5508)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671531
Wang XLiu SWang XCai QHu LLi HJiang PGai KXie GBaeza-Yates RBonchi F(2024)Future Impact Decomposition in Request-level RecommendationsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671506(5905-5916)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671506
Cai QZhao XPan LXin XHuang JZhang WZhao LYin DYang GHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)AgentIR: 1st Workshop on Agent-based Information RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657989(3025-3028)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657989
Yu YGao CChen JTang HSun YChen QMa WZhang MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)EasyRL4Rec: An Easy-to-use Library for Reinforcement Learning Based Recommender SystemsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657868(977-987)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657868
Liu ZLiu SZhang ZCai QZhao XZhao KHu LJiang PGai KHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Sequential Recommendation for Optimizing Both Immediate Feedback and Long-term RetentionProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657829(1872-1882)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657829
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents