research-article

Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation

Authors:

Hai-Kuan Huang,

Hai-Hong TangAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1187 - 1196

https://doi.org/10.1145/3219819.3220122

Published: 19 July 2018 Publication History

Abstract

Deep reinforcement learning has shown great potential in improving system performance autonomously, by learning from iterations with the environment. However, traditional reinforcement learning approaches are designed to work in static environments. In many real-world problems, the environments are commonly dynamic, in which the performance of reinforcement learning approaches can degrade drastically. A direct cause of the performance degradation is the high-variance and biased estimation of the reward, due to the distribution shifting in dynamic environments. In this paper, we propose two techniques to alleviate the unstable reward estimation problem in dynamic environments, the stratified sampling replay strategy and the approximate regretted reward, which address the problem from the sample aspect and the reward aspect, respectively. Integrating the two techniques with Double DQN, we propose the Robust DQN method. We apply Robust DQN in the tip recommendation system in Taobao online retail trading platform. We firstly disclose the highly dynamic property of the recommendation application. We then carried out online A/B test to examine Robust DQN. The results show that Robust DQN can effectively stabilize the value estimation and, therefore, improves the performance in this real-world dynamic environment.

References

[1]

Sherief Abdallah and Michael Kaisers . 2016. Addressing Environment Non-Stationarity by Repeating Q-learning Updates. Journal of Machine Learning Research Vol. 17 (2016), 46:1--46:31.

Digital Library

[2]

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer . 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning Vol. 47, 2--3 (2002), 235--256.

Digital Library

[3]

Sungwoon Choi, Heonseok Ha, Uiwon Hwang, Chanju Kim, Jung-Woo Ha, and Sungroh Yoon . 2018. Reinforcement Learning based Recommender System using Biclustering Technique. CoRR Vol. abs/1801.05532 (2018). deftempurl%http://arxiv.org/abs/1801.05532 tempurl

[4]

Debashis Das, Laxman Sahoo, and Sujoy Datta . 2017. A Survey on Recommendation System. International Journal of Computer Applications Vol. 160, 7 (2017).

[5]

Mohammad Shahrokh Esfahani and Edward R. Dougherty . 2014. Effect of separate sampling on classification accuracy. Bioinformatics Vol. 30, 2 (2014), 242--250.

[6]

Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu . 2018. Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK.

Digital Library

[7]

Andreas Karlsson . 2008. Survey sampling: theory and methods. Metrika Vol. 67, 2 (2008), 241--242.

[8]

Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone . 2015. DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, 591--599.

Digital Library

[9]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller . 2013. Playing Atari with Deep Reinforcement Learning. CoRR Vol. abs/1312.5602 (2013). deftempurl%http://arxiv.org/abs/1312.5602 tempurl

[10]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et almbox. . 2015 a. Human-level control through deep reinforcement learning. Nature Vol. 518, 7540 (2015), 529--533.

[11]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis . 2015 b. Human-level control through deep reinforcement learning. Nature Vol. 518, 7540 (2015), 529--533.

[12]

Masato Nagayoshi, Hajime Murao, and H. Tamaki . 2013. Reinforcement learning for dynamic environment: a classification of dynamic environments and a detection method of environmental changes. Artificial Life and Robotics Vol. 18, 1 (2013), 104--108.

Digital Library

[13]

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy . 2016. Deep Exploration via Bootstrapped DQN. CoRR Vol. abs/1602.04621 (2016). deftempurl%http://arxiv.org/abs/1602.04621 tempurl

Digital Library

[14]

Joseph O'Neill, Barty Pleydell-Bouverie, David Dupret, and Jozsef Csicsvari . 2010. Play it again: reactivation of waking experience and memory. Trends in neurosciences Vol. 33, 5 (2010), 220--229.

[15]

Mathijs Pieters and Marco A. Wiering . 2016. Q-learning with experience replay in a dynamic environment 2016 IEEE Symposium Series on Computational Intelligence. Athens, Greece, 1--8.

[16]

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver . 2015. Prioritized Experience Replay. CoRR Vol. abs/1511.05952 (2015). deftempurl%http://arxiv.org/abs/1511.05952 tempurl

[17]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis . 2016. Mastering the game of Go with deep neural networks and tree search. Nature Vol. 529, 7587 (2016), 484--489.

[18]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et almbox. . 2017. Mastering the game of go without human knowledge. Nature Vol. 550, 7676 (2017), 354.

[19]

Richard S. Sutton and Andrew G. Barto . 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.

Digital Library

[20]

Hado van Hasselt . 2010. Double Q-learning. In Advances in Neural Information Processing Systems 24. Vancouver, British Columbia, 2613--2621.

Digital Library

[21]

Hado van Hasselt, Arthur Guez, and David Silver . 2016. Deep Reinforcement Learning with Double Q-Learning Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, Arizona, 2094--2100.

Digital Library

[22]

Ziyu Wang, Nando de Freitas, and Marc Lanctot . 2015. Dueling Network Architectures for Deep Reinforcement Learning. CoRR Vol. abs/1511.06581 (2015). deftempurl%http://arxiv.org/abs/1511.06581 tempurl

[23]

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas . 2016. Dueling Network Architectures for Deep Reinforcement Learning Proceedings of the 33th International Conference on Machine Learning. New York City, NY, 1995--2003.

Digital Library

[24]

Christopher John Cornish Hellaby Watkins . 1989. Learning from delayed rewards. Ph.D. Dissertation. bibinfoschoolKing's College, Cambridge.

[25]

Marco Wiering . 2001. Reinforcement Learning in Dynamic Environments using Instantiated Information Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, 585--592.

Digital Library

Cited By

Wang SChen XMcAuley JCripps SYao L(2025)Plug-and-Play Model-Agnostic Counterfactual Policy Synthesis for Deep Reinforcement Learning-Based RecommendationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332980836:1(1044-1055)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3329808
Kumar CSakthivelu UNaresh RSenthil Kumar S(2024)Secured Smart Meal Delivery System for Women's SafetyAI Tools and Applications for Women’s Safety10.4018/979-8-3693-1435-7.ch017(275-290)Online publication date: 19-Jan-2024
https://doi.org/10.4018/979-8-3693-1435-7.ch017
Azzopardi LClarke CKantor PMitra BTrippas JRen ZAliannejadi MArabzadeh NChandrasekar Rde Rijke MEustratiadis PHersh WHuang JKanoulas EKareem JLi YLupart SMekonnen KRoegiest ASoboroff ISilvestri FVerberne SVos DYang EZhao Y(2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
https://dl.acm.org/doi/10.1145/3687273.3687288
Show More Cited By

Recommendations

DRN: A Deep Reinforcement Learning Framework for News Recommendation
WWW '18: Proceedings of the 2018 World Wide Web Conference

In this paper, we propose a novel Deep Reinforcement Learning framework for news recommendation. Online personalized news recommendation is a highly challenging problem due to the dynamic nature of news features and user preferences. Although some ...
Reinforcement Learning based Recommender Systems: A Survey
Recommender systems (RSs) have become an inseparable part of our everyday lives. They help us find our favorite items to purchase, our friends on social networks, and our favorite movies to watch. Traditionally, the recommendation problem was considered ...
Deep reinforcement learning for page-wise recommendations
RecSys '18: Proceedings of the 12th ACM Conference on Recommender Systems

Recommender systems can mitigate the information overload problem by suggesting users' personalized items. In real-world recommendations such as e-commerce, a typical interaction between the system and its users is - users are recommended a page of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Jiangsu Science Foundation

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

85
Total Citations
View Citations
3,255
Total Downloads

Downloads (Last 12 months)164
Downloads (Last 6 weeks)16

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang SChen XMcAuley JCripps SYao L(2025)Plug-and-Play Model-Agnostic Counterfactual Policy Synthesis for Deep Reinforcement Learning-Based RecommendationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332980836:1(1044-1055)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3329808
Kumar CSakthivelu UNaresh RSenthil Kumar S(2024)Secured Smart Meal Delivery System for Women's SafetyAI Tools and Applications for Women’s Safety10.4018/979-8-3693-1435-7.ch017(275-290)Online publication date: 19-Jan-2024
https://doi.org/10.4018/979-8-3693-1435-7.ch017
Azzopardi LClarke CKantor PMitra BTrippas JRen ZAliannejadi MArabzadeh NChandrasekar Rde Rijke MEustratiadis PHersh WHuang JKanoulas EKareem JLi YLupart SMekonnen KRoegiest ASoboroff ISilvestri FVerberne SVos DYang EZhao Y(2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
https://dl.acm.org/doi/10.1145/3687273.3687288
Su SChen XWang YWu YZhang ZZhan KWang BGai K(2024)RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688128(670-679)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688128
Gao YZhu ZMa MGao FGao HShi YGao XBaeza-Yates RBonchi F(2024)Online Preference Weight Estimation Algorithm with Vanishing Regret for Car-Hailing in Road NetworkProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671664(863-871)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671664
Wang SChen XYao LSerra ESpezzano F(2024)On Causally Disentangled State Representation Learning for Reinforcement Learning based Recommender SystemsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679674(2390-2399)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679674
Zheng XNi ZZhong XLuo Y(2024)Kernelized Deep Learning for Matrix Factorization Recommendation System Using Explicit and Implicit InformationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318294235:1(1205-1216)Online publication date: Jan-2024
https://doi.org/10.1109/TNNLS.2022.3182942
Ju YGao ZWang HLiu LPei QDong MMumtaz SLeung V(2024)Energy-Efficient Cooperative Secure Communications in mmWave Vehicular Networks Using Deep Recurrent Reinforcement LearningIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.339413025:10(14460-14475)Online publication date: Oct-2024
https://doi.org/10.1109/TITS.2024.3394130
Yang RYu LLi ZLi SWu L(2024)Rethinking Offline Reinforcement Learning for Sequential Recommendation from A Pair-Wise Q-Learning Perspective2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650400(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650400
Wang YJia YFan SXiao J(2024)Deep reinforcement learning based on balanced stratified prioritized experience replay for customer credit scoring in peer-to-peer lendingArtificial Intelligence Review10.1007/s10462-023-10697-957:4Online publication date: 18-Mar-2024
https://doi.org/10.1007/s10462-023-10697-9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten