Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3219819.3220122acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation

Published: 19 July 2018 Publication History

Abstract

Deep reinforcement learning has shown great potential in improving system performance autonomously, by learning from iterations with the environment. However, traditional reinforcement learning approaches are designed to work in static environments. In many real-world problems, the environments are commonly dynamic, in which the performance of reinforcement learning approaches can degrade drastically. A direct cause of the performance degradation is the high-variance and biased estimation of the reward, due to the distribution shifting in dynamic environments. In this paper, we propose two techniques to alleviate the unstable reward estimation problem in dynamic environments, the stratified sampling replay strategy and the approximate regretted reward, which address the problem from the sample aspect and the reward aspect, respectively. Integrating the two techniques with Double DQN, we propose the Robust DQN method. We apply Robust DQN in the tip recommendation system in Taobao online retail trading platform. We firstly disclose the highly dynamic property of the recommendation application. We then carried out online A/B test to examine Robust DQN. The results show that Robust DQN can effectively stabilize the value estimation and, therefore, improves the performance in this real-world dynamic environment.

References

[1]
Sherief Abdallah and Michael Kaisers . 2016. Addressing Environment Non-Stationarity by Repeating Q-learning Updates. Journal of Machine Learning Research Vol. 17 (2016), 46:1--46:31.
[2]
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer . 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning Vol. 47, 2--3 (2002), 235--256.
[3]
Sungwoon Choi, Heonseok Ha, Uiwon Hwang, Chanju Kim, Jung-Woo Ha, and Sungroh Yoon . 2018. Reinforcement Learning based Recommender System using Biclustering Technique. CoRR Vol. abs/1801.05532 (2018). deftempurl%http://arxiv.org/abs/1801.05532 tempurl
[4]
Debashis Das, Laxman Sahoo, and Sujoy Datta . 2017. A Survey on Recommendation System. International Journal of Computer Applications Vol. 160, 7 (2017).
[5]
Mohammad Shahrokh Esfahani and Edward R. Dougherty . 2014. Effect of separate sampling on classification accuracy. Bioinformatics Vol. 30, 2 (2014), 242--250.
[6]
Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu . 2018. Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, UK.
[7]
Andreas Karlsson . 2008. Survey sampling: theory and methods. Metrika Vol. 67, 2 (2008), 241--242.
[8]
Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone . 2015. DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, 591--599.
[9]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller . 2013. Playing Atari with Deep Reinforcement Learning. CoRR Vol. abs/1312.5602 (2013). deftempurl%http://arxiv.org/abs/1312.5602 tempurl
[10]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et almbox. . 2015 a. Human-level control through deep reinforcement learning. Nature Vol. 518, 7540 (2015), 529--533.
[11]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis . 2015 b. Human-level control through deep reinforcement learning. Nature Vol. 518, 7540 (2015), 529--533.
[12]
Masato Nagayoshi, Hajime Murao, and H. Tamaki . 2013. Reinforcement learning for dynamic environment: a classification of dynamic environments and a detection method of environmental changes. Artificial Life and Robotics Vol. 18, 1 (2013), 104--108.
[13]
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy . 2016. Deep Exploration via Bootstrapped DQN. CoRR Vol. abs/1602.04621 (2016). deftempurl%http://arxiv.org/abs/1602.04621 tempurl
[14]
Joseph O'Neill, Barty Pleydell-Bouverie, David Dupret, and Jozsef Csicsvari . 2010. Play it again: reactivation of waking experience and memory. Trends in neurosciences Vol. 33, 5 (2010), 220--229.
[15]
Mathijs Pieters and Marco A. Wiering . 2016. Q-learning with experience replay in a dynamic environment 2016 IEEE Symposium Series on Computational Intelligence. Athens, Greece, 1--8.
[16]
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver . 2015. Prioritized Experience Replay. CoRR Vol. abs/1511.05952 (2015). deftempurl%http://arxiv.org/abs/1511.05952 tempurl
[17]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis . 2016. Mastering the game of Go with deep neural networks and tree search. Nature Vol. 529, 7587 (2016), 484--489.
[18]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et almbox. . 2017. Mastering the game of go without human knowledge. Nature Vol. 550, 7676 (2017), 354.
[19]
Richard S. Sutton and Andrew G. Barto . 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
[20]
Hado van Hasselt . 2010. Double Q-learning. In Advances in Neural Information Processing Systems 24. Vancouver, British Columbia, 2613--2621.
[21]
Hado van Hasselt, Arthur Guez, and David Silver . 2016. Deep Reinforcement Learning with Double Q-Learning Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, Arizona, 2094--2100.
[22]
Ziyu Wang, Nando de Freitas, and Marc Lanctot . 2015. Dueling Network Architectures for Deep Reinforcement Learning. CoRR Vol. abs/1511.06581 (2015). deftempurl%http://arxiv.org/abs/1511.06581 tempurl
[23]
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas . 2016. Dueling Network Architectures for Deep Reinforcement Learning Proceedings of the 33th International Conference on Machine Learning. New York City, NY, 1995--2003.
[24]
Christopher John Cornish Hellaby Watkins . 1989. Learning from delayed rewards. Ph.D. Dissertation. bibinfoschoolKing's College, Cambridge.
[25]
Marco Wiering . 2001. Reinforcement Learning in Dynamic Environments using Instantiated Information Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, 585--592.

Cited By

View all
  • (2024)Secured Smart Meal Delivery System for Women's SafetyAI Tools and Applications for Women’s Safety10.4018/979-8-3693-1435-7.ch017(275-290)Online publication date: 19-Jan-2024
  • (2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
  • (2024)RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688128(670-679)Online publication date: 8-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate regretted reward
  2. dynamic environment
  3. recommendation
  4. reinforcement learning
  5. stratified sampling replay

Qualifiers

  • Research-article

Funding Sources

  • Jiangsu Science Foundation

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)184
  • Downloads (Last 6 weeks)30
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Secured Smart Meal Delivery System for Women's SafetyAI Tools and Applications for Women’s Safety10.4018/979-8-3693-1435-7.ch017(275-290)Online publication date: 19-Jan-2024
  • (2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 7-Aug-2024
  • (2024)RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender SystemsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688128(670-679)Online publication date: 8-Oct-2024
  • (2024)Online Preference Weight Estimation Algorithm with Vanishing Regret for Car-Hailing in Road NetworkProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671664(863-871)Online publication date: 25-Aug-2024
  • (2024)On Causally Disentangled State Representation Learning for Reinforcement Learning based Recommender SystemsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679674(2390-2399)Online publication date: 21-Oct-2024
  • (2024)Kernelized Deep Learning for Matrix Factorization Recommendation System Using Explicit and Implicit InformationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318294235:1(1205-1216)Online publication date: Jan-2024
  • (2024)Energy-Efficient Cooperative Secure Communications in mmWave Vehicular Networks Using Deep Recurrent Reinforcement LearningIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.339413025:10(14460-14475)Online publication date: Oct-2024
  • (2024)Rethinking Offline Reinforcement Learning for Sequential Recommendation from A Pair-Wise Q-Learning Perspective2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650400(1-9)Online publication date: 30-Jun-2024
  • (2024)Deep reinforcement learning based on balanced stratified prioritized experience replay for customer credit scoring in peer-to-peer lendingArtificial Intelligence Review10.1007/s10462-023-10697-957:4Online publication date: 18-Mar-2024
  • (2023)Policy regularization with dataset constraint for offline reinforcement learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619600(28701-28717)Online publication date: 23-Jul-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media