research-article

Free access

Reinforcement learning for joint optimization of multiple rewards

AUTHORs:

Mridul Agarwal,

Vaneet AggarwalAuthors Info & Claims

The Journal of Machine Learning Research, Volume 24, Issue 1

Article No.: 49, Pages 2039 - 2079

Published: 06 March 2024 Publication History

PDF eReader Publisher Site

Abstract

Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of Õ(LKDS√A/T) for K objectives combined with a concave L-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.

References

[1]

Mridul Agarwal and Vaneet Aggarwal. Source Code for Non-Linear Reinforcement Learning. https://github.rcac.purdue.edu/Clan-labs/non-markov-RL, 2019.

[2]

Vaneet Aggarwal, Rittwik Jana, Jeffrey Pang, KK Ramakrishnan, and NK Shankaranarayanan. Characterizing fairness for 3g wireless networks. In 2011 18th IEEE Workshop on Local & Metropolitan Area Networks (LANMAN), pages 1-6. IEEE, 2011.

[3]

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184-1194, 2017.

[4]

Eitan Altman, Konstantin Avrachenkov, and Andrey Garnaev. Generalized α-fair resource allocation in wireless networks. In 2008 47th IEEE Conference on Decision and Control, pages 2414-2419. IEEE, 2008.

[5]

Jinthana Ariyakhajorn, Pattana Wannawilai, and Chanboon Sathitwiriyawong. A comparative study of random waypoint and gauss-markov mobility models in the performance evaluation of manet. In 2006 International Symposium on Communications and Information Technologies, pages 894-899. IEEE, 2006.

[6]

Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.

[7]

Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53:659-697, 2015.

[8]

T Bu, L Li, and R Ramjee. Generalized proportional fair scheduling in third generation wireless data networks. In Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications, pages 1-12. IEEE, 2006.

[9]

Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231-357, 2015.

[10]

Lucian Buşoniu, Robert Babuška, and Bart De Schutter. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pages 183-221. Springer, 2010.

[11]

A Castelletti, Francesca Pianosi, and Marcello Restelli. A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research, 49(6):3476-3486, 2013.

[12]

Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1-5, 2016.

[13]

Anis Elgabli, Vaneet Aggarwal, Shuai Hao, Feng Qian, and Subhabrata Sen. Lbp: Robust rate adaptation algorithm for svc video streaming. IEEE/ACM Transactions on Networking, 26(4):1633-1645, 2018.

[14]

Javier Garcia and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437-1480, 2015.

[15]

Sibsankar Haldar and DK Subramanian. Fairness in processor scheduling in time sharing systems. ACM SIGOPS Operating Systems Review, 25(1):4-18, 1991.

[16]

Harri Holma and Antti Toskala. WCDMA for UMTS.: Radio Access for Third Generation Mobile Communications. john wiley & sons, 2005.

[17]

Marcus Hutter. Extreme state aggregation beyond mdps. In International Conference on Algorithmic Learning Theory, pages 185-199. Springer, 2014.

[18]

Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He, and Li Qi. Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud. In 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pages 17-24. IEEE, 2010.

[19]

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563-1600, 2010.

[20]

Jiechuan Jiang and Zongqing Lu. Learning fairness in multi-agent systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[21]

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863-4873, 2018.

[22]

Jun-young Kwak, Pradeep Varakantham, Rajiv Maheswaran, Milind Tambe, Farrokh Jazizadeh, Geoffrey Kavulya, Laura Klein, Burcin Becerik-Gerber, Timothy Hayes, and Wendy Wood. Saves: A sustainable multiagent application to conserve building energy considering occupants. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 21-28, 2012.

[23]

Raymond Kwan, Cyril Leung, and Jie Zhang. Proportional fair multiuser scheduling in lte. IEEE Signal Processing Letters, 16(6):461-464, 2009.

[24]

Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of fairness in network resource allocation. IEEE, 2010.

[25]

Gregory F Lawler. Introduction to stochastic processes. Chapman and Hall/CRC, 2018.

[26]

David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

[27]

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps. ISAIM, 4:5, 2006.

[28]

Xiaoshuai Li, Rajan Shankaran, Mehmet A Orgun, Gengfa Fang, and Yubin Xu. Resource allocation for underlay d2d communication with proportional fairness. IEEE Transactions on Vehicular Technology, 67(7):6244-6258, 2018.

[29]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[30]

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157-163. Elsevier, 1994.

[31]

Chunming Liu, Xin Xu, and Dewen Hu. Multiobjective reinforcement learning: A comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(3): 385-398, 2014.

[32]

Sultan Javed Majeed and Marcus Hutter. On q-learning convergence for non-markov decision processes. In IJCAI, pages 2546-2552, 2018.

[33]

Robert Margolies, Ashwin Sridharan, Vaneet Aggarwal, Rittwik Jana, NK Shankaranarayanan, Vinay A Vaishampayan, and Gil Zussman. Exploiting mobility in proportional fair cellular scheduling: Measurements and algorithms. IEEE/ACM Transactions on Networking (TON), 24(1):355-367, 2016.

[34]

R Andrew McCallum. Instance-based utile distinctions for reinforcement learning with hidden state. In Machine Learning Proceedings 1995, pages 387-395. Elsevier, 1995.

[35]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[36]

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

[37]

Thanh Thi Nguyen, Ngoc Duy Nguyen, Peter Vamplew, Saeid Nahavandi, Richard Dazeley, and Chee Peng Lim. A multi-objective deep reinforcement learning framework. Engineering Applications of Artificial Intelligence, 96:103915, 2020.

[38]

Norihiko Ono and Kenji Fukumoto. Multi-agent reinforcement learning: A modular approach. In Second International Conference on Multiagent Systems, pages 252-258, 1996.

[39]

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003-3011, 2013.

[40]

Julien Perez, Cécile Germain-Renaud, Balázs Kégl, and Charles Loomis. Responsive elastic computing. In Proceedings of the 6th international conference industry session on Grids meets autonomic computing, pages 55-64. ACM, 2009.

[41]

John W. Pratt. Risk aversion in the small and in the large. Econometrica, 32(1/2):122-136, 1964. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/1913738.

[42]

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779.

[43]

Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. J. Artif. Int. Res., 48(1):67-113, October 2013. ISSN 1076-9757.

[44]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889-1897, 2015.

[45]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[46]

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. arXiv preprint arXiv:1810.04650, 2018.

[47]

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.

[48]

Yoav Shoham, Rob Powers, and Trond Grenager. Multi-agent reinforcement learning: a critical survey. Web manuscript, 2003.

[49]

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018a.

[50]

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018b.

[51]

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057-1063, 2000.

[52]

Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330-337, 1993.

[53]

Sylvie Thiébaux, Charles Gretton, John Slaney, David Price, and Froduald Kabanza. Decision-theoretic planning with non-markovian rewards. Journal of Artificial Intelligence Research, 25:17-74, 2006.

[54]

Kristof Van Moffaert and Ann Nowé. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483-3512, 2014.

[55]

P. Viswanath, D. N. C. Tse, and R. Laroia. Opportunistic beamforming using dumb antennas. IEEE Transactions on Information Theory, 48(6):1277-1294, June 2002. ISSN 0018-9448.

Digital Library

[56]

Wei Wang, Baochun Li, and Ben Liang. Dominant resource fairness in cloud computing systems with heterogeneous servers. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pages 583-591. IEEE, 2014.

[57]

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

[58]

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. 2003.

[59]

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229-256, 1992.

[60]

Yu Xiang, Tian Lan, Vaneet Aggarwal, and Yih-Farn R Chen. Joint latency and cost optimization for erasure-coded data center storage. IEEE/ACM Transactions on Networking, 24(4):2443-2457, 2015.

[61]

Chongjie Zhang and Julie A Shah. Fairness in multi-agent sequential decision-making. In Advances in Neural Information Processing Systems, pages 2636-2644, 2014.

[62]

Chongjie Zhang and Julie A Shah. On fairness in decision-making under uncertainty: Definitions, computation, and comparison. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[63]

Xu Zhang, Siddhartha Sen, Daniar Kurniawan, Haryadi Gunawi, and Junchen Jiang. E2e: Embracing user heterogeneity to improve quality of experience on the web. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM '19, pages 289-302, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-5956-6.

Digital Library

Index Terms

Reinforcement learning for joint optimization of multiple rewards

Index terms have been assigned to the content through auto-classification.

Recommendations

Semi-Regenerative Processes with Unbounded Rewards

A semi-regenerative process SRP is combined with a reward structure such that the accumulated reward during [0, t] is the sum of a functional of the SRP and a functional of the embedded Markov renewal process MRP. For the expected discounted return a ...
Continuous-Time Markov Decision Processes with Discounted Rewards: The Case of Polish Spaces

This paper deals with continuous-time Markov decision processes in Polish spaces, under an expected discounted reward criterion. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates ...
Optimal learning with non-Gaussian rewards
WSC '13: Proceedings of the 2013 Winter Simulation Conference: Simulation: Making Decisions in a Complex World

We propose a theoretical and computational framework for approximating the optimal policy in multi-armed bandit problems where the reward distributions are non-Gaussian. We first construct a probabilistic interpolation of the sequence of discrete-time ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 24, Issue 1

January 2023

18881 pages

ISSN:1532-4435

EISSN:1533-7928

Editors:
Pradeep Ravikumar
Carnegie Mellon University
,
Tong Zhang
University of Illinois Urbana-Champaign

Issue’s Table of Contents

Copyright © 2023.

CC-BY 4.0

Publisher

JMLR.org

Publication History

Published: 06 March 2024

Accepted: 01 April 2023

Revised: 01 July 2022

Received: 01 November 2019

Published in JMLR Volume 24, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
18
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)10

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents