Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study

Published: 03 April 2018 Publication History

Abstract

We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in T slots, starting from any state, compared to the best feasible randomized stationary policy in hindsight. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tight O(√T ) regret and constraint violations simultaneously. To obtain such a bound, we combine several new ingredients including ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online constrained optimization, a drift analysis for queue processes, and a perturbation analysis based on Farkas' Lemma.

References

[1]
Shipra Agrawal and Randy Jia. 2017. Posterior sampling for reinforcement learning: worst-case regret bounds. arXiv preprint arXiv:1705.07041 (2017).
[2]
Eitan Altman. 1999. Constrained Markov decision processes. Vol. Vol. 7. CRC Press.
[3]
Dimitri P. Bertsekas. 1995. Dynamic programming and optimal control. Vol. Vol. 1. Athena scientific Belmont, MA.
[4]
Dimitri P. Bertsekas. 2009. Convex optimization theory. Athena Scientific Belmont.
[5]
Craig Boutilier and Tyler Lu. 2016. Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes. UAI.
[6]
Constantine Caramanis, Nedialko B. Dimitrov, and David P. Morton. 2014. Efficient Algorithms for Budget-Constrained Markov Decision Processes. IEEE Trans. Automat. Control Vol. 59, 10 (2014), 2813--2817.
[7]
Tianyi Chen, Qing Ling, and Georgios B. Giannakis. 2017. An Online Convex Optimization Approach to Dynamic Network Resource Allocation. arXiv preprint arXiv:1701.03974 (2017).
[8]
Yichen Chen and Mengdi Wang. 2016. Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning. arXiv preprint arXiv:1612.02516 (2016).
[9]
Travis Dick, Andras Gyorgy, and Csaba Szepesvari. 2014. Online learning in Markov decision processes with changing cost sequences Proceedings of the 31st International Conference on Machine Learning (ICML-14). 512--520.
[10]
Atilla Eryilmaz and R. Srikant. 2012. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems Vol. 72, 3--4 (2012), 311--359.
[11]
Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. 2009. Online Markov decision processes. Mathematics of Operations Research Vol. 34, 3 (2009), 726--736.
[12]
Bennett Fox. 1966. Markov renewal programming by linear fractional programming. SIAM J. Appl. Math. Vol. 14, 6 (1966), 1418--1432.
[13]
Anshul Gandhi. 2013. Dynamic server provisioning for data center power management. Ph.D. Dissertation. Carnegie Mellon University.
[14]
Anshul Gandhi, Sherwin Doroudi, Mor Harchol-Balter, and Alan Scheller-Wolf. 2013. Exact analysis of the M/M/k/setup class of Markov chains via recursive renewal reward ACM SIGMETRICS Performance Evaluation Review, Vol. Vol. 41. ACM, 153--166.
[15]
Peng Guan, Maxim Raginsky, and Rebecca M. Willett. 2014. Online Markov decision processes with Kullback-Leibler control cost. IEEE Trans. Automat. Control Vol. 59, 6 (2014), 1423--1438.
[16]
Bruce Hajek. 1982. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied probability Vol. 14, 3 (1982), 502--525.
[17]
Elad Hazan et almbox. 2016. Introduction to online convex optimization. Foundations and Trends® in Optimization Vol. 2, 3--4 (2016), 157--325.
[18]
Elad Hazan, Amit Agarwal, and Satyen Kale. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning Vol. 69 (2007), 169--192.
[19]
Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. 2016. Adaptive algorithms for online convex optimization with long-term constraints International Conference on Machine Learning. 402--411.
[20]
Tor Lattimore, Marcus Hutter, Peter Sunehag, et almbox. 2013. The sample-complexity of general reinforcement learning Proceedings of the 30th International Conference on Machine Learning. Journal of Machine Learning Research.
[21]
David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. 2006. Markov chains and mixing times. American Mathematical Society.
[22]
Minghong Lin, Adam Wierman, Lachlan LH Andrew, and Eno Thereska. 2013. Dynamic right-sizing for power-proportional data centers. IEEE/ACM Transactions on Networking (TON) Vol. 21, 5 (2013), 1378--1391.
[23]
Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. 2012. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research Vol. 13, Sep (2012), 2503--2528.
[24]
Angelia Nedić and Asuman Ozdaglar. 2009. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization Vol. 19, 4 (2009), 1757--1780.
[25]
Michael J. Neely. 2011. Online fractional programming for Markov decision systems Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on. IEEE, 353--360.
[26]
Michael J. Neely and Hao Yu. 2017. Online Convex Optimization with Time-Varying Constraints. arXiv preprint arXiv:1702.04783 (2017).
[27]
Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. 2010. Online Markov decision processes under bandit feedback Advances in Neural Information Processing Systems. 1804--1812.
[28]
Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning: An introduction. Vol. Vol. 1. MIT press Cambridge.
[29]
Rahul Urgaonkar, Bhuvan Urgaonkar, Michael J. Neely, and Anand Sivasubramaniam. 2011. Optimal power cost management using stored energy in data centers. ACM SIGMETRICS Performance Evaluation Review Vol. 39, 1 (2011), 181--192.
[30]
Xiaohan Wei and Michael Neely. 2017. Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems. IEEE/ACM Transactions on Networking Vol. PP, 99 (2017), 1--15.

Cited By

View all
  • (2024)Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous ActionsAnalytical and Stochastic Modelling Techniques and Applications10.1007/978-3-031-70753-7_5(67-80)Online publication date: 13-Sep-2024
  • (2020)Upper confidence primal-dual reinforcement learning for CMDP with adversarial lossProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497005(15277-15287)Online publication date: 6-Dec-2020
  • (2018)Online Learning in Weakly Coupled Markov Decision ProcessesACM SIGMETRICS Performance Evaluation Review10.1145/3292040.321964046:1(56-58)Online publication date: 12-Jun-2018

Index Terms

  1. Online Learning in Weakly Coupled Markov Decision Processes: A Convergence Time Study

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
      Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 2, Issue 1
      March 2018
      603 pages
      EISSN:2476-1249
      DOI:10.1145/3203302
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 April 2018
      Published in POMACS Volume 2, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. markov decision processes
      2. online learning
      3. stochastic programming

      Qualifiers

      • Research-article

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)77
      • Downloads (Last 6 weeks)16
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Deep Reinforcement Learning for Weakly Coupled MDP’s with Continuous ActionsAnalytical and Stochastic Modelling Techniques and Applications10.1007/978-3-031-70753-7_5(67-80)Online publication date: 13-Sep-2024
      • (2020)Upper confidence primal-dual reinforcement learning for CMDP with adversarial lossProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497005(15277-15287)Online publication date: 6-Dec-2020
      • (2018)Online Learning in Weakly Coupled Markov Decision ProcessesACM SIGMETRICS Performance Evaluation Review10.1145/3292040.321964046:1(56-58)Online publication date: 12-Jun-2018
      • (2018)Online Learning in Weakly Coupled Markov Decision ProcessesAbstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems10.1145/3219617.3219640(56-58)Online publication date: 12-Jun-2018

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media