Abstract
We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of \({\mathcal {O}}(\sqrt{T \log (L)}+\tau ^2\sqrt{ T \log (\vert A \vert )})\) where L is the size of adversary’s pure strategy set and \(\vert A \vert\) denotes the size of agent’s action space.Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of \({\mathcal {O}}(\sqrt{T\log (L)}+\tau ^2\sqrt{ T k \log (k)})\) where k depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence to a NE result. To our best knowledge, this is the first work leading to the last iteration result in OMDPs.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In the multi-armed bandit setting, it is also impossible to achieve sublinear policy regret against all adaptive adversaries (see Theorem 1 in [24]).
For the completeness of the paper, we provide the lemma in Appendix A.
If the adversary does not follow the optimal bound (i.e., irrational), then regret bound of the agent will change accordingly.
W.l.o.g, we consider the payoff (i.e., -the loss) for the agent in our experiments so that the agent aims to maximize the payoff.
References
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. Massachusetts: MIT press.
Laurent, G. J., Matignon, L., Fort-Piat, L., et al. (2011). The world of independent learners is not markovian. International Journal of Knowledge-based and Intelligent Engineering Systems, 15(1), 55–64.
Even-Dar, E., Kakade, S. M., & Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3), 726–736.
Dick, T., Gyorgy, A., & Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In ICML (pp. 512–520).
Neu, G., Antos, A., György, A., & Szepesvári, C. (2010). Online markov decision processes under bandit feedback. In NeurIPS (pp. 1804–1812).
Neu, G., & Olkhovskaya, J. (2020). Online learning in mdps with linear function approximation and bandit feedback. arXiv e-prints, 2007.
Yang, Y., & Wang, J. (2020). An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583
Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1–2), 79–103.
Shalev-Shwartz, S., et al. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.
Mertikopoulos, P., Papadimitriou, C., & Piliouras, G. (2018). Cycles in adversarial regularized learning. In Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms (pp. 2703–2717). SIAM.
Bailey, J. P., & Piliouras, G. (2018). Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM conference on economics and computation (pp. 321–338).
Dinh, L. C., Nguyen, T.-D., Zemhoho, A. B., & Tran-Thanh, L. (2021). Last round convergence and no-dynamic regret in asymmetric repeated games. In Algorithmic learning theory (pp. 553–577) PMLR.
Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., & Piliouras, G. (2019). Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR 2019-7th international conference on learning representations (pp. 1–23).
Leslie, D. S., Perkins, S., & Xu, Z. (2020). Best-response dynamics in zero-sum stochastic games. Journal of Economic Theory, 189, 105095.
Guan, P., Raginsky, M., Willett, R., & Zois, D.-S. (2016). Regret minimization algorithms for single-controller zero-sum stochastic games. In 2016 IEEE 55th conference on decision and control (CDC) (pp 7075–7080). IEEE
Neu, G., György, A., Szepesvári, C., & Antos, A. (2013). Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3), 676–691.
Filar, J., & Vrieze, K. (1997). Applications and special classes of stochastic games. In Competitive markov decision processes (pp 301–341). Springer, New York.
Puterman, M. L. (1990). Markov decision processes. Handbooks in Operations Research and Management Science, 2, 331–434.
McMahan, H. B., Gordon, G. J., & Blum, A. (2003). Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 536–543).
Dinh, L. C., Yang, Y., Tian, Z., Nieves, N. P., Slumbers, O., Mguni, D. H., & Wang, J. (2021). Online double oracle. arXiv preprint arXiv:2103.07780
Wei, C.-Y., Hong, Y.-T., & Lu, C.-J. (2017). Online reinforcement learning in stochastic games. arXiv preprint arXiv:1712.00579
Cheung, W. C., Simchi-Levi, D., & Zhu, R. (2019). Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818.
Yu, J. Y., Mannor, S., & Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3), 737–757.
Arora, R., Dekel, O., & Tewari, A. (2012). Online bandit learning against an adaptive adversary: from regret to policy regret. arXiv preprint arXiv:1206.6400
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge: Cambridge University Press.
Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems, 20, 1729–1736.
Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2017). Training gans with optimism. arXiv preprint arXiv:1711.00141
Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10), 1095–1100.
Deng, X., Li, Y., Mguni, D. H., Wang, J., & Yang, Y. (2021). On the complexity of computing markov perfect equilibrium in general-sum stochastic games. arXiv preprint arXiv:2109.01795
Tian, Y., Wang, Y., Yu, T., & Sra, S. (2020). Online learning in unknown markov games.
Neumann, Jv. (1928). Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1), 295–320.
Nash, J. F., et al. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.
Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1), 374–376.
Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., & Jaderberg, M. (2020). Real world games look like spinning tops. arXiv preprint arXiv:2004.09468
Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021). Modelling behavioural diversity for learning in open-ended games. In International conference on machine learning (pp. 8514–8524). PMLR
Liu, X., Jia, H., Wen, Y., Yang, Y., Hu, Y., Chen, Y., Fan, C., & Hu, Z. (2021). Unifying behavioral and response diversity for open-ended learning in zero-sum games. arXiv preprint arXiv:2106.04958
Yang, Y., Luo, J., Wen, Y., Slumbers, O., Graves, D., Bou Ammar, H., Wang, J., & Taylor, M. E. (2021). Diverse auto-curriculum is critical for successful real-world multiagent learning systems. In Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 51–56).
Bohnenblust, H., Karlin, S., & Shapley, L. (1950). Solutions of discrete, two-person games. Contributions to the Theory of Games, 1, 51–72.
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
Daskalakis, C., & Panageas, I. (2019). Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th innovations in theoretical computer science.
Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.
Chakraborty, D., & Stone, P. (2014). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-agent Systems, 28(2), 182–213.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A proofs
We provide the following lemmas and proposition:
Lemma 7
(Lemma 3.3 in [3]) For all loss function \({\varvec{l}}\) in [0, 1] and policies \(\pi\), \(Q_{{\varvec{l}},\pi }(s,a) \le 3\tau\).
Lemma 8
(Lemma 1 from [16]) Consider a uniformly ergodic OMDPs with mixing time \(\tau\) with losses \({{\varvec{l}}}_t \in [0,1]^{\varvec{d}}\). Then, for any \(T > 1\) and policy \(\pi\) with stationary distribution \({\varvec{d}}_{\pi }\), it holds that
This lemma guarantees that the performance of a policy’s stationary distribution is similar to the actual performance of the policy in the case of a fixed policy.
In the other case of non-fixed policy, the following lemma bound the performance of policy’s stationary distribution of algorithm A with the actual performance:
Lemma 9
(Lemma 5.2 in [3]) Let \(\pi _1, \pi _2,\dots\) be the policies played by MDP-E algorithm \({\mathcal {A}}\) and let \({\tilde{{\varvec{d}}}}_{{\mathcal {A}},t},\;{\tilde{{\varvec{d}}}}_{\pi _t} \in [0,1]^{|S|}\) be the stationary state distribution. Then,
From the above lemma, since the policy’s stationary distribution is a combination of stationary state distribution and the policy’s action in each state, it is easy to show that:
Proposition 8
For the MWU algorithm [8] with appropriate \(\mu _t\), we have:
where \(\Vert {\varvec{l}}_t(.)\Vert \le M\). Furthermore, the strategy \({\varvec{\pi }}_t\) does not change quickly: \(\Vert {\varvec{\pi }}_t-{\varvec{\pi }}_{t+1}\Vert \le \sqrt{\frac{\log (n)}{t}}.\)
Proof
For a fixed T, if the loss function satisfies \({\varvec{l}}_t(.)\Vert \le 1\) then by setting \(\mu _t=\sqrt{\frac{8 \log (n)}{T}}\), following Theorem 2.2 in [25] we have:
Thus, in the case where \({\varvec{l}}_t(.)\Vert \le M\), by scaling up both sides by M in Eq. (A1) we have the first result of the Proposition 8. For the second part, follow the updating rule of MWU we have:
where we use the approximation \(e^x\approx 1+x\) for small x in Eq. (A2a). Thus, the difference in two consecutive strategies \(\pi _t\) will be proportional to the learning rate \(\mu _t\), which is set to be \({\mathcal {O}}\big (\sqrt{\frac{\log (n)}{t}}\big )\). Similar result can be found in Proposition 1 in [3]. \(\square\)
Theorem
(Theorem 5) Suppose the agent uses Algorithm 2 in our online MDPs setting, then the regret in Eq. (1) can be bounded by:
Proof
First we bound the difference between the true loss and the loss with respect to the policy’s stationary distribution. Following the Algorithm 2, at the start of each time interval \(T_i\) (i.e., the time interval in which the effective strategy set does not change), the learning rate needs to restart to \({\mathcal {O}}(\sqrt{\log (i)/t_i})\), where i denotes the number of pure strategies in the effective strategy set in the time interval \(T_i\) and \(t_i\) is relative position of the current round in that interval. Thus, following Lemma 5.2 in [3], in each time interval \(T_i\), the difference between the true loss and the loss with respect to the policy’s stationary distribution will be:
From this we have:
Following Lemma 1 from [16], we also have:
Thus the regret in Eq. (1) can be bounded by:
The proof is complete. \(\square\)
Theorem
(Theorem 6) Suppose the agent only accesses to \(\epsilon\)-best response in each iteration when following Algorithm 2. If the adversary follows a no-external regret algorithm then the average strategy of the agent and the adversary will converge to \(\epsilon\)-Nash equilibrium. Furthermore, the algorithm has \(\epsilon\)-regret.
Proof
Suppose that the player uses the Multiplicative Weights Update in Algorithm 2 with \(\epsilon\)-best response. Let \(T_1, T_2, \dots , T_k\) be the time window that the players does not add up a new strategy. Since we have a finite set of strategies A then k is finite. Furthermore,
In a time window \(T_i\), the regret with respect to the best strategy in the set of strategy at time \(T_i\) is:
where \(\bar{T}_i=\sum _{j=1}^{i-1}T_j\). Since in the time window \(T_i\), the \(\epsilon\)-best response strategy stays in \(\Pi _{\bar{T}_i +1}\) and therefore we have:
Then, from the Eq. (A4) we have:
Sum up the Eq. (A5) for \(i=1,\dots k\) we have:
Inequality (A6a) is due to \(\sum \min \le \min \sum\). Inequality (A6b) comes from Cauchy-Schwarz inequality and Stirling’ approximation. Using Inequality (A6b), we have:
Since the adversary follows a no-regret algorithm, we have:
Using the Inequalities (A7) and (A8) we have:
Similarly, we also have:
Take the limit \(T \rightarrow \infty\), we then have:
Thus \((\bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }})\) is the \(\epsilon\)-Nash equilibrium of the game. \(\square\)
Appendix B experiments
We provide further experiment results to demonstrate the performance of MDP-OOE and MDP-E.
In Fig. 2, by considering the different number of loss vectors (\(L=7\)), we test whether the performance difference between MDP-OOE and MDP-E is consistent with regard to the number of loss vectors. As we can see in Fig. 2, MDP-OOE also outperforms MDP-E with the number of loss functions \(L=7\). The result further validates the advantage of MDP-OOE over MDP-E in the setting of a small support size of the NE.
In Fig. 3, we consider a larger set of agent’s action in each state (\(A = 500\)). As we can see in Fig. 3, the difference in performance between MDP-OOE and MDP-E becomes more significant when a larger action set is considered in both cases when \(L=3\) and \(L=7\), as expected by our theoretical results.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dinh, L.C., Mguni, D.H., Tran-Thanh, L. et al. Online Markov decision processes with non-oblivious strategic adversary. Auton Agent Multi-Agent Syst 37, 15 (2023). https://doi.org/10.1007/s10458-023-09599-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s10458-023-09599-5