Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Online Markov decision processes with non-oblivious strategic adversary

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of \({\mathcal {O}}(\sqrt{T \log (L)}+\tau ^2\sqrt{ T \log (\vert A \vert )})\) where L is the size of adversary’s pure strategy set and \(\vert A \vert\) denotes the size of agent’s action space.Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of \({\mathcal {O}}(\sqrt{T\log (L)}+\tau ^2\sqrt{ T k \log (k)})\) where k depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence to a NE result. To our best knowledge, this is the first work leading to the last iteration result in OMDPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. In the multi-armed bandit setting, it is also impossible to achieve sublinear policy regret against all adaptive adversaries (see Theorem 1 in [24]).

  2. For the completeness of the paper, we provide the lemma in Appendix A.

  3. If the adversary does not follow the optimal bound (i.e., irrational), then regret bound of the agent will change accordingly.

  4. W.l.o.g, we consider the payoff (i.e., -the loss) for the agent in our experiments so that the agent aims to maximize the payoff.

References

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. Massachusetts: MIT press.

    MATH  Google Scholar 

  2. Laurent, G. J., Matignon, L., Fort-Piat, L., et al. (2011). The world of independent learners is not markovian. International Journal of Knowledge-based and Intelligent Engineering Systems, 15(1), 55–64.

    Article  Google Scholar 

  3. Even-Dar, E., Kakade, S. M., & Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3), 726–736.

    Article  MathSciNet  MATH  Google Scholar 

  4. Dick, T., Gyorgy, A., & Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In ICML (pp. 512–520).

  5. Neu, G., Antos, A., György, A., & Szepesvári, C. (2010). Online markov decision processes under bandit feedback. In NeurIPS (pp. 1804–1812).

  6. Neu, G., & Olkhovskaya, J. (2020). Online learning in mdps with linear function approximation and bandit feedback. arXiv e-prints, 2007.

  7. Yang, Y., & Wang, J. (2020). An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583

  8. Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1–2), 79–103.

    Article  MathSciNet  MATH  Google Scholar 

  9. Shalev-Shwartz, S., et al. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.

    Article  MATH  Google Scholar 

  10. Mertikopoulos, P., Papadimitriou, C., & Piliouras, G. (2018). Cycles in adversarial regularized learning. In Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms (pp. 2703–2717). SIAM.

  11. Bailey, J. P., & Piliouras, G. (2018). Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM conference on economics and computation (pp. 321–338).

  12. Dinh, L. C., Nguyen, T.-D., Zemhoho, A. B., & Tran-Thanh, L. (2021). Last round convergence and no-dynamic regret in asymmetric repeated games. In Algorithmic learning theory (pp. 553–577) PMLR.

  13. Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., & Piliouras, G. (2019). Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR 2019-7th international conference on learning representations (pp. 1–23).

  14. Leslie, D. S., Perkins, S., & Xu, Z. (2020). Best-response dynamics in zero-sum stochastic games. Journal of Economic Theory, 189, 105095.

    Article  MathSciNet  MATH  Google Scholar 

  15. Guan, P., Raginsky, M., Willett, R., & Zois, D.-S. (2016). Regret minimization algorithms for single-controller zero-sum stochastic games. In 2016 IEEE 55th conference on decision and control (CDC) (pp 7075–7080). IEEE

  16. Neu, G., György, A., Szepesvári, C., & Antos, A. (2013). Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3), 676–691.

    Article  MathSciNet  MATH  Google Scholar 

  17. Filar, J., & Vrieze, K. (1997). Applications and special classes of stochastic games. In Competitive markov decision processes (pp 301–341). Springer, New York.

  18. Puterman, M. L. (1990). Markov decision processes. Handbooks in Operations Research and Management Science, 2, 331–434.

    Article  MathSciNet  MATH  Google Scholar 

  19. McMahan, H. B., Gordon, G. J., & Blum, A. (2003). Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 536–543).

  20. Dinh, L. C., Yang, Y., Tian, Z., Nieves, N. P., Slumbers, O., Mguni, D. H., & Wang, J. (2021). Online double oracle. arXiv preprint arXiv:2103.07780

  21. Wei, C.-Y., Hong, Y.-T., & Lu, C.-J. (2017). Online reinforcement learning in stochastic games. arXiv preprint arXiv:1712.00579

  22. Cheung, W. C., Simchi-Levi, D., & Zhu, R. (2019). Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818.

  23. Yu, J. Y., Mannor, S., & Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3), 737–757.

    Article  MathSciNet  MATH  Google Scholar 

  24. Arora, R., Dekel, O., & Tewari, A. (2012). Online bandit learning against an adaptive adversary: from regret to policy regret. arXiv preprint arXiv:1206.6400

  25. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  26. Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems, 20, 1729–1736.

    Google Scholar 

  27. Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2017). Training gans with optimism. arXiv preprint arXiv:1711.00141

  28. Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10), 1095–1100.

    Article  MathSciNet  MATH  Google Scholar 

  29. Deng, X., Li, Y., Mguni, D. H., Wang, J., & Yang, Y. (2021). On the complexity of computing markov perfect equilibrium in general-sum stochastic games. arXiv preprint arXiv:2109.01795

  30. Tian, Y., Wang, Y., Yu, T., & Sra, S. (2020). Online learning in unknown markov games.

  31. Neumann, Jv. (1928). Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1), 295–320.

    Article  MathSciNet  MATH  Google Scholar 

  32. Nash, J. F., et al. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.

    Article  MathSciNet  MATH  Google Scholar 

  33. Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1), 374–376.

    MathSciNet  MATH  Google Scholar 

  34. Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., & Jaderberg, M. (2020). Real world games look like spinning tops. arXiv preprint arXiv:2004.09468

  35. Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021). Modelling behavioural diversity for learning in open-ended games. In International conference on machine learning (pp. 8514–8524). PMLR

  36. Liu, X., Jia, H., Wen, Y., Yang, Y., Hu, Y., Chen, Y., Fan, C., & Hu, Z. (2021). Unifying behavioral and response diversity for open-ended learning in zero-sum games. arXiv preprint arXiv:2106.04958

  37. Yang, Y., Luo, J., Wen, Y., Slumbers, O., Graves, D., Bou Ammar, H., Wang, J., & Taylor, M. E. (2021). Diverse auto-curriculum is critical for successful real-world multiagent learning systems. In Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 51–56).

  38. Bohnenblust, H., Karlin, S., & Shapley, L. (1950). Solutions of discrete, two-person games. Contributions to the Theory of Games, 1, 51–72.

    MathSciNet  MATH  Google Scholar 

  39. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.

    Article  Google Scholar 

  40. Daskalakis, C., & Panageas, I. (2019). Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th innovations in theoretical computer science.

  41. Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.

    Article  MATH  Google Scholar 

  42. Chakraborty, D., & Stone, P. (2014). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-agent Systems, 28(2), 182–213.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaodong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A proofs

We provide the following lemmas and proposition:

Lemma 7

(Lemma 3.3 in [3]) For all loss function \({\varvec{l}}\) in [0, 1] and policies \(\pi\), \(Q_{{\varvec{l}},\pi }(s,a) \le 3\tau\).

Lemma 8

(Lemma 1 from [16]) Consider a uniformly ergodic OMDPs with mixing time \(\tau\) with losses \({{\varvec{l}}}_t \in [0,1]^{\varvec{d}}\). Then, for any \(T > 1\) and policy \(\pi\) with stationary distribution \({\varvec{d}}_{\pi }\), it holds that

$$\begin{aligned} \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \le 2 \tau +2 . \end{aligned}$$

This lemma guarantees that the performance of a policy’s stationary distribution is similar to the actual performance of the policy in the case of a fixed policy.

In the other case of non-fixed policy, the following lemma bound the performance of policy’s stationary distribution of algorithm A with the actual performance:

Lemma 9

(Lemma 5.2 in [3]) Let \(\pi _1, \pi _2,\dots\) be the policies played by MDP-E algorithm \({\mathcal {A}}\) and let \({\tilde{{\varvec{d}}}}_{{\mathcal {A}},t},\;{\tilde{{\varvec{d}}}}_{\pi _t} \in [0,1]^{|S|}\) be the stationary state distribution. Then,

$$\begin{aligned} \Vert {\tilde{{\varvec{d}}}}_{{\mathcal {A}},t}-{\tilde{{\varvec{d}}}}_{\pi _t}\Vert _1\le 2\tau ^2 \sqrt{\frac{\log (\vert A \vert )}{t}}+2e^{-t/\tau }. \end{aligned}$$

From the above lemma, since the policy’s stationary distribution is a combination of stationary state distribution and the policy’s action in each state, it is easy to show that:

$$\begin{aligned} \Vert {\varvec{v}}_t-{\varvec{d}}_{\pi _t}\Vert _1 \le \Vert {\tilde{{\varvec{d}}}}_{{\mathcal {A}},t}-{\tilde{{\varvec{d}}}}_{\pi _t}\Vert _1\le 2\tau ^2 \sqrt{\frac{\log (\vert A \vert )}{t}}+2e^{-t/\tau }. \end{aligned}$$

Proposition 8

For the MWU algorithm [8] with appropriate \(\mu _t\), we have:

$$\begin{aligned} R_T(\pi )= {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi _t)\right] - {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi )\right] \le M \sqrt{\frac{T \log (n)}{2}}, \end{aligned}$$

where \(\Vert {\varvec{l}}_t(.)\Vert \le M\). Furthermore, the strategy \({\varvec{\pi }}_t\) does not change quickly: \(\Vert {\varvec{\pi }}_t-{\varvec{\pi }}_{t+1}\Vert \le \sqrt{\frac{\log (n)}{t}}.\)

Proof

For a fixed T, if the loss function satisfies \({\varvec{l}}_t(.)\Vert \le 1\) then by setting \(\mu _t=\sqrt{\frac{8 \log (n)}{T}}\), following Theorem 2.2 in [25] we have:

$$\begin{aligned} R_T(\pi )= {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi _t)\right] - {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi )\right] \le 1 \sqrt{\frac{T \log (n)}{2}}. \end{aligned}$$
(A1)

Thus, in the case where \({\varvec{l}}_t(.)\Vert \le M\), by scaling up both sides by M in Eq. (A1) we have the first result of the Proposition 8. For the second part, follow the updating rule of MWU we have:

$$\begin{aligned} \pi _{t+1}(i)-\pi _t(i)&=\pi _t(i)\left( \frac{\exp (-\mu _t {\varvec{l}}_t({\varvec{a}}^i))}{\sum _{i=1}^n {\varvec{\pi }}_t(i)\exp (-\mu _t {\varvec{l}}_t({\varvec{a}}^i))}-1\right) \nonumber \\&\approx \pi _t(i) \left( \frac{1-\mu _t{\varvec{l}}_t({\varvec{a}}^i)}{1-\mu _t{\varvec{l}}_t(\pi _t)}-1\right) \nonumber \\&=\mu _t \pi _t(i) \frac{{\varvec{l}}_t(\pi _t)-{\varvec{l}}_t({\varvec{a}}^i)}{1-\mu _t{\varvec{l}}_t(\pi _t)} = {\mathcal {O}}(\mu _t), \end{aligned}$$
(A2a)

where we use the approximation \(e^x\approx 1+x\) for small x in Eq. (A2a). Thus, the difference in two consecutive strategies \(\pi _t\) will be proportional to the learning rate \(\mu _t\), which is set to be \({\mathcal {O}}\big (\sqrt{\frac{\log (n)}{t}}\big )\). Similar result can be found in Proposition 1 in [3]. \(\square\)

Theorem

(Theorem 5) Suppose the agent uses Algorithm 2 in our online MDPs setting, then the regret in Eq. (1) can be bounded by:

$$\begin{aligned} R_T(\pi ) ={\mathcal {O}}(\tau ^2\sqrt{ T k \log (k)} +\sqrt{T\log (L)}). \end{aligned}$$

Proof

First we bound the difference between the true loss and the loss with respect to the policy’s stationary distribution. Following the Algorithm 2, at the start of each time interval \(T_i\) (i.e., the time interval in which the effective strategy set does not change), the learning rate needs to restart to \({\mathcal {O}}(\sqrt{\log (i)/t_i})\), where i denotes the number of pure strategies in the effective strategy set in the time interval \(T_i\) and \(t_i\) is relative position of the current round in that interval. Thus, following Lemma 5.2 in [3], in each time interval \(T_i\), the difference between the true loss and the loss with respect to the policy’s stationary distribution will be:

$$\begin{aligned} \begin{aligned} \sum _{t=t_{i-1}+1}^{t_i} \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert&\le \sum _{t=t_{i-1}+1}^{t_i} \Vert {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \Vert _1 \\&\le \sum _{t=1}^{T_i} 2\tau ^2 \sqrt{\frac{\log (i)}{t}}+2e^{-t/\tau } \\&\le 4\tau ^2 \sqrt{T_i\log (i)}+2(1+\tau ). \end{aligned} \end{aligned}$$

From this we have:

$$\begin{aligned} \begin{aligned} \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert&=\sum _{i=1}^k \sum _{t=t_{i-1}+1}^{t_i} \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert \\&\le \sum _{i=1}^k \left( 4\tau ^2 \sqrt{T_i\log (i)}+2(1+\tau )\right) \\&\le 4\tau ^2 \sqrt{Tk \log (k)}+2k(1+\tau ). \end{aligned} \end{aligned}$$

Following Lemma 1 from [16], we also have:

$$\begin{aligned} \sum _{t=1}^T\vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \le 2 \tau +2. \end{aligned}$$

Thus the regret in Eq. (1) can be bounded by:

$$\begin{aligned} \begin{aligned} R_T(\pi )&\le \left( \sum _{t=1}^T \langle {\varvec{d}}_{\pi _t},{{\varvec{l}}}_t \rangle + \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert \right) -\left( \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi }, {\varvec{d}}_{\pi } \rangle - \sum _{t=1}^T\vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \right) \\&= \left( \sum _{t=1}^T \langle {\varvec{d}}_{\pi _t},{{\varvec{l}}}_t \rangle -\sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi }, {\varvec{d}}_{\pi } \rangle \right) + \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle + \sum _{t=1}^T\vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \\&\le 3 \tau \left( \sqrt{2 {T k \log (k)}} +\frac{k\log (k)}{8} \right) + \frac{\sqrt{T \log (L)}}{\sqrt{2}}+ 4\tau ^2 \sqrt{Tk \log (k)}+2k(1+\tau )+2\tau +2\\&={\mathcal {O}}(\tau ^2\sqrt{ T k \log (k)} +\sqrt{T\log (L)}). \end{aligned} \end{aligned}$$
(A3)

The proof is complete. \(\square\)

Theorem

(Theorem 6) Suppose the agent only accesses to \(\epsilon\)-best response in each iteration when following Algorithm 2. If the adversary follows a no-external regret algorithm then the average strategy of the agent and the adversary will converge to \(\epsilon\)-Nash equilibrium. Furthermore, the algorithm has \(\epsilon\)-regret.

Proof

Suppose that the player uses the Multiplicative Weights Update in Algorithm 2 with \(\epsilon\)-best response. Let \(T_1, T_2, \dots , T_k\) be the time window that the players does not add up a new strategy. Since we have a finite set of strategies A then k is finite. Furthermore,

$$\begin{aligned} \sum _{i=1}^k T_k=T. \end{aligned}$$

In a time window \(T_i\), the regret with respect to the best strategy in the set of strategy at time \(T_i\) is:

$$\begin{aligned} \begin{aligned} \sum _{t=\bar{T}_i}^{\bar{T}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle -\min _{\pi \in A_{{\bar{T}}_i+1}}\sum _{t=\vert \bar{T}_i\vert }^{\bar{T}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) , \end{aligned} \end{aligned}$$
(A4)

where \(\bar{T}_i=\sum _{j=1}^{i-1}T_j\). Since in the time window \(T_i\), the \(\epsilon\)-best response strategy stays in \(\Pi _{\bar{T}_i +1}\) and therefore we have:

$$\begin{aligned} \min _{\pi \in A_{{\bar{T}}_i+1}} \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle -\min _{\pi \in \Pi } \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T_i. \end{aligned}$$

Then, from the Eq. (A4) we have:

$$\begin{aligned} \begin{aligned} \sum _{t={\bar{T}}_i}^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - \min _{\pi \in \Pi } \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) + \epsilon T_i. \end{aligned} \end{aligned}$$
(A5)

Sum up the Eq. (A5) for \(i=1,\dots k\) we have:

$$\begin{aligned}&\sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle -\sum _{i=1}^k \min _{\pi \in \Pi } \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \sum _{i=1}^k 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) + \epsilon T_i \nonumber \\&\quad \implies \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle -\min _{\pi \in \Pi } \sum _{i=1}^k \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T+ \sum _{i=1}^k 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) \end{aligned}$$
(A6a)
$$\begin{aligned}&\quad \implies \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - \min _{\pi \in \Pi } \sum _{t=1}^{T} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T+ \sum _{i=1}^k 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) \nonumber \\&\quad \implies \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - \min _{\pi \in \Pi } \sum _{t=1}^{T} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T + 3 \tau \left( \sqrt{2 {T k \log (k)}} +\frac{k\log (k)}{8} \right) . \end{aligned}$$
(A6b)

Inequality (A6a) is due to \(\sum \min \le \min \sum\). Inequality (A6b) comes from Cauchy-Schwarz inequality and Stirling’ approximation. Using Inequality (A6b), we have:

$$\begin{aligned} \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle \ge \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) -\epsilon . \end{aligned}$$
(A7)

Since the adversary follows a no-regret algorithm, we have:

$$\begin{aligned} \begin{aligned}&\max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, {\varvec{d}}_{\pi _t} \rangle -\sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle \le \sqrt{\frac{T}{2}} \sqrt{\log (L)}\\&\quad \implies \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle \le \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle +\sqrt{\frac{ \log (L)}{2T}}. \end{aligned} \end{aligned}$$
(A8)

Using the Inequalities (A7) and (A8) we have:

$$\begin{aligned} \begin{aligned} \langle \bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }} \rangle&\ge \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle \ge \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) -\epsilon \\&\ge \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle - \sqrt{\frac{\log (L)}{2T}}- 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) -\epsilon . \end{aligned} \end{aligned}$$

Similarly, we also have:

$$\begin{aligned} \begin{aligned} \langle \bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }} \rangle&\le \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle \le \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle +\sqrt{\frac{ \log (L)}{2T}}\\&\le \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle + 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) +\epsilon . \end{aligned} \end{aligned}$$

Take the limit \(T \rightarrow \infty\), we then have:

$$\begin{aligned} \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle -\epsilon \le \langle \bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }} \rangle \le \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle +\epsilon . \end{aligned}$$

Thus \((\bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }})\) is the \(\epsilon\)-Nash equilibrium of the game. \(\square\)

Appendix B experiments

We provide further experiment results to demonstrate the performance of MDP-OOE and MDP-E.

In Fig. 2, by considering the different number of loss vectors (\(L=7\)), we test whether the performance difference between MDP-OOE and MDP-E is consistent with regard to the number of loss vectors. As we can see in Fig. 2, MDP-OOE also outperforms MDP-E with the number of loss functions \(L=7\). The result further validates the advantage of MDP-OOE over MDP-E in the setting of a small support size of the NE.

In Fig. 3, we consider a larger set of agent’s action in each state (\(A = 500\)). As we can see in Fig. 3, the difference in performance between MDP-OOE and MDP-E becomes more significant when a larger action set is considered in both cases when \(L=3\) and \(L=7\), as expected by our theoretical results.

Fig. 2
figure 2

Performance comparisons in average payoff in random games with \(L=7\)

Fig. 3
figure 3

Performance comparisons in average payoff in random games

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dinh, L.C., Mguni, D.H., Tran-Thanh, L. et al. Online Markov decision processes with non-oblivious strategic adversary. Auton Agent Multi-Agent Syst 37, 15 (2023). https://doi.org/10.1007/s10458-023-09599-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-023-09599-5

Keywords