Search | arXiv e-print repository

A Lower Bound on Swap Regret in Extensive-Form Games

Authors: Constantinos Daskalakis, Gabriele Farina, Noah Golowich, Tuomas Sandholm, Brian Hu Zhang

Abstract: Recent simultaneous works by Peng and Rubinstein [2024] and Dagan et al. [2024] have demonstrated the existence of a no-swap-regret learning algorithm that can reach $ε$ average swap regret against an adversary in any extensive-form game within $m^{\tilde{\mathcal O}(1/ε)}$ rounds, where $m$ is the number of nodes in the game tree. However, the question of whether a $\mathrm{poly}(m, 1/ε)$-round a… ▽ More Recent simultaneous works by Peng and Rubinstein [2024] and Dagan et al. [2024] have demonstrated the existence of a no-swap-regret learning algorithm that can reach $ε$ average swap regret against an adversary in any extensive-form game within $m^{\tilde{\mathcal O}(1/ε)}$ rounds, where $m$ is the number of nodes in the game tree. However, the question of whether a $\mathrm{poly}(m, 1/ε)$-round algorithm could exist remained open. In this paper, we show a lower bound that precludes the existence of such an algorithm. In particular, we show that achieving average swap regret $ε$ against an oblivious adversary in general extensive-form games requires at least $\mathrm{exp}\left(Ω\left(\min\left\{m^{1/14}, ε^{-1/6}\right\}\right)\right)$ rounds. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.10631 [pdf, other]

Fast Last-Iterate Convergence of Learning in Games Requires Forgetful Algorithms

Authors: Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Weiqiang Zheng

Abstract: Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several adva… ▽ More Self-play via online learning is one of the premier ways to solve large-scale two-player zero-sum games, both in theory and practice. Particularly popular algorithms include optimistic multiplicative weights update (OMWU) and optimistic gradient-descent-ascent (OGDA). While both algorithms enjoy $O(1/T)$ ergodic convergence to Nash equilibrium in two-player zero-sum games, OMWU offers several advantages including logarithmic dependence on the size of the payoff matrix and $\widetilde{O}(1/T)$ convergence to coarse correlated equilibria even in general-sum games. However, in terms of last-iterate convergence in two-player zero-sum games, an increasingly popular topic in this area, OGDA guarantees that the duality gap shrinks at a rate of $O(1/\sqrt{T})$, while the best existing last-iterate convergence for OMWU depends on some game-dependent constant that could be arbitrarily large. This begs the question: is this potentially slow last-iterate convergence an inherent disadvantage of OMWU, or is the current analysis too loose? Somewhat surprisingly, we show that the former is true. More generally, we prove that a broad class of algorithms that do not forget the past quickly all suffer the same issue: for any arbitrarily small $δ>0$, there exists a $2\times 2$ matrix game such that the algorithm admits a constant duality gap even after $1/δ$ rounds. This class of algorithms includes OMWU and other standard optimistic follow-the-regularized-leader algorithms. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 27 pages, 4 figures

arXiv:2402.16316 [pdf, ps, other]

Polynomial-Time Computation of Exact $Φ$-Equilibria in Polyhedral Games

Authors: Gabriele Farina, Charilaos Pipis

Abstract: It is a well-known fact that correlated equilibria can be computed in polynomial time in a large class of concisely represented games using the celebrated Ellipsoid Against Hope algorithm (Papadimitriou and Roughgarden, 2008; Jiang and Leyton-Brown, 2015). However, the landscape of efficiently computable equilibria in sequential (extensive-form) games remains unknown. The Ellipsoid Against Hope do… ▽ More It is a well-known fact that correlated equilibria can be computed in polynomial time in a large class of concisely represented games using the celebrated Ellipsoid Against Hope algorithm (Papadimitriou and Roughgarden, 2008; Jiang and Leyton-Brown, 2015). However, the landscape of efficiently computable equilibria in sequential (extensive-form) games remains unknown. The Ellipsoid Against Hope does not apply directly to these games, because they do not have the required "polynomial type" property. Despite this barrier, Huang and von Stengel (2008) altered the algorithm to compute exact extensive-form correlated equilibria. In this paper, we generalize the Ellipsoid Against Hope and develop a simple algorithmic framework for efficiently computing saddle-points in bilinear zero-sum games, even when one of the dimensions is exponentially large. Moreover, the framework only requires a "good-enough-response" oracle, which is a weakened notion of a best-response oracle. Using this machinery, we develop a general algorithmic framework for computing exact linear $Φ$-equilibria in any polyhedral game (under mild assumptions), including correlated equilibria in normal-form games, and extensive-form correlated equilibria in extensive-form games. This enables us to give the first polynomial-time algorithm for computing exact linear-deviation correlated equilibria in extensive-form games, thus resolving an open question by Farina and Pipis (2023). Furthermore, even for the cases for which a polynomial time algorithm for exact equilibria was already known, our framework provides a conceptually simpler solution. △ Less

Submitted 8 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.09670 [pdf, ps, other]

Efficient $Φ$-Regret Minimization with Low-Degree Swap Deviations in Extensive-Form Games

Authors: Brian Hu Zhang, Ioannis Anagnostides, Gabriele Farina, Tuomas Sandholm

Abstract: Recent breakthrough results by Dagan, Daskalakis, Fishelson and Golowich [2023] and Peng and Rubinstein [2023] established an efficient algorithm attaining at most $ε$ swap regret over extensive-form strategy spaces of dimension $N$ in $N^{\tilde O(1/ε)}$ rounds. On the other extreme, Farina and Pipis [2023] developed an efficient algorithm for minimizing the weaker notion of linear-swap regret in… ▽ More Recent breakthrough results by Dagan, Daskalakis, Fishelson and Golowich [2023] and Peng and Rubinstein [2023] established an efficient algorithm attaining at most $ε$ swap regret over extensive-form strategy spaces of dimension $N$ in $N^{\tilde O(1/ε)}$ rounds. On the other extreme, Farina and Pipis [2023] developed an efficient algorithm for minimizing the weaker notion of linear-swap regret in $\mathsf{poly}(N)/ε^2$ rounds. In this paper, we take a step toward bridging the gap between those two results. We introduce the set of $k$-mediator deviations, which generalize the untimed communication deviations recently introduced by Zhang, Farina and Sandholm [2024] to the case of having multiple mediators. We develop parameterized algorithms for minimizing the regret with respect to this set of deviations in $N^{O(k)}/ε^2$ rounds. This closes the gap in the sense that $k=1$ recovers linear swap regret, while $k=N$ recovers swap regret. Moreover, by relating $k$-mediator deviations to low-degree polynomials, we show that regret minimization against degree-$k$ polynomial swap deviations is achievable in $N^{O(kd)^3}/ε^2$ rounds, where $d$ is the depth of the game, assuming constant branching factor. For a fixed degree $k$, this is polynomial for Bayesian games and quasipolynomial more broadly when $d = \mathsf{polylog} N$ -- the usual balancedness assumption on the game tree. △ Less

Submitted 17 February, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2312.12067 [pdf, other]

Optimistic Policy Gradient in Multi-Player Markov Games with a Single Controller: Convergence Beyond the Minty Property

Authors: Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, Tuomas Sandholm

Abstract: Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controlle… ▽ More Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary $ε$-NE in $O(1/ε^2)$ iterations, where $O(\cdot)$ suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games. △ Less

Submitted 21 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: To appear at AAAI 2024

arXiv:2312.03696 [pdf, other]

Efficient Learning in Polyhedral Games via Best Response Oracles

Authors: Darshan Chakrabarti, Gabriele Farina, Christian Kroer

Abstract: We study online learning and equilibrium computation in games with polyhedral decision sets, a property shared by both normal-form games and extensive-form games (EFGs), when the learning agent is restricted to using a best-response oracle. We show how to achieve constant regret in zero-sum games and $O(T^{1/4})$ regret in general-sum games while using only $O(\log t)$ best-response queries at a g… ▽ More We study online learning and equilibrium computation in games with polyhedral decision sets, a property shared by both normal-form games and extensive-form games (EFGs), when the learning agent is restricted to using a best-response oracle. We show how to achieve constant regret in zero-sum games and $O(T^{1/4})$ regret in general-sum games while using only $O(\log t)$ best-response queries at a given iteration $t$, thus improving over the best prior result, which required $O(T)$ queries per iteration. Moreover, our framework yields the first last-iterate convergence guarantees for self-play with best-response oracles in zero-sum games. This convergence occurs at a linear rate, though with a condition-number dependence. We go on to show a $O(1/\sqrt{T})$ best-iterate convergence rate without such a dependence. Our results build on linear-rate convergence results for variants of the Frank-Wolfe (FW) algorithm for strongly convex and smooth minimization problems over polyhedral domains. These FW results depend on a condition number of the polytope, known as facial distance. In order to enable application to settings such as EFGs, we show two broad new results: 1) the facial distance for polytopes in standard form is at least $γ/\sqrt{k}$ where $γ$ is the minimum value of a nonzero coordinate of a vertex of the polytope and $k\leq n$ is the number of tight inequality constraints in the optimal face, and 2) the facial distance for polytopes of the form $\mathbf{A}\boldsymbol{x}=\boldsymbol{b},\mathbf{C}\boldsymbol{x}\leq\boldsymbol{d}, \boldsymbol{x}\geq \mathbf{0}$ where $\boldsymbol{x}\in\mathbb{R}^n$, $\mathbf{C}\geq\boldsymbol{0}$ is a nonzero integral matrix, and $\boldsymbol{d}\geq \boldsymbol{0}$, is at least $1/(\|\mathbf{C}\|_\infty\sqrt{n})$. This yields the first such results for several problems such as sequence-form polytopes, flow polytopes, and matching polytopes. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.09712 [pdf, other]

Regularized Conventions: Equilibrium Computation as a Model of Pragmatic Reasoning

Authors: Athul Paul Jacob, Gabriele Farina, Jacob Andreas

Abstract: We present a model of pragmatic language understanding, where utterances are produced and understood by searching for regularized equilibria of signaling games. In this model (which we call ReCo, for Regularized Conventions), speakers and listeners search for contextually appropriate utterance--meaning mappings that are both close to game-theoretically optimal conventions and close to a shared, ''… ▽ More We present a model of pragmatic language understanding, where utterances are produced and understood by searching for regularized equilibria of signaling games. In this model (which we call ReCo, for Regularized Conventions), speakers and listeners search for contextually appropriate utterance--meaning mappings that are both close to game-theoretically optimal conventions and close to a shared, ''default'' semantics. By characterizing pragmatic communication as equilibrium search, we obtain principled sampling algorithms and formal guarantees about the trade-off between communicative success and naturalness. Across several datasets capturing real and idealized human judgments about pragmatic implicatures, ReCo matches or improves upon predictions made by best response and rational speech act models of language understanding. △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.05918 [pdf, other]

Reliable Broadcast despite Mobile Byzantine Faults

Authors: Silvia Bonomi, Giovanni Farina, Sébastien Tixeuil

Abstract: We investigate the solvability of the Byzantine Reliable Broadcast and Byzantine Broadcast Channel problems in distributed systems affected by Mobile Byzantine Faults. We show that both problems are not solvable even in one of the most constrained system models for mobile Byzantine faults defined so far. By endowing processes with an additional local failure oracle, we provide a solution to the By… ▽ More We investigate the solvability of the Byzantine Reliable Broadcast and Byzantine Broadcast Channel problems in distributed systems affected by Mobile Byzantine Faults. We show that both problems are not solvable even in one of the most constrained system models for mobile Byzantine faults defined so far. By endowing processes with an additional local failure oracle, we provide a solution to the Byzantine Broadcast Channel problem. △ Less

Submitted 10 November, 2023; originally announced November 2023.

arXiv:2311.00676 [pdf, other]

Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games

Authors: Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Weiqiang Zheng

Abstract: Algorithms based on regret matching, specifically regret matching$^+$ (RM$^+$), and its variants are the most popular approaches for solving large-scale two-player zero-sum games in practice. Unlike algorithms such as optimistic gradient descent ascent, which have strong last-iterate and ergodic convergence properties for zero-sum games, virtually nothing is known about the last-iterate properties… ▽ More Algorithms based on regret matching, specifically regret matching$^+$ (RM$^+$), and its variants are the most popular approaches for solving large-scale two-player zero-sum games in practice. Unlike algorithms such as optimistic gradient descent ascent, which have strong last-iterate and ergodic convergence properties for zero-sum games, virtually nothing is known about the last-iterate properties of regret-matching algorithms. Given the importance of last-iterate convergence for numerical optimization reasons and relevance as modeling real-word learning in games, in this paper, we study the last-iterate convergence properties of various popular variants of RM$^+$. First, we show numerically that several practical variants such as simultaneous RM$^+$, alternating RM$^+$, and simultaneous predictive RM$^+$, all lack last-iterate convergence guarantees even on a simple $3\times 3$ game. We then prove that recent variants of these algorithms based on a smoothing technique do enjoy last-iterate convergence: we prove that extragradient RM$^{+}$ and smooth Predictive RM$^+$ enjoy asymptotic last-iterate convergence (without a rate) and $1/\sqrt{t}$ best-iterate convergence. Finally, we introduce restarted variants of these algorithms, and show that they enjoy linear-rate last-iterate convergence. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.15935 [pdf, other]

Mediator Interpretation and Faster Learning Algorithms for Linear Correlated Equilibria in General Extensive-Form Games

Authors: Brian Hu Zhang, Gabriele Farina, Tuomas Sandholm

Abstract: A recent paper by Farina & Pipis (2023) established the existence of uncoupled no-linear-swap regret dynamics with polynomial-time iterations in extensive-form games. The equilibrium points reached by these dynamics, known as linear correlated equilibria, are currently the tightest known relaxation of correlated equilibrium that can be learned in polynomial time in any finite extensive-form game.… ▽ More A recent paper by Farina & Pipis (2023) established the existence of uncoupled no-linear-swap regret dynamics with polynomial-time iterations in extensive-form games. The equilibrium points reached by these dynamics, known as linear correlated equilibria, are currently the tightest known relaxation of correlated equilibrium that can be learned in polynomial time in any finite extensive-form game. However, their properties remain vastly unexplored, and their computation is onerous. In this paper, we provide several contributions shedding light on the fundamental nature of linear-swap regret. First, we show a connection between linear deviations and a generalization of communication deviations in which the player can make queries to a "mediator" who replies with action recommendations, and, critically, the player is not constrained to match the timing of the game as would be the case for communication deviations. We coin this latter set the untimed communication (UTC) deviations. We show that the UTC deviations coincide precisely with the linear deviations, and therefore that any player minimizing UTC regret also minimizes linear-swap regret. We then leverage this connection to develop state-of-the-art no-regret algorithms for computing linear correlated equilibria, both in theory and in practice. In theory, our algorithms achieve polynomially better per-iteration runtimes; in practice, our algorithms represent the state of the art by several orders of magnitude. △ Less

Submitted 15 March, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.09139 [pdf, other]

The Consensus Game: Language Model Generation via Equilibrium Search

Authors: Athul Paul Jacob, Yikang Shen, Gabriele Farina, Jacob Andreas

Abstract: When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate outputs). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM pred… ▽ More When applied to question answering and other text generation tasks, language models (LMs) may be queried generatively (by sampling answers from their output distribution) or discriminatively (by using them to score or rank a set of candidate outputs). These procedures sometimes yield very different predictions. How do we reconcile mutually incompatible scoring procedures to obtain coherent LM predictions? We introduce a new, a training-free, game-theoretic procedure for language model decoding. Our approach casts language model decoding as a regularized imperfect-information sequential signaling game - which we term the CONSENSUS GAME - in which a GENERATOR seeks to communicate an abstract correctness parameter using natural language sentences to a DISCRIMINATOR. We develop computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm we call EQUILIBRIUM-RANKING. Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and dialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially, improves performance over existing LM decoding procedures - on multiple benchmarks, we observe that applying EQUILIBRIUM-RANKING to LLaMA-7B outperforms the much larger LLaMA-65B and PaLM-540B models. These results highlight the promise of game-theoretic tools for addressing fundamental challenges of truthfulness and consistency in LMs. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2308.16017 [pdf, ps, other]

Hidden-Role Games: Equilibrium Concepts and Computation

Authors: Luca Carminati, Brian Hu Zhang, Gabriele Farina, Nicola Gatti, Tuomas Sandholm

Abstract: In this paper, we study the class of games known as hidden-role games in which players are assigned privately to teams and are faced with the challenge of recognizing and cooperating with teammates. This model includes both popular recreational games such as the Mafia/Werewolf family and The Resistance (Avalon) and many real-world settings, such as distributed systems where nodes need to work toge… ▽ More In this paper, we study the class of games known as hidden-role games in which players are assigned privately to teams and are faced with the challenge of recognizing and cooperating with teammates. This model includes both popular recreational games such as the Mafia/Werewolf family and The Resistance (Avalon) and many real-world settings, such as distributed systems where nodes need to work together to accomplish a goal in the face of possible corruptions. There has been little to no formal mathematical grounding of such settings in the literature, and it was previously not even clear what the right solution concepts (notions of equilibria) should be. A suitable notion of equilibrium should take into account the communication channels available to the players (e.g., can they communicate? Can they communicate in private?). Defining such suitable notions turns out to be a nontrivial task with several surprising consequences. In this paper, we provide the first rigorous definition of equilibrium for hidden-role games, which overcomes serious limitations of other solution concepts not designed for hidden-role games. We then show that in certain cases, including the above recreational games, optimal equilibria can be computed efficiently. In most other cases, we show that computing an optimal equilibrium is at least NP-hard or coNP-hard. Lastly, we experimentally validate our approach by computing exact equilibria for complete 5- and 6-player Avalon instances whose size in terms of number of information sets is larger than $10^{56}$. △ Less

Submitted 17 February, 2024; v1 submitted 30 August, 2023; originally announced August 2023.

arXiv:2307.05448 [pdf, other]

Polynomial-Time Linear-Swap Regret Minimization in Imperfect-Information Sequential Games

Authors: Gabriele Farina, Charilaos Pipis

Abstract: No-regret learners seek to minimize the difference between the loss they cumulated through the actions they played, and the loss they would have cumulated in hindsight had they consistently modified their behavior according to some strategy transformation function. The size of the set of transformations considered by the learner determines a natural notion of rationality. As the set of transformat… ▽ More No-regret learners seek to minimize the difference between the loss they cumulated through the actions they played, and the loss they would have cumulated in hindsight had they consistently modified their behavior according to some strategy transformation function. The size of the set of transformations considered by the learner determines a natural notion of rationality. As the set of transformations each learner considers grows, the strategies played by the learners recover more complex game-theoretic equilibria, including correlated equilibria in normal-form games and extensive-form correlated equilibria in extensive-form games. At the extreme, a no-swap-regret agent is one that minimizes regret against the set of all functions from the set of strategies to itself. While it is known that the no-swap-regret condition can be attained efficiently in nonsequential (normal-form) games, understanding what is the strongest notion of rationality that can be attained efficiently in the worst case in sequential (extensive-form) games is a longstanding open problem. In this paper we provide a positive result, by showing that it is possible, in any sequential game, to retain polynomial-time (in the game tree size) iterations while achieving sublinear regret with respect to all linear transformations of the mixed strategy space, a notion called no-linear-swap regret. This notion of hindsight rationality is as strong as no-swap-regret in nonsequential games, and stronger than no-trigger-regret in sequential games -- thereby proving the existence of a subset of extensive-form correlated equilibria robust to linear deviations, which we call linear-deviation correlated equilibria, that can be approached efficiently. △ Less

Submitted 7 November, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

Comments: Accepted for publication at NeurIPS 2023

arXiv:2306.05221 [pdf, other]

Steering No-Regret Learners to a Desired Equilibrium

Authors: Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, Tuomas Sandholm

Abstract: A mediator observes no-regret learners playing an extensive-form game repeatedly across $T$ rounds. The mediator attempts to steer players toward some desirable predetermined equilibrium by giving (nonnegative) payments to players. We call this the steering problem. The steering problem captures problems several problems of interest, among them equilibrium selection and information design (persuas… ▽ More A mediator observes no-regret learners playing an extensive-form game repeatedly across $T$ rounds. The mediator attempts to steer players toward some desirable predetermined equilibrium by giving (nonnegative) payments to players. We call this the steering problem. The steering problem captures problems several problems of interest, among them equilibrium selection and information design (persuasion). If the mediator's budget is unbounded, steering is trivial because the mediator can simply pay the players to play desirable actions. We study two bounds on the mediator's payments: a total budget and a per-round budget. If the mediator's total budget does not grow with $T$, we show that steering is impossible. However, we show that it is enough for the total budget to grow sublinearly with $T$, that is, for the average payment to vanish. When players' full strategies are observed at each round, we show that constant per-round budgets permit steering. In the more challenging setting where only trajectories through the game tree are observable, we show that steering is impossible with constant per-round budgets in general extensive-form games, but possible in normal-form games or if the per-round budget may itself depend on $T$. We also show how our results can be generalized to the case when the equilibrium is being computed online while steering is happening. We supplement our theoretical positive results with experiments highlighting the efficacy of steering in large games. △ Less

Submitted 17 February, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.05216 [pdf, ps, other]

Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games

Authors: Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, Tuomas Sandholm

Abstract: We introduce a new approach for computing optimal equilibria via learning in games. It applies to extensive-form settings with any number of players, including mechanism design, information design, and solution concepts such as correlated, communication, and certification equilibria. We observe that optimal equilibria are minimax equilibrium strategies of a player in an extensive-form zero-sum gam… ▽ More We introduce a new approach for computing optimal equilibria via learning in games. It applies to extensive-form settings with any number of players, including mechanism design, information design, and solution concepts such as correlated, communication, and certification equilibria. We observe that optimal equilibria are minimax equilibrium strategies of a player in an extensive-form zero-sum game. This reformulation allows to apply techniques for learning in zero-sum games, yielding the first learning dynamics that converge to optimal equilibria, not only in empirical averages, but also in iterates. We demonstrate the practical scalability and flexibility of our approach by attaining state-of-the-art performance in benchmark tabular games, and by computing an optimal mechanism for a sequential auction design problem using deep reinforcement learning. △ Less

Submitted 23 May, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2305.14709 [pdf, ps, other]

Regret Matching+: (In)Stability and Fast Convergence in Games

Authors: Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo

Abstract: Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its p… ▽ More Regret Matching+ (RM+) and its variants are important algorithms for solving large-scale games. However, a theoretical understanding of their success in practice is still a mystery. Moreover, recent advances on fast convergence in games are limited to no-regret algorithms such as online mirror descent, which satisfy stability. In this paper, we first give counterexamples showing that RM+ and its predictive version can be unstable, which might cause other players to suffer large regret. We then provide two fixes: restarting and chopping off the positive orthant that RM+ works in. We show that these fixes are sufficient to get $O(T^{1/4})$ individual regret and $O(1)$ social regret in normal-form games via RM+ with predictions. We also apply our stabilizing techniques to clairvoyant updates in the uncoupled learning setting for RM+ and prove desirable results akin to recent works for Clairvoyant online mirror descent. Our experiments show the advantages of our algorithms over vanilla RM+-based algorithms in matrix and extensive-form games. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2304.13138 [pdf, other]

The Update-Equivalence Framework for Decision-Time Planning

Authors: Samuel Sokota, Gabriele Farina, David J. Wu, Hengyuan Hu, Kevin A. Wang, J. Zico Kolter, Noam Brown

Abstract: The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes gr… ▽ More The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated. △ Less

Submitted 13 May, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

arXiv:2303.12817 [pdf, other]

IRIS: a Record and Replay Framework to Enable Hardware-assisted Virtualization Fuzzing

Authors: Carmine Cesarano, Marcello Cinque, Domenico Cotroneo, Luigi De Simone, Giorgio Farina

Abstract: Nowadays, industries are looking into virtualization as an effective means to build safe applications, thanks to the isolation it can provide among virtual machines (VMs) running on the same hardware. In this context, a fundamental issue is understanding to what extent the isolation is guaranteed, despite possible (or induced) problems in the virtualization mechanisms. Uncovering such isolation is… ▽ More Nowadays, industries are looking into virtualization as an effective means to build safe applications, thanks to the isolation it can provide among virtual machines (VMs) running on the same hardware. In this context, a fundamental issue is understanding to what extent the isolation is guaranteed, despite possible (or induced) problems in the virtualization mechanisms. Uncovering such isolation issues is still an open challenge, especially for hardware-assisted virtualization, since the search space should include all the possible VM states (and the linked hypervisor state), which is prohibitive. In this paper, we propose IRIS, a framework to record (learn) sequences of inputs (i.e., VM seeds) from the real guest execution (e.g., OS boot), replay them as-is to reach valid and complex VM states, and finally use them as valid seed to be mutated for enabling fuzzing solutions for hardware-assisted hypervisors. We demonstrate the accuracy and efficiency of IRIS in automatically reproducing valid VM behaviors, with no need to execute guest workloads. We also provide a proof-of-concept fuzzer, based on the proposed architecture, showing its potential on the Xen hypervisor. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 13 pages, Accepted for publication at The 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

arXiv:2301.11241 [pdf, other]

On the Convergence of No-Regret Learning Dynamics in Time-Varying Games

Authors: Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, Tuomas Sandholm

Abstract: Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of optimistic gradient descent (OGD) in time-varying games. Our framework yields sharp convergence bou… ▽ More Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of optimistic gradient descent (OGD) in time-varying games. Our framework yields sharp convergence bounds for the equilibrium gap of OGD in zero-sum games parameterized on natural variation measures of the sequence of games, subsuming known results for static games. Furthermore, we establish improved second-order variation bounds under strong convexity-concavity, as long as each game is repeated multiple times. Our results also apply to time-varying general-sum multi-player games via a bilinear formulation of correlated equilibria, which has novel implications for meta-learning and for obtaining refined variation-dependent regret bounds, addressing questions left open in prior papers. Finally, we leverage our framework to also provide new insights on dynamic regret guarantees in static games. △ Less

Submitted 18 October, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

Comments: To appear at NeurIPS 2023; V3 incorporates reviewers' feedback and minor corrections

arXiv:2210.05492 [pdf, other]

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

Authors: Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, Athul Paul Jacob, Gabriele Farina, Alexander H Miller, Noam Brown

Abstract: No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address… ▽ More No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2209.14110 [pdf, other]

Meta-Learning in Games

Authors: Keegan Harris, Ioannis Anagnostides, Gabriele Farina, Mikhail Khodak, Zhiwei Steven Wu, Tuomas Sandholm

Abstract: In the literature on game-theoretic equilibrium finding, focus has mainly been on solving a single game in isolation. In practice, however, strategic interactions -- ranging from routing problems to online advertising auctions -- evolve dynamically, thereby leading to many similar games to be solved. To address this gap, we introduce meta-learning for equilibrium finding and learning to play games… ▽ More In the literature on game-theoretic equilibrium finding, focus has mainly been on solving a single game in isolation. In practice, however, strategic interactions -- ranging from routing problems to online advertising auctions -- evolve dynamically, thereby leading to many similar games to be solved. To address this gap, we introduce meta-learning for equilibrium finding and learning to play games. We establish the first meta-learning guarantees for a variety of fundamental and well-studied classes of games, including two-player zero-sum games, general-sum games, and Stackelberg games. In particular, we obtain rates of convergence to different game-theoretic equilibria that depend on natural notions of similarity between the sequence of games encountered, while at the same time recovering the known single-game guarantees when the sequence of games is arbitrary. Along the way, we prove a number of new results in the single-game regime through a simple and unified framework, which may be of independent interest. Finally, we evaluate our meta-learning algorithms on endgames faced by the poker agent Libratus against top human professionals. The experiments show that games with varying stack sizes can be solved significantly faster using our meta-learning techniques than by solving them separately, often by an order of magnitude. △ Less

Submitted 1 March, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

Comments: In the eleventh Conference on Learning Representations (ICLR 2023)

arXiv:2209.01843 [pdf, other]

RunPHI: Enabling Mixed-criticality Containers via Partitioning Hypervisors in Industry 4.0

Authors: Marco Barletta, Marcello Cinque, Luigi De Simone, Raffaele Della Corte, Giorgio Farina, Daniele Ottaviano

Abstract: Orchestration systems are becoming a key component to automatically manage distributed computing resources in many fields with criticality requirements like Industry 4.0 (I4.0). However, they are mainly linked to OS-level virtualization, which is known to suffer from reduced isolation. In this paper, we propose RunPHI with the aim of integrating partitioning hypervisors, as a solution for assuring… ▽ More Orchestration systems are becoming a key component to automatically manage distributed computing resources in many fields with criticality requirements like Industry 4.0 (I4.0). However, they are mainly linked to OS-level virtualization, which is known to suffer from reduced isolation. In this paper, we propose RunPHI with the aim of integrating partitioning hypervisors, as a solution for assuring strong isolation, with OS-level orchestration systems. The purpose is to enable container orchestration in mixed-criticality systems with isolation requirements through partitioned containers. △ Less

Submitted 5 September, 2022; originally announced September 2022.

Comments: 2 pages, accepted for publication in Proc. ISSREW, 2022

arXiv:2208.14891 [pdf, ps, other]

Clairvoyant Regret Minimization: Equivalence with Nemirovski's Conceptual Prox Method and Extension to General Convex Games

Authors: Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo

Abstract: A recent paper by Piliouras et al. [2021, 2022] introduces an uncoupled learning algorithm for normal-form games -- called Clairvoyant MWU (CMWU). In this note we show that CMWU is equivalent to the conceptual prox method described by Nemirovski [2004]. This connection immediately shows that it is possible to extend the CMWU algorithm to any convex game, a question left open by Piliouras et al. We… ▽ More A recent paper by Piliouras et al. [2021, 2022] introduces an uncoupled learning algorithm for normal-form games -- called Clairvoyant MWU (CMWU). In this note we show that CMWU is equivalent to the conceptual prox method described by Nemirovski [2004]. This connection immediately shows that it is possible to extend the CMWU algorithm to any convex game, a question left open by Piliouras et al. We call the resulting algorithm -- again equivalent to the conceptual prox method -- Clairvoyant OMD. At the same time, we show that our analysis yields an improved regret bound compared to the original bound by Piliouras et al., in that the regret of CMWU scales only with the square root of the number of players, rather than the number of players themselves. △ Less

Submitted 31 August, 2022; originally announced August 2022.

arXiv:2208.09747 [pdf, ps, other]

Near-Optimal $Φ$-Regret Learning in Extensive-Form Games

Authors: Ioannis Anagnostides, Gabriele Farina, Tuomas Sandholm

Abstract: In this paper, we establish efficient and uncoupled learning dynamics so that, when employed by all players in multiplayer perfect-recall imperfect-information extensive-form games, the trigger regret of each player grows as $O(\log T)$ after $T$ repetitions of play. This improves exponentially over the prior best known trigger-regret bound of $O(T^{1/4})$, and settles a recent open question by Ba… ▽ More In this paper, we establish efficient and uncoupled learning dynamics so that, when employed by all players in multiplayer perfect-recall imperfect-information extensive-form games, the trigger regret of each player grows as $O(\log T)$ after $T$ repetitions of play. This improves exponentially over the prior best known trigger-regret bound of $O(T^{1/4})$, and settles a recent open question by Bai et al. (2022). As an immediate consequence, we guarantee convergence to the set of extensive-form correlated equilibria and coarse correlated equilibria at a near-optimal rate of $\frac{\log T}{T}$. Building on prior work, at the heart of our construction lies a more general result regarding fixed points deriving from rational functions with polynomial degree, a property that we establish for the fixed points of (coarse) trigger deviation functions. Moreover, our construction leverages a refined regret circuit for the convex hull, which -- unlike prior guarantees -- preserves the RVU property introduced by Syrgkanis et al. (NIPS, 2015); this observation has an independent interest in establishing near-optimal regret under learning dynamics based on a CFR-type decomposition of the regret. △ Less

Submitted 19 September, 2023; v1 submitted 20 August, 2022; originally announced August 2022.

Comments: Appearing at ICML 2023. V3 corrects a statement

arXiv:2206.14637 [pdf, other]

Assessing Intel's Memory Bandwidth Allocation for resource limitation in real-time systems

Authors: Giorgio Farina, Gautam Gala, Marcello Cinque, Gerhard Fohler

Abstract: Industries are recently considering the adoption of cloud computing for hosting safety critical applications. However, the use of multicore processors usually adopted in the cloud introduces temporal anomalies due to contention for shared resources, such as the memory subsystem. In this paper we explore the potential of Intel's Memory Bandwidth Allocation (MBA) technology, available on Xeon Scalab… ▽ More Industries are recently considering the adoption of cloud computing for hosting safety critical applications. However, the use of multicore processors usually adopted in the cloud introduces temporal anomalies due to contention for shared resources, such as the memory subsystem. In this paper we explore the potential of Intel's Memory Bandwidth Allocation (MBA) technology, available on Xeon Scalable processors. By adopting a systematic measurement approach on real hardware, we assess the indirect memory bandwidth limitation achievable by applying MBA delays, showing that only given delay values (namely 70, 80 and 90) are effective in our setting. We also test the derived bandwidth assured to a hypothetical critical core when interfering cores (e.g., generating a concurrent memory access workload) are present on the same machine. Our results can support designers by providing understanding of impact of the shared memory to enable predictable progress of safety critical applications in cloud environments. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: 8 pages, to appear in proceedings of "The 25th International Symposium On Real-Time Distributed Computing ISORC"

arXiv:2206.08742 [pdf, other]

Near-Optimal No-Regret Learning Dynamics for General Convex Games

Authors: Gabriele Farina, Ioannis Anagnostides, Haipeng Luo, Chung-Wei Lee, Christian Kroer, Tuomas Sandholm

Abstract: A recent line of work has established uncoupled learning dynamics such that, when employed by all players in a game, each player's \emph{regret} after $T$ repetitions grows polylogarithmically in $T$, an exponential improvement over the traditional guarantees within the no-regret framework. However, so far these results have only been limited to certain classes of games with structured strategy sp… ▽ More A recent line of work has established uncoupled learning dynamics such that, when employed by all players in a game, each player's \emph{regret} after $T$ repetitions grows polylogarithmically in $T$, an exponential improvement over the traditional guarantees within the no-regret framework. However, so far these results have only been limited to certain classes of games with structured strategy spaces -- such as normal-form and extensive-form games. The question as to whether $O(\text{polylog} T)$ regret bounds can be obtained for general convex and compact strategy sets -- which occur in many fundamental models in economics and multiagent systems -- while retaining efficient strategy updates is an important question. In this paper, we answer this in the positive by establishing the first uncoupled learning algorithm with $O(\log T)$ per-player regret in general \emph{convex games}, that is, games with concave utility functions supported on arbitrary convex and compact strategy sets. Our learning dynamics are based on an instantiation of optimistic follow-the-regularized-leader over an appropriately \emph{lifted} space using a \emph{self-concordant regularizer} that is, peculiarly, not a barrier for the feasible region. Further, our learning dynamics are efficiently implementable given access to a proximal oracle for the convex strategy set, leading to $O(\log\log T)$ per-iteration complexity; we also give extensions when access to only a \emph{linear} optimization oracle is assumed. Finally, we adapt our dynamics to guarantee $O(\sqrt{T})$ regret in the adversarial regime. Even in those special cases where prior results apply, our algorithm improves over the state-of-the-art regret bounds either in terms of the dependence on the number of iterations or on the dimension of the strategy sets. △ Less

Submitted 16 October, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: To appear at NeurIPS 2022. V2 incorporates reviewers' feedback

arXiv:2206.04122 [pdf, other]

ESCHER: Eschewing Importance Sampling in Games by Computing a History Value Function to Estimate Regret

Authors: Stephen McAleer, Gabriele Farina, Marc Lanctot, Tuomas Sandholm

Abstract: Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains… ▽ More Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability. We show that the variance of the estimated regret of ESCHER is orders of magnitude lower than DREAM and other baselines. We then show that ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- on a number of games and the difference becomes dramatic as game size increases. In the very large game of dark chess, ESCHER is able to beat DREAM and NFSP in a head-to-head competition over $90\%$ of the time. △ Less

Submitted 11 October, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

arXiv:2204.11417 [pdf, other]

Uncoupled Learning Dynamics with $O(\log T)$ Swap Regret in Multiplayer Games

Authors: Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, Tuomas Sandholm

Abstract: In this paper we establish efficient and \emph{uncoupled} learning dynamics so that, when employed by all players in a general-sum multiplayer game, the \emph{swap regret} of each player after $T$ repetitions of the game is bounded by $O(\log T)$, improving over the prior best bounds of $O(\log^4 (T))$. At the same time, we guarantee optimal $O(\sqrt{T})$ swap regret in the adversarial regime as w… ▽ More In this paper we establish efficient and \emph{uncoupled} learning dynamics so that, when employed by all players in a general-sum multiplayer game, the \emph{swap regret} of each player after $T$ repetitions of the game is bounded by $O(\log T)$, improving over the prior best bounds of $O(\log^4 (T))$. At the same time, we guarantee optimal $O(\sqrt{T})$ swap regret in the adversarial regime as well. To obtain these results, our primary contribution is to show that when all players follow our dynamics with a \emph{time-invariant} learning rate, the \emph{second-order path lengths} of the dynamics up to time $T$ are bounded by $O(\log T)$, a fundamental property which could have further implications beyond near-optimally bounding the (swap) regret. Our proposed learning dynamics combine in a novel way \emph{optimistic} regularized learning with the use of \emph{self-concordant barriers}. Further, our analysis is remarkably simple, bypassing the cumbersome framework of higher-order smoothness recently developed by Daskalakis, Fishelson, and Golowich (NeurIPS'21). △ Less

Submitted 5 October, 2022; v1 submitted 24 April, 2022; originally announced April 2022.

Comments: To appear at NeurIPS 2022. V2 incorporates reviewers' feedback and minor corrections

arXiv:2203.12074 [pdf, other]

Optimistic Mirror Descent Either Converges to Nash or to Strong Coarse Correlated Equilibria in Bimatrix Games

Authors: Ioannis Anagnostides, Gabriele Farina, Ioannis Panageas, Tuomas Sandholm

Abstract: We show that, for any sufficiently small fixed $ε> 0$, when both players in a general-sum two-player (bimatrix) game employ optimistic mirror descent (OMD) with smooth regularization, learning rate $η= O(ε^2)$ and $T = Ω(\text{poly}(1/ε))$ repetitions, either the dynamics reach an $ε$-approximate Nash equilibrium (NE), or the average correlated distribution of play is an $Ω(\text{poly}(ε))$-strong… ▽ More We show that, for any sufficiently small fixed $ε> 0$, when both players in a general-sum two-player (bimatrix) game employ optimistic mirror descent (OMD) with smooth regularization, learning rate $η= O(ε^2)$ and $T = Ω(\text{poly}(1/ε))$ repetitions, either the dynamics reach an $ε$-approximate Nash equilibrium (NE), or the average correlated distribution of play is an $Ω(\text{poly}(ε))$-strong coarse correlated equilibrium (CCE): any possible unilateral deviation does not only leave the player worse, but will decrease its utility by $Ω(\text{poly}(ε))$. As an immediate consequence, when the iterates of OMD are bounded away from being Nash equilibria in a bimatrix game, we guarantee convergence to an exact CCE after only $O(1)$ iterations. Our results reveal that uncoupled no-regret learning algorithms can converge to CCE in general-sum games remarkably faster than to NE in, for example, zero-sum games. To establish this, we show that when OMD does not reach arbitrarily close to a NE, the (cumulative) regret of both players is not only negative, but decays linearly with time. Given that regret is the canonical measure of performance in online learning, our results suggest that cycling behavior of no-regret learning algorithms in games can be justified in terms of efficiency. △ Less

Submitted 6 October, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

Comments: To appear at NeurIPS 2022. V2 incorporates reviewers' feedback

arXiv:2203.12056 [pdf, other]

On Last-Iterate Convergence Beyond Zero-Sum Games

Authors: Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, Tuomas Sandholm

Abstract: Most existing results about \emph{last-iterate convergence} of learning dynamics are limited to two-player zero-sum games, and only apply under rigid assumptions about what dynamics the players follow. In this paper we provide new results and techniques that apply to broader families of games and learning dynamics. First, we use a regret-based analysis to show that in a class of games that include… ▽ More Most existing results about \emph{last-iterate convergence} of learning dynamics are limited to two-player zero-sum games, and only apply under rigid assumptions about what dynamics the players follow. In this paper we provide new results and techniques that apply to broader families of games and learning dynamics. First, we use a regret-based analysis to show that in a class of games that includes constant-sum polymatrix and strategically zero-sum games, dynamics such as \emph{optimistic mirror descent (OMD)} have \emph{bounded second-order path lengths}, a property which holds even when players employ different algorithms and prediction mechanisms. This enables us to obtain $O(1/\sqrt{T})$ rates and optimal $O(1)$ regret bounds. Our analysis also reveals a surprising property: OMD either reaches arbitrarily close to a Nash equilibrium, or it outperforms the \emph{robust price of anarchy} in efficiency. Moreover, for potential games we establish convergence to an $ε$-equilibrium after $O(1/ε^2)$ iterations for mirror descent under a broad class of regularizers, as well as optimal $O(1)$ regret bounds for OMD variants. Our framework also extends to near-potential games, and unifies known analyses for distributed learning in Fisher's market model. Finally, we analyze the convergence, efficiency, and robustness of \emph{optimistic gradient descent (OGD)} in general-sum continuous games. △ Less

Submitted 22 March, 2022; originally announced March 2022.

arXiv:2203.07181 [pdf, other]

Optimal Correlated Equilibria in General-Sum Extensive-Form Games: Fixed-Parameter Algorithms, Hardness, and Two-Sided Column-Generation

Authors: Brian Zhang, Gabriele Farina, Andrea Celli, Tuomas Sandholm

Abstract: We study the problem of finding optimal correlated equilibria of various sorts: normal-form coarse correlated equilibrium (NFCCE), extensive-form coarse correlated equilibrium (EFCCE), and extensive-form correlated equilibrium (EFCE). This is NP-hard in the general case and has been studied in special cases, most notably triangle-free games, which include all two-player games with public chance mo… ▽ More We study the problem of finding optimal correlated equilibria of various sorts: normal-form coarse correlated equilibrium (NFCCE), extensive-form coarse correlated equilibrium (EFCCE), and extensive-form correlated equilibrium (EFCE). This is NP-hard in the general case and has been studied in special cases, most notably triangle-free games, which include all two-player games with public chance moves. However, the general case is not well understood, and algorithms usually scale poorly. First, we introduce the correlation DAG, a representation of the space of correlated strategies whose size is dependent on the specific solution concept. It extends the team belief DAG of Zhang et al. to general-sum games. For each of the three solution concepts, its size depends exponentially only on a parameter related to the game's information structure. We also prove a fundamental complexity gap: while our size bounds for NFCCE are similar to those achieved in the case of team games by Zhang et al., this is impossible to achieve for the other two concepts under standard complexity assumptions. Second, we propose a two-sided column generation approach to compute optimal correlated strategies. Our algorithm improves upon the one-sided approach of Farina et al. by means of a new decomposition of correlated strategies which allows players to re-optimize their sequence-form strategies with respect to correlation plans which were previously added to the support. Our techniques outperform the prior state of the art for computing optimal general-sum correlated equilibria. For team games, the two-sided column generation approach vastly outperforms standard column generation approaches, making it the state of the art algorithm when the parameter is large. Along the way we also introduce two new benchmark games: a trick-taking game that emulates the endgame phase of the card game bridge, and a ride-sharing game. △ Less

Submitted 14 March, 2022; originally announced March 2022.

arXiv:2202.05446 [pdf, other]

Faster No-Regret Learning Dynamics for Extensive-Form Correlated and Coarse Correlated Equilibria

Authors: Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Andrea Celli, Tuomas Sandholm

Abstract: A recent emerging trend in the literature on learning in games has been concerned with providing faster learning dynamics for correlated and coarse correlated equilibria in normal-form games. Much less is known about the significantly more challenging setting of extensive-form games, which can capture both sequential and simultaneous moves, as well as imperfect information. In this paper we establ… ▽ More A recent emerging trend in the literature on learning in games has been concerned with providing faster learning dynamics for correlated and coarse correlated equilibria in normal-form games. Much less is known about the significantly more challenging setting of extensive-form games, which can capture both sequential and simultaneous moves, as well as imperfect information. In this paper we establish faster no-regret learning dynamics for \textit{extensive-form correlated equilibria (EFCE)} in multiplayer general-sum imperfect-information extensive-form games. When all players follow our accelerated dynamics, the correlated distribution of play is an $O(T^{-3/4})$-approximate EFCE, where the $O(\cdot)$ notation suppresses parameters polynomial in the description of the game. This significantly improves over the best prior rate of $O(T^{-1/2})$. To achieve this, we develop a framework for performing accelerated \emph{Phi-regret minimization} via predictions. One of our key technical contributions -- that enables us to employ our generic template -- is to characterize the stability of fixed points associated with \emph{trigger deviation functions} through a refined perturbation analysis of a structured Markov chain. Furthermore, for the simpler solution concept of extensive-form \emph{coarse} correlated equilibrium (EFCCE) we give a new succinct closed-form characterization of the associated fixed points, bypassing the expensive computation of stationary distributions required for EFCE. Our results place EFCCE closer to \emph{normal-form coarse correlated equilibria} in terms of the per-iteration complexity, although the former prescribes a much more compelling notion of correlation. Finally, experiments conducted on standard benchmarks corroborate our theoretical findings. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: Preliminary parts of this paper will appear at the AAAI-22 Workshop on Reinforcement Learning in Games. This version also contains results from an earlier preprint published by a subset of the authors (arXiv:2109.08138)

arXiv:2202.00789 [pdf, other]

Team Belief DAG: Generalizing the Sequence Form to Team Games for Fast Computation of Correlated Team Max-Min Equilibria via Regret Minimization

Authors: Brian Hu Zhang, Gabriele Farina, Tuomas Sandholm

Abstract: A classic result in the theory of extensive-form games asserts that the set of strategies available to any perfect-recall player is strategically equivalent to a low-dimensional convex polytope, called the sequence-form polytope. Online convex optimization tools operating on this polytope are the current state-of-the-art for computing several notions of equilibria in games, and have been crucial i… ▽ More A classic result in the theory of extensive-form games asserts that the set of strategies available to any perfect-recall player is strategically equivalent to a low-dimensional convex polytope, called the sequence-form polytope. Online convex optimization tools operating on this polytope are the current state-of-the-art for computing several notions of equilibria in games, and have been crucial in landmark applications of computational game theory. However, when optimizing over the joint strategy space of a team of players, one cannot use the sequence form to obtain a strategically-equivalent convex description of the strategy set of the team. In this paper, we provide new complexity results on the computation of optimal strategies for teams, and propose a new representation, coined team belief DAG (TB-DAG), that describes team strategies as a convex set. The TB-DAG enjoys state-of-the-art parameterized complexity bounds, while at the same time enjoying the advantages of efficient regret minimization techniques. We show that TB-DAG can be exponentially smaller and can be computed exponentially faster than all other known representations, and that the converse is never true. Experimentally, we show that the TB-DAG, when paired with learning techniques, yields state of the art on a wide variety of benchmark team games. △ Less

Submitted 17 February, 2024; v1 submitted 1 February, 2022; originally announced February 2022.

arXiv:2202.00237 [pdf, other]

Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games

Authors: Gabriele Farina, Chung-Wei Lee, Haipeng Luo, Christian Kroer

Abstract: While extensive-form games (EFGs) can be converted into normal-form games (NFGs), doing so comes at the cost of an exponential blowup of the strategy space. So, progress on NFGs and EFGs has historically followed separate tracks, with the EFG community often having to catch up with advances (e.g., last-iterate convergence and predictive regret bounds) from the larger NFG community. In this paper w… ▽ More While extensive-form games (EFGs) can be converted into normal-form games (NFGs), doing so comes at the cost of an exponential blowup of the strategy space. So, progress on NFGs and EFGs has historically followed separate tracks, with the EFG community often having to catch up with advances (e.g., last-iterate convergence and predictive regret bounds) from the larger NFG community. In this paper we show that the Optimistic Multiplicative Weights Update (OMWU) algorithm -- the premier learning algorithm for NFGs -- can be simulated on the normal-form equivalent of an EFG in linear time per iteration in the game tree size using a kernel trick. The resulting algorithm, Kernelized OMWU (KOMWU), applies more broadly to all convex games whose strategy space is a polytope with 0/1 integral vertices, as long as the kernel can be evaluated efficiently. In the particular case of EFGs, KOMWU closes several standing gaps between NFG and EFG learning, by enabling direct, black-box transfer to EFGs of desirable properties of learning dynamics that were so far known to be achievable only in NFGs. Specifically, KOMWU gives the first algorithm that guarantees at the same time last-iterate convergence, lower dependence on the size of the game tree than all prior algorithms, and $\tilde{\mathcal{O}}(1)$ regret when followed by all players. △ Less

Submitted 1 February, 2022; originally announced February 2022.

arXiv:2112.07544 [pdf, other]

Modeling Strong and Human-Like Gameplay with KL-Regularized Search

Authors: Athul Paul Jacob, David J. Wu, Gabriele Farina, Adam Lerer, Hengyuan Hu, Anton Bakhtin, Jacob Andreas, Noam Brown

Abstract: We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to und… ▽ More We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitation-learned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger. △ Less

Submitted 16 February, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

arXiv:2112.03804 [pdf, other]

Fast Payoff Matrix Sparsification Techniques for Structured Extensive-Form Games

Authors: Gabriele Farina, Tuomas Sandholm

Abstract: The practical scalability of many optimization algorithms for large extensive-form games is often limited by the games' huge payoff matrices. To ameliorate the issue, Zhang and Sandholm (2020) recently proposed a sparsification technique that factorizes the payoff matrix $\mathbf{A}$ into a sparser object $\mathbf{A} = \hat{\mathbf{A}} + \mathbf{U}\mathbf{V}^\top$, where the total combined number… ▽ More The practical scalability of many optimization algorithms for large extensive-form games is often limited by the games' huge payoff matrices. To ameliorate the issue, Zhang and Sandholm (2020) recently proposed a sparsification technique that factorizes the payoff matrix $\mathbf{A}$ into a sparser object $\mathbf{A} = \hat{\mathbf{A}} + \mathbf{U}\mathbf{V}^\top$, where the total combined number of nonzeros of $\hat{\mathbf{A}}$, $\mathbf{U}$, and $\mathbf{V}$ is significantly smaller. Such a factorization can be used in place of the original payoff matrix in many optimization algorithm, such as interior-point and second-order methods, thus increasing the size of games that can be handled. Their technique significantly sparsifies poker (end)games, standard benchmarks used in computational game theory, AI, and more broadly. We show that the existence of extremely sparse factorizations in poker games can be tied to their particular \emph{Kronecker-product} structure. We clarify how such structure arises and introduce the connection between that structure and sparsification. By leveraging such structure, we give two ways of computing strong sparsifications of poker games (as well as any other game with a similar structure) that are i) orders of magnitude faster to compute, ii) more numerically stable, and iii) produce a dramatically smaller number of nonzeros than the prior technique. Our techniques enable -- for the first time -- effective computation of high-precision Nash equilibria and strategies subject to constraints on the amount of allowed randomization. Furthermore, they significantly speed up parallel first-order game-solving algorithms; we show state-of-the-art speed on a GPU. △ Less

Submitted 7 December, 2021; originally announced December 2021.

Comments: To appear at AAAI'22

arXiv:2111.06008 [pdf, ps, other]

doi 10.1145/3519935.3520031

Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games

Authors: Ioannis Anagnostides, Constantinos Daskalakis, Gabriele Farina, Maxwell Fishelson, Noah Golowich, Tuomas Sandholm

Abstract: Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoup… ▽ More Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\tilde{O}(T^{-1})$. This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng (NeurIPS`20), and it is optimal -- within the no-regret framework -- up to polylogarithmic factors in $T$. To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we establish that the no-internal-regret learning dynamics of Stoltz and Lugosi (Mach Learn`05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as DFG to near-optimally bound the internal regret. Moreover, we establish an $O(\textrm{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour (BM) (JMLR`07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of DFG. In addition to shedding clarity on the near-optimal regret guarantees of BM, our arguments provide insights into the various ways in which the techniques by DFG can be extended and leveraged in the analysis of more involved learning algorithms. △ Less

Submitted 24 January, 2023; v1 submitted 10 November, 2021; originally announced November 2021.

Comments: Appeared at STOC 2022

arXiv:2109.08138 [pdf, other]

Efficient Decentralized Learning Dynamics for Extensive-Form Coarse Correlated Equilibrium: No Expensive Computation of Stationary Distributions Required

Authors: Gabriele Farina, Andrea Celli, Tuomas Sandholm

Abstract: While in two-player zero-sum games the Nash equilibrium is a well-established prescriptive notion of optimal play, its applicability as a prescriptive tool beyond that setting is limited. Consequently, the study of decentralized learning dynamics that guarantee convergence to correlated solution concepts in multiplayer, general-sum extensive-form (i.e., tree-form) games has become an important top… ▽ More While in two-player zero-sum games the Nash equilibrium is a well-established prescriptive notion of optimal play, its applicability as a prescriptive tool beyond that setting is limited. Consequently, the study of decentralized learning dynamics that guarantee convergence to correlated solution concepts in multiplayer, general-sum extensive-form (i.e., tree-form) games has become an important topic of active research. The per-iteration complexity of the currently known learning dynamics depends on the specific correlated solution concept considered. For example, in the case of extensive-form correlated equilibrium (EFCE), all known dynamics require, as an intermediate step at each iteration, to compute the stationary distribution of multiple Markov chains, an expensive operation in practice. Oppositely, in the case of normal-form coarse correlated equilibrium (NFCCE), simple no-external-regret learning dynamics that amount to a linear-time traversal of the tree-form decision space of each agent suffice to guarantee convergence. This paper focuses on extensive-form coarse correlated equilibrium (EFCCE), an intermediate solution concept that is a subset of NFCCE and a superset of EFCE. Being a superset of EFCE, any learning dynamics for EFCE automatically guarantees convergence to EFCCE. However, since EFCCE is a simpler solution concept, this begs the question: do learning dynamics for EFCCE that avoid the expensive computation of stationary distributions exist? This paper answers the previous question in the positive. Our learning dynamics only require the orchestration of no-external-regret minimizers, thus showing that EFCCE is more akin to NFCCE than to EFCE from a learning perspective. Our dynamics guarantees that the empirical frequency of play after $T$ iteration is a $O(1/\sqrt{T})$-approximate EFCCE with high probability, and an EFCCE almost surely in the limit. △ Less

Submitted 16 September, 2021; originally announced September 2021.

arXiv:2105.12954 [pdf, other]

Better Regularization for Sequential Decision Spaces: Fast Convergence Rates for Nash, Correlated, and Team Equilibria

Authors: Gabriele Farina, Christian Kroer, Tuomas Sandholm

Abstract: We study the application of iterative first-order methods to the problem of computing equilibria of large-scale two-player extensive-form games. First-order methods must typically be instantiated with a regularizer that serves as a distance-generating function for the decision sets of the players. For the case of two-player zero-sum games, the state-of-the-art theoretical convergence rate for Nash… ▽ More We study the application of iterative first-order methods to the problem of computing equilibria of large-scale two-player extensive-form games. First-order methods must typically be instantiated with a regularizer that serves as a distance-generating function for the decision sets of the players. For the case of two-player zero-sum games, the state-of-the-art theoretical convergence rate for Nash equilibrium is achieved by using the dilated entropy function. In this paper, we introduce a new entropy-based distance-generating function for two-player zero-sum games, and show that this function achieves significantly better strong convexity properties than the dilated entropy, while maintaining the same easily-implemented closed-form proximal mapping. Extensive numerical simulations show that these superior theoretical properties translate into better numerical performance as well. We then generalize our new entropy distance function, as well as general dilated distance functions, to the scaled extension operator. The scaled extension operator is a way to recursively construct convex sets, which generalizes the decision polytope of extensive-form games, as well as the convex polytopes corresponding to correlated and team equilibria. By instantiating first-order methods with our regularizers, we develop the first accelerated first-order methods for computing correlated equilibra and ex-ante coordinated team equilibria. Our methods have a guaranteed $1/T$ rate of convergence, along with linear-time proximal updates. △ Less

Submitted 12 October, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

Comments: Extended version of the EC21 conference version

arXiv:2104.03673 [pdf, other]

Practical Byzantine Reliable Broadcast on Partially Connected Networks (Extended version)

Authors: Silvia Bonomi, Jérémie Decouchant, Giovanni Farina, Vincent Rahli, Sébastien Tixeuil

Abstract: In this paper, we consider the Byzantine reliable broadcast problem on authenticated and partially connected networks. The state-of-the-art method to solve this problem consists in combining two algorithms from the literature. Handling asynchrony and faulty senders is typically done thanks to Gabriel Bracha's authenticated double-echo broadcast protocol, which assumes an asynchronous fully connect… ▽ More In this paper, we consider the Byzantine reliable broadcast problem on authenticated and partially connected networks. The state-of-the-art method to solve this problem consists in combining two algorithms from the literature. Handling asynchrony and faulty senders is typically done thanks to Gabriel Bracha's authenticated double-echo broadcast protocol, which assumes an asynchronous fully connected network. Danny Dolev's algorithm can then be used to provide reliable communications between processes in the global fault model, where up to f processes among N can be faulty in a communication network that is at least 2f+1-connected. Following recent works that showed that Dolev's protocol can be made more practical thanks to several optimizations, we show that the state-of-the-art methods to solve our problem can be optimized thanks to layer-specific and cross-layer optimizations. Our simulations with the Omnet++ network simulator show that these optimizations can be efficiently combined to decrease the total amount of information transmitted or the protocol's latency (e.g., respectively, -25% and -50% with a 16B payload, N=31 and f=4) compared to the state-of-the-art combination of Bracha's and Dolev's protocols. △ Less

Submitted 26 February, 2024; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: This is an extended version of a paper that appeared at the IEEE ICDCS 2021 conference

arXiv:2104.01520 [pdf, ps, other]

Simple Uncoupled No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium

Authors: Gabriele Farina, Andrea Celli, Alberto Marchesi, Nicola Gatti

Abstract: The existence of simple uncoupled no-regret learning dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated… ▽ More The existence of simple uncoupled no-regret learning dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form games generalize normal-form games by modeling both sequential and simultaneous moves, as well as imperfect information. Because of the sequential nature and presence of private information in the game, correlation in extensive-form games possesses significantly different properties than its counterpart in normal-form games, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to the classical notion of correlated equilibrium in normal-form games. Compared to the latter, the constraints that define the set of EFCEs are significantly more complex, as the correlation device must keep into account the evolution of beliefs of each player as they make observations throughout the game. Due to that significant added complexity, the existence of uncoupled learning dynamics leading to an EFCE has remained a challenging open research question for a long time. In this article, we settle that question by giving the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. We show that each iterate can be computed in time polynomial in the size of the game tree, and that, when all players play repeatedly according to our learning dynamics, the empirical frequency of play is proven to be a O(T^-0.5)-approximate EFCE with high probability after T game repetitions, and an EFCE almost surely in the limit. △ Less

Submitted 27 May, 2021; v1 submitted 3 April, 2021; originally announced April 2021.

Comments: Extended version of our NeurIPS 2020 paper. Compared to the conference version, this preprint gives finer, in-high-probability regret bounds. We also better connected our work to the phi-regret minimization framework

arXiv:2103.04546 [pdf, other]

Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games

Authors: Gabriele Farina, Robin Schmucker, Tuomas Sandholm

Abstract: Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment. It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on o… ▽ More Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment. It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the full-feedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in one-shot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) linear-time iterations (in the size of the decision tree) and (ii) $O(\sqrt{T})$ cumulative regret in expectation compared to any fixed strategy, at all times $T$. This is made possible by new results that we derive, which may have independent uses as well: 1) geometry of the dilated entropy regularizer, 2) autocorrelation matrix of the natural sampling scheme for sequence-form strategies, 3) construction of an unbiased estimator for linear losses for sequence-form strategies, and 4) a refined regret analysis for mirror descent when using the dilated entropy regularizer. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Comments: Full version. The body of the paper appeared in the proceedings of the AAAI 2021 conference

arXiv:2103.04539 [pdf, other]

Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games

Authors: Gabriele Farina, Tuomas Sandholm

Abstract: Regret minimization has proved to be a versatile tool for tree-form sequential decision making and extensive-form games. In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. Most regret-minimization algorithms for tree-form sequential decision making, in… ▽ More Regret minimization has proved to be a versatile tool for tree-form sequential decision making and extensive-form games. In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. Most regret-minimization algorithms for tree-form sequential decision making, including CFR, require (i) an exact model of the player's decision nodes, observation nodes, and how they are linked, and (ii) full knowledge, at all times t, about the payoffs -- even in parts of the decision space that are not encountered at time t. Recently, there has been growing interest towards relaxing some of those restrictions and making regret minimization applicable to settings for which reinforcement learning methods have traditionally been used -- for example, those in which only black-box access to the environment is available. We give the first, to our knowledge, regret-minimization algorithm that guarantees sublinear regret with high probability even when requirement (i) -- and thus also (ii) -- is dropped. We formalize an online learning setting in which the strategy space is not known to the agent and gets revealed incrementally whenever the agent encounters new decision points. We give an efficient algorithm that achieves $O(T^{3/4})$ regret with high probability for that setting, even when the agent faces an adversarial environment. Our experiments show it significantly outperforms the prior algorithms for the problem, which do not have such guarantees. It can be used in any application for which regret minimization is useful: approximating Nash equilibrium or quantal response equilibrium, approximating coarse correlated equilibrium in multi-player games, learning a best response, learning safe opponent exploitation, and online play against an unknown opponent/environment. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Comments: Full version. The body of the paper appeared in the proceedings of the AAAI 2021 conference

arXiv:2009.10061 [pdf, ps, other]

Faster Algorithms for Optimal Ex-Ante Coordinated Collusive Strategies in Extensive-Form Zero-Sum Games

Authors: Gabriele Farina, Andrea Celli, Nicola Gatti, Tuomas Sandholm

Abstract: We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfect-information zero-sum extensive-form game. Team members are not allowed to communicate during play but can coordinate before the game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint… ▽ More We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfect-information zero-sum extensive-form game. Team members are not allowed to communicate during play but can coordinate before the game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a different literature on extensive-form correlation. Second, we provide an algorithm that computes such an optimal distribution by only using profiles where only one of the team members gets to randomize in each profile. We can also cap the number of such profiles we allow in the solution. This begets an anytime algorithm by increasing the cap. We find that often a handful of well-chosen such profiles suffices to reach optimal utility for the team. This enables team members to reach coordination through a relatively simple and understandable plan. Finally, inspired by this observation and leveraging theoretical concepts that we introduce, we develop an efficient column-generation algorithm for finding an optimal distribution for the team. We evaluate it on a suite of common benchmark games. It is three orders of magnitude faster than the prior state of the art on games that the latter can solve and it can also solve several games that were previously unsolvable. △ Less

Submitted 21 September, 2020; originally announced September 2020.

arXiv:2009.04336 [pdf, other]

Polynomial-Time Computation of Optimal Correlated Equilibria in Two-Player Extensive-Form Games with Public Chance Moves and Beyond

Authors: Gabriele Farina, Tuomas Sandholm

Abstract: Unlike normal-form games, where correlated equilibria have been studied for more than 45 years, extensive-form correlation is still generally not well understood. Part of the reason for this gap is that the sequential nature of extensive-form games allows for a richness of behaviors and incentives that are not possible in normal-form settings. This richness translates to a significantly different… ▽ More Unlike normal-form games, where correlated equilibria have been studied for more than 45 years, extensive-form correlation is still generally not well understood. Part of the reason for this gap is that the sequential nature of extensive-form games allows for a richness of behaviors and incentives that are not possible in normal-form settings. This richness translates to a significantly different complexity landscape surrounding extensive-form correlated equilibria. As of today, it is known that finding an optimal extensive-form correlated equilibrium (EFCE), extensive-form coarse correlated equilibrium (EFCCE), or normal-form coarse correlated equilibrium (NFCCE) in a two-player extensive-form game is computationally tractable when the game does not include chance moves, and intractable when the game involves chance moves. In this paper we significantly refine this complexity threshold by showing that, in two-player games, an optimal correlated equilibrium can be computed in polynomial time, provided that a certain condition is satisfied. We show that the condition holds, for example, when all chance moves are public, that is, both players observe all chance moves. This implies that an optimal EFCE, EFCCE and NFCCE can be computed in polynomial time in the game size in two-player games with public chance moves, providing the biggest positive complexity result surrounding extensive-form correlation in more than a decade. △ Less

Submitted 9 September, 2020; originally announced September 2020.

arXiv:2007.14358 [pdf, other]

Faster Game Solving via Predictive Blackwell Approachability: Connecting Regret Matching and Mirror Descent

Authors: Gabriele Farina, Christian Kroer, Tuomas Sandholm

Abstract: Blackwell approachability is a framework for reasoning about repeated games with vector-valued payoffs. We introduce predictive Blackwell approachability, where an estimate of the next payoff vector is given, and the decision maker tries to achieve better performance based on the accuracy of that estimator. In order to derive algorithms that achieve predictive Blackwell approachability, we start b… ▽ More Blackwell approachability is a framework for reasoning about repeated games with vector-valued payoffs. We introduce predictive Blackwell approachability, where an estimate of the next payoff vector is given, and the decision maker tries to achieve better performance based on the accuracy of that estimator. In order to derive algorithms that achieve predictive Blackwell approachability, we start by showing a powerful connection between four well-known algorithms. Follow-the-regularized-leader (FTRL) and online mirror descent (OMD) are the most prevalent regret minimizers in online convex optimization. In spite of this prevalence, the regret matching (RM) and regret matching+ (RM+) algorithms have been preferred in the practice of solving large-scale games (as the local regret minimizers within the counterfactual regret minimization framework). We show that RM and RM+ are the algorithms that result from running FTRL and OMD, respectively, to select the halfspace to force at all times in the underlying Blackwell approachability game. By applying the predictive variants of FTRL or OMD to this connection, we obtain predictive Blackwell approachability algorithms, as well as predictive variants of RM and RM+. In experiments across 18 common zero-sum extensive-form benchmark games, we show that predictive RM+ coupled with counterfactual regret minimization converges vastly faster than the fastest prior algorithms (CFR+, DCFR, LCFR) across all games but two of the poker games, sometimes by two or more orders of magnitude. △ Less

Submitted 7 March, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: Full version. The body of the paper appeared in the proceedings of the AAAI 2021 conference

arXiv:2007.12987 [pdf, other]

Coupled Relational Symbolic Execution for Differential Privacy

Authors: Gian Pietro Farina, Stephen Chong, Marco Gaboardi

Abstract: Differential privacy is a de facto standard in data privacy with applications in the private and public sectors. Most of the techniques that achieve differential privacy are based on a judicious use of randomness. However, reasoning about randomized programs is difficult and error prone. For this reason, several techniques have been recently proposed to support designer in proving programs differe… ▽ More Differential privacy is a de facto standard in data privacy with applications in the private and public sectors. Most of the techniques that achieve differential privacy are based on a judicious use of randomness. However, reasoning about randomized programs is difficult and error prone. For this reason, several techniques have been recently proposed to support designer in proving programs differentially private or in finding violations to it. In this work we propose a technique based on symbolic execution for reasoning about differential privacy. Symbolic execution is a classic technique used for testing, counterexample generation and to prove absence of bugs. Here we use symbolic execution to support these tasks specifically for differential privacy. To achieve this goal, we leverage two ideas that have been already proven useful in formal reasoning about differential privacy: relational reasoning and probabilistic coupling. Our technique integrates these two ideas and shows how such a combination can be used to both verify and find violations to differential privacy. △ Less

Submitted 25 July, 2020; originally announced July 2020.

arXiv:2004.00603 [pdf, other]

No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium

Authors: Andrea Celli, Alberto Marchesi, Gabriele Farina, Nicola Gatti

Abstract: The existence of simple, uncoupled no-regret dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilib… ▽ More The existence of simple, uncoupled no-regret dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form (that is, tree-form) games generalize normal-form games by modeling both sequential and simultaneous moves, as well as private information. Because of the sequential nature and presence of partial information in the game, extensive-form correlation has significantly different properties than the normal-form counterpart, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to normal-form correlated equilibrium. However, it was currently unknown whether EFCE emerges as the result of uncoupled agent dynamics. In this paper, we give the first uncoupled no-regret dynamics that converge to the set of EFCEs in $n$-player general-sum extensive-form games with perfect recall. First, we introduce a notion of trigger regret in extensive-form games, which extends that of internal regret in normal-form games. When each player has low trigger regret, the empirical frequency of play is close to an EFCE. Then, we give an efficient no-trigger-regret algorithm. Our algorithm decomposes trigger regret into local subproblems at each decision point for the player, and constructs a global strategy of the player from the local solutions at each decision point. △ Less

Submitted 2 September, 2022; v1 submitted 1 April, 2020; originally announced April 2020.

arXiv:2002.08493 [pdf, other]

Stochastic Regret Minimization in Extensive-Form Games

Authors: Gabriele Farina, Christian Kroer, Tuomas Sandholm

Abstract: Monte-Carlo counterfactual regret minimization (MCCFR) is the state-of-the-art algorithm for solving sequential games that are too large for full tree traversals. It works by using gradient estimates that can be computed via sampling. However, stochastic methods for sequential games have not been investigated extensively beyond MCCFR. In this paper we develop a new framework for developing stochas… ▽ More Monte-Carlo counterfactual regret minimization (MCCFR) is the state-of-the-art algorithm for solving sequential games that are too large for full tree traversals. It works by using gradient estimates that can be computed via sampling. However, stochastic methods for sequential games have not been investigated extensively beyond MCCFR. In this paper we develop a new framework for developing stochastic regret minimization methods. This framework allows us to use any regret-minimization algorithm, coupled with any gradient estimator. The MCCFR algorithm can be analyzed as a special case of our framework, and this analysis leads to significantly-stronger theoretical on convergence, while simultaneously yielding a simplified proof. Our framework allows us to instantiate several new stochastic methods for solving sequential games. We show extensive experiments on three games, where some variants of our methods outperform MCCFR. △ Less

Submitted 19 February, 2020; originally announced February 2020.

arXiv:1910.12450 [pdf, other]

Efficient Regret Minimization Algorithm for Extensive-Form Correlated Equilibrium

Authors: Gabriele Farina, Chun Kai Ling, Fei Fang, Tuomas Sandholm

Abstract: Self-play methods based on regret minimization have become the state of the art for computing Nash equilibria in large two-players zero-sum extensive-form games. These methods fundamentally rely on the hierarchical structure of the players' sequential strategy spaces to construct a regret minimizer that recursively minimizes regret at each decision point in the game tree. In this paper, we introdu… ▽ More Self-play methods based on regret minimization have become the state of the art for computing Nash equilibria in large two-players zero-sum extensive-form games. These methods fundamentally rely on the hierarchical structure of the players' sequential strategy spaces to construct a regret minimizer that recursively minimizes regret at each decision point in the game tree. In this paper, we introduce the first efficient regret minimization algorithm for computing extensive-form correlated equilibria in large two-player general-sum games with no chance moves. Designing such an algorithm is significantly more challenging than designing one for the Nash equilibrium counterpart, as the constraints that define the space of correlation plans lack the hierarchical structure and might even form cycles. We show that some of the constraints are redundant and can be excluded from consideration, and present an efficient algorithm that generates the space of extensive-form correlation plans incrementally from the remaining constraints. This structural decomposition is achieved via a special convexity-preserving operation that we coin scaled extension. We show that a regret minimizer can be designed for a scaled extension of any two convex sets, and that from the decomposition we then obtain a global regret minimizer. Our algorithm produces feasible iterates. Experiments show that it significantly outperforms prior approaches and for larger problems it is the only viable option. △ Less

Submitted 28 October, 2019; originally announced October 2019.

Comments: Full version of NeurIPS 2019 paper

Showing 1–50 of 71 results for author: Farina, G