ย
Steering No-Regret Agents in MFGs under Model Uncertainty
ย
Leo Widmer ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Jiawei Huang ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Niao He
lewidmer@student.ethz.ch, {jiawei.huang, niao.he}@inf.ethz.ch Department of Computer Science ETH Zรผrich
Abstract
Incentive design is a popular framework for guiding agentsโ learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agentsโ behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.
1 INTRODUCTION
Mean-Field Games (MFGs) (Huang etย al.,, 2006; Lasry and Lions,, 2007) are a widely-used and powerful framework to model the competition and cooperation of large population systems involving symmetric and interchangeable agents. MFGs effectively capture the dynamics of many real-world scenarios, such as macro-economic models (Steinbacher etย al.,, 2021), road traffic systems (Chen and Cheng,, 2010), autonomous vehicle systems (Dinneweth etย al.,, 2022) and auctions (Iyer etย al.,, 2014), and it has been successfully applied in those domains (Gomes etย al.,, 2014; Cabannes etย al.,, 2021; Achdou and Lasry,, 2019; Guo etย al.,, 2021). Similar to the finite-agent systems (Roughgarden and Tardos,, 2007), MFGs with self-interested agents may lead to undesirable collective behaviors. Typically, the agentsโ learning dynamics may converge to equilibria where all the participants are worse off compared to other possible outcomes (Guo etย al., 2023a, ).
To address this dilemma, the field of incentive design explores methods to guide agents towards more favorable behaviors by modifying the reward structure. A widely-studied formulation, known as the steering problem (Zhang etย al.,, 2024; Canyakmaz etย al.,, 2024; Huang etย al., 2024b, ), assumes the presence of a mediator (incentive designer) outside the game, who can influence the agentsโ learning dynamics by providing additional steering rewards. However, previous research on steering mainly focuses on either Extensive-Form Games (Zhang etย al.,, 2024) or Markov Games (Canyakmaz etย al.,, 2024; Huang etย al., 2024b, ) with a limited number of agents. The methods developed for those small-scale settings become intractable when applied to large-population scenarios, as the number of agents increases, which is known as the curse of multi-agency.
To address this gap, in this work, we study incentive design in Mean-Field Games (MFGs). More concretely, we focus on the finite-horizon MFGs with density-independent transitions, a standard model in literature (Huang etย al.,, 2006; Lasry and Lions,, 2007; Perolat etย al.,, 2021). In the steering problem setup, we play the role of the mediator, with access to a utility function dependent on the collective behavior of the agents through the population density. During the interactions with the mediator, the agents are continuously learning and adapting. We assume the agents are self-interested no-adaptive-regret learners (Hazan and Seshadhri,, 2007); similar no-regret assumptions have been widely adopted in previous literature (Camara etย al.,, 2020; Ge etย al.,, 2024). Following previous works (Zhang etย al.,, 2024; Huang etย al., 2024b, ), our primary goal is to design steering rewards that guide the agents towards desired policies (i.e., minimize the steering gap), such that the resulting behaviors maximize the utility function. Meanwhile, the incentives paid by the mediator to the agents, referred to as the steering cost, should remain low.
In practice, the mediator usually lacks knowledge of the transition dynamics and the intrinsic reward functions of the MFGs. Therefore, in this work, we focus on the design of steering strategies without prior knowledge of the game model. The model uncertainty makes the steering problem much more challenging, and requires the mediator to strategically balance the exploration and exploitation. Typically, without knowledge of the MFG model, the mediator does not know which agentsโ behaviors (population densities) are feasible, let alone how to maximize the utility function. Therefore, the mediator needs not only to steer the agents to explore the MFG for its own learning, but also ensure the agents converge to desired outcomes, while keeping the accumulative incentive payments affordable. In summary, the key question we would like to address is:
How can we design effective steering strategies for no-regret agents in MFGs under model uncertainty?
Main contributionsโWe address the above open question by proposing novel exploration algorithms with provable guarantees. We highlight our main contributions in the following. A summary of the main theorems in this paper can be found in Appx.ย B.
-
โข
Firstly, in Sec.ย 3, we contribute the first formulation for steering in mean-field games, with details about the problem setting and learning objectives.
-
โข
Secondly, as preparation, in Sec.ย 4, we investigate how to steer the agents to a given density or policy. Notably, in Sec.ย 4.2, we propose a novel steering strategy, which can guide the no-adaptive-regret agents towards any target policy without prior knowledge of the model. This method serves as the key ingredient of our steering algorithms in the following sections.
-
โข
Thirdly, in Sec.ย 5 and Sec.ย 6, we investigate strategic exploration methods for steering agents in MFGs under uncertainty. In Sec.ย 5, we start with the setting where the intrinsic reward is zero, and propose an optimism-based exploration algorithm, which guarantees that both the cumulative steering gap and cost only have sub-linear growth. Furthermore, in Sec.ย 6, we extend our methods to the setting with non-zero and unknown intrinsic reward by integrating a pessimism-based reward estimation strategy. We establish sub-linear regret in steering gap, and show that the total steering cost is only sub-linearly worse, compared to a baseline strategy that stabilizes the target policy as an equilibrium by offsetting differences in intrinsic rewards.
1.1 Closely Related Work
Due to the limit of space, we only discuss closely related works here and defer the others to Appendixย C.
Incentive Design in Multi-Agent SystemsโThe problem of incentive design broadly refers to the design of mechanisms for shaping the behavior of autonomous agents (Ehtamo etย al.,, 2002; Ratliff etย al.,, 2019). A recently popular framework for incentive design is known as the steering problem (Zhang etย al.,, 2024; Canyakmaz etย al.,, 2024; Huang etย al., 2024b, ), which considers a repeated interaction between a mediator and learning agents. All of them focuses on small-scale problems (e.g., Markov Games or Extensive-Form Games) and their proposed methods become intractable when extending to large-population setting, including MFGs. Besides, Canyakmaz etย al., (2024); Huang etย al., 2024b consider the agentsโ learning dynamics to be memoryless, which is different from our no-regret assumptions.
Another related direction is contract design111To save space, we defer to Appx.ย C more elaboration of the comparisons between our steering framework and contract design setting. (DellaVigna and Malmendier,, 2004), which studies the interactions between a principal and agents when the two parties transact in the presence of private information. The fundamental question is how the principal should design the incentives for the agents to maximize its own utility after deducting the payments to agents. However, most of literature study the single-agent setting (Zhu etย al.,, 2022; Ho etย al.,, 2014; Scheid etย al.,, 2024), or focus on the computational aspects without addressing exploration under uncertainty (Dรผtting etย al.,, 2023; Castiglioni etย al.,, 2023). (Carmona and Wang,, 2021; Elie etย al.,, 2019) study contract design in the MFGs setting, but none of them consider model uncertainty. Moreover, a common assumption in those works is that the agents always do the best response (or take equilibrium policies) to the principalโs intervention, which is much stronger than our no-adaptive-regret assumption.
Besides, Sanjari etย al., (2024) consider incentive design in a large-population setting, but they study the Stackelberg games with one leader and a large number of followers, which differs quite substantially to ours. Fu and Horst, (2018) consider mean-field leader-follower games, however they assume knowledge of the dynamics, while we consider the steering problem without this knowledge. Moreover, they study the dynamics where the agents cooperate together and optimally respond to the leaderโs control signal. In contrast, we consider decentralized and self-interested agents with no-regret behaviors in maximizing individual interests.
2 PRELIMINARIES
Mean-Field GamesโWe consider the MFG setting with a finite yet extremely large number of agents, each of which acts independently. In line with Subramanian etย al., (2022), we refer to this setting as โDecentralized-MFGsโ, although their model allows diversity in action space and reward functions for agents.
Definition 2.1.
A Finite-Horizon Decentralized MFG is defined by a tuple , given the number of agents ; state and action spaces with sizes and ; horizon length ; initial state distribution . with and with denote the transition and reward function, respectively.
In this paper, we focus on density-independent transition function, a common assumption in previous literature (Huang etย al.,, 2006; Lasry and Lions,, 2007; Perolat etย al.,, 2021). For the reward function, we consider the general setup, where the rewards depend on the state-action density (Guo etย al.,, 2021).
We only focus on non-stationary Markovian policies, denoted by . Given a model , considering an agent taking policy , we use to denote its state-action density for each step . Starting with , for , we have:
When agents take policies , respectively, the trajectory of agent is specified by:
(1) |
where we use to denote the population density at step . We also assume is Lipschitz in the density, which is standard in previous works (Guo etย al.,, 2021; Yardim etย al.,, 2022).
Other Notational ConventionโFor convenience, we implicitly treat as a vector in concatenated by . We denote to be the set of all feasible state-action densities given . Note that is a convex set (see Lem.ย E.1), which implies . If it is not necessary to distinguish what model we use, we omit it in the sub-scriptions, for example, instead of . We also omit in if it is clear from the context. With slight abuse of notation, given a population density and a reward function , we use to denote the reward vector where . In this way, given an arbitrary agent taking policy , its expected total return conditioning on population density can be written as: .
Given that this paper considers learning under uncertainty, we use to denote the true hidden mean-field model with transition and intrinsic reward , in order to distinguish it from the estimated ones.
Besides, given a population density in a model , we will use to denote the policy, which induces the population density (i.e., ), defined by: (or if ).
Reward Function Approximation and Eluder DimensionโIn this paper, we consider the setting where the true intrinsic reward, denoted by , is unknown. Note that the reward function depends on not only the state and action but also the density, which belongs to a high-dimensional continuous space. Therefore, we consider function approximation for reward estimation with the standard realizability assumption.
Assumption A.
A reward function class is available, s.t. (i) , ; (ii) .
In the function approximation setting, the fundamental sample efficiency is closely related to the complexity of the function class. We follow previous works (Russo and Vanย Roy,, 2013; Huang etย al., 2024a, ) and utilize the Eluder Dimension as the complexity measure of the function class. Intuitively, the Eluder Dimension is defined to be the length of the longest โindependentโ sequence, such that each element in the sequence โrevealsโ some new information about the function class comparing with previous ones.
Definition 2.2 (-independent sequence).
Given a domain and a class of functions defined on , we say is -independent on if there exists , such that , but .
Definition 2.3 (Eluder Dimension).
Given a mean-field reward function class and domain , the Eluder Dimension of , denoted by , is defined to be the length of the longest sequence , such that, for any , is -independent w.r.t.ย .
3 THE STEERING PROBLEM FORMULATION FOR MFGS
In this section, we introduce our steering setup. In Sec.ย 3.1, we first provide our formulation for steering protocol. Then, in Sec.ย 3.2 we discuss our assumptions on agentโs behavior. After that, we introduce the learning objectives and other setups in Sec.ย 3.3 andย 3.4.
3.1 Agent-Mediator Interaction Protocol
We consider a repeated game setup, and summarize the interaction procedure between agents and the mediator in Procedureย 1. In each iteration , the mediator first selects a steering reward function222We will use capital to denote the steering reward to distinguish with intrinsic reward . , which is a mapping from the density space to the non-negative333The non-negativity of the steering reward is known as limited liability (Innes,, 1990), which is standard in previous works (Zhang etย al.,, 2024; Huang etย al., 2024b, ) reward vector space, upper bounded by . Besides, each agent computes a policy and plays the game. The agentsโ policies result in a population density , by which the mediator realizes the steering reward . Then, each agent receives payments from the mediator equal to the expected return induced by the steering reward and the agentโs policy, i.e., . We highlight here that in our setup, at each iteration , the mediator designs the steering reward function without the knowledge of the agentsโ policies , and we do not restrict whether the agents can observe or not before they make decisions. Furthermore, the agents can either independently compute their policies or collaborate. In the next section, we will characterize our assumptions on the agentsโ behaviors with more details.
At the end of each iteration, the mediator can observe a trajectory sampled from a random agent with noisy reward samples. We assume noises are i.i.d.ย -sub-Gaussian random variables with zero mean. We also assume the mediator has access to the population density, which is necessary to estimate the unknown intrinsic reward function from samples.
3.2 Behavioral Assumptions on Agents
We first introduce our no-adaptive regret assumption and its implication, and then make some justification.
Assumption B (No-Adaptive Regret Behavior).
Justification for Assump.ย BโWe remark that it is common to consider agents exhibiting no-regret behaviors in previous literature (Deng etย al.,, 2019; Zhang etย al.,, 2024; Brown etย al.,, 2024). Most of these literature assume no-external regret (directly assigning and in Eq.ย (2)), which is weaker than our no-adaptive-regret assumption. However, similar stronger assumptions, such as no-dynamic-regret learners, have also been considered in some studies (Ge etย al.,, 2024). Moreover, our no-adaptive-regret assumption is standard when interpreted through the online linear optimization perspective (Hazan and Seshadhri,, 2007; Hazan,, 2023), where in each iteration, each agent picks a density from the convex set and receives potentially adversarial feedback . Then, Assump.ย B aligns with the standard no-adaptive regret guarantees in online linear optimization setting, and there are very simple algorithms (e.g., Online Gradient Descent) achieving . We defer more detailed discussion to Appx.ย D.2.
Under Assump.ย B, we have the following property, which suggests the collective population will also exhibit no-regret behaviors. This is a useful property we will leverage in algorithm design.
Proposition 3.1 (No-Adaptive-Regret Population Behavior).
Under Assump.ย B, we have:
3.3 Performance Metrics
Inspired by the previous works (Zhang etย al.,, 2024; Huang etย al., 2024b, ), we evaluate the steering algorithm from two aspects: the steering gap and the steering cost. We provide the concrete definition in our MFGs setup as follows.
The Steering GapโIntuitively, the steering gap measures the difference between the desired outcomes and the agentsโ behavior under the mediatorโs guidance. In this paper, we assume the mediator is given a utility function assigning each population density a utility value. The only assumption we make for it is about the Lipschitz continuity:
Assumption C (Lipschitz Utility Function).
The steering gap up to step is defined by:
Here represents the utility paid to the mediator at each iteration , induced by the population density . Note that we consider the best density maximizing utility function as the comparator. This can be interpreted as the best population density if all the agents are restricted to take the same policy, and finding the best shared policy is a standard objective in previous MFGs literature.
The Steering CostโThe motivation for introducing a steering cost is that the agents will not accept the mediatorโs guidance for free. A common measure of the cost is the expected total return associated with the reward received by the agents. Formally, suppose at iteration , the mediator computes a steering reward function , and the agents select policies , which induce a population density , then the steering cost is defined to be the average payments to the agents: We will use
to denote the accumulative steering gap. Note that the steering rewards are non-negative, the steering cost effectively reflects the strength of the steering signal.
3.4 Two Steering Scenarios and Objectives
In this paper, we consider the case when the mediator does not know the true transition and reward functions of . However, to make it easy for reader to understand our algorithm design and technique contributions, we will start with a special case, where the agents do not have intrinsic rewards, i.e., .
Scenario 1: No Intrinsic RewardโThe goal of this setting to find an incentive design algorithm producing a sequence of such that both the steering gap and the steering cost are sub-linear:
The motivation for the sub-linear guarantee here is that it implies the average utility converges to the maximum and the average steering cost vanishes. This implies that the incentive design strategies fulfilling these guarantees will eventually pay off as a long-term investment. In Sec.ย 5, we analyze this case and provide algorithms achieving our objective.
Scenario 2: Non-Zero Intrinsic RewardโIn Sec.ย 6, we study the complete setting where the agentsโ original reward is non-zero and unknown. In this case, the mediator additionally has to estimate the reward function from observed noisy samples and steer the agents based on that. Similarly, we expect sub-linear steering gap , while for steering cost, we manually choose the โsandboxing rewardโ as the comparator:
Here we use to denote the all-ones vector. Intuitively, because the intrinsic rewards are non-zero, if the desired behavior is not an equilibrium induced by , the mediator has to maintain a non-zero steering rewards to avoid the agents deviating from , so we can not expect the average steering cost to vanish to 0 as in Scenario 1. Therefore, we consider the sandboxing reward as a baseline comparator, which mitigates differences in the intrinsic rewards, so that would be a โstable equilibriumโ even if the additional steering reward vanishes to zero. Though, we admit that other choices of sandboxing terms may result in lower steering cost, or one can consider optimizing utility and steering cost together. We leave those interesting directions for the future work.
4 STEERING TOWARDS A FIXED TARGET
In this section, we focus on how to design rewards to guide the agents to a target population density or policy, which serves as preparation steps for the following sections. For convenience, we assume the agentsโ intrinsic rewards are zero and ignore them.
4.1 Warm-Up: Steering in a Known Model
We start with the case when the MFG model is known. In this case, we can also compute and find the best . If we want to steer the population to , one steering reward choice is , where the first shift term is to ensure the non-negativity. The key motivation for our choice is that . As a result, if we consider the accumulative performance, we have the following theorem.
Theorem 4.1.
4.2 Steering towards a Target Policy in an Unknown Model
Without the knowledge of transition function , the steering becomes challenging, because we can no longer compute or identify whether a given density (e.g. ) can actually be achieved by the agents. Therefore, we shift our focus to the policy space. Interestingly, we reveal that, it is possible to steer the agents to any target policy , even without the knowledge of or . Our key observation is the following lemma, which suggests an upper bound to control the difference between the population density and the density regarding the target policy.
Lemma 4.2.
Given any and target , suppose the agents induce population density in , then:
with | (3) |
This motivates us to design a steering reward function that penalizes the RHS of Eq.ย (3), which is actually doable without the knowledge of model. Given a policy , we define matrix to be the block diagonal of for all and , where
(4) |
Now, consider the steering reward function:
(5) |
where is the identity matrix. We can verify that, for any possible population density occurs at step , we have
(6) |
Recall that denotes the policy induced by population density (see definition in Sec.ย 2). Here in the first equality, we use the fact that, for any , since . Eq.ย (6) above is important in that it connects the one step regret (LHS) with the gap between the population density and target density (RHS through Lemmaย 4.2).
Combining with Prop.ย 3.1, if all the agents are no-regret learners, and we steer the agents with the same steering reward for steps, we should expect to converge to , which we summarize to the following theorem. This result provides important insights for our incentive design algorithm in Sectionย 5.
Theorem 4.3.
Let and for all . Under Assump.ย B,
5 STEERING WITH NO INTRINSIC REWARD
In this section, we study the Scenario 1 introduced in Sec.ย 3.4, where the transition function is unknown and the original reward is zero, so the steering rewards are the only incentives for the agents. The main challenge in this setting is that, without the knowledge of , we can not determine the feasible density set and the maximizer of the utility function. Therefore, we have to design a steering strategy to incentivize the agents to explore for the mediator to estimate , while balancing the exploration-exploitation trade-off to ensure sub-linear steering gap and cost.
Our main contribution is an optimism-based exploration algorithm in Alg.ย 2, which provably addresses the above challenges and achieves our objectives. The algorithm is built based on the techniques we developed in Sec.ย 4.2, which allows us to steer the agents to any target policy without the knowledge of model. Next, we introduce the key components in algorithm design.
Low Policy Switching Optimistic Exploration StrategyโFor efficient exploration, we maintain a confidence set for denoted by :
(7) |
where . We highlight that we only update and switch target policy in low frequency, and here we use index to count the policy switching episodes, to distinguish with the steering steps . We use to denote the index of episode at iteration and use to denote the iteration number at the end -th policy switching. We define to be the number of samples equal to at time in episode , and . A new episode begins as soon as we have as many samples in this episode as in all the previous ones for some , i.e., . The main motivation for this technique is to avoid the agentsโ potentially adversarial behaviors. As we will see later in the proof sketch, will appear in the steering gap upper bound.
For exploration, we select the optimistic policy and model (lineย 10) s.t. the induced density maximizes utility. Then, we choose steering reward to guide the agents towards and collect data samples to update the model confidence set. Intuitively, either indeed maximizes the utility, implying a low steering gap; or the exploration helps to reduce the uncertainty.
Managing the steering gap and costโWe have the following guarantees for Alg.ย 2
Theorem 5.1.
As a concrete example, agents following Online Gradient Descent with step size (Hazan,, 2023) result in (ignoring and ), which implies steering gap. Besides, if all the agents are capable enough s.t. for any , are equilibria w.r.t. , would be constant-level, resulting in a bound.
Proof SketchโWe first analyze the steering gap. Intuitively, Alg.ย 2 can be interpreted as a โ-stageโ version of what we did in Sec.ย 4.2. In each stage, we pick a target policy, and steer the agents towards it for exploration. Following this intuition, and thanks to the Lipschitz condition (Assump.ย C) and the optimism in planning, we can decompose the steering gap as follow:
(8) |
We refer the first term as model estimation error, which measures the gap between the population density and the density induced by the population average policy (see definition in Sec.ย 2) in the estimated model . As we collect more and more data, gets closer to , and we can show only grows sub-linearly. The second term can be interpreted as the population convergence error, which is determined by how fast the agents converge to the target policy we steer them to. Following the similar techniques in the proof of Thm.ย 4.3, can be upper bounded by:
(9) |
Here we use AgentReg to refer the summation term, which can be interpreted as the agentsโ dynamic regret if choosing as the comparators. Thanks to the low policy switching, AgentReg can be controlled by , and the only remaining step is to control . Note that we only switch policy when the number of visitation of some state-action pair got doubled, therefore, only grows in .
6 STEERING WITH NON-ZERO INTRINSIC REWARD
Next, we turn to Scenario 2 in Sec.ย 3.4, the complete setting where the agentsโ pre-existing reward function is both non-zero and unknown. The non-zero intrinsic reward introduces non-trivial additional challenges. Firstly, it changes the steering landscape and introduces some prior bias for our steering reward design. Secondly, since it is unknown, we must account for its interference on the steering dynamics and undertake strategic exploration to estimate . In the following, we explain how we overcome these challenges by a pessimism-based reward estimation strategy.
Confidence set for โWe recall our setup in Sec.ย 3.1: the mediator can observe the population density and noisy reward perturbed by i.i.d.ย zero-mean -sub-Gaussian noise . We will use this information to estimate the original reward. At each iteration , we maintain a confidence set for , defined by:
(10) |
where for any function as a short note. We use to denote confidence interval length to ensure is contained in the confidence set at any time with high probability. We defer a detailed choice of to Lem.ย I.1. Informally, grows in , where is the -covering number of .
Steering Reward Design with PessimismโWe consider the following steering reward design
(11) |
Here is computed in the same way as Alg,ย 2; (defined in Eq.ย (10)) is the reward estimation achieving the minimal empirical loss; is a vector with elements , which quantifies the estimation uncertainty for each state-action pair; the last constant shift term ensures non-negativity.
As we can see, the main difference compared with steering reward in Alg.ย 2 is that we include an additional reward estimation term to offset the effect by the non-zero original reward . In this way, the agents will follow the guidance by to explore as we want. Note that here we conduct a pessimism-based reward estimation such that for some technical reason, which we will explain later.
Steering Algorithm DesignโThe algorithm design for the non-zero intrinsic reward setting only differs from Alg.ย 2 in the additional update of as in Eq.ย (10) and choosing Eq.ย (6) as the steering reward . For completeness, we defer the detailed algorithm to Alg.ย 4 in Appx.ย I.1. We have the following guarantees for steering gap and steering cost.
Theorem 6.1.
Comparing with Theoremย 5.1, we can find both the steering gap and cost only differ in the additional term , which results from the estimation error of . The term depends on the Eluder dimension of and . In Appx.ย H.1, we show several common function classes with , and where by choosing appropriately, we have . As a result, both the steering gap and cost upper bounds in Thm.ย 6.1 will be sub-linear in .
Proof SketchโSimilar to the proof for Thm.ย 5.1, we can decompose the steering gap as Eq.ย (8), and upper bound model estimation error term in the same way. The proof diverges when we upper bound AgentReg in Eq.ย (9), because the agentsโ no-regret behavior holds for in this setting. We can write
AgentReg | |||
Using pessimism, i.e., , we can bound this by
Clearly, the first term above is just agentsโ dynamic regret regarding the total reward they received and can be bounded again by . The second term above can be further controlled by , which is basically the accumulative confidence interval length for reward estimation and its growth can be controlled by Eluder dimension (Lem.ย H.5) and is only sub-linear in .
For the steering cost, we can provide an upper bound involving AgentReg and reward estimation error that we analyzed before. To save space, we do not repeat it here and refer the reader to Appx.ย I for the full proof.
Remark 6.1.
Our strategy to deal with the intrinsic reward is to try to โcancelโ it with our steering reward. This approach is justified by the fact that we keep and very general, which means that the target density to maximize may not coincide with an equilibrium associated with the original reward . Therefore, to ensure the target density is still a stationary point for no-regret learners, we treat as a competing force to offset. We admit that there might be other options to counteract the impact of with lower steering costs, and we leave further investigation to the future work.
Remark 6.1 (Generalization to Unknown Utility Setting).
Although this paper focuses on the case when is revealed to the mediator, it is possible to generalize our results to the case where the utility function is unknown, but it lies in a known function class with bounded Eluder dimension. In Appx.ย J, we formalize this setting and present a solution to address this case based on a simple modification of the current methods. Our established regret bound for steering gap and steering cost grow at a rate of . Although the results are worse than the rate of in Thm.ย 6.1 due to the challenges in exploring the utility function, they are still sub-linear in .
7 CONCLUSION
We study a novel problem setting for incentive design in unknown mean-field games with no-regret agents. Our optimistic algorithm introduces newly developed steering reward designs, achieving sublinear utility regret and steering costs when the intrinsic reward is zero. Extending to the setting with a non-zero and unknown intrinsic reward function, we adapted our algorithm to handle this new challenge, maintaining sublinear utility regret and vanishing steering costs competing with a baseline strategy. Future work could explore the more challenging case where the transition function is also dependent on the population density. Another interesting direction is to identify better or even optimal steering reward design to stabilize the target policy and design an algorithm with sub-linear guarantees comparing with that benchmark.
Acknowledgements
This work is supported by Swiss National Science Foundation (SNSF) Project Funding No. 200021-207343 and SNSF Starting Grant.
References
- Achdou and Lasry, (2019) Achdou, Y. and Lasry, J.-M. (2019). Mean Field Games for Modeling Crowd Motion. In Chetverushkin, B.ย N., Fitzgibbon, W., Kuznetsov, Y., Neittaanmรคki, P., Periaux, J., and Pironneau, O., editors, Contributions to Partial Differential Equations and Applications, pages 17โ42. Springer International Publishing, Cham.
- Baumann etย al., (2020) Baumann, T., Graepel, T., and Shawe-Taylor, J. (2020). Adaptive mechanism design: Learning to promote cooperation. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1โ7. IEEE.
- Brown etย al., (2024) Brown, W., Schneider, J., and Vodrahalli, K. (2024). Is learning in games good for the learners? Advances in Neural Information Processing Systems, 36.
- Cabannes etย al., (2021) Cabannes, T., Lauriere, M., Perolat, J., Marinier, R., Girgin, S., Perrin, S., Pietquin, O., Bayen, A.ย M., Goubault, E., and Elie, R. (2021). Solving N-player dynamic routing games with congestion: a mean field approach. arXiv:2110.11943 [cs, eess, math].
- Camara etย al., (2020) Camara, M., Hartline, J., and Johnsen, A. (2020). Mechanisms for a No-Regret Agent: Beyond the Common Prior. arXiv:2009.05518 [cs, econ].
- Canyakmaz etย al., (2024) Canyakmaz, I., Sakos, I., Lin, W., Varvitsiotis, A., and Piliouras, G. (2024). Steering game dynamics towards desired outcomes. arXiv:2404.01066 [cs, eess].
- Carmona and Wang, (2021) Carmona, R. and Wang, P. (2021). Finite-state contract theory with a principal and a field of agents. Management Science, 67(8):4725โ4741.
- Castiglioni etย al., (2023) Castiglioni, M., Marchesi, A., and Gatti, N. (2023). Multi-agent contract design: How to commission multiple agents with individual outcomes. In Proceedings of the 24th ACM Conference on Economics and Computation, pages 412โ448.
- Chen and Cheng, (2010) Chen, B. and Cheng, H.ย H. (2010). A review of the applications of agent technology in traffic and transportation systems. IEEE Transactions on Intelligent Transportation Systems, 11(2):485โ497.
- Curry etย al., (2024) Curry, M., Thoma, V., Chakrabarti, D., McAleer, S., Kroer, C., Sandholm, T., He, N., and Seuken, S. (2024). Automated design of affine maximizer mechanisms in dynamic settings. Proceedings of the AAAI Conference on Artificial Intelligence, 38(9):9626โ9635.
- DellaVigna and Malmendier, (2004) DellaVigna, S. and Malmendier, U. (2004). Contract design and self-control: Theory and evidence. The Quarterly Journal of Economics, 119(2):353โ402.
- Deng etย al., (2019) Deng, Y., Schneider, J., and Sivan, B. (2019). Strategizing against No-regret Learners. arXiv:1909.13861 [cs].
- Dinneweth etย al., (2022) Dinneweth, J., Boubezoul, A., Mandiau, R., and Espiรฉ, S. (2022). Multi-agent reinforcement learning for autonomous vehicles: a survey. Autonomous Intelligent Systems, 2(1):27.
- Dรผtting etย al., (2023) Dรผtting, P., Ezra, T., Feldman, M., and Kesselheim, T. (2023). Multi-agent contracts. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1311โ1324.
- Ehtamo etย al., (2002) Ehtamo, H., Kitti, M., and Hรคmรคlรคinen, R.ย P. (2002). Recent studies on incentive design problems in game theory and management science. In Optimal Control and Differential Games: Essays in Honor of Steffen Jรธrgensen, pages 121โ134. Springer.
- Elie etย al., (2019) Elie, R., Mastrolia, T., and Possamaรฏ, D. (2019). A tale of a principal and many, many agents. Mathematics of Operations Research, 44(2):440โ467.
- Freund and Schapire, (1997) Freund, Y. and Schapire, R.ย E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119โ139.
- Fu and Horst, (2018) Fu, G. and Horst, U. (2018). Mean-field leader-follower games with terminal state constraint.
- Ge etย al., (2024) Ge, J., Wang, Y., Li, W., and Jin, C. (2024). Towards principled superhuman ai for multiplayer symmetric games.
- Gomes etย al., (2014) Gomes, D.ย A., Velho, R.ย M., and Wolfram, M.-T. (2014). Socio-economic applications of finite state mean field games. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2028):20130405. arXiv:1403.4217 [math].
- Guo etย al., (2021) Guo, X., Hu, A., Xu, R., and Zhang, J. (2021). Learning Mean-Field Games. arXiv:1901.09585 [math].
- (22) Guo, X., Li, L., Nabi, S., Salhab, R., and Zhang, J. (2023a). MESOB: Balancing Equilibria & Social Optimality. arXiv:2307.07911 [cs, math].
- (23) Guo, X., Li, L., Nabi, S., Salhab, R., and Zhang, J. (2023b). Mesob: Balancing equilibria & social optimality.
- Hazan, (2023) Hazan, E. (2023). Introduction to Online Convex Optimization. arXiv:1909.05207 [cs, math, stat].
- Hazan and Seshadhri, (2007) Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC), 14.
- Ho etย al., (2014) Ho, C.-J., Slivkins, A., and Vaughan, J.ย W. (2014). Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359โ376.
- Holmstrรถm, (1979) Holmstrรถm, B. (1979). Moral hazard and observability. The Bell journal of economics, pages 74โ91.
- Hu and Zhang, (2024) Hu, A. and Zhang, J. (2024). MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games. arXiv:2405.00282 [cs, math].
- (29) Huang, J., He, N., and Krause, A. (2024a). Model-Based RL for Mean-Field Games is not Statistically Harder than Single-Agent RL. arXiv:2402.05724 [cs, stat].
- (30) Huang, J., Thoma, V., Shen, Z., Nax, H.ย H., and He, N. (2024b). Learning to Steer Markovian Agents under Model Uncertainty. arXiv:2407.10207 [cs, stat].
- Huang etย al., (2023) Huang, J., Yardim, B., and He, N. (2023). On the Statistical Efficiency of Mean Field Reinforcement Learning with General Function Approximation. arXiv:2305.11283 [cs, stat].
- Huang etย al., (2006) Huang, M., Malhamรฉ, R.ย P., and Caines, P.ย E. (2006). Large population stochastic dynamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle.
- Innes, (1990) Innes, R.ย D. (1990). Limited liability and incentive contracting with ex-ante action choices. Journal of economic theory, 52(1):45โ67.
- Iyer etย al., (2014) Iyer, K., Johari, R., and Sundararajan, M. (2014). Mean field equilibria of dynamic auctions with learning. Management Science, 60(12):2949โ2970.
- Jaksch etย al., (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563โ1600.
- Lasry and Lions, (2007) Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Japanese journal of mathematics, 2(1):229โ260.
- Lauriรจre etย al., (2024) Lauriรจre, M., Perrin, S., Pรฉrolat, J., Girgin, S., Muller, P., รlie, R., Geist, M., and Pietquin, O. (2024). Learning in Mean Field Games: A Survey. arXiv:2205.12944 [cs, math].
- Liu etย al., (2022) Liu, B., Li, J., Yang, Z., Wai, H.-T., Hong, M., Nie, Y.ย M., and Wang, Z. (2022). Inducing Equilibria via Incentives: Simultaneous Design-and-Play Ensures Global Convergence. arXiv:2110.01212 [cs].
- Luo etย al., (1996) Luo, Z.-Q., Pang, J.-S., and Ralph, D. (1996). Mathematical Programs with Equilibrium Constraints. Cambridge University Press.
- Osband and Roy, (2014) Osband, I. and Roy, B.ย V. (2014). Model-based reinforcement learning and the eluder dimension.
- Perolat etย al., (2021) Perolat, J., Perrin, S., Elie, R., Lauriรจre, M., Piliouras, G., Geist, M., Tuyls, K., and Pietquin, O. (2021). Scaling up Mean Field Games with Online Mirror Descent. arXiv:2103.00623 [cs].
- Ratliff etย al., (2019) Ratliff, L.ย J., Dong, R., Sekar, S., and Fiez, T. (2019). A perspective on incentive design: Challenges and opportunities. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):305โ338.
- Rosenberg and Mansour, (2019) Rosenberg, A. and Mansour, Y. (2019). Online convex optimization in adversarial markov decision processes.
- Roughgarden and Tardos, (2007) Roughgarden, T. and Tardos, ร. (2007). Introduction to the inefficiency of equilibria. In Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V.ย V., editors, Algorithmic Game Theory, pages 443โ460. Cambridge University Press, Cambridge.
- Russo and Vanย Roy, (2013) Russo, D. and Vanย Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volumeย 26. Curran Associates, Inc.
- Sanjari etย al., (2024) Sanjari, S., Bose, S., and Baลar, T. (2024). Incentive Designs for Stackelberg Games with a Large Number of Followers and their Mean-Field Limits. arXiv:2207.10611 [cs].
- Scheid etย al., (2024) Scheid, A., Tiapkin, D., Boursier, E., Capitaine, A., Mhamdi, E. M.ย E., Moulines, ร., Jordan, M.ย I., and Durmus, A. (2024). Incentivized learning in principal-agent bandit games. arXiv preprint arXiv:2403.03811.
- Steinbacher etย al., (2021) Steinbacher, M., Raddant, M., Karimi, F., Camachoย Cuena, E., Alfarano, S., Iori, G., and Lux, T. (2021). Advances in the agent-based modeling of economic and social behavior. SN Business & Economics, 1(7):99.
- Subramanian etย al., (2022) Subramanian, S.ย G., Taylor, M.ย E., Crowley, M., and Poupart, P. (2022). Decentralized mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 36, pages 9439โ9447.
- Wang etย al., (2022) Wang, K., Xu, L., Perrault, A., Reiter, M.ย K., and Tambe, M. (2022). Coordinating followers to reach better equilibria: End-to-end gradient descent for stackelberg games. Proceedings of the AAAI Conference on Artificial Intelligence, 36(5):5219โ5227.
- Weissman etย al., (2003) Weissman, T., Ordentlich, E., Seroussi, G., Verdรบ, S., and Weinberger, M.ย J. (2003). Inequalities for the l1 deviation of the empirical distribution.
- Yang etย al., (2022) Yang, J., Wang, E., Trivedi, R., Zhao, T., and Zha, H. (2022). Adaptive incentive design with multi-agent meta-gradient reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS โ22, page 1436โ1445, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
- Yardim etย al., (2022) Yardim, B., Cayci, S., Geist, M., and He, N. (2022). Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games.
- Zhang etย al., (2024) Zhang, B.ย H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S.ย M., Haupt, A.ย A., Celli, A., Gatti, N., Conitzer, V., and Sandholm, T. (2024). Steering No-Regret Learners to a Desired Equilibrium. arXiv:2306.05221 [cs].
- Zhu etย al., (2022) Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M.ย I. (2022). The sample complexity of online contract design. arXiv preprint arXiv:2211.05732.
Checklist
-
1.
For all models and algorithms presented, check if you include:
-
(a)
A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes]
-
(b)
An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes]
-
(c)
(Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Not Applicable]
-
(a)
-
2.
For any theoretical claim, check if you include:
-
(a)
Statements of the full set of assumptions of all theoretical results. [Yes]
-
(b)
Complete proofs of all theoretical results. [Yes]
-
(c)
Clear explanations of any assumptions. [Yes]
-
(a)
-
3.
For all figures and tables that present empirical results, check if you include:
-
(a)
The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Not Applicable]
-
(b)
All the training details (e.g., data splits, hyperparameters, how they were chosen). [Not Applicable]
-
(c)
A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Not Applicable]
-
(d)
A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Not Applicable]
-
(a)
-
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
-
(a)
Citations of the creator If your work uses existing assets. [Not Applicable]
-
(b)
The license information of the assets, if applicable. [Not Applicable]
-
(c)
New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]
-
(d)
Information about consent from data providers/curators. [Not Applicable]
-
(e)
Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]
-
(a)
-
5.
If you used crowdsourcing or conducted research with human subjects, check if you include:
-
(a)
The full text of instructions given to participants and screenshots. [Not Applicable]
-
(b)
Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]
-
(c)
The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]
-
(a)
Contents
- 1 INTRODUCTION
- 2 PRELIMINARIES
- 3 THE STEERING PROBLEM FORMULATION FOR MFGS
- 4 STEERING TOWARDS A FIXED TARGET
- 5 STEERING WITH NO INTRINSIC REWARD
- 6 STEERING WITH NON-ZERO INTRINSIC REWARD
- 7 CONCLUSION
- A TABLE OF FREQUENTLY USED NOTATIONS
- B SUMMARY OF MAIN RESULTS
- C OTHER RELATED WORKS
- D REGARDING NO-ADAPTIVE REGRET ASSUMPTION
- E STATE-ACTION DENSITY
- F PROOFS OF RESULTS IN SECTIONย 4
- G PROOF OF THEOREMย 5.1
- H ELUDER DIMENSION
- I PROOF OF THEOREMย 6.1
- J EXTENSION TO UNKNOWN UTILITY FUNCTION
Appendix A TABLE OF FREQUENTLY USED NOTATIONS
Notation | Description |
for any | |
Set of probability distributions over a finite set | |
Indicator function for the event | |
All-one vector | |
The -th standard-basis vector | |
The model / game | |
Number of agents | |
State and action space | |
Horizon length of the game | |
Initial state distribution | |
Transition function | |
Reward function | |
Steering reward function (capitalized) | |
Vectorized reward function | |
Markov policy | |
Set of all policies | |
State-action density of policy in model | |
Set of possible state-action densities in model | |
Adaptive regret bound after games | |
Utility function | |
Steering cost function | |
Reward function which incentivizes policy | |
True model, intrinsic reward, transition function | |
Steering reward for the setting where . | |
โzโ in sub-scription as a short note of โzeroโ. | |
Steering reward for the setting where . | |
โnzโ in sub-scription as a short note of โzeroโ | |
Eluder dimension of function class | |
Population density | |
Population average policy induced by | |
Standard big-O notations |
Appendix B SUMMARY OF MAIN RESULTS
In the following, we summarize the main theorems in this paper under Assump.ย A,ย B andย C. We study the steering gaps and costs of four settings. The settings are categorized depending on whether (or ) is known or not, and whether the intrinsic reward function is zero or non-zero and unknown.
Setting | ? | Steering Gap | Steering Cost | Thm. |
Known | โ | 4.1 | ||
Unknown (known ) | โ | 4.3 | ||
Unknown | โ | 5.1 | ||
Unknown | โ | 6.1 |
Here and , where is the eluder dimension of reward function class , and .
Appendix C OTHER RELATED WORKS
More Elaboration on Comparison between the Steering Setting and Contract Design Setting
The steering setup differs from previous incentive design literature in two aspects: (1) it deals with โlearning agentsโ continuously updating their policies and (2) it cares about the steering gap towards a target policy and the accumulative steering cost. One of the most related and representative existing problem setups is contract design (a.k.a. the principal-agent problem), which is a classical problem dating back to the seminal work (Holmstrรถm,, 1979) in 1979. As we discussed in Sec.ย 1.1, it considers a similar mediator-agents interaction procedure. In the following, we elaborate more on the comparison between those two settings to support our steering setting.
-
(1)
Contract design assumes the agents respond optimally to the mediator/principal (e.g. maximize the total return including the incentives by mediator), which is a quite strong assumption and โsimplifiesโ the problem by making the agentsโ behaviors predictable.
In contrast, the steering framework treats the agentsโ behavior as a dynamic process. For example, Zhang etย al., (2024) and ours consider no-regret behaviors, and (Huang etย al., 2024b, ; Canyakmaz etย al.,, 2024) assumes Markovian learning dynamics. Such a non-stationarity is more reasonable in practice and introduces additional challenges in achieving low the steering gap and cost.
-
(2)
Contract design considers a more challenging objective, and targets at finding the optimal incentive design to maximize the mediatorโs gain deducted by the incentivizing cost. Usually, it also assumes the agentsโ behaviors are unobservable. Due to such challenges, most of the contract design literature focuses on single-agent setting and assumes the knowledge of the model.
On the other hand, the steering setting considers steering the agents to some target policies maximizing some utility function, which makes the framework more general. Besides, we do not pursue the optimality in steering cost but sub-linearity would be enough. This is reasonable because in many scenarios we only have budget constraints but do not have to achieve the optimum. Such a relaxation also makes the problem more tractable.
Mean-field game
The mean-field game (MFG) is an important framework to model systems with a large number of symmetric agents (Lauriรจre etย al.,, 2024). Most works in the context of MFGs focus on learning equilibrium policies. As the pioneers, Lasry and Lions, (2007) and Huang etย al., (2006) reveal that learning Nash Equilibrium (NE) is computationally efficient under monotonicity conditions if the model is known in advance. Without the knowledge of the true model, many previous works contribute sample-efficient model-free (Guo etย al.,, 2021; Yardim etย al.,, 2022; Perolat etย al.,, 2021) and model-based (Huang etย al.,, 2023; Huang etย al., 2024a, ) methods to compute NE. Our mean-field game definition is similar to the general MFG setting (Guo etย al.,, 2021), but unlike them, we assume transitions are density-independent and allow independence of agentsโ policies. This density-independent transition assumption has been frequently considered in previous works (Lasry and Lions,, 2007; Huang etย al.,, 2006; Hu and Zhang,, 2024; Perolat etย al.,, 2021). To our knowledge, we are the first to investigate steering agentsโ behaviors in the context of the mean-field game.
Mathematical Programming with Equilibrium Constraints (MPEC) and Mechanism Design
MPEC considers a bilevel optimization formulation, where the upper level can be utility maximization problem and the lower level involves equilibrium constraints (Luo etย al.,, 1996). There is a line of research works (Liu etย al.,, 2022; Wang etย al.,, 2022; Yang etย al.,, 2022) consider gradient-based approaches to solve MPEC problems. They usually require strong assumptions on computing hyper-gradients, which may fail to be satisfied in most games. In contrast, we do not involve those assumptions or restrict the target policies are equilibria. We only assume the agents are no-regret learners and do not require them to solve the equilibria induced by modified reward functions.
Another related field within game theory is Mechanism Design, which focuses on designing rules or systems (mechanisms) to achieve a specific objective, especially when participants (agents) have private information and act according to their own interests. Most recent works consider mechanism design on Markov Games (Curry etย al.,, 2024; Baumann etย al.,, 2020).
Guo etย al., 2023b consider a bi-level optimization framework and another bi-objective variant, where the goal of the social planner is to solve an equilibrium policy maximizing some social welfare function. They do not consider the usage of steering reward to intervene agents, and focus on the optimization side without considering model uncertainty. In contrast, we study the incentive design problem, and focus on how to explore and design appropriate steering rewards to guide agentsโ behaviors without knowledge of the model.
Appendix D REGARDING NO-ADAPTIVE REGRET ASSUMPTION
D.1 Proof Of Propositionย 3.1
See 3.1
Proof.
D.2 Concrete Examples Satisfying No-Adaptive Regret Assumption
In this section, we provide some concrete agents learning dynamics examples to support our arguments on the practicality of Assump.ย B.
Example 1: Colluded Agents with Full Observation of
If the agents are able to observe the mediatorโs steering strategy and is Lipschitz in density (which is indeed satisfied by our proposed algorithms), the agents can collude together and take a (approximate) Nash Equilibrium policy induced by the reward function , which is guaranteed to be exist given the Lipschitz condition (Huang etย al.,, 2023). By the definition of Nash, each agent will have non-positive adaptive regret, which satisfies Assump.ย B.
Note that in the contract design literature, it is usually assumed the agents are able to do best response (Ho etย al.,, 2014; Zhu etย al.,, 2022) to the principalโs (mediatorโs) strategy if there is only one agent, or take the equilibrium policies for many agents setting (Carmona and Wang,, 2021; Elie etย al.,, 2019). Based on the discussion above, those assumptions are strictly stronger than and implies our no-adaptive-regret assumption.
Example 2: Independent Agents Conducting Online Convex Learning
In this second example, we consider less powerful agents who can not observe the entire or coordinate with the other agents. Note that from an agentโs perspective, the interaction protocol in Procedureย 1 can be interpreted as an online linear optimization task, as in Procedureย 3.
In our setting, in each iteration , the agents pick a density (by picking a policy) from the convex set and receive potentially adversarial feedback (or in bandit feedback setting). Then, Assump.ย B coincides with the standard no-adaptive regret guarantees in online convex optimization setting. Therefore, Assump.ย B can be realized if each agent independently adopts any no-adaptive regret online learning algorithm (Hazan and Seshadhri,, 2007; Hazan,, 2023).
As a concrete algorithm choice, online gradient descent (OGD) achieves a external regret bound of (Hazan,, 2023), where is the diameter of and an upper bound on . In our case, we can bound . A bound for is discussed in Appendixย D.4. Moreover, in the full feedback setting (the agents know the model and are able to observe ), the no-adaptive-regret assumption is not much stronger than no-external-regret, as is demonstrated by the following proposition.
D.3 Motivating Adaptive Regret
Here, we show a small example that should motivate why we need the no-adaptive-regret assumption instead of no-external-regret. External regret is one of the most common regret types, and it is the same as adaptive regret in Assumptionย B, but are fixed. If we want to steer the agents in different directions, the no-external-regret assumption might not be enough, as we can see in the following example.
Consider the stateless setting with , where the incentive designer deploys for the first iterations and for the remaining iterations.
Suppose all the agents perform the Hedge algorithm, where
and is the normalizing constant. This algorithm is known to have sublinear external regret (Freund and Schapire,, 1997). The population density at iteration is
while the optimal action is . Thus, over the interval , the agents accumulate expected regret
So, although this algorithm has no external regret, we still might have to wait many rounds to let the agents converge to a different density. One can easily observe that with the no-adaptive-regret assumption, this is not an issue.
D.4 Boundedness Of Steering Rewards
As we see in Assumptionย B, the adaptive regret bound is dependent on . In this section, we show that for both of our steering rewards and .
Proposition D.2.
For any and , , where is defined as in Eq.ย (5).
Proof.
Proposition D.3.
Appendix E STATE-ACTION DENSITY
E.1 Is Convex
Lemma E.1.
Proof.
We abbreviate and , since the model is fixed throughout. For , it is easy to see that the conditions on the right-hand side are fulfilled. The other direction is more involved. Suppose fulfills and for all , as well as . Now, define such that for all ,
Clearly, and , which means . First of all,
By induction, for all we have if ,
and in case , we know that and therefore
We can conclude that and thus . โ
Lemma E.2.
where
( is the tensor product) and is viewed as a matrix such that . An immediate consequence of this formulation is that is convex.
Proof.
This result is simply a reformulation of Lemmaย E.1. We can rewrite the condition as . The condition can be written as . โ
E.2 Inequalities
Lemma E.3.
For any model and any ,
where .
Proof.
Since the model is fixed throughout, we abbreviate and . First of all, . Furthermore, for any ,
By induction,
Finally,
โ
See 4.2
Proof.
Note that is a convex set and . By definition, we have . By applying Lem.ย E.3 for model policy and in , we finish the proof. โ
Lemma E.4.
Consider any and models , who are the same except with different transition functions respectively. Then,
Proof.
Recall the definition of the state-action density function:
We abbreviate . Since is the same for and , . Furthermore, for all ,
Using induction on , we obtain . Thus,
โ
Appendix F PROOFS OF RESULTS IN SECTIONย 4
See 4.1
Proof.
The steering cost can be bounded similarly.
Given that is sub-linear in , we finish the proof. โ
See 4.3
Appendix G PROOF OF THEOREMย 5.1
Lemma G.1.
Let be a sequence of policies. We abbreviate . Then,
where is the (population) policy which induces .
Proof.
The first inequality follows from Lemmaย E.3. We can write
Furthermore, by Jensenโs inequality,
โ
Lemma G.2.
For any , with probability at least ,
for all , where .
Proof.
Lemma G.3.
For any and respective ,
Proof.
We can define . Clearly, . The condition in line 5 of the algorithm ensures that for all . Thus, we can use Lemma 19 in Jaksch etย al., (2010) and Jensenโs inequality,
Now, using the definition of ,
โ
Lemma G.4.
Let be the policy sequence of the population and the sequence of the corresponding model estimates. We abbreviate . With probability at least ,
Proof.
The proof is based on Rosenberg and Mansour, (2019). Let be the trajectory sampled in the -th game. We define . By Lemmaย E.4,
where is a martingale difference sequence w.r.t.ย the trajectories sampled and with . In the following, we bound the first and second term above with high probability.
The first term can be bounded using Lemmaย G.2ย andย G.3, such that we have, with probability at least ,
By the Hoeffding-Azuma inequality, we have for a fixed that with probability at least ,
Thus, by the union bound over all , the second term is at most with probability at least .
Finally, by union bound over the events used to bound the first and second term, we have with probability at least that
โ
See 5.1
Proof.
We first establish the upper bound for steering gap and then investigate the steering cost.
Proof for Steering Gap
We denote with the episode index at the -th game and denote . Furthermore, we abbreviate and . Consider a fixed and . We can decompose the steering gap term of round as follows:
The first term can be bounded by 0 using the optimism of the algorithm. We use the -Lipschitzness of and the triangle inequality to further decompose the second term.
Putting it all together we now arrive at
By summing over ,
Using Lemmaย G.4, the estimation error term can bounded by with probability at least .
To bound the population convergence term , we can use Lemmaย G.1:
Furthermore, it can be easily seen that AgentReg is
Finally, to bound the number of episodes , note that is also the number of times the condition in line 5 of the algorithm has been true. For each , this condition can be true at most times. Thus, .
Proof for Steering Costs
Note that for any reward function ,
Let for some . Recall that . By looking at the definition of in (4), we see that
where the -matrix norm is defined as . Using this, we can bound
Finally, using Jensenโs inequality and the fact that the agent regret is bounded by , our steering cost can be bounded by
โ
Appendix H ELUDER DIMENSION
H.1 Example Function Classes
Here, we list some bounds of the eluder dimension for different function classes that are commonly considered. We see that in all these cases, the eluder dimension can be bounded logarithmically in , if .
Proposition H.1 (Linear functions, Russo and Vanย Roy, (2013)).
Let .
Proposition H.2 (Quadratic functions, Osband and Roy, (2014)).
Let .
Proposition H.3 (Generalized linear functions, Russo and Vanย Roy, (2013)).
Let be strictly increasing, differentiable and have derivatives bounded in with . Let and .
Remark H.3 (Bounding ).
If we assume that the functions in are parametrized by parameters in some set with constant diameter and the functions are -Lipschitz in that parameter, we have . Then, we might choose such that
can also be bounded logarithmically in .
H.2 Bounding The Width Of The Confidence Set
Notations and Definitions
Here, we introduce some notation used in this section. We define the width function . Throughout this section, we use the notation with to describe elements of a sequence . The idea behind it is that we can later define and apply the results in this section to our setting. Furthermore, for any function we write .
Lemma H.4 (Proposition 3 of Russo and Vanย Roy, (2013)).
If is a positive non-decreasing sequence, some function sequence and then with probability 1, for all ,
for all and .
Proof.
First we show that for any , if then is -dependent on fewer than disjoint subsequences of . Suppose . Then, there are such that . Furthermore, let be a subsequence of on which is -dependent. This implies, by definition, that . If is -dependent on disjoint subsequences of then we must have
By the triangle inequality, . Combining these two inequalities, we get .
Next, we show that in any sequence there is an element which is -dependent on at least disjoint subsequences of , where . Let be an integer with . We will construct disjoint subsequences . First, for all . If is already -dependent on , we are done. Otherwise, select a of which is -independent and append to . We repeat this for until we find that is -dependent on each subsequence or until we have reached . In the latter case, each element of a subsequence is independent of its predecessors and hence . Then, must be -dependent on each subsequence, by definition of the eluder dimension. In both cases we find an element in that is -dependent on disjoint subsequences.
Finally, let be a subsequence of consisting of all elements for which . From before, we know there is some that is -dependent on at least disjoint subsequences of . Let be such that . Note that in there are at most elements for some . From this follows that is -dependent on at least disjoint subsequences of . Now, as we have also shown, is -dependent on fewer than disjoint subsequences of . Combining these two bounds, we get , and therefore . โ
Lemma H.5 (Variant of Lemma 2 in Russo and Vanย Roy, (2013)).
Let be a positive non-decreasing sequence, some function sequence and . Let for all . Then, for all and ,
Proof.
We abbreviate and . Let . Using this ordering of the sequence, implies that . By Lemmaย H.4, this would mean or, equivalently, . Now, since implies , this means that .
In the following, we bound the first and largest widths by and the remaining widths (larger than ) by the previously established bound.
โ
Appendix I PROOF OF THEOREMย 6.1
I.1 Algorithm Details
We present our full algorithm for the unknown reward setting in Alg.ย 4.
I.2 Missing Proofs
Lemma I.1 (Proposition 2 in Russo and Vanย Roy, (2013)).
Let be the -covering number of w.r.t.ย the -norm. Let , and for each , . With probability at least , .
Lemma I.2.
We abbreviate . With probability at least ,
Proof.
Note that . Recall that are the trajectories we gather from the population at step . Therefore, we can define with . By the assumption that is bounded in , we have that for any and . Therefore, . A direct application of Lemma D.4 from Huang etย al., (2023) shows that with probability at least ,
โ
Lemma I.3.
We abbreviate . If the true is contained in all , then, with probability at least ,
Proof.
See 6.1
Proof.
We abbreviate . We can use the exact same arguments as in the proof of Theoremย 5.1, up until the point where we have to bound
Combining Lemmaย I.1 and Lemmaย I.3 we have with probability at least that
Now, we summarize , where , and with slight abuse of notation, we rewrite . With this rewriting of notation, we can apply Lemmaย H.5 with to show that
Combining the with the previous results, we get that with probability at least ,
The new agent regret term NewAgentReg can be bounded in the same way as in the proof of Theoremย 5.1:
For the steering cost, we have for any ,
Then, summing over ,
Using Lemmaย I.2, we can bound the first term by with probability at least . Using Lemmaย H.5 with , we can further bound this by . From the steering cost bound in Thm.ย 5.1 follows that the second term is bounded by
which is at most , as we have already shown in this proof.
Lastly, we have to discuss the asymptotic bound for . The term is dependent on the Eluder dimension of and . In Appendixย H.1, we show several common function classes with . Furthermore, if we assume that the functions in are parametrized by parameters in some set with constant diameter and -Lipschitz in that parameter, we have Then, we might choose such that can also be bounded logarithmically in . In the cases where the Eluder dimension and are in , we have (ignoring other factors). โ
Appendix J EXTENSION TO UNKNOWN UTILITY FUNCTION
In this section, we generalize our previous results to the setting where the mediator does not have prior knowledge of the utility function . We consider non-zero intrinsic reward setting, i.e., Scenario 2 described in Section 3.4. Note that the results for Scenario 1 can be directly derived by setting .
Motivation for Unknown Utility Setting
This setting makes sense, especially when partially depends on the agentsโ intrinsic rewards . As a motivating example, in financial markets, the government (mediator) gains benefits (utility ) from not only the impact on the society by the desired behaviors of the companies (the agents), but also the tax paid by them, which is directly related to the rewards received by agents.444Another way to interpret this scenario is that the mediatorโs utility can be decomposed to a known function representing its intrinsic utility, and another unknown part , which reflects the agentsโ interests and depends on . Here serves as a parameter to trade-off the interests between two parties. Due to the lack of knowledge of , should only be partially revealed to the mediator. This restricts the applicability of our methods to this setting. However, if is unknown, we might infer it, for example, by estimating the true reward functions through the online interaction with the agents. We can also generalize this setting as follows.
We consider a general setting, where the mediator does not have prior knowledge on , but it can observe samples from , perturbed by -sub-Gaussian noise, and get access to a function class which contains and whose functions are bounded in .
J.1 Algorithm
We can use the standard technique described in Russo and Vanย Roy, (2013) to handle this case. We define
(12) | ||||
(13) |
where and, e.g., .
Algorithmย 5 differs from Algorithmย 4 in the if-condition in lineย 7 as well as in lines 10 and 11.
The if-condition in line 7 now includes the case , where will be chosen later. We need this to guarantee for all and thereby bound the estimation error of the utility function estimate. Intuitively, we need to keep the estimates of somewhat up to date to be able to bound the estimation error. Meanwhile, we cannot update the estimate in each round (or too often) since then we would also have to change in each round, which would lead to .
J.2 Analysis
Theorem J.1.
Comparing with Theoremย 6.1, we see that the steering gap has an additional term originating from the estimation of . Furthermore, the bound of the number of epochs has an additional . Similar to the discussion about in Sectionย H.1, we can also bound and under suitable assumptions about . If and , both the steering cap and steering cost are in (ignoring all other constants).
Proof.
We can adapt the proof of Theorem 6.1 by choosing the following regret decomposition.
Using Lemmaย I.1 (and replacing by in the Lemma), we have with probability at least . Thus, with probability at least , the first term can be bounded by 0 using optimism. The second term can be bounded in the same way as in the proof of Theorem 6.1. Summing over all , the last term accumulates to
Using the fact that and Lemma H.5 with , the sum above is at most
We also have to find a new bound for . As before, we can enter the if block at most times because of the first condition. In addition, we can enter the if block at most times due to the condition . Therefore, and .
Now, we set . Then, and
Finally, the steering gap is the previous bound of Theoremย 6.1 plus the above term.
With regard to the steering cost, the only change is the bound of .
โ