1 INTRODUCTION

Steering No-Regret Agents in MFGs under Model Uncertainty

Leo Widmer Jiawei Huang Niao He

lewidmer@student.ethz.ch, {jiawei.huang, niao.he}@inf.ethz.ch Department of Computer Science ETH Zürich

Abstract

Incentive design is a popular framework for guiding agents’ learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents’ behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.

1 INTRODUCTION

Mean-Field Games (MFGs) (Huang et al.,, 2006; Lasry and Lions,, 2007) are a widely-used and powerful framework to model the competition and cooperation of large population systems involving symmetric and interchangeable agents. MFGs effectively capture the dynamics of many real-world scenarios, such as macro-economic models (Steinbacher et al.,, 2021), road traffic systems (Chen and Cheng,, 2010), autonomous vehicle systems (Dinneweth et al.,, 2022) and auctions (Iyer et al.,, 2014), and it has been successfully applied in those domains (Gomes et al.,, 2014; Cabannes et al.,, 2021; Achdou and Lasry,, 2019; Guo et al.,, 2021). Similar to the finite-agent systems (Roughgarden and Tardos,, 2007), MFGs with self-interested agents may lead to undesirable collective behaviors. Typically, the agents’ learning dynamics may converge to equilibria where all the participants are worse off compared to other possible outcomes (Guo et al., 2023a, ).

To address this dilemma, the field of incentive design explores methods to guide agents towards more favorable behaviors by modifying the reward structure. A widely-studied formulation, known as the steering problem (Zhang et al.,, 2024; Canyakmaz et al.,, 2024; Huang et al., 2024b, ), assumes the presence of a mediator (incentive designer) outside the game, who can influence the agents’ learning dynamics by providing additional steering rewards. However, previous research on steering mainly focuses on either Extensive-Form Games (Zhang et al.,, 2024) or Markov Games (Canyakmaz et al.,, 2024; Huang et al., 2024b, ) with a limited number of agents. The methods developed for those small-scale settings become intractable when applied to large-population scenarios, as the number of agents increases, which is known as the curse of multi-agency.

To address this gap, in this work, we study incentive design in Mean-Field Games (MFGs). More concretely, we focus on the finite-horizon MFGs with density-independent transitions, a standard model in literature (Huang et al.,, 2006; Lasry and Lions,, 2007; Perolat et al.,, 2021). In the steering problem setup, we play the role of the mediator, with access to a utility function dependent on the collective behavior of the agents through the population density. During the interactions with the mediator, the agents are continuously learning and adapting. We assume the agents are self-interested no-adaptive-regret learners (Hazan and Seshadhri,, 2007); similar no-regret assumptions have been widely adopted in previous literature (Camara et al.,, 2020; Ge et al.,, 2024). Following previous works (Zhang et al.,, 2024; Huang et al., 2024b, ), our primary goal is to design steering rewards that guide the agents towards desired policies (i.e., minimize the steering gap), such that the resulting behaviors maximize the utility function. Meanwhile, the incentives paid by the mediator to the agents, referred to as the steering cost, should remain low.

In practice, the mediator usually lacks knowledge of the transition dynamics and the intrinsic reward functions of the MFGs. Therefore, in this work, we focus on the design of steering strategies without prior knowledge of the game model. The model uncertainty makes the steering problem much more challenging, and requires the mediator to strategically balance the exploration and exploitation. Typically, without knowledge of the MFG model, the mediator does not know which agents’ behaviors (population densities) are feasible, let alone how to maximize the utility function. Therefore, the mediator needs not only to steer the agents to explore the MFG for its own learning, but also ensure the agents converge to desired outcomes, while keeping the accumulative incentive payments affordable. In summary, the key question we would like to address is:

How can we design effective steering strategies for no-regret agents in MFGs under model uncertainty?

Main contributions We address the above open question by proposing novel exploration algorithms with provable guarantees. We highlight our main contributions in the following. A summary of the main theorems in this paper can be found in Appx. B.

•

Firstly, in Sec. 3, we contribute the first formulation for steering in mean-field games, with details about the problem setting and learning objectives.
•

Secondly, as preparation, in Sec. 4, we investigate how to steer the agents to a given density or policy. Notably, in Sec. 4.2, we propose a novel steering strategy, which can guide the no-adaptive-regret agents towards any target policy without prior knowledge of the model. This method serves as the key ingredient of our steering algorithms in the following sections.
•

Thirdly, in Sec. 5 and Sec. 6, we investigate strategic exploration methods for steering agents in MFGs under uncertainty. In Sec. 5, we start with the setting where the intrinsic reward is zero, and propose an optimism-based exploration algorithm, which guarantees that both the cumulative steering gap and cost only have sub-linear growth. Furthermore, in Sec. 6, we extend our methods to the setting with non-zero and unknown intrinsic reward by integrating a pessimism-based reward estimation strategy. We establish sub-linear regret in steering gap, and show that the total steering cost is only sub-linearly worse, compared to a baseline strategy that stabilizes the target policy as an equilibrium by offsetting differences in intrinsic rewards.

1.1 Closely Related Work

Due to the limit of space, we only discuss closely related works here and defer the others to Appendix C.

Incentive Design in Multi-Agent Systems The problem of incentive design broadly refers to the design of mechanisms for shaping the behavior of autonomous agents (Ehtamo et al.,, 2002; Ratliff et al.,, 2019). A recently popular framework for incentive design is known as the steering problem (Zhang et al.,, 2024; Canyakmaz et al.,, 2024; Huang et al., 2024b, ), which considers a repeated interaction between a mediator and learning agents. All of them focuses on small-scale problems (e.g., Markov Games or Extensive-Form Games) and their proposed methods become intractable when extending to large-population setting, including MFGs. Besides, Canyakmaz et al., (2024); Huang et al., 2024b consider the agents’ learning dynamics to be memoryless, which is different from our no-regret assumptions.

Another related direction is contract design¹¹1To save space, we defer to Appx. C more elaboration of the comparisons between our steering framework and contract design setting. (DellaVigna and Malmendier,, 2004), which studies the interactions between a principal and agents when the two parties transact in the presence of private information. The fundamental question is how the principal should design the incentives for the agents to maximize its own utility after deducting the payments to agents. However, most of literature study the single-agent setting (Zhu et al.,, 2022; Ho et al.,, 2014; Scheid et al.,, 2024), or focus on the computational aspects without addressing exploration under uncertainty (Dütting et al.,, 2023; Castiglioni et al.,, 2023). (Carmona and Wang,, 2021; Elie et al.,, 2019) study contract design in the MFGs setting, but none of them consider model uncertainty. Moreover, a common assumption in those works is that the agents always do the best response (or take equilibrium policies) to the principal’s intervention, which is much stronger than our no-adaptive-regret assumption.

Besides, Sanjari et al., (2024) consider incentive design in a large-population setting, but they study the Stackelberg games with one leader and a large number of followers, which differs quite substantially to ours. Fu and Horst, (2018) consider mean-field leader-follower games, however they assume knowledge of the dynamics, while we consider the steering problem without this knowledge. Moreover, they study the dynamics where the agents cooperate together and optimally respond to the leader’s control signal. In contrast, we consider decentralized and self-interested agents with no-regret behaviors in maximizing individual interests.

2 PRELIMINARIES

Mean-Field Games We consider the MFG setting with a finite yet extremely large number of agents, each of which acts independently. In line with Subramanian et al., (2022), we refer to this setting as “Decentralized-MFGs”, although their model allows diversity in action space and reward functions for agents.

Definition 2.1.

A Finite-Horizon Decentralized MFG is defined by a tuple $M=(N,\mathcal{S},\mathcal{A},H,\mathbb{P}_{M},r_{M},\mu_{1})$ , given the number of agents $N$ ; state and action spaces $\mathcal{S},\mathcal{A}$ with sizes $S$ and $A$ ; horizon length $H$ ; initial state distribution $\mu_{1}\in\Delta_{\mathcal{S}}$ . $\mathbb{P}_{M}:=\{\mathbb{P}_{M,h}\}_{h=1}^{H}$ with $\mathbb{P}_{M,h}:\mathcal{S}\times\mathcal{A}\to\Delta_{\mathcal{S}}$ and $r_{M}:=\{r_{M,h}\}_{h=1}^{H}$ with $r_{M,h}:\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}\times\mathcal{A}% }\to[0,r_{\max}]$ denote the transition and reward function, respectively.

In this paper, we focus on density-independent transition function, a common assumption in previous literature (Huang et al.,, 2006; Lasry and Lions,, 2007; Perolat et al.,, 2021). For the reward function, we consider the general setup, where the rewards depend on the state-action density (Guo et al.,, 2021).

We only focus on non-stationary Markovian policies, denoted by $\Pi:=\{\pi:=\{\pi_{h}\}_{h\in[H]}|\pi_{h}:\mathcal{S}\rightarrow\Delta(% \mathcal{A})\}$ . Given a model $M$ , considering an agent taking policy $\pi\in\Pi$ , we use $\mu^{\pi}_{M}:=\{\mu^{\pi}_{M,h}\}_{h=1}^{H}$ to denote its state-action density for each step $h\in[H]$ . Starting with $\mu^{\pi}_{M,1}(s,a)=\mu_{1}(s)\pi_{1}(a|s)$ , for $1\leq h\leq H$ , we have:

\displaystyle\mu^{\pi}_{M,h+1}(s,a)=\pi_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}% \mathbb{P}_{M,h}(s|s^{\prime},a^{\prime})\mu^{\pi}_{M,h}(s^{\prime},a^{\prime}).

When $N$ agents take policies $\pi^{1},...\pi^{N}\in\Pi$ , respectively, the trajectory of agent $n\in[N]$ is specified by:

	$\displaystyle\textstyle s_{1}^{n}\sim\mu_{1},~{}\forall h\geq 1,~{}$	$\displaystyle a_{h}^{n}\sim\pi^{n}_{h}(\cdot\|s_{h}^{n}),~{}s_{h+1}^{n}\sim% \mathbb{P}_{M,h}(\cdot\|s_{h}^{n},a_{h}^{n}),$
		$\displaystyle r_{h}^{n}\leftarrow r_{M,h}(s_{h}^{n},a_{h}^{n},\bar{\mu}_{M,h}).$		(1)

where we use $\bar{\mu}_{M,h}=\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n}}_{M,h}$ to denote the population density at step $h$ . We also assume $r_{M}$ is Lipschitz in the density, which is standard in previous works (Guo et al.,, 2021; Yardim et al.,, 2022).

Other Notational Convention For convenience, we implicitly treat $\mu^{\pi}_{M}$ as a vector in $\mathbb{R}^{HSA}$ concatenated by $\{\mu_{M,h}^{\pi}\}_{h\in[H]}$ . We denote $\Psi_{M}:=\{\mu^{\pi}_{M}:\pi\in\Pi\}\subseteq\Delta_{\mathcal{S}\times% \mathcal{A}}^{H}$ to be the set of all feasible state-action densities given $M$ . Note that $\Psi_{M}$ is a convex set (see Lem. E.1), which implies $\bar{\mu}_{M}\in\Psi_{M}$ . If it is not necessary to distinguish what model $M$ we use, we omit it in the sub-scriptions, for example, $\mu^{\pi}/\Psi$ instead of $\mu^{\pi}_{M}/\Psi_{M}$ . We also omit $h$ in $s_{h},a_{h}$ if it is clear from the context. With slight abuse of notation, given a population density $\bar{\mu}:=\{\bar{\mu}_{h}\}_{h=1}^{H}\in\Delta^{H}_{\mathcal{S}\times\mathcal% {A}}$ and a reward function $r$ , we use $r(\bar{\mu})\in\mathbb{R}^{HSA}$ to denote the reward vector where $(r(\bar{\mu}))_{h,s,a}=r_{h}(s,a,\bar{\mu}_{h})$ . In this way, given an arbitrary agent $n\in[N]$ taking policy $\pi$ , its expected total return conditioning on population density $\bar{\mu}$ can be written as: $\mathbb{E}_{\pi^{n}}[\sum_{h=1}^{H}r_{h}(s_{h}^{n},a_{h}^{n},\bar{\mu}_{h})]=% \langle r(\bar{\mu}),\mu^{\pi^{n}}\rangle$ .

Given that this paper considers learning under uncertainty, we use $M^{*}$ to denote the true hidden mean-field model with transition $\mathbb{P}^{*}$ and intrinsic reward $r^{*}$ , in order to distinguish it from the estimated ones.

Besides, given a population density $\bar{\mu}$ in a model $M$ , we will use $\bar{\pi}$ to denote the policy, which induces the population density (i.e., $\mu^{\bar{\pi}}=\bar{\mu}$ ), defined by: $\bar{\pi}_{h}(\cdot|s):=\bar{\mu}_{h}(s,\cdot)/\bar{\mu}_{h}(s)$ (or $\bar{\pi}_{h}(\cdot|s)=1/A$ if $\bar{\mu}_{h}(\cdot)=0$ ).

Reward Function Approximation and Eluder Dimension In this paper, we consider the setting where the true intrinsic reward, denoted by $r^{*}$ , is unknown. Note that the reward function depends on not only the state and action but also the density, which belongs to a high-dimensional continuous space. Therefore, we consider function approximation for reward estimation with the standard realizability assumption.

Assumption A.

A reward function class $\mathcal{R}$ is available, s.t. (i) $\forall r\in\mathcal{R}$ , $\forall h,~{}r_{h}(\cdot,\cdot,\cdot)\in[0,r_{\max}]$ ; (ii) $r^{*}\in\mathcal{R}$ .

In the function approximation setting, the fundamental sample efficiency is closely related to the complexity of the function class. We follow previous works (Russo and Van Roy,, 2013; Huang et al., 2024a, ) and utilize the Eluder Dimension as the complexity measure of the function class. Intuitively, the Eluder Dimension is defined to be the length of the longest “independent” sequence, such that each element in the sequence “reveals” some new information about the function class comparing with previous ones.

Definition 2.2 ( $\varepsilon$ -independent sequence).

Given a domain $\mathcal{X}$ and a class of functions $\mathcal{F}$ defined on $\mathcal{X}$ , we say $x\in\mathcal{X}$ is $\varepsilon$ -independent on $\{x_{1},...,x_{J}\}\subseteq\mathcal{X}$ if there exists $f,\tilde{f}\in\mathcal{F}$ , such that $\sum_{j=1}^{J}(f(x_{j})-\tilde{f}(x_{j}))^{2}\leq\varepsilon^{2}$ , but $|f(x)-\tilde{f}(x)|>\varepsilon$ .

Definition 2.3 (Eluder Dimension).

Given a mean-field reward function class $\mathcal{R}$ and domain $\mathcal{X}:=[H]\times\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}% \times\mathcal{A}}$ , the Eluder Dimension of $\mathcal{R}$ , denoted by $\dim_{E}(\mathcal{R},\varepsilon)$ , is defined to be the length of the longest sequence $\{x^{j}\}_{j=1}^{J}$ , such that, for any $i\in[J]$ , $x^{i}$ is $\varepsilon$ -independent w.r.t. $\{x^{j}\}_{j=1}^{i-1}$ .

3 THE STEERING PROBLEM FORMULATION FOR MFGS

In this section, we introduce our steering setup. In Sec. 3.1, we first provide our formulation for steering protocol. Then, in Sec. 3.2 we discuss our assumptions on agent’s behavior. After that, we introduce the learning objectives and other setups in Sec. 3.3 and 3.4.

3.1 Agent-Mediator Interaction Protocol

We consider a repeated game setup, and summarize the interaction procedure between agents and the mediator in Procedure 1. In each iteration $t\in[T]$ , the mediator first selects a steering reward function²²2We will use capital $R$ to denote the steering reward to distinguish with intrinsic reward $r$ . $R^{t}$ , which is a mapping from the density space to the non-negative³³3The non-negativity of the steering reward is known as limited liability (Innes,, 1990), which is standard in previous works (Zhang et al.,, 2024; Huang et al., 2024b, ) reward vector space, upper bounded by $R_{\max}$ . Besides, each agent computes a policy and plays the game. The agents’ policies result in a population density $\bar{\mu}^{t}:=\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n,t}}$ , by which the mediator realizes the steering reward $R^{t}(\bar{\mu}^{t})$ . Then, each agent $n\in[N]$ receives payments from the mediator equal to the expected return induced by the steering reward and the agent’s policy, i.e., $\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangle$ . We highlight here that in our setup, at each iteration $t$ , the mediator designs the steering reward function $R^{t}$ without the knowledge of the agents’ policies $\pi^{n,t}$ , and we do not restrict whether the agents can observe $R^{t}$ or not before they make decisions. Furthermore, the agents can either independently compute their policies or collaborate. In the next section, we will characterize our assumptions on the agents’ behaviors with more details.

1:for

t=1,...,T

2: Mediator chooses

R^{t}:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to[0,R_{\max}]^{HSA}

3: Each agent

n\in[N]

computes policy

\pi^{n,t}\in\Pi

, resulting in the population density

\bar{\mu}^{t}

, and gets payment

\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangle

from the mediator.

4: Mediator observes

\bar{\mu}^{t}

and a trajectory

\{(s_{h}^{n,t},a_{h}^{n,t},r_{h}^{n,t}+\xi_{h}^{t})\}_{h\in[H]}

generated by

\pi^{n,t}

following Eq. (1), where

n\sim\text{Uniform}\{1,2,...,N\}

5:end for

Procedure 1 Agent-Mediator Interaction Protocol

At the end of each iteration, the mediator can observe a trajectory sampled from a random agent with noisy reward samples. We assume noises $\xi_{h}^{t}$ are i.i.d. $\sigma$ -sub-Gaussian random variables with zero mean. We also assume the mediator has access to the population density, which is necessary to estimate the unknown intrinsic reward function from samples.

3.2 Behavioral Assumptions on Agents

We first introduce our no-adaptive regret assumption and its implication, and then make some justification.

Assumption B (No-Adaptive Regret Behavior).

In Procedure 1, the adaptive regret for each agent $\forall n\in[N]$ , which is defined below, can be upper bounded by some term $\operatorname*{AdaReg}(T)=(r_{\max}+R_{\max})\cdot o(T)$ :

\displaystyle\max_{\begin{subarray}{c}1\leq a<b\leq T\\ \mu\in\Psi_{M^{*}}\end{subarray}}

\displaystyle\sum_{t=a}^{b}\langle r^{*}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),% \mu-\mu^{\pi^{n,t}}\rangle,

(2)

where $(r_{\max}+R_{\max})$ is the normalization term. In Appx. D.4 we show that for all the steering rewards that we deploy in this paper we have $R_{\max}=\mathcal{O}(1+r_{\max})$ .

Justification for Assump. B We remark that it is common to consider agents exhibiting no-regret behaviors in previous literature (Deng et al.,, 2019; Zhang et al.,, 2024; Brown et al.,, 2024). Most of these literature assume no-external regret (directly assigning $a=1$ and $b=T$ in Eq. (2)), which is weaker than our no-adaptive-regret assumption. However, similar stronger assumptions, such as no-dynamic-regret learners, have also been considered in some studies (Ge et al.,, 2024). Moreover, our no-adaptive-regret assumption is standard when interpreted through the online linear optimization perspective (Hazan and Seshadhri,, 2007; Hazan,, 2023), where in each iteration, each agent picks a density from the convex set $\Psi_{M^{*}}$ and receives potentially adversarial feedback $R^{t}(\bar{\mu}^{t})$ . Then, Assump. B aligns with the standard no-adaptive regret guarantees in online linear optimization setting, and there are very simple algorithms (e.g., Online Gradient Descent) achieving $\operatorname*{AdaReg}=\tilde{O}(\sqrt{T})$ . We defer more detailed discussion to Appx. D.2.

Under Assump. B, we have the following property, which suggests the collective population will also exhibit no-regret behaviors. This is a useful property we will leverage in algorithm design.

Proposition 3.1 (No-Adaptive-Regret Population Behavior).

Under Assump. B, we have:

	$\displaystyle\max_{1\leq a<b\leq T,\mu\in\Psi_{M^{*}}}$	$\displaystyle\sum_{t=a}^{b}\langle r^{*}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),% \mu-\bar{\mu}^{t}\rangle$
		$\displaystyle\leq\operatorname*{AdaReg}(T)=(r_{\max}+R_{\max})\cdot o(T).$

3.3 Performance Metrics

Inspired by the previous works (Zhang et al.,, 2024; Huang et al., 2024b, ), we evaluate the steering algorithm from two aspects: the steering gap and the steering cost. We provide the concrete definition in our MFGs setup as follows.

The Steering Gap Intuitively, the steering gap measures the difference between the desired outcomes and the agents’ behavior under the mediator’s guidance. In this paper, we assume the mediator is given a utility function $U:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to\mathbb{R}$ assigning each population density a utility value. The only assumption we make for it is about the Lipschitz continuity:

Assumption C (Lipschitz Utility Function).

$\forall\mu,\mu^{\prime}\in\Delta^{H}_{\mathcal{S}\times\mathcal{A}},~{}|U(\mu)% -U(\mu^{\prime})|\leq L_{U}\|\mu-\mu^{\prime}\|_{1}.$

The steering gap up to step $T$ is defined by:

\displaystyle\textstyle\Delta_{T}(\{\bar{\mu}^{t}\}_{t=1}^{T}):=\max_{\pi^{*}% \in\Pi}\sum_{t=1}^{T}U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})

Here $U(\bar{\mu}^{t})$ represents the utility paid to the mediator at each iteration $t\in[T]$ , induced by the population density $\bar{\mu}^{t}$ . Note that we consider the best density maximizing utility function as the comparator. This can be interpreted as the best population density if all the agents are restricted to take the same policy, and finding the best shared policy is a standard objective in previous MFGs literature.

The Steering Cost The motivation for introducing a steering cost is that the agents will not accept the mediator’s guidance for free. A common measure of the cost is the expected total return associated with the reward received by the agents. Formally, suppose at iteration $t$ , the mediator computes a steering reward function $R^{t}$ , and the $N$ agents select policies $\pi^{1,t},\pi^{2,t},...,\pi^{N,t}\in\Pi$ , which induce a population density $\bar{\mu}^{t}:=\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n,t}}$ , then the steering cost is defined to be the average payments to the agents: $C(\bar{\mu}^{t},R^{t}):=\langle R^{t}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle=% \frac{1}{N}\sum_{n=1}^{N}\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangle.$ We will use

\textstyle C_{T}(\{\bar{\mu}^{t},R^{t}\}_{t=1}^{T}):=\sum_{t=1}^{T}C(\bar{\mu}% ^{t},R^{t})

to denote the accumulative steering gap. Note that the steering rewards are non-negative, the steering cost effectively reflects the strength of the steering signal.

3.4 Two Steering Scenarios and Objectives

In this paper, we consider the case when the mediator does not know the true transition and reward functions of $M^{*}$ . However, to make it easy for reader to understand our algorithm design and technique contributions, we will start with a special case, where the agents do not have intrinsic rewards, i.e., $r^{*}=0$ .

Scenario 1: No Intrinsic Reward The goal of this setting to find an incentive design algorithm producing a sequence of $R^{t}$ such that both the steering gap and the steering cost are sub-linear: $\Delta_{T}(\{\bar{\mu}^{t}\}_{t=1}^{T})=o(T),~{}C_{T}(\{\bar{\mu}^{t},R^{t}\}_% {t=1}^{T})=o(T).$

The motivation for the sub-linear guarantee here is that it implies the average utility converges to the maximum and the average steering cost vanishes. This implies that the incentive design strategies fulfilling these guarantees will eventually pay off as a long-term investment. In Sec. 5, we analyze this case and provide algorithms achieving our objective.

Scenario 2: Non-Zero Intrinsic Reward In Sec. 6, we study the complete setting where the agents’ original reward is non-zero and unknown. In this case, the mediator additionally has to estimate the reward function from observed noisy samples and steer the agents based on that. Similarly, we expect sub-linear steering gap $\Delta_{T}(\{\bar{\mu}^{t}\}_{t=1}^{T})=o(T)$ , while for steering cost, we manually choose the “sandboxing reward” as the comparator:

\displaystyle C_{T}(\{\bar{\mu}^{t},R^{t}-\underbrace{(r_{\max}\cdot\mathbf{1}% -r^{*})}_{\text{sandboxing reward}}\}_{t=1}^{T})=o(T).

Here we use $\mathbf{1}$ to denote the all-ones vector. Intuitively, because the intrinsic rewards are non-zero, if the desired behavior $\mu^{\pi^{*}}$ is not an equilibrium induced by $r^{*}$ , the mediator has to maintain a non-zero steering rewards to avoid the agents deviating from $\pi^{*}$ , so we can not expect the average steering cost to vanish to 0 as in Scenario 1. Therefore, we consider the sandboxing reward as a baseline comparator, which mitigates differences in the intrinsic rewards, so that $\pi^{*}$ would be a “stable equilibrium” even if the additional steering reward $R^{t}-(r_{\max}\cdot\mathbf{1}-r^{*}(\bar{\mu}^{t}))$ vanishes to zero. Though, we admit that other choices of sandboxing terms may result in lower steering cost, or one can consider optimizing utility and steering cost together. We leave those interesting directions for the future work.

4 STEERING TOWARDS A FIXED TARGET

In this section, we focus on how to design rewards to guide the agents to a target population density or policy, which serves as preparation steps for the following sections. For convenience, we assume the agents’ intrinsic rewards are zero and ignore them.

4.1 Warm-Up: Steering in a Known Model

We start with the case when the MFG model is known. In this case, we can also compute $\Psi_{M^{*}}$ and find the best $\mu^{*}=\operatorname*{arg\,max}_{\mu\in\Psi_{M^{*}}}U(\mu)$ . If we want to steer the population to $\mu^{*}$ , one steering reward choice is $R(\mu)=\bm{1}\|\mu^{*}-\mu\|_{\infty}+\mu^{*}-\mu$ , where the first shift term is to ensure the non-negativity. The key motivation for our choice is that $\langle R(\mu),\mu^{*}-\mu\rangle=\|\mu-\mu^{*}\|_{2}^{2}$ . As a result, if we consider the accumulative performance, we have the following theorem.

Theorem 4.1.

If $M^{*}$ is known and $r^{*}$ is zero everywhere, under Asump. B and C, by choosing the steering reward $\forall~{}t\in[T]$ , $R^{t}(\mu)=\mu^{*}-\mu+\bm{1}\|\mu^{*}-\mu\|_{\infty}$ , for any $\mu\in\Delta_{\mathcal{S}\times\mathcal{A}}^{H}$ , we have:

	$\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{}}\}_{t=1}^{T})\leq L_{U}\sqrt{% HSAT\operatorname{AdaReg}(T)}=o(T)$
	$\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{}},R^{t}\}_{t=1}^{T})\leq 2H\sqrt{T% \operatorname{AdaReg}(T)}=o(T).$

The bound above for the known model setting, although may not be tight, can serve as a benchmark for the more challenging unknown model settings. We will see that the bounds in Theorems 5.1 and 6.1 (for the unknown model setting) are not much worse than the bound of Theorem 4.1.

4.2 Steering towards a Target Policy in an Unknown Model

Without the knowledge of transition function $\mathbb{P}^{*}$ , the steering becomes challenging, because we can no longer compute $\Psi_{M^{*}}$ or identify whether a given density (e.g. $\operatorname*{arg\,max}_{\mu\in\Delta_{\mathcal{S}\times\mathcal{A}}^{H}}U(\mu)$ ) can actually be achieved by the agents. Therefore, we shift our focus to the policy space. Interestingly, we reveal that, it is possible to steer the agents to any target policy $\pi\in\Pi$ , even without the knowledge of $\mathbb{P}^{*}$ or $\Psi_{M^{*}}$ . Our key observation is the following lemma, which suggests an upper bound to control the difference between the population density and the density regarding the target policy.

Lemma 4.2.

Given any $M$ and target $\pi\in\Pi$ , suppose the agents induce population density $\bar{\mu}_{M}$ in $M$ , then:

	$\displaystyle\\|\bar{\mu}_{M}-\mu^{\pi}_{M}\\|_{1}\leq$	$\displaystyle H\sum_{h,s}\bar{\mu}_{M,h}(s)\\|\bar{\pi}_{h}(\cdot\|s)-\pi_{h}(% \cdot\|s)\\|_{1},$
	with	$\displaystyle\bar{\mu}_{M,h}(s):=\sum_{a}\bar{\mu}_{M,h}(s,a)$		(3)

This motivates us to design a steering reward function that penalizes the RHS of Eq. (3), which is actually doable without the knowledge of model. Given a policy $\pi$ , we define matrix $W^{\pi}\in\mathbb{R}^{SAH\times SAH}$ to be the block diagonal of $W^{\pi}_{h,s_{h}}$ for all $h\in[H]$ and $s_{h}\in\mathcal{S}$ , where

\displaystyle\begin{split}W^{\pi}_{h,s}&:=\left[\begin{matrix}\pi_{h}(a_{1}|s)% &\ldots&\pi_{h}(a_{1}|s)\\ \vdots&&\vdots\\ \pi_{h}(a_{A}|s)&\ldots&\pi_{h}(a_{A}|s)\end{matrix}\right]\in\mathbb{R}^{A% \times A},\end{split}

(4)

Now, consider the steering reward function:

\displaystyle\forall\mu\in\Delta^{H}_{\mathcal{S}\times\mathcal{A}},~{}R_{\pi}% (\mu):=-\mu^{\top}(W^{\pi}-I)^{\top}(W^{\pi}-I).

(5)

where $I\in\mathbb{R}^{SAH\times SAH}$ is the identity matrix. We can verify that, for any possible population density $\bar{\mu}^{t}$ occurs at step $t$ , we have

	$\displaystyle\langle R_{\pi}(\bar{\mu}^{t}),\mu^{\pi}-\bar{\mu}^{t}\rangle=\\|(% W^{\pi}-I)\bar{\mu}^{t}\\|_{2}^{2}$
	$\displaystyle=\sum_{h,s,a}(\bar{\mu}_{h}^{t}(s))^{2}\|\pi_{h}(a\|s)-\bar{\pi}_{h% }^{t}(a\|s)\|^{2}.$		(6)

Recall that $\bar{\pi}_{h}^{t}$ denotes the policy induced by population density (see definition in Sec. 2). Here in the first equality, we use the fact that, for any $\mu$ , $\langle R_{\pi}(\mu),\mu^{\pi}\rangle=0$ since $(W^{\pi}-I)\mu^{\pi}=0$ . Eq. (6) above is important in that it connects the one step regret (LHS) with the gap between the population density and target density (RHS through Lemma 4.2).

Combining with Prop. 3.1, if all the agents are no-regret learners, and we steer the agents with the same steering reward $R_{\pi}$ for $T$ steps, we should expect $\bar{\mu}^{t}$ to converge to $\mu^{\pi}$ , which we summarize to the following theorem. This result provides important insights for our incentive design algorithm in Section 5.

Theorem 4.3.

Let $\pi^{*}=\operatorname*{arg\,max}_{\pi}U(\mu^{\pi})$ and $R^{t}(\mu)=R_{\pi^{*}}(\mu)+\|R_{\pi^{*}}(\mu)\|_{\infty}\bm{1}$ for all $t$ . Under Assump. B,

	$\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{}}\}_{t=1}^{T})\leq L_{U}\sqrt{H^% {3}SAT\operatorname{AdaReg}(T)}=o(T)$
	$\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{}},R^{t}\}_{t=1}^{T})\leq 4H\sqrt{T% \operatorname{AdaReg}(T)}=o(T).$

5 STEERING WITH NO INTRINSIC REWARD

In this section, we study the Scenario 1 introduced in Sec. 3.4, where the transition function $\mathbb{P}^{*}$ is unknown and the original reward $r^{*}$ is zero, so the steering rewards are the only incentives for the agents. The main challenge in this setting is that, without the knowledge of $\mathbb{P}^{*}$ , we can not determine the feasible density set $\Psi_{M^{*}}$ and the maximizer of the utility function. Therefore, we have to design a steering strategy to incentivize the agents to explore for the mediator to estimate $\mathbb{P}^{*}$ , while balancing the exploration-exploitation trade-off to ensure sub-linear steering gap and cost.

Our main contribution is an optimism-based exploration algorithm in Alg. 2, which provably addresses the above challenges and achieves our objectives. The algorithm is built based on the techniques we developed in Sec. 4.2, which allows us to steer the agents to any target policy without the knowledge of model. Next, we introduce the key components in algorithm design.

Algorithm 2 Steering reward design for Scenario 1

1:Initialize

\mathcal{P}^{1}:=

set of all possible transition functions,

\pi_{*}^{1}

(arbitrarily),

k=1,T_{0}=0

2:for

t=1,...,T

\triangleright

Recall

R_{\pi_{*}^{k}}

as defined in Eq. (5)

4: Compute steering reward function

R^{t}_{\text{z}}(\cdot)\leftarrow R_{\pi_{*}^{k}}(\cdot)+\|R_{\pi_{*}^{k}}(% \cdot)\|_{\infty}\bm{1}.

5: Agents play the

t

-th game.

6: Obtain trajectory

((s^{t}_{h},a^{t}_{h}))_{h=1}^{H}

7: if

\exists(h,s,a),~{}s.t.~{}n_{k}(h,s,a)\geq N_{k}(h,s,a)

then

8: Update

\mathcal{P}^{k+1}

as in (5).

T_{k}\leftarrow t

;

k\leftarrow k+1

10:

\pi_{*}^{k},\hat{M}^{k}\leftarrow\operatorname*{arg\,max}_{\pi\in\Pi,\hat{M}:% \hat{\mathbb{P}}_{\hat{M}}\in\mathcal{P}^{k}}U(\mu^{\pi}_{\hat{M}}).

11: end if

12:end for

Low Policy Switching Optimistic Exploration Strategy For efficient exploration, we maintain a confidence set for $\mathbb{P}^{*}$ denoted by $\mathcal{P}$ :

		$\displaystyle\bar{\mathbb{P}}^{k+1}_{h}(s^{\prime}\|s,a):=\sum_{t=1}^{T_{k}}% \frac{\mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a,s^{t}_{h+1}=s^{\prime}\}}{\max\{1,N_% {k+1}(h,s,a)\}},$
		$\displaystyle\mathcal{P}^{k+1}:=\bigg{\{}\hat{\mathbb{P}}:\forall h,s,a.\\|\hat% {\mathbb{P}}_{h}(\cdot\|s,a)-\bar{\mathbb{P}}^{k+1}_{h}(\cdot\|s,a)\\|_{1}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\leq\varepsilon_{k+1}(h,s,a)% \bigg{\}},$		(7)

where $\varepsilon_{k+1}(h,s,a):=\sqrt{\frac{2S\ln(THSA/\delta)}{\max\{1,N_{k+1}(h,s,% a)\}}}$ . We highlight that we only update $\mathcal{P}$ and switch target policy in low frequency, and here we use index $1\leq k\leq K$ to count the policy switching episodes, to distinguish with the steering steps $t\in[T]$ . We use $k(t)$ to denote the index of episode at iteration $t$ and use $T_{k}$ to denote the iteration number at the end $k$ -th policy switching. We define $n_{k}(h,s,a)$ to be the number of samples equal to $(s,a)$ at time $h$ in episode $k$ , and $N_{k}(h,s,a)=\sum_{k^{\prime}<k}n_{k^{\prime}}(h,s,a)$ . A new episode begins as soon as we have as many samples in this episode as in all the previous ones for some $h,s,a$ , i.e., $n_{k}(h,s,a)\geq N_{k}(h,s,a)$ . The main motivation for this technique is to avoid the agents’ potentially adversarial behaviors. As we will see later in the proof sketch, $K$ will appear in the steering gap upper bound.

For exploration, we select the optimistic policy $\pi_{*}^{(\cdot)}$ and model $\hat{M}^{(\cdot)}$ (line 10) s.t. the induced density maximizes utility. Then, we choose steering reward $R_{\text{z}}$ to guide the agents towards $\pi_{*}^{(\cdot)}$ and collect data samples to update the model confidence set. Intuitively, either $\pi_{*}^{(\cdot)}$ indeed maximizes the utility, implying a low steering gap; or the exploration helps to reduce the uncertainty.

Managing the steering gap and cost We have the following guarantees for Alg. 2

Theorem 5.1.

Suppose the intrinsic reward $r^{*}=0$ , under Assump. B and C, if we run Alg. 2 with $\delta\in(0,1)$ , then with probability at least $1-2\delta$ , $K\leq HSA\log_{2}T$ , and

	$\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{}}\}_{t=1}^{T})\leq L_{U}\sqrt{H^% {3}SATK\operatorname{AdaReg}(T)}$
	$\displaystyle\qquad\qquad\quad+36L_{U}H^{3}S\sqrt{\ln(THSA/\delta)AT}=o(T).$
	$\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{}},R_{\text{z}}^{t}\}_{t=1}^{T})\leq 4% H\sqrt{TK\operatorname{AdaReg}(T)}=o(T).$

As a concrete example, agents following Online Gradient Descent with step size $O(1/\sqrt{t})$ (Hazan,, 2023) result in $\operatorname*{AdaReg}(T)=\tilde{\mathcal{O}}(\sqrt{T})$ (ignoring $H,S$ and $A$ ), which implies $\tilde{\mathcal{O}}(T^{3/4})$ steering gap. Besides, if all the agents are capable enough s.t. for any $t\in[T]$ , $\pi^{1,t},...,\pi^{N,t}$ are equilibria w.r.t. $r^{*}+R_{\text{z}}^{t}$ , $\operatorname*{AdaReg}$ would be constant-level, resulting in a $\tilde{\mathcal{O}}(\sqrt{T})$ bound.

Proof Sketch We first analyze the steering gap. Intuitively, Alg. 2 can be interpreted as a “ $K$ -stage” version of what we did in Sec. 4.2. In each stage, we pick a target policy, and steer the agents towards it for exploration. Following this intuition, and thanks to the Lipschitz condition (Assump. C) and the optimism in planning, we can decompose the steering gap as follow:

	$\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}$	$\displaystyle\}_{t=1}^{T})\leq L_{U}(2H+1)\underset{\Delta_{\text{est}}}{% \underbrace{\sum_{t=1}^{T}\\|\mu^{\bar{\pi}^{t}}_{\hat{M}^{k(t)}}-\bar{\mu}^{t}% _{M^{*}}\\|_{1}}}$
	$\displaystyle+L_{U}H$	$\displaystyle\underset{\Delta_{\text{pop}}}{\underbrace{\sum_{t=1}^{T}\sum_{h,% s}\bar{\mu}_{M^{},h}^{t}(s)\\|\pi^{k(t)}_{,h}(\cdot\|s)-\bar{\pi}^{t}_{h}(% \cdot\|s)\\|_{1}}}.$		(8)

We refer the first term $\Delta_{\text{est}}$ as model estimation error, which measures the gap between the population density $\bar{\mu}^{t}$ and the density induced by the population average policy $\bar{\pi}^{t}$ (see definition in Sec. 2) in the estimated model $\hat{M}^{k}$ . As we collect more and more data, $\hat{M}^{k}$ gets closer to $M^{*}$ , and we can show $\Delta_{\text{est}}$ only grows sub-linearly. The second term $\Delta_{\text{pop}}$ can be interpreted as the population convergence error, which is determined by how fast the agents converge to the target policy we steer them to. Following the similar techniques in the proof of Thm. 4.3, $\Delta_{\text{pop}}$ can be upper bounded by:

\displaystyle\textstyle\vphantom{\underbrace{\sum_{t}^{T}}_{\texttt{AgentReg}}% }\sqrt{\vphantom{\sum_{t}^{T}}\smash[b]{HSAT\!\underbrace{\sum_{t=1}^{T}% \langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}_{M^{*}}),\mu_{M^{*}}^{\pi_{*}^{k(t)}}% -\bar{\mu}^{t}_{M^{*}}\rangle}_{\texttt{AgentReg}}\,}}.

(9)

Here we use AgentReg to refer the summation term, which can be interpreted as the agents’ dynamic regret if choosing $\mu_{M^{*}}^{\pi_{*}^{k(t)}}$ as the comparators. Thanks to the low policy switching, AgentReg can be controlled by $O(K\operatorname*{AdaReg}(T))$ , and the only remaining step is to control $K$ . Note that we only switch policy when the number of visitation of some state-action pair got doubled, therefore, $K$ only grows in $O(\log(T))$ .

For the steering cost, we can calculate that

\displaystyle C(\bar{\mu}^{t}_{M^{*}},R^{t}_{\text{z}})\leq 2H\|R_{\pi_{*}^{k(% t)}}(\bar{\mu}^{t}_{M^{*}})\|_{\infty},

and for any $\pi,\mu$ , $\|R_{\pi}(\mu)\|_{\infty}\leq 2\|(W^{\pi}-I)\mu\|_{2}$ which, by Eq. (6), is equal to $2\sqrt{\langle R_{\pi}(\mu),\mu-\mu^{\pi}\rangle}$ . Using Jensen’s inequality and Assump. B, we derive the final bound.

6 STEERING WITH NON-ZERO INTRINSIC REWARD

Next, we turn to Scenario 2 in Sec. 3.4, the complete setting where the agents’ pre-existing reward function $r^{*}\in[0,r_{\max}]$ is both non-zero and unknown. The non-zero intrinsic reward introduces non-trivial additional challenges. Firstly, it changes the steering landscape and introduces some prior bias for our steering reward design. Secondly, since it is unknown, we must account for its interference on the steering dynamics and undertake strategic exploration to estimate $r^{*}$ . In the following, we explain how we overcome these challenges by a pessimism-based reward estimation strategy.

Confidence set for $r^{*}$ We recall our setup in Sec. 3.1: the mediator can observe the population density $\bar{\mu}^{t}$ and noisy reward $r^{t}=(r^{*}_{h}(s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{M^{*},h})+\xi_{h})_{h\in[H]}$ perturbed by i.i.d. zero-mean $\sigma$ -sub-Gaussian noise $\xi$ . We will use this information to estimate the original reward. At each iteration $t$ , we maintain a confidence set $\hat{\mathcal{R}}^{t}$ for $r^{*}$ , defined by:

	$\displaystyle\hat{\mathcal{R}}^{t}:=$	$\displaystyle\left\{\hat{r}\in\mathcal{R}:\\|\hat{r}-\bar{r}^{t}\\|_{2,E_{t}}% \leq\sqrt{\beta_{t}}\right\},$
	$\displaystyle\bar{r}^{t}:=$	$\displaystyle\operatorname{arg\,min}_{\hat{r}\in\mathcal{R}}\sum_{i=1}^{t-1}% \sum_{h=1}^{H}\left(\hat{r}_{h}(s^{i}_{h},a^{i}_{h},\bar{\mu}^{i}_{M^{},h})-r% _{h}^{i}\right)^{2},$		(10)

where $\|g\|_{2,E_{t}}^{2}:=\sum_{i=1}^{t-1}\sum_{h=1}^{H}(g_{h}(s^{i}_{h},a^{i}_{h},% \bar{\mu}^{i}_{M^{*},h}))^{2}$ for any function $g$ as a short note. We use $\beta_{t}$ to denote confidence interval length to ensure $r^{*}$ is contained in the confidence set at any time with high probability. We defer a detailed choice of $\beta_{t}$ to Lem. I.1. Informally, $\beta_{t}=O(\sigma^{2}\log N(\mathcal{R},\frac{1}{T}))$ grows in $\log T$ , where $N(\mathcal{R},\varepsilon)$ is the $\varepsilon$ -covering number of $\mathcal{R}$ .

Steering Reward Design with Pessimism We consider the following steering reward design

	$\displaystyle\forall\mu\in\Delta_{\mathcal{S}\times\mathcal{A}}^{H},~{}$	$\displaystyle R^{t}_{\text{nz}}(\mu):=R_{\pi_{*}^{k(t)}}(\mu)-(\bar{r}^{t}(\mu% )-w_{\hat{\mathcal{R}}^{t}}(\mu))$
		$\displaystyle+(r_{\max}+\\|R_{\pi_{*}^{k(t)}}(\mu)\\|_{\infty})\bm{1}$		(11)

Here $\pi_{*}^{k(t)}$ is computed in the same way as Alg, 2; $\bar{r}^{t}\in\hat{\mathcal{R}}^{t}$ (defined in Eq. (10)) is the reward estimation achieving the minimal empirical loss; $w_{\hat{\mathcal{R}}^{t}}(\mu)$ is a vector with elements $(w_{\hat{\mathcal{R}}^{t}}(\mu))_{h,s,a}:=\sup_{r,\tilde{r}\in\hat{\mathcal{R}% }^{t}}\left|r_{h}(s,a,\mu)-\tilde{r}_{h}(s,a,\mu)\right|$ , which quantifies the estimation uncertainty for each state-action pair; the last constant shift term ensures non-negativity.

As we can see, the main difference compared with steering reward $R^{t}_{\text{z}}$ in Alg. 2 is that we include an additional reward estimation term to offset the effect by the non-zero original reward $r^{*}$ . In this way, the agents will follow the guidance by $R_{\pi_{*}^{k(t)}}$ to explore as we want. Note that here we conduct a pessimism-based reward estimation such that $\bar{r}^{t}-w_{\hat{\mathcal{R}}^{t}}\leq r^{*}$ for some technical reason, which we will explain later.

Steering Algorithm Design The algorithm design for the non-zero intrinsic reward setting only differs from Alg. 2 in the additional update of $\hat{\mathcal{R}}^{t}$ as in Eq. (10) and choosing Eq. (6) as the steering reward $R_{\text{nz}}^{t}$ . For completeness, we defer the detailed algorithm to Alg. 4 in Appx. I.1. We have the following guarantees for steering gap and steering cost.

Theorem 6.1.

Under Assump. A, B and C, if we run Alg. 4 with $0<\delta<1$ , then with probability at least $1-6\delta$ , $K\leq HSA\log_{2}T$ , and

	$\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})\leq\,$	$\displaystyle L_{U}\sqrt{H^{3}SAT(K\operatorname*{AdaReg}(T)+D)}$
		$\displaystyle+36L_{U}H^{3}S\sqrt{AT\ln(THSA/\delta)},$
	$\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R_{\text{nz}}^{t}-($	$\displaystyle r_{\max}\cdot\bm{1}-r^{*})\}_{t=1}^{T})$
	$\displaystyle=4H$	$\displaystyle\sqrt{T(K\operatorname*{AdaReg}(T)+D)}+D,$

where $D=\tilde{O}(\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}))$ .

Comparing with Theorem 5.1, we can find both the steering gap and cost only differ in the additional term $D$ , which results from the estimation error of $r^{*}$ . The term $D$ depends on the Eluder dimension of $\mathcal{R}$ and $\beta_{T}$ . In Appx. H.1, we show several common function classes with $\dim_{E}(\mathcal{R},T^{-1})\in\tilde{\mathcal{O}}(1)$ , and where by choosing $\beta_{T}$ appropriately, we have $D\in\tilde{\mathcal{O}}(\sqrt{T})$ . As a result, both the steering gap and cost upper bounds in Thm. 6.1 will be sub-linear in $T$ .

Proof Sketch Similar to the proof for Thm. 5.1, we can decompose the steering gap as Eq. (8), and upper bound model estimation error term $\Delta_{\text{est}}$ in the same way. The proof diverges when we upper bound AgentReg in Eq. (9), because the agents’ no-regret behavior holds for $r^{*}+R_{\text{nz}}^{t}$ in this setting. We can write

	AgentReg	$\displaystyle=\sum_{t=1}^{T}\langle R_{\text{nz}}^{t}(\bar{\mu}^{t}_{M^{}})+r% ^{}(\bar{\mu}^{t}_{M^{}})-r^{}(\bar{\mu}^{t}_{M^{*}})$
		$\displaystyle\quad+\bar{r}^{t}(\bar{\mu}^{t}_{M^{}})-w_{\hat{\mathcal{R}}^{t}% }(\bar{\mu}^{t}_{M^{}}),\mu_{M^{}}^{\pi_{}^{k(t)}}-\bar{\mu}^{t}_{M^{*}}\rangle.$

Using pessimism, i.e., $r^{*}\geq\bar{r}^{t}-w_{\hat{\mathcal{R}}^{t}}$ , we can bound this by

	$\displaystyle\sum_{t=1}^{T}\langle R_{\text{nz}}^{t}(\bar{\mu}^{t}_{M^{}})+r^% {}(\bar{\mu}^{t}_{M^{}}),\mu_{M^{}}^{\pi_{}^{k(t)}}-\bar{\mu}^{t}_{M^{}}\rangle$
	$\displaystyle\quad+\sum_{t=1}^{T}\langle r^{}(\bar{\mu}^{t}_{M^{}})-\bar{r}^% {t}(\bar{\mu}^{t}_{M^{}})+w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}_{M^{}}),% \bar{\mu}^{t}_{M^{*}}\rangle.$

Clearly, the first term above is just agents’ dynamic regret regarding the total reward they received and can be bounded again by $K\operatorname*{AdaReg}(T)$ . The second term above can be further controlled by $\mathcal{O}(\sum_{t}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}% ^{t}\rangle)$ , which is basically the accumulative confidence interval length for reward estimation and its growth can be controlled by Eluder dimension (Lem. H.5) and is only sub-linear in $T$ .

For the steering cost, we can provide an upper bound involving AgentReg and reward estimation error that we analyzed before. To save space, we do not repeat it here and refer the reader to Appx. I for the full proof.

Remark 6.1.

Our strategy to deal with the intrinsic reward $r^{*}$ is to try to “cancel” it with our steering reward. This approach is justified by the fact that we keep $r^{*}$ and $U$ very general, which means that the target density to maximize $U$ may not coincide with an equilibrium associated with the original reward $r^{*}$ . Therefore, to ensure the target density is still a stationary point for no-regret learners, we treat $r^{*}$ as a competing force to offset. We admit that there might be other options to counteract the impact of $r^{*}$ with lower steering costs, and we leave further investigation to the future work.

Remark 6.1 (Generalization to Unknown Utility Setting).

Although this paper focuses on the case when $U$ is revealed to the mediator, it is possible to generalize our results to the case where the utility function $U$ is unknown, but it lies in a known function class $\mathcal{U}$ with bounded Eluder dimension. In Appx. J, we formalize this setting and present a solution to address this case based on a simple modification of the current methods. Our established regret bound for steering gap and steering cost grow at a rate of $\tilde{\mathcal{O}}(T^{5/6})$ . Although the results are worse than the rate of $\tilde{\mathcal{O}}(T^{3/4})$ in Thm. 6.1 due to the challenges in exploring the utility function, they are still sub-linear in $T$ .

7 CONCLUSION

We study a novel problem setting for incentive design in unknown mean-field games with no-regret agents. Our optimistic algorithm introduces newly developed steering reward designs, achieving sublinear utility regret and steering costs when the intrinsic reward is zero. Extending to the setting with a non-zero and unknown intrinsic reward function, we adapted our algorithm to handle this new challenge, maintaining sublinear utility regret and vanishing steering costs competing with a baseline strategy. Future work could explore the more challenging case where the transition function is also dependent on the population density. Another interesting direction is to identify better or even optimal steering reward design to stabilize the target policy and design an algorithm with sub-linear guarantees comparing with that benchmark.

Acknowledgements

This work is supported by Swiss National Science Foundation (SNSF) Project Funding No. 200021-207343 and SNSF Starting Grant.

References

Achdou and Lasry, (2019) Achdou, Y. and Lasry, J.-M. (2019). Mean Field Games for Modeling Crowd Motion. In Chetverushkin, B. N., Fitzgibbon, W., Kuznetsov, Y., Neittaanmäki, P., Periaux, J., and Pironneau, O., editors, Contributions to Partial Differential Equations and Applications, pages 17–42. Springer International Publishing, Cham.
Baumann et al., (2020) Baumann, T., Graepel, T., and Shawe-Taylor, J. (2020). Adaptive mechanism design: Learning to promote cooperation. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE.
Brown et al., (2024) Brown, W., Schneider, J., and Vodrahalli, K. (2024). Is learning in games good for the learners? Advances in Neural Information Processing Systems, 36.
Cabannes et al., (2021) Cabannes, T., Lauriere, M., Perolat, J., Marinier, R., Girgin, S., Perrin, S., Pietquin, O., Bayen, A. M., Goubault, E., and Elie, R. (2021). Solving N-player dynamic routing games with congestion: a mean field approach. arXiv:2110.11943 [cs, eess, math].
Camara et al., (2020) Camara, M., Hartline, J., and Johnsen, A. (2020). Mechanisms for a No-Regret Agent: Beyond the Common Prior. arXiv:2009.05518 [cs, econ].
Canyakmaz et al., (2024) Canyakmaz, I., Sakos, I., Lin, W., Varvitsiotis, A., and Piliouras, G. (2024). Steering game dynamics towards desired outcomes. arXiv:2404.01066 [cs, eess].
Carmona and Wang, (2021) Carmona, R. and Wang, P. (2021). Finite-state contract theory with a principal and a field of agents. Management Science, 67(8):4725–4741.
Castiglioni et al., (2023) Castiglioni, M., Marchesi, A., and Gatti, N. (2023). Multi-agent contract design: How to commission multiple agents with individual outcomes. In Proceedings of the 24th ACM Conference on Economics and Computation, pages 412–448.
Chen and Cheng, (2010) Chen, B. and Cheng, H. H. (2010). A review of the applications of agent technology in traffic and transportation systems. IEEE Transactions on Intelligent Transportation Systems, 11(2):485–497.
Curry et al., (2024) Curry, M., Thoma, V., Chakrabarti, D., McAleer, S., Kroer, C., Sandholm, T., He, N., and Seuken, S. (2024). Automated design of affine maximizer mechanisms in dynamic settings. Proceedings of the AAAI Conference on Artificial Intelligence, 38(9):9626–9635.
DellaVigna and Malmendier, (2004) DellaVigna, S. and Malmendier, U. (2004). Contract design and self-control: Theory and evidence. The Quarterly Journal of Economics, 119(2):353–402.
Deng et al., (2019) Deng, Y., Schneider, J., and Sivan, B. (2019). Strategizing against No-regret Learners. arXiv:1909.13861 [cs].
Dinneweth et al., (2022) Dinneweth, J., Boubezoul, A., Mandiau, R., and Espié, S. (2022). Multi-agent reinforcement learning for autonomous vehicles: a survey. Autonomous Intelligent Systems, 2(1):27.
Dütting et al., (2023) Dütting, P., Ezra, T., Feldman, M., and Kesselheim, T. (2023). Multi-agent contracts. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1311–1324.
Ehtamo et al., (2002) Ehtamo, H., Kitti, M., and Hämäläinen, R. P. (2002). Recent studies on incentive design problems in game theory and management science. In Optimal Control and Differential Games: Essays in Honor of Steffen Jørgensen, pages 121–134. Springer.
Elie et al., (2019) Elie, R., Mastrolia, T., and Possamaï, D. (2019). A tale of a principal and many, many agents. Mathematics of Operations Research, 44(2):440–467.
Freund and Schapire, (1997) Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.
Fu and Horst, (2018) Fu, G. and Horst, U. (2018). Mean-field leader-follower games with terminal state constraint.
Ge et al., (2024) Ge, J., Wang, Y., Li, W., and Jin, C. (2024). Towards principled superhuman ai for multiplayer symmetric games.
Gomes et al., (2014) Gomes, D. A., Velho, R. M., and Wolfram, M.-T. (2014). Socio-economic applications of finite state mean field games. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2028):20130405. arXiv:1403.4217 [math].
Guo et al., (2021) Guo, X., Hu, A., Xu, R., and Zhang, J. (2021). Learning Mean-Field Games. arXiv:1901.09585 [math].
(22) Guo, X., Li, L., Nabi, S., Salhab, R., and Zhang, J. (2023a). MESOB: Balancing Equilibria & Social Optimality. arXiv:2307.07911 [cs, math].
(23) Guo, X., Li, L., Nabi, S., Salhab, R., and Zhang, J. (2023b). Mesob: Balancing equilibria & social optimality.
Hazan, (2023) Hazan, E. (2023). Introduction to Online Convex Optimization. arXiv:1909.05207 [cs, math, stat].
Hazan and Seshadhri, (2007) Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC), 14.
Ho et al., (2014) Ho, C.-J., Slivkins, A., and Vaughan, J. W. (2014). Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359–376.
Holmström, (1979) Holmström, B. (1979). Moral hazard and observability. The Bell journal of economics, pages 74–91.
Hu and Zhang, (2024) Hu, A. and Zhang, J. (2024). MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games. arXiv:2405.00282 [cs, math].
(29) Huang, J., He, N., and Krause, A. (2024a). Model-Based RL for Mean-Field Games is not Statistically Harder than Single-Agent RL. arXiv:2402.05724 [cs, stat].
(30) Huang, J., Thoma, V., Shen, Z., Nax, H. H., and He, N. (2024b). Learning to Steer Markovian Agents under Model Uncertainty. arXiv:2407.10207 [cs, stat].
Huang et al., (2023) Huang, J., Yardim, B., and He, N. (2023). On the Statistical Efficiency of Mean Field Reinforcement Learning with General Function Approximation. arXiv:2305.11283 [cs, stat].
Huang et al., (2006) Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle.
Innes, (1990) Innes, R. D. (1990). Limited liability and incentive contracting with ex-ante action choices. Journal of economic theory, 52(1):45–67.
Iyer et al., (2014) Iyer, K., Johari, R., and Sundararajan, M. (2014). Mean field equilibria of dynamic auctions with learning. Management Science, 60(12):2949–2970.
Jaksch et al., (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600.
Lasry and Lions, (2007) Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Japanese journal of mathematics, 2(1):229–260.
Laurière et al., (2024) Laurière, M., Perrin, S., Pérolat, J., Girgin, S., Muller, P., Élie, R., Geist, M., and Pietquin, O. (2024). Learning in Mean Field Games: A Survey. arXiv:2205.12944 [cs, math].
Liu et al., (2022) Liu, B., Li, J., Yang, Z., Wai, H.-T., Hong, M., Nie, Y. M., and Wang, Z. (2022). Inducing Equilibria via Incentives: Simultaneous Design-and-Play Ensures Global Convergence. arXiv:2110.01212 [cs].
Luo et al., (1996) Luo, Z.-Q., Pang, J.-S., and Ralph, D. (1996). Mathematical Programs with Equilibrium Constraints. Cambridge University Press.
Osband and Roy, (2014) Osband, I. and Roy, B. V. (2014). Model-based reinforcement learning and the eluder dimension.
Perolat et al., (2021) Perolat, J., Perrin, S., Elie, R., Laurière, M., Piliouras, G., Geist, M., Tuyls, K., and Pietquin, O. (2021). Scaling up Mean Field Games with Online Mirror Descent. arXiv:2103.00623 [cs].
Ratliff et al., (2019) Ratliff, L. J., Dong, R., Sekar, S., and Fiez, T. (2019). A perspective on incentive design: Challenges and opportunities. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):305–338.
Rosenberg and Mansour, (2019) Rosenberg, A. and Mansour, Y. (2019). Online convex optimization in adversarial markov decision processes.
Roughgarden and Tardos, (2007) Roughgarden, T. and Tardos, É. (2007). Introduction to the inefficiency of equilibria. In Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. V., editors, Algorithmic Game Theory, pages 443–460. Cambridge University Press, Cambridge.
Russo and Van Roy, (2013) Russo, D. and Van Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
Sanjari et al., (2024) Sanjari, S., Bose, S., and Başar, T. (2024). Incentive Designs for Stackelberg Games with a Large Number of Followers and their Mean-Field Limits. arXiv:2207.10611 [cs].
Scheid et al., (2024) Scheid, A., Tiapkin, D., Boursier, E., Capitaine, A., Mhamdi, E. M. E., Moulines, É., Jordan, M. I., and Durmus, A. (2024). Incentivized learning in principal-agent bandit games. arXiv preprint arXiv:2403.03811.
Steinbacher et al., (2021) Steinbacher, M., Raddant, M., Karimi, F., Camacho Cuena, E., Alfarano, S., Iori, G., and Lux, T. (2021). Advances in the agent-based modeling of economic and social behavior. SN Business & Economics, 1(7):99.
Subramanian et al., (2022) Subramanian, S. G., Taylor, M. E., Crowley, M., and Poupart, P. (2022). Decentralized mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9439–9447.
Wang et al., (2022) Wang, K., Xu, L., Perrault, A., Reiter, M. K., and Tambe, M. (2022). Coordinating followers to reach better equilibria: End-to-end gradient descent for stackelberg games. Proceedings of the AAAI Conference on Artificial Intelligence, 36(5):5219–5227.
Weissman et al., (2003) Weissman, T., Ordentlich, E., Seroussi, G., Verdú, S., and Weinberger, M. J. (2003). Inequalities for the l1 deviation of the empirical distribution.
Yang et al., (2022) Yang, J., Wang, E., Trivedi, R., Zhao, T., and Zha, H. (2022). Adaptive incentive design with multi-agent meta-gradient reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’22, page 1436–1445, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
Yardim et al., (2022) Yardim, B., Cayci, S., Geist, M., and He, N. (2022). Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games.
Zhang et al., (2024) Zhang, B. H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S. M., Haupt, A. A., Celli, A., Gatti, N., Conitzer, V., and Sandholm, T. (2024). Steering No-Regret Learners to a Desired Equilibrium. arXiv:2306.05221 [cs].
Zhu et al., (2022) Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M. I. (2022). The sample complexity of online contract design. arXiv preprint arXiv:2211.05732.

Checklist

1.
For all models and algorithms presented, check if you include:
1. (a)
  
  A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes]
2. (b)
  
  An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes]
3. (c)
  
  (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Not Applicable]
2.
For any theoretical claim, check if you include:
1. (a)
  
  Statements of the full set of assumptions of all theoretical results. [Yes]
2. (b)
  
  Complete proofs of all theoretical results. [Yes]
3. (c)
  
  Clear explanations of any assumptions. [Yes]
3.
For all figures and tables that present empirical results, check if you include:
1. (a)
  
  The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Not Applicable]
2. (b)
  
  All the training details (e.g., data splits, hyperparameters, how they were chosen). [Not Applicable]
3. (c)
  
  A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Not Applicable]
4. (d)
  
  A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Not Applicable]
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
1. (a)
  
  Citations of the creator If your work uses existing assets. [Not Applicable]
2. (b)
  
  The license information of the assets, if applicable. [Not Applicable]
3. (c)
  
  New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]
4. (d)
  
  Information about consent from data providers/curators. [Not Applicable]
5. (e)
  
  Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]
5.
If you used crowdsourcing or conducted research with human subjects, check if you include:
1. (a)
  
  The full text of instructions given to participants and screenshots. [Not Applicable]
2. (b)
  
  Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]
3. (c)
  
  The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]

Appendix A TABLE OF FREQUENTLY USED NOTATIONS

Notation	Description
$[n]$	$\{1,2,...,n\}$ for any $n\in\mathbb{N}$
$\Delta_{\mathcal{X}}$	Set of probability distributions over a finite set $\mathcal{X}$
$\mathbb{I}\{\mathcal{E}\}$	Indicator function for the event $\mathcal{E}$
$\bm{1}$	All-one vector
$\mathbf{e}_{i}$	The $i$ -th standard-basis vector
$M=(N,\mathcal{S},\mathcal{A},H,\mathbb{P}_{M},r_{M},\mu_{1})$	The model / game
$N$	Number of agents
$\mathcal{S},\mathcal{A}$	State and action space
$H$	Horizon length of the game
$\mu_{1}$	Initial state distribution
$\{\mathbb{P}_{M,h}:\mathcal{S}\times\mathcal{A}\to\Delta_{\mathcal{S}}\}_{h\in% [H]}$	Transition function
$\{r_{M,h}:\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}\times\mathcal{% A}}\to[0,r_{\max}]\}_{h\in[H]}$	Reward function
$\{R_{h}:\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}\times\mathcal{A}% }\to\mathbb{R}\}_{h\in[H]}$	Steering reward function (capitalized)
$r:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to\mathbb{R}^{HSA}$	Vectorized reward function $(r(\mu))_{h,s,a}=r_{h}(s,a,\mu_{h})$
$\{\pi_{h}:\mathcal{S}\to\Delta_{\mathcal{A}}\}_{h\in[H]}$	Markov policy
$\Pi$	Set of all policies
$\mu_{M}^{\pi}$	State-action density of policy $\pi$ in model $M$
$\Psi_{M}$	Set of possible state-action densities in model $M$
$\operatorname*{AdaReg}(T)$	Adaptive regret bound after $T$ games
$U:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to\mathbb{R}$	Utility function
$C(\bar{\mu}^{t},R^{t})=\langle R^{t}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle$	Steering cost function
$R_{\pi}$	Reward function which incentivizes policy $\pi$
$M^{},r^{},\mathbb{P}^{*}$	True model, intrinsic reward, transition function
$R_{\text{z}}$	Steering reward for the setting where $r^{*}=0$ .
	“z” in sub-scription as a short note of “zero”.
$R_{\text{nz}}$	Steering reward for the setting where $r^{*}\in\mathcal{R}$ .
	“nz” in sub-scription as a short note of “zero”
$\dim_{E}(\mathcal{F},\varepsilon)$	Eluder dimension of function class $\mathcal{F}$
$\bar{\mu}$	Population density $\bar{\mu}:=\frac{1}{N}\sum_{n}\mu^{\pi^{n}}$
$\bar{\pi}$	Population average policy induced by $\bar{\mu}$
$\mathcal{O},\tilde{\mathcal{O}}$	Standard big-O notations

Appendix B SUMMARY OF MAIN RESULTS

In the following, we summarize the main theorems in this paper under Assump. A, B and C. We study the steering gaps and costs of four settings. The settings are categorized depending on whether $M^{*}$ (or $\pi^{*}:=\operatorname*{arg\,max}_{\pi\in\Pi}U(\mu_{M^{*}}^{\pi})$ ) is known or not, and whether the intrinsic reward function $r^{*}$ is zero or non-zero and unknown.

Setting	$r^{*}=0$ ?	Steering Gap	Steering Cost	Thm.
Known $M^{*}$	✓	$\mathcal{O}(L_{U}\sqrt{HSAT\operatorname*{AdaReg}(T)})$	$\mathcal{O}(H\sqrt{T\operatorname*{AdaReg}(T)})$	4.1
Unknown $M^{}$ (known $\pi^{}$ )	✓	$\mathcal{O}(L_{U}\sqrt{H^{3}SAT\operatorname*{AdaReg}(T)})$	$\mathcal{O}(H\sqrt{T\operatorname*{AdaReg}(T)})$	4.3
Unknown $M^{*}$	✓	$\begin{aligned} {\mathcal{O}}(&L_{U}\sqrt{H^{3}SATK\operatorname*{AdaReg}(T)}% \\ &+L_{U}H^{3}S\sqrt{AT\ln(THSA/\delta)})\end{aligned}$	$\mathcal{O}(H\sqrt{TK\operatorname*{AdaReg}(T)})$	5.1
Unknown $M^{*}$	✗	$\begin{aligned} \mathcal{O}(&L_{U}\sqrt{H^{3}SAT(K\operatorname*{AdaReg}(T)+D)% }\\ &+L_{U}H^{3}S\sqrt{AT\ln(THSA/\delta)})\end{aligned}$	$\begin{aligned} \mathcal{O}(H\sqrt{T(K\operatorname{AdaReg}(T)+D)})\\ +D+C_{T}(\{\bar{\mu}^{t},r_{\max}\cdot\mathbf{1}-r^{}\}_{t=1}^{T})\end{aligned}$	6.1

Here $K=\mathcal{O}(HSA\log T)$ and $D=\tilde{\mathcal{O}}(\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}))$ , where $\dim_{E}$ is the eluder dimension of reward function class $\mathcal{R}$ , and $\beta_{T}=\tilde{\mathcal{O}}(1)$ .

Appendix C OTHER RELATED WORKS

More Elaboration on Comparison between the Steering Setting and Contract Design Setting

The steering setup differs from previous incentive design literature in two aspects: (1) it deals with “learning agents” continuously updating their policies and (2) it cares about the steering gap towards a target policy and the accumulative steering cost. One of the most related and representative existing problem setups is contract design (a.k.a. the principal-agent problem), which is a classical problem dating back to the seminal work (Holmström,, 1979) in 1979. As we discussed in Sec. 1.1, it considers a similar mediator-agents interaction procedure. In the following, we elaborate more on the comparison between those two settings to support our steering setting.

(1)

Contract design assumes the agents respond optimally to the mediator/principal (e.g. maximize the total return including the incentives by mediator), which is a quite strong assumption and “simplifies” the problem by making the agents’ behaviors predictable.

In contrast, the steering framework treats the agents’ behavior as a dynamic process. For example, Zhang et al., (2024) and ours consider no-regret behaviors, and (Huang et al., 2024b, ; Canyakmaz et al.,, 2024) assumes Markovian learning dynamics. Such a non-stationarity is more reasonable in practice and introduces additional challenges in achieving low the steering gap and cost.
(2)

Contract design considers a more challenging objective, and targets at finding the optimal incentive design to maximize the mediator’s gain deducted by the incentivizing cost. Usually, it also assumes the agents’ behaviors are unobservable. Due to such challenges, most of the contract design literature focuses on single-agent setting and assumes the knowledge of the model.

On the other hand, the steering setting considers steering the agents to some target policies maximizing some utility function, which makes the framework more general. Besides, we do not pursue the optimality in steering cost but sub-linearity would be enough. This is reasonable because in many scenarios we only have budget constraints but do not have to achieve the optimum. Such a relaxation also makes the problem more tractable.

Mean-field game

The mean-field game (MFG) is an important framework to model systems with a large number of symmetric agents (Laurière et al.,, 2024). Most works in the context of MFGs focus on learning equilibrium policies. As the pioneers, Lasry and Lions, (2007) and Huang et al., (2006) reveal that learning Nash Equilibrium (NE) is computationally efficient under monotonicity conditions if the model is known in advance. Without the knowledge of the true model, many previous works contribute sample-efficient model-free (Guo et al.,, 2021; Yardim et al.,, 2022; Perolat et al.,, 2021) and model-based (Huang et al.,, 2023; Huang et al., 2024a, ) methods to compute NE. Our mean-field game definition is similar to the general MFG setting (Guo et al.,, 2021), but unlike them, we assume transitions are density-independent and allow independence of agents’ policies. This density-independent transition assumption has been frequently considered in previous works (Lasry and Lions,, 2007; Huang et al.,, 2006; Hu and Zhang,, 2024; Perolat et al.,, 2021). To our knowledge, we are the first to investigate steering agents’ behaviors in the context of the mean-field game.

Mathematical Programming with Equilibrium Constraints (MPEC) and Mechanism Design

MPEC considers a bilevel optimization formulation, where the upper level can be utility maximization problem and the lower level involves equilibrium constraints (Luo et al.,, 1996). There is a line of research works (Liu et al.,, 2022; Wang et al.,, 2022; Yang et al.,, 2022) consider gradient-based approaches to solve MPEC problems. They usually require strong assumptions on computing hyper-gradients, which may fail to be satisfied in most games. In contrast, we do not involve those assumptions or restrict the target policies are equilibria. We only assume the agents are no-regret learners and do not require them to solve the equilibria induced by modified reward functions.

Another related field within game theory is Mechanism Design, which focuses on designing rules or systems (mechanisms) to achieve a specific objective, especially when participants (agents) have private information and act according to their own interests. Most recent works consider mechanism design on Markov Games (Curry et al.,, 2024; Baumann et al.,, 2020).

Guo et al., 2023b consider a bi-level optimization framework and another bi-objective variant, where the goal of the social planner is to solve an equilibrium policy maximizing some social welfare function. They do not consider the usage of steering reward to intervene agents, and focus on the optimization side without considering model uncertainty. In contrast, we study the incentive design problem, and focus on how to explore and design appropriate steering rewards to guide agents’ behaviors without knowledge of the model.

Appendix D REGARDING NO-ADAPTIVE REGRET ASSUMPTION

D.1 Proof Of Proposition 3.1

See 3.1

Proof.

We have

	$\displaystyle\sup_{1\leq a<b\leq T}\max_{\mu\in\Psi_{M^{}}}\sum_{t=a}^{b}% \langle r^{}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),\mu-\bar{\mu}^{t}\rangle=% \sup_{1\leq a<b\leq T}\max_{\mu\in\Psi_{M^{}}}\sum_{t=a}^{b}\langle r^{}(% \bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),\mu-\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n,t% }}\rangle$
	$\displaystyle\leq\frac{1}{N}\sum_{n=1}^{N}\sup_{1\leq a<b\leq T}\max_{\mu\in% \Psi_{M^{}}}\sum_{t=a}^{b}\langle r^{}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),% \mu-\mu^{\pi^{n,t}}\rangle\leq\frac{1}{N}\sum_{n=1}^{N}\operatorname{AdaReg}(% T)=\operatorname{AdaReg}(T),$

where we used Assumption B in the third step. ∎

D.2 Concrete Examples Satisfying No-Adaptive Regret Assumption

In this section, we provide some concrete agents learning dynamics examples to support our arguments on the practicality of Assump. B.

Example 1: Colluded Agents with Full Observation of $R^{t}$

If the agents are able to observe the mediator’s steering strategy $R^{t}$ and $R^{t}$ is Lipschitz in density (which is indeed satisfied by our proposed algorithms), the agents can collude together and take a (approximate) Nash Equilibrium policy induced by the reward function $r^{*}+R^{t}$ , which is guaranteed to be exist given the Lipschitz condition (Huang et al.,, 2023). By the definition of Nash, each agent will have non-positive adaptive regret, which satisfies Assump. B.

Note that in the contract design literature, it is usually assumed the agents are able to do best response (Ho et al.,, 2014; Zhu et al.,, 2022) to the principal’s (mediator’s) strategy if there is only one agent, or take the equilibrium policies for many agents setting (Carmona and Wang,, 2021; Elie et al.,, 2019). Based on the discussion above, those assumptions are strictly stronger than and implies our no-adaptive-regret assumption.

Example 2: Independent Agents Conducting Online Convex Learning

In this second example, we consider less powerful agents who can not observe the entire $R^{t}$ or coordinate with the other agents. Note that from an agent’s perspective, the interaction protocol in Procedure 1 can be interpreted as an online linear optimization task, as in Procedure 3.

Procedure 3 Agent-adversary interaction

1:for

t=1,...,T

2: Agent chooses

x_{t}\in\mathcal{X}

, where

\mathcal{X}\subseteq\mathbb{R}^{d}

is a convex set in Euclidean space.

3: Adversary chooses a reward vector

r_{t}\in\mathbb{R}^{d}

, possibly based on the history and

x_{t}

4: Agent observes

r_{t}

and obtains reward

\langle r_{t},x_{t}\rangle

5:end for

In our setting, in each iteration $t\in[T]$ , the agents pick a density (by picking a policy) from the convex set $\Psi_{M^{*}}$ and receive potentially adversarial feedback $R^{t}(\bar{\mu}^{t})$ (or $\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangle$ in bandit feedback setting). Then, Assump. B coincides with the standard no-adaptive regret guarantees in online convex optimization setting. Therefore, Assump. B can be realized if each agent independently adopts any no-adaptive regret online learning algorithm (Hazan and Seshadhri,, 2007; Hazan,, 2023).

As a concrete algorithm choice, online gradient descent (OGD) achieves a external regret bound of $\frac{3}{2}GD\sqrt{T}$ (Hazan,, 2023), where $D\leq 2H$ is the diameter of $\mathcal{X}=\Psi_{M^{*}}$ and $G$ an upper bound on $\|r_{t}\|_{2}\leq\sqrt{d}\|r_{t}\|_{\infty}$ . In our case, we can bound $G\leq\sqrt{HSA}(r_{\max}+R_{\max})$ . A bound for $r_{\max}+R_{\max}$ is discussed in Appendix D.4. Moreover, in the full feedback setting (the agents know the model $M$ and are able to observe $R^{t}(\bar{\mu}^{t})$ ), the no-adaptive-regret assumption is not much stronger than no-external-regret, as is demonstrated by the following proposition.

Proposition D.1 (Theorem 1.3 of Hazan and Seshadhri, (2007)).

Let $(r_{t})_{t=1}^{T}$ be reward vectors in $[0,C]^{d}$ . Any algorithm following Protocol 3 with external regret $\operatorname*{Reg}(T)$ can be utilized to build an algorithm with adaptive regret at most $\operatorname*{Reg}(T)+\mathcal{O}(C\sqrt{T\log T})$ .

Thus, Assump. B can be satisfied with an adaptive regret bound of $\tilde{\mathcal{O}}(\sqrt{T})$ if all the agents follow OGD, modified as in Prop. D.1.

D.3 Motivating Adaptive Regret

Here, we show a small example that should motivate why we need the no-adaptive-regret assumption instead of no-external-regret. External regret is one of the most common regret types, and it is the same as adaptive regret in Assumption B, but $a=1,b=T$ are fixed. If we want to steer the agents in different directions, the no-external-regret assumption might not be enough, as we can see in the following example.

Consider the stateless setting with $|\mathcal{A}|=2$ , where the incentive designer deploys $R(\mu)=\mathbf{e}_{1}$ for the first $T/2$ iterations and $R(\mu)=\mathbf{e}_{2}$ for the remaining $T/2$ iterations.

Suppose all the agents perform the Hedge algorithm, where

\displaystyle\bar{\mu}^{t}(a)=\bar{\pi}^{t}(a)=\frac{1}{Z}_{t}\exp\left(1+\eta% \sum_{s=1}^{t-1}\langle R^{s}(\bar{\mu}^{s}),\mathbf{e}_{a}\rangle\right),

and $Z_{t}$ is the normalizing constant. This algorithm is known to have sublinear external regret (Freund and Schapire,, 1997). The population density at iteration $t\geq T/2$ is

\displaystyle\bar{\mu}^{t}=\frac{1}{Z}_{t}\left(\begin{matrix}\exp(1+\eta T/2)% \\ \exp(1+\eta(t-T/2))\end{matrix}\right),

while the optimal action is $\mathbf{e}_{2}$ . Thus, over the interval $[T/2+1,T]$ , the agents accumulate expected regret

\displaystyle\sum_{t=T/2+1}^{T}(1-\bar{\mu}^{t}(2))=\sum_{t=T/2+1}^{T}% \underset{\geq 1/2}{\underbrace{\frac{\exp(1+\eta T/2)}{\exp(1+\eta T/2)+\exp(% 1+\eta(t-T/2)}}}\geq T/4.

So, although this algorithm has no external regret, we still might have to wait $\Omega(T)$ many rounds to let the agents converge to a different density. One can easily observe that with the no-adaptive-regret assumption, this is not an issue.

D.4 Boundedness Of Steering Rewards

As we see in Assumption B, the adaptive regret bound $\operatorname*{AdaReg}(T)$ is dependent on $r_{\max}+R_{\max}$ . In this section, we show that $R_{\max}=\mathcal{O}(1+r_{\max})$ for both of our steering rewards $R_{\text{z}}$ and $R_{\text{nz}}$ .

Proposition D.2.

For any $\pi\in\Pi$ and $\mu\in\Psi$ , $\|R_{\pi}(\mu)\|_{\infty}\leq 2$ , where $R_{\pi}$ is defined as in Eq. (5).

Proof.

As one can observe using the definition of $R_{\pi}$ in Eq. (5) and $W^{\pi}$ in Eq. (4), we have for any $h,s,a$ ,

	$\displaystyle\left\|(R_{\pi}(\mu))_{h,s,a}\right\|=\left\|(\mu^{\top}(W^{\pi}-I)^% {\top}(W^{\pi}-I))_{h,s,a}\right\|=\left\|(W^{\pi}\mu-\mu)^{\top}(W^{\pi}-I)_{(h% ,s,a)}\right\|$
	$\displaystyle=\left\|\sum_{a^{\prime}}(\pi_{h}(a^{\prime}\|s)\mu_{h}(s)-\mu_{h}(% s,a^{\prime}))(\pi_{h}(a^{\prime}\|s)-\mathbb{I}\{a^{\prime}=a\})\right\|\leq% \sum_{a^{\prime}}\underbrace{\|\pi_{h}(a^{\prime}\|s)\mu_{h}(s)-\mu_{h}(s,a^{% \prime})\|}_{\leq 1}\cdot\|\pi_{h}(a^{\prime}\|s)-\mathbb{I}\{a^{\prime}=a\}\|$
	$\displaystyle\leq\sum_{a^{\prime}\neq a}\pi_{h}(a^{\prime}\|s)+\|\pi_{h}(a\|s)-1\|% \leq 2,$

where $(W^{\pi}-I)_{(h,s,a)}$ is the $(h,s,a)$ -th column of $W^{\pi}-I$ . ∎

Proposition D.3.

We have for $R_{\text{z}}$ , as defined in Alg. 2, and for $R_{\text{nz}}$ , as defined in Eq. (6), that for all iterations $t\in[T]$ ,

\displaystyle\|R_{\text{z}}^{t}(\mu)\|_{\infty}\leq 4\quad\text{and}\quad\|R_{% \text{nz}}^{t}(\mu)\|_{\infty}\leq 2r_{\max}+4\quad\forall\mu\in\Delta_{% \mathcal{S}\times\mathcal{A}}^{H}.

Proof.

We have $\|R_{\text{z}}^{t}(\mu)\|_{\infty}=\|R_{\pi_{*}^{k(t)}}(\mu)+\|R_{\pi_{*}^{k(t% )}}(\mu)\|_{\infty}\bm{1}\|_{\infty}\leq 2\|R_{\pi_{*}^{k(t)}}(\mu)\|_{\infty}$ , which is at most $4$ , by Prop. D.2.

By the definition of $w_{\hat{\mathcal{R}}^{t}}$ , we know that the elements of $w_{\hat{\mathcal{R}}^{t}}(\mu)$ are bounded in $[0,r_{\max}]$ for any $\mu$ . Therefore, and by Prop. D.2,

	$\displaystyle\left\\|R_{\text{nz}}^{t}(\mu)\right\\|_{\infty}$	$\displaystyle=\left\\|R_{\pi_{}^{k(t)}}(\mu)-(\bar{r}^{t}(\mu)-w_{\hat{% \mathcal{R}}^{t}}(\mu))+(r_{\max}+\\|R_{\pi_{}^{k(t)}}(\mu)\\|_{\infty})\bm{1}% \right\\|_{\infty}$
		$\displaystyle\leq\left\\|R_{\pi_{}^{k(t)}}(\mu)+\\|R_{\pi_{}^{k(t)}}(\mu)\\|_{% \infty}\bm{1}\right\\|_{\infty}+\left\\|\bar{r}^{t}(\mu)-w_{\hat{\mathcal{R}}^{t% }}(\mu)\right\\|_{\infty}+\left\\|r_{\max}\bm{1}\right\\|_{\infty}$
		$\displaystyle\leq 4+2r_{\max}.$

∎

Appendix E STATE-ACTION DENSITY

E.1 $\Psi_{M}$ Is Convex

Lemma E.1.

\displaystyle\Psi_{M}=\{\mu:\mu\geq 0,\sum_{a^{\prime}}\mu_{h+1}(s,a^{\prime})% =\sum_{s^{\prime},a^{\prime}}\mathbb{P}_{M,h}(s|s^{\prime},a^{\prime})\mu_{h}(% s^{\prime},a^{\prime})\forall h,s,\sum_{a^{\prime}}\mu_{1}(s,a^{\prime})=\mu_{% 1}(s)\}

Proof.

We abbreviate $\mathbb{P}=\mathbb{P}_{M}$ and $\mu=\mu_{M}$ , since the model is fixed throughout. For $\mu\in\Psi_{M}$ , it is easy to see that the conditions on the right-hand side are fulfilled. The other direction is more involved. Suppose $\tilde{\mu}$ fulfills $\tilde{\mu}\geq 0$ and for all $s,h$ , $\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=\sum_{s^{\prime},a^{\prime}}% \mathbb{P}_{h}(s|s^{\prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime})$ as well as $\sum_{a^{\prime}}\tilde{\mu}_{1}(s,a^{\prime})=\mu_{1}(s)$ . Now, define $\pi$ such that for all $s,a,h$ ,

\displaystyle\pi_{h}(a|s)=\begin{cases}\frac{\tilde{\mu}_{h}(s,a)}{\sum_{a^{% \prime}}\tilde{\mu}_{h}(s,a^{\prime})},&\text{if }\sum_{a^{\prime}}\tilde{\mu}% _{h}(s,a^{\prime})\neq 0\\ 1/A,&\text{else}\end{cases}.

Clearly, $\pi_{h}(a|s)\geq 0$ and $\sum_{a}\pi_{h}(a|s)=1$ , which means $\pi\in\Pi$ . First of all,

\displaystyle\mu_{1}^{\pi}(s,a)=\pi_{1}(a|s)\mu_{1}(s)=\frac{\tilde{\mu}_{1}(s% ,a)}{\sum_{a^{\prime}}\tilde{\mu}_{1}(s,a^{\prime})}\mu_{1}(s)=\tilde{\mu}_{1}% (s,a).

By induction, for all $h\geq 1$ we have if $\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})\neq 0$ ,

	$\displaystyle\mu_{h+1}^{\pi}(s,a)$	$\displaystyle=\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a\|s)\mathbb{P}_{h}(s\|s^{% \prime},a^{\prime})\mu_{h}^{\pi}(s^{\prime},a^{\prime})$
		$\displaystyle=\pi_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}\mathbb{P}_{h}(s\|s^{% \prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime})$
		$\displaystyle=\frac{\tilde{\mu}_{h+1}(s,a)}{\sum_{a^{\prime}}\tilde{\mu}_{h+1}% (s,a^{\prime})}\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=\tilde{\mu}_{h% +1}(s,a),$

and in case $\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=0$ , we know that $\tilde{\mu}_{h+1}(s,a)=0$ and therefore

	$\displaystyle\mu_{h+1}^{\pi}(s,a)$	$\displaystyle=\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a\|s)\mathbb{P}_{h}(s\|s^{% \prime},a^{\prime})\mu_{h}^{\pi}(s^{\prime},a^{\prime})$
		$\displaystyle=\pi_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}\mathbb{P}_{h}(s\|s^{% \prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime})$
		$\displaystyle=1/A\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=0=\tilde{\mu% }_{h+1}(s,a).$

We can conclude that $\tilde{\mu}=\mu^{\pi}$ and thus $\tilde{\mu}\in\Psi_{M}$ . ∎

Lemma E.2.

\displaystyle\Psi_{M}=\{\mu:\mu\geq 0,B\mu=b\},

where

\displaystyle B=\left(\begin{matrix}D&&&&\\ -\mathbb{P}_{M,1}^{\top}&D&&&\\ &-\mathbb{P}_{M,2}^{\top}&D&&\\ &&...&&\\ &&&-\mathbb{P}_{M,H-1}^{\top}&D\end{matrix}\right),\quad b=\left(\begin{matrix% }\mu_{1}\\ 0\\ \vdots\\ 0\end{matrix}\right),

$D:=I_{S}\otimes\bm{1}_{A}^{\top}$ ( $\otimes$ is the tensor product) and $\mathbb{P}_{M}$ is viewed as a matrix such that $(\mathbb{P}_{M,h})_{(s,a),s^{\prime}}=\mathbb{P}_{M,h}(s^{\prime}|s,a)$ . An immediate consequence of this formulation is that $\Psi_{M}$ is convex.

Proof.

This result is simply a reformulation of Lemma E.1. We can rewrite the condition $\sum_{a^{\prime}}\mu_{h+1}(s,a^{\prime})=\sum_{s^{\prime},a^{\prime}}\mathbb{P% }_{M,h}(s|s^{\prime},a^{\prime})\mu_{h}(s^{\prime},a^{\prime})$ as $D\mu_{h+1}=\mathbb{P}_{M,h}^{\top}\mu_{h}$ . The condition $\sum_{a^{\prime}}\mu_{1}(s,a^{\prime})=\mu_{1}(s)$ can be written as $D\mu_{1}(\cdot,\cdot)=\mu_{1}$ . ∎

E.2 Inequalities

Lemma E.3.

For any model $M$ and any $\pi,\tilde{\pi}\in\Pi$ ,

\displaystyle\|\mu_{M}^{\pi}-\mu_{M}^{\tilde{\pi}}\|_{1}\leq H\sum_{h,s}\mu_{M% ,h}^{\pi}(s)\|\pi_{h}(\cdot|s)-\tilde{\pi}_{h}(\cdot|s)\|_{1},

where $\mu_{M,h}^{\pi}(s)=\sum_{a}\mu_{M,h}^{\pi}(s,a)$ .

Proof.

Since the model $M$ is fixed throughout, we abbreviate $\mu=\mu_{M}$ and $\mathbb{P}=\mathbb{P}_{M}$ . First of all, $\|\mu_{1}^{\pi}-\mu_{1}^{\tilde{\pi}}\|_{1}=\sum_{s,a}\mu_{1}(s)|\pi_{1}(a|s)-% \tilde{\pi}_{1}(a|s)|$ . Furthermore, for any $h$ ,

	$\displaystyle\\|\mu_{h+1}^{\pi}-\mu_{h+1}^{\tilde{\pi}}\\|_{1}$	$\displaystyle=\sum_{s,a}\|\mu_{h+1}^{\pi}(s,a)-\mu_{h+1}^{\tilde{\pi}}(s,a)\|$
		$\displaystyle=\sum_{s,a}\left\|\pi_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}\mu_{h% }^{\pi}(s^{\prime},a^{\prime})\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})-\tilde{% \pi}_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}\mu_{h}^{\tilde{\pi}}(s^{\prime},a^% {\prime})\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})\right\|$
		$\displaystyle\leq\sum_{s,a}\left\|\pi_{h+1}(a\|s)-\tilde{\pi}_{h+1}(a\|s)\right\|% \sum_{s^{\prime},a^{\prime}}\mu_{h}^{\pi}(s^{\prime},a^{\prime})\mathbb{P}_{h}% (s\|s^{\prime},a^{\prime})$
		$\displaystyle~{}+\sum_{s,a}\tilde{\pi}_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}% \left\|\mu_{h}^{\pi}(s^{\prime},a^{\prime})-\mu_{h}^{\tilde{\pi}}(s^{\prime},a^% {\prime})\right\|\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})$
		$\displaystyle=\sum_{s,a}\mu_{h+1}^{\pi}(s)\left\|\pi_{h+1}(a\|s)-\tilde{\pi}_{h+% 1}(a\|s)\right\|+\sum_{s^{\prime},a^{\prime}}\left\|\mu_{h}^{\pi}(s^{\prime},a^{% \prime})-\mu_{h}^{\tilde{\pi}}(s^{\prime},a^{\prime})\right\|\sum_{s,a}\tilde{% \pi}_{h+1}(a\|s)\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})$
		$\displaystyle=\sum_{s}\mu_{h+1}^{\pi}(s)\\|\pi_{h+1}(\cdot\|s)-\tilde{\pi}_{h+1}% (\cdot\|s)\\|_{1}+\\|\mu_{h}^{\pi}-\mu_{h}^{\tilde{\pi}}\\|_{1}.$

By induction,

\displaystyle\|\mu_{h}^{\pi}-\mu_{h}^{\tilde{\pi}}\|_{1}\leq\sum_{h^{\prime}=1% }^{h}\sum_{s}\mu_{h^{\prime}}^{\pi}(s)\|\pi_{h^{\prime}}(\cdot|s)-\tilde{\pi}_% {h^{\prime}}(\cdot|s)\|_{1}.

Finally,

	$\displaystyle\\|\mu^{\pi}-\mu^{\tilde{\pi}}\\|_{1}$	$\displaystyle\leq\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h}\sum_{s}\mu_{h^{\prime}}% ^{\pi}(s)\\|\pi_{h^{\prime}}(\cdot\|s)-\tilde{\pi}_{h^{\prime}}(\cdot\|s)\\|_{1}$
		$\displaystyle\leq H\sum_{h=1}^{H}\sum_{s}\mu_{h}^{\pi}(s)\\|\pi_{h}(\cdot\|s)-% \tilde{\pi}_{h}(\cdot\|s)\\|_{1}.$

∎

See 4.2

Proof.

Note that $\Psi_{M}$ is a convex set and $\bar{\mu}_{M}\in\Psi_{M}$ . By definition, we have $\bar{\mu}_{M}=\mu_{M}^{\bar{\pi}}$ . By applying Lem. E.3 for model policy $\bar{\pi}$ and $\pi$ in $M$ , we finish the proof. ∎

Lemma E.4.

Consider any $\pi\in\Pi$ and models $M,\tilde{M}$ , who are the same except with different transition functions $\mathbb{P},\tilde{\mathbb{P}}$ respectively. Then,

\displaystyle\|\mu_{\tilde{M}}^{\pi}-\mu_{M}^{\pi}\|_{1}\leq H\sum_{h=1}^{H-1}% \sum_{s,a}\mu_{M,h}^{\pi}(s,a)\|\tilde{\mathbb{P}}_{h}(\cdot|s,a)-\mathbb{P}_{% h}(\cdot|s,a)\|_{1}.

Proof.

Recall the definition of the state-action density function:

\displaystyle\mu_{M,h+1}^{\pi}(s,a)=\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a|s)% \mathbb{P}_{M,h}(s|s^{\prime},a^{\prime})\mu_{M,h}^{\pi}(s^{\prime},a^{\prime}% )\quad\text{and}\quad\mu_{1}^{\pi}(s,a)=\pi_{1}(a|s)\mu_{1}(s).

We abbreviate $\tilde{\mu}=\mu_{\tilde{M}}^{\pi},\mu=\mu_{M}^{\pi}$ . Since $\mu_{1}$ is the same for $M$ and $\tilde{M}$ , $\sum_{s,a}|\tilde{\mu}_{1}(s,a)-\mu_{1}(s,a)|=0$ . Furthermore, for all $h$ ,

	$\displaystyle\\|\tilde{\mu}_{h+1}-\mu_{h+1}\\|_{1}=\sum_{s,a}\left\|\tilde{\mu}_{% h+1}(s,a)-\mu_{h+1}(s,a)\right\|$
	$\displaystyle=\sum_{s,a}\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a\|s)\left\|\tilde% {\mathbb{P}}_{h}(s\|s^{\prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime}% )-\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})\mu_{h}(s^{\prime},a^{\prime})\right\|$
	$\displaystyle=\sum_{s,a,s^{\prime}}\left\|\tilde{\mathbb{P}}_{h}(s^{\prime}\|s,a% )\tilde{\mu}_{h}(s,a)-\mathbb{P}_{h}(s^{\prime}\|s,a)\mu_{h}(s,a)\right\|$
	$\displaystyle\leq\sum_{s,a,s^{\prime}}\tilde{\mathbb{P}}_{h}(s^{\prime}\|s,a)% \left\|\tilde{\mu}_{h}(s,a)-\mu_{h}(s,a)\right\|+\mu_{h}(s,a)\left\|\tilde{% \mathbb{P}}_{h}(s^{\prime}\|s,a)-\mathbb{P}_{h}(s^{\prime}\|s,a)\right\|$
	$\displaystyle=\sum_{s,a}\left\|\tilde{\mu}_{h}(s,a)-\mu_{h}(s,a)\right\|+\sum_{s% ,a}\mu_{h}(s,a)\sum_{s^{\prime}}\left\|\tilde{\mathbb{P}}_{h}(s^{\prime}\|s,a)-% \mathbb{P}_{h}(s^{\prime}\|s,a)\right\|$
	$\displaystyle=\\|\tilde{\mu}_{h}-\mu_{h}\\|_{1}+\sum_{s,a}\mu_{h}(s,a)\\|\tilde{% \mathbb{P}}_{h^{\prime}}(\cdot\|s,a)-\mathbb{P}_{h^{\prime}}(\cdot\|s,a)\\|_{1}.$

Using induction on $h$ , we obtain $\|\tilde{\mu}_{h}-\mu_{h}\|_{1}\leq\sum_{h^{\prime}=1}^{h-1}\sum_{s,a}\mu_{h^{% \prime}}(s,a)\|\tilde{\mathbb{P}}_{h^{\prime}}(\cdot|s,a)-\mathbb{P}_{h^{% \prime}}(\cdot|s,a)\|_{1}$ . Thus,

	$\displaystyle\\|\tilde{\mu}-\mu\\|_{1}$	$\displaystyle\leq\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h-1}\sum_{s,a}\mu_{h^{% \prime}}(s,a)\\|\tilde{\mathbb{P}}_{h^{\prime}}(\cdot\|s,a)-\mathbb{P}_{h^{% \prime}}(\cdot\|s,a)\\|_{1}$
		$\displaystyle\leq H\sum_{h=1}^{H-1}\sum_{s,a}\mu_{h}(s,a)\\|\tilde{\mathbb{P}}_% {h}(\cdot\|s,a)-\mathbb{P}_{h}(\cdot\|s,a)\\|_{1}.$

∎

Appendix F PROOFS OF RESULTS IN SECTION 4

See 4.1

Proof.

We abbreviate $\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}$ . By Assumption B, $\sum_{t=1}^{T}\|\bar{\mu}^{t}-\mu^{*}\|_{2}^{2}=\sum_{t=1}^{T}\langle R^{t}(% \bar{\mu}^{t}),\mu^{*}-\bar{\mu}^{t}\rangle\leq\operatorname*{AdaReg}(T)$ . Thus,

\displaystyle\sum_{t=1}^{T}\|\bar{\mu}^{t}-\mu^{*}\|_{1}^{2}\leq HSA\sum_{t=1}% ^{T}\|\bar{\mu}^{t}-\mu^{*}\|_{2}^{2}\leq HSA\operatorname*{AdaReg}(T).

By Assumption C and Jensen’s inequality,

\displaystyle\max_{\mu}\sum_{t=1}^{T}U(\mu)-U(\bar{\mu}^{t})\leq L_{U}\cdot% \sum_{t=1}^{T}\|\mu^{*}-\bar{\mu}^{t}\|_{1}\leq L_{U}\cdot\sqrt{HSAT% \operatorname*{AdaReg}(T)}.

The steering cost can be bounded similarly.

	$\displaystyle\sum_{t=1}^{T}\langle R^{t}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle$	$\displaystyle=\sum_{t=1}^{T}H\\|\mu^{}-\bar{\mu}^{t}\\|_{\infty}+\langle\mu^{}% -\bar{\mu}^{t},\bar{\mu}^{t}\rangle\leq 2H\sum_{t=1}^{T}\\|\mu^{*}-\bar{\mu}^{t% }\\|_{\infty}$
		$\displaystyle\leq 2H\sum_{t=1}^{T}\\|\mu^{}-\bar{\mu}^{t}\\|_{2}\leq 2H\sqrt{T% \operatorname{AdaReg}(T)}.$

Given that $\operatorname*{AdaReg}$ is sub-linear in $T$ , we finish the proof. ∎

See 4.3

Proof.

The bound for the steering gap can be shown by first using the $L_{U}$ -Lipschitzness of $U$ and then applying Lem. G.1 under Assump. B, where $\pi_{*}^{t}=\pi$ for all $t$ . The calculation of the steering cost is the same as in the proof of Theorem 5.1 with $K=1$ . ∎

Appendix G PROOF OF THEOREM 5.1

Lemma G.1.

Let $(\pi_{*}^{t})_{t=1}^{T}$ be a sequence of policies. We abbreviate $\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}},\mu^{\pi_{*}^{t}}=\mu^{\pi_{*}^{t}}_{M^{*}}$ . Then,

	$\displaystyle\frac{1}{H}\sum_{t=1}^{T}\\|\bar{\mu}^{t}-\mu^{\pi_{*}^{t}}\\|_{1}\leq$	$\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{s\in\mathcal{S}}\bar{\mu}^{t}_{% h}(s)\\|\bar{\pi}^{t}_{h}(\cdot\|s)-\pi_{*,h}^{t}(\cdot\|s)\\|_{1}$
	$\displaystyle\leq$	$\displaystyle\sqrt{HSAT\sum_{t=1}^{T}\left\langle R_{\pi_{}^{t}}(\bar{\mu}^{t% }),\mu^{\pi_{}^{t}}-\bar{\mu}^{t}\right\rangle},$

where $\bar{\pi}^{t}$ is the (population) policy which induces $\bar{\mu}^{t}=\mu^{\bar{\pi}^{t}}$ .

Proof.

The first inequality follows from Lemma E.3. We can write

	$\displaystyle\sum_{t=1}^{T}\left\langle R_{\pi^{t}_{}}(\bar{\mu}^{t}),\mu^{% \pi^{t}_{}}-\bar{\mu}^{t}\right\rangle=\sum_{t=1}^{T}\left\\|(W^{\pi^{t}_{*}}-% I)\bar{\mu}^{t}\right\\|_{2}^{2}$
	$\displaystyle=\sum_{t=1}^{T}\sum_{h,s,a}\left(\pi^{t}_{*,h}(a\|s)\sum_{a^{% \prime}}\bar{\mu}^{t}_{h}(s,a^{\prime})-\bar{\mu}^{t}_{h}(s,a)\right)^{2}$
	$\displaystyle=\sum_{t=1}^{T}\sum_{h,s,a}(\bar{\mu}^{t}_{h}(s))^{2}\left(\pi^{t% }_{*,h}(a\|s)-\bar{\pi}^{t}_{h}(a\|s)\right)^{2}$
	$\displaystyle=\sum_{t=1}^{T}\sum_{h,s}(\bar{\mu}^{t}_{h}(s))^{2}\left\\|\pi^{t}% _{*,h}(\cdot\|s)-\bar{\pi}^{t}_{h}(\cdot\|s)\right\\|_{2}^{2}.$

Furthermore, by Jensen’s inequality,

	$\displaystyle\sum_{t=1}^{T}\sum_{h,s}\bar{\mu}^{t}_{h}(s)\\|\bar{\pi}^{t}_{h}(% \cdot\|s)-\pi^{t}_{,h}(\cdot\|s)\\|_{1}\leq\sqrt{A}\sum_{t=1}^{T}\sum_{h,s}\bar{% \mu}^{t}_{h}(s)\\|\bar{\pi}^{t}_{h}(\cdot\|s)-\pi^{t}_{,h}(\cdot\|s)\\|_{2}$
	$\displaystyle\leq\sqrt{HSAT\sum_{t=1}^{T}\sum_{h,s}(\bar{\mu}^{t}_{h}(s))^{2}% \\|\bar{\pi}^{t}_{h}(\cdot\|s)-\pi^{t}_{*,h}(\cdot\|s)\\|_{2}^{2}}$
	$\displaystyle=\sqrt{HSAT\sum_{t=1}^{T}\left\langle R_{\pi^{t}_{}}(\bar{\mu}^{% t}),\mu^{\pi^{t}_{}}-\bar{\mu}^{t}\right\rangle}.$

∎

Lemma G.2.

For any $0<\delta<1$ , with probability at least $1-\delta$ ,

\displaystyle\|\mathbb{P}_{\hat{M}^{k},h}(\cdot|s,a)-\mathbb{P}^{*}_{h}(\cdot|% s,a)\|_{1}\leq 2\varepsilon_{k}(h,s,a)

for all $t,h,s,a$ , where $\varepsilon_{k}(h,s,a):=\sqrt{\frac{2S\ln(THSA/\delta)}{\max\{1,N_{k}(h,s,a)\}}}$ .

Proof.

Since $\mathbb{P}_{\hat{M}^{k}}\in\mathcal{P}^{k}$ , we have $\|\mathbb{P}_{\hat{M}^{k},h}(\cdot|s,a)-\bar{\mathbb{P}}^{k}_{h}(\cdot|s,a)\|_% {1}\leq\varepsilon_{k}(h,s,a)$ for all $k,h,s,a$ . By (5) and Theorem 2.1 of Weissman et al., (2003),

\displaystyle\Pr\left[\|\bar{\mathbb{P}}^{k}_{h}(\cdot|s,a)-\mathbb{P}^{*}_{h}% (\cdot|s,a)\|_{1}>\varepsilon\right]\leq(2^{S}-2)e^{-N_{k}(h,s,a)\varepsilon^{% 2}/2}.

Plugging in $\varepsilon_{k}(h,s,a)$ for $\varepsilon$ bounds this probability with $\delta/(THSA)$ . The triangle inequality and a union bound over all $k,h,s,a$ imply the result. ∎

Lemma G.3.

For any $0<\delta<1$ and respective $\varepsilon_{k}(h,s,a)$ ,

\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H-1}\varepsilon_{k(t)}(h,s^{t}_{h},a^{t% }_{h})\leq 3HS\sqrt{2\ln(THSA/\delta)AT}.

Proof.

We can define $n_{k}(h,s,a):=\sum_{t=T_{k-1}+1}^{T_{k}}\mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a\}$ . Clearly, $N_{k}(h,s,a)=\sum_{k^{\prime}<k}n_{k}(h,s,a)$ . The condition in line 5 of the algorithm ensures that $n_{k}(h,s,a)\leq N_{k}(h,s,a)$ for all $k,h,s,a$ . Thus, we can use Lemma 19 in Jaksch et al., (2010) and Jensen’s inequality,

	$\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H-1}\frac{1}{\sqrt{\max\{1,N_{k(t)}(h,s% ^{t}_{h},a^{t}_{h})\}}}=\sum_{k=1}^{K}\sum_{t=T_{k-1}+1}^{T_{k}}\sum_{h=1}^{H-% 1}\frac{1}{\sqrt{\max\{1,N_{k}(h,s^{t}_{h},a^{t}_{h})\}}}$
	$\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H-1}\sum_{s,a}\sum_{t=T_{k-1}+1}^{T_{k% }}\frac{\mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a\}}{\sqrt{\max\{1,N_{k}(h,s,a)\}}}=% \sum_{k=1}^{K}\sum_{h=1}^{H-1}\sum_{s,a}\frac{n_{k}(h,s,a)}{\sqrt{\max\{1,N_{k% }(h,s,a)\}}}$
	$\displaystyle\leq 3\sum_{h=1}^{H-1}\sum_{s,a}\sqrt{N_{K}(h,s,a)+n_{K}(h,s,a)}% \leq 3\sqrt{HSA\sum_{h=1}^{H-1}\sum_{s,a}(N_{K}(h,s,a)+n_{K}(h,s,a))}$
	$\displaystyle=3\sqrt{HSA\cdot HT}=3H\sqrt{SAT}.$

Now, using the definition of $\varepsilon_{k}(h,s,a)$ ,

	$\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H-1}\varepsilon_{k(t)}(h,s^{t}_{h},a^{t% }_{h})$	$\displaystyle=\sum_{t=1}^{T}\sum_{h=1}^{H-1}\sqrt{\frac{2S\ln(THSA/\delta)}{% \max\{1,N_{k(t)}(h,s^{t}_{h},a^{t}_{h})\}}}$
		$\displaystyle\leq\sqrt{2S\ln(THSA/\delta)}\cdot 3H\sqrt{SAT}=3HS\sqrt{2\ln(% THSA/\delta)AT}.$

∎

Lemma G.4.

Let $(\bar{\pi}^{t})_{t=1}^{T}$ be the policy sequence of the population and $(\hat{M}^{k})_{k=1}^{K}$ the sequence of the corresponding model estimates. We abbreviate $\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}},\hat{\mu}^{t}=\mu_{\hat{M}^{k(t)}}^{\bar{% \pi}^{t}}$ . With probability at least $1-2\delta$ ,

\displaystyle\sum_{t=1}^{T}\left\|\hat{\mu}^{t}-\bar{\mu}^{t}\right\|_{1}\leq 1% 2H^{2}S\sqrt{\ln(THSA/\delta)AT}.

Proof.

The proof is based on Rosenberg and Mansour, (2019). Let $(s^{t}_{h},a^{t}_{h})_{h=1}^{H}$ be the trajectory sampled in the $t$ -th game. We define $\xi_{k}(h,s,a):=\|\mathbb{P}_{\hat{M}^{k},h}(\cdot|s,a)-\mathbb{P}^{*}_{h}(% \cdot|s,a)\|_{1}$ . By Lemma E.4,

	$\displaystyle\sum_{t=1}^{T}\left\\|\hat{\mu}^{t}-\bar{\mu}^{t}\right\\|_{1}\leq H% \sum_{t=1}^{T}\sum_{h=1}^{H-1}\sum_{s,a}\bar{\mu}_{h}^{t}(s,a)\xi_{k(t)}(h,s,a)$
	$\displaystyle=H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\xi_{k(t)}(h,s^{t}_{h},a^{t}_{h})$
	$\displaystyle~{}+H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\underset{=:Y_{t}(h)}{% \underbrace{\left(\sum_{s,a}\bar{\mu}^{t}_{h}(s,a)\xi_{k(t)}(h,s,a)-\sum_{s,a}% \mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a\}\xi_{k(t)}(h,s,a)\right)}},$

where $(Y_{t}(h))_{t}$ is a martingale difference sequence w.r.t. the trajectories sampled and with $|Y_{t}(h)|\leq\max_{s,a}\xi_{k(t)}(h,s,a)\leq 2$ . In the following, we bound the first and second term above with high probability.

The first term can be bounded using Lemma G.2 and G.3, such that we have, with probability at least $1-\delta$ ,

\displaystyle H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\xi_{k(t)}(h,s^{t}_{h},a^{t}_{h})% \leq 2H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\varepsilon_{k(t)}(h,s^{t}_{h},a^{t}_{h})% \leq 2H\cdot 3H\sqrt{2S\ln(THSA/\delta)\cdot SAT}.

By the Hoeffding-Azuma inequality, we have for a fixed $h$ that with probability at least $1-\delta/H$ ,

\displaystyle\sum_{t=1}^{T}Y_{t}(h)\leq 2\sqrt{2T\ln(H/\delta)}.

Thus, by the union bound over all $h$ , the second term is at most $2H^{2}\sqrt{2T\ln(H/\delta)}$ with probability at least $1-\delta$ .

Finally, by union bound over the events used to bound the first and second term, we have with probability at least $1-2\delta$ that

	$\displaystyle\sum_{t=1}^{T}\left\\|\hat{\mu}^{t}-\bar{\mu}^{t}\right\\|_{1}$	$\displaystyle\leq 2H^{2}\sqrt{2T\ln(H/\delta)}+6H^{2}\sqrt{2S\ln(THSA/\delta)% \cdot SAT}$
		$\displaystyle\leq 12H^{2}S\sqrt{\ln(THSA/\delta)AT}.$

∎

See 5.1

Proof.

We first establish the upper bound for steering gap and then investigate the steering cost.

Proof for Steering Gap

We denote with $k(t)$ the episode index at the $t$ -th game and denote $\pi^{*}=\operatorname*{arg\,max}_{\pi}U(\mu_{M^{*}}^{\pi})$ . Furthermore, we abbreviate $\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}},\mu_{*}^{k}=\mu_{M^{*}}^{\pi_{*}^{k}},\hat% {\mu}^{t}=\mu_{\hat{M}^{k(t)}}^{\bar{\pi}^{t}}$ and $\hat{\mu}^{k}_{*}=\mu_{\hat{M}^{k}}^{\pi_{*}^{k}}$ . Consider a fixed $t$ and $k=k(t)$ . We can decompose the steering gap term of round $t$ as follows:

\displaystyle U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})=\left(U(\mu^{\pi^{*}})-U(\hat{% \mu}^{k}_{*})\right)+\left(U(\hat{\mu}^{k}_{*})-U(\bar{\mu}^{t})\right)

The first term can be bounded by 0 using the optimism of the algorithm. We use the $L_{U}$ -Lipschitzness of $U$ and the triangle inequality to further decompose the second term.

\displaystyle U(\hat{\mu}^{k}_{*})-U(\bar{\mu}^{t})\leq L_{U}\|\hat{\mu}^{k}_{% *}-\bar{\mu}^{t}\|_{1}\leq L_{U}\|\hat{\mu}^{k}_{*}-\hat{\mu}^{t}\|_{1}+L_{U}% \|\hat{\mu}^{t}-\bar{\mu}^{t}\|_{1}.

Applying Lemma E.3, we get

	$\displaystyle\\|\hat{\mu}^{k}_{}-\hat{\mu}^{t}\\|_{1}\leq H\sum_{h,s}\hat{\mu}_% {h}^{t}(s)\\|\pi^{k}_{,h}(\cdot\|s)-\bar{\pi}^{t}_{h}(\cdot\|s)\\|_{1}$
	$\displaystyle\leq H\sum_{h,s}\bar{\mu}_{h}^{t}(s)\cdot\\|\pi^{k}_{,h}(\cdot\|s)% -\bar{\pi}^{t}_{h}(\cdot\|s)\\|_{1}+H\underset{()}{\underbrace{\sum_{h,s}\|\hat{% \mu}_{h}^{t}(s)-\bar{\mu}_{h}^{t}(s)\|\cdot\\|\pi^{k}_{*,h}(\cdot\|s)-\bar{\pi}^{% t}_{h}(\cdot\|s)\\|_{1}}},$

where the second term can be bounded with

\displaystyle(*)\leq 2\sum_{h,s}|\hat{\mu}_{h}^{t}(s)-\bar{\mu}_{h}^{t}(s)|% \leq 2\sum_{h,s}\left|\sum_{a}\hat{\mu}_{h}^{t}(s,a)-\sum_{a}\bar{\mu}_{h}^{t}% (s,a)\right|\leq 2\|\hat{\mu}^{t}-\bar{\mu}^{t}\|_{1}.

Putting it all together we now arrive at

\displaystyle U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})\leq L_{U}H\sum_{h,s}\bar{\mu}^% {t}_{h}(s)\|\pi^{k}_{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}+L_{U}(2H+1% )\|\hat{\mu}^{t}-\bar{\mu}^{t}\|_{1}.

By summing over $t$ ,

\displaystyle\sum_{t=1}^{T}U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})\leq L_{U}H% \underset{\Delta_{\text{pop}}}{\underbrace{\sum_{t=1}^{T}\sum_{h,s}\bar{\mu}^{% t}_{h}(s)\|\pi^{k(t)}_{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}}}+L_{U}(% 2H+1)\underset{\Delta_{\text{est}}}{\underbrace{\sum_{t=1}^{T}\|\hat{\mu}^{t}-% \bar{\mu}^{t}\|_{1}}}.

Using Lemma G.4, the estimation error term $\Delta_{\text{est}}$ can bounded by $12H^{2}S\sqrt{\ln(THSA/\delta)AT}$ with probability at least $1-2\delta$ .

To bound the population convergence term $\Delta_{\text{pop}}$ , we can use Lemma G.1:

\displaystyle\sum_{t=1}^{T}\sum_{h,s}\bar{\mu}^{t}_{h}(s)\|\pi^{k(t)}_{*,h}(% \cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}\leq\vphantom{\underbrace{\sum_{t}^{T% }}_{\texttt{AgentReg}}}\sqrt{\vphantom{\sum_{t}^{T}}\smash[b]{HSAT\!% \underbrace{\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{k% (t)}-\bar{\mu}^{t}\rangle}_{\texttt{AgentReg}}\,}}

Furthermore, it can be easily seen that AgentReg is

\displaystyle\sum_{t=1}^{T}\langle R_{\text{z}}^{t}(\bar{\mu}^{t}),\mu_{*}^{k(% t)}-\bar{\mu}^{t}\rangle

\displaystyle=\sum_{k=1}^{K}\underset{\leq\operatorname*{AdaReg}(T)}{% \underbrace{\sum_{t=T_{k-1}+1}^{T_{k}}\langle R_{\text{z}}^{t}(\bar{\mu}^{t}),% \mu_{*}^{k}-\bar{\mu}^{t}\rangle}}\leq K\cdot\operatorname*{AdaReg}(T).

Finally, to bound the number of episodes $K$ , note that $K$ is also the number of times the condition in line 5 of the algorithm has been true. For each $(h,s,a)$ , this condition can be true at most $\log_{2}T$ times. Thus, $K\leq HSA\log_{2}T$ .

Proof for Steering Costs

Note that for any reward function $R$ ,

\displaystyle\langle R(\mu)+\|R(\mu)\|_{\infty}\bm{1},\mu\rangle=H\|R(\mu)\|_{% \infty}+\langle R(\mu),\mu\rangle\leq 2H\|R(\mu)\|_{\infty}.

Let $\pi^{*}=\pi_{*}^{k}$ for some $k$ . Recall that $R_{\pi^{*}}(\mu)=-((W^{\pi^{*}}-I)\mu)^{\top}(W^{\pi^{*}}-I)$ . By looking at the definition of $W^{\pi^{*}}$ in (4), we see that

\displaystyle\left\|(W^{\pi^{*}}-I)^{\top}\right\|_{\infty}=\max_{h,s,a}\sum_{% a^{\prime}\neq a}|\pi_{h}(a^{\prime}|s)|+|\pi_{h}(a|s)-1|\leq 2,

where the $\|\cdot\|_{\infty}$ -matrix norm is defined as $\|M\|_{\infty}=\max_{i}\sum_{j}|M_{ij}|$ . Using this, we can bound

	$\displaystyle\left\\|R_{\pi^{*}}(\mu)\right\\|_{\infty}$	$\displaystyle=\left\\|((W^{\pi^{}}-I)\mu)^{\top}(W^{\pi^{}}-I)\right\\|_{% \infty}=\left\\|(W^{\pi^{}}-I)^{\top}(W^{\pi^{}}-I)\mu\right\\|_{\infty}$
		$\displaystyle\leq\left\\|(W^{\pi^{}}-I)^{\top}\right\\|_{\infty}\cdot\left\\|(W^% {\pi^{}}-I)\mu\right\\|_{\infty}\leq 2\left\\|(W^{\pi^{*}}-I)\mu\right\\|_{2}.$

Finally, using Jensen’s inequality and the fact that the agent regret is bounded by $K\operatorname*{AdaReg}(T)$ , our steering cost can be bounded by

	$\displaystyle\sum_{t=1}^{T}\left\langle R_{\text{z}}^{t}(\bar{\mu}^{t}),\bar{% \mu}^{t}\right\rangle=\sum_{t=1}^{T}\left\langle R_{\pi_{}^{k(t)}}(\bar{\mu}^% {t})+\\|R_{\pi_{}^{k(t)}}(\bar{\mu}^{t})\\|_{\infty}\bm{1},\bar{\mu}^{t}\right\rangle$
	$\displaystyle\leq 4H\sum_{t=1}^{T}\left\\|(W^{\pi^{k(t)}_{}}-I)\bar{\mu}^{t}% \right\\|_{2}\leq 4H\sqrt{T\sum_{t=1}^{T}\left\\|(W^{\pi^{k(t)}_{}}-I)\bar{\mu}% ^{t}\right\\|_{2}^{2}}$
	$\displaystyle\leq 4H\sqrt{T\sum_{t=1}^{T}\langle R_{\text{z}}^{t}(\bar{\mu}^{t% }),\mu_{}^{k(t)}-\bar{\mu}^{t}\rangle}\leq 4H\sqrt{TK\operatorname{AdaReg}(T)}$

∎

Appendix H ELUDER DIMENSION

H.1 Example Function Classes

Here, we list some bounds of the eluder dimension for different function classes that are commonly considered. We see that in all these cases, the eluder dimension can be bounded logarithmically in $T$ , if $\varepsilon=T^{-1}$ .

Proposition H.1 (Linear functions, Russo and Van Roy, (2013)).

Let $\mathcal{F}=\{f|f(x)=\theta^{\top}\phi(x),\theta\in\mathbb{R}^{d},\|\theta\|_{% 2}\leq C_{\theta},\|\phi(x)\|_{2}\leq C_{\phi}\}$ .

\displaystyle\dim_{E}(\mathcal{F},\varepsilon)\leq 3d\frac{e}{e-1}\ln\left(3+3% \left(\frac{2C_{\theta}}{\varepsilon}\right)^{2}\right)+1.

Proposition H.2 (Quadratic functions, Osband and Roy, (2014)).

Let $\mathcal{F}=\{f|f(x)=\phi(x)^{\top}\theta\phi(x),\theta\in\mathbb{R}^{p\times p% },\phi\in\mathbb{R}^{p},\|\theta\|_{2}\leq C_{\theta},\|\phi\|_{2}\leq C_{\phi}\}$ .

\displaystyle\dim_{E}(\mathcal{F},\varepsilon)\leq p(4p-1)\frac{e}{e-1}\log% \left(\left(1+\left(\frac{2pC_{\phi}^{2}C_{\theta}}{\varepsilon}\right)^{2}% \right)(4p-1)\right)+1.

Proposition H.3 (Generalized linear functions, Russo and Van Roy, (2013)).

Let $g$ be strictly increasing, differentiable and have derivatives bounded in $[\underline{h},\overline{h}]$ with $\overline{h}>\underline{h}>0$ . Let $r=\overline{h}/\underline{h}$ and $\mathcal{F}=\{f|f(x)=g(\theta^{\top}\phi(x)),\theta\in\mathbb{R}^{d},\|\theta% \|_{2}\leq C_{\theta},\|\phi\|_{2}\leq C_{\phi}\}$ .

\displaystyle\dim_{E}(\mathcal{F},\varepsilon)\leq 3dr^{2}\frac{e}{e-1}\log% \left(3r^{2}+3r^{2}\left(\frac{2C_{\theta}\overline{h}}{\varepsilon}\right)^{2% }\right)+1.

Remark H.3 (Bounding $\beta_{T}$ ).

If we assume that the functions in $\mathcal{R}$ are parametrized by parameters in some set $\Theta\subset\mathbb{R}^{d}$ with constant diameter and the functions are $L$ -Lipschitz in that parameter, we have $N(\mathcal{R},\alpha,\|\cdot\|_{\infty})\leq N(\Theta,\alpha/L,\|\cdot\|_{% \infty})\leq\left(1+\mathcal{O}(L/\alpha)\right)^{d}$ . Then, we might choose $\alpha=T^{-1}$ such that

\displaystyle\beta_{T}=8\sigma^{2}\log(N(\mathcal{R},\alpha,\|\cdot\|_{\infty}% )/\delta)+2\alpha T(8r_{\max}+\sqrt{8\sigma^{2}\ln(4T^{2}/\delta)})

can also be bounded logarithmically in $T$ .

H.2 Bounding The Width Of The Confidence Set

Notations and Definitions

Here, we introduce some notation used in this section. We define the width function $w_{\mathcal{F}}(x)=\sup_{\underline{f},\overline{f}\in\mathcal{F}}|\underline{% f}(x)-\overline{f}(x)|$ . Throughout this section, we use the notation $x_{H_{t}+h}$ with $H_{t}=(t-1)H$ to describe elements of a sequence $x_{1},...,x_{HT}$ . The idea behind it is that we can later define $x_{H_{t}+h}=(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{M^{*},h})$ and apply the results in this section to our setting. Furthermore, for any function $g$ we write $\|g\|_{2,E_{t}}^{2}=\sum_{i=1}^{t-1}\sum_{h=1}^{H}g^{2}(x_{H_{t}+h})$ .

Lemma H.4 (Proposition 3 of Russo and Van Roy, (2013)).

If $(\beta_{t})_{t\in\mathbb{N}}$ is a positive non-decreasing sequence, $(\hat{f}_{t})_{t}$ some function sequence and $\mathcal{F}_{t}:=\{f\in\mathcal{F}:\|f-\hat{f}_{t}\|_{2,E_{t}}\leq\sqrt{\beta_% {t}}\}$ then with probability 1, for all $T\in\mathbb{N}$ ,

\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{I}\{w_{\mathcal{F}_{t}}(x_{H_% {t}+h})>\varepsilon\}\leq\left(\frac{4\beta_{T}}{\varepsilon^{2}}+H\right)\dim% _{E}(\mathcal{F},\varepsilon)

for all $T\in\mathbb{N}$ and $\varepsilon>0$ .

Proof.

First we show that for any $\tau=H_{t}+h<TH$ , if $w_{\mathcal{F}_{t}}(x_{\tau})>\varepsilon$ then $x_{\tau}$ is $(\mathcal{F},\varepsilon)$ -dependent on fewer than $4\beta_{T}/\varepsilon^{2}$ disjoint subsequences of $(x_{1},...,x_{H_{t}})$ . Suppose $w_{\mathcal{F}_{t}}(x_{\tau})>\varepsilon$ . Then, there are $f,\tilde{f}\in\mathcal{F}_{t}$ such that $|f(x_{\tau})-\tilde{f}(x_{\tau})|>\varepsilon$ . Furthermore, let $(x_{i_{1}},...,x_{i_{k}})$ be a subsequence of $(x_{1},...,x_{H_{t}})$ on which $x_{\tau}$ is $(\mathcal{F},\varepsilon)$ -dependent. This implies, by definition, that $\sum_{j=1}^{k}(f(x_{i_{j}})-\tilde{f}(x_{i_{j}}))^{2}>\varepsilon^{2}$ . If $x_{\tau}$ is $(\mathcal{F},\varepsilon)$ -dependent on $K$ disjoint subsequences of $(x_{1},...,x_{H_{t}})$ then we must have

\displaystyle\|f-\tilde{f}\|_{2,E_{t}}^{2}=\sum_{i=1}^{t-1}\sum_{h=1}^{H}(f(x_% {H_{i}+h})-\tilde{f}(x_{H_{i}+h}))^{2}\geq\sum_{l=1}^{K}\sum_{j=1}^{k_{l}}(f(x% _{i^{l}_{j}})-\tilde{f}(x_{i^{l}_{j}}))^{2}>K\varepsilon^{2}.

By the triangle inequality, $\|f-\tilde{f}\|_{2,E_{t}}\leq\|f-\hat{f}_{t}\|_{2,E_{t}}+\|\tilde{f}-\hat{f}_{% t}\|_{2,E_{t}}\leq 2\sqrt{\beta_{t}}\leq 2\sqrt{\beta_{T}}$ . Combining these two inequalities, we get $K<4\beta_{T}/\varepsilon^{2}$ .

Next, we show that in any sequence $(y_{1},...,y_{l})$ there is an element $y_{j}$ which is $(\mathcal{F},\varepsilon)$ -dependent on at least $l/d-1$ disjoint subsequences of $(y_{1},...,y_{j-1})$ , where $d=\dim_{E}(\mathcal{F},\varepsilon)$ . Let $K$ be an integer with $Kd+1\leq l\leq Kd+d$ . We will construct $K$ disjoint subsequences $B_{1},...,B_{K}$ . First, $B_{i}=(y_{i})$ for all $i\in[K]$ . If $y_{K+1}$ is already $(\mathcal{F},\varepsilon)$ -dependent on $B_{1},...,B_{K}$ , we are done. Otherwise, select a $B_{i}$ of which $y_{K+1}$ is $(\mathcal{F},\varepsilon)$ -independent and append $y_{K+1}$ to $B_{i}$ . We repeat this for $y_{K+2},y_{K+3},...$ until we find $y_{j}$ that is $(\mathcal{F},\varepsilon)$ -dependent on each subsequence or until we have reached $y_{l}$ . In the latter case, each element of a subsequence $B_{i}$ is independent of its predecessors and hence $|B_{i}|=d$ . Then, $y_{l}$ must be $(\mathcal{F},\varepsilon)$ -dependent on each subsequence, by definition of the eluder dimension. In both cases we find an element in $(y_{1},...,y_{l})$ that is $(\mathcal{F},\varepsilon)$ -dependent on $K\geq t/d-1$ disjoint subsequences.

Finally, let $(y_{1},...,y_{l})=(x_{i_{1}},...,x_{i_{l}})$ be a subsequence of $(x_{1},...,x_{TH})$ consisting of all elements $x_{H_{t}+h}$ for which $w_{\mathcal{F}_{t}}(x_{H_{t}+h})>\varepsilon$ . From before, we know there is some $y_{j}$ that is $(\mathcal{F},\varepsilon)$ -dependent on at least $l/d-1$ disjoint subsequences of $(y_{1},...,y_{j-1})$ . Let $t,h$ be such that $y_{j}=x_{H_{t}+h}$ . Note that in $(y_{1},...,y_{j-1})$ there are at most $H-1$ elements $y_{i}=x_{H_{t}+h^{\prime}}$ for some $h^{\prime}<h$ . From this follows that $y_{j}=x_{H_{t}+h}$ is $(\mathcal{F},\varepsilon)$ -dependent on at least $l/d-1-(H-1)=l/d-H$ disjoint subsequences of $(y_{1},...,y_{j-H})\subseteq(x_{1},...,x_{H_{t}})$ . Now, as we have also shown, $x_{H_{t}+h}$ is $(\mathcal{F},\varepsilon)$ -dependent on fewer than $4\beta_{T}/\varepsilon^{2}$ disjoint subsequences of $(x_{1},...,x_{H_{t}})$ . Combining these two bounds, we get $l/d-H\leq 4\beta_{T}/\varepsilon^{2}$ , and therefore $l\leq(4\beta_{T}/\varepsilon^{2}+H)d$ . ∎

Lemma H.5 (Variant of Lemma 2 in Russo and Van Roy, (2013)).

Let $(\beta_{t})_{t\in\mathbb{N}}$ be a positive non-decreasing sequence, $(\hat{f}_{t})_{t}$ some function sequence and $\mathcal{F}_{t}:=\{f\in\mathcal{F}:\|f-\hat{f}_{t}\|_{2,E_{t}}\leq\sqrt{\beta_% {t}}\}$ . Let $w_{\mathcal{F}}(x)\leq C$ for all $x$ . Then, for all $T\in\mathbb{N}$ and $\varepsilon>0$ ,

	$\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\mathcal{F}_{t}}(x_{H_{t}+h})\leq% \varepsilon HT+CH\dim_{E}(\mathcal{F},\varepsilon)$
	$\displaystyle+4\sqrt{\beta_{T}H\dim_{E}(\mathcal{F},\varepsilon)T}.$

Proof.

We abbreviate $w_{H_{t}+h}=w_{\mathcal{F}_{t}}(x_{H_{t}+h})$ and $d=\dim_{E}(\mathcal{F},\varepsilon)$ . Let $w_{i_{1}}\geq...\geq w_{i_{HT}}$ . Using this ordering of the sequence, $w_{i_{k}}>\varepsilon$ implies that $\sum_{j=1}^{T}\mathbb{I}\{w_{j}>\varepsilon\}\geq k$ . By Lemma H.4, this would mean $k\leq(4\beta_{T}/\varepsilon^{2}+H)d$ or, equivalently, $\varepsilon<\sqrt{4\beta_{T}d/(k-Hd)}$ . Now, since $w_{i_{k}}>\varepsilon$ implies $\varepsilon<\sqrt{4\beta_{T}d/(k-Hd)}$ , this means that $w_{i_{k}}<\sqrt{4\beta_{T}d/(k-Hd)}$ .

In the following, we bound the first and largest widths $w_{i_{1}},...,w_{i_{Hd}}$ by $C$ and the remaining widths (larger than $\varepsilon$ ) by the previously established bound.

	$\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}w_{H_{t}+h}$	$\displaystyle=\sum_{k=1}^{HT}\mathbb{I}\{w_{k}\leq\varepsilon\}w_{k}+\sum_{k=1% }^{HT}\mathbb{I}\{w_{k}>\varepsilon\}w_{k}\leq\varepsilon HT+\sum_{k=1}^{HT}% \mathbb{I}\{w_{k}>\varepsilon\}w_{k}$
		$\displaystyle\leq\varepsilon HT+HdC+\sum_{k=Hd+1}^{HT}\mathbb{I}\{w_{i_{k}}>% \varepsilon\}w_{k_{t}}$
		$\displaystyle\leq\varepsilon HT+HdC+\sum_{k=Hd+1}^{HT}\sqrt{4\beta_{T}d/(k-Hd)}$
		$\displaystyle\leq\varepsilon HT+HdC+\sqrt{4d\beta_{T}}\int_{0}^{HT}\frac{1}{% \sqrt{x}}dx$
		$\displaystyle=\varepsilon HT+HdC+4\sqrt{d\beta_{T}HT}$

∎

Appendix I PROOF OF THEOREM 6.1

I.1 Algorithm Details

We present our full algorithm for the unknown reward setting in Alg. 4.

Algorithm 4 Steering reward design for Scenario 2

1:Initialize

\mathcal{P}^{1}:=

set of all possible transition functions,

\pi_{*}^{1}

(arbitrarily),

k=1,T_{0}=0

2:for

t=1,...,T

3: Update

\hat{\mathcal{R}}^{t}

as in (10).

4: Choose

R_{\text{nz}}^{t}

as in (6).

5: Agents play

t

-th game with

r^{*}+R_{\text{nz}}^{t}

6: Obtain trajectory

((s^{t}_{h},a^{t}_{h},r^{t}_{h}))_{h=1}^{H}

7: if

\exists(h,s,a),~{}s.t.~{}n_{k}(h,s,a)\geq N_{k}(h,s,a)

then

8: Update

\mathcal{P}^{k+1}

as in (5).

T_{k}\leftarrow t

;

k\leftarrow k+1

10:

\pi_{*}^{k},\hat{M}^{k}\leftarrow\operatorname*{arg\,max}_{\pi\in\Pi,\hat{M}:% \mathbb{P}_{\hat{M}}\in\mathcal{P}^{k}}U(\mu^{\pi}_{\hat{M}}).

11: end if

12:end for

I.2 Missing Proofs

Lemma I.1 (Proposition 2 in Russo and Van Roy, (2013)).

Let $N(\mathcal{R},\alpha,\|\cdot\|_{\infty})$ be the $\alpha$ -covering number of $\mathcal{R}$ w.r.t. the $\|\cdot\|_{\infty}$ -norm. Let $\delta>0,\alpha>0$ , and for each $t$ , $\beta_{t}=8\sigma^{2}\log(N(\mathcal{R},\alpha,\|\cdot\|_{\infty})/\delta)+2% \alpha t(8r_{\max}+\sqrt{8\sigma^{2}\ln(4t^{2}/\delta)})$ . With probability at least $1-2\delta$ , $r^{*}\in\bigcap_{t=1}^{\infty}\hat{\mathcal{R}}^{t}$ .

Lemma I.2.

We abbreviate $\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}$ . With probability at least $1-\delta$ ,

\displaystyle\sum_{t=1}^{T}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),% \bar{\mu}^{t}\rangle\leq 3\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}% }(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t})+r_{\max}H\ln(1/\delta).

Proof.

Note that $\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle=\mathbb{% E}_{(s_{h},a_{h})_{h=1}^{H}\sim\bar{\mu}^{t}}[\sum_{h=1}^{H}w_{\hat{\mathcal{R% }}^{t}}(h,s_{h},a_{h},\bar{\mu}^{t})]=:Y_{t}$ . Recall that $(s_{h}^{t},a_{h}^{t})_{h}\sim\bar{\mu}^{t}$ are the trajectories we gather from the population at step $t$ . Therefore, we can define $X_{t}:=\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(h,s_{h}^{t},a_{h}^{t},\bar{\mu}% ^{t})$ with $\mathbb{E}[X_{t}|\bar{\mu}^{t}]=Y_{t}$ . By the assumption that $r^{*}$ is bounded in $[0,r_{\max}]$ , we have that $w_{\hat{\mathcal{R}}}(h,s,a,\mu)\leq r_{\max}$ for any $\mu,h,s,a$ and $\hat{\mathcal{R}}\subseteq\mathcal{R}$ . Therefore, $0\leq X_{t}\leq r_{\max}H$ . A direct application of Lemma D.4 from Huang et al., (2023) shows that with probability at least $1-\delta$ ,

\displaystyle\sum_{t=1}^{T}Y_{t}\leq 3\sum_{t=1}^{T}X_{t}+r_{\max}H\ln\frac{1}% {\delta}.

∎

Lemma I.3.

We abbreviate $\mu_{*}^{k}=\mu_{M^{*}}^{\pi_{*}^{k}},\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}$ . If the true $r^{*}$ is contained in all $\hat{\mathcal{R}}^{t}$ , then, with probability at least $1-\delta$ ,

\displaystyle\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{% k(t)}-\bar{\mu}^{t}\rangle\leq\sum_{t=1}^{T}\langle r^{*}(\bar{\mu}^{t})+R_{% \text{nz}}^{t}(\bar{\mu}^{t}),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle+6\sum_{t=1}^% {T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}% )+2r_{\max}H\ln\frac{1}{\delta}.

Proof.

Let $t\in[T]$ and $k=k(t)$ . By Eq. (6),

\displaystyle\langle R_{\pi_{*}^{k}}(\bar{\mu}^{t}),\mu_{*}^{k}-\bar{\mu}^{t}% \rangle=\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar{\mu}^{t}),\mu_{*}^% {k}-\bar{\mu}^{t}\rangle+\langle\bar{r}^{t}(\bar{\mu}^{t})-r^{*}(\bar{\mu}^{t}% )-w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\mu_{*}^{k}-\bar{\mu}^{t}\rangle.

With that, we have already separated out the first term (agent regret). Using the assumption that $r^{*}\in\hat{\mathcal{R}}^{t}$ for all $t$ , we can bound the second term as follows.

	$\displaystyle\langle\bar{r}^{t}(\bar{\mu}^{t})-r^{}(\bar{\mu}^{t})-w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\mu_{}^{k}-\bar{\mu}^{t}\rangle$
	$\displaystyle=\langle r^{}(\bar{\mu}^{t})-\bar{r}^{t}(\bar{\mu}^{t})+w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle+\langle\underset{\leq w_% {\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t})}{\underbrace{\bar{r}^{t}(\bar{\mu}^{t})% -r^{}(\bar{\mu}^{t})}}-w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\mu^{k}_{*}\rangle$
	$\displaystyle\leq\langle\underset{\leq w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}% )}{\underbrace{r^{*}(\bar{\mu}^{t})-\bar{r}^{t}(\bar{\mu}^{t})}}+w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle\leq 2\langle w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle$

Finally, we can bound $\sum_{t=1}^{T}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle$ using Lemma I.2, which implies the result. ∎

See 6.1

Proof.

We abbreviate $\mu_{*}^{k}=\mu_{M^{*}}^{\pi_{*}^{k}},\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}$ . We can use the exact same arguments as in the proof of Theorem 5.1, up until the point where we have to bound

\displaystyle\texttt{AgentReg}=\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{% \mu}^{t}),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle.

Combining Lemma I.1 and Lemma I.3 we have with probability at least $1-3\delta$ that

\displaystyle\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{% k(t)}-\bar{\mu}^{t}\rangle\leq\underset{\texttt{NewAgentReg}}{\underbrace{\sum% _{t=1}^{T}\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar{\mu}^{t}),\mu_{*% }^{k(t)}-\bar{\mu}^{t}\rangle}}+6\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{% R}}^{t}}(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t})+2r_{\max}H\ln\frac{1}{\delta}.

Now, we summarize $x_{H_{t}+h}=(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{h})$ , where $H_{t}=H(t-1)$ , and with slight abuse of notation, we rewrite $\hat{r}(x_{H_{t}+h})=\hat{r}_{h}(s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{h})$ . With this rewriting of notation, we can apply Lemma H.5 with $\varepsilon=T^{-1}$ to show that

\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(h,s_{h}^{t}% ,a_{h}^{t},\bar{\mu}^{t})\leq H+r_{\max}H\dim_{E}(\mathcal{R},T^{-1})+4\sqrt{% \beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}.

Combining the with the previous results, we get that with probability at least $1-3\delta$ ,

	$\displaystyle\sum_{t=1}^{T}\langle R_{\pi_{}^{k(t)}}(\bar{\mu}^{t}),\mu_{}^{% k(t)}-\bar{\mu}^{t}\rangle$	$\displaystyle\leq\texttt{NewAgentReg}$
		$\displaystyle\quad+\underset{=:D}{\underbrace{6\left(H+r_{\max}H\dim_{E}(% \mathcal{R},T^{-1})+4\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}\right)+2r_% {\max}H\ln\frac{1}{\delta}}}.$

The new agent regret term NewAgentReg can be bounded in the same way as in the proof of Theorem 5.1:

\displaystyle\sum_{t=1}^{T}\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar% {\mu}^{t}),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle\leq\sum_{k=1}^{K}\sum_{t=T_{k-1% }+1}^{T_{k}}\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar{\mu}^{t}),\mu_% {*}^{k}-\bar{\mu}^{t}\rangle\leq K\operatorname*{AdaReg}(T).

For the steering cost, we have for any $\mu\in\Psi_{M}$ ,

	$\displaystyle C(\mu,R_{\text{nz}}^{t})-C(\mu,r_{\max}\bm{1}-r^{*})$	$\displaystyle=\langle r^{}(\mu)-\bar{r}^{t}(\mu)+w_{\hat{\mathcal{R}}^{t}}(% \mu)+R_{\pi_{}^{k(t)}}(\mu)+\\|R_{\pi_{*}^{k(t)}}(\mu)\\|_{\infty}\bm{1},\mu\rangle$
		$\displaystyle\leq 2\langle w_{\hat{\mathcal{R}}^{t}}(\mu),\mu\rangle+\langle R% _{\pi_{}^{k(t)}}(\mu)+\\|R_{\pi_{}^{k(t)}}(\mu)\\|_{\infty}\bm{1},\mu\rangle$

Then, summing over $t=1,...,T$ ,

\displaystyle C_{T}(\{\bar{\mu}^{t},R_{\text{nz}}^{t}-(r_{\max}\bm{1}-r^{*})\}% _{t=1}^{T})\leq 2\sum_{t=1}^{T}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}% ),\bar{\mu}^{t}\rangle+\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})% +\|R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})\|_{\infty}\bm{1},\bar{\mu}^{t}\rangle.

Using Lemma I.2, we can bound the first term by $2(3\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(x_{H_{t}+h})+r_{\max}% H\ln(1/\delta))$ with probability at least $1-\delta$ . Using Lemma H.5 with $\varepsilon=T^{-1}$ , we can further bound this by $D=2r_{\max}H\ln(1/\delta)+6(H+r_{\max}H\dim_{E}(\mathcal{F},T^{-1})+4\sqrt{% \beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T})$ . From the steering cost bound in Thm. 5.1 follows that the second term is bounded by

\displaystyle 4H\sqrt{T\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})% ,\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle},

which is at most $4H\sqrt{T(K\operatorname*{AdaReg}(T)+D)}$ , as we have already shown in this proof.

Lastly, we have to discuss the asymptotic bound for $D$ . The term $D$ is dependent on the Eluder dimension of $\mathcal{R}$ and $\beta_{T}$ . In Appendix H.1, we show several common function classes with $\dim_{E}(\mathcal{R},T^{-1})\in\tilde{\mathcal{O}}(1)$ . Furthermore, if we assume that the functions in $\mathcal{R}$ are parametrized by parameters in some set $\Theta\subset\mathbb{R}^{d}$ with constant diameter and $L$ -Lipschitz in that parameter, we have $N(\mathcal{R},\alpha,\|\cdot\|_{\infty})\leq N(\Theta,\alpha/L,\|\cdot\|_{% \infty})\leq\left(1+\mathcal{O}(L/\alpha)\right)^{d}$ Then, we might choose $\alpha=T^{-1}$ such that $\beta_{T}$ can also be bounded logarithmically in $T$ . In the cases where the Eluder dimension and $\beta_{T}$ are in $\tilde{\mathcal{O}}(1)$ , we have $D\in\tilde{\mathcal{O}}(\sqrt{T})$ (ignoring other factors). ∎

Appendix J EXTENSION TO UNKNOWN UTILITY FUNCTION

In this section, we generalize our previous results to the setting where the mediator does not have prior knowledge of the utility function $U$ . We consider non-zero intrinsic reward setting, i.e., Scenario 2 described in Section 3.4. Note that the results for Scenario 1 can be directly derived by setting $r^{*}=0$ .

Motivation for Unknown Utility Setting

This setting makes sense, especially when $U$ partially depends on the agents’ intrinsic rewards $r^{*}$ . As a motivating example, in financial markets, the government (mediator) gains benefits (utility $U$ ) from not only the impact on the society by the desired behaviors of the companies (the agents), but also the tax paid by them, which is directly related to the rewards $r^{*}$ received by agents.⁴⁴4Another way to interpret this scenario is that the mediator’s utility $U=\alpha U_{\text{mediator}}+(1-\alpha)U_{\text{agents};r^{*}}$ can be decomposed to a known function $U_{\text{mediator}}$ representing its intrinsic utility, and another unknown part $U_{\text{agents};r^{*}}$ , which reflects the agents’ interests and depends on $r^{*}$ . Here $\alpha$ serves as a parameter to trade-off the interests between two parties. Due to the lack of knowledge of $r^{*}$ , $U$ should only be partially revealed to the mediator. This restricts the applicability of our methods to this setting. However, if $U$ is unknown, we might infer it, for example, by estimating the true reward functions $r^{*}$ through the online interaction with the agents. We can also generalize this setting as follows.

We consider a general setting, where the mediator does not have prior knowledge on $U$ , but it can observe samples from $U$ , perturbed by $\sigma_{U}$ -sub-Gaussian noise, and get access to a function class $\mathcal{U}$ which contains $U$ and whose functions are bounded in $[0,U_{\max}]$ .

J.1 Algorithm

We can use the standard technique described in Russo and Van Roy, (2013) to handle this case. We define

	$\displaystyle\bar{U}^{k}$	$\displaystyle=\operatorname*{arg\,min}_{\hat{U}\in\mathcal{U}}\sum_{t=1}^{T_{k% }}(\hat{U}(\bar{\mu}^{t})-U(\bar{\mu}^{t}))^{2},$		(12)
	$\displaystyle\hat{\mathcal{U}}^{k}$	$\displaystyle=\left\{\hat{U}\in\mathcal{U}:\\|\hat{U}-\bar{U}^{k}\\|_{2,E_{T_{k}% }}^{2}\leq\beta_{k}^{U}\right\},$		(13)

where $\beta_{k}^{U}:=8\sigma_{U}^{2}\log(N(\mathcal{U},\alpha,\|\cdot\|_{\infty})/% \delta)+2\alpha k(8U_{\max}+\sqrt{8\sigma_{U}^{2}\ln(4k^{2}/\delta)})$ and, e.g., $\alpha=T^{-1}$ .

Algorithm 5 Steering reward design for Scenario 2 and unknown utility

1:Initialize

\mathcal{P}^{1}:=

set of all possible transition functions,

\pi_{*}^{1}

(arbitrarily),

k=1,T_{0}=0

2:for

t=1,...,T

3: Update

\hat{\mathcal{R}}^{t}

as in (10).

4: Choose

R_{\text{nz}}^{t}

as in (6).

5: Agents play

t

-th game with

r^{*}+R_{\text{nz}}^{t}

6: Obtain trajectory

((s^{t}_{h},a^{t}_{h},r^{t}_{h}))_{h=1}^{H}

7: if

\exists(h,s,a),~{}s.t.~{}n_{k}(h,s,a)\geq N_{k}(h,s,a)

t-T_{k-1}\geq T_{\mathit{epoch}}

then

8: Update

\mathcal{P}^{k+1}

as in (5).

T_{k}\leftarrow t

;

k\leftarrow k+1

10: Compute

\hat{\mathcal{U}}^{k}

as in (13).

11:

\hat{U}^{k},\pi_{*}^{k},\hat{M}^{k}\leftarrow\operatorname*{arg\,max}_{\hat{U}% \in\hat{\mathcal{U}}^{k},\pi\in\Pi,\hat{M}:\mathbb{P}_{\hat{M}}\in\mathcal{P}^% {k}}\hat{U}(\mu^{\pi}_{\hat{M}}).

12: end if

13:end for

Algorithm 5 differs from Algorithm 4 in the if-condition in line 7 as well as in lines 10 and 11.

The if-condition in line 7 now includes the case $t-T_{k-1}\geq T_{\mathit{epoch}}$ , where $T_{\mathit{epoch}}$ will be chosen later. We need this to guarantee $T_{k}-T_{k-1}\leq T_{\mathit{epoch}}$ for all $k$ and thereby bound the estimation error of the utility function estimate. Intuitively, we need to keep the estimates of $U$ somewhat up to date to be able to bound the estimation error. Meanwhile, we cannot update the estimate in each round (or too often) since then we would also have to change $\pi_{*}^{k}$ in each round, which would lead to $K=T$ .

Since we also need to estimate the utility function, we changed line 11 to also compute an optimistic estimate of the utility using the definition in (13).

J.2 Analysis

Theorem J.1.

Under Assump. A, B and C, if we run Alg. 5 with $0<\delta<1$ , then with probability at least $1-8\delta$ , $K\leq T^{1/6}+HSA\log_{2}T$ , and

	$\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})$	$\displaystyle\leq L_{U}\sqrt{H^{3}SAT(K\operatorname*{AdaReg}(T)+D)}+36L_{U}H^% {3}S\sqrt{AT\ln(THSA/\delta)}$
		$\displaystyle+\mathcal{O}\left(T^{5/6}U_{\max}\dim_{E}(\mathcal{U},T^{-1})+% \sqrt{\beta_{K}^{U}\dim_{E}(\mathcal{U},T^{-1})T}\right),$
	$\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R_{\text{nz}}^{t}-$	$\displaystyle(r_{\max}\cdot\mathbf{1}-r^{*})\}_{t=1}^{T})$
	$\displaystyle=4H$	$\displaystyle\sqrt{T(K\operatorname*{AdaReg}(T)+D)}+D,$

where $D=\tilde{O}(\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}))$ .

Comparing with Theorem 6.1, we see that the steering gap has an additional term originating from the estimation of $U$ . Furthermore, the bound of the number of epochs $K$ has an additional $T^{1/6}$ . Similar to the discussion about $\mathcal{R}$ in Section H.1, we can also bound $\beta^{U}_{K}$ and $\dim_{E}(\mathcal{U},T^{-1})$ under suitable assumptions about $\mathcal{U}$ . If $\beta^{U}_{K},\dim_{E}(\mathcal{U},T^{-1})\in\tilde{\mathcal{O}}(1)$ and $\operatorname*{AdaReg}(T)=\tilde{\mathcal{O}}(\sqrt{T})$ , both the steering cap and steering cost are in $\tilde{\mathcal{O}}(T^{5/6})$ (ignoring all other constants).

Proof.

We can adapt the proof of Theorem 6.1 by choosing the following regret decomposition.

\displaystyle U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})=\left(U(\mu^{\pi^{*}})-\hat{U}% ^{k}(\hat{\mu}^{k}_{*})\right)+\left(\hat{U}^{k}(\hat{\mu}^{k}_{*})-\hat{U}^{k% }(\bar{\mu}^{t})\right)+\left(\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})\right)

Using Lemma I.1 (and replacing $\mathcal{R}$ by $\mathcal{U}$ in the Lemma), we have $U\in\bigcap_{k=1}^{K}\hat{\mathcal{U}}^{k}$ with probability at least $1-2\delta$ . Thus, with probability at least $1-2\delta$ , the first term can be bounded by 0 using optimism. The second term can be bounded in the same way as in the proof of Theorem 6.1. Summing over all $t$ , the last term accumulates to

\displaystyle\sum_{t=1}^{T}\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})=\sum_{k% =1}^{K}\sum_{t=T_{k-1}+1}^{T_{k}}\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})% \leq\sum_{k=1}^{K}\sum_{t=T_{k-1}+1}^{T_{k}}w_{\hat{\mathcal{U}}^{k}}(\bar{\mu% }^{t}).

Using the fact that $T_{k}-T_{k-1}\leq T_{\mathit{epoch}}$ and Lemma H.5 with $\varepsilon=T^{-1}$ , the sum above is at most

\displaystyle\frac{KT_{\mathit{epoch}}}{T}+U_{\max}T_{\mathit{epoch}}\dim_{E}(% \mathcal{U},T^{-1})+4\sqrt{\beta_{T}^{U}KT_{\mathit{epoch}}\dim_{E}(\mathcal{U% },T^{-1})}.

We also have to find a new bound for $K$ . As before, we can enter the if block at most $HSA\log_{2}T$ times because of the first condition. In addition, we can enter the if block at most $T/T_{\mathit{epoch}}$ times due to the condition $T_{k}\geq T_{\mathit{epoch}}$ . Therefore, $K\leq T/T_{\mathit{epoch}}+HSA\log_{2}T$ and $KT_{\mathit{epoch}}\leq T+HSAT_{\mathit{epoch}}\log_{2}T$ .

Now, we set $T_{\mathit{epoch}}=T^{5/6}$ . Then, $K\leq T^{1/6}+HSA\log_{2}T$ and

\displaystyle\sum_{t=1}^{T}\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})\leq% \mathcal{O}\left(T^{5/6}U_{\max}\dim_{E}(\mathcal{U},T^{-1})+\sqrt{\beta_{T}^{% U}\dim_{E}(\mathcal{U},T^{-1})T}\right).

Finally, the steering gap is the previous bound of Theorem 6.1 plus the above term.

With regard to the steering cost, the only change is the bound of $K$ .

∎

	$\displaystyle\left\\|R_{\text{nz}}^{t}(\mu)\right\\|_{\infty}$	$\displaystyle=\left\\|R_{\pi_{}^{k(t)}}(\mu)-(\bar{r}^{t}(\mu)-w_{\hat{% \mathcal{R}}^{t}}(\mu))+(r_{\max}+\\|R_{\pi_{}^{k(t)}}(\mu)\\|_{\infty})\bm{1}% \right\\|_{\infty}$
		$\displaystyle\leq\left\\|R_{\pi_{}^{k(t)}}(\mu)+\\|R_{\pi_{}^{k(t)}}(\mu)\\|_{% \infty}\bm{1}\right\\|_{\infty}+\left\\|\bar{r}^{t}(\mu)-w_{\hat{\mathcal{R}}^{t% }}(\mu)\right\\|_{\infty}+\left\\|r_{\max}\bm{1}\right\\|_{\infty}$
		$\displaystyle\leq 4+2r_{\max}.$

	$\displaystyle\\|\mu_{h+1}^{\pi}-\mu_{h+1}^{\tilde{\pi}}\\|_{1}$	$\displaystyle=\sum_{s,a}\|\mu_{h+1}^{\pi}(s,a)-\mu_{h+1}^{\tilde{\pi}}(s,a)\|$
		$\displaystyle=\sum_{s,a}\left\|\pi_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}\mu_{h% }^{\pi}(s^{\prime},a^{\prime})\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})-\tilde{% \pi}_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}\mu_{h}^{\tilde{\pi}}(s^{\prime},a^% {\prime})\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})\right\|$
		$\displaystyle\leq\sum_{s,a}\left\|\pi_{h+1}(a\|s)-\tilde{\pi}_{h+1}(a\|s)\right\|% \sum_{s^{\prime},a^{\prime}}\mu_{h}^{\pi}(s^{\prime},a^{\prime})\mathbb{P}_{h}% (s\|s^{\prime},a^{\prime})$
		$\displaystyle~{}+\sum_{s,a}\tilde{\pi}_{h+1}(a\|s)\sum_{s^{\prime},a^{\prime}}% \left\|\mu_{h}^{\pi}(s^{\prime},a^{\prime})-\mu_{h}^{\tilde{\pi}}(s^{\prime},a^% {\prime})\right\|\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})$
		$\displaystyle=\sum_{s,a}\mu_{h+1}^{\pi}(s)\left\|\pi_{h+1}(a\|s)-\tilde{\pi}_{h+% 1}(a\|s)\right\|+\sum_{s^{\prime},a^{\prime}}\left\|\mu_{h}^{\pi}(s^{\prime},a^{% \prime})-\mu_{h}^{\tilde{\pi}}(s^{\prime},a^{\prime})\right\|\sum_{s,a}\tilde{% \pi}_{h+1}(a\|s)\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})$
		$\displaystyle=\sum_{s}\mu_{h+1}^{\pi}(s)\\|\pi_{h+1}(\cdot\|s)-\tilde{\pi}_{h+1}% (\cdot\|s)\\|_{1}+\\|\mu_{h}^{\pi}-\mu_{h}^{\tilde{\pi}}\\|_{1}.$

	$\displaystyle\\|\tilde{\mu}_{h+1}-\mu_{h+1}\\|_{1}=\sum_{s,a}\left\|\tilde{\mu}_{% h+1}(s,a)-\mu_{h+1}(s,a)\right\|$
	$\displaystyle=\sum_{s,a}\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a\|s)\left\|\tilde% {\mathbb{P}}_{h}(s\|s^{\prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime}% )-\mathbb{P}_{h}(s\|s^{\prime},a^{\prime})\mu_{h}(s^{\prime},a^{\prime})\right\|$
	$\displaystyle=\sum_{s,a,s^{\prime}}\left\|\tilde{\mathbb{P}}_{h}(s^{\prime}\|s,a% )\tilde{\mu}_{h}(s,a)-\mathbb{P}_{h}(s^{\prime}\|s,a)\mu_{h}(s,a)\right\|$
	$\displaystyle\leq\sum_{s,a,s^{\prime}}\tilde{\mathbb{P}}_{h}(s^{\prime}\|s,a)% \left\|\tilde{\mu}_{h}(s,a)-\mu_{h}(s,a)\right\|+\mu_{h}(s,a)\left\|\tilde{% \mathbb{P}}_{h}(s^{\prime}\|s,a)-\mathbb{P}_{h}(s^{\prime}\|s,a)\right\|$
	$\displaystyle=\sum_{s,a}\left\|\tilde{\mu}_{h}(s,a)-\mu_{h}(s,a)\right\|+\sum_{s% ,a}\mu_{h}(s,a)\sum_{s^{\prime}}\left\|\tilde{\mathbb{P}}_{h}(s^{\prime}\|s,a)-% \mathbb{P}_{h}(s^{\prime}\|s,a)\right\|$
	$\displaystyle=\\|\tilde{\mu}_{h}-\mu_{h}\\|_{1}+\sum_{s,a}\mu_{h}(s,a)\\|\tilde{% \mathbb{P}}_{h^{\prime}}(\cdot\|s,a)-\mathbb{P}_{h^{\prime}}(\cdot\|s,a)\\|_{1}.$

	$\displaystyle\frac{1}{H}\sum_{t=1}^{T}\\|\bar{\mu}^{t}-\mu^{\pi_{*}^{t}}\\|_{1}\leq$	$\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{s\in\mathcal{S}}\bar{\mu}^{t}_{% h}(s)\\|\bar{\pi}^{t}_{h}(\cdot\|s)-\pi_{*,h}^{t}(\cdot\|s)\\|_{1}$
	$\displaystyle\leq$	$\displaystyle\sqrt{HSAT\sum_{t=1}^{T}\left\langle R_{\pi_{}^{t}}(\bar{\mu}^{t% }),\mu^{\pi_{}^{t}}-\bar{\mu}^{t}\right\rangle},$

	$\displaystyle\sum_{t=1}^{T}\left\langle R_{\pi^{t}_{}}(\bar{\mu}^{t}),\mu^{% \pi^{t}_{}}-\bar{\mu}^{t}\right\rangle=\sum_{t=1}^{T}\left\\|(W^{\pi^{t}_{*}}-% I)\bar{\mu}^{t}\right\\|_{2}^{2}$
	$\displaystyle=\sum_{t=1}^{T}\sum_{h,s,a}\left(\pi^{t}_{*,h}(a\|s)\sum_{a^{% \prime}}\bar{\mu}^{t}_{h}(s,a^{\prime})-\bar{\mu}^{t}_{h}(s,a)\right)^{2}$
	$\displaystyle=\sum_{t=1}^{T}\sum_{h,s,a}(\bar{\mu}^{t}_{h}(s))^{2}\left(\pi^{t% }_{*,h}(a\|s)-\bar{\pi}^{t}_{h}(a\|s)\right)^{2}$
	$\displaystyle=\sum_{t=1}^{T}\sum_{h,s}(\bar{\mu}^{t}_{h}(s))^{2}\left\\|\pi^{t}% _{*,h}(\cdot\|s)-\bar{\pi}^{t}_{h}(\cdot\|s)\right\\|_{2}^{2}.$

Abstract

1 INTRODUCTION

1.1 Closely Related Work

2 PRELIMINARIES

Definition 2.1.

Assumption A.

Definition 2.2 (ε𝜀\varepsilonitalic_ε-independent sequence).

Definition 2.3 (Eluder Dimension).

3 THE STEERING PROBLEM FORMULATION FOR MFGS

3.1 Agent-Mediator Interaction Protocol

3.2 Behavioral Assumptions on Agents

Assumption B (No-Adaptive Regret Behavior).

Proposition 3.1 (No-Adaptive-Regret Population Behavior).

3.3 Performance Metrics

Assumption C (Lipschitz Utility Function).

3.4 Two Steering Scenarios and Objectives

4 STEERING TOWARDS A FIXED TARGET

4.1 Warm-Up: Steering in a Known Model

Theorem 4.1.

4.2 Steering towards a Target Policy in an Unknown Model

Lemma 4.2.

Theorem 4.3.

5 STEERING WITH NO INTRINSIC REWARD

Theorem 5.1.

6 STEERING WITH NON-ZERO INTRINSIC REWARD

Theorem 6.1.

Remark 6.1.

Remark 6.1 (Generalization to Unknown Utility Setting).

7 CONCLUSION

Acknowledgements

References

Checklist

Appendix A TABLE OF FREQUENTLY USED NOTATIONS

Appendix B SUMMARY OF MAIN RESULTS

Appendix C OTHER RELATED WORKS

More Elaboration on Comparison between the Steering Setting and Contract Design Setting

Mean-field game

Mathematical Programming with Equilibrium Constraints (MPEC) and Mechanism Design

Appendix D REGARDING NO-ADAPTIVE REGRET ASSUMPTION

D.1 Proof Of Proposition 3.1

Proof.

D.2 Concrete Examples Satisfying No-Adaptive Regret Assumption

Example 1: Colluded Agents with Full Observation of Rtsuperscript𝑅𝑡R^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

Example 2: Independent Agents Conducting Online Convex Learning

Proposition D.1 (Theorem 1.3 of Hazan and Seshadhri, (2007)).

D.3 Motivating Adaptive Regret

D.4 Boundedness Of Steering Rewards

Proposition D.2.

Proof.

Proposition D.3.

Proof.

Appendix E STATE-ACTION DENSITY

E.1 ΨMsubscriptΨ𝑀\Psi_{M}roman_Ψ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT Is Convex

Lemma E.1.

Proof.

Lemma E.2.

Proof.

E.2 Inequalities

Lemma E.3.

Proof.

Proof.

Lemma E.4.

Proof.

Appendix F PROOFS OF RESULTS IN SECTION 4

Proof.

Proof.

Appendix G PROOF OF THEOREM 5.1

Lemma G.1.

Proof.

Lemma G.2.

Proof.

Lemma G.3.

Proof.

Lemma G.4.

Proof.

Proof.

Proof for Steering Gap

Proof for Steering Costs

Appendix H ELUDER DIMENSION

H.1 Example Function Classes

Definition 2.2 ( $\varepsilon$ -independent sequence).

Example 1: Colluded Agents with Full Observation of $R^{t}$

E.1 $\Psi_{M}$ Is Convex

Remark H.3 (Bounding $\beta_{T}$ ).