Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ย 

Steering No-Regret Agents in MFGs under Model Uncertainty


ย 


Leo Widmer ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย  Jiawei Huang ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย  Niao He

lewidmer@student.ethz.ch, {jiawei.huang, niao.he}@inf.ethz.ch Department of Computer Science ETH Zรผrich

Abstract

Incentive design is a popular framework for guiding agentsโ€™ learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agentsโ€™ behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.

1 INTRODUCTION

Mean-Field Games (MFGs) (Huang etย al.,, 2006; Lasry and Lions,, 2007) are a widely-used and powerful framework to model the competition and cooperation of large population systems involving symmetric and interchangeable agents. MFGs effectively capture the dynamics of many real-world scenarios, such as macro-economic models (Steinbacher etย al.,, 2021), road traffic systems (Chen and Cheng,, 2010), autonomous vehicle systems (Dinneweth etย al.,, 2022) and auctions (Iyer etย al.,, 2014), and it has been successfully applied in those domains (Gomes etย al.,, 2014; Cabannes etย al.,, 2021; Achdou and Lasry,, 2019; Guo etย al.,, 2021). Similar to the finite-agent systems (Roughgarden and Tardos,, 2007), MFGs with self-interested agents may lead to undesirable collective behaviors. Typically, the agentsโ€™ learning dynamics may converge to equilibria where all the participants are worse off compared to other possible outcomes (Guo etย al., 2023a, ).

To address this dilemma, the field of incentive design explores methods to guide agents towards more favorable behaviors by modifying the reward structure. A widely-studied formulation, known as the steering problem (Zhang etย al.,, 2024; Canyakmaz etย al.,, 2024; Huang etย al., 2024b, ), assumes the presence of a mediator (incentive designer) outside the game, who can influence the agentsโ€™ learning dynamics by providing additional steering rewards. However, previous research on steering mainly focuses on either Extensive-Form Games (Zhang etย al.,, 2024) or Markov Games (Canyakmaz etย al.,, 2024; Huang etย al., 2024b, ) with a limited number of agents. The methods developed for those small-scale settings become intractable when applied to large-population scenarios, as the number of agents increases, which is known as the curse of multi-agency.

To address this gap, in this work, we study incentive design in Mean-Field Games (MFGs). More concretely, we focus on the finite-horizon MFGs with density-independent transitions, a standard model in literature (Huang etย al.,, 2006; Lasry and Lions,, 2007; Perolat etย al.,, 2021). In the steering problem setup, we play the role of the mediator, with access to a utility function dependent on the collective behavior of the agents through the population density. During the interactions with the mediator, the agents are continuously learning and adapting. We assume the agents are self-interested no-adaptive-regret learners (Hazan and Seshadhri,, 2007); similar no-regret assumptions have been widely adopted in previous literature (Camara etย al.,, 2020; Ge etย al.,, 2024). Following previous works (Zhang etย al.,, 2024; Huang etย al., 2024b, ), our primary goal is to design steering rewards that guide the agents towards desired policies (i.e., minimize the steering gap), such that the resulting behaviors maximize the utility function. Meanwhile, the incentives paid by the mediator to the agents, referred to as the steering cost, should remain low.

In practice, the mediator usually lacks knowledge of the transition dynamics and the intrinsic reward functions of the MFGs. Therefore, in this work, we focus on the design of steering strategies without prior knowledge of the game model. The model uncertainty makes the steering problem much more challenging, and requires the mediator to strategically balance the exploration and exploitation. Typically, without knowledge of the MFG model, the mediator does not know which agentsโ€™ behaviors (population densities) are feasible, let alone how to maximize the utility function. Therefore, the mediator needs not only to steer the agents to explore the MFG for its own learning, but also ensure the agents converge to desired outcomes, while keeping the accumulative incentive payments affordable. In summary, the key question we would like to address is:

How can we design effective steering strategies for no-regret agents in MFGs under model uncertainty?

Main contributionsโ€ƒWe address the above open question by proposing novel exploration algorithms with provable guarantees. We highlight our main contributions in the following. A summary of the main theorems in this paper can be found in Appx.ย B.

  • โ€ข

    Firstly, in Sec.ย 3, we contribute the first formulation for steering in mean-field games, with details about the problem setting and learning objectives.

  • โ€ข

    Secondly, as preparation, in Sec.ย 4, we investigate how to steer the agents to a given density or policy. Notably, in Sec.ย 4.2, we propose a novel steering strategy, which can guide the no-adaptive-regret agents towards any target policy without prior knowledge of the model. This method serves as the key ingredient of our steering algorithms in the following sections.

  • โ€ข

    Thirdly, in Sec.ย 5 and Sec.ย 6, we investigate strategic exploration methods for steering agents in MFGs under uncertainty. In Sec.ย 5, we start with the setting where the intrinsic reward is zero, and propose an optimism-based exploration algorithm, which guarantees that both the cumulative steering gap and cost only have sub-linear growth. Furthermore, in Sec.ย 6, we extend our methods to the setting with non-zero and unknown intrinsic reward by integrating a pessimism-based reward estimation strategy. We establish sub-linear regret in steering gap, and show that the total steering cost is only sub-linearly worse, compared to a baseline strategy that stabilizes the target policy as an equilibrium by offsetting differences in intrinsic rewards.

1.1 Closely Related Work

Due to the limit of space, we only discuss closely related works here and defer the others to Appendixย C.

Incentive Design in Multi-Agent Systemsโ€ƒThe problem of incentive design broadly refers to the design of mechanisms for shaping the behavior of autonomous agents (Ehtamo etย al.,, 2002; Ratliff etย al.,, 2019). A recently popular framework for incentive design is known as the steering problem (Zhang etย al.,, 2024; Canyakmaz etย al.,, 2024; Huang etย al., 2024b, ), which considers a repeated interaction between a mediator and learning agents. All of them focuses on small-scale problems (e.g., Markov Games or Extensive-Form Games) and their proposed methods become intractable when extending to large-population setting, including MFGs. Besides, Canyakmaz etย al., (2024); Huang etย al., 2024b consider the agentsโ€™ learning dynamics to be memoryless, which is different from our no-regret assumptions.

Another related direction is contract design111To save space, we defer to Appx.ย C more elaboration of the comparisons between our steering framework and contract design setting. (DellaVigna and Malmendier,, 2004), which studies the interactions between a principal and agents when the two parties transact in the presence of private information. The fundamental question is how the principal should design the incentives for the agents to maximize its own utility after deducting the payments to agents. However, most of literature study the single-agent setting (Zhu etย al.,, 2022; Ho etย al.,, 2014; Scheid etย al.,, 2024), or focus on the computational aspects without addressing exploration under uncertainty (Dรผtting etย al.,, 2023; Castiglioni etย al.,, 2023). (Carmona and Wang,, 2021; Elie etย al.,, 2019) study contract design in the MFGs setting, but none of them consider model uncertainty. Moreover, a common assumption in those works is that the agents always do the best response (or take equilibrium policies) to the principalโ€™s intervention, which is much stronger than our no-adaptive-regret assumption.

Besides, Sanjari etย al., (2024) consider incentive design in a large-population setting, but they study the Stackelberg games with one leader and a large number of followers, which differs quite substantially to ours. Fu and Horst, (2018) consider mean-field leader-follower games, however they assume knowledge of the dynamics, while we consider the steering problem without this knowledge. Moreover, they study the dynamics where the agents cooperate together and optimally respond to the leaderโ€™s control signal. In contrast, we consider decentralized and self-interested agents with no-regret behaviors in maximizing individual interests.

2 PRELIMINARIES

Mean-Field Gamesโ€ƒWe consider the MFG setting with a finite yet extremely large number of agents, each of which acts independently. In line with Subramanian etย al., (2022), we refer to this setting as โ€œDecentralized-MFGsโ€, although their model allows diversity in action space and reward functions for agents.

Definition 2.1.

A Finite-Horizon Decentralized MFG is defined by a tuple M=(N,๐’ฎ,๐’œ,H,โ„™M,rM,ฮผ1)๐‘€๐‘๐’ฎ๐’œ๐ปsubscriptโ„™๐‘€subscript๐‘Ÿ๐‘€subscript๐œ‡1M=(N,\mathcal{S},\mathcal{A},H,\mathbb{P}_{M},r_{M},\mu_{1})italic_M = ( italic_N , caligraphic_S , caligraphic_A , italic_H , blackboard_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), given the number of agents N๐‘Nitalic_N; state and action spaces ๐’ฎ,๐’œ๐’ฎ๐’œ\mathcal{S},\mathcal{A}caligraphic_S , caligraphic_A with sizes S๐‘†Sitalic_S and A๐ดAitalic_A; horizon length H๐ปHitalic_H; initial state distribution ฮผ1โˆˆฮ”๐’ฎsubscript๐œ‡1subscriptฮ”๐’ฎ\mu_{1}\in\Delta_{\mathcal{S}}italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โˆˆ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT. โ„™M:={โ„™M,h}h=1Hassignsubscriptโ„™๐‘€superscriptsubscriptsubscriptโ„™๐‘€โ„Žโ„Ž1๐ป\mathbb{P}_{M}:=\{\mathbb{P}_{M,h}\}_{h=1}^{H}blackboard_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT := { blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT with โ„™M,h:๐’ฎร—๐’œโ†’ฮ”๐’ฎ:subscriptโ„™๐‘€โ„Žโ†’๐’ฎ๐’œsubscriptฮ”๐’ฎ\mathbb{P}_{M,h}:\mathcal{S}\times\mathcal{A}\to\Delta_{\mathcal{S}}blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT : caligraphic_S ร— caligraphic_A โ†’ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and rM:={rM,h}h=1Hassignsubscript๐‘Ÿ๐‘€superscriptsubscriptsubscript๐‘Ÿ๐‘€โ„Žโ„Ž1๐ปr_{M}:=\{r_{M,h}\}_{h=1}^{H}italic_r start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT := { italic_r start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT with rM,h:๐’ฎร—๐’œร—ฮ”๐’ฎร—๐’œโ†’[0,rmax]:subscript๐‘Ÿ๐‘€โ„Žโ†’๐’ฎ๐’œsubscriptฮ”๐’ฎ๐’œ0subscript๐‘Ÿr_{M,h}:\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}\times\mathcal{A}% }\to[0,r_{\max}]italic_r start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT : caligraphic_S ร— caligraphic_A ร— roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT โ†’ [ 0 , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] denote the transition and reward function, respectively.

In this paper, we focus on density-independent transition function, a common assumption in previous literature (Huang etย al.,, 2006; Lasry and Lions,, 2007; Perolat etย al.,, 2021). For the reward function, we consider the general setup, where the rewards depend on the state-action density (Guo etย al.,, 2021).

We only focus on non-stationary Markovian policies, denoted by ฮ :={ฯ€:={ฯ€h}hโˆˆ[H]|ฯ€h:๐’ฎโ†’ฮ”โข(๐’œ)}assignฮ conditional-setassign๐œ‹subscriptsubscript๐œ‹โ„Žโ„Ždelimited-[]๐ป:subscript๐œ‹โ„Žโ†’๐’ฎฮ”๐’œ\Pi:=\{\pi:=\{\pi_{h}\}_{h\in[H]}|\pi_{h}:\mathcal{S}\rightarrow\Delta(% \mathcal{A})\}roman_ฮ  := { italic_ฯ€ := { italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S โ†’ roman_ฮ” ( caligraphic_A ) }. Given a model M๐‘€Mitalic_M, considering an agent taking policy ฯ€โˆˆฮ ๐œ‹ฮ \pi\in\Piitalic_ฯ€ โˆˆ roman_ฮ , we use ฮผMฯ€:={ฮผM,hฯ€}h=1Hassignsubscriptsuperscript๐œ‡๐œ‹๐‘€superscriptsubscriptsubscriptsuperscript๐œ‡๐œ‹๐‘€โ„Žโ„Ž1๐ป\mu^{\pi}_{M}:=\{\mu^{\pi}_{M,h}\}_{h=1}^{H}italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT := { italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT to denote its state-action density for each step hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ]. Starting with ฮผM,1ฯ€โข(s,a)=ฮผ1โข(s)โขฯ€1โข(a|s)subscriptsuperscript๐œ‡๐œ‹๐‘€1๐‘ ๐‘Žsubscript๐œ‡1๐‘ subscript๐œ‹1conditional๐‘Ž๐‘ \mu^{\pi}_{M,1}(s,a)=\mu_{1}(s)\pi_{1}(a|s)italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M , 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) italic_ฯ€ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a | italic_s ), for 1โ‰คhโ‰คH1โ„Ž๐ป1\leq h\leq H1 โ‰ค italic_h โ‰ค italic_H, we have:

ฮผM,h+1ฯ€โข(s,a)=ฯ€h+1โข(a|s)โขโˆ‘sโ€ฒ,aโ€ฒโ„™M,hโข(s|sโ€ฒ,aโ€ฒ)โขฮผM,hฯ€โข(sโ€ฒ,aโ€ฒ).subscriptsuperscript๐œ‡๐œ‹๐‘€โ„Ž1๐‘ ๐‘Žsubscript๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™๐‘€โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptsuperscript๐œ‡๐œ‹๐‘€โ„Žsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\displaystyle\mu^{\pi}_{M,h+1}(s,a)=\pi_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}% \mathbb{P}_{M,h}(s|s^{\prime},a^{\prime})\mu^{\pi}_{M,h}(s^{\prime},a^{\prime}).italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M , italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) .

When N๐‘Nitalic_N agents take policies ฯ€1,โ€ฆโขฯ€Nโˆˆฮ superscript๐œ‹1โ€ฆsuperscript๐œ‹๐‘ฮ \pi^{1},...\pi^{N}\in\Piitalic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , โ€ฆ italic_ฯ€ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT โˆˆ roman_ฮ , respectively, the trajectory of agent nโˆˆ[N]๐‘›delimited-[]๐‘n\in[N]italic_n โˆˆ [ italic_N ] is specified by:

s1nโˆผฮผ1,โˆ€hโ‰ฅ1,formulae-sequencesimilar-tosuperscriptsubscript๐‘ 1๐‘›subscript๐œ‡1for-allโ„Ž1\displaystyle\textstyle s_{1}^{n}\sim\mu_{1},~{}\forall h\geq 1,~{}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โˆ€ italic_h โ‰ฅ 1 , ahnโˆผฯ€hn(โ‹…|shn),sh+1nโˆผโ„™M,h(โ‹…|shn,ahn),\displaystyle a_{h}^{n}\sim\pi^{n}_{h}(\cdot|s_{h}^{n}),~{}s_{h+1}^{n}\sim% \mathbb{P}_{M,h}(\cdot|s_{h}^{n},a_{h}^{n}),italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT โˆผ italic_ฯ€ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT โˆผ blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ,
rhnโ†rM,hโข(shn,ahn,ฮผยฏM,h).โ†superscriptsubscript๐‘Ÿโ„Ž๐‘›subscript๐‘Ÿ๐‘€โ„Žsuperscriptsubscript๐‘ โ„Ž๐‘›superscriptsubscript๐‘Žโ„Ž๐‘›subscriptยฏ๐œ‡๐‘€โ„Ž\displaystyle r_{h}^{n}\leftarrow r_{M,h}(s_{h}^{n},a_{h}^{n},\bar{\mu}_{M,h}).italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT โ† italic_r start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ) . (1)

where we use ฮผยฏM,h=1Nโขโˆ‘n=1NฮผM,hฯ€nsubscriptยฏ๐œ‡๐‘€โ„Ž1๐‘superscriptsubscript๐‘›1๐‘subscriptsuperscript๐œ‡superscript๐œ‹๐‘›๐‘€โ„Ž\bar{\mu}_{M,h}=\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n}}_{M,h}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT to denote the population density at step hโ„Žhitalic_h. We also assume rMsubscript๐‘Ÿ๐‘€r_{M}italic_r start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is Lipschitz in the density, which is standard in previous works (Guo etย al.,, 2021; Yardim etย al.,, 2022).

Other Notational Conventionโ€ƒFor convenience, we implicitly treat ฮผMฯ€subscriptsuperscript๐œ‡๐œ‹๐‘€\mu^{\pi}_{M}italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as a vector in โ„HโขSโขAsuperscriptโ„๐ป๐‘†๐ด\mathbb{R}^{HSA}blackboard_R start_POSTSUPERSCRIPT italic_H italic_S italic_A end_POSTSUPERSCRIPT concatenated by {ฮผM,hฯ€}hโˆˆ[H]subscriptsuperscriptsubscript๐œ‡๐‘€โ„Ž๐œ‹โ„Ždelimited-[]๐ป\{\mu_{M,h}^{\pi}\}_{h\in[H]}{ italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT. We denote ฮจM:={ฮผMฯ€:ฯ€โˆˆฮ }โŠ†ฮ”๐’ฎร—๐’œHassignsubscriptฮจ๐‘€conditional-setsubscriptsuperscript๐œ‡๐œ‹๐‘€๐œ‹ฮ superscriptsubscriptฮ”๐’ฎ๐’œ๐ป\Psi_{M}:=\{\mu^{\pi}_{M}:\pi\in\Pi\}\subseteq\Delta_{\mathcal{S}\times% \mathcal{A}}^{H}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT := { italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT : italic_ฯ€ โˆˆ roman_ฮ  } โŠ† roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT to be the set of all feasible state-action densities given M๐‘€Mitalic_M. Note that ฮจMsubscriptฮจ๐‘€\Psi_{M}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is a convex set (see Lem.ย E.1), which implies ฮผยฏMโˆˆฮจMsubscriptยฏ๐œ‡๐‘€subscriptฮจ๐‘€\bar{\mu}_{M}\in\Psi_{M}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. If it is not necessary to distinguish what model M๐‘€Mitalic_M we use, we omit it in the sub-scriptions, for example, ฮผฯ€/ฮจsuperscript๐œ‡๐œ‹ฮจ\mu^{\pi}/\Psiitalic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT / roman_ฮจ instead of ฮผMฯ€/ฮจMsubscriptsuperscript๐œ‡๐œ‹๐‘€subscriptฮจ๐‘€\mu^{\pi}_{M}/\Psi_{M}italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT / roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. We also omit hโ„Žhitalic_h in sh,ahsubscript๐‘ โ„Žsubscript๐‘Žโ„Žs_{h},a_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT if it is clear from the context. With slight abuse of notation, given a population density ฮผยฏ:={ฮผยฏh}h=1Hโˆˆฮ”๐’ฎร—๐’œHassignยฏ๐œ‡superscriptsubscriptsubscriptยฏ๐œ‡โ„Žโ„Ž1๐ปsubscriptsuperscriptฮ”๐ป๐’ฎ๐’œ\bar{\mu}:=\{\bar{\mu}_{h}\}_{h=1}^{H}\in\Delta^{H}_{\mathcal{S}\times\mathcal% {A}}overยฏ start_ARG italic_ฮผ end_ARG := { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โˆˆ roman_ฮ” start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT and a reward function r๐‘Ÿritalic_r, we use rโข(ฮผยฏ)โˆˆโ„HโขSโขA๐‘Ÿยฏ๐œ‡superscriptโ„๐ป๐‘†๐ดr(\bar{\mu})\in\mathbb{R}^{HSA}italic_r ( overยฏ start_ARG italic_ฮผ end_ARG ) โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_H italic_S italic_A end_POSTSUPERSCRIPT to denote the reward vector where (rโข(ฮผยฏ))h,s,a=rhโข(s,a,ฮผยฏh)subscript๐‘Ÿยฏ๐œ‡โ„Ž๐‘ ๐‘Žsubscript๐‘Ÿโ„Ž๐‘ ๐‘Žsubscriptยฏ๐œ‡โ„Ž(r(\bar{\mu}))_{h,s,a}=r_{h}(s,a,\bar{\mu}_{h})( italic_r ( overยฏ start_ARG italic_ฮผ end_ARG ) ) start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). In this way, given an arbitrary agent nโˆˆ[N]๐‘›delimited-[]๐‘n\in[N]italic_n โˆˆ [ italic_N ] taking policy ฯ€๐œ‹\piitalic_ฯ€, its expected total return conditioning on population density ฮผยฏยฏ๐œ‡\bar{\mu}overยฏ start_ARG italic_ฮผ end_ARG can be written as: ๐”ผฯ€nโข[โˆ‘h=1Hrhโข(shn,ahn,ฮผยฏh)]=โŸจrโข(ฮผยฏ),ฮผฯ€nโŸฉsubscript๐”ผsuperscript๐œ‹๐‘›delimited-[]superscriptsubscriptโ„Ž1๐ปsubscript๐‘Ÿโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘›superscriptsubscript๐‘Žโ„Ž๐‘›subscriptยฏ๐œ‡โ„Ž๐‘Ÿยฏ๐œ‡superscript๐œ‡superscript๐œ‹๐‘›\mathbb{E}_{\pi^{n}}[\sum_{h=1}^{H}r_{h}(s_{h}^{n},a_{h}^{n},\bar{\mu}_{h})]=% \langle r(\bar{\mu}),\mu^{\pi^{n}}\rangleblackboard_E start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ] = โŸจ italic_r ( overยฏ start_ARG italic_ฮผ end_ARG ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ.

Given that this paper considers learning under uncertainty, we use Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT to denote the true hidden mean-field model with transition โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT and intrinsic reward rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, in order to distinguish it from the estimated ones.

Besides, given a population density ฮผยฏยฏ๐œ‡\bar{\mu}overยฏ start_ARG italic_ฮผ end_ARG in a model M๐‘€Mitalic_M, we will use ฯ€ยฏยฏ๐œ‹\bar{\pi}overยฏ start_ARG italic_ฯ€ end_ARG to denote the policy, which induces the population density (i.e., ฮผฯ€ยฏ=ฮผยฏsuperscript๐œ‡ยฏ๐œ‹ยฏ๐œ‡\mu^{\bar{\pi}}=\bar{\mu}italic_ฮผ start_POSTSUPERSCRIPT overยฏ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG), defined by: ฯ€ยฏh(โ‹…|s):=ฮผยฏh(s,โ‹…)/ฮผยฏh(s)\bar{\pi}_{h}(\cdot|s):=\bar{\mu}_{h}(s,\cdot)/\bar{\mu}_{h}(s)overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) := overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , โ‹… ) / overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) (or ฯ€ยฏh(โ‹…|s)=1/A\bar{\pi}_{h}(\cdot|s)=1/Aoverยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) = 1 / italic_A if ฮผยฏhโข(โ‹…)=0subscriptยฏ๐œ‡โ„Žโ‹…0\bar{\mu}_{h}(\cdot)=0overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… ) = 0).

Reward Function Approximation and Eluder Dimensionโ€ƒIn this paper, we consider the setting where the true intrinsic reward, denoted by rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, is unknown. Note that the reward function depends on not only the state and action but also the density, which belongs to a high-dimensional continuous space. Therefore, we consider function approximation for reward estimation with the standard realizability assumption.

Assumption A.

A reward function class โ„›โ„›\mathcal{R}caligraphic_R is available, s.t. (i) โˆ€rโˆˆโ„›for-all๐‘Ÿโ„›\forall r\in\mathcal{R}โˆ€ italic_r โˆˆ caligraphic_R, โˆ€h,rhโข(โ‹…,โ‹…,โ‹…)โˆˆ[0,rmax]for-allโ„Žsubscript๐‘Ÿโ„Žโ‹…โ‹…โ‹…0subscript๐‘Ÿ\forall h,~{}r_{h}(\cdot,\cdot,\cdot)\in[0,r_{\max}]โˆ€ italic_h , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… , โ‹… , โ‹… ) โˆˆ [ 0 , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]; (ii) rโˆ—โˆˆโ„›superscript๐‘Ÿโ„›r^{*}\in\mathcal{R}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ caligraphic_R.

In the function approximation setting, the fundamental sample efficiency is closely related to the complexity of the function class. We follow previous works (Russo and Vanย Roy,, 2013; Huang etย al., 2024a, ) and utilize the Eluder Dimension as the complexity measure of the function class. Intuitively, the Eluder Dimension is defined to be the length of the longest โ€œindependentโ€ sequence, such that each element in the sequence โ€œrevealsโ€ some new information about the function class comparing with previous ones.

Definition 2.2 (ฮต๐œ€\varepsilonitalic_ฮต-independent sequence).

Given a domain ๐’ณ๐’ณ\mathcal{X}caligraphic_X and a class of functions โ„ฑโ„ฑ\mathcal{F}caligraphic_F defined on ๐’ณ๐’ณ\mathcal{X}caligraphic_X, we say xโˆˆ๐’ณ๐‘ฅ๐’ณx\in\mathcal{X}italic_x โˆˆ caligraphic_X is ฮต๐œ€\varepsilonitalic_ฮต-independent on {x1,โ€ฆ,xJ}โŠ†๐’ณsubscript๐‘ฅ1โ€ฆsubscript๐‘ฅ๐ฝ๐’ณ\{x_{1},...,x_{J}\}\subseteq\mathcal{X}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT } โŠ† caligraphic_X if there exists f,f~โˆˆโ„ฑ๐‘“~๐‘“โ„ฑf,\tilde{f}\in\mathcal{F}italic_f , over~ start_ARG italic_f end_ARG โˆˆ caligraphic_F, such that โˆ‘j=1J(fโข(xj)โˆ’f~โข(xj))2โ‰คฮต2superscriptsubscript๐‘—1๐ฝsuperscript๐‘“subscript๐‘ฅ๐‘—~๐‘“subscript๐‘ฅ๐‘—2superscript๐œ€2\sum_{j=1}^{J}(f(x_{j})-\tilde{f}(x_{j}))^{2}\leq\varepsilon^{2}โˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โ‰ค italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, but |fโข(x)โˆ’f~โข(x)|>ฮต๐‘“๐‘ฅ~๐‘“๐‘ฅ๐œ€|f(x)-\tilde{f}(x)|>\varepsilon| italic_f ( italic_x ) - over~ start_ARG italic_f end_ARG ( italic_x ) | > italic_ฮต.

Definition 2.3 (Eluder Dimension).

Given a mean-field reward function class โ„›โ„›\mathcal{R}caligraphic_R and domain ๐’ณ:=[H]ร—๐’ฎร—๐’œร—ฮ”๐’ฎร—๐’œassign๐’ณdelimited-[]๐ป๐’ฎ๐’œsubscriptฮ”๐’ฎ๐’œ\mathcal{X}:=[H]\times\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}% \times\mathcal{A}}caligraphic_X := [ italic_H ] ร— caligraphic_S ร— caligraphic_A ร— roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT, the Eluder Dimension of โ„›โ„›\mathcal{R}caligraphic_R, denoted by dimE(โ„›,ฮต)subscriptdimension๐ธโ„›๐œ€\dim_{E}(\mathcal{R},\varepsilon)roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_ฮต ), is defined to be the length of the longest sequence {xj}j=1Jsuperscriptsubscriptsuperscript๐‘ฅ๐‘—๐‘—1๐ฝ\{x^{j}\}_{j=1}^{J}{ italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT, such that, for any iโˆˆ[J]๐‘–delimited-[]๐ฝi\in[J]italic_i โˆˆ [ italic_J ], xisuperscript๐‘ฅ๐‘–x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is ฮต๐œ€\varepsilonitalic_ฮต-independent w.r.t.ย {xj}j=1iโˆ’1superscriptsubscriptsuperscript๐‘ฅ๐‘—๐‘—1๐‘–1\{x^{j}\}_{j=1}^{i-1}{ italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT.

3 THE STEERING PROBLEM FORMULATION FOR MFGS

In this section, we introduce our steering setup. In Sec.ย 3.1, we first provide our formulation for steering protocol. Then, in Sec.ย 3.2 we discuss our assumptions on agentโ€™s behavior. After that, we introduce the learning objectives and other setups in Sec.ย 3.3 andย 3.4.

3.1 Agent-Mediator Interaction Protocol

We consider a repeated game setup, and summarize the interaction procedure between agents and the mediator in Procedureย 1. In each iteration tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ], the mediator first selects a steering reward function222We will use capital R๐‘…Ritalic_R to denote the steering reward to distinguish with intrinsic reward r๐‘Ÿritalic_r. Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which is a mapping from the density space to the non-negative333The non-negativity of the steering reward is known as limited liability (Innes,, 1990), which is standard in previous works (Zhang etย al.,, 2024; Huang etย al., 2024b, ) reward vector space, upper bounded by Rmaxsubscript๐‘…R_{\max}italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. Besides, each agent computes a policy and plays the game. The agentsโ€™ policies result in a population density ฮผยฏt:=1Nโขโˆ‘n=1Nฮผฯ€n,tassignsuperscriptยฏ๐œ‡๐‘ก1๐‘superscriptsubscript๐‘›1๐‘superscript๐œ‡superscript๐œ‹๐‘›๐‘ก\bar{\mu}^{t}:=\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n,t}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, by which the mediator realizes the steering reward Rtโข(ฮผยฏt)superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กR^{t}(\bar{\mu}^{t})italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Then, each agent nโˆˆ[N]๐‘›delimited-[]๐‘n\in[N]italic_n โˆˆ [ italic_N ] receives payments from the mediator equal to the expected return induced by the steering reward and the agentโ€™s policy, i.e., โŸจRtโข(ฮผยฏt),ฮผฯ€n,tโŸฉsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscript๐œ‹๐‘›๐‘ก\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangleโŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ. We highlight here that in our setup, at each iteration t๐‘กtitalic_t, the mediator designs the steering reward function Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT without the knowledge of the agentsโ€™ policies ฯ€n,tsuperscript๐œ‹๐‘›๐‘ก\pi^{n,t}italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT, and we do not restrict whether the agents can observe Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT or not before they make decisions. Furthermore, the agents can either independently compute their policies or collaborate. In the next section, we will characterize our assumptions on the agentsโ€™ behaviors with more details.

1:forย t=1,โ€ฆ,T๐‘ก1โ€ฆ๐‘‡t=1,...,Titalic_t = 1 , โ€ฆ , italic_Tย do
2:ย ย ย ย ย Mediator chooses Rt:ฮ”๐’ฎร—๐’œHโ†’[0,Rmax]HโขSโขA:superscript๐‘…๐‘กโ†’superscriptsubscriptฮ”๐’ฎ๐’œ๐ปsuperscript0subscript๐‘…๐ป๐‘†๐ดR^{t}:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to[0,R_{\max}]^{HSA}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT : roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โ†’ [ 0 , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_H italic_S italic_A end_POSTSUPERSCRIPT.
3:ย ย ย ย ย Each agent nโˆˆ[N]๐‘›delimited-[]๐‘n\in[N]italic_n โˆˆ [ italic_N ] computes policy ฯ€n,tโˆˆฮ superscript๐œ‹๐‘›๐‘กฮ \pi^{n,t}\in\Piitalic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT โˆˆ roman_ฮ , resulting in the population density ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and gets payment โŸจRtโข(ฮผยฏt),ฮผฯ€n,tโŸฉsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscript๐œ‹๐‘›๐‘ก\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangleโŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ from the mediator.
4:ย ย ย ย ย Mediator observes ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and a trajectory {(shn,t,ahn,t,rhn,t+ฮพht)}hโˆˆ[H]subscriptsuperscriptsubscript๐‘ โ„Ž๐‘›๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘›๐‘กsuperscriptsubscript๐‘Ÿโ„Ž๐‘›๐‘กsuperscriptsubscript๐œ‰โ„Ž๐‘กโ„Ždelimited-[]๐ป\{(s_{h}^{n,t},a_{h}^{n,t},r_{h}^{n,t}+\xi_{h}^{t})\}_{h\in[H]}{ ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT + italic_ฮพ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT generated by ฯ€n,tsuperscript๐œ‹๐‘›๐‘ก\pi^{n,t}italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT following Eq.ย (1), where nโˆผUniformโข{1,2,โ€ฆ,N}similar-to๐‘›Uniform12โ€ฆ๐‘n\sim\text{Uniform}\{1,2,...,N\}italic_n โˆผ Uniform { 1 , 2 , โ€ฆ , italic_N }.
5:endย for
Procedure 1 Agent-Mediator Interaction Protocol

At the end of each iteration, the mediator can observe a trajectory sampled from a random agent with noisy reward samples. We assume noises ฮพhtsuperscriptsubscript๐œ‰โ„Ž๐‘ก\xi_{h}^{t}italic_ฮพ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are i.i.d.ย ฯƒ๐œŽ\sigmaitalic_ฯƒ-sub-Gaussian random variables with zero mean. We also assume the mediator has access to the population density, which is necessary to estimate the unknown intrinsic reward function from samples.

3.2 Behavioral Assumptions on Agents

We first introduce our no-adaptive regret assumption and its implication, and then make some justification.

Assumption B (No-Adaptive Regret Behavior).

In Procedureย 1, the adaptive regret for each agent โˆ€nโˆˆ[N]for-all๐‘›delimited-[]๐‘\forall n\in[N]โˆ€ italic_n โˆˆ [ italic_N ], which is defined below, can be upper bounded by some term AdaReg(T)=(rmax+Rmax)โ‹…oโข(T)AdaReg๐‘‡โ‹…subscript๐‘Ÿsubscript๐‘…๐‘œ๐‘‡\operatorname*{AdaReg}(T)=(r_{\max}+R_{\max})\cdot o(T)roman_AdaReg ( italic_T ) = ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) โ‹… italic_o ( italic_T ):

max1โ‰คa<bโ‰คTฮผโˆˆฮจMโˆ—subscript1๐‘Ž๐‘๐‘‡๐œ‡subscriptฮจsuperscript๐‘€\displaystyle\max_{\begin{subarray}{c}1\leq a<b\leq T\\ \mu\in\Psi_{M^{*}}\end{subarray}}roman_max start_POSTSUBSCRIPT start_ARG start_ROW start_CELL 1 โ‰ค italic_a < italic_b โ‰ค italic_T end_CELL end_ROW start_ROW start_CELL italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT โˆ‘t=abโŸจrโˆ—โข(ฮผยฏt)+Rtโข(ฮผยฏt),ฮผโˆ’ฮผฯ€n,tโŸฉ,superscriptsubscript๐‘ก๐‘Ž๐‘superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘ก๐œ‡superscript๐œ‡superscript๐œ‹๐‘›๐‘ก\displaystyle\sum_{t=a}^{b}\langle r^{*}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),% \mu-\mu^{\pi^{n,t}}\rangle,โˆ‘ start_POSTSUBSCRIPT italic_t = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ - italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ , (2)

where (rmax+Rmax)subscript๐‘Ÿsubscript๐‘…(r_{\max}+R_{\max})( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) is the normalization term. In Appx.ย D.4 we show that for all the steering rewards that we deploy in this paper we have Rmax=๐’ชโข(1+rmax)subscript๐‘…๐’ช1subscript๐‘ŸR_{\max}=\mathcal{O}(1+r_{\max})italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = caligraphic_O ( 1 + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ).

Justification for Assump.ย Bโ€ƒWe remark that it is common to consider agents exhibiting no-regret behaviors in previous literature (Deng etย al.,, 2019; Zhang etย al.,, 2024; Brown etย al.,, 2024). Most of these literature assume no-external regret (directly assigning a=1๐‘Ž1a=1italic_a = 1 and b=T๐‘๐‘‡b=Titalic_b = italic_T in Eq.ย (2)), which is weaker than our no-adaptive-regret assumption. However, similar stronger assumptions, such as no-dynamic-regret learners, have also been considered in some studies (Ge etย al.,, 2024). Moreover, our no-adaptive-regret assumption is standard when interpreted through the online linear optimization perspective (Hazan and Seshadhri,, 2007; Hazan,, 2023), where in each iteration, each agent picks a density from the convex set ฮจMโˆ—subscriptฮจsuperscript๐‘€\Psi_{M^{*}}roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and receives potentially adversarial feedback Rtโข(ฮผยฏt)superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กR^{t}(\bar{\mu}^{t})italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Then, Assump.ย B aligns with the standard no-adaptive regret guarantees in online linear optimization setting, and there are very simple algorithms (e.g., Online Gradient Descent) achieving AdaReg=O~โข(T)AdaReg~๐‘‚๐‘‡\operatorname*{AdaReg}=\tilde{O}(\sqrt{T})roman_AdaReg = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ). We defer more detailed discussion to Appx.ย D.2.

Under Assump.ย B, we have the following property, which suggests the collective population will also exhibit no-regret behaviors. This is a useful property we will leverage in algorithm design.

Proposition 3.1 (No-Adaptive-Regret Population Behavior).

Under Assump.ย B, we have:

max1โ‰คa<bโ‰คT,ฮผโˆˆฮจMโˆ—subscriptformulae-sequence1๐‘Ž๐‘๐‘‡๐œ‡subscriptฮจsuperscript๐‘€\displaystyle\max_{1\leq a<b\leq T,\mu\in\Psi_{M^{*}}}roman_max start_POSTSUBSCRIPT 1 โ‰ค italic_a < italic_b โ‰ค italic_T , italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT โˆ‘t=abโŸจrโˆ—โข(ฮผยฏt)+Rtโข(ฮผยฏt),ฮผโˆ’ฮผยฏtโŸฉsuperscriptsubscript๐‘ก๐‘Ž๐‘superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘ก๐œ‡superscriptยฏ๐œ‡๐‘ก\displaystyle\sum_{t=a}^{b}\langle r^{*}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),% \mu-\bar{\mu}^{t}\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ
โ‰คAdaReg(T)=(rmax+Rmax)โ‹…oโข(T).absentAdaReg๐‘‡โ‹…subscript๐‘Ÿsubscript๐‘…๐‘œ๐‘‡\displaystyle\leq\operatorname*{AdaReg}(T)=(r_{\max}+R_{\max})\cdot o(T).โ‰ค roman_AdaReg ( italic_T ) = ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) โ‹… italic_o ( italic_T ) .

3.3 Performance Metrics

Inspired by the previous works (Zhang etย al.,, 2024; Huang etย al., 2024b, ), we evaluate the steering algorithm from two aspects: the steering gap and the steering cost. We provide the concrete definition in our MFGs setup as follows.

The Steering Gapโ€ƒIntuitively, the steering gap measures the difference between the desired outcomes and the agentsโ€™ behavior under the mediatorโ€™s guidance. In this paper, we assume the mediator is given a utility function U:ฮ”๐’ฎร—๐’œHโ†’โ„:๐‘ˆโ†’superscriptsubscriptฮ”๐’ฎ๐’œ๐ปโ„U:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to\mathbb{R}italic_U : roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โ†’ blackboard_R assigning each population density a utility value. The only assumption we make for it is about the Lipschitz continuity:

Assumption C (Lipschitz Utility Function).

โˆ€ฮผ,ฮผโ€ฒโˆˆฮ”๐’ฎร—๐’œH,|Uโข(ฮผ)โˆ’Uโข(ฮผโ€ฒ)|โ‰คLUโขโ€–ฮผโˆ’ฮผโ€ฒโ€–1.formulae-sequencefor-all๐œ‡superscript๐œ‡โ€ฒsubscriptsuperscriptฮ”๐ป๐’ฎ๐’œ๐‘ˆ๐œ‡๐‘ˆsuperscript๐œ‡โ€ฒsubscript๐ฟ๐‘ˆsubscriptnorm๐œ‡superscript๐œ‡โ€ฒ1\forall\mu,\mu^{\prime}\in\Delta^{H}_{\mathcal{S}\times\mathcal{A}},~{}|U(\mu)% -U(\mu^{\prime})|\leq L_{U}\|\mu-\mu^{\prime}\|_{1}.โˆ€ italic_ฮผ , italic_ฮผ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โˆˆ roman_ฮ” start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT , | italic_U ( italic_ฮผ ) - italic_U ( italic_ฮผ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) | โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT โˆฅ italic_ฮผ - italic_ฮผ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

The steering gap up to step T๐‘‡Titalic_T is defined by:

ฮ”Tโข({ฮผยฏt}t=1T):=maxฯ€โˆ—โˆˆฮ โขโˆ‘t=1TUโข(ฮผฯ€โˆ—)โˆ’Uโข(ฮผยฏt)assignsubscriptฮ”๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘ก๐‘ก1๐‘‡subscriptsuperscript๐œ‹ฮ superscriptsubscript๐‘ก1๐‘‡๐‘ˆsuperscript๐œ‡superscript๐œ‹๐‘ˆsuperscriptยฏ๐œ‡๐‘ก\displaystyle\textstyle\Delta_{T}(\{\bar{\mu}^{t}\}_{t=1}^{T}):=\max_{\pi^{*}% \in\Pi}\sum_{t=1}^{T}U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) := roman_max start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ roman_ฮ  end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

Here Uโข(ฮผยฏt)๐‘ˆsuperscriptยฏ๐œ‡๐‘กU(\bar{\mu}^{t})italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) represents the utility paid to the mediator at each iteration tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ], induced by the population density ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Note that we consider the best density maximizing utility function as the comparator. This can be interpreted as the best population density if all the agents are restricted to take the same policy, and finding the best shared policy is a standard objective in previous MFGs literature.

The Steering Costโ€ƒThe motivation for introducing a steering cost is that the agents will not accept the mediatorโ€™s guidance for free. A common measure of the cost is the expected total return associated with the reward received by the agents. Formally, suppose at iteration t๐‘กtitalic_t, the mediator computes a steering reward function Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and the N๐‘Nitalic_N agents select policies ฯ€1,t,ฯ€2,t,โ€ฆ,ฯ€N,tโˆˆฮ superscript๐œ‹1๐‘กsuperscript๐œ‹2๐‘กโ€ฆsuperscript๐œ‹๐‘๐‘กฮ \pi^{1,t},\pi^{2,t},...,\pi^{N,t}\in\Piitalic_ฯ€ start_POSTSUPERSCRIPT 1 , italic_t end_POSTSUPERSCRIPT , italic_ฯ€ start_POSTSUPERSCRIPT 2 , italic_t end_POSTSUPERSCRIPT , โ€ฆ , italic_ฯ€ start_POSTSUPERSCRIPT italic_N , italic_t end_POSTSUPERSCRIPT โˆˆ roman_ฮ , which induce a population density ฮผยฏt:=1Nโขโˆ‘n=1Nฮผฯ€n,tassignsuperscriptยฏ๐œ‡๐‘ก1๐‘superscriptsubscript๐‘›1๐‘superscript๐œ‡superscript๐œ‹๐‘›๐‘ก\bar{\mu}^{t}:=\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n,t}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, then the steering cost is defined to be the average payments to the agents: Cโข(ฮผยฏt,Rt):=โŸจRtโข(ฮผยฏt),ฮผยฏtโŸฉ=1Nโขโˆ‘n=1NโŸจRtโข(ฮผยฏt),ฮผฯ€n,tโŸฉ.assign๐ถsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก1๐‘superscriptsubscript๐‘›1๐‘superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscript๐œ‹๐‘›๐‘กC(\bar{\mu}^{t},R^{t}):=\langle R^{t}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle=% \frac{1}{N}\sum_{n=1}^{N}\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangle.italic_C ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) := โŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ . We will use

CTโข({ฮผยฏt,Rt}t=1T):=โˆ‘t=1TCโข(ฮผยฏt,Rt)assignsubscript๐ถ๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘ก๐‘ก1๐‘‡superscriptsubscript๐‘ก1๐‘‡๐ถsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘ก\textstyle C_{T}(\{\bar{\mu}^{t},R^{t}\}_{t=1}^{T}):=\sum_{t=1}^{T}C(\bar{\mu}% ^{t},R^{t})italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) := โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_C ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

to denote the accumulative steering gap. Note that the steering rewards are non-negative, the steering cost effectively reflects the strength of the steering signal.

3.4 Two Steering Scenarios and Objectives

In this paper, we consider the case when the mediator does not know the true transition and reward functions of Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. However, to make it easy for reader to understand our algorithm design and technique contributions, we will start with a special case, where the agents do not have intrinsic rewards, i.e., rโˆ—=0superscript๐‘Ÿ0r^{*}=0italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = 0.

Scenario 1: No Intrinsic Rewardโ€ƒThe goal of this setting to find an incentive design algorithm producing a sequence of Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT such that both the steering gap and the steering cost are sub-linear: ฮ”Tโข({ฮผยฏt}t=1T)=oโข(T),CTโข({ฮผยฏt,Rt}t=1T)=oโข(T).formulae-sequencesubscriptฮ”๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘ก๐‘ก1๐‘‡๐‘œ๐‘‡subscript๐ถ๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘ก๐‘ก1๐‘‡๐‘œ๐‘‡\Delta_{T}(\{\bar{\mu}^{t}\}_{t=1}^{T})=o(T),~{}C_{T}(\{\bar{\mu}^{t},R^{t}\}_% {t=1}^{T})=o(T).roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_o ( italic_T ) , italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_o ( italic_T ) .

The motivation for the sub-linear guarantee here is that it implies the average utility converges to the maximum and the average steering cost vanishes. This implies that the incentive design strategies fulfilling these guarantees will eventually pay off as a long-term investment. In Sec.ย 5, we analyze this case and provide algorithms achieving our objective.

Scenario 2: Non-Zero Intrinsic Rewardโ€ƒIn Sec.ย 6, we study the complete setting where the agentsโ€™ original reward is non-zero and unknown. In this case, the mediator additionally has to estimate the reward function from observed noisy samples and steer the agents based on that. Similarly, we expect sub-linear steering gap ฮ”Tโข({ฮผยฏt}t=1T)=oโข(T)subscriptฮ”๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘ก๐‘ก1๐‘‡๐‘œ๐‘‡\Delta_{T}(\{\bar{\mu}^{t}\}_{t=1}^{T})=o(T)roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_o ( italic_T ), while for steering cost, we manually choose the โ€œsandboxing rewardโ€ as the comparator:

CTโข({ฮผยฏt,Rtโˆ’(rmaxโ‹…๐Ÿโˆ’rโˆ—)โŸsandboxing reward}t=1T)=oโข(T).subscript๐ถ๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsubscriptโŸโ‹…subscript๐‘Ÿ1superscript๐‘Ÿsandboxing reward๐‘ก1๐‘‡๐‘œ๐‘‡\displaystyle C_{T}(\{\bar{\mu}^{t},R^{t}-\underbrace{(r_{\max}\cdot\mathbf{1}% -r^{*})}_{\text{sandboxing reward}}\}_{t=1}^{T})=o(T).italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - underโŸ start_ARG ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT โ‹… bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT sandboxing reward end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_o ( italic_T ) .

Here we use ๐Ÿ1\mathbf{1}bold_1 to denote the all-ones vector. Intuitively, because the intrinsic rewards are non-zero, if the desired behavior ฮผฯ€โˆ—superscript๐œ‡superscript๐œ‹\mu^{\pi^{*}}italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is not an equilibrium induced by rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, the mediator has to maintain a non-zero steering rewards to avoid the agents deviating from ฯ€โˆ—superscript๐œ‹\pi^{*}italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, so we can not expect the average steering cost to vanish to 0 as in Scenario 1. Therefore, we consider the sandboxing reward as a baseline comparator, which mitigates differences in the intrinsic rewards, so that ฯ€โˆ—superscript๐œ‹\pi^{*}italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT would be a โ€œstable equilibriumโ€ even if the additional steering reward Rtโˆ’(rmaxโ‹…๐Ÿโˆ’rโˆ—โข(ฮผยฏt))superscript๐‘…๐‘กโ‹…subscript๐‘Ÿ1superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กR^{t}-(r_{\max}\cdot\mathbf{1}-r^{*}(\bar{\mu}^{t}))italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT โ‹… bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) vanishes to zero. Though, we admit that other choices of sandboxing terms may result in lower steering cost, or one can consider optimizing utility and steering cost together. We leave those interesting directions for the future work.

4 STEERING TOWARDS A FIXED TARGET

In this section, we focus on how to design rewards to guide the agents to a target population density or policy, which serves as preparation steps for the following sections. For convenience, we assume the agentsโ€™ intrinsic rewards are zero and ignore them.

4.1 Warm-Up: Steering in a Known Model

We start with the case when the MFG model is known. In this case, we can also compute ฮจMโˆ—subscriptฮจsuperscript๐‘€\Psi_{M^{*}}roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and find the best ฮผโˆ—=argโขmaxฮผโˆˆฮจMโˆ—โกUโข(ฮผ)superscript๐œ‡subscriptargmax๐œ‡subscriptฮจsuperscript๐‘€๐‘ˆ๐œ‡\mu^{*}=\operatorname*{arg\,max}_{\mu\in\Psi_{M^{*}}}U(\mu)italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_U ( italic_ฮผ ). If we want to steer the population to ฮผโˆ—superscript๐œ‡\mu^{*}italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, one steering reward choice is Rโข(ฮผ)=๐Ÿโขโ€–ฮผโˆ—โˆ’ฮผโ€–โˆž+ฮผโˆ—โˆ’ฮผ๐‘…๐œ‡1subscriptnormsuperscript๐œ‡๐œ‡superscript๐œ‡๐œ‡R(\mu)=\bm{1}\|\mu^{*}-\mu\|_{\infty}+\mu^{*}-\muitalic_R ( italic_ฮผ ) = bold_1 โˆฅ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - italic_ฮผ โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT + italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - italic_ฮผ, where the first shift term is to ensure the non-negativity. The key motivation for our choice is that โŸจRโข(ฮผ),ฮผโˆ—โˆ’ฮผโŸฉ=โ€–ฮผโˆ’ฮผโˆ—โ€–22๐‘…๐œ‡superscript๐œ‡๐œ‡superscriptsubscriptnorm๐œ‡superscript๐œ‡22\langle R(\mu),\mu^{*}-\mu\rangle=\|\mu-\mu^{*}\|_{2}^{2}โŸจ italic_R ( italic_ฮผ ) , italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - italic_ฮผ โŸฉ = โˆฅ italic_ฮผ - italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. As a result, if we consider the accumulative performance, we have the following theorem.

Theorem 4.1.

If Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is known and rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is zero everywhere, under Asump.ย B andย C, by choosing the steering reward โˆ€tโˆˆ[T]for-all๐‘กdelimited-[]๐‘‡\forall~{}t\in[T]โˆ€ italic_t โˆˆ [ italic_T ], Rtโข(ฮผ)=ฮผโˆ—โˆ’ฮผ+๐Ÿโขโ€–ฮผโˆ—โˆ’ฮผโ€–โˆžsuperscript๐‘…๐‘ก๐œ‡superscript๐œ‡๐œ‡1subscriptnormsuperscript๐œ‡๐œ‡R^{t}(\mu)=\mu^{*}-\mu+\bm{1}\|\mu^{*}-\mu\|_{\infty}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) = italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - italic_ฮผ + bold_1 โˆฅ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - italic_ฮผ โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT, for any ฮผโˆˆฮ”๐’ฎร—๐’œH๐œ‡superscriptsubscriptฮ”๐’ฎ๐’œ๐ป\mu\in\Delta_{\mathcal{S}\times\mathcal{A}}^{H}italic_ฮผ โˆˆ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, we have:

ฮ”Tโข({ฮผยฏMโˆ—t}t=1T)โ‰คLUโขHโขSโขAโขTโขAdaReg(T)=oโข(T)subscriptฮ”๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€๐‘ก1๐‘‡subscript๐ฟ๐‘ˆ๐ป๐‘†๐ด๐‘‡AdaReg๐‘‡๐‘œ๐‘‡\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})\leq L_{U}\sqrt{% HSAT\operatorname*{AdaReg}(T)}=o(T)roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H italic_S italic_A italic_T roman_AdaReg ( italic_T ) end_ARG = italic_o ( italic_T )
CTโข({ฮผยฏMโˆ—t,Rt}t=1T)โ‰ค2โขHโขTโขAdaReg(T)=oโข(T).subscript๐ถ๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscript๐‘…๐‘ก๐‘ก1๐‘‡2๐ป๐‘‡AdaReg๐‘‡๐‘œ๐‘‡\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R^{t}\}_{t=1}^{T})\leq 2H\sqrt{T% \operatorname*{AdaReg}(T)}=o(T).italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค 2 italic_H square-root start_ARG italic_T roman_AdaReg ( italic_T ) end_ARG = italic_o ( italic_T ) .

The bound above for the known model setting, although may not be tight, can serve as a benchmark for the more challenging unknown model settings. We will see that the bounds in Theoremsย 5.1ย andย 6.1 (for the unknown model setting) are not much worse than the bound of Theoremย 4.1.

4.2 Steering towards a Target Policy in an Unknown Model

Without the knowledge of transition function โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, the steering becomes challenging, because we can no longer compute ฮจMโˆ—subscriptฮจsuperscript๐‘€\Psi_{M^{*}}roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT or identify whether a given density (e.g. argโขmaxฮผโˆˆฮ”๐’ฎร—๐’œHโกUโข(ฮผ)subscriptargmax๐œ‡superscriptsubscriptฮ”๐’ฎ๐’œ๐ป๐‘ˆ๐œ‡\operatorname*{arg\,max}_{\mu\in\Delta_{\mathcal{S}\times\mathcal{A}}^{H}}U(\mu)start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฮผ โˆˆ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_U ( italic_ฮผ )) can actually be achieved by the agents. Therefore, we shift our focus to the policy space. Interestingly, we reveal that, it is possible to steer the agents to any target policy ฯ€โˆˆฮ ๐œ‹ฮ \pi\in\Piitalic_ฯ€ โˆˆ roman_ฮ , even without the knowledge of โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT or ฮจMโˆ—subscriptฮจsuperscript๐‘€\Psi_{M^{*}}roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Our key observation is the following lemma, which suggests an upper bound to control the difference between the population density and the density regarding the target policy.

Lemma 4.2.

Given any M๐‘€Mitalic_M and target ฯ€โˆˆฮ ๐œ‹ฮ \pi\in\Piitalic_ฯ€ โˆˆ roman_ฮ , suppose the agents induce population density ฮผยฏMsubscriptยฏ๐œ‡๐‘€\bar{\mu}_{M}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT in M๐‘€Mitalic_M, then:

โ€–ฮผยฏMโˆ’ฮผMฯ€โ€–1โ‰คsubscriptnormsubscriptยฏ๐œ‡๐‘€subscriptsuperscript๐œ‡๐œ‹๐‘€1absent\displaystyle\|\bar{\mu}_{M}-\mu^{\pi}_{M}\|_{1}\leqโˆฅ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค Hโˆ‘h,sฮผยฏM,h(s)โˆฅฯ€ยฏh(โ‹…|s)โˆ’ฯ€h(โ‹…|s)โˆฅ1,\displaystyle H\sum_{h,s}\bar{\mu}_{M,h}(s)\|\bar{\pi}_{h}(\cdot|s)-\pi_{h}(% \cdot|s)\|_{1},italic_H โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,
with ฮผยฏM,hโข(s):=โˆ‘aฮผยฏM,hโข(s,a)assignsubscriptยฏ๐œ‡๐‘€โ„Ž๐‘ subscript๐‘Žsubscriptยฏ๐œ‡๐‘€โ„Ž๐‘ ๐‘Ž\displaystyle\bar{\mu}_{M,h}(s):=\sum_{a}\bar{\mu}_{M,h}(s,a)overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s ) := โˆ‘ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) (3)

This motivates us to design a steering reward function that penalizes the RHS of Eq.ย (3), which is actually doable without the knowledge of model. Given a policy ฯ€๐œ‹\piitalic_ฯ€, we define matrix Wฯ€โˆˆโ„SโขAโขHร—SโขAโขHsuperscript๐‘Š๐œ‹superscriptโ„๐‘†๐ด๐ป๐‘†๐ด๐ปW^{\pi}\in\mathbb{R}^{SAH\times SAH}italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_S italic_A italic_H ร— italic_S italic_A italic_H end_POSTSUPERSCRIPT to be the block diagonal of Wh,shฯ€subscriptsuperscript๐‘Š๐œ‹โ„Žsubscript๐‘ โ„ŽW^{\pi}_{h,s_{h}}italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ] and shโˆˆ๐’ฎsubscript๐‘ โ„Ž๐’ฎs_{h}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ caligraphic_S, where

Wh,sฯ€:=[ฯ€hโข(a1|s)โ€ฆฯ€hโข(a1|s)โ‹ฎโ‹ฎฯ€hโข(aA|s)โ€ฆฯ€hโข(aA|s)]โˆˆโ„Aร—A,assignsubscriptsuperscript๐‘Š๐œ‹โ„Ž๐‘ delimited-[]matrixsubscript๐œ‹โ„Žconditionalsubscript๐‘Ž1๐‘ โ€ฆsubscript๐œ‹โ„Žconditionalsubscript๐‘Ž1๐‘ โ‹ฎmissing-subexpressionโ‹ฎsubscript๐œ‹โ„Žconditionalsubscript๐‘Ž๐ด๐‘ โ€ฆsubscript๐œ‹โ„Žconditionalsubscript๐‘Ž๐ด๐‘ superscriptโ„๐ด๐ด\displaystyle\begin{split}W^{\pi}_{h,s}&:=\left[\begin{matrix}\pi_{h}(a_{1}|s)% &\ldots&\pi_{h}(a_{1}|s)\\ \vdots&&\vdots\\ \pi_{h}(a_{A}|s)&\ldots&\pi_{h}(a_{A}|s)\end{matrix}\right]\in\mathbb{R}^{A% \times A},\end{split}start_ROW start_CELL italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT end_CELL start_CELL := [ start_ARG start_ROW start_CELL italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_s ) end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_s ) end_CELL end_ROW start_ROW start_CELL โ‹ฎ end_CELL start_CELL end_CELL start_CELL โ‹ฎ end_CELL end_ROW start_ROW start_CELL italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT | italic_s ) end_CELL start_CELL โ€ฆ end_CELL start_CELL italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT | italic_s ) end_CELL end_ROW end_ARG ] โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_A ร— italic_A end_POSTSUPERSCRIPT , end_CELL end_ROW (4)

Now, consider the steering reward function:

โˆ€ฮผโˆˆฮ”๐’ฎร—๐’œH,Rฯ€โข(ฮผ):=โˆ’ฮผโŠคโข(Wฯ€โˆ’I)โŠคโข(Wฯ€โˆ’I).formulae-sequencefor-all๐œ‡subscriptsuperscriptฮ”๐ป๐’ฎ๐’œassignsubscript๐‘…๐œ‹๐œ‡superscript๐œ‡topsuperscriptsuperscript๐‘Š๐œ‹๐ผtopsuperscript๐‘Š๐œ‹๐ผ\displaystyle\forall\mu\in\Delta^{H}_{\mathcal{S}\times\mathcal{A}},~{}R_{\pi}% (\mu):=-\mu^{\top}(W^{\pi}-I)^{\top}(W^{\pi}-I).โˆ€ italic_ฮผ โˆˆ roman_ฮ” start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( italic_ฮผ ) := - italic_ฮผ start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) . (5)

where Iโˆˆโ„SโขAโขHร—SโขAโขH๐ผsuperscriptโ„๐‘†๐ด๐ป๐‘†๐ด๐ปI\in\mathbb{R}^{SAH\times SAH}italic_I โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_S italic_A italic_H ร— italic_S italic_A italic_H end_POSTSUPERSCRIPT is the identity matrix. We can verify that, for any possible population density ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT occurs at step t๐‘กtitalic_t, we have

โŸจRฯ€โข(ฮผยฏt),ฮผฯ€โˆ’ฮผยฏtโŸฉ=โ€–(Wฯ€โˆ’I)โขฮผยฏtโ€–22subscript๐‘…๐œ‹superscriptยฏ๐œ‡๐‘กsuperscript๐œ‡๐œ‹superscriptยฏ๐œ‡๐‘กsuperscriptsubscriptnormsuperscript๐‘Š๐œ‹๐ผsuperscriptยฏ๐œ‡๐‘ก22\displaystyle\langle R_{\pi}(\bar{\mu}^{t}),\mu^{\pi}-\bar{\mu}^{t}\rangle=\|(% W^{\pi}-I)\bar{\mu}^{t}\|_{2}^{2}โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=โˆ‘h,s,a(ฮผยฏht(s))2|ฯ€h(a|s)โˆ’ฯ€ยฏht(a|s)|2.\displaystyle=\sum_{h,s,a}(\bar{\mu}_{h}^{t}(s))^{2}|\pi_{h}(a|s)-\bar{\pi}_{h% }^{t}(a|s)|^{2}.= โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_a | italic_s ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

Recall that ฯ€ยฏhtsuperscriptsubscriptยฏ๐œ‹โ„Ž๐‘ก\bar{\pi}_{h}^{t}overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the policy induced by population density (see definition in Sec.ย 2). Here in the first equality, we use the fact that, for any ฮผ๐œ‡\muitalic_ฮผ, โŸจRฯ€โข(ฮผ),ฮผฯ€โŸฉ=0subscript๐‘…๐œ‹๐œ‡superscript๐œ‡๐œ‹0\langle R_{\pi}(\mu),\mu^{\pi}\rangle=0โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( italic_ฮผ ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT โŸฉ = 0 since (Wฯ€โˆ’I)โขฮผฯ€=0superscript๐‘Š๐œ‹๐ผsuperscript๐œ‡๐œ‹0(W^{\pi}-I)\mu^{\pi}=0( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT = 0. Eq.ย (6) above is important in that it connects the one step regret (LHS) with the gap between the population density and target density (RHS through Lemmaย 4.2).

Combining with Prop.ย 3.1, if all the agents are no-regret learners, and we steer the agents with the same steering reward Rฯ€subscript๐‘…๐œ‹R_{\pi}italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT for T๐‘‡Titalic_T steps, we should expect ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to converge to ฮผฯ€superscript๐œ‡๐œ‹\mu^{\pi}italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT, which we summarize to the following theorem. This result provides important insights for our incentive design algorithm in Sectionย 5.

Theorem 4.3.

Let ฯ€โˆ—=argโขmaxฯ€โกUโข(ฮผฯ€)superscript๐œ‹subscriptargmax๐œ‹๐‘ˆsuperscript๐œ‡๐œ‹\pi^{*}=\operatorname*{arg\,max}_{\pi}U(\mu^{\pi})italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ) and Rtโข(ฮผ)=Rฯ€โˆ—โข(ฮผ)+โ€–Rฯ€โˆ—โข(ฮผ)โ€–โˆžโข๐Ÿsuperscript๐‘…๐‘ก๐œ‡subscript๐‘…superscript๐œ‹๐œ‡subscriptnormsubscript๐‘…superscript๐œ‹๐œ‡1R^{t}(\mu)=R_{\pi^{*}}(\mu)+\|R_{\pi^{*}}(\mu)\|_{\infty}\bm{1}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) = italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 for all t๐‘กtitalic_t. Under Assump.ย B,

ฮ”Tโข({ฮผยฏMโˆ—t}t=1T)โ‰คLUโขH3โขSโขAโขTโขAdaReg(T)=oโข(T)subscriptฮ”๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€๐‘ก1๐‘‡subscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡AdaReg๐‘‡๐‘œ๐‘‡\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})\leq L_{U}\sqrt{H^% {3}SAT\operatorname*{AdaReg}(T)}=o(T)roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T roman_AdaReg ( italic_T ) end_ARG = italic_o ( italic_T )
CTโข({ฮผยฏMโˆ—t,Rt}t=1T)โ‰ค4โขHโขTโขAdaReg(T)=oโข(T).subscript๐ถ๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscript๐‘…๐‘ก๐‘ก1๐‘‡4๐ป๐‘‡AdaReg๐‘‡๐‘œ๐‘‡\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R^{t}\}_{t=1}^{T})\leq 4H\sqrt{T% \operatorname*{AdaReg}(T)}=o(T).italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค 4 italic_H square-root start_ARG italic_T roman_AdaReg ( italic_T ) end_ARG = italic_o ( italic_T ) .

5 STEERING WITH NO INTRINSIC REWARD

In this section, we study the Scenario 1 introduced in Sec.ย 3.4, where the transition function โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is unknown and the original reward rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is zero, so the steering rewards are the only incentives for the agents. The main challenge in this setting is that, without the knowledge of โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, we can not determine the feasible density set ฮจMโˆ—subscriptฮจsuperscript๐‘€\Psi_{M^{*}}roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and the maximizer of the utility function. Therefore, we have to design a steering strategy to incentivize the agents to explore for the mediator to estimate โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, while balancing the exploration-exploitation trade-off to ensure sub-linear steering gap and cost.

Our main contribution is an optimism-based exploration algorithm in Alg.ย 2, which provably addresses the above challenges and achieves our objectives. The algorithm is built based on the techniques we developed in Sec.ย 4.2, which allows us to steer the agents to any target policy without the knowledge of model. Next, we introduce the key components in algorithm design.

Algorithm 2 Steering reward design for Scenario 1
1:Initialize ๐’ซ1:=assignsuperscript๐’ซ1absent\mathcal{P}^{1}:=caligraphic_P start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT := set of all possible transition functions, ฯ€โˆ—1superscriptsubscript๐œ‹1\pi_{*}^{1}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (arbitrarily), k=1,T0=0formulae-sequence๐‘˜1subscript๐‘‡00k=1,T_{0}=0italic_k = 1 , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.
2:forย t=1,โ€ฆ,T๐‘ก1โ€ฆ๐‘‡t=1,...,Titalic_t = 1 , โ€ฆ , italic_Tย do
3:โ–ทโ–ท\trianglerightโ–ท Recall Rฯ€โˆ—ksubscript๐‘…superscriptsubscript๐œ‹๐‘˜R_{\pi_{*}^{k}}italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as defined in Eq.ย (5)
4:ย ย ย ย ย Compute steering reward function
Rztโข(โ‹…)โ†Rฯ€โˆ—kโข(โ‹…)+โ€–Rฯ€โˆ—kโข(โ‹…)โ€–โˆžโข๐Ÿ.โ†subscriptsuperscript๐‘…๐‘กzโ‹…subscript๐‘…superscriptsubscript๐œ‹๐‘˜โ‹…subscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜โ‹…1R^{t}_{\text{z}}(\cdot)\leftarrow R_{\pi_{*}^{k}}(\cdot)+\|R_{\pi_{*}^{k}}(% \cdot)\|_{\infty}\bm{1}.italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT z end_POSTSUBSCRIPT ( โ‹… ) โ† italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 .
5:ย ย ย ย ย Agents play the t๐‘กtitalic_t-th game.
6:ย ย ย ย ย Obtain trajectory ((sht,aht))h=1Hsuperscriptsubscriptsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Žโ„Ž1๐ป((s^{t}_{h},a^{t}_{h}))_{h=1}^{H}( ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT.
7:ย ย ย ย ย ifย โˆƒ(h,s,a),s.t.nkโข(h,s,a)โ‰ฅNkโข(h,s,a)formulae-sequenceโ„Ž๐‘ ๐‘Ž๐‘ ๐‘กsubscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žsubscript๐‘๐‘˜โ„Ž๐‘ ๐‘Ž\exists(h,s,a),~{}s.t.~{}n_{k}(h,s,a)\geq N_{k}(h,s,a)โˆƒ ( italic_h , italic_s , italic_a ) , italic_s . italic_t . italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) โ‰ฅ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a )ย then
8:ย ย ย ย ย ย ย ย ย Update ๐’ซk+1superscript๐’ซ๐‘˜1\mathcal{P}^{k+1}caligraphic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT as in (5).
9:ย ย ย ย ย ย ย ย ย Tkโ†tโ†subscript๐‘‡๐‘˜๐‘กT_{k}\leftarrow titalic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT โ† italic_t; kโ†k+1โ†๐‘˜๐‘˜1k\leftarrow k+1italic_k โ† italic_k + 1.
10:ย ย ย ย ย ย ย ย ย ฯ€โˆ—k,M^kโ†argโขmaxฯ€โˆˆฮ ,M^:โ„™^M^โˆˆ๐’ซkโกUโข(ฮผM^ฯ€).โ†superscriptsubscript๐œ‹๐‘˜superscript^๐‘€๐‘˜subscriptargmax:๐œ‹ฮ ^๐‘€subscript^โ„™^๐‘€superscript๐’ซ๐‘˜๐‘ˆsubscriptsuperscript๐œ‡๐œ‹^๐‘€\pi_{*}^{k},\hat{M}^{k}\leftarrow\operatorname*{arg\,max}_{\pi\in\Pi,\hat{M}:% \hat{\mathbb{P}}_{\hat{M}}\in\mathcal{P}^{k}}U(\mu^{\pi}_{\hat{M}}).italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT โ† start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  , over^ start_ARG italic_M end_ARG : over^ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT โˆˆ caligraphic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT ) .
11:ย ย ย ย ย endย if
12:endย for

Low Policy Switching Optimistic Exploration Strategyโ€ƒFor efficient exploration, we maintain a confidence set for โ„™โˆ—superscriptโ„™\mathbb{P}^{*}blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT denoted by ๐’ซ๐’ซ\mathcal{P}caligraphic_P:

โ„™ยฏhk+1โข(sโ€ฒ|s,a):=โˆ‘t=1Tk๐•€โข{sht=s,aht=a,sh+1t=sโ€ฒ}maxโก{1,Nk+1โข(h,s,a)},assignsubscriptsuperscriptยฏโ„™๐‘˜1โ„Žconditionalsuperscript๐‘ โ€ฒ๐‘ ๐‘Žsuperscriptsubscript๐‘ก1subscript๐‘‡๐‘˜๐•€formulae-sequencesubscriptsuperscript๐‘ ๐‘กโ„Ž๐‘ formulae-sequencesubscriptsuperscript๐‘Ž๐‘กโ„Ž๐‘Žsubscriptsuperscript๐‘ ๐‘กโ„Ž1superscript๐‘ โ€ฒ1subscript๐‘๐‘˜1โ„Ž๐‘ ๐‘Ž\displaystyle\bar{\mathbb{P}}^{k+1}_{h}(s^{\prime}|s,a):=\sum_{t=1}^{T_{k}}% \frac{\mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a,s^{t}_{h+1}=s^{\prime}\}}{\max\{1,N_% {k+1}(h,s,a)\}},overยฏ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) := โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG blackboard_I { italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT } end_ARG start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) } end_ARG ,
๐’ซk+1:={โ„™^:โˆ€h,s,a.โˆฅโ„™^h(โ‹…|s,a)โˆ’โ„™ยฏhk+1(โ‹…|s,a)โˆฅ1\displaystyle\mathcal{P}^{k+1}:=\bigg{\{}\hat{\mathbb{P}}:\forall h,s,a.\|\hat% {\mathbb{P}}_{h}(\cdot|s,a)-\bar{\mathbb{P}}^{k+1}_{h}(\cdot|s,a)\|_{1}caligraphic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT := { over^ start_ARG blackboard_P end_ARG : โˆ€ italic_h , italic_s , italic_a . โˆฅ over^ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - overยฏ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
โ‰คฮตk+1(h,s,a)},\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\leq\varepsilon_{k+1}(h,s,a)% \bigg{\}},โ‰ค italic_ฮต start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) } , (7)

where ฮตk+1โข(h,s,a):=2โขSโขlnโก(TโขHโขSโขA/ฮด)maxโก{1,Nk+1โข(h,s,a)}assignsubscript๐œ€๐‘˜1โ„Ž๐‘ ๐‘Ž2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ1subscript๐‘๐‘˜1โ„Ž๐‘ ๐‘Ž\varepsilon_{k+1}(h,s,a):=\sqrt{\frac{2S\ln(THSA/\delta)}{\max\{1,N_{k+1}(h,s,% a)\}}}italic_ฮต start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) := square-root start_ARG divide start_ARG 2 italic_S roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) } end_ARG end_ARG. We highlight that we only update ๐’ซ๐’ซ\mathcal{P}caligraphic_P and switch target policy in low frequency, and here we use index 1โ‰คkโ‰คK1๐‘˜๐พ1\leq k\leq K1 โ‰ค italic_k โ‰ค italic_K to count the policy switching episodes, to distinguish with the steering steps tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ]. We use kโข(t)๐‘˜๐‘กk(t)italic_k ( italic_t ) to denote the index of episode at iteration t๐‘กtitalic_t and use Tksubscript๐‘‡๐‘˜T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to denote the iteration number at the end k๐‘˜kitalic_k-th policy switching. We define nkโข(h,s,a)subscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žn_{k}(h,s,a)italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) to be the number of samples equal to (s,a)๐‘ ๐‘Ž(s,a)( italic_s , italic_a ) at time hโ„Žhitalic_h in episode k๐‘˜kitalic_k, and Nkโข(h,s,a)=โˆ‘kโ€ฒ<knkโ€ฒโข(h,s,a)subscript๐‘๐‘˜โ„Ž๐‘ ๐‘Žsubscriptsuperscript๐‘˜โ€ฒ๐‘˜subscript๐‘›superscript๐‘˜โ€ฒโ„Ž๐‘ ๐‘ŽN_{k}(h,s,a)=\sum_{k^{\prime}<k}n_{k^{\prime}}(h,s,a)italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) = โˆ‘ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT < italic_k end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ). A new episode begins as soon as we have as many samples in this episode as in all the previous ones for some h,s,aโ„Ž๐‘ ๐‘Žh,s,aitalic_h , italic_s , italic_a, i.e., nkโข(h,s,a)โ‰ฅNkโข(h,s,a)subscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žsubscript๐‘๐‘˜โ„Ž๐‘ ๐‘Žn_{k}(h,s,a)\geq N_{k}(h,s,a)italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) โ‰ฅ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ). The main motivation for this technique is to avoid the agentsโ€™ potentially adversarial behaviors. As we will see later in the proof sketch, K๐พKitalic_K will appear in the steering gap upper bound.

For exploration, we select the optimistic policy ฯ€โˆ—(โ‹…)superscriptsubscript๐œ‹โ‹…\pi_{*}^{(\cdot)}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( โ‹… ) end_POSTSUPERSCRIPT and model M^(โ‹…)superscript^๐‘€โ‹…\hat{M}^{(\cdot)}over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT ( โ‹… ) end_POSTSUPERSCRIPT (lineย 10) s.t. the induced density maximizes utility. Then, we choose steering reward Rzsubscript๐‘…zR_{\text{z}}italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT to guide the agents towards ฯ€โˆ—(โ‹…)superscriptsubscript๐œ‹โ‹…\pi_{*}^{(\cdot)}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( โ‹… ) end_POSTSUPERSCRIPT and collect data samples to update the model confidence set. Intuitively, either ฯ€โˆ—(โ‹…)superscriptsubscript๐œ‹โ‹…\pi_{*}^{(\cdot)}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( โ‹… ) end_POSTSUPERSCRIPT indeed maximizes the utility, implying a low steering gap; or the exploration helps to reduce the uncertainty.

Managing the steering gap and costโ€ƒWe have the following guarantees for Alg.ย 2

Theorem 5.1.

Suppose the intrinsic reward rโˆ—=0superscript๐‘Ÿ0r^{*}=0italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = 0, under Assump.ย B and C, if we run Alg.ย 2 with ฮดโˆˆ(0,1)๐›ฟ01\delta\in(0,1)italic_ฮด โˆˆ ( 0 , 1 ), then with probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด, Kโ‰คHโขSโขAโขlog2โกT๐พ๐ป๐‘†๐ดsubscript2๐‘‡K\leq HSA\log_{2}Titalic_K โ‰ค italic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T, and

ฮ”Tโข({ฮผยฏMโˆ—t}t=1T)โ‰คLUโขH3โขSโขAโขTโขKโขAdaReg(T)subscriptฮ”๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€๐‘ก1๐‘‡subscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡๐พAdaReg๐‘‡\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})\leq L_{U}\sqrt{H^% {3}SATK\operatorname*{AdaReg}(T)}roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T italic_K roman_AdaReg ( italic_T ) end_ARG
+36โขLUโขH3โขSโขlnโก(TโขHโขSโขA/ฮด)โขAโขT=oโข(T).36subscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ๐ด๐‘‡๐‘œ๐‘‡\displaystyle\qquad\qquad\quad+36L_{U}H^{3}S\sqrt{\ln(THSA/\delta)AT}=o(T).+ 36 italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S square-root start_ARG roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) italic_A italic_T end_ARG = italic_o ( italic_T ) .
CTโข({ฮผยฏMโˆ—t,Rzt}t=1T)โ‰ค4โขHโขTโขKโขAdaReg(T)=oโข(T).subscript๐ถ๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscriptsubscript๐‘…z๐‘ก๐‘ก1๐‘‡4๐ป๐‘‡๐พAdaReg๐‘‡๐‘œ๐‘‡\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R_{\text{z}}^{t}\}_{t=1}^{T})\leq 4% H\sqrt{TK\operatorname*{AdaReg}(T)}=o(T).italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค 4 italic_H square-root start_ARG italic_T italic_K roman_AdaReg ( italic_T ) end_ARG = italic_o ( italic_T ) .

As a concrete example, agents following Online Gradient Descent with step size Oโข(1/t)๐‘‚1๐‘กO(1/\sqrt{t})italic_O ( 1 / square-root start_ARG italic_t end_ARG ) (Hazan,, 2023) result in AdaReg(T)=๐’ช~โข(T)AdaReg๐‘‡~๐’ช๐‘‡\operatorname*{AdaReg}(T)=\tilde{\mathcal{O}}(\sqrt{T})roman_AdaReg ( italic_T ) = over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_T end_ARG ) (ignoring H,S๐ป๐‘†H,Sitalic_H , italic_S and A๐ดAitalic_A), which implies ๐’ช~โข(T3/4)~๐’ชsuperscript๐‘‡34\tilde{\mathcal{O}}(T^{3/4})over~ start_ARG caligraphic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT ) steering gap. Besides, if all the agents are capable enough s.t. for any tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ], ฯ€1,t,โ€ฆ,ฯ€N,tsuperscript๐œ‹1๐‘กโ€ฆsuperscript๐œ‹๐‘๐‘ก\pi^{1,t},...,\pi^{N,t}italic_ฯ€ start_POSTSUPERSCRIPT 1 , italic_t end_POSTSUPERSCRIPT , โ€ฆ , italic_ฯ€ start_POSTSUPERSCRIPT italic_N , italic_t end_POSTSUPERSCRIPT are equilibria w.r.t. rโˆ—+Rztsuperscript๐‘Ÿsuperscriptsubscript๐‘…z๐‘กr^{*}+R_{\text{z}}^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, AdaRegAdaReg\operatorname*{AdaReg}roman_AdaReg would be constant-level, resulting in a ๐’ช~โข(T)~๐’ช๐‘‡\tilde{\mathcal{O}}(\sqrt{T})over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_T end_ARG ) bound.

Proof Sketchโ€ƒWe first analyze the steering gap. Intuitively, Alg.ย 2 can be interpreted as a โ€œK๐พKitalic_K-stageโ€ version of what we did in Sec.ย 4.2. In each stage, we pick a target policy, and steer the agents towards it for exploration. Following this intuition, and thanks to the Lipschitz condition (Assump.ย C) and the optimism in planning, we can decompose the steering gap as follow:

ฮ”T({ฮผยฏMโˆ—t\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }t=1T)โ‰คLU(2H+1)โˆ‘t=1Tโ€–ฮผM^kโข(t)ฯ€ยฏtโˆ’ฮผยฏMโˆ—tโ€–1โŸฮ”est\displaystyle\}_{t=1}^{T})\leq L_{U}(2H+1)\underset{\Delta_{\text{est}}}{% \underbrace{\sum_{t=1}^{T}\|\mu^{\bar{\pi}^{t}}_{\hat{M}^{k(t)}}-\bar{\mu}^{t}% _{M^{*}}\|_{1}}}} start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( 2 italic_H + 1 ) start_UNDERACCENT roman_ฮ” start_POSTSUBSCRIPT est end_POSTSUBSCRIPT end_UNDERACCENT start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ italic_ฮผ start_POSTSUPERSCRIPT overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG
+LUโขHsubscript๐ฟ๐‘ˆ๐ป\displaystyle+L_{U}H+ italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H โˆ‘t=1Tโˆ‘h,sฮผยฏMโˆ—,ht(s)โˆฅฯ€โˆ—,hkโข(t)(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1โŸฮ”pop.\displaystyle\underset{\Delta_{\text{pop}}}{\underbrace{\sum_{t=1}^{T}\sum_{h,% s}\bar{\mu}_{M^{*},h}^{t}(s)\|\pi^{k(t)}_{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(% \cdot|s)\|_{1}}}.start_UNDERACCENT roman_ฮ” start_POSTSUBSCRIPT pop end_POSTSUBSCRIPT end_UNDERACCENT start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG . (8)

We refer the first term ฮ”estsubscriptฮ”est\Delta_{\text{est}}roman_ฮ” start_POSTSUBSCRIPT est end_POSTSUBSCRIPT as model estimation error, which measures the gap between the population density ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and the density induced by the population average policy ฯ€ยฏtsuperscriptยฏ๐œ‹๐‘ก\bar{\pi}^{t}overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (see definition in Sec.ย 2) in the estimated model M^ksuperscript^๐‘€๐‘˜\hat{M}^{k}over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. As we collect more and more data, M^ksuperscript^๐‘€๐‘˜\hat{M}^{k}over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT gets closer to Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, and we can show ฮ”estsubscriptฮ”est\Delta_{\text{est}}roman_ฮ” start_POSTSUBSCRIPT est end_POSTSUBSCRIPT only grows sub-linearly. The second term ฮ”popsubscriptฮ”pop\Delta_{\text{pop}}roman_ฮ” start_POSTSUBSCRIPT pop end_POSTSUBSCRIPT can be interpreted as the population convergence error, which is determined by how fast the agents converge to the target policy we steer them to. Following the similar techniques in the proof of Thm.ย 4.3, ฮ”popsubscriptฮ”pop\Delta_{\text{pop}}roman_ฮ” start_POSTSUBSCRIPT pop end_POSTSUBSCRIPT can be upper bounded by:

HโขSโขAโขTโขโˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏMโˆ—t),ฮผMโˆ—ฯ€โˆ—kโข(t)โˆ’ฮผยฏMโˆ—tโŸฉโŸAgentReg.๐ป๐‘†๐ด๐‘‡subscriptโŸsuperscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscriptsubscript๐œ‡superscript๐‘€superscriptsubscript๐œ‹๐‘˜๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€AgentReg\displaystyle\textstyle\vphantom{\underbrace{\sum_{t}^{T}}_{\texttt{AgentReg}}% }\sqrt{\vphantom{\sum_{t}^{T}}\smash[b]{HSAT\!\underbrace{\sum_{t=1}^{T}% \langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}_{M^{*}}),\mu_{M^{*}}^{\pi_{*}^{k(t)}}% -\bar{\mu}^{t}_{M^{*}}\rangle}_{\texttt{AgentReg}}\,}}.square-root start_ARG italic_H italic_S italic_A italic_T underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โŸฉ end_ARG start_POSTSUBSCRIPT AgentReg end_POSTSUBSCRIPT end_ARG . (9)

Here we use AgentReg to refer the summation term, which can be interpreted as the agentsโ€™ dynamic regret if choosing ฮผMโˆ—ฯ€โˆ—kโข(t)superscriptsubscript๐œ‡superscript๐‘€superscriptsubscript๐œ‹๐‘˜๐‘ก\mu_{M^{*}}^{\pi_{*}^{k(t)}}italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as the comparators. Thanks to the low policy switching, AgentReg can be controlled by Oโข(KโขAdaReg(T))๐‘‚๐พAdaReg๐‘‡O(K\operatorname*{AdaReg}(T))italic_O ( italic_K roman_AdaReg ( italic_T ) ), and the only remaining step is to control K๐พKitalic_K. Note that we only switch policy when the number of visitation of some state-action pair got doubled, therefore, K๐พKitalic_K only grows in Oโข(logโก(T))๐‘‚๐‘‡O(\log(T))italic_O ( roman_log ( italic_T ) ).

For the steering cost, we can calculate that

Cโข(ฮผยฏMโˆ—t,Rzt)โ‰ค2โขHโขโ€–Rฯ€โˆ—kโข(t)โข(ฮผยฏMโˆ—t)โ€–โˆž,๐ถsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€subscriptsuperscript๐‘…๐‘กz2๐ปsubscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\displaystyle C(\bar{\mu}^{t}_{M^{*}},R^{t}_{\text{z}})\leq 2H\|R_{\pi_{*}^{k(% t)}}(\bar{\mu}^{t}_{M^{*}})\|_{\infty},italic_C ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT z end_POSTSUBSCRIPT ) โ‰ค 2 italic_H โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ,

and for any ฯ€,ฮผ๐œ‹๐œ‡\pi,\muitalic_ฯ€ , italic_ฮผ, โ€–Rฯ€โข(ฮผ)โ€–โˆžโ‰ค2โขโ€–(Wฯ€โˆ’I)โขฮผโ€–2subscriptnormsubscript๐‘…๐œ‹๐œ‡2subscriptnormsuperscript๐‘Š๐œ‹๐ผ๐œ‡2\|R_{\pi}(\mu)\|_{\infty}\leq 2\|(W^{\pi}-I)\mu\|_{2}โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‰ค 2 โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT which, by Eq.ย (6), is equal to 2โขโŸจRฯ€โข(ฮผ),ฮผโˆ’ฮผฯ€โŸฉ2subscript๐‘…๐œ‹๐œ‡๐œ‡superscript๐œ‡๐œ‹2\sqrt{\langle R_{\pi}(\mu),\mu-\mu^{\pi}\rangle}2 square-root start_ARG โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( italic_ฮผ ) , italic_ฮผ - italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT โŸฉ end_ARG. Using Jensenโ€™s inequality and Assump.ย B, we derive the final bound.

6 STEERING WITH NON-ZERO INTRINSIC REWARD

Next, we turn to Scenario 2 in Sec.ย 3.4, the complete setting where the agentsโ€™ pre-existing reward function rโˆ—โˆˆ[0,rmax]superscript๐‘Ÿ0subscript๐‘Ÿr^{*}\in[0,r_{\max}]italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ [ 0 , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] is both non-zero and unknown. The non-zero intrinsic reward introduces non-trivial additional challenges. Firstly, it changes the steering landscape and introduces some prior bias for our steering reward design. Secondly, since it is unknown, we must account for its interference on the steering dynamics and undertake strategic exploration to estimate rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. In the following, we explain how we overcome these challenges by a pessimism-based reward estimation strategy.

Confidence set for rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPTโ€ƒWe recall our setup in Sec.ย 3.1: the mediator can observe the population density ฮผยฏtsuperscriptยฏ๐œ‡๐‘ก\bar{\mu}^{t}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and noisy reward rt=(rhโˆ—โข(sht,aht,ฮผยฏMโˆ—,ht)+ฮพh)hโˆˆ[H]superscript๐‘Ÿ๐‘กsubscriptsubscriptsuperscript๐‘Ÿโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€โ„Žsubscript๐œ‰โ„Žโ„Ždelimited-[]๐ปr^{t}=(r^{*}_{h}(s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{M^{*},h})+\xi_{h})_{h\in[H]}italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ) + italic_ฮพ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT perturbed by i.i.d.ย zero-mean ฯƒ๐œŽ\sigmaitalic_ฯƒ-sub-Gaussian noise ฮพ๐œ‰\xiitalic_ฮพ. We will use this information to estimate the original reward. At each iteration t๐‘กtitalic_t, we maintain a confidence set โ„›^tsuperscript^โ„›๐‘ก\hat{\mathcal{R}}^{t}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, defined by:

โ„›^t:=assignsuperscript^โ„›๐‘กabsent\displaystyle\hat{\mathcal{R}}^{t}:=over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := {r^โˆˆโ„›:โ€–r^โˆ’rยฏtโ€–2,Etโ‰คฮฒt},conditional-set^๐‘Ÿโ„›subscriptnorm^๐‘Ÿsuperscriptยฏ๐‘Ÿ๐‘ก2subscript๐ธ๐‘กsubscript๐›ฝ๐‘ก\displaystyle\left\{\hat{r}\in\mathcal{R}:\|\hat{r}-\bar{r}^{t}\|_{2,E_{t}}% \leq\sqrt{\beta_{t}}\right\},{ over^ start_ARG italic_r end_ARG โˆˆ caligraphic_R : โˆฅ over^ start_ARG italic_r end_ARG - overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT โ‰ค square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG } ,
rยฏt:=assignsuperscriptยฏ๐‘Ÿ๐‘กabsent\displaystyle\bar{r}^{t}:=overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := argโขminr^โˆˆโ„›โขโˆ‘i=1tโˆ’1โˆ‘h=1H(r^hโข(shi,ahi,ฮผยฏMโˆ—,hi)โˆ’rhi)2,subscriptargmin^๐‘Ÿโ„›superscriptsubscript๐‘–1๐‘ก1superscriptsubscriptโ„Ž1๐ปsuperscriptsubscript^๐‘Ÿโ„Žsubscriptsuperscript๐‘ ๐‘–โ„Žsubscriptsuperscript๐‘Ž๐‘–โ„Žsubscriptsuperscriptยฏ๐œ‡๐‘–superscript๐‘€โ„Žsuperscriptsubscript๐‘Ÿโ„Ž๐‘–2\displaystyle\operatorname*{arg\,min}_{\hat{r}\in\mathcal{R}}\sum_{i=1}^{t-1}% \sum_{h=1}^{H}\left(\hat{r}_{h}(s^{i}_{h},a^{i}_{h},\bar{\mu}^{i}_{M^{*},h})-r% _{h}^{i}\right)^{2},start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG italic_r end_ARG โˆˆ caligraphic_R end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where โ€–gโ€–2,Et2:=โˆ‘i=1tโˆ’1โˆ‘h=1H(ghโข(shi,ahi,ฮผยฏMโˆ—,hi))2assignsuperscriptsubscriptnorm๐‘”2subscript๐ธ๐‘ก2superscriptsubscript๐‘–1๐‘ก1superscriptsubscriptโ„Ž1๐ปsuperscriptsubscript๐‘”โ„Žsubscriptsuperscript๐‘ ๐‘–โ„Žsubscriptsuperscript๐‘Ž๐‘–โ„Žsubscriptsuperscriptยฏ๐œ‡๐‘–superscript๐‘€โ„Ž2\|g\|_{2,E_{t}}^{2}:=\sum_{i=1}^{t-1}\sum_{h=1}^{H}(g_{h}(s^{i}_{h},a^{i}_{h},% \bar{\mu}^{i}_{M^{*},h}))^{2}โˆฅ italic_g โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any function g๐‘”gitalic_g as a short note. We use ฮฒtsubscript๐›ฝ๐‘ก\beta_{t}italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to denote confidence interval length to ensure rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is contained in the confidence set at any time with high probability. We defer a detailed choice of ฮฒtsubscript๐›ฝ๐‘ก\beta_{t}italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to Lem.ย I.1. Informally, ฮฒt=Oโข(ฯƒ2โขlogโกNโข(โ„›,1T))subscript๐›ฝ๐‘ก๐‘‚superscript๐œŽ2๐‘โ„›1๐‘‡\beta_{t}=O(\sigma^{2}\log N(\mathcal{R},\frac{1}{T}))italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_O ( italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_N ( caligraphic_R , divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ) ) grows in logโกT๐‘‡\log Troman_log italic_T, where Nโข(โ„›,ฮต)๐‘โ„›๐œ€N(\mathcal{R},\varepsilon)italic_N ( caligraphic_R , italic_ฮต ) is the ฮต๐œ€\varepsilonitalic_ฮต-covering number of โ„›โ„›\mathcal{R}caligraphic_R.

Steering Reward Design with Pessimismโ€ƒWe consider the following steering reward design

โˆ€ฮผโˆˆฮ”๐’ฎร—๐’œH,for-all๐œ‡superscriptsubscriptฮ”๐’ฎ๐’œ๐ป\displaystyle\forall\mu\in\Delta_{\mathcal{S}\times\mathcal{A}}^{H},~{}โˆ€ italic_ฮผ โˆˆ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , Rnztโข(ฮผ):=Rฯ€โˆ—kโข(t)โข(ฮผ)โˆ’(rยฏtโข(ฮผ)โˆ’wโ„›^tโข(ฮผ))assignsubscriptsuperscript๐‘…๐‘กnz๐œ‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡superscriptยฏ๐‘Ÿ๐‘ก๐œ‡subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡\displaystyle R^{t}_{\text{nz}}(\mu):=R_{\pi_{*}^{k(t)}}(\mu)-(\bar{r}^{t}(\mu% )-w_{\hat{\mathcal{R}}^{t}}(\mu))italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT ( italic_ฮผ ) := italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) - ( overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) )
+(rmax+โ€–Rฯ€โˆ—kโข(t)โข(ฮผ)โ€–โˆž)โข๐Ÿsubscript๐‘Ÿsubscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡1\displaystyle+(r_{\max}+\|R_{\pi_{*}^{k(t)}}(\mu)\|_{\infty})\bm{1}+ ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) bold_1 (11)

Here ฯ€โˆ—kโข(t)superscriptsubscript๐œ‹๐‘˜๐‘ก\pi_{*}^{k(t)}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT is computed in the same way as Alg,ย 2; rยฏtโˆˆโ„›^tsuperscriptยฏ๐‘Ÿ๐‘กsuperscript^โ„›๐‘ก\bar{r}^{t}\in\hat{\mathcal{R}}^{t}overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆˆ over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (defined in Eq.ย (10)) is the reward estimation achieving the minimal empirical loss; wโ„›^tโข(ฮผ)subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡w_{\hat{\mathcal{R}}^{t}}(\mu)italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) is a vector with elements (wโ„›^tโข(ฮผ))h,s,a:=supr,r~โˆˆโ„›^t|rhโข(s,a,ฮผ)โˆ’r~hโข(s,a,ฮผ)|assignsubscriptsubscript๐‘คsuperscript^โ„›๐‘ก๐œ‡โ„Ž๐‘ ๐‘Žsubscriptsupremum๐‘Ÿ~๐‘Ÿsuperscript^โ„›๐‘กsubscript๐‘Ÿโ„Ž๐‘ ๐‘Ž๐œ‡subscript~๐‘Ÿโ„Ž๐‘ ๐‘Ž๐œ‡(w_{\hat{\mathcal{R}}^{t}}(\mu))_{h,s,a}:=\sup_{r,\tilde{r}\in\hat{\mathcal{R}% }^{t}}\left|r_{h}(s,a,\mu)-\tilde{r}_{h}(s,a,\mu)\right|( italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) ) start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT italic_r , over~ start_ARG italic_r end_ARG โˆˆ over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_ฮผ ) - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_ฮผ ) |, which quantifies the estimation uncertainty for each state-action pair; the last constant shift term ensures non-negativity.

As we can see, the main difference compared with steering reward Rztsubscriptsuperscript๐‘…๐‘กzR^{t}_{\text{z}}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT z end_POSTSUBSCRIPT in Alg.ย 2 is that we include an additional reward estimation term to offset the effect by the non-zero original reward rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. In this way, the agents will follow the guidance by Rฯ€โˆ—kโข(t)subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กR_{\pi_{*}^{k(t)}}italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to explore as we want. Note that here we conduct a pessimism-based reward estimation such that rยฏtโˆ’wโ„›^tโ‰คrโˆ—superscriptยฏ๐‘Ÿ๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscript๐‘Ÿ\bar{r}^{t}-w_{\hat{\mathcal{R}}^{t}}\leq r^{*}overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โ‰ค italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT for some technical reason, which we will explain later.

Steering Algorithm Designโ€ƒThe algorithm design for the non-zero intrinsic reward setting only differs from Alg.ย 2 in the additional update of โ„›^tsuperscript^โ„›๐‘ก\hat{\mathcal{R}}^{t}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as in Eq.ย (10) and choosing Eq.ย (6) as the steering reward Rnztsuperscriptsubscript๐‘…nz๐‘กR_{\text{nz}}^{t}italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. For completeness, we defer the detailed algorithm to Alg.ย 4 in Appx.ย I.1. We have the following guarantees for steering gap and steering cost.

Theorem 6.1.

Under Assump.ย A,ย B andย C, if we run Alg.ย 4 with 0<ฮด<10๐›ฟ10<\delta<10 < italic_ฮด < 1, then with probability at least 1โˆ’6โขฮด16๐›ฟ1-6\delta1 - 6 italic_ฮด, Kโ‰คHโขSโขAโขlog2โกT๐พ๐ป๐‘†๐ดsubscript2๐‘‡K\leq HSA\log_{2}Titalic_K โ‰ค italic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T, and

ฮ”Tโข({ฮผยฏMโˆ—t}t=1T)โ‰คsubscriptฮ”๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€๐‘ก1๐‘‡absent\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})\leq\,roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค LUโขH3โขSโขAโขTโข(KโขAdaReg(T)+D)subscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡๐พAdaReg๐‘‡๐ท\displaystyle L_{U}\sqrt{H^{3}SAT(K\operatorname*{AdaReg}(T)+D)}italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG
+36โขLUโขH3โขSโขAโขTโขlnโก(TโขHโขSโขA/ฮด),36subscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡๐‘‡๐ป๐‘†๐ด๐›ฟ\displaystyle+36L_{U}H^{3}S\sqrt{AT\ln(THSA/\delta)},+ 36 italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG ,
CT({ฮผยฏMโˆ—t,Rnztโˆ’(\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R_{\text{nz}}^{t}-(italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ( rmaxโ‹…๐Ÿโˆ’rโˆ—)}t=1T)\displaystyle r_{\max}\cdot\bm{1}-r^{*})\}_{t=1}^{T})italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT โ‹… bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
=4โขHabsent4๐ป\displaystyle=4H= 4 italic_H Tโข(KโขAdaReg(T)+D)+D,๐‘‡๐พAdaReg๐‘‡๐ท๐ท\displaystyle\sqrt{T(K\operatorname*{AdaReg}(T)+D)}+D,square-root start_ARG italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG + italic_D ,

where D=O~(ฮฒTโขHโขdimE(โ„›,Tโˆ’1)โขT))D=\tilde{O}(\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}))italic_D = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ) ).

Comparing with Theoremย 5.1, we can find both the steering gap and cost only differ in the additional term D๐ทDitalic_D, which results from the estimation error of rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. The term D๐ทDitalic_D depends on the Eluder dimension of โ„›โ„›\mathcal{R}caligraphic_R and ฮฒTsubscript๐›ฝ๐‘‡\beta_{T}italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. In Appx.ย H.1, we show several common function classes with dimE(โ„›,Tโˆ’1)โˆˆ๐’ช~โข(1)subscriptdimension๐ธโ„›superscript๐‘‡1~๐’ช1\dim_{E}(\mathcal{R},T^{-1})\in\tilde{\mathcal{O}}(1)roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) โˆˆ over~ start_ARG caligraphic_O end_ARG ( 1 ), and where by choosing ฮฒTsubscript๐›ฝ๐‘‡\beta_{T}italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT appropriately, we have Dโˆˆ๐’ช~โข(T)๐ท~๐’ช๐‘‡D\in\tilde{\mathcal{O}}(\sqrt{T})italic_D โˆˆ over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_T end_ARG ). As a result, both the steering gap and cost upper bounds in Thm.ย 6.1 will be sub-linear in T๐‘‡Titalic_T.

Proof Sketchโ€ƒSimilar to the proof for Thm.ย 5.1, we can decompose the steering gap as Eq.ย (8), and upper bound model estimation error term ฮ”estsubscriptฮ”est\Delta_{\text{est}}roman_ฮ” start_POSTSUBSCRIPT est end_POSTSUBSCRIPT in the same way. The proof diverges when we upper bound AgentReg in Eq.ย (9), because the agentsโ€™ no-regret behavior holds for rโˆ—+Rnztsuperscript๐‘Ÿsuperscriptsubscript๐‘…nz๐‘กr^{*}+R_{\text{nz}}^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in this setting. We can write

AgentReg =โˆ‘t=1TโŸจRnzt(ฮผยฏMโˆ—t)+rโˆ—(ฮผยฏMโˆ—t)โˆ’rโˆ—(ฮผยฏMโˆ—t)\displaystyle=\sum_{t=1}^{T}\langle R_{\text{nz}}^{t}(\bar{\mu}^{t}_{M^{*}})+r% ^{*}(\bar{\mu}^{t}_{M^{*}})-r^{*}(\bar{\mu}^{t}_{M^{*}})= โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
+rยฏt(ฮผยฏMโˆ—t)โˆ’wโ„›^t(ฮผยฏMโˆ—t),ฮผMโˆ—ฯ€โˆ—kโข(t)โˆ’ฮผยฏMโˆ—tโŸฉ.\displaystyle\quad+\bar{r}^{t}(\bar{\mu}^{t}_{M^{*}})-w_{\hat{\mathcal{R}}^{t}% }(\bar{\mu}^{t}_{M^{*}}),\mu_{M^{*}}^{\pi_{*}^{k(t)}}-\bar{\mu}^{t}_{M^{*}}\rangle.+ overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โŸฉ .

Using pessimism, i.e., rโˆ—โ‰ฅrยฏtโˆ’wโ„›^tsuperscript๐‘Ÿsuperscriptยฏ๐‘Ÿ๐‘กsubscript๐‘คsuperscript^โ„›๐‘กr^{*}\geq\bar{r}^{t}-w_{\hat{\mathcal{R}}^{t}}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โ‰ฅ overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we can bound this by

โˆ‘t=1TโŸจRnztโข(ฮผยฏMโˆ—t)+rโˆ—โข(ฮผยฏMโˆ—t),ฮผMโˆ—ฯ€โˆ—kโข(t)โˆ’ฮผยฏMโˆ—tโŸฉsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscript๐‘…nz๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscript๐‘Ÿsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscriptsubscript๐œ‡superscript๐‘€superscriptsubscript๐œ‹๐‘˜๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\displaystyle\sum_{t=1}^{T}\langle R_{\text{nz}}^{t}(\bar{\mu}^{t}_{M^{*}})+r^% {*}(\bar{\mu}^{t}_{M^{*}}),\mu_{M^{*}}^{\pi_{*}^{k(t)}}-\bar{\mu}^{t}_{M^{*}}\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โŸฉ
+โˆ‘t=1TโŸจrโˆ—โข(ฮผยฏMโˆ—t)โˆ’rยฏtโข(ฮผยฏMโˆ—t)+wโ„›^tโข(ฮผยฏMโˆ—t),ฮผยฏMโˆ—tโŸฉ.superscriptsubscript๐‘ก1๐‘‡superscript๐‘Ÿsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscriptยฏ๐‘Ÿ๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€subscript๐‘คsuperscript^โ„›๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€subscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\displaystyle\quad+\sum_{t=1}^{T}\langle r^{*}(\bar{\mu}^{t}_{M^{*}})-\bar{r}^% {t}(\bar{\mu}^{t}_{M^{*}})+w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}_{M^{*}}),% \bar{\mu}^{t}_{M^{*}}\rangle.+ โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) - overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โŸฉ .

Clearly, the first term above is just agentsโ€™ dynamic regret regarding the total reward they received and can be bounded again by KโขAdaReg(T)๐พAdaReg๐‘‡K\operatorname*{AdaReg}(T)italic_K roman_AdaReg ( italic_T ). The second term above can be further controlled by ๐’ชโข(โˆ‘tโŸจwโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉ)๐’ชsubscript๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก\mathcal{O}(\sum_{t}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}% ^{t}\rangle)caligraphic_O ( โˆ‘ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ ), which is basically the accumulative confidence interval length for reward estimation and its growth can be controlled by Eluder dimension (Lem.ย H.5) and is only sub-linear in T๐‘‡Titalic_T.

For the steering cost, we can provide an upper bound involving AgentReg and reward estimation error that we analyzed before. To save space, we do not repeat it here and refer the reader to Appx.ย I for the full proof.

Remark 6.1.

Our strategy to deal with the intrinsic reward rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is to try to โ€œcancelโ€ it with our steering reward. This approach is justified by the fact that we keep rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT and U๐‘ˆUitalic_U very general, which means that the target density to maximize U๐‘ˆUitalic_U may not coincide with an equilibrium associated with the original reward rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. Therefore, to ensure the target density is still a stationary point for no-regret learners, we treat rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT as a competing force to offset. We admit that there might be other options to counteract the impact of rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT with lower steering costs, and we leave further investigation to the future work.

Remark 6.1 (Generalization to Unknown Utility Setting).

Although this paper focuses on the case when U๐‘ˆUitalic_U is revealed to the mediator, it is possible to generalize our results to the case where the utility function U๐‘ˆUitalic_U is unknown, but it lies in a known function class ๐’ฐ๐’ฐ\mathcal{U}caligraphic_U with bounded Eluder dimension. In Appx.ย J, we formalize this setting and present a solution to address this case based on a simple modification of the current methods. Our established regret bound for steering gap and steering cost grow at a rate of ๐’ช~โข(T5/6)~๐’ชsuperscript๐‘‡56\tilde{\mathcal{O}}(T^{5/6})over~ start_ARG caligraphic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 5 / 6 end_POSTSUPERSCRIPT ). Although the results are worse than the rate of ๐’ช~โข(T3/4)~๐’ชsuperscript๐‘‡34\tilde{\mathcal{O}}(T^{3/4})over~ start_ARG caligraphic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT ) in Thm.ย 6.1 due to the challenges in exploring the utility function, they are still sub-linear in T๐‘‡Titalic_T.

7 CONCLUSION

We study a novel problem setting for incentive design in unknown mean-field games with no-regret agents. Our optimistic algorithm introduces newly developed steering reward designs, achieving sublinear utility regret and steering costs when the intrinsic reward is zero. Extending to the setting with a non-zero and unknown intrinsic reward function, we adapted our algorithm to handle this new challenge, maintaining sublinear utility regret and vanishing steering costs competing with a baseline strategy. Future work could explore the more challenging case where the transition function is also dependent on the population density. Another interesting direction is to identify better or even optimal steering reward design to stabilize the target policy and design an algorithm with sub-linear guarantees comparing with that benchmark.

Acknowledgements

This work is supported by Swiss National Science Foundation (SNSF) Project Funding No. 200021-207343 and SNSF Starting Grant.

References

  • Achdou and Lasry, (2019) Achdou, Y. and Lasry, J.-M. (2019). Mean Field Games for Modeling Crowd Motion. In Chetverushkin, B.ย N., Fitzgibbon, W., Kuznetsov, Y., Neittaanmรคki, P., Periaux, J., and Pironneau, O., editors, Contributions to Partial Differential Equations and Applications, pages 17โ€“42. Springer International Publishing, Cham.
  • Baumann etย al., (2020) Baumann, T., Graepel, T., and Shawe-Taylor, J. (2020). Adaptive mechanism design: Learning to promote cooperation. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1โ€“7. IEEE.
  • Brown etย al., (2024) Brown, W., Schneider, J., and Vodrahalli, K. (2024). Is learning in games good for the learners? Advances in Neural Information Processing Systems, 36.
  • Cabannes etย al., (2021) Cabannes, T., Lauriere, M., Perolat, J., Marinier, R., Girgin, S., Perrin, S., Pietquin, O., Bayen, A.ย M., Goubault, E., and Elie, R. (2021). Solving N-player dynamic routing games with congestion: a mean field approach. arXiv:2110.11943 [cs, eess, math].
  • Camara etย al., (2020) Camara, M., Hartline, J., and Johnsen, A. (2020). Mechanisms for a No-Regret Agent: Beyond the Common Prior. arXiv:2009.05518 [cs, econ].
  • Canyakmaz etย al., (2024) Canyakmaz, I., Sakos, I., Lin, W., Varvitsiotis, A., and Piliouras, G. (2024). Steering game dynamics towards desired outcomes. arXiv:2404.01066 [cs, eess].
  • Carmona and Wang, (2021) Carmona, R. and Wang, P. (2021). Finite-state contract theory with a principal and a field of agents. Management Science, 67(8):4725โ€“4741.
  • Castiglioni etย al., (2023) Castiglioni, M., Marchesi, A., and Gatti, N. (2023). Multi-agent contract design: How to commission multiple agents with individual outcomes. In Proceedings of the 24th ACM Conference on Economics and Computation, pages 412โ€“448.
  • Chen and Cheng, (2010) Chen, B. and Cheng, H.ย H. (2010). A review of the applications of agent technology in traffic and transportation systems. IEEE Transactions on Intelligent Transportation Systems, 11(2):485โ€“497.
  • Curry etย al., (2024) Curry, M., Thoma, V., Chakrabarti, D., McAleer, S., Kroer, C., Sandholm, T., He, N., and Seuken, S. (2024). Automated design of affine maximizer mechanisms in dynamic settings. Proceedings of the AAAI Conference on Artificial Intelligence, 38(9):9626โ€“9635.
  • DellaVigna and Malmendier, (2004) DellaVigna, S. and Malmendier, U. (2004). Contract design and self-control: Theory and evidence. The Quarterly Journal of Economics, 119(2):353โ€“402.
  • Deng etย al., (2019) Deng, Y., Schneider, J., and Sivan, B. (2019). Strategizing against No-regret Learners. arXiv:1909.13861 [cs].
  • Dinneweth etย al., (2022) Dinneweth, J., Boubezoul, A., Mandiau, R., and Espiรฉ, S. (2022). Multi-agent reinforcement learning for autonomous vehicles: a survey. Autonomous Intelligent Systems, 2(1):27.
  • Dรผtting etย al., (2023) Dรผtting, P., Ezra, T., Feldman, M., and Kesselheim, T. (2023). Multi-agent contracts. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1311โ€“1324.
  • Ehtamo etย al., (2002) Ehtamo, H., Kitti, M., and Hรคmรคlรคinen, R.ย P. (2002). Recent studies on incentive design problems in game theory and management science. In Optimal Control and Differential Games: Essays in Honor of Steffen Jรธrgensen, pages 121โ€“134. Springer.
  • Elie etย al., (2019) Elie, R., Mastrolia, T., and Possamaรฏ, D. (2019). A tale of a principal and many, many agents. Mathematics of Operations Research, 44(2):440โ€“467.
  • Freund and Schapire, (1997) Freund, Y. and Schapire, R.ย E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119โ€“139.
  • Fu and Horst, (2018) Fu, G. and Horst, U. (2018). Mean-field leader-follower games with terminal state constraint.
  • Ge etย al., (2024) Ge, J., Wang, Y., Li, W., and Jin, C. (2024). Towards principled superhuman ai for multiplayer symmetric games.
  • Gomes etย al., (2014) Gomes, D.ย A., Velho, R.ย M., and Wolfram, M.-T. (2014). Socio-economic applications of finite state mean field games. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2028):20130405. arXiv:1403.4217 [math].
  • Guo etย al., (2021) Guo, X., Hu, A., Xu, R., and Zhang, J. (2021). Learning Mean-Field Games. arXiv:1901.09585 [math].
  • (22) Guo, X., Li, L., Nabi, S., Salhab, R., and Zhang, J. (2023a). MESOB: Balancing Equilibria & Social Optimality. arXiv:2307.07911 [cs, math].
  • (23) Guo, X., Li, L., Nabi, S., Salhab, R., and Zhang, J. (2023b). Mesob: Balancing equilibria & social optimality.
  • Hazan, (2023) Hazan, E. (2023). Introduction to Online Convex Optimization. arXiv:1909.05207 [cs, math, stat].
  • Hazan and Seshadhri, (2007) Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC), 14.
  • Ho etย al., (2014) Ho, C.-J., Slivkins, A., and Vaughan, J.ย W. (2014). Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359โ€“376.
  • Holmstrรถm, (1979) Holmstrรถm, B. (1979). Moral hazard and observability. The Bell journal of economics, pages 74โ€“91.
  • Hu and Zhang, (2024) Hu, A. and Zhang, J. (2024). MF-OML: Online Mean-Field Reinforcement Learning with Occupation Measures for Large Population Games. arXiv:2405.00282 [cs, math].
  • (29) Huang, J., He, N., and Krause, A. (2024a). Model-Based RL for Mean-Field Games is not Statistically Harder than Single-Agent RL. arXiv:2402.05724 [cs, stat].
  • (30) Huang, J., Thoma, V., Shen, Z., Nax, H.ย H., and He, N. (2024b). Learning to Steer Markovian Agents under Model Uncertainty. arXiv:2407.10207 [cs, stat].
  • Huang etย al., (2023) Huang, J., Yardim, B., and He, N. (2023). On the Statistical Efficiency of Mean Field Reinforcement Learning with General Function Approximation. arXiv:2305.11283 [cs, stat].
  • Huang etย al., (2006) Huang, M., Malhamรฉ, R.ย P., and Caines, P.ย E. (2006). Large population stochastic dynamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle.
  • Innes, (1990) Innes, R.ย D. (1990). Limited liability and incentive contracting with ex-ante action choices. Journal of economic theory, 52(1):45โ€“67.
  • Iyer etย al., (2014) Iyer, K., Johari, R., and Sundararajan, M. (2014). Mean field equilibria of dynamic auctions with learning. Management Science, 60(12):2949โ€“2970.
  • Jaksch etย al., (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563โ€“1600.
  • Lasry and Lions, (2007) Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Japanese journal of mathematics, 2(1):229โ€“260.
  • Lauriรจre etย al., (2024) Lauriรจre, M., Perrin, S., Pรฉrolat, J., Girgin, S., Muller, P., ร‰lie, R., Geist, M., and Pietquin, O. (2024). Learning in Mean Field Games: A Survey. arXiv:2205.12944 [cs, math].
  • Liu etย al., (2022) Liu, B., Li, J., Yang, Z., Wai, H.-T., Hong, M., Nie, Y.ย M., and Wang, Z. (2022). Inducing Equilibria via Incentives: Simultaneous Design-and-Play Ensures Global Convergence. arXiv:2110.01212 [cs].
  • Luo etย al., (1996) Luo, Z.-Q., Pang, J.-S., and Ralph, D. (1996). Mathematical Programs with Equilibrium Constraints. Cambridge University Press.
  • Osband and Roy, (2014) Osband, I. and Roy, B.ย V. (2014). Model-based reinforcement learning and the eluder dimension.
  • Perolat etย al., (2021) Perolat, J., Perrin, S., Elie, R., Lauriรจre, M., Piliouras, G., Geist, M., Tuyls, K., and Pietquin, O. (2021). Scaling up Mean Field Games with Online Mirror Descent. arXiv:2103.00623 [cs].
  • Ratliff etย al., (2019) Ratliff, L.ย J., Dong, R., Sekar, S., and Fiez, T. (2019). A perspective on incentive design: Challenges and opportunities. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):305โ€“338.
  • Rosenberg and Mansour, (2019) Rosenberg, A. and Mansour, Y. (2019). Online convex optimization in adversarial markov decision processes.
  • Roughgarden and Tardos, (2007) Roughgarden, T. and Tardos, ร‰. (2007). Introduction to the inefficiency of equilibria. In Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V.ย V., editors, Algorithmic Game Theory, pages 443โ€“460. Cambridge University Press, Cambridge.
  • Russo and Vanย Roy, (2013) Russo, D. and Vanย Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volumeย 26. Curran Associates, Inc.
  • Sanjari etย al., (2024) Sanjari, S., Bose, S., and BaลŸar, T. (2024). Incentive Designs for Stackelberg Games with a Large Number of Followers and their Mean-Field Limits. arXiv:2207.10611 [cs].
  • Scheid etย al., (2024) Scheid, A., Tiapkin, D., Boursier, E., Capitaine, A., Mhamdi, E. M.ย E., Moulines, ร‰., Jordan, M.ย I., and Durmus, A. (2024). Incentivized learning in principal-agent bandit games. arXiv preprint arXiv:2403.03811.
  • Steinbacher etย al., (2021) Steinbacher, M., Raddant, M., Karimi, F., Camachoย Cuena, E., Alfarano, S., Iori, G., and Lux, T. (2021). Advances in the agent-based modeling of economic and social behavior. SN Business & Economics, 1(7):99.
  • Subramanian etย al., (2022) Subramanian, S.ย G., Taylor, M.ย E., Crowley, M., and Poupart, P. (2022). Decentralized mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volumeย 36, pages 9439โ€“9447.
  • Wang etย al., (2022) Wang, K., Xu, L., Perrault, A., Reiter, M.ย K., and Tambe, M. (2022). Coordinating followers to reach better equilibria: End-to-end gradient descent for stackelberg games. Proceedings of the AAAI Conference on Artificial Intelligence, 36(5):5219โ€“5227.
  • Weissman etย al., (2003) Weissman, T., Ordentlich, E., Seroussi, G., Verdรบ, S., and Weinberger, M.ย J. (2003). Inequalities for the l1 deviation of the empirical distribution.
  • Yang etย al., (2022) Yang, J., Wang, E., Trivedi, R., Zhao, T., and Zha, H. (2022). Adaptive incentive design with multi-agent meta-gradient reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS โ€™22, page 1436โ€“1445, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
  • Yardim etย al., (2022) Yardim, B., Cayci, S., Geist, M., and He, N. (2022). Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games.
  • Zhang etย al., (2024) Zhang, B.ย H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S.ย M., Haupt, A.ย A., Celli, A., Gatti, N., Conitzer, V., and Sandholm, T. (2024). Steering No-Regret Learners to a Desired Equilibrium. arXiv:2306.05221 [cs].
  • Zhu etย al., (2022) Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M.ย I. (2022). The sample complexity of online contract design. arXiv preprint arXiv:2211.05732.

Checklist

  1. 1.

    For all models and algorithms presented, check if you include:

    1. (a)

      A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes]

    2. (b)

      An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes]

    3. (c)

      (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Not Applicable]

  2. 2.

    For any theoretical claim, check if you include:

    1. (a)

      Statements of the full set of assumptions of all theoretical results. [Yes]

    2. (b)

      Complete proofs of all theoretical results. [Yes]

    3. (c)

      Clear explanations of any assumptions. [Yes]

  3. 3.

    For all figures and tables that present empirical results, check if you include:

    1. (a)

      The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Not Applicable]

    2. (b)

      All the training details (e.g., data splits, hyperparameters, how they were chosen). [Not Applicable]

    3. (c)

      A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Not Applicable]

    4. (d)

      A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Not Applicable]

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:

    1. (a)

      Citations of the creator If your work uses existing assets. [Not Applicable]

    2. (b)

      The license information of the assets, if applicable. [Not Applicable]

    3. (c)

      New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]

    4. (d)

      Information about consent from data providers/curators. [Not Applicable]

    5. (e)

      Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects, check if you include:

    1. (a)

      The full text of instructions given to participants and screenshots. [Not Applicable]

    2. (b)

      Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]

    3. (c)

      The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]

Appendix A TABLE OF FREQUENTLY USED NOTATIONS

Notation Description
[n]delimited-[]๐‘›[n][ italic_n ] {1,2,โ€ฆ,n}12โ€ฆ๐‘›\{1,2,...,n\}{ 1 , 2 , โ€ฆ , italic_n } for any nโˆˆโ„•๐‘›โ„•n\in\mathbb{N}italic_n โˆˆ blackboard_N
ฮ”๐’ณsubscriptฮ”๐’ณ\Delta_{\mathcal{X}}roman_ฮ” start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT Set of probability distributions over a finite set ๐’ณ๐’ณ\mathcal{X}caligraphic_X
๐•€โข{โ„ฐ}๐•€โ„ฐ\mathbb{I}\{\mathcal{E}\}blackboard_I { caligraphic_E } Indicator function for the event โ„ฐโ„ฐ\mathcal{E}caligraphic_E
๐Ÿ1\bm{1}bold_1 All-one vector
๐žisubscript๐ž๐‘–\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The i๐‘–iitalic_i-th standard-basis vector
M=(N,๐’ฎ,๐’œ,H,โ„™M,rM,ฮผ1)๐‘€๐‘๐’ฎ๐’œ๐ปsubscriptโ„™๐‘€subscript๐‘Ÿ๐‘€subscript๐œ‡1M=(N,\mathcal{S},\mathcal{A},H,\mathbb{P}_{M},r_{M},\mu_{1})italic_M = ( italic_N , caligraphic_S , caligraphic_A , italic_H , blackboard_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) The model / game
N๐‘Nitalic_N Number of agents
๐’ฎ,๐’œ๐’ฎ๐’œ\mathcal{S},\mathcal{A}caligraphic_S , caligraphic_A State and action space
H๐ปHitalic_H Horizon length of the game
ฮผ1subscript๐œ‡1\mu_{1}italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Initial state distribution
{โ„™M,h:๐’ฎร—๐’œโ†’ฮ”๐’ฎ}hโˆˆ[H]subscriptconditional-setsubscriptโ„™๐‘€โ„Žโ†’๐’ฎ๐’œsubscriptฮ”๐’ฎโ„Ždelimited-[]๐ป\{\mathbb{P}_{M,h}:\mathcal{S}\times\mathcal{A}\to\Delta_{\mathcal{S}}\}_{h\in% [H]}{ blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT : caligraphic_S ร— caligraphic_A โ†’ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT Transition function
{rM,h:๐’ฎร—๐’œร—ฮ”๐’ฎร—๐’œโ†’[0,rmax]}hโˆˆ[H]subscriptconditional-setsubscript๐‘Ÿ๐‘€โ„Žโ†’๐’ฎ๐’œsubscriptฮ”๐’ฎ๐’œ0subscript๐‘Ÿโ„Ždelimited-[]๐ป\{r_{M,h}:\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}\times\mathcal{% A}}\to[0,r_{\max}]\}_{h\in[H]}{ italic_r start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT : caligraphic_S ร— caligraphic_A ร— roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT โ†’ [ 0 , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT Reward function
{Rh:๐’ฎร—๐’œร—ฮ”๐’ฎร—๐’œโ†’โ„}hโˆˆ[H]subscriptconditional-setsubscript๐‘…โ„Žโ†’๐’ฎ๐’œsubscriptฮ”๐’ฎ๐’œโ„โ„Ždelimited-[]๐ป\{R_{h}:\mathcal{S}\times\mathcal{A}\times\Delta_{\mathcal{S}\times\mathcal{A}% }\to\mathbb{R}\}_{h\in[H]}{ italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S ร— caligraphic_A ร— roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT โ†’ blackboard_R } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT Steering reward function (capitalized)
r:ฮ”๐’ฎร—๐’œHโ†’โ„HโขSโขA:๐‘Ÿโ†’superscriptsubscriptฮ”๐’ฎ๐’œ๐ปsuperscriptโ„๐ป๐‘†๐ดr:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to\mathbb{R}^{HSA}italic_r : roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โ†’ blackboard_R start_POSTSUPERSCRIPT italic_H italic_S italic_A end_POSTSUPERSCRIPT Vectorized reward function (rโข(ฮผ))h,s,a=rhโข(s,a,ฮผh)subscript๐‘Ÿ๐œ‡โ„Ž๐‘ ๐‘Žsubscript๐‘Ÿโ„Ž๐‘ ๐‘Žsubscript๐œ‡โ„Ž(r(\mu))_{h,s,a}=r_{h}(s,a,\mu_{h})( italic_r ( italic_ฮผ ) ) start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
{ฯ€h:๐’ฎโ†’ฮ”๐’œ}hโˆˆ[H]subscriptconditional-setsubscript๐œ‹โ„Žโ†’๐’ฎsubscriptฮ”๐’œโ„Ždelimited-[]๐ป\{\pi_{h}:\mathcal{S}\to\Delta_{\mathcal{A}}\}_{h\in[H]}{ italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S โ†’ roman_ฮ” start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT Markov policy
ฮ ฮ \Piroman_ฮ  Set of all policies
ฮผMฯ€superscriptsubscript๐œ‡๐‘€๐œ‹\mu_{M}^{\pi}italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT State-action density of policy ฯ€๐œ‹\piitalic_ฯ€ in model M๐‘€Mitalic_M
ฮจMsubscriptฮจ๐‘€\Psi_{M}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT Set of possible state-action densities in model M๐‘€Mitalic_M
AdaReg(T)AdaReg๐‘‡\operatorname*{AdaReg}(T)roman_AdaReg ( italic_T ) Adaptive regret bound after T๐‘‡Titalic_T games
U:ฮ”๐’ฎร—๐’œHโ†’โ„:๐‘ˆโ†’superscriptsubscriptฮ”๐’ฎ๐’œ๐ปโ„U:\Delta_{\mathcal{S}\times\mathcal{A}}^{H}\to\mathbb{R}italic_U : roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โ†’ blackboard_R Utility function
Cโข(ฮผยฏt,Rt)=โŸจRtโข(ฮผยฏt),ฮผยฏtโŸฉ๐ถsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘กC(\bar{\mu}^{t},R^{t})=\langle R^{t}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangleitalic_C ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = โŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ Steering cost function
Rฯ€subscript๐‘…๐œ‹R_{\pi}italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT Reward function which incentivizes policy ฯ€๐œ‹\piitalic_ฯ€
Mโˆ—,rโˆ—,โ„™โˆ—superscript๐‘€superscript๐‘Ÿsuperscriptโ„™M^{*},r^{*},\mathbb{P}^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT True model, intrinsic reward, transition function
Rzsubscript๐‘…zR_{\text{z}}italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT Steering reward for the setting where rโˆ—=0superscript๐‘Ÿ0r^{*}=0italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = 0.
โ€œzโ€ in sub-scription as a short note of โ€œzeroโ€.
Rnzsubscript๐‘…nzR_{\text{nz}}italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT Steering reward for the setting where rโˆ—โˆˆโ„›superscript๐‘Ÿโ„›r^{*}\in\mathcal{R}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ caligraphic_R.
โ€œnzโ€ in sub-scription as a short note of โ€œzeroโ€
dimE(โ„ฑ,ฮต)subscriptdimension๐ธโ„ฑ๐œ€\dim_{E}(\mathcal{F},\varepsilon)roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ) Eluder dimension of function class โ„ฑโ„ฑ\mathcal{F}caligraphic_F
ฮผยฏยฏ๐œ‡\bar{\mu}overยฏ start_ARG italic_ฮผ end_ARG Population density ฮผยฏ:=1Nโขโˆ‘nฮผฯ€nassignยฏ๐œ‡1๐‘subscript๐‘›superscript๐œ‡superscript๐œ‹๐‘›\bar{\mu}:=\frac{1}{N}\sum_{n}\mu^{\pi^{n}}overยฏ start_ARG italic_ฮผ end_ARG := divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT
ฯ€ยฏยฏ๐œ‹\bar{\pi}overยฏ start_ARG italic_ฯ€ end_ARG Population average policy induced by ฮผยฏยฏ๐œ‡\bar{\mu}overยฏ start_ARG italic_ฮผ end_ARG
๐’ช,๐’ช~๐’ช~๐’ช\mathcal{O},\tilde{\mathcal{O}}caligraphic_O , over~ start_ARG caligraphic_O end_ARG Standard big-O notations

Appendix B SUMMARY OF MAIN RESULTS

In the following, we summarize the main theorems in this paper under Assump.ย A,ย B andย C. We study the steering gaps and costs of four settings. The settings are categorized depending on whether Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT (or ฯ€โˆ—:=argโขmaxฯ€โˆˆฮ โกUโข(ฮผMโˆ—ฯ€)assignsuperscript๐œ‹subscriptargmax๐œ‹ฮ ๐‘ˆsuperscriptsubscript๐œ‡superscript๐‘€๐œ‹\pi^{*}:=\operatorname*{arg\,max}_{\pi\in\Pi}U(\mu_{M^{*}}^{\pi})italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  end_POSTSUBSCRIPT italic_U ( italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT )) is known or not, and whether the intrinsic reward function rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is zero or non-zero and unknown.

Setting rโˆ—=0superscript๐‘Ÿ0r^{*}=0italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = 0? Steering Gap Steering Cost Thm.
Known Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โœ“ ๐’ชโข(LUโขHโขSโขAโขTโขAdaReg(T))๐’ชsubscript๐ฟ๐‘ˆ๐ป๐‘†๐ด๐‘‡AdaReg๐‘‡\mathcal{O}(L_{U}\sqrt{HSAT\operatorname*{AdaReg}(T)})caligraphic_O ( italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H italic_S italic_A italic_T roman_AdaReg ( italic_T ) end_ARG ) ๐’ชโข(HโขTโขAdaReg(T))๐’ช๐ป๐‘‡AdaReg๐‘‡\mathcal{O}(H\sqrt{T\operatorname*{AdaReg}(T)})caligraphic_O ( italic_H square-root start_ARG italic_T roman_AdaReg ( italic_T ) end_ARG ) 4.1
Unknown Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT (known ฯ€โˆ—superscript๐œ‹\pi^{*}italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT) โœ“ ๐’ชโข(LUโขH3โขSโขAโขTโขAdaReg(T))๐’ชsubscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡AdaReg๐‘‡\mathcal{O}(L_{U}\sqrt{H^{3}SAT\operatorname*{AdaReg}(T)})caligraphic_O ( italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T roman_AdaReg ( italic_T ) end_ARG ) ๐’ชโข(HโขTโขAdaReg(T))๐’ช๐ป๐‘‡AdaReg๐‘‡\mathcal{O}(H\sqrt{T\operatorname*{AdaReg}(T)})caligraphic_O ( italic_H square-root start_ARG italic_T roman_AdaReg ( italic_T ) end_ARG ) 4.3
Unknown Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โœ“ ๐’ช(LUโขH3โขSโขAโขTโขKโขAdaReg(T)+LUH3SAโขTโขlnโก(TโขHโขSโขA/ฮด))\begin{aligned} {\mathcal{O}}(&L_{U}\sqrt{H^{3}SATK\operatorname*{AdaReg}(T)}% \\ &+L_{U}H^{3}S\sqrt{AT\ln(THSA/\delta)})\end{aligned}start_ROW start_CELL caligraphic_O ( end_CELL start_CELL italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T italic_K roman_AdaReg ( italic_T ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG ) end_CELL end_ROW ๐’ชโข(HโขTโขKโขAdaReg(T))๐’ช๐ป๐‘‡๐พAdaReg๐‘‡\mathcal{O}(H\sqrt{TK\operatorname*{AdaReg}(T)})caligraphic_O ( italic_H square-root start_ARG italic_T italic_K roman_AdaReg ( italic_T ) end_ARG ) 5.1
Unknown Mโˆ—superscript๐‘€M^{*}italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โœ— ๐’ช(LUโขH3โขSโขAโขTโข(KโขAdaReg(T)+D)+LUH3SAโขTโขlnโก(TโขHโขSโขA/ฮด))\begin{aligned} \mathcal{O}(&L_{U}\sqrt{H^{3}SAT(K\operatorname*{AdaReg}(T)+D)% }\\ &+L_{U}H^{3}S\sqrt{AT\ln(THSA/\delta)})\end{aligned}start_ROW start_CELL caligraphic_O ( end_CELL start_CELL italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG ) end_CELL end_ROW ๐’ชโข(HโขTโข(KโขAdaReg(T)+D))+D+CTโข({ฮผยฏt,rmaxโ‹…๐Ÿโˆ’rโˆ—}t=1T)๐’ช๐ป๐‘‡๐พAdaReg๐‘‡๐ท๐ทsubscript๐ถ๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘กโ‹…subscript๐‘Ÿ1superscript๐‘Ÿ๐‘ก1๐‘‡\begin{aligned} \mathcal{O}(H\sqrt{T(K\operatorname*{AdaReg}(T)+D)})\\ +D+C_{T}(\{\bar{\mu}^{t},r_{\max}\cdot\mathbf{1}-r^{*}\}_{t=1}^{T})\end{aligned}start_ROW start_CELL caligraphic_O ( italic_H square-root start_ARG italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG ) end_CELL end_ROW start_ROW start_CELL + italic_D + italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT โ‹… bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) end_CELL end_ROW 6.1

Here K=๐’ชโข(HโขSโขAโขlogโกT)๐พ๐’ช๐ป๐‘†๐ด๐‘‡K=\mathcal{O}(HSA\log T)italic_K = caligraphic_O ( italic_H italic_S italic_A roman_log italic_T ) and D=๐’ช~(ฮฒTโขHโขdimE(โ„›,Tโˆ’1)โขT))D=\tilde{\mathcal{O}}(\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}))italic_D = over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ) ), where dimEsubscriptdimension๐ธ\dim_{E}roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is the eluder dimension of reward function class โ„›โ„›\mathcal{R}caligraphic_R, and ฮฒT=๐’ช~โข(1)subscript๐›ฝ๐‘‡~๐’ช1\beta_{T}=\tilde{\mathcal{O}}(1)italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = over~ start_ARG caligraphic_O end_ARG ( 1 ).

Appendix C OTHER RELATED WORKS

More Elaboration on Comparison between the Steering Setting and Contract Design Setting

The steering setup differs from previous incentive design literature in two aspects: (1) it deals with โ€œlearning agentsโ€ continuously updating their policies and (2) it cares about the steering gap towards a target policy and the accumulative steering cost. One of the most related and representative existing problem setups is contract design (a.k.a. the principal-agent problem), which is a classical problem dating back to the seminal work (Holmstrรถm,, 1979) in 1979. As we discussed in Sec.ย 1.1, it considers a similar mediator-agents interaction procedure. In the following, we elaborate more on the comparison between those two settings to support our steering setting.

  • (1)

    Contract design assumes the agents respond optimally to the mediator/principal (e.g. maximize the total return including the incentives by mediator), which is a quite strong assumption and โ€œsimplifiesโ€ the problem by making the agentsโ€™ behaviors predictable.

    In contrast, the steering framework treats the agentsโ€™ behavior as a dynamic process. For example, Zhang etย al., (2024) and ours consider no-regret behaviors, and (Huang etย al., 2024b, ; Canyakmaz etย al.,, 2024) assumes Markovian learning dynamics. Such a non-stationarity is more reasonable in practice and introduces additional challenges in achieving low the steering gap and cost.

  • (2)

    Contract design considers a more challenging objective, and targets at finding the optimal incentive design to maximize the mediatorโ€™s gain deducted by the incentivizing cost. Usually, it also assumes the agentsโ€™ behaviors are unobservable. Due to such challenges, most of the contract design literature focuses on single-agent setting and assumes the knowledge of the model.

    On the other hand, the steering setting considers steering the agents to some target policies maximizing some utility function, which makes the framework more general. Besides, we do not pursue the optimality in steering cost but sub-linearity would be enough. This is reasonable because in many scenarios we only have budget constraints but do not have to achieve the optimum. Such a relaxation also makes the problem more tractable.

Mean-field game

The mean-field game (MFG) is an important framework to model systems with a large number of symmetric agents (Lauriรจre etย al.,, 2024). Most works in the context of MFGs focus on learning equilibrium policies. As the pioneers, Lasry and Lions, (2007) and Huang etย al., (2006) reveal that learning Nash Equilibrium (NE) is computationally efficient under monotonicity conditions if the model is known in advance. Without the knowledge of the true model, many previous works contribute sample-efficient model-free (Guo etย al.,, 2021; Yardim etย al.,, 2022; Perolat etย al.,, 2021) and model-based (Huang etย al.,, 2023; Huang etย al., 2024a, ) methods to compute NE. Our mean-field game definition is similar to the general MFG setting (Guo etย al.,, 2021), but unlike them, we assume transitions are density-independent and allow independence of agentsโ€™ policies. This density-independent transition assumption has been frequently considered in previous works (Lasry and Lions,, 2007; Huang etย al.,, 2006; Hu and Zhang,, 2024; Perolat etย al.,, 2021). To our knowledge, we are the first to investigate steering agentsโ€™ behaviors in the context of the mean-field game.

Mathematical Programming with Equilibrium Constraints (MPEC) and Mechanism Design

MPEC considers a bilevel optimization formulation, where the upper level can be utility maximization problem and the lower level involves equilibrium constraints (Luo etย al.,, 1996). There is a line of research works (Liu etย al.,, 2022; Wang etย al.,, 2022; Yang etย al.,, 2022) consider gradient-based approaches to solve MPEC problems. They usually require strong assumptions on computing hyper-gradients, which may fail to be satisfied in most games. In contrast, we do not involve those assumptions or restrict the target policies are equilibria. We only assume the agents are no-regret learners and do not require them to solve the equilibria induced by modified reward functions.

Another related field within game theory is Mechanism Design, which focuses on designing rules or systems (mechanisms) to achieve a specific objective, especially when participants (agents) have private information and act according to their own interests. Most recent works consider mechanism design on Markov Games (Curry etย al.,, 2024; Baumann etย al.,, 2020).

Guo etย al., 2023b consider a bi-level optimization framework and another bi-objective variant, where the goal of the social planner is to solve an equilibrium policy maximizing some social welfare function. They do not consider the usage of steering reward to intervene agents, and focus on the optimization side without considering model uncertainty. In contrast, we study the incentive design problem, and focus on how to explore and design appropriate steering rewards to guide agentsโ€™ behaviors without knowledge of the model.

Appendix D REGARDING NO-ADAPTIVE REGRET ASSUMPTION

D.1 Proof Of Propositionย 3.1

See 3.1

Proof.

We have

sup1โ‰คa<bโ‰คTmaxฮผโˆˆฮจMโˆ—โขโˆ‘t=abโŸจrโˆ—โข(ฮผยฏt)+Rtโข(ฮผยฏt),ฮผโˆ’ฮผยฏtโŸฉ=sup1โ‰คa<bโ‰คTmaxฮผโˆˆฮจMโˆ—โขโˆ‘t=abโŸจrโˆ—โข(ฮผยฏt)+Rtโข(ฮผยฏt),ฮผโˆ’1Nโขโˆ‘n=1Nฮผฯ€n,tโŸฉsubscriptsupremum1๐‘Ž๐‘๐‘‡subscript๐œ‡subscriptฮจsuperscript๐‘€superscriptsubscript๐‘ก๐‘Ž๐‘superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘ก๐œ‡superscriptยฏ๐œ‡๐‘กsubscriptsupremum1๐‘Ž๐‘๐‘‡subscript๐œ‡subscriptฮจsuperscript๐‘€superscriptsubscript๐‘ก๐‘Ž๐‘superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘ก๐œ‡1๐‘superscriptsubscript๐‘›1๐‘superscript๐œ‡superscript๐œ‹๐‘›๐‘ก\displaystyle\sup_{1\leq a<b\leq T}\max_{\mu\in\Psi_{M^{*}}}\sum_{t=a}^{b}% \langle r^{*}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),\mu-\bar{\mu}^{t}\rangle=% \sup_{1\leq a<b\leq T}\max_{\mu\in\Psi_{M^{*}}}\sum_{t=a}^{b}\langle r^{*}(% \bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),\mu-\frac{1}{N}\sum_{n=1}^{N}\mu^{\pi^{n,t% }}\rangleroman_sup start_POSTSUBSCRIPT 1 โ‰ค italic_a < italic_b โ‰ค italic_T end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = roman_sup start_POSTSUBSCRIPT 1 โ‰ค italic_a < italic_b โ‰ค italic_T end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ
โ‰ค1Nโขโˆ‘n=1Nsup1โ‰คa<bโ‰คTmaxฮผโˆˆฮจMโˆ—โขโˆ‘t=abโŸจrโˆ—โข(ฮผยฏt)+Rtโข(ฮผยฏt),ฮผโˆ’ฮผฯ€n,tโŸฉโ‰ค1Nโขโˆ‘n=1NAdaReg(T)=AdaReg(T),absent1๐‘superscriptsubscript๐‘›1๐‘subscriptsupremum1๐‘Ž๐‘๐‘‡subscript๐œ‡subscriptฮจsuperscript๐‘€superscriptsubscript๐‘ก๐‘Ž๐‘superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘ก๐œ‡superscript๐œ‡superscript๐œ‹๐‘›๐‘ก1๐‘superscriptsubscript๐‘›1๐‘AdaReg๐‘‡AdaReg๐‘‡\displaystyle\leq\frac{1}{N}\sum_{n=1}^{N}\sup_{1\leq a<b\leq T}\max_{\mu\in% \Psi_{M^{*}}}\sum_{t=a}^{b}\langle r^{*}(\bar{\mu}^{t})+R^{t}(\bar{\mu}^{t}),% \mu-\mu^{\pi^{n,t}}\rangle\leq\frac{1}{N}\sum_{n=1}^{N}\operatorname*{AdaReg}(% T)=\operatorname*{AdaReg}(T),โ‰ค divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT 1 โ‰ค italic_a < italic_b โ‰ค italic_T end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ - italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ โ‰ค divide start_ARG 1 end_ARG start_ARG italic_N end_ARG โˆ‘ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_AdaReg ( italic_T ) = roman_AdaReg ( italic_T ) ,

where we used Assumptionย B in the third step. โˆŽ

D.2 Concrete Examples Satisfying No-Adaptive Regret Assumption

In this section, we provide some concrete agents learning dynamics examples to support our arguments on the practicality of Assump.ย B.

Example 1: Colluded Agents with Full Observation of Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

If the agents are able to observe the mediatorโ€™s steering strategy Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is Lipschitz in density (which is indeed satisfied by our proposed algorithms), the agents can collude together and take a (approximate) Nash Equilibrium policy induced by the reward function rโˆ—+Rtsuperscript๐‘Ÿsuperscript๐‘…๐‘กr^{*}+R^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT + italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which is guaranteed to be exist given the Lipschitz condition (Huang etย al.,, 2023). By the definition of Nash, each agent will have non-positive adaptive regret, which satisfies Assump.ย B.

Note that in the contract design literature, it is usually assumed the agents are able to do best response (Ho etย al.,, 2014; Zhu etย al.,, 2022) to the principalโ€™s (mediatorโ€™s) strategy if there is only one agent, or take the equilibrium policies for many agents setting (Carmona and Wang,, 2021; Elie etย al.,, 2019). Based on the discussion above, those assumptions are strictly stronger than and implies our no-adaptive-regret assumption.

Example 2: Independent Agents Conducting Online Convex Learning

In this second example, we consider less powerful agents who can not observe the entire Rtsuperscript๐‘…๐‘กR^{t}italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT or coordinate with the other agents. Note that from an agentโ€™s perspective, the interaction protocol in Procedureย 1 can be interpreted as an online linear optimization task, as in Procedureย 3.

Procedure 3 Agent-adversary interaction
1:forย t=1,โ€ฆ,T๐‘ก1โ€ฆ๐‘‡t=1,...,Titalic_t = 1 , โ€ฆ , italic_Tย do
2:ย ย ย ย ย Agent chooses xtโˆˆ๐’ณsubscript๐‘ฅ๐‘ก๐’ณx_{t}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆˆ caligraphic_X, where ๐’ณโŠ†โ„d๐’ณsuperscriptโ„๐‘‘\mathcal{X}\subseteq\mathbb{R}^{d}caligraphic_X โŠ† blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a convex set in Euclidean space.
3:ย ย ย ย ย Adversary chooses a reward vector rtโˆˆโ„dsubscript๐‘Ÿ๐‘กsuperscriptโ„๐‘‘r_{t}\in\mathbb{R}^{d}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, possibly based on the history and xtsubscript๐‘ฅ๐‘กx_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
4:ย ย ย ย ย Agent observes rtsubscript๐‘Ÿ๐‘กr_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and obtains reward โŸจrt,xtโŸฉsubscript๐‘Ÿ๐‘กsubscript๐‘ฅ๐‘ก\langle r_{t},x_{t}\rangleโŸจ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โŸฉ.
5:endย for

In our setting, in each iteration tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ], the agents pick a density (by picking a policy) from the convex set ฮจMโˆ—subscriptฮจsuperscript๐‘€\Psi_{M^{*}}roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and receive potentially adversarial feedback Rtโข(ฮผยฏt)superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กR^{t}(\bar{\mu}^{t})italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (or โŸจRtโข(ฮผยฏt),ฮผฯ€n,tโŸฉsuperscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscript๐œ‹๐‘›๐‘ก\langle R^{t}(\bar{\mu}^{t}),\mu^{\pi^{n,t}}\rangleโŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_n , italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โŸฉ in bandit feedback setting). Then, Assump.ย B coincides with the standard no-adaptive regret guarantees in online convex optimization setting. Therefore, Assump.ย B can be realized if each agent independently adopts any no-adaptive regret online learning algorithm (Hazan and Seshadhri,, 2007; Hazan,, 2023).

As a concrete algorithm choice, online gradient descent (OGD) achieves a external regret bound of 32โขGโขDโขT32๐บ๐ท๐‘‡\frac{3}{2}GD\sqrt{T}divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_G italic_D square-root start_ARG italic_T end_ARG (Hazan,, 2023), where Dโ‰ค2โขH๐ท2๐ปD\leq 2Hitalic_D โ‰ค 2 italic_H is the diameter of ๐’ณ=ฮจMโˆ—๐’ณsubscriptฮจsuperscript๐‘€\mathcal{X}=\Psi_{M^{*}}caligraphic_X = roman_ฮจ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and G๐บGitalic_G an upper bound on โ€–rtโ€–2โ‰คdโขโ€–rtโ€–โˆžsubscriptnormsubscript๐‘Ÿ๐‘ก2๐‘‘subscriptnormsubscript๐‘Ÿ๐‘ก\|r_{t}\|_{2}\leq\sqrt{d}\|r_{t}\|_{\infty}โˆฅ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค square-root start_ARG italic_d end_ARG โˆฅ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT. In our case, we can bound Gโ‰คHโขSโขAโข(rmax+Rmax)๐บ๐ป๐‘†๐ดsubscript๐‘Ÿsubscript๐‘…G\leq\sqrt{HSA}(r_{\max}+R_{\max})italic_G โ‰ค square-root start_ARG italic_H italic_S italic_A end_ARG ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ). A bound for rmax+Rmaxsubscript๐‘Ÿsubscript๐‘…r_{\max}+R_{\max}italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is discussed in Appendixย D.4. Moreover, in the full feedback setting (the agents know the model M๐‘€Mitalic_M and are able to observe Rtโข(ฮผยฏt)superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กR^{t}(\bar{\mu}^{t})italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )), the no-adaptive-regret assumption is not much stronger than no-external-regret, as is demonstrated by the following proposition.

Proposition D.1 (Theorem 1.3 of Hazan and Seshadhri, (2007)).

Let (rt)t=1Tsuperscriptsubscriptsubscript๐‘Ÿ๐‘ก๐‘ก1๐‘‡(r_{t})_{t=1}^{T}( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be reward vectors in [0,C]dsuperscript0๐ถ๐‘‘[0,C]^{d}[ 0 , italic_C ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Any algorithm following Protocolย 3 with external regret Reg(T)Reg๐‘‡\operatorname*{Reg}(T)roman_Reg ( italic_T ) can be utilized to build an algorithm with adaptive regret at most Reg(T)+๐’ชโข(CโขTโขlogโกT)Reg๐‘‡๐’ช๐ถ๐‘‡๐‘‡\operatorname*{Reg}(T)+\mathcal{O}(C\sqrt{T\log T})roman_Reg ( italic_T ) + caligraphic_O ( italic_C square-root start_ARG italic_T roman_log italic_T end_ARG ).

Thus, Assump.ย B can be satisfied with an adaptive regret bound of ๐’ช~โข(T)~๐’ช๐‘‡\tilde{\mathcal{O}}(\sqrt{T})over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_T end_ARG ) if all the agents follow OGD, modified as in Prop.ย D.1.

D.3 Motivating Adaptive Regret

Here, we show a small example that should motivate why we need the no-adaptive-regret assumption instead of no-external-regret. External regret is one of the most common regret types, and it is the same as adaptive regret in Assumptionย B, but a=1,b=Tformulae-sequence๐‘Ž1๐‘๐‘‡a=1,b=Titalic_a = 1 , italic_b = italic_T are fixed. If we want to steer the agents in different directions, the no-external-regret assumption might not be enough, as we can see in the following example.

Consider the stateless setting with |๐’œ|=2๐’œ2|\mathcal{A}|=2| caligraphic_A | = 2, where the incentive designer deploys Rโข(ฮผ)=๐ž1๐‘…๐œ‡subscript๐ž1R(\mu)=\mathbf{e}_{1}italic_R ( italic_ฮผ ) = bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the first T/2๐‘‡2T/2italic_T / 2 iterations and Rโข(ฮผ)=๐ž2๐‘…๐œ‡subscript๐ž2R(\mu)=\mathbf{e}_{2}italic_R ( italic_ฮผ ) = bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the remaining T/2๐‘‡2T/2italic_T / 2 iterations.

Suppose all the agents perform the Hedge algorithm, where

ฮผยฏtโข(a)=ฯ€ยฏtโข(a)=1Ztโขexpโก(1+ฮทโขโˆ‘s=1tโˆ’1โŸจRsโข(ฮผยฏs),๐žaโŸฉ),superscriptยฏ๐œ‡๐‘ก๐‘Žsuperscriptยฏ๐œ‹๐‘ก๐‘Žsubscript1๐‘๐‘ก1๐œ‚superscriptsubscript๐‘ 1๐‘ก1superscript๐‘…๐‘ superscriptยฏ๐œ‡๐‘ subscript๐ž๐‘Ž\displaystyle\bar{\mu}^{t}(a)=\bar{\pi}^{t}(a)=\frac{1}{Z}_{t}\exp\left(1+\eta% \sum_{s=1}^{t-1}\langle R^{s}(\bar{\mu}^{s}),\mathbf{e}_{a}\rangle\right),overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_a ) = overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_a ) = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_exp ( 1 + italic_ฮท โˆ‘ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) , bold_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT โŸฉ ) ,

and Ztsubscript๐‘๐‘กZ_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the normalizing constant. This algorithm is known to have sublinear external regret (Freund and Schapire,, 1997). The population density at iteration tโ‰ฅT/2๐‘ก๐‘‡2t\geq T/2italic_t โ‰ฅ italic_T / 2 is

ฮผยฏt=1Ztโข(expโก(1+ฮทโขT/2)expโก(1+ฮทโข(tโˆ’T/2))),superscriptยฏ๐œ‡๐‘กsubscript1๐‘๐‘กmatrix1๐œ‚๐‘‡21๐œ‚๐‘ก๐‘‡2\displaystyle\bar{\mu}^{t}=\frac{1}{Z}_{t}\left(\begin{matrix}\exp(1+\eta T/2)% \\ \exp(1+\eta(t-T/2))\end{matrix}\right),overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( start_ARG start_ROW start_CELL roman_exp ( 1 + italic_ฮท italic_T / 2 ) end_CELL end_ROW start_ROW start_CELL roman_exp ( 1 + italic_ฮท ( italic_t - italic_T / 2 ) ) end_CELL end_ROW end_ARG ) ,

while the optimal action is ๐ž2subscript๐ž2\mathbf{e}_{2}bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, over the interval [T/2+1,T]๐‘‡21๐‘‡[T/2+1,T][ italic_T / 2 + 1 , italic_T ], the agents accumulate expected regret

โˆ‘t=T/2+1T(1โˆ’ฮผยฏtโข(2))=โˆ‘t=T/2+1Texpโก(1+ฮทโขT/2)exp(1+ฮทT/2)+exp(1+ฮท(tโˆ’T/2)โŸโ‰ฅ1/2โ‰ฅT/4.\displaystyle\sum_{t=T/2+1}^{T}(1-\bar{\mu}^{t}(2))=\sum_{t=T/2+1}^{T}% \underset{\geq 1/2}{\underbrace{\frac{\exp(1+\eta T/2)}{\exp(1+\eta T/2)+\exp(% 1+\eta(t-T/2)}}}\geq T/4.โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T / 2 + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 1 - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 2 ) ) = โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T / 2 + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_UNDERACCENT โ‰ฅ 1 / 2 end_UNDERACCENT start_ARG underโŸ start_ARG divide start_ARG roman_exp ( 1 + italic_ฮท italic_T / 2 ) end_ARG start_ARG roman_exp ( 1 + italic_ฮท italic_T / 2 ) + roman_exp ( 1 + italic_ฮท ( italic_t - italic_T / 2 ) end_ARG end_ARG end_ARG โ‰ฅ italic_T / 4 .

So, although this algorithm has no external regret, we still might have to wait ฮฉโข(T)ฮฉ๐‘‡\Omega(T)roman_ฮฉ ( italic_T ) many rounds to let the agents converge to a different density. One can easily observe that with the no-adaptive-regret assumption, this is not an issue.

D.4 Boundedness Of Steering Rewards

As we see in Assumptionย B, the adaptive regret bound AdaReg(T)AdaReg๐‘‡\operatorname*{AdaReg}(T)roman_AdaReg ( italic_T ) is dependent on rmax+Rmaxsubscript๐‘Ÿsubscript๐‘…r_{\max}+R_{\max}italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. In this section, we show that Rmax=๐’ชโข(1+rmax)subscript๐‘…๐’ช1subscript๐‘ŸR_{\max}=\mathcal{O}(1+r_{\max})italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = caligraphic_O ( 1 + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) for both of our steering rewards Rzsubscript๐‘…zR_{\text{z}}italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT and Rnzsubscript๐‘…nzR_{\text{nz}}italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT.

Proposition D.2.

For any ฯ€โˆˆฮ ๐œ‹ฮ \pi\in\Piitalic_ฯ€ โˆˆ roman_ฮ  and ฮผโˆˆฮจ๐œ‡ฮจ\mu\in\Psiitalic_ฮผ โˆˆ roman_ฮจ, โ€–Rฯ€โข(ฮผ)โ€–โˆžโ‰ค2subscriptnormsubscript๐‘…๐œ‹๐œ‡2\|R_{\pi}(\mu)\|_{\infty}\leq 2โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‰ค 2, where Rฯ€subscript๐‘…๐œ‹R_{\pi}italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT is defined as in Eq.ย (5).

Proof.

As one can observe using the definition of Rฯ€subscript๐‘…๐œ‹R_{\pi}italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT in Eq.ย (5) and Wฯ€superscript๐‘Š๐œ‹W^{\pi}italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT in Eq.ย (4), we have for any h,s,aโ„Ž๐‘ ๐‘Žh,s,aitalic_h , italic_s , italic_a,

|(Rฯ€โข(ฮผ))h,s,a|=|(ฮผโŠคโข(Wฯ€โˆ’I)โŠคโข(Wฯ€โˆ’I))h,s,a|=|(Wฯ€โขฮผโˆ’ฮผ)โŠคโข(Wฯ€โˆ’I)(h,s,a)|subscriptsubscript๐‘…๐œ‹๐œ‡โ„Ž๐‘ ๐‘Žsubscriptsuperscript๐œ‡topsuperscriptsuperscript๐‘Š๐œ‹๐ผtopsuperscript๐‘Š๐œ‹๐ผโ„Ž๐‘ ๐‘Žsuperscriptsuperscript๐‘Š๐œ‹๐œ‡๐œ‡topsubscriptsuperscript๐‘Š๐œ‹๐ผโ„Ž๐‘ ๐‘Ž\displaystyle\left|(R_{\pi}(\mu))_{h,s,a}\right|=\left|(\mu^{\top}(W^{\pi}-I)^% {\top}(W^{\pi}-I))_{h,s,a}\right|=\left|(W^{\pi}\mu-\mu)^{\top}(W^{\pi}-I)_{(h% ,s,a)}\right|| ( italic_R start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT ( italic_ฮผ ) ) start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT | = | ( italic_ฮผ start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) ) start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT | = | ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT italic_ฮผ - italic_ฮผ ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) start_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) end_POSTSUBSCRIPT |
=|โˆ‘aโ€ฒ(ฯ€h(aโ€ฒ|s)ฮผh(s)โˆ’ฮผh(s,aโ€ฒ))(ฯ€h(aโ€ฒ|s)โˆ’๐•€{aโ€ฒ=a})|โ‰คโˆ‘aโ€ฒ|ฯ€h(aโ€ฒ|s)ฮผh(s)โˆ’ฮผh(s,aโ€ฒ)|โŸโ‰ค1โ‹…|ฯ€h(aโ€ฒ|s)โˆ’๐•€{aโ€ฒ=a}|\displaystyle=\left|\sum_{a^{\prime}}(\pi_{h}(a^{\prime}|s)\mu_{h}(s)-\mu_{h}(% s,a^{\prime}))(\pi_{h}(a^{\prime}|s)-\mathbb{I}\{a^{\prime}=a\})\right|\leq% \sum_{a^{\prime}}\underbrace{|\pi_{h}(a^{\prime}|s)\mu_{h}(s)-\mu_{h}(s,a^{% \prime})|}_{\leq 1}\cdot|\pi_{h}(a^{\prime}|s)-\mathbb{I}\{a^{\prime}=a\}|= | โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) ) ( italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s ) - blackboard_I { italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = italic_a } ) | โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT underโŸ start_ARG | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) | end_ARG start_POSTSUBSCRIPT โ‰ค 1 end_POSTSUBSCRIPT โ‹… | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s ) - blackboard_I { italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = italic_a } |
โ‰คโˆ‘aโ€ฒโ‰ aฯ€h(aโ€ฒ|s)+|ฯ€h(a|s)โˆ’1|โ‰ค2,\displaystyle\leq\sum_{a^{\prime}\neq a}\pi_{h}(a^{\prime}|s)+|\pi_{h}(a|s)-1|% \leq 2,โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โ‰  italic_a end_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s ) + | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) - 1 | โ‰ค 2 ,

where (Wฯ€โˆ’I)(h,s,a)subscriptsuperscript๐‘Š๐œ‹๐ผโ„Ž๐‘ ๐‘Ž(W^{\pi}-I)_{(h,s,a)}( italic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I ) start_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) end_POSTSUBSCRIPT is the (h,s,a)โ„Ž๐‘ ๐‘Ž(h,s,a)( italic_h , italic_s , italic_a )-th column of Wฯ€โˆ’Isuperscript๐‘Š๐œ‹๐ผW^{\pi}-Iitalic_W start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_I. โˆŽ

Proposition D.3.

We have for Rzsubscript๐‘…zR_{\text{z}}italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT, as defined in Alg.ย 2, and for Rnzsubscript๐‘…nzR_{\text{nz}}italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT, as defined in Eq.ย (6), that for all iterations tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ],

โ€–Rztโข(ฮผ)โ€–โˆžโ‰ค4andโ€–Rnztโข(ฮผ)โ€–โˆžโ‰ค2โขrmax+4โˆ€ฮผโˆˆฮ”๐’ฎร—๐’œH.formulae-sequencesubscriptnormsuperscriptsubscript๐‘…z๐‘ก๐œ‡4andformulae-sequencesubscriptnormsuperscriptsubscript๐‘…nz๐‘ก๐œ‡2subscript๐‘Ÿ4for-all๐œ‡superscriptsubscriptฮ”๐’ฎ๐’œ๐ป\displaystyle\|R_{\text{z}}^{t}(\mu)\|_{\infty}\leq 4\quad\text{and}\quad\|R_{% \text{nz}}^{t}(\mu)\|_{\infty}\leq 2r_{\max}+4\quad\forall\mu\in\Delta_{% \mathcal{S}\times\mathcal{A}}^{H}.โˆฅ italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‰ค 4 and โˆฅ italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‰ค 2 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + 4 โˆ€ italic_ฮผ โˆˆ roman_ฮ” start_POSTSUBSCRIPT caligraphic_S ร— caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT .
Proof.

We have โ€–Rztโข(ฮผ)โ€–โˆž=โ€–Rฯ€โˆ—kโข(t)โข(ฮผ)+โ€–โขRฯ€โˆ—kโข(t)โข(ฮผ)โˆฅโˆžโข๐Ÿโˆฅโˆžโ‰ค2โขโ€–Rฯ€โˆ—kโข(t)โข(ฮผ)โ€–โˆžsubscriptnormsuperscriptsubscript๐‘…z๐‘ก๐œ‡evaluated-atevaluated-atnormlimit-fromsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡12subscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡\|R_{\text{z}}^{t}(\mu)\|_{\infty}=\|R_{\pi_{*}^{k(t)}}(\mu)+\|R_{\pi_{*}^{k(t% )}}(\mu)\|_{\infty}\bm{1}\|_{\infty}\leq 2\|R_{\pi_{*}^{k(t)}}(\mu)\|_{\infty}โˆฅ italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT = โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‰ค 2 โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT, which is at most 4444, by Prop.ย D.2.

By the definition of wโ„›^tsubscript๐‘คsuperscript^โ„›๐‘กw_{\hat{\mathcal{R}}^{t}}italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we know that the elements of wโ„›^tโข(ฮผ)subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡w_{\hat{\mathcal{R}}^{t}}(\mu)italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) are bounded in [0,rmax]0subscript๐‘Ÿ[0,r_{\max}][ 0 , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] for any ฮผ๐œ‡\muitalic_ฮผ. Therefore, and by Prop.ย D.2,

โ€–Rnztโข(ฮผ)โ€–โˆžsubscriptnormsuperscriptsubscript๐‘…nz๐‘ก๐œ‡\displaystyle\left\|R_{\text{nz}}^{t}(\mu)\right\|_{\infty}โˆฅ italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT =โ€–Rฯ€โˆ—kโข(t)โข(ฮผ)โˆ’(rยฏtโข(ฮผ)โˆ’wโ„›^tโข(ฮผ))+(rmax+โ€–Rฯ€โˆ—kโข(t)โข(ฮผ)โ€–โˆž)โข๐Ÿโ€–โˆžabsentsubscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡superscriptยฏ๐‘Ÿ๐‘ก๐œ‡subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡subscript๐‘Ÿsubscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡1\displaystyle=\left\|R_{\pi_{*}^{k(t)}}(\mu)-(\bar{r}^{t}(\mu)-w_{\hat{% \mathcal{R}}^{t}}(\mu))+(r_{\max}+\|R_{\pi_{*}^{k(t)}}(\mu)\|_{\infty})\bm{1}% \right\|_{\infty}= โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) - ( overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) ) + ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) bold_1 โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT
โ‰คโ€–Rฯ€โˆ—kโข(t)โข(ฮผ)+โ€–โขRฯ€โˆ—kโข(t)โข(ฮผ)โˆฅโˆžโข๐Ÿโˆฅโˆž+โ€–rยฏtโข(ฮผ)โˆ’wโ„›^tโข(ฮผ)โ€–โˆž+โ€–rmaxโข๐Ÿโ€–โˆžabsentevaluated-atevaluated-atnormlimit-fromsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡1subscriptnormsuperscriptยฏ๐‘Ÿ๐‘ก๐œ‡subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡subscriptnormsubscript๐‘Ÿ1\displaystyle\leq\left\|R_{\pi_{*}^{k(t)}}(\mu)+\|R_{\pi_{*}^{k(t)}}(\mu)\|_{% \infty}\bm{1}\right\|_{\infty}+\left\|\bar{r}^{t}(\mu)-w_{\hat{\mathcal{R}}^{t% }}(\mu)\right\|_{\infty}+\left\|r_{\max}\bm{1}\right\|_{\infty}โ‰ค โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT + โˆฅ overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT + โˆฅ italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT bold_1 โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT
โ‰ค4+2โขrmax.absent42subscript๐‘Ÿ\displaystyle\leq 4+2r_{\max}.โ‰ค 4 + 2 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT .

โˆŽ

Appendix E STATE-ACTION DENSITY

E.1 ฮจMsubscriptฮจ๐‘€\Psi_{M}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT Is Convex

Lemma E.1.
ฮจM={ฮผ:ฮผโ‰ฅ0,โˆ‘aโ€ฒฮผh+1โข(s,aโ€ฒ)=โˆ‘sโ€ฒ,aโ€ฒโ„™M,hโข(s|sโ€ฒ,aโ€ฒ)โขฮผhโข(sโ€ฒ,aโ€ฒ)โขโˆ€h,s,โˆ‘aโ€ฒฮผ1โข(s,aโ€ฒ)=ฮผ1โข(s)}subscriptฮจ๐‘€conditional-set๐œ‡formulae-sequence๐œ‡0formulae-sequencesubscriptsuperscript๐‘Žโ€ฒsubscript๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒsubscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™๐‘€โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript๐œ‡โ„Žsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒfor-allโ„Ž๐‘ subscriptsuperscript๐‘Žโ€ฒsubscript๐œ‡1๐‘ superscript๐‘Žโ€ฒsubscript๐œ‡1๐‘ \displaystyle\Psi_{M}=\{\mu:\mu\geq 0,\sum_{a^{\prime}}\mu_{h+1}(s,a^{\prime})% =\sum_{s^{\prime},a^{\prime}}\mathbb{P}_{M,h}(s|s^{\prime},a^{\prime})\mu_{h}(% s^{\prime},a^{\prime})\forall h,s,\sum_{a^{\prime}}\mu_{1}(s,a^{\prime})=\mu_{% 1}(s)\}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = { italic_ฮผ : italic_ฮผ โ‰ฅ 0 , โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) โˆ€ italic_h , italic_s , โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) }
Proof.

We abbreviate โ„™=โ„™Mโ„™subscriptโ„™๐‘€\mathbb{P}=\mathbb{P}_{M}blackboard_P = blackboard_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and ฮผ=ฮผM๐œ‡subscript๐œ‡๐‘€\mu=\mu_{M}italic_ฮผ = italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, since the model is fixed throughout. For ฮผโˆˆฮจM๐œ‡subscriptฮจ๐‘€\mu\in\Psi_{M}italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, it is easy to see that the conditions on the right-hand side are fulfilled. The other direction is more involved. Suppose ฮผ~~๐œ‡\tilde{\mu}over~ start_ARG italic_ฮผ end_ARG fulfills ฮผ~โ‰ฅ0~๐œ‡0\tilde{\mu}\geq 0over~ start_ARG italic_ฮผ end_ARG โ‰ฅ 0 and for all s,h๐‘ โ„Žs,hitalic_s , italic_h, โˆ‘aโ€ฒฮผ~h+1โข(s,aโ€ฒ)=โˆ‘sโ€ฒ,aโ€ฒโ„™hโข(s|sโ€ฒ,aโ€ฒ)โขฮผ~hโข(sโ€ฒ,aโ€ฒ)subscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒsubscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Žsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=\sum_{s^{\prime},a^{\prime}}% \mathbb{P}_{h}(s|s^{\prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime})โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) as well as โˆ‘aโ€ฒฮผ~1โข(s,aโ€ฒ)=ฮผ1โข(s)subscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡1๐‘ superscript๐‘Žโ€ฒsubscript๐œ‡1๐‘ \sum_{a^{\prime}}\tilde{\mu}_{1}(s,a^{\prime})=\mu_{1}(s)โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ). Now, define ฯ€๐œ‹\piitalic_ฯ€ such that for all s,a,h๐‘ ๐‘Žโ„Žs,a,hitalic_s , italic_a , italic_h,

ฯ€hโข(a|s)={ฮผ~hโข(s,a)โˆ‘aโ€ฒฮผ~hโข(s,aโ€ฒ),ifย โขโˆ‘aโ€ฒฮผ~hโข(s,aโ€ฒ)โ‰ 01/A,else.subscript๐œ‹โ„Žconditional๐‘Ž๐‘ casessubscript~๐œ‡โ„Ž๐‘ ๐‘Žsubscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž๐‘ superscript๐‘Žโ€ฒifย subscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž๐‘ superscript๐‘Žโ€ฒ01๐ดelse\displaystyle\pi_{h}(a|s)=\begin{cases}\frac{\tilde{\mu}_{h}(s,a)}{\sum_{a^{% \prime}}\tilde{\mu}_{h}(s,a^{\prime})},&\text{if }\sum_{a^{\prime}}\tilde{\mu}% _{h}(s,a^{\prime})\neq 0\\ 1/A,&\text{else}\end{cases}.italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) = { start_ROW start_CELL divide start_ARG over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) end_ARG , end_CELL start_CELL if โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) โ‰  0 end_CELL end_ROW start_ROW start_CELL 1 / italic_A , end_CELL start_CELL else end_CELL end_ROW .

Clearly, ฯ€hโข(a|s)โ‰ฅ0subscript๐œ‹โ„Žconditional๐‘Ž๐‘ 0\pi_{h}(a|s)\geq 0italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) โ‰ฅ 0 and โˆ‘aฯ€hโข(a|s)=1subscript๐‘Žsubscript๐œ‹โ„Žconditional๐‘Ž๐‘ 1\sum_{a}\pi_{h}(a|s)=1โˆ‘ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) = 1, which means ฯ€โˆˆฮ ๐œ‹ฮ \pi\in\Piitalic_ฯ€ โˆˆ roman_ฮ . First of all,

ฮผ1ฯ€โข(s,a)=ฯ€1โข(a|s)โขฮผ1โข(s)=ฮผ~1โข(s,a)โˆ‘aโ€ฒฮผ~1โข(s,aโ€ฒ)โขฮผ1โข(s)=ฮผ~1โข(s,a).superscriptsubscript๐œ‡1๐œ‹๐‘ ๐‘Žsubscript๐œ‹1conditional๐‘Ž๐‘ subscript๐œ‡1๐‘ subscript~๐œ‡1๐‘ ๐‘Žsubscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡1๐‘ superscript๐‘Žโ€ฒsubscript๐œ‡1๐‘ subscript~๐œ‡1๐‘ ๐‘Ž\displaystyle\mu_{1}^{\pi}(s,a)=\pi_{1}(a|s)\mu_{1}(s)=\frac{\tilde{\mu}_{1}(s% ,a)}{\sum_{a^{\prime}}\tilde{\mu}_{1}(s,a^{\prime})}\mu_{1}(s)=\tilde{\mu}_{1}% (s,a).italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_ฯ€ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = divide start_ARG over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) end_ARG italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) .

By induction, for all hโ‰ฅ1โ„Ž1h\geq 1italic_h โ‰ฅ 1 we have if โˆ‘aโ€ฒฮผ~h+1โข(s,aโ€ฒ)โ‰ 0subscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒ0\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})\neq 0โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) โ‰  0,

ฮผh+1ฯ€โข(s,a)superscriptsubscript๐œ‡โ„Ž1๐œ‹๐‘ ๐‘Ž\displaystyle\mu_{h+1}^{\pi}(s,a)italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) =โˆ‘sโ€ฒ,aโ€ฒฯ€h+1โข(a|s)โขโ„™hโข(s|sโ€ฒ,aโ€ฒ)โขฮผhฯ€โข(sโ€ฒ,aโ€ฒ)absentsubscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptโ„™โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsuperscriptsubscript๐œ‡โ„Ž๐œ‹superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\displaystyle=\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a|s)\mathbb{P}_{h}(s|s^{% \prime},a^{\prime})\mu_{h}^{\pi}(s^{\prime},a^{\prime})= โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
=ฯ€h+1โข(a|s)โขโˆ‘sโ€ฒ,aโ€ฒโ„™hโข(s|sโ€ฒ,aโ€ฒ)โขฮผ~hโข(sโ€ฒ,aโ€ฒ)absentsubscript๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Žsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\displaystyle=\pi_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}\mathbb{P}_{h}(s|s^{% \prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime})= italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
=ฮผ~h+1โข(s,a)โˆ‘aโ€ฒฮผ~h+1โข(s,aโ€ฒ)โขโˆ‘aโ€ฒฮผ~h+1โข(s,aโ€ฒ)=ฮผ~h+1โข(s,a),absentsubscript~๐œ‡โ„Ž1๐‘ ๐‘Žsubscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒsubscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ ๐‘Ž\displaystyle=\frac{\tilde{\mu}_{h+1}(s,a)}{\sum_{a^{\prime}}\tilde{\mu}_{h+1}% (s,a^{\prime})}\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=\tilde{\mu}_{h% +1}(s,a),= divide start_ARG over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) end_ARG โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) ,

and in case โˆ‘aโ€ฒฮผ~h+1โข(s,aโ€ฒ)=0subscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒ0\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=0โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = 0, we know that ฮผ~h+1โข(s,a)=0subscript~๐œ‡โ„Ž1๐‘ ๐‘Ž0\tilde{\mu}_{h+1}(s,a)=0over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0 and therefore

ฮผh+1ฯ€โข(s,a)superscriptsubscript๐œ‡โ„Ž1๐œ‹๐‘ ๐‘Ž\displaystyle\mu_{h+1}^{\pi}(s,a)italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) =โˆ‘sโ€ฒ,aโ€ฒฯ€h+1โข(a|s)โขโ„™hโข(s|sโ€ฒ,aโ€ฒ)โขฮผhฯ€โข(sโ€ฒ,aโ€ฒ)absentsubscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptโ„™โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsuperscriptsubscript๐œ‡โ„Ž๐œ‹superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\displaystyle=\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a|s)\mathbb{P}_{h}(s|s^{% \prime},a^{\prime})\mu_{h}^{\pi}(s^{\prime},a^{\prime})= โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
=ฯ€h+1โข(a|s)โขโˆ‘sโ€ฒ,aโ€ฒโ„™hโข(s|sโ€ฒ,aโ€ฒ)โขฮผ~hโข(sโ€ฒ,aโ€ฒ)absentsubscript๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Žsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\displaystyle=\pi_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}\mathbb{P}_{h}(s|s^{% \prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime})= italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
=1/Aโขโˆ‘aโ€ฒฮผ~h+1โข(s,aโ€ฒ)=0=ฮผ~h+1โข(s,a).absent1๐ดsubscriptsuperscript๐‘Žโ€ฒsubscript~๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒ0subscript~๐œ‡โ„Ž1๐‘ ๐‘Ž\displaystyle=1/A\sum_{a^{\prime}}\tilde{\mu}_{h+1}(s,a^{\prime})=0=\tilde{\mu% }_{h+1}(s,a).= 1 / italic_A โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = 0 = over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) .

We can conclude that ฮผ~=ฮผฯ€~๐œ‡superscript๐œ‡๐œ‹\tilde{\mu}=\mu^{\pi}over~ start_ARG italic_ฮผ end_ARG = italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT and thus ฮผ~โˆˆฮจM~๐œ‡subscriptฮจ๐‘€\tilde{\mu}\in\Psi_{M}over~ start_ARG italic_ฮผ end_ARG โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. โˆŽ

Lemma E.2.
ฮจM={ฮผ:ฮผโ‰ฅ0,Bโขฮผ=b},subscriptฮจ๐‘€conditional-set๐œ‡formulae-sequence๐œ‡0๐ต๐œ‡๐‘\displaystyle\Psi_{M}=\{\mu:\mu\geq 0,B\mu=b\},roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = { italic_ฮผ : italic_ฮผ โ‰ฅ 0 , italic_B italic_ฮผ = italic_b } ,

where

B=(Dโˆ’โ„™M,1โŠคDโˆ’โ„™M,2โŠคDโ€ฆโˆ’โ„™M,Hโˆ’1โŠคD),b=(ฮผ10โ‹ฎ0),formulae-sequence๐ตmatrix๐ทmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscriptโ„™๐‘€1top๐ทmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscriptโ„™๐‘€2top๐ทmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionโ€ฆmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscriptโ„™๐‘€๐ป1top๐ท๐‘matrixsubscript๐œ‡10โ‹ฎ0\displaystyle B=\left(\begin{matrix}D&&&&\\ -\mathbb{P}_{M,1}^{\top}&D&&&\\ &-\mathbb{P}_{M,2}^{\top}&D&&\\ &&...&&\\ &&&-\mathbb{P}_{M,H-1}^{\top}&D\end{matrix}\right),\quad b=\left(\begin{matrix% }\mu_{1}\\ 0\\ \vdots\\ 0\end{matrix}\right),italic_B = ( start_ARG start_ROW start_CELL italic_D end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - blackboard_P start_POSTSUBSCRIPT italic_M , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT end_CELL start_CELL italic_D end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - blackboard_P start_POSTSUBSCRIPT italic_M , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT end_CELL start_CELL italic_D end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL โ€ฆ end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL - blackboard_P start_POSTSUBSCRIPT italic_M , italic_H - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT end_CELL start_CELL italic_D end_CELL end_ROW end_ARG ) , italic_b = ( start_ARG start_ROW start_CELL italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL โ‹ฎ end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ) ,

D:=ISโŠ—๐ŸAโŠคassign๐ทtensor-productsubscript๐ผ๐‘†superscriptsubscript1๐ดtopD:=I_{S}\otimes\bm{1}_{A}^{\top}italic_D := italic_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT โŠ— bold_1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT (โŠ—tensor-product\otimesโŠ— is the tensor product) and โ„™Msubscriptโ„™๐‘€\mathbb{P}_{M}blackboard_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is viewed as a matrix such that (โ„™M,h)(s,a),sโ€ฒ=โ„™M,hโข(sโ€ฒ|s,a)subscriptsubscriptโ„™๐‘€โ„Ž๐‘ ๐‘Žsuperscript๐‘ โ€ฒsubscriptโ„™๐‘€โ„Žconditionalsuperscript๐‘ โ€ฒ๐‘ ๐‘Ž(\mathbb{P}_{M,h})_{(s,a),s^{\prime}}=\mathbb{P}_{M,h}(s^{\prime}|s,a)( blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ). An immediate consequence of this formulation is that ฮจMsubscriptฮจ๐‘€\Psi_{M}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is convex.

Proof.

This result is simply a reformulation of Lemmaย E.1. We can rewrite the condition โˆ‘aโ€ฒฮผh+1โข(s,aโ€ฒ)=โˆ‘sโ€ฒ,aโ€ฒโ„™M,hโข(s|sโ€ฒ,aโ€ฒ)โขฮผhโข(sโ€ฒ,aโ€ฒ)subscriptsuperscript๐‘Žโ€ฒsubscript๐œ‡โ„Ž1๐‘ superscript๐‘Žโ€ฒsubscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™๐‘€โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript๐œ‡โ„Žsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\sum_{a^{\prime}}\mu_{h+1}(s,a^{\prime})=\sum_{s^{\prime},a^{\prime}}\mathbb{P% }_{M,h}(s|s^{\prime},a^{\prime})\mu_{h}(s^{\prime},a^{\prime})โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) as Dโขฮผh+1=โ„™M,hโŠคโขฮผh๐ทsubscript๐œ‡โ„Ž1superscriptsubscriptโ„™๐‘€โ„Žtopsubscript๐œ‡โ„ŽD\mu_{h+1}=\mathbb{P}_{M,h}^{\top}\mu_{h}italic_D italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The condition โˆ‘aโ€ฒฮผ1โข(s,aโ€ฒ)=ฮผ1โข(s)subscriptsuperscript๐‘Žโ€ฒsubscript๐œ‡1๐‘ superscript๐‘Žโ€ฒsubscript๐œ‡1๐‘ \sum_{a^{\prime}}\mu_{1}(s,a^{\prime})=\mu_{1}(s)โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) = italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) can be written as Dโขฮผ1โข(โ‹…,โ‹…)=ฮผ1๐ทsubscript๐œ‡1โ‹…โ‹…subscript๐œ‡1D\mu_{1}(\cdot,\cdot)=\mu_{1}italic_D italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( โ‹… , โ‹… ) = italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. โˆŽ

E.2 Inequalities

Lemma E.3.

For any model M๐‘€Mitalic_M and any ฯ€,ฯ€~โˆˆฮ ๐œ‹~๐œ‹ฮ \pi,\tilde{\pi}\in\Piitalic_ฯ€ , over~ start_ARG italic_ฯ€ end_ARG โˆˆ roman_ฮ ,

โˆฅฮผMฯ€โˆ’ฮผMฯ€~โˆฅ1โ‰คHโˆ‘h,sฮผM,hฯ€(s)โˆฅฯ€h(โ‹…|s)โˆ’ฯ€~h(โ‹…|s)โˆฅ1,\displaystyle\|\mu_{M}^{\pi}-\mu_{M}^{\tilde{\pi}}\|_{1}\leq H\sum_{h,s}\mu_{M% ,h}^{\pi}(s)\|\pi_{h}(\cdot|s)-\tilde{\pi}_{h}(\cdot|s)\|_{1},โˆฅ italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where ฮผM,hฯ€โข(s)=โˆ‘aฮผM,hฯ€โข(s,a)superscriptsubscript๐œ‡๐‘€โ„Ž๐œ‹๐‘ subscript๐‘Žsuperscriptsubscript๐œ‡๐‘€โ„Ž๐œ‹๐‘ ๐‘Ž\mu_{M,h}^{\pi}(s)=\sum_{a}\mu_{M,h}^{\pi}(s,a)italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) = โˆ‘ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ).

Proof.

Since the model M๐‘€Mitalic_M is fixed throughout, we abbreviate ฮผ=ฮผM๐œ‡subscript๐œ‡๐‘€\mu=\mu_{M}italic_ฮผ = italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and โ„™=โ„™Mโ„™subscriptโ„™๐‘€\mathbb{P}=\mathbb{P}_{M}blackboard_P = blackboard_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. First of all, โˆฅฮผ1ฯ€โˆ’ฮผ1ฯ€~โˆฅ1=โˆ‘s,aฮผ1(s)|ฯ€1(a|s)โˆ’ฯ€~1(a|s)|\|\mu_{1}^{\pi}-\mu_{1}^{\tilde{\pi}}\|_{1}=\sum_{s,a}\mu_{1}(s)|\pi_{1}(a|s)-% \tilde{\pi}_{1}(a|s)|โˆฅ italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) | italic_ฯ€ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) |. Furthermore, for any hโ„Žhitalic_h,

โ€–ฮผh+1ฯ€โˆ’ฮผh+1ฯ€~โ€–1subscriptnormsuperscriptsubscript๐œ‡โ„Ž1๐œ‹superscriptsubscript๐œ‡โ„Ž1~๐œ‹1\displaystyle\|\mu_{h+1}^{\pi}-\mu_{h+1}^{\tilde{\pi}}\|_{1}โˆฅ italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =โˆ‘s,a|ฮผh+1ฯ€โข(s,a)โˆ’ฮผh+1ฯ€~โข(s,a)|absentsubscript๐‘ ๐‘Žsuperscriptsubscript๐œ‡โ„Ž1๐œ‹๐‘ ๐‘Žsuperscriptsubscript๐œ‡โ„Ž1~๐œ‹๐‘ ๐‘Ž\displaystyle=\sum_{s,a}|\mu_{h+1}^{\pi}(s,a)-\mu_{h+1}^{\tilde{\pi}}(s,a)|= โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s , italic_a ) |
=โˆ‘s,a|ฯ€h+1(a|s)โˆ‘sโ€ฒ,aโ€ฒฮผhฯ€(sโ€ฒ,aโ€ฒ)โ„™h(s|sโ€ฒ,aโ€ฒ)โˆ’ฯ€~h+1(a|s)โˆ‘sโ€ฒ,aโ€ฒฮผhฯ€~(sโ€ฒ,aโ€ฒ)โ„™h(s|sโ€ฒ,aโ€ฒ)|\displaystyle=\sum_{s,a}\left|\pi_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}\mu_{h% }^{\pi}(s^{\prime},a^{\prime})\mathbb{P}_{h}(s|s^{\prime},a^{\prime})-\tilde{% \pi}_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}\mu_{h}^{\tilde{\pi}}(s^{\prime},a^% {\prime})\mathbb{P}_{h}(s|s^{\prime},a^{\prime})\right|= โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) |
โ‰คโˆ‘s,a|ฯ€h+1(a|s)โˆ’ฯ€~h+1(a|s)|โˆ‘sโ€ฒ,aโ€ฒฮผhฯ€(sโ€ฒ,aโ€ฒ)โ„™h(s|sโ€ฒ,aโ€ฒ)\displaystyle\leq\sum_{s,a}\left|\pi_{h+1}(a|s)-\tilde{\pi}_{h+1}(a|s)\right|% \sum_{s^{\prime},a^{\prime}}\mu_{h}^{\pi}(s^{\prime},a^{\prime})\mathbb{P}_{h}% (s|s^{\prime},a^{\prime})โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) | โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
+โˆ‘s,aฯ€~h+1โข(a|s)โขโˆ‘sโ€ฒ,aโ€ฒ|ฮผhฯ€โข(sโ€ฒ,aโ€ฒ)โˆ’ฮผhฯ€~โข(sโ€ฒ,aโ€ฒ)|โขโ„™hโข(s|sโ€ฒ,aโ€ฒ)subscript๐‘ ๐‘Žsubscript~๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsuperscriptsubscript๐œ‡โ„Ž๐œ‹superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsuperscriptsubscript๐œ‡โ„Ž~๐œ‹superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscriptโ„™โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒ\displaystyle~{}+\sum_{s,a}\tilde{\pi}_{h+1}(a|s)\sum_{s^{\prime},a^{\prime}}% \left|\mu_{h}^{\pi}(s^{\prime},a^{\prime})-\mu_{h}^{\tilde{\pi}}(s^{\prime},a^% {\prime})\right|\mathbb{P}_{h}(s|s^{\prime},a^{\prime})+ โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) | blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
=โˆ‘s,aฮผh+1ฯ€(s)|ฯ€h+1(a|s)โˆ’ฯ€~h+1(a|s)|+โˆ‘sโ€ฒ,aโ€ฒ|ฮผhฯ€(sโ€ฒ,aโ€ฒ)โˆ’ฮผhฯ€~(sโ€ฒ,aโ€ฒ)|โˆ‘s,aฯ€~h+1(a|s)โ„™h(s|sโ€ฒ,aโ€ฒ)\displaystyle=\sum_{s,a}\mu_{h+1}^{\pi}(s)\left|\pi_{h+1}(a|s)-\tilde{\pi}_{h+% 1}(a|s)\right|+\sum_{s^{\prime},a^{\prime}}\left|\mu_{h}^{\pi}(s^{\prime},a^{% \prime})-\mu_{h}^{\tilde{\pi}}(s^{\prime},a^{\prime})\right|\sum_{s,a}\tilde{% \pi}_{h+1}(a|s)\mathbb{P}_{h}(s|s^{\prime},a^{\prime})= โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) | italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) | + โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) | โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT )
=โˆ‘sฮผh+1ฯ€(s)โˆฅฯ€h+1(โ‹…|s)โˆ’ฯ€~h+1(โ‹…|s)โˆฅ1+โˆฅฮผhฯ€โˆ’ฮผhฯ€~โˆฅ1.\displaystyle=\sum_{s}\mu_{h+1}^{\pi}(s)\|\pi_{h+1}(\cdot|s)-\tilde{\pi}_{h+1}% (\cdot|s)\|_{1}+\|\mu_{h}^{\pi}-\mu_{h}^{\tilde{\pi}}\|_{1}.= โˆ‘ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( โ‹… | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + โˆฅ italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

By induction,

โˆฅฮผhฯ€โˆ’ฮผhฯ€~โˆฅ1โ‰คโˆ‘hโ€ฒ=1hโˆ‘sฮผhโ€ฒฯ€(s)โˆฅฯ€hโ€ฒ(โ‹…|s)โˆ’ฯ€~hโ€ฒ(โ‹…|s)โˆฅ1.\displaystyle\|\mu_{h}^{\pi}-\mu_{h}^{\tilde{\pi}}\|_{1}\leq\sum_{h^{\prime}=1% }^{h}\sum_{s}\mu_{h^{\prime}}^{\pi}(s)\|\pi_{h^{\prime}}(\cdot|s)-\tilde{\pi}_% {h^{\prime}}(\cdot|s)\|_{1}.โˆฅ italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Finally,

โ€–ฮผฯ€โˆ’ฮผฯ€~โ€–1subscriptnormsuperscript๐œ‡๐œ‹superscript๐œ‡~๐œ‹1\displaystyle\|\mu^{\pi}-\mu^{\tilde{\pi}}\|_{1}โˆฅ italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUPERSCRIPT over~ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰คโˆ‘h=1Hโˆ‘hโ€ฒ=1hโˆ‘sฮผhโ€ฒฯ€(s)โˆฅฯ€hโ€ฒ(โ‹…|s)โˆ’ฯ€~hโ€ฒ(โ‹…|s)โˆฅ1\displaystyle\leq\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h}\sum_{s}\mu_{h^{\prime}}% ^{\pi}(s)\|\pi_{h^{\prime}}(\cdot|s)-\tilde{\pi}_{h^{\prime}}(\cdot|s)\|_{1}โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
โ‰คHโˆ‘h=1Hโˆ‘sฮผhฯ€(s)โˆฅฯ€h(โ‹…|s)โˆ’ฯ€~h(โ‹…|s)โˆฅ1.\displaystyle\leq H\sum_{h=1}^{H}\sum_{s}\mu_{h}^{\pi}(s)\|\pi_{h}(\cdot|s)-% \tilde{\pi}_{h}(\cdot|s)\|_{1}.โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - over~ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

โˆŽ

See 4.2

Proof.

Note that ฮจMsubscriptฮจ๐‘€\Psi_{M}roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is a convex set and ฮผยฏMโˆˆฮจMsubscriptยฏ๐œ‡๐‘€subscriptฮจ๐‘€\bar{\mu}_{M}\in\Psi_{M}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. By definition, we have ฮผยฏM=ฮผMฯ€ยฏsubscriptยฏ๐œ‡๐‘€superscriptsubscript๐œ‡๐‘€ยฏ๐œ‹\bar{\mu}_{M}=\mu_{M}^{\bar{\pi}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT overยฏ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT. By applying Lem.ย E.3 for model policy ฯ€ยฏยฏ๐œ‹\bar{\pi}overยฏ start_ARG italic_ฯ€ end_ARG and ฯ€๐œ‹\piitalic_ฯ€ in M๐‘€Mitalic_M, we finish the proof. โˆŽ

Lemma E.4.

Consider any ฯ€โˆˆฮ ๐œ‹ฮ \pi\in\Piitalic_ฯ€ โˆˆ roman_ฮ  and models M,M~๐‘€~๐‘€M,\tilde{M}italic_M , over~ start_ARG italic_M end_ARG, who are the same except with different transition functions โ„™,โ„™~โ„™~โ„™\mathbb{P},\tilde{\mathbb{P}}blackboard_P , over~ start_ARG blackboard_P end_ARG respectively. Then,

โˆฅฮผM~ฯ€โˆ’ฮผMฯ€โˆฅ1โ‰คHโˆ‘h=1Hโˆ’1โˆ‘s,aฮผM,hฯ€(s,a)โˆฅโ„™~h(โ‹…|s,a)โˆ’โ„™h(โ‹…|s,a)โˆฅ1.\displaystyle\|\mu_{\tilde{M}}^{\pi}-\mu_{M}^{\pi}\|_{1}\leq H\sum_{h=1}^{H-1}% \sum_{s,a}\mu_{M,h}^{\pi}(s,a)\|\tilde{\mathbb{P}}_{h}(\cdot|s,a)-\mathbb{P}_{% h}(\cdot|s,a)\|_{1}.โˆฅ italic_ฮผ start_POSTSUBSCRIPT over~ start_ARG italic_M end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) โˆฅ over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .
Proof.

Recall the definition of the state-action density function:

ฮผM,h+1ฯ€โข(s,a)=โˆ‘sโ€ฒ,aโ€ฒฯ€h+1โข(a|s)โขโ„™M,hโข(s|sโ€ฒ,aโ€ฒ)โขฮผM,hฯ€โข(sโ€ฒ,aโ€ฒ)andฮผ1ฯ€โข(s,a)=ฯ€1โข(a|s)โขฮผ1โข(s).formulae-sequencesuperscriptsubscript๐œ‡๐‘€โ„Ž1๐œ‹๐‘ ๐‘Žsubscriptsuperscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsubscript๐œ‹โ„Ž1conditional๐‘Ž๐‘ subscriptโ„™๐‘€โ„Žconditional๐‘ superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒsuperscriptsubscript๐œ‡๐‘€โ„Ž๐œ‹superscript๐‘ โ€ฒsuperscript๐‘Žโ€ฒandsuperscriptsubscript๐œ‡1๐œ‹๐‘ ๐‘Žsubscript๐œ‹1conditional๐‘Ž๐‘ subscript๐œ‡1๐‘ \displaystyle\mu_{M,h+1}^{\pi}(s,a)=\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a|s)% \mathbb{P}_{M,h}(s|s^{\prime},a^{\prime})\mu_{M,h}^{\pi}(s^{\prime},a^{\prime}% )\quad\text{and}\quad\mu_{1}^{\pi}(s,a)=\pi_{1}(a|s)\mu_{1}(s).italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) blackboard_P start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUBSCRIPT italic_M , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) and italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_ฯ€ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) .

We abbreviate ฮผ~=ฮผM~ฯ€,ฮผ=ฮผMฯ€formulae-sequence~๐œ‡superscriptsubscript๐œ‡~๐‘€๐œ‹๐œ‡superscriptsubscript๐œ‡๐‘€๐œ‹\tilde{\mu}=\mu_{\tilde{M}}^{\pi},\mu=\mu_{M}^{\pi}over~ start_ARG italic_ฮผ end_ARG = italic_ฮผ start_POSTSUBSCRIPT over~ start_ARG italic_M end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT , italic_ฮผ = italic_ฮผ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT. Since ฮผ1subscript๐œ‡1\mu_{1}italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the same for M๐‘€Mitalic_M and M~~๐‘€\tilde{M}over~ start_ARG italic_M end_ARG, โˆ‘s,a|ฮผ~1โข(s,a)โˆ’ฮผ1โข(s,a)|=0subscript๐‘ ๐‘Žsubscript~๐œ‡1๐‘ ๐‘Žsubscript๐œ‡1๐‘ ๐‘Ž0\sum_{s,a}|\tilde{\mu}_{1}(s,a)-\mu_{1}(s,a)|=0โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) | = 0. Furthermore, for all hโ„Žhitalic_h,

โ€–ฮผ~h+1โˆ’ฮผh+1โ€–1=โˆ‘s,a|ฮผ~h+1โข(s,a)โˆ’ฮผh+1โข(s,a)|subscriptnormsubscript~๐œ‡โ„Ž1subscript๐œ‡โ„Ž11subscript๐‘ ๐‘Žsubscript~๐œ‡โ„Ž1๐‘ ๐‘Žsubscript๐œ‡โ„Ž1๐‘ ๐‘Ž\displaystyle\|\tilde{\mu}_{h+1}-\mu_{h+1}\|_{1}=\sum_{s,a}\left|\tilde{\mu}_{% h+1}(s,a)-\mu_{h+1}(s,a)\right|โˆฅ over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ฮผ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) |
=โˆ‘s,aโˆ‘sโ€ฒ,aโ€ฒฯ€h+1(a|s)|โ„™~h(s|sโ€ฒ,aโ€ฒ)ฮผ~h(sโ€ฒ,aโ€ฒ)โˆ’โ„™h(s|sโ€ฒ,aโ€ฒ)ฮผh(sโ€ฒ,aโ€ฒ)|\displaystyle=\sum_{s,a}\sum_{s^{\prime},a^{\prime}}\pi_{h+1}(a|s)\left|\tilde% {\mathbb{P}}_{h}(s|s^{\prime},a^{\prime})\tilde{\mu}_{h}(s^{\prime},a^{\prime}% )-\mathbb{P}_{h}(s|s^{\prime},a^{\prime})\mu_{h}(s^{\prime},a^{\prime})\right|= โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) | over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) - blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) |
=โˆ‘s,a,sโ€ฒ|โ„™~h(sโ€ฒ|s,a)ฮผ~h(s,a)โˆ’โ„™h(sโ€ฒ|s,a)ฮผh(s,a)|\displaystyle=\sum_{s,a,s^{\prime}}\left|\tilde{\mathbb{P}}_{h}(s^{\prime}|s,a% )\tilde{\mu}_{h}(s,a)-\mathbb{P}_{h}(s^{\prime}|s,a)\mu_{h}(s,a)\right|= โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a , italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) |
โ‰คโˆ‘s,a,sโ€ฒโ„™~h(sโ€ฒ|s,a)|ฮผ~h(s,a)โˆ’ฮผh(s,a)|+ฮผh(s,a)|โ„™~h(sโ€ฒ|s,a)โˆ’โ„™h(sโ€ฒ|s,a)|\displaystyle\leq\sum_{s,a,s^{\prime}}\tilde{\mathbb{P}}_{h}(s^{\prime}|s,a)% \left|\tilde{\mu}_{h}(s,a)-\mu_{h}(s,a)\right|+\mu_{h}(s,a)\left|\tilde{% \mathbb{P}}_{h}(s^{\prime}|s,a)-\mathbb{P}_{h}(s^{\prime}|s,a)\right|โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a , italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) | over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | + italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) |
=โˆ‘s,a|ฮผ~h(s,a)โˆ’ฮผh(s,a)|+โˆ‘s,aฮผh(s,a)โˆ‘sโ€ฒ|โ„™~h(sโ€ฒ|s,a)โˆ’โ„™h(sโ€ฒ|s,a)|\displaystyle=\sum_{s,a}\left|\tilde{\mu}_{h}(s,a)-\mu_{h}(s,a)\right|+\sum_{s% ,a}\mu_{h}(s,a)\sum_{s^{\prime}}\left|\tilde{\mathbb{P}}_{h}(s^{\prime}|s,a)-% \mathbb{P}_{h}(s^{\prime}|s,a)\right|= โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT | over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | + โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) โˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s , italic_a ) |
=โˆฅฮผ~hโˆ’ฮผhโˆฅ1+โˆ‘s,aฮผh(s,a)โˆฅโ„™~hโ€ฒ(โ‹…|s,a)โˆ’โ„™hโ€ฒ(โ‹…|s,a)โˆฅ1.\displaystyle=\|\tilde{\mu}_{h}-\mu_{h}\|_{1}+\sum_{s,a}\mu_{h}(s,a)\|\tilde{% \mathbb{P}}_{h^{\prime}}(\cdot|s,a)-\mathbb{P}_{h^{\prime}}(\cdot|s,a)\|_{1}.= โˆฅ over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) โˆฅ over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Using induction on hโ„Žhitalic_h, we obtain โˆฅฮผ~hโˆ’ฮผhโˆฅ1โ‰คโˆ‘hโ€ฒ=1hโˆ’1โˆ‘s,aฮผhโ€ฒ(s,a)โˆฅโ„™~hโ€ฒ(โ‹…|s,a)โˆ’โ„™hโ€ฒ(โ‹…|s,a)โˆฅ1\|\tilde{\mu}_{h}-\mu_{h}\|_{1}\leq\sum_{h^{\prime}=1}^{h-1}\sum_{s,a}\mu_{h^{% \prime}}(s,a)\|\tilde{\mathbb{P}}_{h^{\prime}}(\cdot|s,a)-\mathbb{P}_{h^{% \prime}}(\cdot|s,a)\|_{1}โˆฅ over~ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) โˆฅ over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Thus,

โ€–ฮผ~โˆ’ฮผโ€–1subscriptnorm~๐œ‡๐œ‡1\displaystyle\|\tilde{\mu}-\mu\|_{1}โˆฅ over~ start_ARG italic_ฮผ end_ARG - italic_ฮผ โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰คโˆ‘h=1Hโˆ‘hโ€ฒ=1hโˆ’1โˆ‘s,aฮผhโ€ฒ(s,a)โˆฅโ„™~hโ€ฒ(โ‹…|s,a)โˆ’โ„™hโ€ฒ(โ‹…|s,a)โˆฅ1\displaystyle\leq\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h-1}\sum_{s,a}\mu_{h^{% \prime}}(s,a)\|\tilde{\mathbb{P}}_{h^{\prime}}(\cdot|s,a)-\mathbb{P}_{h^{% \prime}}(\cdot|s,a)\|_{1}โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) โˆฅ over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
โ‰คHโˆ‘h=1Hโˆ’1โˆ‘s,aฮผh(s,a)โˆฅโ„™~h(โ‹…|s,a)โˆ’โ„™h(โ‹…|s,a)โˆฅ1.\displaystyle\leq H\sum_{h=1}^{H-1}\sum_{s,a}\mu_{h}(s,a)\|\tilde{\mathbb{P}}_% {h}(\cdot|s,a)-\mathbb{P}_{h}(\cdot|s,a)\|_{1}.โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) โˆฅ over~ start_ARG blackboard_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

โˆŽ

Appendix F PROOFS OF RESULTS IN SECTIONย 4

See 4.1

Proof.

We abbreviate ฮผยฏt=ฮผยฏMโˆ—tsuperscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. By Assumptionย B, โˆ‘t=1Tโ€–ฮผยฏtโˆ’ฮผโˆ—โ€–22=โˆ‘t=1TโŸจRtโข(ฮผยฏt),ฮผโˆ—โˆ’ฮผยฏtโŸฉโ‰คAdaReg(T)superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptnormsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡22superscriptsubscript๐‘ก1๐‘‡superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscriptยฏ๐œ‡๐‘กAdaReg๐‘‡\sum_{t=1}^{T}\|\bar{\mu}^{t}-\mu^{*}\|_{2}^{2}=\sum_{t=1}^{T}\langle R^{t}(% \bar{\mu}^{t}),\mu^{*}-\bar{\mu}^{t}\rangle\leq\operatorname*{AdaReg}(T)โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค roman_AdaReg ( italic_T ). Thus,

โˆ‘t=1Tโ€–ฮผยฏtโˆ’ฮผโˆ—โ€–12โ‰คHโขSโขAโขโˆ‘t=1Tโ€–ฮผยฏtโˆ’ฮผโˆ—โ€–22โ‰คHโขSโขAโขAdaReg(T).superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptnormsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡12๐ป๐‘†๐ดsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptnormsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡22๐ป๐‘†๐ดAdaReg๐‘‡\displaystyle\sum_{t=1}^{T}\|\bar{\mu}^{t}-\mu^{*}\|_{1}^{2}\leq HSA\sum_{t=1}% ^{T}\|\bar{\mu}^{t}-\mu^{*}\|_{2}^{2}\leq HSA\operatorname*{AdaReg}(T).โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โ‰ค italic_H italic_S italic_A โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โ‰ค italic_H italic_S italic_A roman_AdaReg ( italic_T ) .

By Assumptionย C and Jensenโ€™s inequality,

maxฮผโขโˆ‘t=1TUโข(ฮผ)โˆ’Uโข(ฮผยฏt)โ‰คLUโ‹…โˆ‘t=1Tโ€–ฮผโˆ—โˆ’ฮผยฏtโ€–1โ‰คLUโ‹…HโขSโขAโขTโขAdaReg(T).subscript๐œ‡superscriptsubscript๐‘ก1๐‘‡๐‘ˆ๐œ‡๐‘ˆsuperscriptยฏ๐œ‡๐‘กโ‹…subscript๐ฟ๐‘ˆsuperscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript๐œ‡superscriptยฏ๐œ‡๐‘ก1โ‹…subscript๐ฟ๐‘ˆ๐ป๐‘†๐ด๐‘‡AdaReg๐‘‡\displaystyle\max_{\mu}\sum_{t=1}^{T}U(\mu)-U(\bar{\mu}^{t})\leq L_{U}\cdot% \sum_{t=1}^{T}\|\mu^{*}-\bar{\mu}^{t}\|_{1}\leq L_{U}\cdot\sqrt{HSAT% \operatorname*{AdaReg}(T)}.roman_max start_POSTSUBSCRIPT italic_ฮผ end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ( italic_ฮผ ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT โ‹… โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT โ‹… square-root start_ARG italic_H italic_S italic_A italic_T roman_AdaReg ( italic_T ) end_ARG .

The steering cost can be bounded similarly.

โˆ‘t=1TโŸจRtโข(ฮผยฏt),ฮผยฏtโŸฉsuperscriptsubscript๐‘ก1๐‘‡superscript๐‘…๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle\sum_{t=1}^{T}\langle R^{t}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ =โˆ‘t=1THโขโ€–ฮผโˆ—โˆ’ฮผยฏtโ€–โˆž+โŸจฮผโˆ—โˆ’ฮผยฏt,ฮผยฏtโŸฉโ‰ค2โขHโขโˆ‘t=1Tโ€–ฮผโˆ—โˆ’ฮผยฏtโ€–โˆžabsentsuperscriptsubscript๐‘ก1๐‘‡๐ปsubscriptnormsuperscript๐œ‡superscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก2๐ปsuperscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript๐œ‡superscriptยฏ๐œ‡๐‘ก\displaystyle=\sum_{t=1}^{T}H\|\mu^{*}-\bar{\mu}^{t}\|_{\infty}+\langle\mu^{*}% -\bar{\mu}^{t},\bar{\mu}^{t}\rangle\leq 2H\sum_{t=1}^{T}\|\mu^{*}-\bar{\mu}^{t% }\|_{\infty}= โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_H โˆฅ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT + โŸจ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค 2 italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT
โ‰ค2โขHโขโˆ‘t=1Tโ€–ฮผโˆ—โˆ’ฮผยฏtโ€–2โ‰ค2โขHโขTโขAdaReg(T).absent2๐ปsuperscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript๐œ‡superscriptยฏ๐œ‡๐‘ก22๐ป๐‘‡AdaReg๐‘‡\displaystyle\leq 2H\sum_{t=1}^{T}\|\mu^{*}-\bar{\mu}^{t}\|_{2}\leq 2H\sqrt{T% \operatorname*{AdaReg}(T)}.โ‰ค 2 italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ italic_ฮผ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค 2 italic_H square-root start_ARG italic_T roman_AdaReg ( italic_T ) end_ARG .

Given that AdaRegAdaReg\operatorname*{AdaReg}roman_AdaReg is sub-linear in T๐‘‡Titalic_T, we finish the proof. โˆŽ

See 4.3

Proof.

The bound for the steering gap can be shown by first using the LUsubscript๐ฟ๐‘ˆL_{U}italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT-Lipschitzness of U๐‘ˆUitalic_U and then applying Lem.ย G.1 under Assump.ย B, where ฯ€โˆ—t=ฯ€superscriptsubscript๐œ‹๐‘ก๐œ‹\pi_{*}^{t}=\piitalic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_ฯ€ for all t๐‘กtitalic_t. The calculation of the steering cost is the same as in the proof of Theorem 5.1 with K=1๐พ1K=1italic_K = 1. โˆŽ

Appendix G PROOF OF THEOREMย 5.1

Lemma G.1.

Let (ฯ€โˆ—t)t=1Tsuperscriptsubscriptsuperscriptsubscript๐œ‹๐‘ก๐‘ก1๐‘‡(\pi_{*}^{t})_{t=1}^{T}( italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be a sequence of policies. We abbreviate ฮผยฏt=ฮผยฏMโˆ—t,ฮผฯ€โˆ—t=ฮผMโˆ—ฯ€โˆ—tformulae-sequencesuperscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscript๐œ‡superscriptsubscript๐œ‹๐‘กsubscriptsuperscript๐œ‡superscriptsubscript๐œ‹๐‘กsuperscript๐‘€\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}},\mu^{\pi_{*}^{t}}=\mu^{\pi_{*}^{t}}_{M^{*}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Then,

1Hโขโˆ‘t=1Tโ€–ฮผยฏtโˆ’ฮผฯ€โˆ—tโ€–1โ‰ค1๐ปsuperscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscriptsubscript๐œ‹๐‘ก1absent\displaystyle\frac{1}{H}\sum_{t=1}^{T}\|\bar{\mu}^{t}-\mu^{\pi_{*}^{t}}\|_{1}\leqdivide start_ARG 1 end_ARG start_ARG italic_H end_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค โˆ‘t=1Tโˆ‘h=1Hโˆ‘sโˆˆ๐’ฎฮผยฏht(s)โˆฅฯ€ยฏht(โ‹…|s)โˆ’ฯ€โˆ—,ht(โ‹…|s)โˆฅ1\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{s\in\mathcal{S}}\bar{\mu}^{t}_{% h}(s)\|\bar{\pi}^{t}_{h}(\cdot|s)-\pi_{*,h}^{t}(\cdot|s)\|_{1}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s โˆˆ caligraphic_S end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - italic_ฯ€ start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
โ‰ค\displaystyle\leqโ‰ค HโขSโขAโขTโขโˆ‘t=1TโŸจRฯ€โˆ—tโข(ฮผยฏt),ฮผฯ€โˆ—tโˆ’ฮผยฏtโŸฉ,๐ป๐‘†๐ด๐‘‡superscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscriptsubscript๐œ‹๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle\sqrt{HSAT\sum_{t=1}^{T}\left\langle R_{\pi_{*}^{t}}(\bar{\mu}^{t% }),\mu^{\pi_{*}^{t}}-\bar{\mu}^{t}\right\rangle},square-root start_ARG italic_H italic_S italic_A italic_T โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG ,

where ฯ€ยฏtsuperscriptยฏ๐œ‹๐‘ก\bar{\pi}^{t}overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the (population) policy which induces ฮผยฏt=ฮผฯ€ยฏtsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡superscriptยฏ๐œ‹๐‘ก\bar{\mu}^{t}=\mu^{\bar{\pi}^{t}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUPERSCRIPT overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Proof.

The first inequality follows from Lemmaย E.3. We can write

โˆ‘t=1TโŸจRฯ€โˆ—tโข(ฮผยฏt),ฮผฯ€โˆ—tโˆ’ฮผยฏtโŸฉ=โˆ‘t=1Tโ€–(Wฯ€โˆ—tโˆ’I)โขฮผยฏtโ€–22superscriptsubscript๐‘ก1๐‘‡subscript๐‘…subscriptsuperscript๐œ‹๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡subscriptsuperscript๐œ‹๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptnormsuperscript๐‘Šsubscriptsuperscript๐œ‹๐‘ก๐ผsuperscriptยฏ๐œ‡๐‘ก22\displaystyle\sum_{t=1}^{T}\left\langle R_{\pi^{t}_{*}}(\bar{\mu}^{t}),\mu^{% \pi^{t}_{*}}-\bar{\mu}^{t}\right\rangle=\sum_{t=1}^{T}\left\|(W^{\pi^{t}_{*}}-% I)\bar{\mu}^{t}\right\|_{2}^{2}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=โˆ‘t=1Tโˆ‘h,s,a(ฯ€โˆ—,htโข(a|s)โขโˆ‘aโ€ฒฮผยฏhtโข(s,aโ€ฒ)โˆ’ฮผยฏhtโข(s,a))2absentsuperscriptsubscript๐‘ก1๐‘‡subscriptโ„Ž๐‘ ๐‘Žsuperscriptsubscriptsuperscript๐œ‹๐‘กโ„Žconditional๐‘Ž๐‘ subscriptsuperscript๐‘Žโ€ฒsubscriptsuperscriptยฏ๐œ‡๐‘กโ„Ž๐‘ superscript๐‘Žโ€ฒsubscriptsuperscriptยฏ๐œ‡๐‘กโ„Ž๐‘ ๐‘Ž2\displaystyle=\sum_{t=1}^{T}\sum_{h,s,a}\left(\pi^{t}_{*,h}(a|s)\sum_{a^{% \prime}}\bar{\mu}^{t}_{h}(s,a^{\prime})-\bar{\mu}^{t}_{h}(s,a)\right)^{2}= โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ( italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ) - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=โˆ‘t=1Tโˆ‘h,s,a(ฮผยฏhtโข(s))2โข(ฯ€โˆ—,htโข(a|s)โˆ’ฯ€ยฏhtโข(a|s))2absentsuperscriptsubscript๐‘ก1๐‘‡subscriptโ„Ž๐‘ ๐‘Žsuperscriptsubscriptsuperscriptยฏ๐œ‡๐‘กโ„Ž๐‘ 2superscriptsubscriptsuperscript๐œ‹๐‘กโ„Žconditional๐‘Ž๐‘ subscriptsuperscriptยฏ๐œ‹๐‘กโ„Žconditional๐‘Ž๐‘ 2\displaystyle=\sum_{t=1}^{T}\sum_{h,s,a}(\bar{\mu}^{t}_{h}(s))^{2}\left(\pi^{t% }_{*,h}(a|s)-\bar{\pi}^{t}_{h}(a|s)\right)^{2}= โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=โˆ‘t=1Tโˆ‘h,s(ฮผยฏht(s))2โˆฅฯ€โˆ—,ht(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ22.\displaystyle=\sum_{t=1}^{T}\sum_{h,s}(\bar{\mu}^{t}_{h}(s))^{2}\left\|\pi^{t}% _{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\right\|_{2}^{2}.= โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Furthermore, by Jensenโ€™s inequality,

โˆ‘t=1Tโˆ‘h,sฮผยฏht(s)โˆฅฯ€ยฏht(โ‹…|s)โˆ’ฯ€โˆ—,ht(โ‹…|s)โˆฅ1โ‰คAโˆ‘t=1Tโˆ‘h,sฮผยฏht(s)โˆฅฯ€ยฏht(โ‹…|s)โˆ’ฯ€โˆ—,ht(โ‹…|s)โˆฅ2\displaystyle\sum_{t=1}^{T}\sum_{h,s}\bar{\mu}^{t}_{h}(s)\|\bar{\pi}^{t}_{h}(% \cdot|s)-\pi^{t}_{*,h}(\cdot|s)\|_{1}\leq\sqrt{A}\sum_{t=1}^{T}\sum_{h,s}\bar{% \mu}^{t}_{h}(s)\|\bar{\pi}^{t}_{h}(\cdot|s)-\pi^{t}_{*,h}(\cdot|s)\|_{2}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค square-root start_ARG italic_A end_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
โ‰คHSATโˆ‘t=1Tโˆ‘h,s(ฮผยฏht(s))2โˆฅฯ€ยฏht(โ‹…|s)โˆ’ฯ€โˆ—,ht(โ‹…|s)โˆฅ22\displaystyle\leq\sqrt{HSAT\sum_{t=1}^{T}\sum_{h,s}(\bar{\mu}^{t}_{h}(s))^{2}% \|\bar{\pi}^{t}_{h}(\cdot|s)-\pi^{t}_{*,h}(\cdot|s)\|_{2}^{2}}โ‰ค square-root start_ARG italic_H italic_S italic_A italic_T โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โˆฅ overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=HโขSโขAโขTโขโˆ‘t=1TโŸจRฯ€โˆ—tโข(ฮผยฏt),ฮผฯ€โˆ—tโˆ’ฮผยฏtโŸฉ.absent๐ป๐‘†๐ด๐‘‡superscriptsubscript๐‘ก1๐‘‡subscript๐‘…subscriptsuperscript๐œ‹๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐œ‡subscriptsuperscript๐œ‹๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle=\sqrt{HSAT\sum_{t=1}^{T}\left\langle R_{\pi^{t}_{*}}(\bar{\mu}^{% t}),\mu^{\pi^{t}_{*}}-\bar{\mu}^{t}\right\rangle}.= square-root start_ARG italic_H italic_S italic_A italic_T โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG .

โˆŽ

Lemma G.2.

For any 0<ฮด<10๐›ฟ10<\delta<10 < italic_ฮด < 1, with probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด,

โˆฅโ„™M^k,h(โ‹…|s,a)โˆ’โ„™hโˆ—(โ‹…|s,a)โˆฅ1โ‰ค2ฮตk(h,s,a)\displaystyle\|\mathbb{P}_{\hat{M}^{k},h}(\cdot|s,a)-\mathbb{P}^{*}_{h}(\cdot|% s,a)\|_{1}\leq 2\varepsilon_{k}(h,s,a)โˆฅ blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค 2 italic_ฮต start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a )

for all t,h,s,a๐‘กโ„Ž๐‘ ๐‘Žt,h,s,aitalic_t , italic_h , italic_s , italic_a, where ฮตkโข(h,s,a):=2โขSโขlnโก(TโขHโขSโขA/ฮด)maxโก{1,Nkโข(h,s,a)}assignsubscript๐œ€๐‘˜โ„Ž๐‘ ๐‘Ž2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ1subscript๐‘๐‘˜โ„Ž๐‘ ๐‘Ž\varepsilon_{k}(h,s,a):=\sqrt{\frac{2S\ln(THSA/\delta)}{\max\{1,N_{k}(h,s,a)\}}}italic_ฮต start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) := square-root start_ARG divide start_ARG 2 italic_S roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) } end_ARG end_ARG.

Proof.

Since โ„™M^kโˆˆ๐’ซksubscriptโ„™superscript^๐‘€๐‘˜superscript๐’ซ๐‘˜\mathbb{P}_{\hat{M}^{k}}\in\mathcal{P}^{k}blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โˆˆ caligraphic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we have โˆฅโ„™M^k,h(โ‹…|s,a)โˆ’โ„™ยฏhk(โ‹…|s,a)โˆฅ1โ‰คฮตk(h,s,a)\|\mathbb{P}_{\hat{M}^{k},h}(\cdot|s,a)-\bar{\mathbb{P}}^{k}_{h}(\cdot|s,a)\|_% {1}\leq\varepsilon_{k}(h,s,a)โˆฅ blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - overยฏ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_ฮต start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) for all k,h,s,a๐‘˜โ„Ž๐‘ ๐‘Žk,h,s,aitalic_k , italic_h , italic_s , italic_a. By (5) and Theorem 2.1 of Weissman etย al., (2003),

Pr[โˆฅโ„™ยฏhk(โ‹…|s,a)โˆ’โ„™hโˆ—(โ‹…|s,a)โˆฅ1>ฮต]โ‰ค(2Sโˆ’2)eโˆ’Nkโข(h,s,a)โขฮต2/2.\displaystyle\Pr\left[\|\bar{\mathbb{P}}^{k}_{h}(\cdot|s,a)-\mathbb{P}^{*}_{h}% (\cdot|s,a)\|_{1}>\varepsilon\right]\leq(2^{S}-2)e^{-N_{k}(h,s,a)\varepsilon^{% 2}/2}.roman_Pr [ โˆฅ overยฏ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_ฮต ] โ‰ค ( 2 start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT - 2 ) italic_e start_POSTSUPERSCRIPT - italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT .

Plugging in ฮตkโข(h,s,a)subscript๐œ€๐‘˜โ„Ž๐‘ ๐‘Ž\varepsilon_{k}(h,s,a)italic_ฮต start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) for ฮต๐œ€\varepsilonitalic_ฮต bounds this probability with ฮด/(TโขHโขSโขA)๐›ฟ๐‘‡๐ป๐‘†๐ด\delta/(THSA)italic_ฮด / ( italic_T italic_H italic_S italic_A ). The triangle inequality and a union bound over all k,h,s,a๐‘˜โ„Ž๐‘ ๐‘Žk,h,s,aitalic_k , italic_h , italic_s , italic_a imply the result. โˆŽ

Lemma G.3.

For any 0<ฮด<10๐›ฟ10<\delta<10 < italic_ฮด < 1 and respective ฮตkโข(h,s,a)subscript๐œ€๐‘˜โ„Ž๐‘ ๐‘Ž\varepsilon_{k}(h,s,a)italic_ฮต start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ),

โˆ‘t=1Tโˆ‘h=1Hโˆ’1ฮตkโข(t)โข(h,sht,aht)โ‰ค3โขHโขSโข2โขlnโก(TโขHโขSโขA/ฮด)โขAโขT.superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป1subscript๐œ€๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Ž3๐ป๐‘†2๐‘‡๐ป๐‘†๐ด๐›ฟ๐ด๐‘‡\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H-1}\varepsilon_{k(t)}(h,s^{t}_{h},a^{t% }_{h})\leq 3HS\sqrt{2\ln(THSA/\delta)AT}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_ฮต start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โ‰ค 3 italic_H italic_S square-root start_ARG 2 roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) italic_A italic_T end_ARG .
Proof.

We can define nkโข(h,s,a):=โˆ‘t=Tkโˆ’1+1Tk๐•€โข{sht=s,aht=a}assignsubscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜๐•€formulae-sequencesubscriptsuperscript๐‘ ๐‘กโ„Ž๐‘ subscriptsuperscript๐‘Ž๐‘กโ„Ž๐‘Žn_{k}(h,s,a):=\sum_{t=T_{k-1}+1}^{T_{k}}\mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a\}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) := โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_I { italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a }. Clearly, Nkโข(h,s,a)=โˆ‘kโ€ฒ<knkโข(h,s,a)subscript๐‘๐‘˜โ„Ž๐‘ ๐‘Žsubscriptsuperscript๐‘˜โ€ฒ๐‘˜subscript๐‘›๐‘˜โ„Ž๐‘ ๐‘ŽN_{k}(h,s,a)=\sum_{k^{\prime}<k}n_{k}(h,s,a)italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) = โˆ‘ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT < italic_k end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ). The condition in line 5 of the algorithm ensures that nkโข(h,s,a)โ‰คNkโข(h,s,a)subscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žsubscript๐‘๐‘˜โ„Ž๐‘ ๐‘Žn_{k}(h,s,a)\leq N_{k}(h,s,a)italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) โ‰ค italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) for all k,h,s,a๐‘˜โ„Ž๐‘ ๐‘Žk,h,s,aitalic_k , italic_h , italic_s , italic_a. Thus, we can use Lemma 19 in Jaksch etย al., (2010) and Jensenโ€™s inequality,

โˆ‘t=1Tโˆ‘h=1Hโˆ’11maxโก{1,Nkโข(t)โข(h,sht,aht)}=โˆ‘k=1Kโˆ‘t=Tkโˆ’1+1Tkโˆ‘h=1Hโˆ’11maxโก{1,Nkโข(h,sht,aht)}superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป111subscript๐‘๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Žsuperscriptsubscript๐‘˜1๐พsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜superscriptsubscriptโ„Ž1๐ป111subscript๐‘๐‘˜โ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Ž\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H-1}\frac{1}{\sqrt{\max\{1,N_{k(t)}(h,s% ^{t}_{h},a^{t}_{h})\}}}=\sum_{k=1}^{K}\sum_{t=T_{k-1}+1}^{T_{k}}\sum_{h=1}^{H-% 1}\frac{1}{\sqrt{\max\{1,N_{k}(h,s^{t}_{h},a^{t}_{h})\}}}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) } end_ARG end_ARG = โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) } end_ARG end_ARG
=โˆ‘k=1Kโˆ‘h=1Hโˆ’1โˆ‘s,aโˆ‘t=Tkโˆ’1+1Tk๐•€โข{sht=s,aht=a}maxโก{1,Nkโข(h,s,a)}=โˆ‘k=1Kโˆ‘h=1Hโˆ’1โˆ‘s,ankโข(h,s,a)maxโก{1,Nkโข(h,s,a)}absentsuperscriptsubscript๐‘˜1๐พsuperscriptsubscriptโ„Ž1๐ป1subscript๐‘ ๐‘Žsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜๐•€formulae-sequencesubscriptsuperscript๐‘ ๐‘กโ„Ž๐‘ subscriptsuperscript๐‘Ž๐‘กโ„Ž๐‘Ž1subscript๐‘๐‘˜โ„Ž๐‘ ๐‘Žsuperscriptsubscript๐‘˜1๐พsuperscriptsubscriptโ„Ž1๐ป1subscript๐‘ ๐‘Žsubscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Ž1subscript๐‘๐‘˜โ„Ž๐‘ ๐‘Ž\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H-1}\sum_{s,a}\sum_{t=T_{k-1}+1}^{T_{k% }}\frac{\mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a\}}{\sqrt{\max\{1,N_{k}(h,s,a)\}}}=% \sum_{k=1}^{K}\sum_{h=1}^{H-1}\sum_{s,a}\frac{n_{k}(h,s,a)}{\sqrt{\max\{1,N_{k% }(h,s,a)\}}}= โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG blackboard_I { italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a } end_ARG start_ARG square-root start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) } end_ARG end_ARG = โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) end_ARG start_ARG square-root start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) } end_ARG end_ARG
โ‰ค3โขโˆ‘h=1Hโˆ’1โˆ‘s,aNKโข(h,s,a)+nKโข(h,s,a)โ‰ค3โขHโขSโขAโขโˆ‘h=1Hโˆ’1โˆ‘s,a(NKโข(h,s,a)+nKโข(h,s,a))absent3superscriptsubscriptโ„Ž1๐ป1subscript๐‘ ๐‘Žsubscript๐‘๐พโ„Ž๐‘ ๐‘Žsubscript๐‘›๐พโ„Ž๐‘ ๐‘Ž3๐ป๐‘†๐ดsuperscriptsubscriptโ„Ž1๐ป1subscript๐‘ ๐‘Žsubscript๐‘๐พโ„Ž๐‘ ๐‘Žsubscript๐‘›๐พโ„Ž๐‘ ๐‘Ž\displaystyle\leq 3\sum_{h=1}^{H-1}\sum_{s,a}\sqrt{N_{K}(h,s,a)+n_{K}(h,s,a)}% \leq 3\sqrt{HSA\sum_{h=1}^{H-1}\sum_{s,a}(N_{K}(h,s,a)+n_{K}(h,s,a))}โ‰ค 3 โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT square-root start_ARG italic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) + italic_n start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) end_ARG โ‰ค 3 square-root start_ARG italic_H italic_S italic_A โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) + italic_n start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) ) end_ARG
=3โขHโขSโขAโ‹…HโขT=3โขHโขSโขAโขT.absent3โ‹…๐ป๐‘†๐ด๐ป๐‘‡3๐ป๐‘†๐ด๐‘‡\displaystyle=3\sqrt{HSA\cdot HT}=3H\sqrt{SAT}.= 3 square-root start_ARG italic_H italic_S italic_A โ‹… italic_H italic_T end_ARG = 3 italic_H square-root start_ARG italic_S italic_A italic_T end_ARG .

Now, using the definition of ฮตkโข(h,s,a)subscript๐œ€๐‘˜โ„Ž๐‘ ๐‘Ž\varepsilon_{k}(h,s,a)italic_ฮต start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ),

โˆ‘t=1Tโˆ‘h=1Hโˆ’1ฮตkโข(t)โข(h,sht,aht)superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป1subscript๐œ€๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Ž\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H-1}\varepsilon_{k(t)}(h,s^{t}_{h},a^{t% }_{h})โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_ฮต start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) =โˆ‘t=1Tโˆ‘h=1Hโˆ’12โขSโขlnโก(TโขHโขSโขA/ฮด)maxโก{1,Nkโข(t)โข(h,sht,aht)}absentsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป12๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ1subscript๐‘๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Ž\displaystyle=\sum_{t=1}^{T}\sum_{h=1}^{H-1}\sqrt{\frac{2S\ln(THSA/\delta)}{% \max\{1,N_{k(t)}(h,s^{t}_{h},a^{t}_{h})\}}}= โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 2 italic_S roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) } end_ARG end_ARG
โ‰ค2โขSโขlnโก(TโขHโขSโขA/ฮด)โ‹…3โขHโขSโขAโขT=3โขHโขSโข2โขlnโก(TโขHโขSโขA/ฮด)โขAโขT.absentโ‹…2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ3๐ป๐‘†๐ด๐‘‡3๐ป๐‘†2๐‘‡๐ป๐‘†๐ด๐›ฟ๐ด๐‘‡\displaystyle\leq\sqrt{2S\ln(THSA/\delta)}\cdot 3H\sqrt{SAT}=3HS\sqrt{2\ln(% THSA/\delta)AT}.โ‰ค square-root start_ARG 2 italic_S roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG โ‹… 3 italic_H square-root start_ARG italic_S italic_A italic_T end_ARG = 3 italic_H italic_S square-root start_ARG 2 roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) italic_A italic_T end_ARG .

โˆŽ

Lemma G.4.

Let (ฯ€ยฏt)t=1Tsuperscriptsubscriptsuperscriptยฏ๐œ‹๐‘ก๐‘ก1๐‘‡(\bar{\pi}^{t})_{t=1}^{T}( overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the policy sequence of the population and (M^k)k=1Ksuperscriptsubscriptsuperscript^๐‘€๐‘˜๐‘˜1๐พ(\hat{M}^{k})_{k=1}^{K}( over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT the sequence of the corresponding model estimates. We abbreviate ฮผยฏt=ฮผยฏMโˆ—t,ฮผ^t=ฮผM^kโข(t)ฯ€ยฏtformulae-sequencesuperscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€superscript^๐œ‡๐‘กsuperscriptsubscript๐œ‡superscript^๐‘€๐‘˜๐‘กsuperscriptยฏ๐œ‹๐‘ก\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}},\hat{\mu}^{t}=\mu_{\hat{M}^{k(t)}}^{\bar{% \pi}^{t}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. With probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด,

โˆ‘t=1Tโ€–ฮผ^tโˆ’ฮผยฏtโ€–1โ‰ค12โขH2โขSโขlnโก(TโขHโขSโขA/ฮด)โขAโขT.superscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript^๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก112superscript๐ป2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ๐ด๐‘‡\displaystyle\sum_{t=1}^{T}\left\|\hat{\mu}^{t}-\bar{\mu}^{t}\right\|_{1}\leq 1% 2H^{2}S\sqrt{\ln(THSA/\delta)AT}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค 12 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) italic_A italic_T end_ARG .
Proof.

The proof is based on Rosenberg and Mansour, (2019). Let (sht,aht)h=1Hsuperscriptsubscriptsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Žโ„Ž1๐ป(s^{t}_{h},a^{t}_{h})_{h=1}^{H}( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT be the trajectory sampled in the t๐‘กtitalic_t-th game. We define ฮพk(h,s,a):=โˆฅโ„™M^k,h(โ‹…|s,a)โˆ’โ„™hโˆ—(โ‹…|s,a)โˆฅ1\xi_{k}(h,s,a):=\|\mathbb{P}_{\hat{M}^{k},h}(\cdot|s,a)-\mathbb{P}^{*}_{h}(% \cdot|s,a)\|_{1}italic_ฮพ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) := โˆฅ blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) - blackboard_P start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s , italic_a ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By Lemmaย E.4,

โˆ‘t=1Tโ€–ฮผ^tโˆ’ฮผยฏtโ€–1โ‰คHโขโˆ‘t=1Tโˆ‘h=1Hโˆ’1โˆ‘s,aฮผยฏhtโข(s,a)โขฮพkโข(t)โข(h,s,a)superscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript^๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก1๐ปsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป1subscript๐‘ ๐‘Žsuperscriptsubscriptยฏ๐œ‡โ„Ž๐‘ก๐‘ ๐‘Žsubscript๐œ‰๐‘˜๐‘กโ„Ž๐‘ ๐‘Ž\displaystyle\sum_{t=1}^{T}\left\|\hat{\mu}^{t}-\bar{\mu}^{t}\right\|_{1}\leq H% \sum_{t=1}^{T}\sum_{h=1}^{H-1}\sum_{s,a}\bar{\mu}_{h}^{t}(s,a)\xi_{k(t)}(h,s,a)โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) italic_ฮพ start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a )
=Hโขโˆ‘t=1Tโˆ‘h=1Hโˆ’1ฮพkโข(t)โข(h,sht,aht)absent๐ปsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป1subscript๐œ‰๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Ž\displaystyle=H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\xi_{k(t)}(h,s^{t}_{h},a^{t}_{h})= italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_ฮพ start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
+Hโขโˆ‘t=1Tโˆ‘h=1Hโˆ’1(โˆ‘s,aฮผยฏhtโข(s,a)โขฮพkโข(t)โข(h,s,a)โˆ’โˆ‘s,a๐•€โข{sht=s,aht=a}โขฮพkโข(t)โข(h,s,a))โŸ=:Yt(h),\displaystyle~{}+H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\underset{=:Y_{t}(h)}{% \underbrace{\left(\sum_{s,a}\bar{\mu}^{t}_{h}(s,a)\xi_{k(t)}(h,s,a)-\sum_{s,a}% \mathbb{I}\{s^{t}_{h}=s,a^{t}_{h}=a\}\xi_{k(t)}(h,s,a)\right)}},+ italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT start_UNDERACCENT = : italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) end_UNDERACCENT start_ARG underโŸ start_ARG ( โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_ฮพ start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) - โˆ‘ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT blackboard_I { italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a } italic_ฮพ start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) ) end_ARG end_ARG ,

where (Ytโข(h))tsubscriptsubscript๐‘Œ๐‘กโ„Ž๐‘ก(Y_{t}(h))_{t}( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a martingale difference sequence w.r.t.ย the trajectories sampled and with |Ytโข(h)|โ‰คmaxs,aโกฮพkโข(t)โข(h,s,a)โ‰ค2subscript๐‘Œ๐‘กโ„Žsubscript๐‘ ๐‘Žsubscript๐œ‰๐‘˜๐‘กโ„Ž๐‘ ๐‘Ž2|Y_{t}(h)|\leq\max_{s,a}\xi_{k(t)}(h,s,a)\leq 2| italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) | โ‰ค roman_max start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_ฮพ start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) โ‰ค 2. In the following, we bound the first and second term above with high probability.

The first term can be bounded using Lemmaย G.2ย andย G.3, such that we have, with probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด,

Hโขโˆ‘t=1Tโˆ‘h=1Hโˆ’1ฮพkโข(t)โข(h,sht,aht)โ‰ค2โขHโขโˆ‘t=1Tโˆ‘h=1Hโˆ’1ฮตkโข(t)โข(h,sht,aht)โ‰ค2โขHโ‹…3โขHโข2โขSโขlnโก(TโขHโขSโขA/ฮด)โ‹…SโขAโขT.๐ปsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป1subscript๐œ‰๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Ž2๐ปsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป1subscript๐œ€๐‘˜๐‘กโ„Žsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Žโ‹…2๐ป3๐ปโ‹…2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ๐‘†๐ด๐‘‡\displaystyle H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\xi_{k(t)}(h,s^{t}_{h},a^{t}_{h})% \leq 2H\sum_{t=1}^{T}\sum_{h=1}^{H-1}\varepsilon_{k(t)}(h,s^{t}_{h},a^{t}_{h})% \leq 2H\cdot 3H\sqrt{2S\ln(THSA/\delta)\cdot SAT}.italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_ฮพ start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โ‰ค 2 italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_ฮต start_POSTSUBSCRIPT italic_k ( italic_t ) end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โ‰ค 2 italic_H โ‹… 3 italic_H square-root start_ARG 2 italic_S roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) โ‹… italic_S italic_A italic_T end_ARG .

By the Hoeffding-Azuma inequality, we have for a fixed hโ„Žhitalic_h that with probability at least 1โˆ’ฮด/H1๐›ฟ๐ป1-\delta/H1 - italic_ฮด / italic_H,

โˆ‘t=1TYtโข(h)โ‰ค2โข2โขTโขlnโก(H/ฮด).superscriptsubscript๐‘ก1๐‘‡subscript๐‘Œ๐‘กโ„Ž22๐‘‡๐ป๐›ฟ\displaystyle\sum_{t=1}^{T}Y_{t}(h)\leq 2\sqrt{2T\ln(H/\delta)}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h ) โ‰ค 2 square-root start_ARG 2 italic_T roman_ln ( italic_H / italic_ฮด ) end_ARG .

Thus, by the union bound over all hโ„Žhitalic_h, the second term is at most 2โขH2โข2โขTโขlnโก(H/ฮด)2superscript๐ป22๐‘‡๐ป๐›ฟ2H^{2}\sqrt{2T\ln(H/\delta)}2 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 italic_T roman_ln ( italic_H / italic_ฮด ) end_ARG with probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด.

Finally, by union bound over the events used to bound the first and second term, we have with probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด that

โˆ‘t=1Tโ€–ฮผ^tโˆ’ฮผยฏtโ€–1superscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript^๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก1\displaystyle\sum_{t=1}^{T}\left\|\hat{\mu}^{t}-\bar{\mu}^{t}\right\|_{1}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค2โขH2โข2โขTโขlnโก(H/ฮด)+6โขH2โข2โขSโขlnโก(TโขHโขSโขA/ฮด)โ‹…SโขAโขTabsent2superscript๐ป22๐‘‡๐ป๐›ฟ6superscript๐ป2โ‹…2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ๐‘†๐ด๐‘‡\displaystyle\leq 2H^{2}\sqrt{2T\ln(H/\delta)}+6H^{2}\sqrt{2S\ln(THSA/\delta)% \cdot SAT}โ‰ค 2 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 italic_T roman_ln ( italic_H / italic_ฮด ) end_ARG + 6 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG 2 italic_S roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) โ‹… italic_S italic_A italic_T end_ARG
โ‰ค12โขH2โขSโขlnโก(TโขHโขSโขA/ฮด)โขAโขT.absent12superscript๐ป2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ๐ด๐‘‡\displaystyle\leq 12H^{2}S\sqrt{\ln(THSA/\delta)AT}.โ‰ค 12 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) italic_A italic_T end_ARG .

โˆŽ

See 5.1

Proof.

We first establish the upper bound for steering gap and then investigate the steering cost.

Proof for Steering Gap

We denote with kโข(t)๐‘˜๐‘กk(t)italic_k ( italic_t ) the episode index at the t๐‘กtitalic_t-th game and denote ฯ€โˆ—=argโขmaxฯ€โกUโข(ฮผMโˆ—ฯ€)superscript๐œ‹subscriptargmax๐œ‹๐‘ˆsuperscriptsubscript๐œ‡superscript๐‘€๐œ‹\pi^{*}=\operatorname*{arg\,max}_{\pi}U(\mu_{M^{*}}^{\pi})italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฯ€ end_POSTSUBSCRIPT italic_U ( italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ). Furthermore, we abbreviate ฮผยฏt=ฮผยฏMโˆ—t,ฮผโˆ—k=ฮผMโˆ—ฯ€โˆ—k,ฮผ^t=ฮผM^kโข(t)ฯ€ยฏtformulae-sequencesuperscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€formulae-sequencesuperscriptsubscript๐œ‡๐‘˜superscriptsubscript๐œ‡superscript๐‘€superscriptsubscript๐œ‹๐‘˜superscript^๐œ‡๐‘กsuperscriptsubscript๐œ‡superscript^๐‘€๐‘˜๐‘กsuperscriptยฏ๐œ‹๐‘ก\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}},\mu_{*}^{k}=\mu_{M^{*}}^{\pi_{*}^{k}},\hat% {\mu}^{t}=\mu_{\hat{M}^{k(t)}}^{\bar{\pi}^{t}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and ฮผ^โˆ—k=ฮผM^kฯ€โˆ—ksubscriptsuperscript^๐œ‡๐‘˜superscriptsubscript๐œ‡superscript^๐‘€๐‘˜superscriptsubscript๐œ‹๐‘˜\hat{\mu}^{k}_{*}=\mu_{\hat{M}^{k}}^{\pi_{*}^{k}}over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT = italic_ฮผ start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Consider a fixed t๐‘กtitalic_t and k=kโข(t)๐‘˜๐‘˜๐‘กk=k(t)italic_k = italic_k ( italic_t ). We can decompose the steering gap term of round t๐‘กtitalic_t as follows:

Uโข(ฮผฯ€โˆ—)โˆ’Uโข(ฮผยฏt)=(Uโข(ฮผฯ€โˆ—)โˆ’Uโข(ฮผ^โˆ—k))+(Uโข(ฮผ^โˆ—k)โˆ’Uโข(ฮผยฏt))๐‘ˆsuperscript๐œ‡superscript๐œ‹๐‘ˆsuperscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscript๐œ‡superscript๐œ‹๐‘ˆsubscriptsuperscript^๐œ‡๐‘˜๐‘ˆsubscriptsuperscript^๐œ‡๐‘˜๐‘ˆsuperscriptยฏ๐œ‡๐‘ก\displaystyle U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})=\left(U(\mu^{\pi^{*}})-U(\hat{% \mu}^{k}_{*})\right)+\left(U(\hat{\mu}^{k}_{*})-U(\bar{\mu}^{t})\right)italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ( italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_U ( over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT ) ) + ( italic_U ( over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )

The first term can be bounded by 0 using the optimism of the algorithm. We use the LUsubscript๐ฟ๐‘ˆL_{U}italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT-Lipschitzness of U๐‘ˆUitalic_U and the triangle inequality to further decompose the second term.

Uโข(ฮผ^โˆ—k)โˆ’Uโข(ฮผยฏt)โ‰คLUโขโ€–ฮผ^โˆ—kโˆ’ฮผยฏtโ€–1โ‰คLUโขโ€–ฮผ^โˆ—kโˆ’ฮผ^tโ€–1+LUโขโ€–ฮผ^tโˆ’ฮผยฏtโ€–1.๐‘ˆsubscriptsuperscript^๐œ‡๐‘˜๐‘ˆsuperscriptยฏ๐œ‡๐‘กsubscript๐ฟ๐‘ˆsubscriptnormsubscriptsuperscript^๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘ก1subscript๐ฟ๐‘ˆsubscriptnormsubscriptsuperscript^๐œ‡๐‘˜superscript^๐œ‡๐‘ก1subscript๐ฟ๐‘ˆsubscriptnormsuperscript^๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก1\displaystyle U(\hat{\mu}^{k}_{*})-U(\bar{\mu}^{t})\leq L_{U}\|\hat{\mu}^{k}_{% *}-\bar{\mu}^{t}\|_{1}\leq L_{U}\|\hat{\mu}^{k}_{*}-\hat{\mu}^{t}\|_{1}+L_{U}% \|\hat{\mu}^{t}-\bar{\mu}^{t}\|_{1}.italic_U ( over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT - over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Applying Lemmaย E.3, we get

โˆฅฮผ^โˆ—kโˆ’ฮผ^tโˆฅ1โ‰คHโˆ‘h,sฮผ^ht(s)โˆฅฯ€โˆ—,hk(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1\displaystyle\|\hat{\mu}^{k}_{*}-\hat{\mu}^{t}\|_{1}\leq H\sum_{h,s}\hat{\mu}_% {h}^{t}(s)\|\pi^{k}_{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT - over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
โ‰คHโˆ‘h,sฮผยฏht(s)โ‹…โˆฅฯ€โˆ—,hk(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1+Hโˆ‘h,s|ฮผ^ht(s)โˆ’ฮผยฏht(s)|โ‹…โˆฅฯ€โˆ—,hk(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1โŸ(โˆ—),\displaystyle\leq H\sum_{h,s}\bar{\mu}_{h}^{t}(s)\cdot\|\pi^{k}_{*,h}(\cdot|s)% -\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}+H\underset{(*)}{\underbrace{\sum_{h,s}|\hat{% \mu}_{h}^{t}(s)-\bar{\mu}_{h}^{t}(s)|\cdot\|\pi^{k}_{*,h}(\cdot|s)-\bar{\pi}^{% t}_{h}(\cdot|s)\|_{1}}},โ‰ค italic_H โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) โ‹… โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_H start_UNDERACCENT ( โˆ— ) end_UNDERACCENT start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT | over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) | โ‹… โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG ,

where the second term can be bounded with

(โˆ—)โ‰ค2โขโˆ‘h,s|ฮผ^htโข(s)โˆ’ฮผยฏhtโข(s)|โ‰ค2โขโˆ‘h,s|โˆ‘aฮผ^htโข(s,a)โˆ’โˆ‘aฮผยฏhtโข(s,a)|โ‰ค2โขโ€–ฮผ^tโˆ’ฮผยฏtโ€–1.2subscriptโ„Ž๐‘ superscriptsubscript^๐œ‡โ„Ž๐‘ก๐‘ superscriptsubscriptยฏ๐œ‡โ„Ž๐‘ก๐‘ 2subscriptโ„Ž๐‘ subscript๐‘Žsuperscriptsubscript^๐œ‡โ„Ž๐‘ก๐‘ ๐‘Žsubscript๐‘Žsuperscriptsubscriptยฏ๐œ‡โ„Ž๐‘ก๐‘ ๐‘Ž2subscriptnormsuperscript^๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก1\displaystyle(*)\leq 2\sum_{h,s}|\hat{\mu}_{h}^{t}(s)-\bar{\mu}_{h}^{t}(s)|% \leq 2\sum_{h,s}\left|\sum_{a}\hat{\mu}_{h}^{t}(s,a)-\sum_{a}\bar{\mu}_{h}^{t}% (s,a)\right|\leq 2\|\hat{\mu}^{t}-\bar{\mu}^{t}\|_{1}.( โˆ— ) โ‰ค 2 โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT | over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) | โ‰ค 2 โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT | โˆ‘ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT over^ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - โˆ‘ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) | โ‰ค 2 โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Putting it all together we now arrive at

U(ฮผฯ€โˆ—)โˆ’U(ฮผยฏt)โ‰คLUHโˆ‘h,sฮผยฏht(s)โˆฅฯ€โˆ—,hk(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1+LU(2H+1)โˆฅฮผ^tโˆ’ฮผยฏtโˆฅ1.\displaystyle U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})\leq L_{U}H\sum_{h,s}\bar{\mu}^% {t}_{h}(s)\|\pi^{k}_{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}+L_{U}(2H+1% )\|\hat{\mu}^{t}-\bar{\mu}^{t}\|_{1}.italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( 2 italic_H + 1 ) โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

By summing over t๐‘กtitalic_t,

โˆ‘t=1TUโข(ฮผฯ€โˆ—)โˆ’Uโข(ฮผยฏt)โ‰คLUโขHโขโˆ‘t=1Tโˆ‘h,sฮผยฏht(s)โˆฅฯ€โˆ—,hkโข(t)(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1โŸฮ”pop+LUโข(2โขH+1)โขโˆ‘t=1Tโ€–ฮผ^tโˆ’ฮผยฏtโ€–1โŸฮ”est.\displaystyle\sum_{t=1}^{T}U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})\leq L_{U}H% \underset{\Delta_{\text{pop}}}{\underbrace{\sum_{t=1}^{T}\sum_{h,s}\bar{\mu}^{% t}_{h}(s)\|\pi^{k(t)}_{*,h}(\cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}}}+L_{U}(% 2H+1)\underset{\Delta_{\text{est}}}{\underbrace{\sum_{t=1}^{T}\|\hat{\mu}^{t}-% \bar{\mu}^{t}\|_{1}}}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H start_UNDERACCENT roman_ฮ” start_POSTSUBSCRIPT pop end_POSTSUBSCRIPT end_UNDERACCENT start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG + italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( 2 italic_H + 1 ) start_UNDERACCENT roman_ฮ” start_POSTSUBSCRIPT est end_POSTSUBSCRIPT end_UNDERACCENT start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG .

Using Lemmaย G.4, the estimation error term ฮ”estsubscriptฮ”est\Delta_{\text{est}}roman_ฮ” start_POSTSUBSCRIPT est end_POSTSUBSCRIPT can bounded by 12โขH2โขSโขlnโก(TโขHโขSโขA/ฮด)โขAโขT12superscript๐ป2๐‘†๐‘‡๐ป๐‘†๐ด๐›ฟ๐ด๐‘‡12H^{2}S\sqrt{\ln(THSA/\delta)AT}12 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) italic_A italic_T end_ARG with probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด.

To bound the population convergence term ฮ”popsubscriptฮ”pop\Delta_{\text{pop}}roman_ฮ” start_POSTSUBSCRIPT pop end_POSTSUBSCRIPT, we can use Lemmaย G.1:

โˆ‘t=1Tโˆ‘h,sฮผยฏht(s)โˆฅฯ€โˆ—,hkโข(t)(โ‹…|s)โˆ’ฯ€ยฏht(โ‹…|s)โˆฅ1โ‰คHโขSโขAโขTโขโˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉโŸAgentReg\displaystyle\sum_{t=1}^{T}\sum_{h,s}\bar{\mu}^{t}_{h}(s)\|\pi^{k(t)}_{*,h}(% \cdot|s)-\bar{\pi}^{t}_{h}(\cdot|s)\|_{1}\leq\vphantom{\underbrace{\sum_{t}^{T% }}_{\texttt{AgentReg}}}\sqrt{\vphantom{\sum_{t}^{T}}\smash[b]{HSAT\!% \underbrace{\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{k% (t)}-\bar{\mu}^{t}\rangle}_{\texttt{AgentReg}}\,}}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h , italic_s end_POSTSUBSCRIPT overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โˆฅ italic_ฯ€ start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— , italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) - overยฏ start_ARG italic_ฯ€ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( โ‹… | italic_s ) โˆฅ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT โ‰ค square-root start_ARG italic_H italic_S italic_A italic_T underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG start_POSTSUBSCRIPT AgentReg end_POSTSUBSCRIPT end_ARG

Furthermore, it can be easily seen that AgentReg is

โˆ‘t=1TโŸจRztโข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscript๐‘…z๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle\sum_{t=1}^{T}\langle R_{\text{z}}^{t}(\bar{\mu}^{t}),\mu_{*}^{k(% t)}-\bar{\mu}^{t}\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ =โˆ‘k=1Kโˆ‘t=Tkโˆ’1+1TkโŸจRztโข(ฮผยฏt),ฮผโˆ—kโˆ’ฮผยฏtโŸฉโŸโ‰คAdaReg(T)โ‰คKโ‹…AdaReg(T).absentsuperscriptsubscript๐‘˜1๐พabsentAdaReg๐‘‡โŸsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜superscriptsubscript๐‘…z๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘กโ‹…๐พAdaReg๐‘‡\displaystyle=\sum_{k=1}^{K}\underset{\leq\operatorname*{AdaReg}(T)}{% \underbrace{\sum_{t=T_{k-1}+1}^{T_{k}}\langle R_{\text{z}}^{t}(\bar{\mu}^{t}),% \mu_{*}^{k}-\bar{\mu}^{t}\rangle}}\leq K\cdot\operatorname*{AdaReg}(T).= โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_UNDERACCENT โ‰ค roman_AdaReg ( italic_T ) end_UNDERACCENT start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG end_ARG โ‰ค italic_K โ‹… roman_AdaReg ( italic_T ) .

Finally, to bound the number of episodes K๐พKitalic_K, note that K๐พKitalic_K is also the number of times the condition in line 5 of the algorithm has been true. For each (h,s,a)โ„Ž๐‘ ๐‘Ž(h,s,a)( italic_h , italic_s , italic_a ), this condition can be true at most log2โกTsubscript2๐‘‡\log_{2}Troman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T times. Thus, Kโ‰คHโขSโขAโขlog2โกT๐พ๐ป๐‘†๐ดsubscript2๐‘‡K\leq HSA\log_{2}Titalic_K โ‰ค italic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T.

Proof for Steering Costs

Note that for any reward function R๐‘…Ritalic_R,

โŸจRโข(ฮผ)+โ€–Rโข(ฮผ)โ€–โˆžโข๐Ÿ,ฮผโŸฉ=Hโขโ€–Rโข(ฮผ)โ€–โˆž+โŸจRโข(ฮผ),ฮผโŸฉโ‰ค2โขHโขโ€–Rโข(ฮผ)โ€–โˆž.๐‘…๐œ‡subscriptnorm๐‘…๐œ‡1๐œ‡๐ปsubscriptnorm๐‘…๐œ‡๐‘…๐œ‡๐œ‡2๐ปsubscriptnorm๐‘…๐œ‡\displaystyle\langle R(\mu)+\|R(\mu)\|_{\infty}\bm{1},\mu\rangle=H\|R(\mu)\|_{% \infty}+\langle R(\mu),\mu\rangle\leq 2H\|R(\mu)\|_{\infty}.โŸจ italic_R ( italic_ฮผ ) + โˆฅ italic_R ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 , italic_ฮผ โŸฉ = italic_H โˆฅ italic_R ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT + โŸจ italic_R ( italic_ฮผ ) , italic_ฮผ โŸฉ โ‰ค 2 italic_H โˆฅ italic_R ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT .

Let ฯ€โˆ—=ฯ€โˆ—ksuperscript๐œ‹superscriptsubscript๐œ‹๐‘˜\pi^{*}=\pi_{*}^{k}italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for some k๐‘˜kitalic_k. Recall that Rฯ€โˆ—โข(ฮผ)=โˆ’((Wฯ€โˆ—โˆ’I)โขฮผ)โŠคโข(Wฯ€โˆ—โˆ’I)subscript๐‘…superscript๐œ‹๐œ‡superscriptsuperscript๐‘Šsuperscript๐œ‹๐ผ๐œ‡topsuperscript๐‘Šsuperscript๐œ‹๐ผR_{\pi^{*}}(\mu)=-((W^{\pi^{*}}-I)\mu)^{\top}(W^{\pi^{*}}-I)italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) = - ( ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ). By looking at the definition of Wฯ€โˆ—superscript๐‘Šsuperscript๐œ‹W^{\pi^{*}}italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT in (4), we see that

โˆฅ(Wฯ€โˆ—โˆ’I)โŠคโˆฅโˆž=maxh,s,aโˆ‘aโ€ฒโ‰ a|ฯ€h(aโ€ฒ|s)|+|ฯ€h(a|s)โˆ’1|โ‰ค2,\displaystyle\left\|(W^{\pi^{*}}-I)^{\top}\right\|_{\infty}=\max_{h,s,a}\sum_{% a^{\prime}\neq a}|\pi_{h}(a^{\prime}|s)|+|\pi_{h}(a|s)-1|\leq 2,โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_h , italic_s , italic_a end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โ‰  italic_a end_POSTSUBSCRIPT | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT | italic_s ) | + | italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) - 1 | โ‰ค 2 ,

where the โˆฅโ‹…โˆฅโˆž\|\cdot\|_{\infty}โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT-matrix norm is defined as โ€–Mโ€–โˆž=maxiโขโˆ‘j|Miโขj|subscriptnorm๐‘€subscript๐‘–subscript๐‘—subscript๐‘€๐‘–๐‘—\|M\|_{\infty}=\max_{i}\sum_{j}|M_{ij}|โˆฅ italic_M โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT |. Using this, we can bound

โ€–Rฯ€โˆ—โข(ฮผ)โ€–โˆžsubscriptnormsubscript๐‘…superscript๐œ‹๐œ‡\displaystyle\left\|R_{\pi^{*}}(\mu)\right\|_{\infty}โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT =โ€–((Wฯ€โˆ—โˆ’I)โขฮผ)โŠคโข(Wฯ€โˆ—โˆ’I)โ€–โˆž=โ€–(Wฯ€โˆ—โˆ’I)โŠคโข(Wฯ€โˆ—โˆ’I)โขฮผโ€–โˆžabsentsubscriptnormsuperscriptsuperscript๐‘Šsuperscript๐œ‹๐ผ๐œ‡topsuperscript๐‘Šsuperscript๐œ‹๐ผsubscriptnormsuperscriptsuperscript๐‘Šsuperscript๐œ‹๐ผtopsuperscript๐‘Šsuperscript๐œ‹๐ผ๐œ‡\displaystyle=\left\|((W^{\pi^{*}}-I)\mu)^{\top}(W^{\pi^{*}}-I)\right\|_{% \infty}=\left\|(W^{\pi^{*}}-I)^{\top}(W^{\pi^{*}}-I)\mu\right\|_{\infty}= โˆฅ ( ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT = โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT
โ‰คโ€–(Wฯ€โˆ—โˆ’I)โŠคโ€–โˆžโ‹…โ€–(Wฯ€โˆ—โˆ’I)โขฮผโ€–โˆžโ‰ค2โขโ€–(Wฯ€โˆ—โˆ’I)โขฮผโ€–2.absentโ‹…subscriptnormsuperscriptsuperscript๐‘Šsuperscript๐œ‹๐ผtopsubscriptnormsuperscript๐‘Šsuperscript๐œ‹๐ผ๐œ‡2subscriptnormsuperscript๐‘Šsuperscript๐œ‹๐ผ๐œ‡2\displaystyle\leq\left\|(W^{\pi^{*}}-I)^{\top}\right\|_{\infty}\cdot\left\|(W^% {\pi^{*}}-I)\mu\right\|_{\infty}\leq 2\left\|(W^{\pi^{*}}-I)\mu\right\|_{2}.โ‰ค โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‹… โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT โ‰ค 2 โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I ) italic_ฮผ โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Finally, using Jensenโ€™s inequality and the fact that the agent regret is bounded by KโขAdaReg(T)๐พAdaReg๐‘‡K\operatorname*{AdaReg}(T)italic_K roman_AdaReg ( italic_T ), our steering cost can be bounded by

โˆ‘t=1TโŸจRztโข(ฮผยฏt),ฮผยฏtโŸฉ=โˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt)+โ€–Rฯ€โˆ—kโข(t)โข(ฮผยฏt)โ€–โˆžโข๐Ÿ,ฮผยฏtโŸฉsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscript๐‘…z๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsubscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก1superscriptยฏ๐œ‡๐‘ก\displaystyle\sum_{t=1}^{T}\left\langle R_{\text{z}}^{t}(\bar{\mu}^{t}),\bar{% \mu}^{t}\right\rangle=\sum_{t=1}^{T}\left\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^% {t})+\|R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})\|_{\infty}\bm{1},\bar{\mu}^{t}\right\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ
โ‰ค4โขHโขโˆ‘t=1Tโ€–(Wฯ€โˆ—kโข(t)โˆ’I)โขฮผยฏtโ€–2โ‰ค4โขHโขTโขโˆ‘t=1Tโ€–(Wฯ€โˆ—kโข(t)โˆ’I)โขฮผยฏtโ€–22absent4๐ปsuperscriptsubscript๐‘ก1๐‘‡subscriptnormsuperscript๐‘Šsubscriptsuperscript๐œ‹๐‘˜๐‘ก๐ผsuperscriptยฏ๐œ‡๐‘ก24๐ป๐‘‡superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptnormsuperscript๐‘Šsubscriptsuperscript๐œ‹๐‘˜๐‘ก๐ผsuperscriptยฏ๐œ‡๐‘ก22\displaystyle\leq 4H\sum_{t=1}^{T}\left\|(W^{\pi^{k(t)}_{*}}-I)\bar{\mu}^{t}% \right\|_{2}\leq 4H\sqrt{T\sum_{t=1}^{T}\left\|(W^{\pi^{k(t)}_{*}}-I)\bar{\mu}% ^{t}\right\|_{2}^{2}}โ‰ค 4 italic_H โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค 4 italic_H square-root start_ARG italic_T โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆฅ ( italic_W start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
โ‰ค4โขHโขTโขโˆ‘t=1TโŸจRztโข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉโ‰ค4โขHโขTโขKโขAdaReg(T)absent4๐ป๐‘‡superscriptsubscript๐‘ก1๐‘‡superscriptsubscript๐‘…z๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก4๐ป๐‘‡๐พAdaReg๐‘‡\displaystyle\leq 4H\sqrt{T\sum_{t=1}^{T}\langle R_{\text{z}}^{t}(\bar{\mu}^{t% }),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle}\leq 4H\sqrt{TK\operatorname*{AdaReg}(T)}โ‰ค 4 italic_H square-root start_ARG italic_T โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG โ‰ค 4 italic_H square-root start_ARG italic_T italic_K roman_AdaReg ( italic_T ) end_ARG

โˆŽ

Appendix H ELUDER DIMENSION

H.1 Example Function Classes

Here, we list some bounds of the eluder dimension for different function classes that are commonly considered. We see that in all these cases, the eluder dimension can be bounded logarithmically in T๐‘‡Titalic_T, if ฮต=Tโˆ’1๐œ€superscript๐‘‡1\varepsilon=T^{-1}italic_ฮต = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Proposition H.1 (Linear functions, Russo and Vanย Roy, (2013)).

Let โ„ฑ={f|fโข(x)=ฮธโŠคโขฯ•โข(x),ฮธโˆˆโ„d,โ€–ฮธโ€–2โ‰คCฮธ,โ€–ฯ•โข(x)โ€–2โ‰คCฯ•}โ„ฑconditional-set๐‘“formulae-sequence๐‘“๐‘ฅsuperscript๐œƒtopitalic-ฯ•๐‘ฅformulae-sequence๐œƒsuperscriptโ„๐‘‘formulae-sequencesubscriptnorm๐œƒ2subscript๐ถ๐œƒsubscriptnormitalic-ฯ•๐‘ฅ2subscript๐ถitalic-ฯ•\mathcal{F}=\{f|f(x)=\theta^{\top}\phi(x),\theta\in\mathbb{R}^{d},\|\theta\|_{% 2}\leq C_{\theta},\|\phi(x)\|_{2}\leq C_{\phi}\}caligraphic_F = { italic_f | italic_f ( italic_x ) = italic_ฮธ start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT italic_ฯ• ( italic_x ) , italic_ฮธ โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , โˆฅ italic_ฮธ โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค italic_C start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT , โˆฅ italic_ฯ• ( italic_x ) โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค italic_C start_POSTSUBSCRIPT italic_ฯ• end_POSTSUBSCRIPT }.

dimE(โ„ฑ,ฮต)โ‰ค3โขdโขeeโˆ’1โขlnโก(3+3โข(2โขCฮธฮต)2)+1.subscriptdimension๐ธโ„ฑ๐œ€3๐‘‘๐‘’๐‘’133superscript2subscript๐ถ๐œƒ๐œ€21\displaystyle\dim_{E}(\mathcal{F},\varepsilon)\leq 3d\frac{e}{e-1}\ln\left(3+3% \left(\frac{2C_{\theta}}{\varepsilon}\right)^{2}\right)+1.roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ) โ‰ค 3 italic_d divide start_ARG italic_e end_ARG start_ARG italic_e - 1 end_ARG roman_ln ( 3 + 3 ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT end_ARG start_ARG italic_ฮต end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 1 .
Proposition H.2 (Quadratic functions, Osband and Roy, (2014)).

Let โ„ฑ={f|fโข(x)=ฯ•โข(x)โŠคโขฮธโขฯ•โข(x),ฮธโˆˆโ„pร—p,ฯ•โˆˆโ„p,โ€–ฮธโ€–2โ‰คCฮธ,โ€–ฯ•โ€–2โ‰คCฯ•}โ„ฑconditional-set๐‘“formulae-sequence๐‘“๐‘ฅitalic-ฯ•superscript๐‘ฅtop๐œƒitalic-ฯ•๐‘ฅformulae-sequence๐œƒsuperscriptโ„๐‘๐‘formulae-sequenceitalic-ฯ•superscriptโ„๐‘formulae-sequencesubscriptnorm๐œƒ2subscript๐ถ๐œƒsubscriptnormitalic-ฯ•2subscript๐ถitalic-ฯ•\mathcal{F}=\{f|f(x)=\phi(x)^{\top}\theta\phi(x),\theta\in\mathbb{R}^{p\times p% },\phi\in\mathbb{R}^{p},\|\theta\|_{2}\leq C_{\theta},\|\phi\|_{2}\leq C_{\phi}\}caligraphic_F = { italic_f | italic_f ( italic_x ) = italic_ฯ• ( italic_x ) start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT italic_ฮธ italic_ฯ• ( italic_x ) , italic_ฮธ โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_p ร— italic_p end_POSTSUPERSCRIPT , italic_ฯ• โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , โˆฅ italic_ฮธ โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค italic_C start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT , โˆฅ italic_ฯ• โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค italic_C start_POSTSUBSCRIPT italic_ฯ• end_POSTSUBSCRIPT }.

dimE(โ„ฑ,ฮต)โ‰คpโข(4โขpโˆ’1)โขeeโˆ’1โขlogโก((1+(2โขpโขCฯ•2โขCฮธฮต)2)โข(4โขpโˆ’1))+1.subscriptdimension๐ธโ„ฑ๐œ€๐‘4๐‘1๐‘’๐‘’11superscript2๐‘superscriptsubscript๐ถitalic-ฯ•2subscript๐ถ๐œƒ๐œ€24๐‘11\displaystyle\dim_{E}(\mathcal{F},\varepsilon)\leq p(4p-1)\frac{e}{e-1}\log% \left(\left(1+\left(\frac{2pC_{\phi}^{2}C_{\theta}}{\varepsilon}\right)^{2}% \right)(4p-1)\right)+1.roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ) โ‰ค italic_p ( 4 italic_p - 1 ) divide start_ARG italic_e end_ARG start_ARG italic_e - 1 end_ARG roman_log ( ( 1 + ( divide start_ARG 2 italic_p italic_C start_POSTSUBSCRIPT italic_ฯ• end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT end_ARG start_ARG italic_ฮต end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 4 italic_p - 1 ) ) + 1 .
Proposition H.3 (Generalized linear functions, Russo and Vanย Roy, (2013)).

Let g๐‘”gitalic_g be strictly increasing, differentiable and have derivatives bounded in [hยฏ,hยฏ]ยฏโ„Žยฏโ„Ž[\underline{h},\overline{h}][ underยฏ start_ARG italic_h end_ARG , overยฏ start_ARG italic_h end_ARG ] with hยฏ>hยฏ>0ยฏโ„Žยฏโ„Ž0\overline{h}>\underline{h}>0overยฏ start_ARG italic_h end_ARG > underยฏ start_ARG italic_h end_ARG > 0. Let r=hยฏ/hยฏ๐‘Ÿยฏโ„Žยฏโ„Žr=\overline{h}/\underline{h}italic_r = overยฏ start_ARG italic_h end_ARG / underยฏ start_ARG italic_h end_ARG and โ„ฑ={f|fโข(x)=gโข(ฮธโŠคโขฯ•โข(x)),ฮธโˆˆโ„d,โ€–ฮธโ€–2โ‰คCฮธ,โ€–ฯ•โ€–2โ‰คCฯ•}โ„ฑconditional-set๐‘“formulae-sequence๐‘“๐‘ฅ๐‘”superscript๐œƒtopitalic-ฯ•๐‘ฅformulae-sequence๐œƒsuperscriptโ„๐‘‘formulae-sequencesubscriptnorm๐œƒ2subscript๐ถ๐œƒsubscriptnormitalic-ฯ•2subscript๐ถitalic-ฯ•\mathcal{F}=\{f|f(x)=g(\theta^{\top}\phi(x)),\theta\in\mathbb{R}^{d},\|\theta% \|_{2}\leq C_{\theta},\|\phi\|_{2}\leq C_{\phi}\}caligraphic_F = { italic_f | italic_f ( italic_x ) = italic_g ( italic_ฮธ start_POSTSUPERSCRIPT โŠค end_POSTSUPERSCRIPT italic_ฯ• ( italic_x ) ) , italic_ฮธ โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , โˆฅ italic_ฮธ โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค italic_C start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT , โˆฅ italic_ฯ• โˆฅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT โ‰ค italic_C start_POSTSUBSCRIPT italic_ฯ• end_POSTSUBSCRIPT }.

dimE(โ„ฑ,ฮต)โ‰ค3โขdโขr2โขeeโˆ’1โขlogโก(3โขr2+3โขr2โข(2โขCฮธโขhยฏฮต)2)+1.subscriptdimension๐ธโ„ฑ๐œ€3๐‘‘superscript๐‘Ÿ2๐‘’๐‘’13superscript๐‘Ÿ23superscript๐‘Ÿ2superscript2subscript๐ถ๐œƒยฏโ„Ž๐œ€21\displaystyle\dim_{E}(\mathcal{F},\varepsilon)\leq 3dr^{2}\frac{e}{e-1}\log% \left(3r^{2}+3r^{2}\left(\frac{2C_{\theta}\overline{h}}{\varepsilon}\right)^{2% }\right)+1.roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ) โ‰ค 3 italic_d italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_e end_ARG start_ARG italic_e - 1 end_ARG roman_log ( 3 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT italic_ฮธ end_POSTSUBSCRIPT overยฏ start_ARG italic_h end_ARG end_ARG start_ARG italic_ฮต end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 1 .
Remark H.3 (Bounding ฮฒTsubscript๐›ฝ๐‘‡\beta_{T}italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT).

If we assume that the functions in โ„›โ„›\mathcal{R}caligraphic_R are parametrized by parameters in some set ฮ˜โŠ‚โ„dฮ˜superscriptโ„๐‘‘\Theta\subset\mathbb{R}^{d}roman_ฮ˜ โŠ‚ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with constant diameter and the functions are L๐ฟLitalic_L-Lipschitz in that parameter, we have N(โ„›,ฮฑ,โˆฅโ‹…โˆฅโˆž)โ‰คN(ฮ˜,ฮฑ/L,โˆฅโ‹…โˆฅโˆž)โ‰ค(1+๐’ช(L/ฮฑ))dN(\mathcal{R},\alpha,\|\cdot\|_{\infty})\leq N(\Theta,\alpha/L,\|\cdot\|_{% \infty})\leq\left(1+\mathcal{O}(L/\alpha)\right)^{d}italic_N ( caligraphic_R , italic_ฮฑ , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) โ‰ค italic_N ( roman_ฮ˜ , italic_ฮฑ / italic_L , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) โ‰ค ( 1 + caligraphic_O ( italic_L / italic_ฮฑ ) ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Then, we might choose ฮฑ=Tโˆ’1๐›ผsuperscript๐‘‡1\alpha=T^{-1}italic_ฮฑ = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT such that

ฮฒT=8ฯƒ2log(N(โ„›,ฮฑ,โˆฅโ‹…โˆฅโˆž)/ฮด)+2ฮฑT(8rmax+8โขฯƒ2โขlnโก(4โขT2/ฮด))\displaystyle\beta_{T}=8\sigma^{2}\log(N(\mathcal{R},\alpha,\|\cdot\|_{\infty}% )/\delta)+2\alpha T(8r_{\max}+\sqrt{8\sigma^{2}\ln(4T^{2}/\delta)})italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 8 italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_N ( caligraphic_R , italic_ฮฑ , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) / italic_ฮด ) + 2 italic_ฮฑ italic_T ( 8 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + square-root start_ARG 8 italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln ( 4 italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ฮด ) end_ARG )

can also be bounded logarithmically in T๐‘‡Titalic_T.

H.2 Bounding The Width Of The Confidence Set

Notations and Definitions

Here, we introduce some notation used in this section. We define the width function wโ„ฑโข(x)=supfยฏ,fยฏโˆˆโ„ฑ|fยฏโข(x)โˆ’fยฏโข(x)|subscript๐‘คโ„ฑ๐‘ฅsubscriptsupremumยฏ๐‘“ยฏ๐‘“โ„ฑยฏ๐‘“๐‘ฅยฏ๐‘“๐‘ฅw_{\mathcal{F}}(x)=\sup_{\underline{f},\overline{f}\in\mathcal{F}}|\underline{% f}(x)-\overline{f}(x)|italic_w start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_x ) = roman_sup start_POSTSUBSCRIPT underยฏ start_ARG italic_f end_ARG , overยฏ start_ARG italic_f end_ARG โˆˆ caligraphic_F end_POSTSUBSCRIPT | underยฏ start_ARG italic_f end_ARG ( italic_x ) - overยฏ start_ARG italic_f end_ARG ( italic_x ) |. Throughout this section, we use the notation xHt+hsubscript๐‘ฅsubscript๐ป๐‘กโ„Žx_{H_{t}+h}italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT with Ht=(tโˆ’1)โขHsubscript๐ป๐‘ก๐‘ก1๐ปH_{t}=(t-1)Hitalic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_t - 1 ) italic_H to describe elements of a sequence x1,โ€ฆ,xHโขTsubscript๐‘ฅ1โ€ฆsubscript๐‘ฅ๐ป๐‘‡x_{1},...,x_{HT}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_H italic_T end_POSTSUBSCRIPT. The idea behind it is that we can later define xHt+h=(h,sht,aht,ฮผยฏMโˆ—,ht)subscript๐‘ฅsubscript๐ป๐‘กโ„Žโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€โ„Žx_{H_{t}+h}=(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{M^{*},h})italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT = ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ) and apply the results in this section to our setting. Furthermore, for any function g๐‘”gitalic_g we write โ€–gโ€–2,Et2=โˆ‘i=1tโˆ’1โˆ‘h=1Hg2โข(xHt+h)superscriptsubscriptnorm๐‘”2subscript๐ธ๐‘ก2superscriptsubscript๐‘–1๐‘ก1superscriptsubscriptโ„Ž1๐ปsuperscript๐‘”2subscript๐‘ฅsubscript๐ป๐‘กโ„Ž\|g\|_{2,E_{t}}^{2}=\sum_{i=1}^{t-1}\sum_{h=1}^{H}g^{2}(x_{H_{t}+h})โˆฅ italic_g โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ).

Lemma H.4 (Proposition 3 of Russo and Vanย Roy, (2013)).

If (ฮฒt)tโˆˆโ„•subscriptsubscript๐›ฝ๐‘ก๐‘กโ„•(\beta_{t})_{t\in\mathbb{N}}( italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t โˆˆ blackboard_N end_POSTSUBSCRIPT is a positive non-decreasing sequence, (f^t)tsubscriptsubscript^๐‘“๐‘ก๐‘ก(\hat{f}_{t})_{t}( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT some function sequence and โ„ฑt:={fโˆˆโ„ฑ:โ€–fโˆ’f^tโ€–2,Etโ‰คฮฒt}assignsubscriptโ„ฑ๐‘กconditional-set๐‘“โ„ฑsubscriptnorm๐‘“subscript^๐‘“๐‘ก2subscript๐ธ๐‘กsubscript๐›ฝ๐‘ก\mathcal{F}_{t}:=\{f\in\mathcal{F}:\|f-\hat{f}_{t}\|_{2,E_{t}}\leq\sqrt{\beta_% {t}}\}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { italic_f โˆˆ caligraphic_F : โˆฅ italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT โ‰ค square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG } then with probability 1, for all Tโˆˆโ„•๐‘‡โ„•T\in\mathbb{N}italic_T โˆˆ blackboard_N,

โˆ‘t=1Tโˆ‘h=1H๐•€โข{wโ„ฑtโข(xHt+h)>ฮต}โ‰ค(4โขฮฒTฮต2+H)โขdimE(โ„ฑ,ฮต)superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ป๐•€subscript๐‘คsubscriptโ„ฑ๐‘กsubscript๐‘ฅsubscript๐ป๐‘กโ„Ž๐œ€4subscript๐›ฝ๐‘‡superscript๐œ€2๐ปsubscriptdimension๐ธโ„ฑ๐œ€\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{I}\{w_{\mathcal{F}_{t}}(x_{H_% {t}+h})>\varepsilon\}\leq\left(\frac{4\beta_{T}}{\varepsilon^{2}}+H\right)\dim% _{E}(\mathcal{F},\varepsilon)โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_I { italic_w start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) > italic_ฮต } โ‰ค ( divide start_ARG 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_H ) roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต )

for all Tโˆˆโ„•๐‘‡โ„•T\in\mathbb{N}italic_T โˆˆ blackboard_N and ฮต>0๐œ€0\varepsilon>0italic_ฮต > 0.

Proof.

First we show that for any ฯ„=Ht+h<TโขH๐œsubscript๐ป๐‘กโ„Ž๐‘‡๐ป\tau=H_{t}+h<THitalic_ฯ„ = italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h < italic_T italic_H, if wโ„ฑtโข(xฯ„)>ฮตsubscript๐‘คsubscriptโ„ฑ๐‘กsubscript๐‘ฅ๐œ๐œ€w_{\mathcal{F}_{t}}(x_{\tau})>\varepsilonitalic_w start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ) > italic_ฮต then xฯ„subscript๐‘ฅ๐œx_{\tau}italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on fewer than 4โขฮฒT/ฮต24subscript๐›ฝ๐‘‡superscript๐œ€24\beta_{T}/\varepsilon^{2}4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT disjoint subsequences of (x1,โ€ฆ,xHt)subscript๐‘ฅ1โ€ฆsubscript๐‘ฅsubscript๐ป๐‘ก(x_{1},...,x_{H_{t}})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Suppose wโ„ฑtโข(xฯ„)>ฮตsubscript๐‘คsubscriptโ„ฑ๐‘กsubscript๐‘ฅ๐œ๐œ€w_{\mathcal{F}_{t}}(x_{\tau})>\varepsilonitalic_w start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ) > italic_ฮต. Then, there are f,f~โˆˆโ„ฑt๐‘“~๐‘“subscriptโ„ฑ๐‘กf,\tilde{f}\in\mathcal{F}_{t}italic_f , over~ start_ARG italic_f end_ARG โˆˆ caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that |fโข(xฯ„)โˆ’f~โข(xฯ„)|>ฮต๐‘“subscript๐‘ฅ๐œ~๐‘“subscript๐‘ฅ๐œ๐œ€|f(x_{\tau})-\tilde{f}(x_{\tau})|>\varepsilon| italic_f ( italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ) | > italic_ฮต. Furthermore, let (xi1,โ€ฆ,xik)subscript๐‘ฅsubscript๐‘–1โ€ฆsubscript๐‘ฅsubscript๐‘–๐‘˜(x_{i_{1}},...,x_{i_{k}})( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) be a subsequence of (x1,โ€ฆ,xHt)subscript๐‘ฅ1โ€ฆsubscript๐‘ฅsubscript๐ป๐‘ก(x_{1},...,x_{H_{t}})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) on which xฯ„subscript๐‘ฅ๐œx_{\tau}italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent. This implies, by definition, that โˆ‘j=1k(fโข(xij)โˆ’f~โข(xij))2>ฮต2superscriptsubscript๐‘—1๐‘˜superscript๐‘“subscript๐‘ฅsubscript๐‘–๐‘—~๐‘“subscript๐‘ฅsubscript๐‘–๐‘—2superscript๐œ€2\sum_{j=1}^{k}(f(x_{i_{j}})-\tilde{f}(x_{i_{j}}))^{2}>\varepsilon^{2}โˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. If xฯ„subscript๐‘ฅ๐œx_{\tau}italic_x start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on K๐พKitalic_K disjoint subsequences of (x1,โ€ฆ,xHt)subscript๐‘ฅ1โ€ฆsubscript๐‘ฅsubscript๐ป๐‘ก(x_{1},...,x_{H_{t}})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) then we must have

โ€–fโˆ’f~โ€–2,Et2=โˆ‘i=1tโˆ’1โˆ‘h=1H(fโข(xHi+h)โˆ’f~โข(xHi+h))2โ‰ฅโˆ‘l=1Kโˆ‘j=1kl(fโข(xijl)โˆ’f~โข(xijl))2>Kโขฮต2.superscriptsubscriptnorm๐‘“~๐‘“2subscript๐ธ๐‘ก2superscriptsubscript๐‘–1๐‘ก1superscriptsubscriptโ„Ž1๐ปsuperscript๐‘“subscript๐‘ฅsubscript๐ป๐‘–โ„Ž~๐‘“subscript๐‘ฅsubscript๐ป๐‘–โ„Ž2superscriptsubscript๐‘™1๐พsuperscriptsubscript๐‘—1subscript๐‘˜๐‘™superscript๐‘“subscript๐‘ฅsubscriptsuperscript๐‘–๐‘™๐‘—~๐‘“subscript๐‘ฅsubscriptsuperscript๐‘–๐‘™๐‘—2๐พsuperscript๐œ€2\displaystyle\|f-\tilde{f}\|_{2,E_{t}}^{2}=\sum_{i=1}^{t-1}\sum_{h=1}^{H}(f(x_% {H_{i}+h})-\tilde{f}(x_{H_{i}+h}))^{2}\geq\sum_{l=1}^{K}\sum_{j=1}^{k_{l}}(f(x% _{i^{l}_{j}})-\tilde{f}(x_{i^{l}_{j}}))^{2}>K\varepsilon^{2}.โˆฅ italic_f - over~ start_ARG italic_f end_ARG โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โ‰ฅ โˆ‘ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_K italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By the triangle inequality, โ€–fโˆ’f~โ€–2,Etโ‰คโ€–fโˆ’f^tโ€–2,Et+โ€–f~โˆ’f^tโ€–2,Etโ‰ค2โขฮฒtโ‰ค2โขฮฒTsubscriptnorm๐‘“~๐‘“2subscript๐ธ๐‘กsubscriptnorm๐‘“subscript^๐‘“๐‘ก2subscript๐ธ๐‘กsubscriptnorm~๐‘“subscript^๐‘“๐‘ก2subscript๐ธ๐‘ก2subscript๐›ฝ๐‘ก2subscript๐›ฝ๐‘‡\|f-\tilde{f}\|_{2,E_{t}}\leq\|f-\hat{f}_{t}\|_{2,E_{t}}+\|\tilde{f}-\hat{f}_{% t}\|_{2,E_{t}}\leq 2\sqrt{\beta_{t}}\leq 2\sqrt{\beta_{T}}โˆฅ italic_f - over~ start_ARG italic_f end_ARG โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT โ‰ค โˆฅ italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + โˆฅ over~ start_ARG italic_f end_ARG - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT โ‰ค 2 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG โ‰ค 2 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG. Combining these two inequalities, we get K<4โขฮฒT/ฮต2๐พ4subscript๐›ฝ๐‘‡superscript๐œ€2K<4\beta_{T}/\varepsilon^{2}italic_K < 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Next, we show that in any sequence (y1,โ€ฆ,yl)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘™(y_{1},...,y_{l})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) there is an element yjsubscript๐‘ฆ๐‘—y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on at least l/dโˆ’1๐‘™๐‘‘1l/d-1italic_l / italic_d - 1 disjoint subsequences of (y1,โ€ฆ,yjโˆ’1)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘—1(y_{1},...,y_{j-1})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ), where d=dimE(โ„ฑ,ฮต)๐‘‘subscriptdimension๐ธโ„ฑ๐œ€d=\dim_{E}(\mathcal{F},\varepsilon)italic_d = roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ). Let K๐พKitalic_K be an integer with Kโขd+1โ‰คlโ‰คKโขd+d๐พ๐‘‘1๐‘™๐พ๐‘‘๐‘‘Kd+1\leq l\leq Kd+ditalic_K italic_d + 1 โ‰ค italic_l โ‰ค italic_K italic_d + italic_d. We will construct K๐พKitalic_K disjoint subsequences B1,โ€ฆ,BKsubscript๐ต1โ€ฆsubscript๐ต๐พB_{1},...,B_{K}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_B start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. First, Bi=(yi)subscript๐ต๐‘–subscript๐‘ฆ๐‘–B_{i}=(y_{i})italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for all iโˆˆ[K]๐‘–delimited-[]๐พi\in[K]italic_i โˆˆ [ italic_K ]. If yK+1subscript๐‘ฆ๐พ1y_{K+1}italic_y start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT is already (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on B1,โ€ฆ,BKsubscript๐ต1โ€ฆsubscript๐ต๐พB_{1},...,B_{K}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_B start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, we are done. Otherwise, select a Bisubscript๐ต๐‘–B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of which yK+1subscript๐‘ฆ๐พ1y_{K+1}italic_y start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-independent and append yK+1subscript๐‘ฆ๐พ1y_{K+1}italic_y start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT to Bisubscript๐ต๐‘–B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We repeat this for yK+2,yK+3,โ€ฆsubscript๐‘ฆ๐พ2subscript๐‘ฆ๐พ3โ€ฆy_{K+2},y_{K+3},...italic_y start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_K + 3 end_POSTSUBSCRIPT , โ€ฆ until we find yjsubscript๐‘ฆ๐‘—y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on each subsequence or until we have reached ylsubscript๐‘ฆ๐‘™y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. In the latter case, each element of a subsequence Bisubscript๐ต๐‘–B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is independent of its predecessors and hence |Bi|=dsubscript๐ต๐‘–๐‘‘|B_{i}|=d| italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = italic_d. Then, ylsubscript๐‘ฆ๐‘™y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT must be (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on each subsequence, by definition of the eluder dimension. In both cases we find an element in (y1,โ€ฆ,yl)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘™(y_{1},...,y_{l})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) that is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on Kโ‰ฅt/dโˆ’1๐พ๐‘ก๐‘‘1K\geq t/d-1italic_K โ‰ฅ italic_t / italic_d - 1 disjoint subsequences.

Finally, let (y1,โ€ฆ,yl)=(xi1,โ€ฆ,xil)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘™subscript๐‘ฅsubscript๐‘–1โ€ฆsubscript๐‘ฅsubscript๐‘–๐‘™(y_{1},...,y_{l})=(x_{i_{1}},...,x_{i_{l}})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) be a subsequence of (x1,โ€ฆ,xTโขH)subscript๐‘ฅ1โ€ฆsubscript๐‘ฅ๐‘‡๐ป(x_{1},...,x_{TH})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_T italic_H end_POSTSUBSCRIPT ) consisting of all elements xHt+hsubscript๐‘ฅsubscript๐ป๐‘กโ„Žx_{H_{t}+h}italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT for which wโ„ฑtโข(xHt+h)>ฮตsubscript๐‘คsubscriptโ„ฑ๐‘กsubscript๐‘ฅsubscript๐ป๐‘กโ„Ž๐œ€w_{\mathcal{F}_{t}}(x_{H_{t}+h})>\varepsilonitalic_w start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) > italic_ฮต. From before, we know there is some yjsubscript๐‘ฆ๐‘—y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on at least l/dโˆ’1๐‘™๐‘‘1l/d-1italic_l / italic_d - 1 disjoint subsequences of (y1,โ€ฆ,yjโˆ’1)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘—1(y_{1},...,y_{j-1})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ). Let t,h๐‘กโ„Žt,hitalic_t , italic_h be such that yj=xHt+hsubscript๐‘ฆ๐‘—subscript๐‘ฅsubscript๐ป๐‘กโ„Žy_{j}=x_{H_{t}+h}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT. Note that in (y1,โ€ฆ,yjโˆ’1)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘—1(y_{1},...,y_{j-1})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) there are at most Hโˆ’1๐ป1H-1italic_H - 1 elements yi=xHt+hโ€ฒsubscript๐‘ฆ๐‘–subscript๐‘ฅsubscript๐ป๐‘กsuperscriptโ„Žโ€ฒy_{i}=x_{H_{t}+h^{\prime}}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for some hโ€ฒ<hsuperscriptโ„Žโ€ฒโ„Žh^{\prime}<hitalic_h start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT < italic_h. From this follows that yj=xHt+hsubscript๐‘ฆ๐‘—subscript๐‘ฅsubscript๐ป๐‘กโ„Žy_{j}=x_{H_{t}+h}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on at least l/dโˆ’1โˆ’(Hโˆ’1)=l/dโˆ’H๐‘™๐‘‘1๐ป1๐‘™๐‘‘๐ปl/d-1-(H-1)=l/d-Hitalic_l / italic_d - 1 - ( italic_H - 1 ) = italic_l / italic_d - italic_H disjoint subsequences of (y1,โ€ฆ,yjโˆ’H)โŠ†(x1,โ€ฆ,xHt)subscript๐‘ฆ1โ€ฆsubscript๐‘ฆ๐‘—๐ปsubscript๐‘ฅ1โ€ฆsubscript๐‘ฅsubscript๐ป๐‘ก(y_{1},...,y_{j-H})\subseteq(x_{1},...,x_{H_{t}})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_y start_POSTSUBSCRIPT italic_j - italic_H end_POSTSUBSCRIPT ) โŠ† ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Now, as we have also shown, xHt+hsubscript๐‘ฅsubscript๐ป๐‘กโ„Žx_{H_{t}+h}italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT is (โ„ฑ,ฮต)โ„ฑ๐œ€(\mathcal{F},\varepsilon)( caligraphic_F , italic_ฮต )-dependent on fewer than 4โขฮฒT/ฮต24subscript๐›ฝ๐‘‡superscript๐œ€24\beta_{T}/\varepsilon^{2}4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT disjoint subsequences of (x1,โ€ฆ,xHt)subscript๐‘ฅ1โ€ฆsubscript๐‘ฅsubscript๐ป๐‘ก(x_{1},...,x_{H_{t}})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Combining these two bounds, we get l/dโˆ’Hโ‰ค4โขฮฒT/ฮต2๐‘™๐‘‘๐ป4subscript๐›ฝ๐‘‡superscript๐œ€2l/d-H\leq 4\beta_{T}/\varepsilon^{2}italic_l / italic_d - italic_H โ‰ค 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and therefore lโ‰ค(4โขฮฒT/ฮต2+H)โขd๐‘™4subscript๐›ฝ๐‘‡superscript๐œ€2๐ป๐‘‘l\leq(4\beta_{T}/\varepsilon^{2}+H)ditalic_l โ‰ค ( 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_H ) italic_d. โˆŽ

Lemma H.5 (Variant of Lemma 2 in Russo and Vanย Roy, (2013)).

Let (ฮฒt)tโˆˆโ„•subscriptsubscript๐›ฝ๐‘ก๐‘กโ„•(\beta_{t})_{t\in\mathbb{N}}( italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t โˆˆ blackboard_N end_POSTSUBSCRIPT be a positive non-decreasing sequence, (f^t)tsubscriptsubscript^๐‘“๐‘ก๐‘ก(\hat{f}_{t})_{t}( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT some function sequence and โ„ฑt:={fโˆˆโ„ฑ:โ€–fโˆ’f^tโ€–2,Etโ‰คฮฒt}assignsubscriptโ„ฑ๐‘กconditional-set๐‘“โ„ฑsubscriptnorm๐‘“subscript^๐‘“๐‘ก2subscript๐ธ๐‘กsubscript๐›ฝ๐‘ก\mathcal{F}_{t}:=\{f\in\mathcal{F}:\|f-\hat{f}_{t}\|_{2,E_{t}}\leq\sqrt{\beta_% {t}}\}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { italic_f โˆˆ caligraphic_F : โˆฅ italic_f - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT โ‰ค square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG }. Let wโ„ฑโข(x)โ‰คCsubscript๐‘คโ„ฑ๐‘ฅ๐ถw_{\mathcal{F}}(x)\leq Citalic_w start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_x ) โ‰ค italic_C for all x๐‘ฅxitalic_x. Then, for all Tโˆˆโ„•๐‘‡โ„•T\in\mathbb{N}italic_T โˆˆ blackboard_N and ฮต>0๐œ€0\varepsilon>0italic_ฮต > 0,

โˆ‘t=1Tโˆ‘h=1Hwโ„ฑtโข(xHt+h)โ‰คฮตโขHโขT+CโขHโขdimE(โ„ฑ,ฮต)superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsubscriptโ„ฑ๐‘กsubscript๐‘ฅsubscript๐ป๐‘กโ„Ž๐œ€๐ป๐‘‡๐ถ๐ปsubscriptdimension๐ธโ„ฑ๐œ€\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\mathcal{F}_{t}}(x_{H_{t}+h})\leq% \varepsilon HT+CH\dim_{E}(\mathcal{F},\varepsilon)โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) โ‰ค italic_ฮต italic_H italic_T + italic_C italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต )
+4โขฮฒTโขHโขdimE(โ„ฑ,ฮต)โขT.4subscript๐›ฝ๐‘‡๐ปsubscriptdimension๐ธโ„ฑ๐œ€๐‘‡\displaystyle+4\sqrt{\beta_{T}H\dim_{E}(\mathcal{F},\varepsilon)T}.+ 4 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ) italic_T end_ARG .
Proof.

We abbreviate wHt+h=wโ„ฑtโข(xHt+h)subscript๐‘คsubscript๐ป๐‘กโ„Žsubscript๐‘คsubscriptโ„ฑ๐‘กsubscript๐‘ฅsubscript๐ป๐‘กโ„Žw_{H_{t}+h}=w_{\mathcal{F}_{t}}(x_{H_{t}+h})italic_w start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) and d=dimE(โ„ฑ,ฮต)๐‘‘subscriptdimension๐ธโ„ฑ๐œ€d=\dim_{E}(\mathcal{F},\varepsilon)italic_d = roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_ฮต ). Let wi1โ‰ฅโ€ฆโ‰ฅwiHโขTsubscript๐‘คsubscript๐‘–1โ€ฆsubscript๐‘คsubscript๐‘–๐ป๐‘‡w_{i_{1}}\geq...\geq w_{i_{HT}}italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT โ‰ฅ โ€ฆ โ‰ฅ italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_H italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Using this ordering of the sequence, wik>ฮตsubscript๐‘คsubscript๐‘–๐‘˜๐œ€w_{i_{k}}>\varepsilonitalic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_ฮต implies that โˆ‘j=1T๐•€โข{wj>ฮต}โ‰ฅksuperscriptsubscript๐‘—1๐‘‡๐•€subscript๐‘ค๐‘—๐œ€๐‘˜\sum_{j=1}^{T}\mathbb{I}\{w_{j}>\varepsilon\}\geq kโˆ‘ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_I { italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_ฮต } โ‰ฅ italic_k. By Lemmaย H.4, this would mean kโ‰ค(4โขฮฒT/ฮต2+H)โขd๐‘˜4subscript๐›ฝ๐‘‡superscript๐œ€2๐ป๐‘‘k\leq(4\beta_{T}/\varepsilon^{2}+H)ditalic_k โ‰ค ( 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_H ) italic_d or, equivalently, ฮต<4โขฮฒTโขd/(kโˆ’Hโขd)๐œ€4subscript๐›ฝ๐‘‡๐‘‘๐‘˜๐ป๐‘‘\varepsilon<\sqrt{4\beta_{T}d/(k-Hd)}italic_ฮต < square-root start_ARG 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_d / ( italic_k - italic_H italic_d ) end_ARG. Now, since wik>ฮตsubscript๐‘คsubscript๐‘–๐‘˜๐œ€w_{i_{k}}>\varepsilonitalic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_ฮต implies ฮต<4โขฮฒTโขd/(kโˆ’Hโขd)๐œ€4subscript๐›ฝ๐‘‡๐‘‘๐‘˜๐ป๐‘‘\varepsilon<\sqrt{4\beta_{T}d/(k-Hd)}italic_ฮต < square-root start_ARG 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_d / ( italic_k - italic_H italic_d ) end_ARG, this means that wik<4โขฮฒTโขd/(kโˆ’Hโขd)subscript๐‘คsubscript๐‘–๐‘˜4subscript๐›ฝ๐‘‡๐‘‘๐‘˜๐ป๐‘‘w_{i_{k}}<\sqrt{4\beta_{T}d/(k-Hd)}italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT < square-root start_ARG 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_d / ( italic_k - italic_H italic_d ) end_ARG.

In the following, we bound the first and largest widths wi1,โ€ฆ,wiHโขdsubscript๐‘คsubscript๐‘–1โ€ฆsubscript๐‘คsubscript๐‘–๐ป๐‘‘w_{i_{1}},...,w_{i_{Hd}}italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , โ€ฆ , italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_H italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT by C๐ถCitalic_C and the remaining widths (larger than ฮต๐œ€\varepsilonitalic_ฮต) by the previously established bound.

โˆ‘t=1Tโˆ‘h=1HwHt+hsuperscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsubscript๐ป๐‘กโ„Ž\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}w_{H_{t}+h}โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT =โˆ‘k=1HโขT๐•€โข{wkโ‰คฮต}โขwk+โˆ‘k=1HโขT๐•€โข{wk>ฮต}โขwkโ‰คฮตโขHโขT+โˆ‘k=1HโขT๐•€โข{wk>ฮต}โขwkabsentsuperscriptsubscript๐‘˜1๐ป๐‘‡๐•€subscript๐‘ค๐‘˜๐œ€subscript๐‘ค๐‘˜superscriptsubscript๐‘˜1๐ป๐‘‡๐•€subscript๐‘ค๐‘˜๐œ€subscript๐‘ค๐‘˜๐œ€๐ป๐‘‡superscriptsubscript๐‘˜1๐ป๐‘‡๐•€subscript๐‘ค๐‘˜๐œ€subscript๐‘ค๐‘˜\displaystyle=\sum_{k=1}^{HT}\mathbb{I}\{w_{k}\leq\varepsilon\}w_{k}+\sum_{k=1% }^{HT}\mathbb{I}\{w_{k}>\varepsilon\}w_{k}\leq\varepsilon HT+\sum_{k=1}^{HT}% \mathbb{I}\{w_{k}>\varepsilon\}w_{k}= โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_T end_POSTSUPERSCRIPT blackboard_I { italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT โ‰ค italic_ฮต } italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_T end_POSTSUPERSCRIPT blackboard_I { italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_ฮต } italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT โ‰ค italic_ฮต italic_H italic_T + โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_T end_POSTSUPERSCRIPT blackboard_I { italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_ฮต } italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
โ‰คฮตโขHโขT+HโขdโขC+โˆ‘k=Hโขd+1HโขT๐•€โข{wik>ฮต}โขwktabsent๐œ€๐ป๐‘‡๐ป๐‘‘๐ถsuperscriptsubscript๐‘˜๐ป๐‘‘1๐ป๐‘‡๐•€subscript๐‘คsubscript๐‘–๐‘˜๐œ€subscript๐‘คsubscript๐‘˜๐‘ก\displaystyle\leq\varepsilon HT+HdC+\sum_{k=Hd+1}^{HT}\mathbb{I}\{w_{i_{k}}>% \varepsilon\}w_{k_{t}}โ‰ค italic_ฮต italic_H italic_T + italic_H italic_d italic_C + โˆ‘ start_POSTSUBSCRIPT italic_k = italic_H italic_d + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_T end_POSTSUPERSCRIPT blackboard_I { italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_ฮต } italic_w start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT
โ‰คฮตโขHโขT+HโขdโขC+โˆ‘k=Hโขd+1HโขT4โขฮฒTโขd/(kโˆ’Hโขd)absent๐œ€๐ป๐‘‡๐ป๐‘‘๐ถsuperscriptsubscript๐‘˜๐ป๐‘‘1๐ป๐‘‡4subscript๐›ฝ๐‘‡๐‘‘๐‘˜๐ป๐‘‘\displaystyle\leq\varepsilon HT+HdC+\sum_{k=Hd+1}^{HT}\sqrt{4\beta_{T}d/(k-Hd)}โ‰ค italic_ฮต italic_H italic_T + italic_H italic_d italic_C + โˆ‘ start_POSTSUBSCRIPT italic_k = italic_H italic_d + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_T end_POSTSUPERSCRIPT square-root start_ARG 4 italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_d / ( italic_k - italic_H italic_d ) end_ARG
โ‰คฮตโขHโขT+HโขdโขC+4โขdโขฮฒTโขโˆซ0HโขT1xโข๐‘‘xabsent๐œ€๐ป๐‘‡๐ป๐‘‘๐ถ4๐‘‘subscript๐›ฝ๐‘‡superscriptsubscript0๐ป๐‘‡1๐‘ฅdifferential-d๐‘ฅ\displaystyle\leq\varepsilon HT+HdC+\sqrt{4d\beta_{T}}\int_{0}^{HT}\frac{1}{% \sqrt{x}}dxโ‰ค italic_ฮต italic_H italic_T + italic_H italic_d italic_C + square-root start_ARG 4 italic_d italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG โˆซ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_x end_ARG end_ARG italic_d italic_x
=ฮตโขHโขT+HโขdโขC+4โขdโขฮฒTโขHโขTabsent๐œ€๐ป๐‘‡๐ป๐‘‘๐ถ4๐‘‘subscript๐›ฝ๐‘‡๐ป๐‘‡\displaystyle=\varepsilon HT+HdC+4\sqrt{d\beta_{T}HT}= italic_ฮต italic_H italic_T + italic_H italic_d italic_C + 4 square-root start_ARG italic_d italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H italic_T end_ARG

โˆŽ

Appendix I PROOF OF THEOREMย 6.1

I.1 Algorithm Details

We present our full algorithm for the unknown reward setting in Alg.ย 4.

Algorithm 4 Steering reward design for Scenario 2
1:Initialize ๐’ซ1:=assignsuperscript๐’ซ1absent\mathcal{P}^{1}:=caligraphic_P start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT := set of all possible transition functions, ฯ€โˆ—1superscriptsubscript๐œ‹1\pi_{*}^{1}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (arbitrarily), k=1,T0=0formulae-sequence๐‘˜1subscript๐‘‡00k=1,T_{0}=0italic_k = 1 , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.
2:forย t=1,โ€ฆ,T๐‘ก1โ€ฆ๐‘‡t=1,...,Titalic_t = 1 , โ€ฆ , italic_Tย do
3:ย ย ย ย ย Update โ„›^tsuperscript^โ„›๐‘ก\hat{\mathcal{R}}^{t}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as in (10).
4:ย ย ย ย ย Choose Rnztsuperscriptsubscript๐‘…nz๐‘กR_{\text{nz}}^{t}italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as in (6).
5:ย ย ย ย ย Agents play t๐‘กtitalic_t-th game with rโˆ—+Rnztsuperscript๐‘Ÿsuperscriptsubscript๐‘…nz๐‘กr^{*}+R_{\text{nz}}^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.
6:ย ย ย ย ย Obtain trajectory ((sht,aht,rht))h=1Hsuperscriptsubscriptsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Žsubscriptsuperscript๐‘Ÿ๐‘กโ„Žโ„Ž1๐ป((s^{t}_{h},a^{t}_{h},r^{t}_{h}))_{h=1}^{H}( ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT.
7:ย ย ย ย ย ifย โˆƒ(h,s,a),s.t.nkโข(h,s,a)โ‰ฅNkโข(h,s,a)formulae-sequenceโ„Ž๐‘ ๐‘Ž๐‘ ๐‘กsubscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žsubscript๐‘๐‘˜โ„Ž๐‘ ๐‘Ž\exists(h,s,a),~{}s.t.~{}n_{k}(h,s,a)\geq N_{k}(h,s,a)โˆƒ ( italic_h , italic_s , italic_a ) , italic_s . italic_t . italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) โ‰ฅ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a )ย then
8:ย ย ย ย ย ย ย ย ย Update ๐’ซk+1superscript๐’ซ๐‘˜1\mathcal{P}^{k+1}caligraphic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT as in (5).
9:ย ย ย ย ย ย ย ย ย Tkโ†tโ†subscript๐‘‡๐‘˜๐‘กT_{k}\leftarrow titalic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT โ† italic_t; kโ†k+1โ†๐‘˜๐‘˜1k\leftarrow k+1italic_k โ† italic_k + 1.
10:ย ย ย ย ย ย ย ย ย ฯ€โˆ—k,M^kโ†argโขmaxฯ€โˆˆฮ ,M^:โ„™M^โˆˆ๐’ซkโกUโข(ฮผM^ฯ€).โ†superscriptsubscript๐œ‹๐‘˜superscript^๐‘€๐‘˜subscriptargmax:๐œ‹ฮ ^๐‘€subscriptโ„™^๐‘€superscript๐’ซ๐‘˜๐‘ˆsubscriptsuperscript๐œ‡๐œ‹^๐‘€\pi_{*}^{k},\hat{M}^{k}\leftarrow\operatorname*{arg\,max}_{\pi\in\Pi,\hat{M}:% \mathbb{P}_{\hat{M}}\in\mathcal{P}^{k}}U(\mu^{\pi}_{\hat{M}}).italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT โ† start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  , over^ start_ARG italic_M end_ARG : blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT โˆˆ caligraphic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT ) .
11:ย ย ย ย ย endย if
12:endย for

I.2 Missing Proofs

Lemma I.1 (Proposition 2 in Russo and Vanย Roy, (2013)).

Let N(โ„›,ฮฑ,โˆฅโ‹…โˆฅโˆž)N(\mathcal{R},\alpha,\|\cdot\|_{\infty})italic_N ( caligraphic_R , italic_ฮฑ , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) be the ฮฑ๐›ผ\alphaitalic_ฮฑ-covering number of โ„›โ„›\mathcal{R}caligraphic_R w.r.t.ย the โˆฅโ‹…โˆฅโˆž\|\cdot\|_{\infty}โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT-norm. Let ฮด>0,ฮฑ>0formulae-sequence๐›ฟ0๐›ผ0\delta>0,\alpha>0italic_ฮด > 0 , italic_ฮฑ > 0, and for each t๐‘กtitalic_t, ฮฒt=8ฯƒ2log(N(โ„›,ฮฑ,โˆฅโ‹…โˆฅโˆž)/ฮด)+2ฮฑt(8rmax+8โขฯƒ2โขlnโก(4โขt2/ฮด))\beta_{t}=8\sigma^{2}\log(N(\mathcal{R},\alpha,\|\cdot\|_{\infty})/\delta)+2% \alpha t(8r_{\max}+\sqrt{8\sigma^{2}\ln(4t^{2}/\delta)})italic_ฮฒ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 8 italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_N ( caligraphic_R , italic_ฮฑ , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) / italic_ฮด ) + 2 italic_ฮฑ italic_t ( 8 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + square-root start_ARG 8 italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln ( 4 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ฮด ) end_ARG ). With probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด, rโˆ—โˆˆโ‹‚t=1โˆžโ„›^tsuperscript๐‘Ÿsuperscriptsubscript๐‘ก1superscript^โ„›๐‘กr^{*}\in\bigcap_{t=1}^{\infty}\hat{\mathcal{R}}^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ โ‹‚ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆž end_POSTSUPERSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Lemma I.2.

We abbreviate ฮผยฏt=ฮผยฏMโˆ—tsuperscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. With probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด,

โˆ‘t=1TโŸจwโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉโ‰ค3โขโˆ‘t=1Tโˆ‘h=1Hwโ„›^tโข(h,sht,aht,ฮผยฏt)+rmaxโขHโขlnโก(1/ฮด).superscriptsubscript๐‘ก1๐‘‡subscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก3superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsuperscript^โ„›๐‘กโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsuperscriptยฏ๐œ‡๐‘กsubscript๐‘Ÿ๐ป1๐›ฟ\displaystyle\sum_{t=1}^{T}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),% \bar{\mu}^{t}\rangle\leq 3\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}% }(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t})+r_{\max}H\ln(1/\delta).โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค 3 โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln ( 1 / italic_ฮด ) .
Proof.

Note that โŸจwโ„›^t(ฮผยฏt),ฮผยฏtโŸฉ=๐”ผ(sh,ah)h=1Hโˆผฮผยฏt[โˆ‘h=1Hwโ„›^t(h,sh,ah,ฮผยฏt)]=:Yt\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle=\mathbb{% E}_{(s_{h},a_{h})_{h=1}^{H}\sim\bar{\mu}^{t}}[\sum_{h=1}^{H}w_{\hat{\mathcal{R% }}^{t}}(h,s_{h},a_{h},\bar{\mu}^{t})]=:Y_{t}โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT โˆผ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] = : italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Recall that (sht,aht)hโˆผฮผยฏtsimilar-tosubscriptsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กโ„Žsuperscriptยฏ๐œ‡๐‘ก(s_{h}^{t},a_{h}^{t})_{h}\sim\bar{\mu}^{t}( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆผ overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are the trajectories we gather from the population at step t๐‘กtitalic_t. Therefore, we can define Xt:=โˆ‘h=1Hwโ„›^tโข(h,sht,aht,ฮผยฏt)assignsubscript๐‘‹๐‘กsuperscriptsubscriptโ„Ž1๐ปsubscript๐‘คsuperscript^โ„›๐‘กโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsuperscriptยฏ๐œ‡๐‘กX_{t}:=\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(h,s_{h}^{t},a_{h}^{t},\bar{\mu}% ^{t})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) with ๐”ผโข[Xt|ฮผยฏt]=Yt๐”ผdelimited-[]conditionalsubscript๐‘‹๐‘กsuperscriptยฏ๐œ‡๐‘กsubscript๐‘Œ๐‘ก\mathbb{E}[X_{t}|\bar{\mu}^{t}]=Y_{t}blackboard_E [ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] = italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By the assumption that rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is bounded in [0,rmax]0subscript๐‘Ÿ[0,r_{\max}][ 0 , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], we have that wโ„›^โข(h,s,a,ฮผ)โ‰คrmaxsubscript๐‘ค^โ„›โ„Ž๐‘ ๐‘Ž๐œ‡subscript๐‘Ÿw_{\hat{\mathcal{R}}}(h,s,a,\mu)\leq r_{\max}italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a , italic_ฮผ ) โ‰ค italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT for any ฮผ,h,s,a๐œ‡โ„Ž๐‘ ๐‘Ž\mu,h,s,aitalic_ฮผ , italic_h , italic_s , italic_a and โ„›^โŠ†โ„›^โ„›โ„›\hat{\mathcal{R}}\subseteq\mathcal{R}over^ start_ARG caligraphic_R end_ARG โŠ† caligraphic_R. Therefore, 0โ‰คXtโ‰คrmaxโขH0subscript๐‘‹๐‘กsubscript๐‘Ÿ๐ป0\leq X_{t}\leq r_{\max}H0 โ‰ค italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โ‰ค italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H. A direct application of Lemma D.4 from Huang etย al., (2023) shows that with probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด,

โˆ‘t=1TYtโ‰ค3โขโˆ‘t=1TXt+rmaxโขHโขlnโก1ฮด.superscriptsubscript๐‘ก1๐‘‡subscript๐‘Œ๐‘ก3superscriptsubscript๐‘ก1๐‘‡subscript๐‘‹๐‘กsubscript๐‘Ÿ๐ป1๐›ฟ\displaystyle\sum_{t=1}^{T}Y_{t}\leq 3\sum_{t=1}^{T}X_{t}+r_{\max}H\ln\frac{1}% {\delta}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โ‰ค 3 โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln divide start_ARG 1 end_ARG start_ARG italic_ฮด end_ARG .

โˆŽ

Lemma I.3.

We abbreviate ฮผโˆ—k=ฮผMโˆ—ฯ€โˆ—k,ฮผยฏt=ฮผยฏMโˆ—tformulae-sequencesuperscriptsubscript๐œ‡๐‘˜superscriptsubscript๐œ‡superscript๐‘€superscriptsubscript๐œ‹๐‘˜superscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\mu_{*}^{k}=\mu_{M^{*}}^{\pi_{*}^{k}},\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. If the true rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT is contained in all โ„›^tsuperscript^โ„›๐‘ก\hat{\mathcal{R}}^{t}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, then, with probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด,

โˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉโ‰คโˆ‘t=1TโŸจrโˆ—โข(ฮผยฏt)+Rnztโข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉ+6โขโˆ‘t=1Tโˆ‘h=1Hwโ„›^tโข(h,sht,aht,ฮผยฏt)+2โขrmaxโขHโขlnโก1ฮด.superscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘ก1๐‘‡superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘…nz๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก6superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsuperscript^โ„›๐‘กโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsuperscriptยฏ๐œ‡๐‘ก2subscript๐‘Ÿ๐ป1๐›ฟ\displaystyle\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{% k(t)}-\bar{\mu}^{t}\rangle\leq\sum_{t=1}^{T}\langle r^{*}(\bar{\mu}^{t})+R_{% \text{nz}}^{t}(\bar{\mu}^{t}),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle+6\sum_{t=1}^% {T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}% )+2r_{\max}H\ln\frac{1}{\delta}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ + 6 โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln divide start_ARG 1 end_ARG start_ARG italic_ฮด end_ARG .
Proof.

Let tโˆˆ[T]๐‘กdelimited-[]๐‘‡t\in[T]italic_t โˆˆ [ italic_T ] and k=kโข(t)๐‘˜๐‘˜๐‘กk=k(t)italic_k = italic_k ( italic_t ). By Eq.ย (6),

โŸจRฯ€โˆ—kโข(ฮผยฏt),ฮผโˆ—kโˆ’ฮผยฏtโŸฉ=โŸจrโˆ—โข(ฮผยฏt)+Rnztโข(ฮผยฏt),ฮผโˆ—kโˆ’ฮผยฏtโŸฉ+โŸจrยฏtโข(ฮผยฏt)โˆ’rโˆ—โข(ฮผยฏt)โˆ’wโ„›^tโข(ฮผยฏt),ฮผโˆ—kโˆ’ฮผยฏtโŸฉ.subscript๐‘…superscriptsubscript๐œ‹๐‘˜superscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘กsuperscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘…nz๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐‘Ÿ๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘ก\displaystyle\langle R_{\pi_{*}^{k}}(\bar{\mu}^{t}),\mu_{*}^{k}-\bar{\mu}^{t}% \rangle=\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar{\mu}^{t}),\mu_{*}^% {k}-\bar{\mu}^{t}\rangle+\langle\bar{r}^{t}(\bar{\mu}^{t})-r^{*}(\bar{\mu}^{t}% )-w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\mu_{*}^{k}-\bar{\mu}^{t}\rangle.โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ = โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ + โŸจ overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ .

With that, we have already separated out the first term (agent regret). Using the assumption that rโˆ—โˆˆโ„›^tsuperscript๐‘Ÿsuperscript^โ„›๐‘กr^{*}\in\hat{\mathcal{R}}^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for all t๐‘กtitalic_t, we can bound the second term as follows.

โŸจrยฏtโข(ฮผยฏt)โˆ’rโˆ—โข(ฮผยฏt)โˆ’wโ„›^tโข(ฮผยฏt),ฮผโˆ—kโˆ’ฮผยฏtโŸฉsuperscriptยฏ๐‘Ÿ๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘ก\displaystyle\langle\bar{r}^{t}(\bar{\mu}^{t})-r^{*}(\bar{\mu}^{t})-w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\mu_{*}^{k}-\bar{\mu}^{t}\rangleโŸจ overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ
=โŸจrโˆ—โข(ฮผยฏt)โˆ’rยฏtโข(ฮผยฏt)+wโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉ+โŸจrยฏtโข(ฮผยฏt)โˆ’rโˆ—โข(ฮผยฏt)โŸโ‰คwโ„›^tโข(ฮผยฏt)โˆ’wโ„›^tโข(ฮผยฏt),ฮผโˆ—kโŸฉabsentsuperscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐‘Ÿ๐‘กsuperscriptยฏ๐œ‡๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘กabsentsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กโŸsuperscriptยฏ๐‘Ÿ๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsubscriptsuperscript๐œ‡๐‘˜\displaystyle=\langle r^{*}(\bar{\mu}^{t})-\bar{r}^{t}(\bar{\mu}^{t})+w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle+\langle\underset{\leq w_% {\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t})}{\underbrace{\bar{r}^{t}(\bar{\mu}^{t})% -r^{*}(\bar{\mu}^{t})}}-w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\mu^{k}_{*}\rangle= โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ + โŸจ start_UNDERACCENT โ‰ค italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG underโŸ start_ARG overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG end_ARG - italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT โŸฉ
โ‰คโŸจrโˆ—โข(ฮผยฏt)โˆ’rยฏtโข(ฮผยฏt)โŸโ‰คwโ„›^tโข(ฮผยฏt)+wโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉโ‰ค2โขโŸจwโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉabsentabsentsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กโŸsuperscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐‘Ÿ๐‘กsuperscriptยฏ๐œ‡๐‘กsubscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก2subscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle\leq\langle\underset{\leq w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}% )}{\underbrace{r^{*}(\bar{\mu}^{t})-\bar{r}^{t}(\bar{\mu}^{t})}}+w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangle\leq 2\langle w_{\hat{% \mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangleโ‰ค โŸจ start_UNDERACCENT โ‰ค italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG underโŸ start_ARG italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG end_ARG + italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค 2 โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ

Finally, we can bound โˆ‘t=1TโŸจwโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉsuperscriptsubscript๐‘ก1๐‘‡subscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘ก\sum_{t=1}^{T}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}),\bar{\mu}^{t}\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ using Lemmaย I.2, which implies the result. โˆŽ

See 6.1

Proof.

We abbreviate ฮผโˆ—k=ฮผMโˆ—ฯ€โˆ—k,ฮผยฏt=ฮผยฏMโˆ—tformulae-sequencesuperscriptsubscript๐œ‡๐‘˜superscriptsubscript๐œ‡superscript๐‘€superscriptsubscript๐œ‹๐‘˜superscriptยฏ๐œ‡๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€\mu_{*}^{k}=\mu_{M^{*}}^{\pi_{*}^{k}},\bar{\mu}^{t}=\bar{\mu}^{t}_{M^{*}}italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_ฮผ start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We can use the exact same arguments as in the proof of Theoremย 5.1, up until the point where we have to bound

AgentReg=โˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉ.AgentRegsuperscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle\texttt{AgentReg}=\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{% \mu}^{t}),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle.AgentReg = โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ .

Combining Lemmaย I.1 and Lemmaย I.3 we have with probability at least 1โˆ’3โขฮด13๐›ฟ1-3\delta1 - 3 italic_ฮด that

โˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉโ‰คโˆ‘t=1TโŸจrโˆ—โข(ฮผยฏt)+Rnztโข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉโŸNewAgentReg+6โขโˆ‘t=1Tโˆ‘h=1Hwโ„›^tโข(h,sht,aht,ฮผยฏt)+2โขrmaxโขHโขlnโก1ฮด.superscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กNewAgentRegโŸsuperscriptsubscript๐‘ก1๐‘‡superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘…nz๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก6superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsuperscript^โ„›๐‘กโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsuperscriptยฏ๐œ‡๐‘ก2subscript๐‘Ÿ๐ป1๐›ฟ\displaystyle\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{% k(t)}-\bar{\mu}^{t}\rangle\leq\underset{\texttt{NewAgentReg}}{\underbrace{\sum% _{t=1}^{T}\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar{\mu}^{t}),\mu_{*% }^{k(t)}-\bar{\mu}^{t}\rangle}}+6\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{% R}}^{t}}(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t})+2r_{\max}H\ln\frac{1}{\delta}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค underNewAgentReg start_ARG underโŸ start_ARG โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG end_ARG + 6 โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln divide start_ARG 1 end_ARG start_ARG italic_ฮด end_ARG .

Now, we summarize xHt+h=(h,sht,aht,ฮผยฏht)subscript๐‘ฅsubscript๐ป๐‘กโ„Žโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กโ„Žx_{H_{t}+h}=(h,s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{h})italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT = ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), where Ht=Hโข(tโˆ’1)subscript๐ป๐‘ก๐ป๐‘ก1H_{t}=H(t-1)italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_H ( italic_t - 1 ), and with slight abuse of notation, we rewrite r^โข(xHt+h)=r^hโข(sht,aht,ฮผยฏht)^๐‘Ÿsubscript๐‘ฅsubscript๐ป๐‘กโ„Žsubscript^๐‘Ÿโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsubscriptsuperscriptยฏ๐œ‡๐‘กโ„Ž\hat{r}(x_{H_{t}+h})=\hat{r}_{h}(s_{h}^{t},a_{h}^{t},\bar{\mu}^{t}_{h})over^ start_ARG italic_r end_ARG ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) = over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). With this rewriting of notation, we can apply Lemmaย H.5 with ฮต=Tโˆ’1๐œ€superscript๐‘‡1\varepsilon=T^{-1}italic_ฮต = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to show that

โˆ‘t=1Tโˆ‘h=1Hwโ„›^tโข(h,sht,aht,ฮผยฏt)โ‰คH+rmaxโขHโขdimE(โ„›,Tโˆ’1)+4โขฮฒTโขHโขdimE(โ„›,Tโˆ’1)โขT.superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsuperscript^โ„›๐‘กโ„Žsuperscriptsubscript๐‘ โ„Ž๐‘กsuperscriptsubscript๐‘Žโ„Ž๐‘กsuperscriptยฏ๐œ‡๐‘ก๐ปsubscript๐‘Ÿ๐ปsubscriptdimension๐ธโ„›superscript๐‘‡14subscript๐›ฝ๐‘‡๐ปsubscriptdimension๐ธโ„›superscript๐‘‡1๐‘‡\displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(h,s_{h}^{t}% ,a_{h}^{t},\bar{\mu}^{t})\leq H+r_{\max}H\dim_{E}(\mathcal{R},T^{-1})+4\sqrt{% \beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}.โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_h , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค italic_H + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + 4 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG .

Combining the with the previous results, we get that with probability at least 1โˆ’3โขฮด13๐›ฟ1-3\delta1 - 3 italic_ฮด,

โˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉsuperscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t}),\mu_{*}^{% k(t)}-\bar{\mu}^{t}\rangleโˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰คNewAgentRegabsentNewAgentReg\displaystyle\leq\texttt{NewAgentReg}โ‰ค NewAgentReg
+6โข(H+rmaxโขHโขdimE(โ„›,Tโˆ’1)+4โขฮฒTโขHโขdimE(โ„›,Tโˆ’1)โขT)+2โขrmaxโขHโขlnโก1ฮดโŸ=:D.\displaystyle\quad+\underset{=:D}{\underbrace{6\left(H+r_{\max}H\dim_{E}(% \mathcal{R},T^{-1})+4\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}\right)+2r_% {\max}H\ln\frac{1}{\delta}}}.+ start_UNDERACCENT = : italic_D end_UNDERACCENT start_ARG underโŸ start_ARG 6 ( italic_H + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + 4 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ) + 2 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln divide start_ARG 1 end_ARG start_ARG italic_ฮด end_ARG end_ARG end_ARG .

The new agent regret term NewAgentReg can be bounded in the same way as in the proof of Theoremย 5.1:

โˆ‘t=1TโŸจrโˆ—โข(ฮผยฏt)+Rnztโข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉโ‰คโˆ‘k=1Kโˆ‘t=Tkโˆ’1+1TkโŸจrโˆ—โข(ฮผยฏt)+Rnztโข(ฮผยฏt),ฮผโˆ—kโˆ’ฮผยฏtโŸฉโ‰คKโขAdaReg(T).superscriptsubscript๐‘ก1๐‘‡superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘…nz๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘˜1๐พsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜superscript๐‘Ÿsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘…nz๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜superscriptยฏ๐œ‡๐‘ก๐พAdaReg๐‘‡\displaystyle\sum_{t=1}^{T}\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar% {\mu}^{t}),\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle\leq\sum_{k=1}^{K}\sum_{t=T_{k-1% }+1}^{T_{k}}\langle r^{*}(\bar{\mu}^{t})+R_{\text{nz}}^{t}(\bar{\mu}^{t}),\mu_% {*}^{k}-\bar{\mu}^{t}\rangle\leq K\operatorname*{AdaReg}(T).โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ โ‰ค italic_K roman_AdaReg ( italic_T ) .

For the steering cost, we have for any ฮผโˆˆฮจM๐œ‡subscriptฮจ๐‘€\mu\in\Psi_{M}italic_ฮผ โˆˆ roman_ฮจ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT,

Cโข(ฮผ,Rnzt)โˆ’Cโข(ฮผ,rmaxโข๐Ÿโˆ’rโˆ—)๐ถ๐œ‡superscriptsubscript๐‘…nz๐‘ก๐ถ๐œ‡subscript๐‘Ÿ1superscript๐‘Ÿ\displaystyle C(\mu,R_{\text{nz}}^{t})-C(\mu,r_{\max}\bm{1}-r^{*})italic_C ( italic_ฮผ , italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_C ( italic_ฮผ , italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) =โŸจrโˆ—โข(ฮผ)โˆ’rยฏtโข(ฮผ)+wโ„›^tโข(ฮผ)+Rฯ€โˆ—kโข(t)โข(ฮผ)+โ€–Rฯ€โˆ—kโข(t)โข(ฮผ)โ€–โˆžโข๐Ÿ,ฮผโŸฉabsentsuperscript๐‘Ÿ๐œ‡superscriptยฏ๐‘Ÿ๐‘ก๐œ‡subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡subscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡1๐œ‡\displaystyle=\langle r^{*}(\mu)-\bar{r}^{t}(\mu)+w_{\hat{\mathcal{R}}^{t}}(% \mu)+R_{\pi_{*}^{k(t)}}(\mu)+\|R_{\pi_{*}^{k(t)}}(\mu)\|_{\infty}\bm{1},\mu\rangle= โŸจ italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ( italic_ฮผ ) - overยฏ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_ฮผ ) + italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) + italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 , italic_ฮผ โŸฉ
โ‰ค2โขโŸจwโ„›^tโข(ฮผ),ฮผโŸฉ+โŸจRฯ€โˆ—kโข(t)โข(ฮผ)+โ€–Rฯ€โˆ—kโข(t)โข(ฮผ)โ€–โˆžโข๐Ÿ,ฮผโŸฉabsent2subscript๐‘คsuperscript^โ„›๐‘ก๐œ‡๐œ‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡subscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘ก๐œ‡1๐œ‡\displaystyle\leq 2\langle w_{\hat{\mathcal{R}}^{t}}(\mu),\mu\rangle+\langle R% _{\pi_{*}^{k(t)}}(\mu)+\|R_{\pi_{*}^{k(t)}}(\mu)\|_{\infty}\bm{1},\mu\rangleโ‰ค 2 โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) , italic_ฮผ โŸฉ + โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ฮผ ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 , italic_ฮผ โŸฉ

Then, summing over t=1,โ€ฆ,T๐‘ก1โ€ฆ๐‘‡t=1,...,Titalic_t = 1 , โ€ฆ , italic_T,

CTโข({ฮผยฏt,Rnztโˆ’(rmaxโข๐Ÿโˆ’rโˆ—)}t=1T)โ‰ค2โขโˆ‘t=1TโŸจwโ„›^tโข(ฮผยฏt),ฮผยฏtโŸฉ+โˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt)+โ€–Rฯ€โˆ—kโข(t)โข(ฮผยฏt)โ€–โˆžโข๐Ÿ,ฮผยฏtโŸฉ.subscript๐ถ๐‘‡superscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘…nz๐‘กsubscript๐‘Ÿ1superscript๐‘Ÿ๐‘ก1๐‘‡2superscriptsubscript๐‘ก1๐‘‡subscript๐‘คsuperscript^โ„›๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsubscriptnormsubscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก1superscriptยฏ๐œ‡๐‘ก\displaystyle C_{T}(\{\bar{\mu}^{t},R_{\text{nz}}^{t}-(r_{\max}\bm{1}-r^{*})\}% _{t=1}^{T})\leq 2\sum_{t=1}^{T}\langle w_{\hat{\mathcal{R}}^{t}}(\bar{\mu}^{t}% ),\bar{\mu}^{t}\rangle+\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})% +\|R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})\|_{\infty}\bm{1},\bar{\mu}^{t}\rangle.italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰ค 2 โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ + โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + โˆฅ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT bold_1 , overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ .

Using Lemmaย I.2, we can bound the first term by 2โข(3โขโˆ‘t=1Tโˆ‘h=1Hwโ„›^tโข(xHt+h)+rmaxโขHโขlnโก(1/ฮด))23superscriptsubscript๐‘ก1๐‘‡superscriptsubscriptโ„Ž1๐ปsubscript๐‘คsuperscript^โ„›๐‘กsubscript๐‘ฅsubscript๐ป๐‘กโ„Žsubscript๐‘Ÿ๐ป1๐›ฟ2(3\sum_{t=1}^{T}\sum_{h=1}^{H}w_{\hat{\mathcal{R}}^{t}}(x_{H_{t}+h})+r_{\max}% H\ln(1/\delta))2 ( 3 โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_h end_POSTSUBSCRIPT ) + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln ( 1 / italic_ฮด ) ) with probability at least 1โˆ’ฮด1๐›ฟ1-\delta1 - italic_ฮด. Using Lemmaย H.5 with ฮต=Tโˆ’1๐œ€superscript๐‘‡1\varepsilon=T^{-1}italic_ฮต = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we can further bound this by D=2โขrmaxโขHโขlnโก(1/ฮด)+6โข(H+rmaxโขHโขdimE(โ„ฑ,Tโˆ’1)+4โขฮฒTโขHโขdimE(โ„›,Tโˆ’1)โขT)๐ท2subscript๐‘Ÿ๐ป1๐›ฟ6๐ปsubscript๐‘Ÿ๐ปsubscriptdimension๐ธโ„ฑsuperscript๐‘‡14subscript๐›ฝ๐‘‡๐ปsubscriptdimension๐ธโ„›superscript๐‘‡1๐‘‡D=2r_{\max}H\ln(1/\delta)+6(H+r_{\max}H\dim_{E}(\mathcal{F},T^{-1})+4\sqrt{% \beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T})italic_D = 2 italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_ln ( 1 / italic_ฮด ) + 6 ( italic_H + italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_F , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + 4 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ). From the steering cost bound in Thm.ย 5.1 follows that the second term is bounded by

4โขHโขTโขโˆ‘t=1TโŸจRฯ€โˆ—kโข(t)โข(ฮผยฏt),ฮผโˆ—kโข(t)โˆ’ฮผยฏtโŸฉ,4๐ป๐‘‡superscriptsubscript๐‘ก1๐‘‡subscript๐‘…superscriptsubscript๐œ‹๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐œ‡๐‘˜๐‘กsuperscriptยฏ๐œ‡๐‘ก\displaystyle 4H\sqrt{T\sum_{t=1}^{T}\langle R_{\pi_{*}^{k(t)}}(\bar{\mu}^{t})% ,\mu_{*}^{k(t)}-\bar{\mu}^{t}\rangle},4 italic_H square-root start_ARG italic_T โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT โŸจ italic_R start_POSTSUBSCRIPT italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , italic_ฮผ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k ( italic_t ) end_POSTSUPERSCRIPT - overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โŸฉ end_ARG ,

which is at most 4โขHโขTโข(KโขAdaReg(T)+D)4๐ป๐‘‡๐พAdaReg๐‘‡๐ท4H\sqrt{T(K\operatorname*{AdaReg}(T)+D)}4 italic_H square-root start_ARG italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG, as we have already shown in this proof.

Lastly, we have to discuss the asymptotic bound for D๐ทDitalic_D. The term D๐ทDitalic_D is dependent on the Eluder dimension of โ„›โ„›\mathcal{R}caligraphic_R and ฮฒTsubscript๐›ฝ๐‘‡\beta_{T}italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. In Appendixย H.1, we show several common function classes with dimE(โ„›,Tโˆ’1)โˆˆ๐’ช~โข(1)subscriptdimension๐ธโ„›superscript๐‘‡1~๐’ช1\dim_{E}(\mathcal{R},T^{-1})\in\tilde{\mathcal{O}}(1)roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) โˆˆ over~ start_ARG caligraphic_O end_ARG ( 1 ). Furthermore, if we assume that the functions in โ„›โ„›\mathcal{R}caligraphic_R are parametrized by parameters in some set ฮ˜โŠ‚โ„dฮ˜superscriptโ„๐‘‘\Theta\subset\mathbb{R}^{d}roman_ฮ˜ โŠ‚ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with constant diameter and L๐ฟLitalic_L-Lipschitz in that parameter, we have N(โ„›,ฮฑ,โˆฅโ‹…โˆฅโˆž)โ‰คN(ฮ˜,ฮฑ/L,โˆฅโ‹…โˆฅโˆž)โ‰ค(1+๐’ช(L/ฮฑ))dN(\mathcal{R},\alpha,\|\cdot\|_{\infty})\leq N(\Theta,\alpha/L,\|\cdot\|_{% \infty})\leq\left(1+\mathcal{O}(L/\alpha)\right)^{d}italic_N ( caligraphic_R , italic_ฮฑ , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) โ‰ค italic_N ( roman_ฮ˜ , italic_ฮฑ / italic_L , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) โ‰ค ( 1 + caligraphic_O ( italic_L / italic_ฮฑ ) ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT Then, we might choose ฮฑ=Tโˆ’1๐›ผsuperscript๐‘‡1\alpha=T^{-1}italic_ฮฑ = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT such that ฮฒTsubscript๐›ฝ๐‘‡\beta_{T}italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT can also be bounded logarithmically in T๐‘‡Titalic_T. In the cases where the Eluder dimension and ฮฒTsubscript๐›ฝ๐‘‡\beta_{T}italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are in ๐’ช~โข(1)~๐’ช1\tilde{\mathcal{O}}(1)over~ start_ARG caligraphic_O end_ARG ( 1 ), we have Dโˆˆ๐’ช~โข(T)๐ท~๐’ช๐‘‡D\in\tilde{\mathcal{O}}(\sqrt{T})italic_D โˆˆ over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_T end_ARG ) (ignoring other factors). โˆŽ

Appendix J EXTENSION TO UNKNOWN UTILITY FUNCTION

In this section, we generalize our previous results to the setting where the mediator does not have prior knowledge of the utility function U๐‘ˆUitalic_U. We consider non-zero intrinsic reward setting, i.e., Scenario 2 described in Section 3.4. Note that the results for Scenario 1 can be directly derived by setting rโˆ—=0superscript๐‘Ÿ0r^{*}=0italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = 0.

Motivation for Unknown Utility Setting

This setting makes sense, especially when U๐‘ˆUitalic_U partially depends on the agentsโ€™ intrinsic rewards rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. As a motivating example, in financial markets, the government (mediator) gains benefits (utility U๐‘ˆUitalic_U) from not only the impact on the society by the desired behaviors of the companies (the agents), but also the tax paid by them, which is directly related to the rewards rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT received by agents.444Another way to interpret this scenario is that the mediatorโ€™s utility U=ฮฑโขUmediator+(1โˆ’ฮฑ)โขUagents;rโˆ—๐‘ˆ๐›ผsubscript๐‘ˆmediator1๐›ผsubscript๐‘ˆagentssuperscript๐‘ŸU=\alpha U_{\text{mediator}}+(1-\alpha)U_{\text{agents};r^{*}}italic_U = italic_ฮฑ italic_U start_POSTSUBSCRIPT mediator end_POSTSUBSCRIPT + ( 1 - italic_ฮฑ ) italic_U start_POSTSUBSCRIPT agents ; italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can be decomposed to a known function Umediatorsubscript๐‘ˆmediatorU_{\text{mediator}}italic_U start_POSTSUBSCRIPT mediator end_POSTSUBSCRIPT representing its intrinsic utility, and another unknown part Uagents;rโˆ—subscript๐‘ˆagentssuperscript๐‘ŸU_{\text{agents};r^{*}}italic_U start_POSTSUBSCRIPT agents ; italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, which reflects the agentsโ€™ interests and depends on rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT. Here ฮฑ๐›ผ\alphaitalic_ฮฑ serves as a parameter to trade-off the interests between two parties. Due to the lack of knowledge of rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT, U๐‘ˆUitalic_U should only be partially revealed to the mediator. This restricts the applicability of our methods to this setting. However, if U๐‘ˆUitalic_U is unknown, we might infer it, for example, by estimating the true reward functions rโˆ—superscript๐‘Ÿr^{*}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT through the online interaction with the agents. We can also generalize this setting as follows.

We consider a general setting, where the mediator does not have prior knowledge on U๐‘ˆUitalic_U, but it can observe samples from U๐‘ˆUitalic_U, perturbed by ฯƒUsubscript๐œŽ๐‘ˆ\sigma_{U}italic_ฯƒ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT-sub-Gaussian noise, and get access to a function class ๐’ฐ๐’ฐ\mathcal{U}caligraphic_U which contains U๐‘ˆUitalic_U and whose functions are bounded in [0,Umax]0subscript๐‘ˆ[0,U_{\max}][ 0 , italic_U start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ].

J.1 Algorithm

We can use the standard technique described in Russo and Vanย Roy, (2013) to handle this case. We define

Uยฏksuperscriptยฏ๐‘ˆ๐‘˜\displaystyle\bar{U}^{k}overยฏ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT =argโขminU^โˆˆ๐’ฐโขโˆ‘t=1Tk(U^โข(ฮผยฏt)โˆ’Uโข(ฮผยฏt))2,absentsubscriptargmin^๐‘ˆ๐’ฐsuperscriptsubscript๐‘ก1subscript๐‘‡๐‘˜superscript^๐‘ˆsuperscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscriptยฏ๐œ‡๐‘ก2\displaystyle=\operatorname*{arg\,min}_{\hat{U}\in\mathcal{U}}\sum_{t=1}^{T_{k% }}(\hat{U}(\bar{\mu}^{t})-U(\bar{\mu}^{t}))^{2},= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG italic_U end_ARG โˆˆ caligraphic_U end_POSTSUBSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( over^ start_ARG italic_U end_ARG ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (12)
๐’ฐ^ksuperscript^๐’ฐ๐‘˜\displaystyle\hat{\mathcal{U}}^{k}over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ={U^โˆˆ๐’ฐ:โ€–U^โˆ’Uยฏkโ€–2,ETk2โ‰คฮฒkU},absentconditional-set^๐‘ˆ๐’ฐsuperscriptsubscriptnorm^๐‘ˆsuperscriptยฏ๐‘ˆ๐‘˜2subscript๐ธsubscript๐‘‡๐‘˜2superscriptsubscript๐›ฝ๐‘˜๐‘ˆ\displaystyle=\left\{\hat{U}\in\mathcal{U}:\|\hat{U}-\bar{U}^{k}\|_{2,E_{T_{k}% }}^{2}\leq\beta_{k}^{U}\right\},= { over^ start_ARG italic_U end_ARG โˆˆ caligraphic_U : โˆฅ over^ start_ARG italic_U end_ARG - overยฏ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT โˆฅ start_POSTSUBSCRIPT 2 , italic_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT โ‰ค italic_ฮฒ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT } , (13)

where ฮฒkU:=8ฯƒU2log(N(๐’ฐ,ฮฑ,โˆฅโ‹…โˆฅโˆž)/ฮด)+2ฮฑk(8Umax+8โขฯƒU2โขlnโก(4โขk2/ฮด))\beta_{k}^{U}:=8\sigma_{U}^{2}\log(N(\mathcal{U},\alpha,\|\cdot\|_{\infty})/% \delta)+2\alpha k(8U_{\max}+\sqrt{8\sigma_{U}^{2}\ln(4k^{2}/\delta)})italic_ฮฒ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT := 8 italic_ฯƒ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_N ( caligraphic_U , italic_ฮฑ , โˆฅ โ‹… โˆฅ start_POSTSUBSCRIPT โˆž end_POSTSUBSCRIPT ) / italic_ฮด ) + 2 italic_ฮฑ italic_k ( 8 italic_U start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT + square-root start_ARG 8 italic_ฯƒ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln ( 4 italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ฮด ) end_ARG ) and, e.g., ฮฑ=Tโˆ’1๐›ผsuperscript๐‘‡1\alpha=T^{-1}italic_ฮฑ = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Algorithm 5 Steering reward design for Scenario 2 and unknown utility
1:Initialize ๐’ซ1:=assignsuperscript๐’ซ1absent\mathcal{P}^{1}:=caligraphic_P start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT := set of all possible transition functions, ฯ€โˆ—1superscriptsubscript๐œ‹1\pi_{*}^{1}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (arbitrarily), k=1,T0=0formulae-sequence๐‘˜1subscript๐‘‡00k=1,T_{0}=0italic_k = 1 , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.
2:forย t=1,โ€ฆ,T๐‘ก1โ€ฆ๐‘‡t=1,...,Titalic_t = 1 , โ€ฆ , italic_Tย do
3:ย ย ย ย ย Update โ„›^tsuperscript^โ„›๐‘ก\hat{\mathcal{R}}^{t}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as in (10).
4:ย ย ย ย ย Choose Rnztsuperscriptsubscript๐‘…nz๐‘กR_{\text{nz}}^{t}italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as in (6).
5:ย ย ย ย ย Agents play t๐‘กtitalic_t-th game with rโˆ—+Rnztsuperscript๐‘Ÿsuperscriptsubscript๐‘…nz๐‘กr^{*}+R_{\text{nz}}^{t}italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.
6:ย ย ย ย ย Obtain trajectory ((sht,aht,rht))h=1Hsuperscriptsubscriptsubscriptsuperscript๐‘ ๐‘กโ„Žsubscriptsuperscript๐‘Ž๐‘กโ„Žsubscriptsuperscript๐‘Ÿ๐‘กโ„Žโ„Ž1๐ป((s^{t}_{h},a^{t}_{h},r^{t}_{h}))_{h=1}^{H}( ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT.
7:ย ย ย ย ย ifย โˆƒ(h,s,a),s.t.nkโข(h,s,a)โ‰ฅNkโข(h,s,a)formulae-sequenceโ„Ž๐‘ ๐‘Ž๐‘ ๐‘กsubscript๐‘›๐‘˜โ„Ž๐‘ ๐‘Žsubscript๐‘๐‘˜โ„Ž๐‘ ๐‘Ž\exists(h,s,a),~{}s.t.~{}n_{k}(h,s,a)\geq N_{k}(h,s,a)โˆƒ ( italic_h , italic_s , italic_a ) , italic_s . italic_t . italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) โ‰ฅ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_h , italic_s , italic_a ) or tโˆ’Tkโˆ’1โ‰ฅT๐‘’๐‘๐‘œ๐‘โ„Ž๐‘กsubscript๐‘‡๐‘˜1subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Žt-T_{k-1}\geq T_{\mathit{epoch}}italic_t - italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT โ‰ฅ italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPTย then
8:ย ย ย ย ย ย ย ย ย Update ๐’ซk+1superscript๐’ซ๐‘˜1\mathcal{P}^{k+1}caligraphic_P start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT as in (5).
9:ย ย ย ย ย ย ย ย ย Tkโ†tโ†subscript๐‘‡๐‘˜๐‘กT_{k}\leftarrow titalic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT โ† italic_t; kโ†k+1โ†๐‘˜๐‘˜1k\leftarrow k+1italic_k โ† italic_k + 1.
10:ย ย ย ย ย ย ย ย ย Compute ๐’ฐ^ksuperscript^๐’ฐ๐‘˜\hat{\mathcal{U}}^{k}over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as in (13).
11:ย ย ย ย ย ย ย ย ย U^k,ฯ€โˆ—k,M^kโ†argโขmaxU^โˆˆ๐’ฐ^k,ฯ€โˆˆฮ ,M^:โ„™M^โˆˆ๐’ซkโกU^โข(ฮผM^ฯ€).โ†superscript^๐‘ˆ๐‘˜superscriptsubscript๐œ‹๐‘˜superscript^๐‘€๐‘˜subscriptargmax:formulae-sequence^๐‘ˆsuperscript^๐’ฐ๐‘˜๐œ‹ฮ ^๐‘€subscriptโ„™^๐‘€superscript๐’ซ๐‘˜^๐‘ˆsubscriptsuperscript๐œ‡๐œ‹^๐‘€\hat{U}^{k},\pi_{*}^{k},\hat{M}^{k}\leftarrow\operatorname*{arg\,max}_{\hat{U}% \in\hat{\mathcal{U}}^{k},\pi\in\Pi,\hat{M}:\mathbb{P}_{\hat{M}}\in\mathcal{P}^% {k}}\hat{U}(\mu^{\pi}_{\hat{M}}).over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_M end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT โ† start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG italic_U end_ARG โˆˆ over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_ฯ€ โˆˆ roman_ฮ  , over^ start_ARG italic_M end_ARG : blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT โˆˆ caligraphic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_U end_ARG ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_M end_ARG end_POSTSUBSCRIPT ) .
12:ย ย ย ย ย endย if
13:endย for

Algorithmย 5 differs from Algorithmย 4 in the if-condition in lineย 7 as well as in lines 10 and 11.

The if-condition in line 7 now includes the case tโˆ’Tkโˆ’1โ‰ฅT๐‘’๐‘๐‘œ๐‘โ„Ž๐‘กsubscript๐‘‡๐‘˜1subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Žt-T_{k-1}\geq T_{\mathit{epoch}}italic_t - italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT โ‰ฅ italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT, where T๐‘’๐‘๐‘œ๐‘โ„Žsubscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„ŽT_{\mathit{epoch}}italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT will be chosen later. We need this to guarantee Tkโˆ’Tkโˆ’1โ‰คT๐‘’๐‘๐‘œ๐‘โ„Žsubscript๐‘‡๐‘˜subscript๐‘‡๐‘˜1subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„ŽT_{k}-T_{k-1}\leq T_{\mathit{epoch}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT โ‰ค italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT for all k๐‘˜kitalic_k and thereby bound the estimation error of the utility function estimate. Intuitively, we need to keep the estimates of U๐‘ˆUitalic_U somewhat up to date to be able to bound the estimation error. Meanwhile, we cannot update the estimate in each round (or too often) since then we would also have to change ฯ€โˆ—ksuperscriptsubscript๐œ‹๐‘˜\pi_{*}^{k}italic_ฯ€ start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT in each round, which would lead to K=T๐พ๐‘‡K=Titalic_K = italic_T.

Since we also need to estimate the utility function, we changed lineย 11 to also compute an optimistic estimate of the utility using the definition in (13).

J.2 Analysis

Theorem J.1.

Under Assump.ย A,ย B andย C, if we run Alg.ย 5 with 0<ฮด<10๐›ฟ10<\delta<10 < italic_ฮด < 1, then with probability at least 1โˆ’8โขฮด18๐›ฟ1-8\delta1 - 8 italic_ฮด, Kโ‰คT1/6+HโขSโขAโขlog2โกT๐พsuperscript๐‘‡16๐ป๐‘†๐ดsubscript2๐‘‡K\leq T^{1/6}+HSA\log_{2}Titalic_K โ‰ค italic_T start_POSTSUPERSCRIPT 1 / 6 end_POSTSUPERSCRIPT + italic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T, and

ฮ”Tโข({ฮผยฏMโˆ—t}t=1T)subscriptฮ”๐‘‡superscriptsubscriptsubscriptsuperscriptยฏ๐œ‡๐‘กsuperscript๐‘€๐‘ก1๐‘‡\displaystyle\Delta_{T}(\{\bar{\mu}^{t}_{M^{*}}\}_{t=1}^{T})roman_ฮ” start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) โ‰คLUโขH3โขSโขAโขTโข(KโขAdaReg(T)+D)+36โขLUโขH3โขSโขAโขTโขlnโก(TโขHโขSโขA/ฮด)absentsubscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡๐พAdaReg๐‘‡๐ท36subscript๐ฟ๐‘ˆsuperscript๐ป3๐‘†๐ด๐‘‡๐‘‡๐ป๐‘†๐ด๐›ฟ\displaystyle\leq L_{U}\sqrt{H^{3}SAT(K\operatorname*{AdaReg}(T)+D)}+36L_{U}H^% {3}S\sqrt{AT\ln(THSA/\delta)}โ‰ค italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT square-root start_ARG italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG + 36 italic_L start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_T italic_H italic_S italic_A / italic_ฮด ) end_ARG
+๐’ชโข(T5/6โขUmaxโขdimE(๐’ฐ,Tโˆ’1)+ฮฒKUโขdimE(๐’ฐ,Tโˆ’1)โขT),๐’ชsuperscript๐‘‡56subscript๐‘ˆsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡1superscriptsubscript๐›ฝ๐พ๐‘ˆsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡1๐‘‡\displaystyle+\mathcal{O}\left(T^{5/6}U_{\max}\dim_{E}(\mathcal{U},T^{-1})+% \sqrt{\beta_{K}^{U}\dim_{E}(\mathcal{U},T^{-1})T}\right),+ caligraphic_O ( italic_T start_POSTSUPERSCRIPT 5 / 6 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ) ,
CT({ฮผยฏMโˆ—t,Rnztโˆ’\displaystyle C_{T}(\{\bar{\mu}^{t}_{M^{*}},R_{\text{nz}}^{t}-italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( { overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT nz end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - (rmaxโ‹…๐Ÿโˆ’rโˆ—)}t=1T)\displaystyle(r_{\max}\cdot\mathbf{1}-r^{*})\}_{t=1}^{T})( italic_r start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT โ‹… bold_1 - italic_r start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
=4โขHabsent4๐ป\displaystyle=4H= 4 italic_H Tโข(KโขAdaReg(T)+D)+D,๐‘‡๐พAdaReg๐‘‡๐ท๐ท\displaystyle\sqrt{T(K\operatorname*{AdaReg}(T)+D)}+D,square-root start_ARG italic_T ( italic_K roman_AdaReg ( italic_T ) + italic_D ) end_ARG + italic_D ,

where D=O~(ฮฒTโขHโขdimE(โ„›,Tโˆ’1)โขT))D=\tilde{O}(\sqrt{\beta_{T}H\dim_{E}(\mathcal{R},T^{-1})T}))italic_D = over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_H roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_R , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ) ).

Comparing with Theoremย 6.1, we see that the steering gap has an additional term originating from the estimation of U๐‘ˆUitalic_U. Furthermore, the bound of the number of epochs K๐พKitalic_K has an additional T1/6superscript๐‘‡16T^{1/6}italic_T start_POSTSUPERSCRIPT 1 / 6 end_POSTSUPERSCRIPT. Similar to the discussion about โ„›โ„›\mathcal{R}caligraphic_R in Sectionย H.1, we can also bound ฮฒKUsubscriptsuperscript๐›ฝ๐‘ˆ๐พ\beta^{U}_{K}italic_ฮฒ start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and dimE(๐’ฐ,Tโˆ’1)subscriptdimension๐ธ๐’ฐsuperscript๐‘‡1\dim_{E}(\mathcal{U},T^{-1})roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) under suitable assumptions about ๐’ฐ๐’ฐ\mathcal{U}caligraphic_U. If ฮฒKU,dimE(๐’ฐ,Tโˆ’1)โˆˆ๐’ช~โข(1)subscriptsuperscript๐›ฝ๐‘ˆ๐พsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡1~๐’ช1\beta^{U}_{K},\dim_{E}(\mathcal{U},T^{-1})\in\tilde{\mathcal{O}}(1)italic_ฮฒ start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) โˆˆ over~ start_ARG caligraphic_O end_ARG ( 1 ) and AdaReg(T)=๐’ช~โข(T)AdaReg๐‘‡~๐’ช๐‘‡\operatorname*{AdaReg}(T)=\tilde{\mathcal{O}}(\sqrt{T})roman_AdaReg ( italic_T ) = over~ start_ARG caligraphic_O end_ARG ( square-root start_ARG italic_T end_ARG ), both the steering cap and steering cost are in ๐’ช~โข(T5/6)~๐’ชsuperscript๐‘‡56\tilde{\mathcal{O}}(T^{5/6})over~ start_ARG caligraphic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 5 / 6 end_POSTSUPERSCRIPT ) (ignoring all other constants).

Proof.

We can adapt the proof of Theorem 6.1 by choosing the following regret decomposition.

Uโข(ฮผฯ€โˆ—)โˆ’Uโข(ฮผยฏt)=(Uโข(ฮผฯ€โˆ—)โˆ’U^kโข(ฮผ^โˆ—k))+(U^kโข(ฮผ^โˆ—k)โˆ’U^kโข(ฮผยฏt))+(U^kโข(ฮผยฏt)โˆ’Uโข(ฮผยฏt))๐‘ˆsuperscript๐œ‡superscript๐œ‹๐‘ˆsuperscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscript๐œ‡superscript๐œ‹superscript^๐‘ˆ๐‘˜subscriptsuperscript^๐œ‡๐‘˜superscript^๐‘ˆ๐‘˜subscriptsuperscript^๐œ‡๐‘˜superscript^๐‘ˆ๐‘˜superscriptยฏ๐œ‡๐‘กsuperscript^๐‘ˆ๐‘˜superscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscriptยฏ๐œ‡๐‘ก\displaystyle U(\mu^{\pi^{*}})-U(\bar{\mu}^{t})=\left(U(\mu^{\pi^{*}})-\hat{U}% ^{k}(\hat{\mu}^{k}_{*})\right)+\left(\hat{U}^{k}(\hat{\mu}^{k}_{*})-\hat{U}^{k% }(\bar{\mu}^{t})\right)+\left(\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})\right)italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ( italic_U ( italic_ฮผ start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT ) ) + ( over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( over^ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT โˆ— end_POSTSUBSCRIPT ) - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) + ( over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )

Using Lemmaย I.1 (and replacing โ„›โ„›\mathcal{R}caligraphic_R by ๐’ฐ๐’ฐ\mathcal{U}caligraphic_U in the Lemma), we have Uโˆˆโ‹‚k=1K๐’ฐ^k๐‘ˆsuperscriptsubscript๐‘˜1๐พsuperscript^๐’ฐ๐‘˜U\in\bigcap_{k=1}^{K}\hat{\mathcal{U}}^{k}italic_U โˆˆ โ‹‚ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT with probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด. Thus, with probability at least 1โˆ’2โขฮด12๐›ฟ1-2\delta1 - 2 italic_ฮด, the first term can be bounded by 0 using optimism. The second term can be bounded in the same way as in the proof of Theorem 6.1. Summing over all t๐‘กtitalic_t, the last term accumulates to

โˆ‘t=1TU^kโข(ฮผยฏt)โˆ’Uโข(ฮผยฏt)=โˆ‘k=1Kโˆ‘t=Tkโˆ’1+1TkU^kโข(ฮผยฏt)โˆ’Uโข(ฮผยฏt)โ‰คโˆ‘k=1Kโˆ‘t=Tkโˆ’1+1Tkw๐’ฐ^kโข(ฮผยฏt).superscriptsubscript๐‘ก1๐‘‡superscript^๐‘ˆ๐‘˜superscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘˜1๐พsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜superscript^๐‘ˆ๐‘˜superscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscriptยฏ๐œ‡๐‘กsuperscriptsubscript๐‘˜1๐พsuperscriptsubscript๐‘กsubscript๐‘‡๐‘˜11subscript๐‘‡๐‘˜subscript๐‘คsuperscript^๐’ฐ๐‘˜superscriptยฏ๐œ‡๐‘ก\displaystyle\sum_{t=1}^{T}\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})=\sum_{k% =1}^{K}\sum_{t=T_{k-1}+1}^{T_{k}}\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})% \leq\sum_{k=1}^{K}\sum_{t=T_{k-1}+1}^{T_{k}}w_{\hat{\mathcal{U}}^{k}}(\bar{\mu% }^{t}).โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT โˆ‘ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT over^ start_ARG caligraphic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

Using the fact that Tkโˆ’Tkโˆ’1โ‰คT๐‘’๐‘๐‘œ๐‘โ„Žsubscript๐‘‡๐‘˜subscript๐‘‡๐‘˜1subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„ŽT_{k}-T_{k-1}\leq T_{\mathit{epoch}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT โ‰ค italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT and Lemma H.5 with ฮต=Tโˆ’1๐œ€superscript๐‘‡1\varepsilon=T^{-1}italic_ฮต = italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, the sum above is at most

KโขT๐‘’๐‘๐‘œ๐‘โ„ŽT+UmaxโขT๐‘’๐‘๐‘œ๐‘โ„ŽโขdimE(๐’ฐ,Tโˆ’1)+4โขฮฒTUโขKโขT๐‘’๐‘๐‘œ๐‘โ„ŽโขdimE(๐’ฐ,Tโˆ’1).๐พsubscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Ž๐‘‡subscript๐‘ˆsubscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Žsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡14superscriptsubscript๐›ฝ๐‘‡๐‘ˆ๐พsubscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Žsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡1\displaystyle\frac{KT_{\mathit{epoch}}}{T}+U_{\max}T_{\mathit{epoch}}\dim_{E}(% \mathcal{U},T^{-1})+4\sqrt{\beta_{T}^{U}KT_{\mathit{epoch}}\dim_{E}(\mathcal{U% },T^{-1})}.divide start_ARG italic_K italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG + italic_U start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + 4 square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT italic_K italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG .

We also have to find a new bound for K๐พKitalic_K. As before, we can enter the if block at most HโขSโขAโขlog2โกT๐ป๐‘†๐ดsubscript2๐‘‡HSA\log_{2}Titalic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T times because of the first condition. In addition, we can enter the if block at most T/T๐‘’๐‘๐‘œ๐‘โ„Ž๐‘‡subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„ŽT/T_{\mathit{epoch}}italic_T / italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT times due to the condition Tkโ‰ฅT๐‘’๐‘๐‘œ๐‘โ„Žsubscript๐‘‡๐‘˜subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„ŽT_{k}\geq T_{\mathit{epoch}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT โ‰ฅ italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT. Therefore, Kโ‰คT/T๐‘’๐‘๐‘œ๐‘โ„Ž+HโขSโขAโขlog2โกT๐พ๐‘‡subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Ž๐ป๐‘†๐ดsubscript2๐‘‡K\leq T/T_{\mathit{epoch}}+HSA\log_{2}Titalic_K โ‰ค italic_T / italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT + italic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T and KโขT๐‘’๐‘๐‘œ๐‘โ„Žโ‰คT+HโขSโขAโขT๐‘’๐‘๐‘œ๐‘โ„Žโขlog2โกT๐พsubscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Ž๐‘‡๐ป๐‘†๐ดsubscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Žsubscript2๐‘‡KT_{\mathit{epoch}}\leq T+HSAT_{\mathit{epoch}}\log_{2}Titalic_K italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT โ‰ค italic_T + italic_H italic_S italic_A italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T.

Now, we set T๐‘’๐‘๐‘œ๐‘โ„Ž=T5/6subscript๐‘‡๐‘’๐‘๐‘œ๐‘โ„Žsuperscript๐‘‡56T_{\mathit{epoch}}=T^{5/6}italic_T start_POSTSUBSCRIPT italic_epoch end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT 5 / 6 end_POSTSUPERSCRIPT. Then, Kโ‰คT1/6+HโขSโขAโขlog2โกT๐พsuperscript๐‘‡16๐ป๐‘†๐ดsubscript2๐‘‡K\leq T^{1/6}+HSA\log_{2}Titalic_K โ‰ค italic_T start_POSTSUPERSCRIPT 1 / 6 end_POSTSUPERSCRIPT + italic_H italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T and

โˆ‘t=1TU^kโข(ฮผยฏt)โˆ’Uโข(ฮผยฏt)โ‰ค๐’ชโข(T5/6โขUmaxโขdimE(๐’ฐ,Tโˆ’1)+ฮฒTUโขdimE(๐’ฐ,Tโˆ’1)โขT).superscriptsubscript๐‘ก1๐‘‡superscript^๐‘ˆ๐‘˜superscriptยฏ๐œ‡๐‘ก๐‘ˆsuperscriptยฏ๐œ‡๐‘ก๐’ชsuperscript๐‘‡56subscript๐‘ˆsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡1superscriptsubscript๐›ฝ๐‘‡๐‘ˆsubscriptdimension๐ธ๐’ฐsuperscript๐‘‡1๐‘‡\displaystyle\sum_{t=1}^{T}\hat{U}^{k}(\bar{\mu}^{t})-U(\bar{\mu}^{t})\leq% \mathcal{O}\left(T^{5/6}U_{\max}\dim_{E}(\mathcal{U},T^{-1})+\sqrt{\beta_{T}^{% U}\dim_{E}(\mathcal{U},T^{-1})T}\right).โˆ‘ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_U ( overยฏ start_ARG italic_ฮผ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) โ‰ค caligraphic_O ( italic_T start_POSTSUPERSCRIPT 5 / 6 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + square-root start_ARG italic_ฮฒ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT roman_dim start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( caligraphic_U , italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_T end_ARG ) .

Finally, the steering gap is the previous bound of Theoremย 6.1 plus the above term.

With regard to the steering cost, the only change is the bound of K๐พKitalic_K.

โˆŽ