Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Open Ad Hoc Teamwork with Cooperative Game Theory

Jianhong Wang    Yang Li    Yuan Zhang    Wei Pan    Samuel Kaski
Abstract

Ad hoc teamwork poses a challenging problem, requiring the design of an agent to collaborate with teammates without prior coordination or joint training. Open ad hoc teamwork further complicates this challenge by considering environments with a changing number of teammates, referred to as open teams. One promising solution to this problem is leveraging the generalizability of graph neural networks to handle an unrestricted number of agents and effectively address open teams, named graph-based policy learning (GPL). However, its joint Q-value representation over a coordination graph lacks convincing explanations. In this paper, we establish a new theory to understand the joint Q-value representation from the perspective of cooperative game theory, and validate its learning paradigm in open team settings. Building on our theory, we propose a novel algorithm named CIAO compatible with GPL framework, with additional provable implementation tricks that can facilitate learning. The demo of experiments is available on https://sites.google.com/view/ciao2024, and the code of experiments is published on https://github.com/hsvgbkhgbv/CIAO.

Machine Learning, ICML

1 Introduction

Multi-agent reinforcement learning (MARL) has achieved partial success on multiple tasks including playing strategy games (Rashid et al., 2020), power system operation (Wang et al., 2021), and dynamic algorithm configuration (Xue et al., 2022). These tasks fit to the training paradigm of MARL, which requires all agents to be controllable and to be coordinated during training. However, with this paradigm it is difficult to tackle many real-world tasks where not all agents are controllable and even prior coordination may not be possible. For example, in search and rescue, a robot must collaborate with other robots it has not seen before (e.g., manufactured by various companies without a common coordination protocol) or humans to rescue survivors (Barrett & Stone, 2015). Similar situations occur in AI that helps trading markets (Albrecht & Ramamoorthy, 2013), as well as in the human-machine and machine-machine collaboration emerging from the prevailing embodied AI settings (Smith & Gasser, 2005; Duan et al., 2022) and large language models (Brown et al., 2020; Zhao et al., 2023).

To tackle the ad hoc teamwork problem, we explore a scenario where one agent, referred to as the learner, operates under our control and seeks to collaborate without prior coordination with teammates which have unknown types and policies (Stone et al., 2010). When dealing with teams of dynamic sizes, commonly termed open teams, the research problem addressed in this paper is often referred to as open ad hoc teamwork (OAHT) (Mirsky et al., 2022). One promising solution for OAHT is graph-based policy learning (GPL) (Rahman et al., 2021). GPL presents an empirical three-fold framework, encompassing a type inference model, a joint action value model, and an agent model, to tackle this problem. Although GPL reaps the success of performance, its weakness is that the representation of the joint Q-value over a coordination graph in OAHT lacks convincing explanations. This restricts its applicability to real-world problems requiring trustworthy algorithms (Bhat & Alqahtani, 2021; Wang et al., 2021).

We propose to describe OAHT using a game model from cooperative game theory, namely the coalitional affinity game (CAG) (Brânzei & Larson, 2009). Specifically, we extend the CAG by incorporating Bayesian games (Harsanyi, 1967) to depict uncertain agent types and stochastic games (Shapley, 1953) to represent the long-horizon goal. The resulting game is termed the open stochastic Bayesian coalitional affinity game (OSB-CAG). In this game, the learner aims to influence other teammates (via its actions) to collaborate in achieving a shared goal. To formalize this, we extend the standard cooperative game theory notion of strict core to a novel solution concept which we call dynamic variational strict core (DVSC). The DVSC transforms collaboration in a temporary team into the task of forming a stable temporary team, where no agent has incentives to leave. We model the OAHT process under the learner’s influence as a dynamic affinity graph (equivalent to a coordination graph), generalizing the classical static CAG. Based on the dynamic affinity graph, we further conceptualize an agent’s preference for a temporary team to measure whether they prefer to stay in the team under the learner’s influence. GPL’s joint action value model is proven to be the sum of any temporary agents’ preferences over a long horizon.

The main contributions of this paper can be summarized as follows: (1) We conceptualize OAHT as a dynamic coalitional affinity game, OSB-CAG. In this model, the learner seeks to influence teammates through its actions, without prior coordination, to establish a stable temporary team. (2) The theoretical model of OSB-CAG gives an understanding of GPL’s joint action value model. It ensures collaboration within any temporary team under open team settings. (3) Building on the OSB-CAG theory, we derive a constraint for representing the joint action value to facilitate learning, and an additional regularization term depending on the graph structure to rationalize solving DVSC as an RL problem. The novel algorithm, named CIAO (Cooperative game theory Inspired Ad hoc teamwork in Open teams), is implemented based on GPL and incorporates the above novel and provable tricks. (4) We validate the learning paradigm of GPL in open team settings. (5) We conduct experiments, primarily comparing two instances of CIAO (CIAO-S and CIAO-C, implemented in star and complete graph structures, respectively) based on GPL framework in two environments: Level-based Foraging (LBF) and Wolfpack under open team settings (Rahman et al., 2021). Finally, we conduct a comprehensive review and discussion of related works on both theoretical and algorithmic aspects of AHT and explore its relationship to MARL in Appendix A.

2 Background

Let Δ(Ω)ΔΩ\Delta(\Omega)roman_Δ ( roman_Ω ) indicate the set of probability distributions over a random variable on a sample space ΩΩ\Omegaroman_Ω and let (𝒳)𝒳\mathbb{P}(\mathcal{X})blackboard_P ( caligraphic_X ) denote the power set of an arbitrary set 𝒳𝒳\mathcal{X}caligraphic_X. To simplify the notation, let i𝑖iitalic_i exclusively denote the learner and i𝑖-i- italic_i denote the set of all temporary teammates at any timestep. P(𝒳)𝑃𝒳P(\mathcal{X})italic_P ( caligraphic_X ) indicates the generic probability distribution over a random variable 𝒳𝒳\mathcal{X}caligraphic_X and |𝒳|𝒳|\mathcal{X}|| caligraphic_X | indicates the cardinality of an arbitrary set 𝒳𝒳\mathcal{X}caligraphic_X.

2.1 Coalitional Affinity Game

As a subclass of non-transferable utility games, hedonic game (Chalkiadakis et al., 2022) is defined as a tuple 𝒩,𝒩succeeds-or-equals\langle\mathcal{N},\succeq\rangle⟨ caligraphic_N , ⪰ ⟩, where 𝒩𝒩\mathcal{N}caligraphic_N is a set of all agents; and =(1,,n)\succeq=(\succeq_{1},...,\succeq_{n})⪰ = ( ⪰ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , ⪰ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is a sequence of agents’ preferences over the subsets of 𝒩𝒩\mathcal{N}caligraphic_N called coalitions. 𝒞j𝒞subscriptsucceeds-or-equals𝑗𝒞superscript𝒞\mathcal{C}\succeq_{j}\mathcal{C}^{\prime}caligraphic_C ⪰ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT implies that coalition 𝒞𝒞\mathcal{C}caligraphic_C is no less preferred by agent j𝑗jitalic_j than coalition 𝒞superscript𝒞\mathcal{C}^{\prime}caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. For each agent j𝒩𝑗𝒩j\in\mathcal{N}italic_j ∈ caligraphic_N, jsubscriptsucceeds-or-equals𝑗\succeq_{j}⪰ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT describes a complete and transitive preference relation over a collection of all feasible coalitions 𝒩(j)={𝒞𝒩|j𝒞}𝒩𝑗conditional-set𝒞𝒩𝑗𝒞\mathcal{N}(j)=\{\mathcal{C}\ {{\subseteq}}\ \mathcal{N}\ |\ j\in\mathcal{C}\}caligraphic_N ( italic_j ) = { caligraphic_C ⊆ caligraphic_N | italic_j ∈ caligraphic_C }. The outcome of a hedonic game is a coalition structure 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S, i.e., a partition of 𝒩𝒩\mathcal{N}caligraphic_N into disjoint coalitions. We denote by 𝒞𝒮(j)𝒞𝒮𝑗\mathcal{CS}(j)caligraphic_C caligraphic_S ( italic_j ) the coalition including agent j𝑗jitalic_j. The ordinal preferences can be represented as the cardinal form with preference values (Sliwinski & Zick, 2017). More specifically, an agent j𝑗jitalic_j has a preference value function such that vj:𝒩(j)0:subscript𝑣𝑗𝒩𝑗subscriptabsent0v_{j}:\mathcal{N}(j)\rightarrow\mathbb{R}_{\geq 0}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : caligraphic_N ( italic_j ) → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT. vj(𝒞)vj(𝒞)subscript𝑣𝑗𝒞subscript𝑣𝑗superscript𝒞v_{j}(\mathcal{C})\geq v_{j}(\mathcal{C}^{\prime})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) ≥ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) if 𝒞j𝒞subscriptsucceeds-or-equals𝑗𝒞superscript𝒞\mathcal{C}\succeq_{j}\mathcal{C}^{\prime}caligraphic_C ⪰ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which implies that agent j𝑗jitalic_j weakly prefers 𝒞𝒞\mathcal{C}caligraphic_C to 𝒞superscript𝒞\mathcal{C}^{\prime}caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; vj(𝒞)>vj(𝒞)subscript𝑣𝑗𝒞subscript𝑣𝑗superscript𝒞v_{j}(\mathcal{C})>v_{j}(\mathcal{C}^{\prime})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) > italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) if 𝒞j𝒞subscriptsucceeds𝑗𝒞superscript𝒞\mathcal{C}\succ_{j}\mathcal{C}^{\prime}caligraphic_C ≻ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which implies that agent j𝑗jitalic_j strictly prefers 𝒞𝒞\mathcal{C}caligraphic_C to 𝒞superscript𝒞\mathcal{C}^{\prime}caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

To concisely represent the preference value, a hedonic game is equipped with an affinity graph G=𝒩,𝐺𝒩G=\langle\mathcal{N},\mathcal{E}\rangleitalic_G = ⟨ caligraphic_N , caligraphic_E ⟩, where each edge (j,k)𝑗𝑘(j,k)\in\mathcal{E}( italic_j , italic_k ) ∈ caligraphic_E describes an affinity relation between agents j𝑗jitalic_j and k𝑘kitalic_k. For each edge (j,k)𝑗𝑘(j,k)( italic_j , italic_k ), it defines an affinity weight w(j,k)𝑤𝑗𝑘w(j,k)\in\mathbb{R}italic_w ( italic_j , italic_k ) ∈ blackboard_R to indicate the value that agent j𝑗jitalic_j can receive from agent k𝑘kitalic_k, while if (j,k)𝑗𝑘(j,k)\notin\mathcal{E}( italic_j , italic_k ) ∉ caligraphic_E, w(j,k)=0𝑤𝑗𝑘0w(j,k)=0italic_w ( italic_j , italic_k ) = 0. For any coalition 𝒞𝒩j𝒞subscript𝒩𝑗\mathcal{C}\ {{\subseteq}}\ \mathcal{N}_{j}caligraphic_C ⊆ caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the preference value of agent j𝑗jitalic_j is specified as vj(𝒞)=(j,k),k𝒞w(j,k)subscript𝑣𝑗𝒞subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘v_{j}(\mathcal{C})=\sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}w(j,k)italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_w ( italic_j , italic_k ) if 𝒞{j}𝒞𝑗\mathcal{C}\neq\{j\}caligraphic_C ≠ { italic_j }, otherwise, vj({j})=bj0subscript𝑣𝑗𝑗subscript𝑏𝑗subscriptabsent0v_{j}(\{j\})=b_{j}\in\mathbb{R}_{\geq 0}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT.111In the original CAG setting (Sliwinski & Zick, 2017), vj({j})subscript𝑣𝑗𝑗v_{j}(\{j\})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) is conventionally set to zero. Herein, we extend it to non-negative values for generality (see Appendix E). An affinity graph is symmetric if w(j,k)=w(k,j)𝑤𝑗𝑘𝑤𝑘𝑗w(j,k)=w(k,j)italic_w ( italic_j , italic_k ) = italic_w ( italic_k , italic_j ), for all (j,k),(k,j)𝑗𝑘𝑘𝑗(j,k),(k,j)\in\mathcal{E}( italic_j , italic_k ) , ( italic_k , italic_j ) ∈ caligraphic_E. The hedonic game with an affinity graph is named as coalitional affinity game (CAG) (Brânzei & Larson, 2009). Strict core stability is a principal solution concept of CAG (see Definition 1).

Definition 1.

We say that a blocking coalition 𝒞𝒞\mathcal{C}caligraphic_C weakly blocks a coalition structure 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S if every agent j𝒞𝑗𝒞j\in\mathcal{C}italic_j ∈ caligraphic_C weakly prefers 𝒞𝒞\mathcal{C}caligraphic_C to 𝒞𝒮(j)𝒞𝒮𝑗\mathcal{CS}(j)caligraphic_C caligraphic_S ( italic_j ) and there exists at least one agent k𝒞𝑘𝒞k\in\mathcal{C}italic_k ∈ caligraphic_C who strictly prefers 𝒞𝒞\mathcal{C}caligraphic_C to 𝒞𝒮(j)𝒞𝒮𝑗\mathcal{CS}(j)caligraphic_C caligraphic_S ( italic_j ). A coalition structure admitting no weakly blocking coalition 𝒞𝒩𝒞𝒩\mathcal{C}\ {{\subseteq}}\ \mathcal{N}caligraphic_C ⊆ caligraphic_N is called strict core stable.

2.2 Graph-Based Policy Learning

We now briefly review GPL’s empirical framework (Rahman et al., 2021) to solve OAHT (see Appendix C.1 for more details). GPL consists of the following modules: the type inference model, the joint action value model and the agent model. To align with our motivation, we transform the framework to be adaptable to any coordination graph structure, as opposed to being restricted to only the complete graph as in GPL.

Type Inference Model. This is modelled as a LSTM (Hochreiter & Schmidhuber, 1997) to infer agent types of a team at timestep t𝑡titalic_t given the teammates’ agent-types and the state at timestep t1𝑡1t-1italic_t - 1. The agent-type is modelled as a fixed-length hidden-state vector of LSTM, referred to as agent-type embedding. To address the issue of variable team size, the embedding of the agents who leave a team would be removed at each timestep, while the type embedding of the newly added agents would be set to a zero vector.

Joint Action Value Model. The joint Q-value Q^πi(st,at)superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡\hat{Q}^{\pi^{i}}(s_{t},a_{t})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is approximated as the sum of the individual utility Q^jπi(atj,|st)\hat{Q}_{j}^{\pi^{i}}(a_{t}^{j},|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and pairwise utility Q^jkπi(atj,atk|st)superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

Q^πi(st,at)=(j,k)tQ^jkπi(atj,atk|st)+j𝒩tQ^jπi(atj|st),superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\hat{Q}^{\pi^{i}}(s_{t},a_{t})=\sum_{(j,k)\in\mathcal{E}_{t}}\hat{Q}_{jk}^{\pi% ^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})+\sum_{j\in\mathcal{N}_{t}}\hat{Q}_{j}^{\pi^{i% }}(a_{t}^{j}|s_{t}),over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (1)

where the superscript πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT implies that the above terms can only be optimized by the learner’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

Agent Model. To address the open team setting, GNN is applied to process the joint agent type embedding θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT produced from the type inference model, where each agent is represented as a node and the coordination graph is consistent with that for the joint action value model. The resulting node representation n¯tsubscript¯𝑛𝑡\bar{n}_{t}over¯ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is applied as input to infer the estimated teammates’ joint policy, denoted as π^i(ati|st)superscript^𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡\hat{\pi}^{-i}(a_{t}^{-i}|s_{t})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Learner’s Decision Making. The learner’s approximate action value function Q^πi(st,ati)superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖\hat{Q}^{\pi^{i}}(s_{t},a_{t}^{i})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is defined as follows:

Q^πi(st,ati)=𝔼atiπti[Q^πi(st,ati,ati)],superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖subscript𝔼similar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖delimited-[]superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑎𝑡𝑖\hat{Q}^{\pi^{i}}(s_{t},a_{t}^{i})=\mathbb{E}_{a_{t}^{-i}\sim\pi_{t}^{-i}}% \left[\hat{Q}^{\pi^{i}}(s_{t},a_{t}^{i},a_{t}^{-i})\right],over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ) ] , (2)

where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a state at timestep t𝑡titalic_t, atisuperscriptsubscript𝑎𝑡𝑖a_{t}^{-i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT is a joint action of teammates i𝑖-i- italic_i at timestep t𝑡titalic_t and atisuperscriptsubscript𝑎𝑡𝑖a_{t}^{i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the learner i𝑖iitalic_i’s action at timestep t𝑡titalic_t. The learner’s decision making is conducted by selecting the action that maximizes Q^πi(st,ati)superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖\hat{Q}^{\pi^{i}}(s_{t},a_{t}^{i})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ).

3 A New Game Model to Formalize OAHT

In this section, we generalize the coalitional affinity game framework to formalize OAHT, by integrating a graph to represent relationships among agents. It is essential to emphasize that, for the sake of brevity, our focus of this work is exclusively on fully observable scenarios.

3.1 Problem Formulation

In an environment, the learner i𝑖iitalic_i interacts with other uncontrollable temporary teammates i𝑖-i- italic_i to achieve a shared goal. To model this process, we introduce Open Stochastic Bayesian Coalitional Affinity Game (OSB-CAG), defined as a tuple 𝒩,𝒮,(𝒜j)j𝒩,Θ,(Rj)j𝒩,PT,PI,PA,,γ𝒩𝒮subscriptsubscript𝒜𝑗𝑗𝒩Θsubscriptsubscript𝑅𝑗𝑗𝒩subscript𝑃𝑇subscript𝑃𝐼subscript𝑃𝐴𝛾\langle\mathcal{N},\mathcal{S},(\mathcal{A}_{j})_{j\in\mathcal{N}},\Theta,(R_{% j})_{j\in{\scriptscriptstyle\mathcal{N}}},P_{T},P_{I},P_{A},\mathcal{E},\gamma\rangle⟨ caligraphic_N , caligraphic_S , ( caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT , roman_Θ , ( italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , caligraphic_E , italic_γ ⟩. Here, 𝒩𝒩\mathcal{N}caligraphic_N represents the set of all possible agents, 𝒮𝒮\mathcal{S}caligraphic_S is the set of states, 𝒜jsubscript𝒜𝑗\mathcal{A}_{j}caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the action set for agent j𝑗jitalic_j, and ΘΘ\Thetaroman_Θ denotes the set of all possible agent-types. Let the joint action set under a variable agent set 𝒩t𝒩subscript𝒩𝑡𝒩\mathcal{N}_{t}\subseteq\mathcal{N}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_N be defined as 𝒜𝒩t=×j𝒩t𝒜j\mathcal{A}_{{\scriptscriptstyle\mathcal{N}}_{t}}=\times_{j\in\mathcal{N}_{t}}% \mathcal{A}_{j}caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = × start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, the joint action space under a variable number of agents is defined as 𝒜𝒩=𝒩t(𝒩){a|a𝒜𝒩t}subscript𝒜𝒩subscriptsubscript𝒩𝑡𝒩conditional-set𝑎𝑎subscript𝒜subscript𝒩𝑡\mathcal{A}_{\scriptscriptstyle\mathcal{N}}=\bigcup_{\mathcal{N}_{t}\in\mathbb% {P}(\mathcal{N})}\{a|a\in\mathcal{A}_{{\scriptscriptstyle\mathcal{N}}_{t}}\}caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_P ( caligraphic_N ) end_POSTSUBSCRIPT { italic_a | italic_a ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, while the joint agent-type space under a variable number of agents is defined as Θ𝒩=𝒩t(𝒩){θ|θΘ|𝒩t|}subscriptΘ𝒩subscriptsubscript𝒩𝑡𝒩conditional-set𝜃𝜃superscriptΘsubscript𝒩𝑡\Theta_{\scriptscriptstyle\mathcal{N}}=\bigcup_{\mathcal{N}_{t}\in\mathbb{P}(% \mathcal{N})}\{\theta|\theta\in\Theta^{{\scriptscriptstyle|\mathcal{N}_{t}|}}\}roman_Θ start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_P ( caligraphic_N ) end_POSTSUBSCRIPT { italic_θ | italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT }. A dynamic affinity graph, denoted as Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩, is introduced to describe the relationships among agents. Here, t={(j,k)|j,k𝒩t}subscript𝑡conditional-set𝑗𝑘𝑗𝑘subscript𝒩𝑡\mathcal{E}_{t}=\{(j,k)\ |\ j,k\in\mathcal{N}_{t}\}\ {{\subseteq}}\ \mathcal{E}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_j , italic_k ) | italic_j , italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ⊆ caligraphic_E, and \mathcal{E}caligraphic_E is a set of possible edges represented by pairs (j,k)𝑗𝑘(j,k)( italic_j , italic_k ). This graph is referred to as the coordination graph in GPL.

Transition Function. We now introduce three primitive probability distributions denoted as PT:(𝒩)×𝒮×𝒜𝒩Δ((𝒩)×𝒮):subscript𝑃𝑇𝒩𝒮subscript𝒜𝒩Δ𝒩𝒮P_{T}:\mathbb{P}(\mathcal{N})\times\mathcal{S}\times\mathcal{A}_{% \scriptscriptstyle\mathcal{N}}\rightarrow\Delta(\mathbb{P}(\mathcal{N})\times% \mathcal{S})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : blackboard_P ( caligraphic_N ) × caligraphic_S × caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT → roman_Δ ( blackboard_P ( caligraphic_N ) × caligraphic_S ), PI:(𝒩)×𝒮[0,1]:subscript𝑃𝐼𝒩𝒮01P_{I}:\mathbb{P}(\mathcal{N})\times\mathcal{S}\rightarrow[0,1]italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT : blackboard_P ( caligraphic_N ) × caligraphic_S → [ 0 , 1 ], and PA:𝒩×𝒮Δ(Θ):subscript𝑃𝐴𝒩𝒮ΔΘP_{A}:\mathcal{N}\times\mathcal{S}\rightarrow\Delta(\Theta)italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT : caligraphic_N × caligraphic_S → roman_Δ ( roman_Θ ). These probability functions characterize the dynamics of the environment in the following procedure: (1) At the initial timestep 00, PI(𝒩0,s0)subscript𝑃𝐼subscript𝒩0subscript𝑠0P_{I}(\mathcal{N}_{0},s_{0})italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) generates an initial set of agents 𝒩0subscript𝒩0\mathcal{N}_{0}caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and an initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. (2) PA(θtj|{j},st)subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡P_{A}(\theta_{t}^{j}|\{j\},s_{t})italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represents a type assignment function that randomly assigns agent-types to the generated agent set. (3) PT(𝒩t,st|𝒩t1,st1,at1)subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) generates the agent set 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the next time step t𝑡titalic_t. (4) Stage 2 and 3 above are repeated. To succinctly represent the aforementioned process, we derive a composite transition function T(𝒩t,st,θt|st1,at1,θt1)𝑇subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1T(\mathcal{N}_{t},s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_{t-1})italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (see Proposition 1) in place of stage 2 and 3 from timesteps t1𝑡1t\geq 1italic_t ≥ 1. This function can be factorized, clarifying the GPL’s framework, as follows:

T(𝒩t,st,θt|st1,at1,θt1)=PE(θt|𝒩t,st)PO(𝒩t,st|st1,at1,θt1).𝑇subscript𝒩𝑡subscript𝑠𝑡|subscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscript𝑃𝑂subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1\begin{split}T(\mathcal{N}_{t},&s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_{t-1})% \\ &=P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})P_{O}(\mathcal{N}_{t},s_{t}|s_{t-1},a% _{t-1},\theta_{t-1}).\end{split}start_ROW start_CELL italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . end_CELL end_ROW (3)

Herein, PO(𝒩t,st|st1,at1,θt1)subscript𝑃𝑂subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1P_{O}(\mathcal{N}_{t},s_{t}|s_{t-1},a_{t-1},\theta_{t-1})italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is a probability distribution composed of PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, PIsubscript𝑃𝐼P_{I}italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and PAsubscript𝑃𝐴P_{A}italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (see the sketch of proof of Proposition 1) that generates a variable agent set 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and a state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, observable by the learner. In contrast, a joint agent-type θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated from PE(θt|𝒩t,st)=j=1|𝒩t|PA(θtj|{j},st)subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡superscriptsubscriptproduct𝑗1subscript𝒩𝑡subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})=\prod_{j=1}^{|\mathcal{N}_{t}|}P_{A}(% \theta_{t}^{j}|\{j\},s_{t})italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is unobservable by the learner. However, it plays a crucial role in the agent model for the learner’s decision making in the empirical framework of GPL, motivating the estimation of this term in practice, as conducted by the type inference model (see Section 2.2). To distinguish between and clarify the observation generated from POsubscript𝑃𝑂P_{O}italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and the agent-types generated from PEsubscript𝑃𝐸P_{E}italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT during the decision process, both functions will be concurrently utilized to describe the composite transition function T𝑇Titalic_T in the subsequent sections. To simplify the notation, we would use POsubscript𝑃𝑂P_{O}italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT in place of PIsubscript𝑃𝐼P_{I}italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT for t=0𝑡0t=0italic_t = 0 in the following sections.

Assumption 1.

The following conditional independencies are assumed to hold in any distribution P𝑃Pitalic_P over the set of variables in an OSB-CAG: (1) (θtθt1,st1,at1|𝒩t,st)perpendicular-toabsentperpendicular-tosubscript𝜃𝑡subscript𝜃𝑡1subscript𝑠𝑡1conditionalsubscript𝑎𝑡1subscript𝒩𝑡subscript𝑠𝑡(\theta_{t}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$}}}\theta_{% t-1},s_{t-1},a_{t-1}\ |\ \mathcal{N}_{t},s_{t})( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); (2) (𝒩t,stθt1|𝒩t1,st1,at1)perpendicular-toabsentperpendicular-tosubscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡1subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1(\mathcal{N}_{t},s_{t}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$% }}}\theta_{t-1}\ |\ \mathcal{N}_{t-1},s_{t-1},a_{t-1})( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ); (3) (𝒩tat|st,θt)perpendicular-toabsentperpendicular-tosubscript𝒩𝑡conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜃𝑡(\mathcal{N}_{t}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$}}}a_{% t}|s_{t},\theta_{t})( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); (4) (θtjj,θtj|{j},st)perpendicular-toabsentperpendicular-tosuperscriptsubscript𝜃𝑡𝑗𝑗conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡(\theta_{t}^{j}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$}}}-j,% \theta_{t}^{-j}\ |\ \{j\},s_{t})( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_RELOP ⟂ ⟂ end_RELOP - italic_j , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Proposition 1.

T(𝒩t,st,θt|st1,at1,θt1)𝑇subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1T(\mathcal{N}_{t},s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_{t-1})italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) for t1𝑡1t\geq 1italic_t ≥ 1 can be expressed in terms of the following well-defined probability distributions: PI(𝒩0,s0)subscript𝑃𝐼subscript𝒩0subscript𝑠0P_{I}(\mathcal{N}_{0},s_{0})italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), PT(𝒩t,st|𝒩t1,st1,at1)subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) for t1𝑡1t\geq 1italic_t ≥ 1, and PA(θtj|{j},st)subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡P_{A}(\theta_{t}^{j}|\{j\},s_{t})italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for t0𝑡0t\geq 0italic_t ≥ 0.

Proof.

We show the sketch of proof here. The following derivation is obtained by Assumption 1. For validity of conditions in Assumption 1, please refer to Appendix D. About the complete version of proof, please refer to Appendix G.1.

T(𝒩t,st,θt|st1,at1,θt1)=PE(θt|𝒩t,st)PO(𝒩t,st|st1,at1,θt1),𝑇subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscript𝑃𝐸|subscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscript𝑃𝑂subscript𝒩𝑡|subscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1\begin{split}T(\mathcal{N}_{t},s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_{t-1})&% =\\ P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t}&)P_{O}(\mathcal{N}_{t},s_{t}|s_{t-1},a_% {t-1},\theta_{t-1}),\end{split}start_ROW start_CELL italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL start_CELL = end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL ) italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW

where PE(θt|𝒩t,st)=j=1|𝒩t|PA(θtj|{j},st)subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡superscriptsubscriptproduct𝑗1subscript𝒩𝑡subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})=\prod_{j=1}^{|\mathcal{N}_{t}|}P_{A}(% \theta_{t}^{j}|\{j\},s_{t})italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and

PO(𝒩t,st|st1,at1,θt1)=𝒩t1PT(𝒩t,st|𝒩t1,st1,at1)P(𝒩t1|st1,θt1).subscript𝑃𝑂subscript𝒩𝑡|subscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscriptsubscript𝒩𝑡1subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1𝑃conditionalsubscript𝒩𝑡1subscript𝑠𝑡1subscript𝜃𝑡1\begin{split}P_{O}(&\mathcal{N}_{t},s_{t}|s_{t-1},a_{t-1},\theta_{t-1})=\\ &\sum_{\scriptscriptstyle{\mathcal{N}}_{t-1}}P_{T}(\mathcal{N}_{t},s_{t}|% \mathcal{N}_{t-1},s_{t-1},a_{t-1})P(\mathcal{N}_{t-1}|s_{t-1},\theta_{t-1}).% \end{split}start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( end_CELL start_CELL caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . end_CELL end_ROW

We have

P(𝒩t|st,θt)=stPE(θt|𝒩t,st)P(𝒩t,st)𝒩tstPE(θt|𝒩t,st)P(𝒩t,st).𝑃conditionalsubscript𝒩𝑡subscript𝑠𝑡subscript𝜃𝑡subscriptsubscript𝑠𝑡subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡𝑃subscript𝒩𝑡subscript𝑠𝑡subscriptsubscript𝒩𝑡subscriptsubscript𝑠𝑡subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡𝑃subscript𝒩𝑡subscript𝑠𝑡P(\mathcal{N}_{t}|s_{t},\theta_{t})=\frac{\sum_{s_{t}}P_{E}(\theta_{t}|% \mathcal{N}_{t},s_{t})P(\mathcal{N}_{t},s_{t})}{\sum_{\scriptscriptstyle{% \mathcal{N}}_{t}}\sum_{s_{t}}P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})P(\mathcal% {N}_{t},s_{t})}.italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .

Also, we have P(𝒩0,s0)=PI(𝒩0,s0)𝑃subscript𝒩0subscript𝑠0subscript𝑃𝐼subscript𝒩0subscript𝑠0P(\mathcal{N}_{0},s_{0})=P_{I}(\mathcal{N}_{0},s_{0})italic_P ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and when t1𝑡1t\geq 1italic_t ≥ 1,

P(𝒩t,st)=𝒩tstatP(𝒩t,st,𝒩t1,st1,at1),𝑃subscript𝒩𝑡subscript𝑠𝑡subscriptsubscript𝒩𝑡subscriptsubscript𝑠𝑡subscriptsubscript𝑎𝑡𝑃subscript𝒩𝑡subscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1P(\mathcal{N}_{t},s_{t})=\sum_{\scriptscriptstyle{\mathcal{N}}_{t}}\sum_{s_{t}% }\sum_{a_{t}}P(\mathcal{N}_{t},s_{t},\mathcal{N}_{t-1},s_{t-1},a_{t-1}),italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ,

where

P(𝒩t,st,𝒩t1,st1,at1)=PT(𝒩t,st|𝒩t1,st1,at1)P(𝒩t1,st1)πt1(at1|st1).𝑃subscript𝒩𝑡subscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1𝑃subscript𝒩𝑡1subscript𝑠𝑡1subscript𝜋𝑡1conditionalsubscript𝑎𝑡1subscript𝑠𝑡1\begin{split}&P(\mathcal{N}_{t},s_{t},\mathcal{N}_{t-1},s_{t-1},a_{t-1})=\\ &P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1})P(\mathcal{N}_{% t-1},s_{t-1})\pi_{t-1}(a_{t-1}|s_{t-1}).\end{split}start_ROW start_CELL end_CELL start_CELL italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . end_CELL end_ROW

The sketch of proof is completed. ∎

Preference Reward. The function Rj:𝒜𝒩×𝒮0:subscript𝑅𝑗subscript𝒜𝒩𝒮subscriptabsent0R_{j}:\mathcal{A}_{\scriptscriptstyle\mathcal{N}}\times\mathcal{S}\rightarrow% \mathbb{R}_{\geq 0}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT × caligraphic_S → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT extends an agent j𝑗jitalic_j’s preference value, of the original stateless CAG, to the agent j𝑗jitalic_j’s preference reward Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which depends on the state and action. For example, Rj(at|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡R_{j}(a_{t}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) indicates agent j𝑗jitalic_j’s preference reward for a temporary team 𝒩t𝒩subscript𝒩𝑡𝒩\mathcal{N}_{t}\ {{\subseteq}}\ \mathcal{N}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_N with the corresponding joint action at=×j𝒩tatja_{t}=\times_{j\in{\scriptscriptstyle\mathcal{N}_{t}}}a_{t}^{j}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = × start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, whereas Rj(atj|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) indicates agent j𝑗jitalic_j’s preference reward for a coalition only including itself. To capture the relationship between agents j𝑗jitalic_j and k𝑘kitalic_k in terms of both the current state and the actions taken, the affinity weight is generalized accordingly as wjk:𝒜j×𝒜k×𝒮:subscript𝑤𝑗𝑘subscript𝒜𝑗subscript𝒜𝑘𝒮w_{jk}:\mathcal{A}_{j}\times\mathcal{A}_{k}\times\mathcal{S}\rightarrow\mathbb% {R}italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT : caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × caligraphic_S → blackboard_R. Following the specification of preference values through affinity weights, the preference reward of any agent j𝑗jitalic_j for a coalition 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented as Rj(at|st)=(j,k)t,k𝒩twjk(atj,atk|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscriptformulae-sequence𝑗𝑘subscript𝑡𝑘subscript𝒩𝑡subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡R_{j}(a_{t}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t},k\in\mathcal{N}_{t}}w_{jk}(a_% {t}^{j},a_{t}^{k}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This summation aggregates the affinity weights for all pairs of agents (j,k)𝑗𝑘(j,k)( italic_j , italic_k ) in the coalition, where k𝑘kitalic_k is a member of 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The learner’s reward function R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for any 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is specified by Rj(at|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡R_{j}(a_{t}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which will be introduced in Section 3.4.

3.2 Dynamic Variational Strict Core

We now extend the game theoretical concept of strict core from CAG to OSB-CAG as a criterion to evaluate the extent of collaboration among the agents in a temporary team (a coalition 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each timestep t𝑡titalic_t), named as dynamic variational strict core (DVSC). Unlike the strict core defined in CAG that evaluates coalition formation based on the given preference values, DVSC evaluates whether the learner i𝑖iitalic_i’s policy can influence temporary teammates’ decisions (measured by preference rewards), so that they intend to collaborate (so called variational). This is analogous to forming a temporary team as a desired coalition. Next we derive a result on strict core stability to motivate a result on DVSC. The following two statements are equivalent when the affinity graph is symmetric: Team maximizes social welfare, and team reaches strict core stability (see Lemma 1 in Appendix F). This inspires using the objective of maximizing social welfare as a surrogate criterion to evaluate strict core stability, and this criterion can be further generalized to dynamic scenarios to derive the DVSC (see Definition 2).

Definition 2.

If a dynamic affinity graph is symmetric, then maximizing the long-horizon social welfare is equivalent to reaching strict core stability under the variable teammates of uncertain agent-types generated by PEsubscript𝑃𝐸P_{E}italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and the uncertain states generated by POsubscript𝑃𝑂P_{O}italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT.

Following the inspiration shown in Definition 2, DVSC can be equivalently expressed in the form shown in Eq. (4). The detailed derivation of DVSC is left in Appendix F.

DVSC:={πi,|𝔼πi,[t=0γtj𝒩tRj(at|st)]𝔼πi[t=0γtj𝒩tRj(at|st)],s0𝒮,πi},assignDVSCconditional-setsuperscript𝜋𝑖formulae-sequencesubscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑗subscript𝒩𝑡subscript𝑅𝑗|subscript𝑎𝑡subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑗subscript𝒩𝑡subscript𝑅𝑗|subscript𝑎𝑡subscript𝑠𝑡for-allsubscript𝑠0𝒮for-allsuperscript𝜋𝑖\begin{split}\texttt{DVSC}&:=\Big{\{}\ \pi^{i,*}\ \Big{|}\ \mathbb{E}_{\pi^{i,% *}}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\sum_{j\in\mathcal{N}_{t}}R_{j}(a_{t}|s% _{t})\big{]}\\ &\geq\mathbb{E}_{\pi^{i}}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\sum_{j\in% \mathcal{N}_{t}}R_{j}(a_{t}|s_{t})\big{]},\forall s_{0}\in\mathcal{S},\forall% \pi^{i}\ \Big{\}},\end{split}start_ROW start_CELL DVSC end_CELL start_CELL := { italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , ∀ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S , ∀ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } , end_CELL end_ROW (4)

where atiπisimilar-tosuperscriptsubscript𝑎𝑡𝑖superscript𝜋𝑖a_{t}^{i}\sim\pi^{i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and atiπtisimilar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖a_{t}^{-i}\sim\pi_{t}^{-i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT; 𝔼πi[]subscript𝔼superscript𝜋𝑖delimited-[]\mathbb{E}_{\pi^{i}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ⋅ ] denotes the expectation that also implicitly depends on θtPEsimilar-tosubscript𝜃𝑡subscript𝑃𝐸\theta_{t}\sim P_{E}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and 𝒩t,stPOsimilar-tosubscript𝒩𝑡subscript𝑠𝑡subscript𝑃𝑂\mathcal{N}_{t},s_{t}\sim P_{O}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, and atiπtisimilar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖a_{t}^{-i}\sim\pi_{t}^{-i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT; and πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT indicates the solution to DVSC.

3.3 Is Stability of any Temporary Team a Reasonable Metric for Describing Ad Hoc Collaboration?

Recall that all agents in AHT have a shared goal, which implies that they intrinsically aim to collaborate on solving a shared task (Mirsky et al., 2022), but their preferences for collaborating with each other are not necessarily compatible. This compatibility can be interpreted as stability of a temporary team, determined by the preferences of ad hoc agents for collaborating with each other. If those ad hoc agents are incompatible with each other, the temporary team becomes unstable but still with hope of collaborating as a team to solve the shared task. Therefore, the learner’s aim is to tweak the compatibility of a temporary team through its actions, to influence the temporary teammates’ preferences, equivalent to maintaining the stability of the temporary team, across timesteps.

3.4 Solving DVSC by Reinforcement Learning

We proceed to define the learner’s reward function, initially left blank in Section 3.1 and convert DVSC from Eq. (4) into an RL problem. Since the learner’s objective is to execute actions that influence any temporary teammates to collaboratively solve a shared task, we naturally interpret the learner’s reward function as R(st,at)=j𝒩tRj(at|st)𝑅subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡R(s_{t},a_{t})=\sum_{j\in\mathcal{N}_{t}}R_{j}(a_{t}|s_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The reward function represents the social welfare of preference rewards for a temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, serving as a measure of agents’ preferences to collaborate on a shared task.222In practical scenarios, R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) only needs to implicitly encode the shared goal that multiple agents are required to achieve. Substituting R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) into Eq. (4), we derive an RL problem equivalent to solving DVSC:

maxπi𝔼𝒩t,stPO,θtPE,atiπti,atiπi[t=0γtR(st,at)].subscriptsubscript𝜋𝑖subscript𝔼formulae-sequencesimilar-tosubscript𝒩𝑡subscript𝑠𝑡subscript𝑃𝑂formulae-sequencesimilar-tosubscript𝜃𝑡subscript𝑃𝐸formulae-sequencesimilar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖similar-tosuperscriptsubscript𝑎𝑡𝑖superscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡\max_{\pi_{i}}\mathbb{E}_{{\scriptscriptstyle\mathcal{N}}_{t},s_{t}\sim P_{O},% \theta_{t}\sim P_{E},a_{t}^{-i}\sim\pi_{t}^{-i},a_{t}^{i}\sim\pi^{i}}\Big{[}% \sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\Big{]}.roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . (5)

In the following section, we will explore how the optimization problem in Eq. (5) can be solved by a novel algorithm.

4 A Novel Algorithm Building on OSB-CAG

In this section, we derive a novel graph-based RL algorithm to solve OAHT based on the OSB-CAG, with DVSC as a solution concept. We first derive the joint Q-value’s representation to narrow down its hypothesis space including the solution of DVSC. The representation aligns with and gives an interpretation to the GPL’s heuristic joint action value model. Note that we also acquire a condition to further confine the joint Q-value’s hypothesis space thanks to our theory (see Section 4.1). With the estimated type inference model and agent model, the optimal learner’s policy obtained by GPL’s optimization problem approximately reaches DSVC (see Section 4.2). Finally, we derive a novel practical algorithm, named CIAO (see Section 4.3).

4.1 Representation of Joint Q-Value

Refer to caption
Figure 1: Illustration of the relationship between the conditions for our preference reward function, ensuring the existence of DVSC under its confined hypothesis space, and its alignment to a task-specific reward R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in Eq. (5).

Given the joint actions generated under the influence by the optimal learner’s policy πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT, we have a sufficient condition, as an inductive bias, for any preference reward function to narrow down its hypothesis space meeting DVSC in Theorem 1. Solving the RL problem outlined in Eq. (5) based on this condition to specify πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT, the preference reward function is aligned to a task-specific reward R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The relationship between the above conditions to generate our preference reward function is shown in Fig. 1.

Theorem 1.

In an OSB-CAG, for any dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ at any timestep t𝑡titalic_t, if there exists a joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for any agent j𝒩t𝑗subscript𝒩𝑡j\in\mathcal{N}_{t}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, satisfying Rj(at|st)Rj(atj|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}|s_{t})\geq R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for any st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, then DVSC always exists.

To meet the condition that Rj(at|st)Rj(atj|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}|s_{t})\geq R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as shown in Theorem 1, we derive a representation of wjk(atj,atk|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in Proposition 2. Recall that an agent j𝑗jitalic_j’s preference reward function for a temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t𝑡titalic_t is defined as Rj(at|st)=(j,k)twjk(atj,atk|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑗𝑘subscript𝑡subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡R_{j}(a_{t}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t}}w_{jk}(a_{t}^{j},a_{t}^{k}|s_% {t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (see Section 3.1).

Proposition 2.

In a dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩, for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and any joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, if for all (j,k)t𝑗𝑘subscript𝑡(j,k)\in\mathcal{E}_{t}( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, wjk(atj,atk|st)=αjk(atj,atk|st)+βjk(atj|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})+\beta% _{jk}(a_{t}^{j}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with the conditions that αjk(atj,atk|st)0subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡0\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})\geq 0italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 and Rj(atj|st)=(j,k)tβjk(atj|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑗𝑘subscript𝑡subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t}}\beta_{jk}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), then Rj(at|st)Rj(atj|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}|s_{t})\geq R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for any agent j𝒩t𝑗subscript𝒩𝑡j\in\mathcal{N}_{t}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Proof.

This result can be directly obtained by the definition that Rj(at|st)=(j,k)t,k𝒩twjk(atj,atk|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscriptformulae-sequence𝑗𝑘subscript𝑡𝑘subscript𝒩𝑡subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡R_{j}(a_{t}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t},k\in\mathcal{N}_{t}}w_{jk}(a_% {t}^{j},a_{t}^{k}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). ∎

Plugging in the expression of wjk(atj,atk|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we can obtain the representation of an arbitrary agent j𝑗jitalic_j’s preference Q-value under the learner’s optimal policy πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT, Qjπi,(at|st)=(j,k)tQjkπi,(atj,atk|st)+Qjπi,(atj|st)superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡Q_{j}^{\pi^{i,*}}(a_{t}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi^{i,*}% }(a_{t}^{j},a_{t}^{k}|s_{t})+Q_{j}^{\pi^{i,*}}(a_{t}^{j}|s_{t})italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the joint Q-value under the learner’s optimal policy πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT, Qπi,(st,at)=j𝒩tQjπi,(at|st)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡Q^{\pi^{i,*}}(s_{t},a_{t})=\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i,*}}(a_{t}|s% _{t})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), outlined in Theorem 2.

Assumption 2.

Suppose that αjk(atj,atk|st)=0subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡0\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=0italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 for tT𝑡𝑇t\geq Titalic_t ≥ italic_T, where T𝑇Titalic_T is the timestep when agent j𝑗jitalic_j or k𝑘kitalic_k leaves the environment, and Rj(atj|st)=0subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡0R_{j}(a_{t}^{j}|s_{t})=0italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 for tT𝑡superscript𝑇t\geq T^{\prime}italic_t ≥ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the timestep when agent j𝑗jitalic_j leaves the environment.

Theorem 2.

Under Assumption 2, if wjk(sτ,aτj,aτk)=αjk(sτ,aτj,aτk)+βjk(sτ,aτj)subscript𝑤𝑗𝑘subscript𝑠𝜏superscriptsubscript𝑎𝜏𝑗superscriptsubscript𝑎𝜏𝑘subscript𝛼𝑗𝑘subscript𝑠𝜏superscriptsubscript𝑎𝜏𝑗superscriptsubscript𝑎𝜏𝑘subscript𝛽𝑗𝑘subscript𝑠𝜏superscriptsubscript𝑎𝜏𝑗w_{jk}(s_{\tau},a_{\tau}^{j},a_{\tau}^{k})=\alpha_{jk}(s_{\tau},a_{\tau}^{j},a% _{\tau}^{k})+\beta_{jk}(s_{\tau},a_{\tau}^{j})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ), then the joint Q-value of the learner’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT can be expressed as follows:

Qπi(st,at)=(j,k)tQjkπi(atj,atk|st)+j𝒩tQjπi(atj|st)=j𝒩t{(j,k)tQjkπi(atj,atk|st)+j𝒩tQjπi(atj|st)}:=j𝒩tQjπi(at|st),superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑗subscript𝒩𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡assignsubscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡\begin{split}Q^{\pi^{i}}(s_{t},a_{t})&=\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{% \pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})+\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(% a_{t}^{j}|s_{t})\\ =\sum_{j\in\mathcal{N}_{t}}&\Big{\{}\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi^% {i}}(a_{t}^{j},a_{t}^{k}|s_{t})+\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(a_{t% }^{j}|s_{t})\Big{\}}\\ :=\sum_{j\in\mathcal{N}_{t}}&Q_{j}^{\pi^{i}}(a_{t}|s_{t}),\end{split}start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL { ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } end_CELL end_ROW start_ROW start_CELL := ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW

where Qjkπi(atj,atk|st)=𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)]superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}% ^{\infty}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] and Qjπi(atj|st)=𝔼πi[τ=tγτtRj(aτj|sτ)]superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}^{\infty}% \gamma^{\tau-t}R_{j}(a_{\tau}^{j}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ].

Remark 1.

The result of Theorem 2 verifies that the optimal joint Q-value representation derived from our theory is consistent with the GPL’s joint action value model, as shown in Eq. (1), but additionally with Q^jkπi(atj,atk|st)0superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡0\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})\geq 0over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0, following our theory, which is requisite for satisfying αjk(atj,atk|st)0subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡0\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})\geq 0italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0, as shown in Proposition 2.

Recall that the condition for solving DSVC as a RL problem is the symmetry of a dynamic affinity graph (see Definition 2). To meet this condition, we outline in Proposition 3 the constraints that must be fulfilled for the case of a dynamic affinity graph being a star graph (see Remark 2 for its validity in OAHT). Similarly, we provide the relevant constraints articulated in Proposition 4 for situations where the dynamic affinity graph takes the form of a complete graph (as applied to GPL). The implementation of the constraints for these two cases are shown in Remark 3.

Definition 3.

In this paper, we introduce a novel dynamic affinity graph structured as a star graph, with the learner serving as the internal node and temporary teammates as the leaf nodes.

Remark 2.

We introduce a novel architecture for the dynamic affinity graph in the context of OAHT, assuming teammates lack prior coordination (Mirsky et al., 2022). Given an additional assumption that teammates cannot adapt their policies or types in response to other agents,333For simplicity in presenting our theory in this paper, we tentatively disregard scenarios where temporary teammates can adapt to other agents (e.g. establishing an affinity model). it is reasonable to presume the absence of relationships among any temporary teammates. Besides, this is also in line with the assumption in AHT that the learner’s temporary teammates might not be familiar with one another before the interaction (Stone et al., 2010; Mirsky et al., 2022). In particular, this implies that no edges between any two teammates are necessary to form a dynamic affinity graph. However, the learner’s goal is to establish collaboration with a variable number of temporary teammates at each timestep, necessitating the existence of edges between the learner and each teammate. To meet all these requirements, we design the learner’s dynamic affinity graph as a star graph, as detailed in Definition 3. Consequently, the preference reward of any teammate j𝑗jitalic_j for a temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is determined as Rj(st,at)=wji(st,atj,ati)subscript𝑅𝑗subscript𝑠𝑡subscript𝑎𝑡subscript𝑤𝑗𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑗superscriptsubscript𝑎𝑡𝑖R_{j}(s_{t},a_{t})=w_{ji}(s_{t},a_{t}^{j},a_{t}^{i})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), while the learner i𝑖iitalic_i’s preference reward for the temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is expressed as Ri(st,at)=jiwij(st,ati,atj)subscript𝑅𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗𝑖subscript𝑤𝑖𝑗subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑎𝑡𝑗R_{i}(s_{t},a_{t})=\sum_{j\in-i}w_{ij}(s_{t},a_{t}^{i},a_{t}^{j})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ).

Proposition 3.

For the learner i𝑖iitalic_i and any teammate j𝑗jitalic_j or k𝑘kitalic_k, the constraints Ri(ati|st)=jiRj(atj|st)subscript𝑅𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑗𝑖subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{i}(a_{t}^{i}|s_{t})=\sum_{j\in-i}R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and αjk(atj,atk|st)=αkj(atk,atj|st)subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{kj}(a_{t}^{k},a_{t}^{j}|s_{t})italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for any at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, are necessary for a star dynamic affinity graph to be symmetric.

Proposition 4.

For any two agents j𝑗jitalic_j or k𝑘kitalic_k, the constraints Rj(atj|st)=Rk(atk|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑅𝑘conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})=R_{k}(a_{t}^{k}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and αjk(atj,atk|st)=αkj(atk,atj|st)subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{kj}(a_{t}^{k},a_{t}^{j}|s_{t})italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for any at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, are necessary for the complete dynamic affinity graph to be symmetric.

Remark 3.

The following implementation is necessary to satisfy the symmetry of a dynamic affinity graph: (1) meeting Qjkπi(atj,atk|st)=Qkjπi(atk,atj|st)0superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡superscriptsubscript𝑄𝑘𝑗superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡0Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=Q_{kj}^{\pi^{i}}(a_{t}^{k},a_{t}^{% j}|s_{t})\geq 0italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 in constructing preference Q-values; (2) If the dynamic affinity graph is a star graph with the learner as the internal node, Qiπi(ati|st)=jiQjπi(atj|st)superscriptsubscript𝑄𝑖superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑗𝑖superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡Q_{i}^{\pi^{i}}(a_{t}^{i}|s_{t})=\sum_{j\in-i}Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is implemented as a regularizer. If the dynamic affinity graph is a complete graph, Qiπi(ati|st)=Qjπi(atj|st)superscriptsubscript𝑄𝑖superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡Q_{i}^{\pi^{i}}(a_{t}^{i}|s_{t})=Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is implemented as a regularizer.

4.2 Bellman Optimality Equation for OSB-CAG

We now define the Bellman optimality equation for OSB-CAG to evaluate the learner i𝑖iitalic_i’s optimal policy πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT as a solution of the DVSC following Theorem 3, such that

Qπi,(st,at)=R(st,at)+γ𝔼𝒩t+1,st+1PO[maxai𝔼θt+1PE,at+1iπt+1i[Qπi,(st+1,at+1i,ai)]].superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscriptsuperscript𝑎𝑖subscript𝔼formulae-sequencesimilar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosuperscriptsubscript𝑎𝑡1𝑖superscriptsubscript𝜋𝑡1𝑖delimited-[]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡1superscriptsubscript𝑎𝑡1𝑖superscript𝑎𝑖\begin{split}Q^{\pi^{i,*}}(s_{t},a_{t})=R(s_{t},a_{t})+\gamma\mathbb{E}_{{% \scriptscriptstyle\mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big{[}\\ \max_{a^{i}}\mathbb{E}_{\begin{subarray}{c}\theta_{t+1}\sim P_{E},\ a_{t+1}^{-% i}\sim\pi_{t+1}^{-i}\end{subarray}}\big{[}Q^{\pi^{i,*}}(s_{t+1},a_{t+1}^{-i},a% ^{i})\big{]}\Big{]}.\end{split}start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] ] . end_CELL end_ROW (6)

The regularity condition of Eq. (6) is that 𝒩t+1𝒩tsubscript𝒩𝑡1subscript𝒩𝑡\mathcal{N}_{t+1}\subseteq\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⊆ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, since it is pathological to consider an agent j𝒩t+1𝑗subscript𝒩𝑡1j\in\mathcal{N}_{t+1}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT but, 𝒩tabsentsubscript𝒩𝑡\notin\mathcal{N}_{t}∉ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t𝑡titalic_t when expanding Qπi,(st,at)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡Q^{\pi^{i,*}}(s_{t},a_{t})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) across timesteps, which is clarified in an illustrative example in Fig. 2.

Refer to caption
Figure 2: Illustration of the expansion of Bellman optimality equation for OSB-CAG. The thin green arrow indicates the time axis, while the thick black arrow indicates the expansion direction of Bellman optimality equation. In the theory of OSB-CAG, we have Qπi,(st,at)=j𝒩tQjπi,(at|st)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡Q^{\pi^{i,*}}(s_{t},a_{t})=\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i,*}}(a_{t}|s% _{t})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where Qjπi,superscriptsubscript𝑄𝑗superscript𝜋𝑖Q_{j}^{\pi^{i,*}}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is denoted by 𝐐jsubscript𝐐𝑗\mathbf{Q}_{j}bold_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the figure. Rjsubscript𝑅𝑗R_{j}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT indicates the preference reward of agent j𝑗jitalic_j to the team, measuring the agent’s preference to stay in the team to solve the shared task. At each timestep, the preference Q-value 𝐐jsubscript𝐐𝑗\mathbf{Q}_{j}bold_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of an agent j𝑗jitalic_j that joins the game is filled in a red box, that remains in the game is filled in a blue box, and that leaves the game filled in a grey box with the dashed outline. At timestep 1111, agent 1111, 2222 and 3333 just join the game, so with all preference Q-values as zeros. At timestep 2222, agent 3333 leaves the game, and the expansion works as usual, since agent 3333 has influence to the team. At timestep 3333, agent 4444 joins the game, but it has not any influence to the team. For this reason, it is unnecessary to consider the expansion for agent 4444’s preference Q-value only at timestep 3333, since it would be trivially zero. To satisfy the more generic representation of the joint Q-value with respect to preference Q-values (other than the linear decomposition described in our theory), we rule out the transition samples of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. At timestep 4444, the expansion considers all existing agents as usual.
Theorem 3.

Under Assumption 2 and an arbitrary learner’s deterministic stationary policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the Bellman equation for the OSB-CAG with DVSC as a solution concept is expressed as follows: Qπi(st,at)=R(st,at)+γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qπi(st+1,at+1)]]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼formulae-sequencesimilar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡1subscript𝑎𝑡1Q^{\pi^{i}}(s_{t},a_{t})=R(s_{t},a_{t})+\gamma\mathbb{E}_{{\scriptscriptstyle% \mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big{[}\mathbb{E}_{\begin{subarray}{c}% \theta_{t+1}\sim P_{E},\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}% (s_{t+1},a_{t+1})\big{]}\Big{]}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ].

To solve Eq. (6), we further propose an operator with the same regularity condition, such that Γ:QΓQ:Γmaps-to𝑄Γ𝑄\Gamma:Q\mapsto\Gamma Qroman_Γ : italic_Q ↦ roman_Γ italic_Q, specified as follows:

ΓQπi(st+1,at+1i,ai):=R(st,at)+γ𝔼𝒩t+1,st+1PO[maxai𝔼θt+1PE,at+1iπt+1i[Qπi(st+1,at+1i,ai)]].assignΓsuperscript𝑄superscript𝜋𝑖subscript𝑠𝑡1superscriptsubscript𝑎𝑡1𝑖superscript𝑎𝑖𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscriptsuperscript𝑎𝑖subscript𝔼formulae-sequencesimilar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosuperscriptsubscript𝑎𝑡1𝑖superscriptsubscript𝜋𝑡1𝑖delimited-[]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡1superscriptsubscript𝑎𝑡1𝑖superscript𝑎𝑖\begin{split}\Gamma Q^{\pi^{i}}\left(s_{t+1},a_{t+1}^{-i},a^{i}\right):=R(s_{t% },a_{t})+\gamma\mathbb{E}_{{\scriptscriptstyle\mathcal{N}_{t+1}},s_{t+1}\sim P% _{O}}\Big{[}\\ \max_{a^{i}}\mathbb{E}_{\begin{subarray}{c}\theta_{t+1}\sim P_{E},\ a_{t+1}^{-% i}\sim\pi_{t+1}^{-i}\end{subarray}}\big{[}Q^{\pi^{i}}(s_{t+1},a_{t+1}^{-i},a^{% i})\big{]}\Big{]}.\end{split}start_ROW start_CELL roman_Γ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) := italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] ] . end_CELL end_ROW (7)

Eq. (7) is a standard form of Bellman operator. Therefore, recursively running Eq. (7) converges to the Bellman optimality equation in Eq. (6), following the well-known value iteration algorithm (Sutton & Barto, 2018, Ch. 4).

Remark 4.

In implementation, the effect of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT can be omitted, due to its low proportions during the process. Therefore, solving the GPL optimization problem of fitted Q-learning (Ernst et al., 2005) that omits the effect of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is a reasonable approximation of Bellman operator in Eq. (7), which reduces the computational cost of filtering out the transition samples of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in practice. The GPL optimization problem is shown as follows:

minβL(β)=𝔼[12(R(st,at)+γmaxai𝔼θt+1PE,at+1iπt+1i[Q^πi(st+1,at+1i,ai;β)]Q^πi(st,at;β))2],subscript𝛽𝐿𝛽𝔼delimited-[]12superscript𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscriptsuperscript𝑎𝑖subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosuperscriptsubscript𝑎𝑡1𝑖superscriptsubscript𝜋𝑡1𝑖delimited-[]superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡1superscriptsubscript𝑎𝑡1𝑖superscript𝑎𝑖superscript𝛽superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡𝛽2\begin{split}&\min_{\beta}L(\beta)=\mathbb{E}\Big{[}\frac{1}{2}\Big{(}R(s_{t},% a_{t})+\gamma\max_{a^{i}}\mathbb{E}_{\begin{subarray}{c}\theta_{t+1}\sim P_{E}% ,\\ a_{t+1}^{-i}\sim\pi_{t+1}^{-i}\end{subarray}}\big{[}\\ &\hat{Q}^{\pi^{i}}(s_{t+1},a_{t+1}^{-i},a^{i};\beta^{-})\big{]}-\hat{Q}^{\pi^{% i}}(s_{t},a_{t};\beta)\Big{)}^{2}\Big{]},\end{split}start_ROW start_CELL end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_L ( italic_β ) = blackboard_E [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; italic_β start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW (8)

where Q^πi(;β)superscript^𝑄superscript𝜋𝑖superscript𝛽\hat{Q}^{\pi^{i}}(\cdot\ ;\ \beta^{-})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ ; italic_β start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) is the approximate target optimal joint Q-value parameterised by βsuperscript𝛽\beta^{-}italic_β start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and Q^πi(;β)superscript^𝑄superscript𝜋𝑖𝛽\hat{Q}^{\pi^{i}}(\cdot\ ;\ \beta)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ ; italic_β ) is the approximate optimal joint Q-value parameterised by β𝛽\betaitalic_β.

4.3 Practical Implementation

Based on our theory, we introduce a novel algorithm, CIAO, representing the algorithm for Cooperative game theory Inspired Ad hoc teamwork in Open teams. We implement CIAO in dynamic affinity graphs as a star graph (refer to Remark 2 for more insights into this topology) and a complete graph, denoted as CIAO-S and CIAO-C, respectively, where “S” signifies Star graph and “C” signifies Complete graph. In addition to the joint Q-value representation model (derived from Theorem 2) and the training losses for estimating the unknown type inference model and the unknown agent model (as detailed in Section 2.2), we introduce novel Q losses tailored for variant dynamic affinity graphs based on our theory. These losses incorporate regularization terms with multipliers λ>0𝜆0\lambda>0italic_λ > 0.

CIAO-S. If the dynamic affinity graph is a star graph, the training loss with the regularizer is as follows:

Ls(β)=L(β)+λ𝔼st,at[12(jiQ^jπi(atj|st)Q^iπi(ati|st;β))2].subscript𝐿𝑠𝛽𝐿𝛽𝜆subscript𝔼subscript𝑠𝑡subscript𝑎𝑡delimited-[]12superscriptsubscript𝑗𝑖superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡superscriptsubscript^𝑄𝑖superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡𝛽2\begin{split}L_{s}(&\beta)=L(\beta)\\ &+\lambda\mathbb{E}_{s_{t},a_{t}}\Big{[}\frac{1}{2}\big{(}\sum_{j\in-i}\hat{Q}% _{j}^{\pi^{i}}(a_{t}^{j}|s_{t})-\hat{Q}_{i}^{\pi^{i}}(a_{t}^{i}|s_{t};\beta)% \big{)}^{2}\Big{]}.\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( end_CELL start_CELL italic_β ) = italic_L ( italic_β ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . end_CELL end_ROW

CIAO-C. If the dynamic affinity graph is a complete graph, the training loss with the regularizer is as follows:

Lc(β)=L(β)+λ𝔼st,at[ji12(Q^iπi(ati|st)Q^jπi(atj|st;β))2].subscript𝐿𝑐𝛽𝐿𝛽𝜆subscript𝔼subscript𝑠𝑡subscript𝑎𝑡delimited-[]subscript𝑗𝑖12superscriptsuperscriptsubscript^𝑄𝑖superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡𝛽2\begin{split}L_{c}(&\beta)=L(\beta)\\ &+\lambda\mathbb{E}_{s_{t},a_{t}}\Big{[}\sum_{j\in-i}\frac{1}{2}\big{(}\hat{Q}% _{i}^{\pi^{i}}(a_{t}^{i}|s_{t})-\hat{Q}_{j}^{\pi^{i}}(a_{t}^{j}|s_{t};\beta)% \big{)}^{2}\Big{]}.\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( end_CELL start_CELL italic_β ) = italic_L ( italic_β ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_λ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . end_CELL end_ROW

Note that it is also requisite to enforce that Q^jkπi(atj,atk|st)=Q^kjπi(atk,atj|st)0superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡superscriptsubscript^𝑄𝑘𝑗superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡0\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=\hat{Q}_{kj}^{\pi^{i}}(a_{t}% ^{k},a_{t}^{j}|s_{t})\geq 0over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 by Remark 3. Following our theoretical model, the learner’s reward R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ought to be non-negative, while the designated reward of an environment could be negative. However, this can be adjusted by adding the maximum difference between these two rewards among states and joint actions denoted by ΔR(st,at)Δ𝑅subscript𝑠𝑡subscript𝑎𝑡\Delta R(s_{t},a_{t})roman_Δ italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) without changing the original goal. In practice, Eq. (8) is solved by DQN (Mnih et al., 2013). The learner’s actions are decided by Eq. (2), employing the estimated teammates’ agent models π^isuperscript^𝜋𝑖\hat{\pi}^{-i}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT (see Section 2.2) to marginalize atisuperscriptsubscript𝑎𝑡𝑖a_{t}^{-i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT of Q^πi(st,at;β)superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡𝛽\hat{Q}^{\pi^{i}}(s_{t},a_{t};\beta)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_β ), as implemented in GPL. The further implementation details are left to Appendix C.

5 Experiments

We assess the effectiveness of the proposed algorithms CIAO-S and CIAO-C in two established environments, LBF and Wolfpack, featuring open team settings (Rahman et al., 2021). In these settings, teammates are randomly selected to enter the environment and remain for a certain number of time steps. During experiments, the learner is trained in an environment with a maximum of 3 agents at each timestep. Subsequently, testing is conducted in environments with a maximum of 5 and 9 agents at each timestep, showcasing the model’s ability to handle both unseen compositions and varied team sizes. All experiments are conducted with five random seeds, and the results are presented as the mean performance with a 95% confidence interval. Our experimental design aims to answer the following questions: (1) Does the joint Q-value representation outlined in our theory effectively facilitate collaboration between the learner and temporary teammates? (2) Is it necessary to generalize the preference reward function from zero, as in CAG, to a non-negative range in our theory (see Appendix E)? (3) Is the claim in Remark 4 valid in practice? (4) Is CIAO able to deal with generalization in agent-type sets?

Baselines and Ablation Variants. The state-of-the-art baseline we use in this experiment is GPL-Q (shortened as GPL) (Rahman et al., 2021). The ablation variants of the proposed CIAO are as follows: CIAO-X-FI, CIAO-X-ZI and CIAO-X-NI are variants that remove enforcement of individual utility, enforce individual utility as zero and enforce individual utility as negative values, respectively. CIAO-X-NP is a variant that enforces negative pairwise utility. “X” above indicates either “S” or “C”. Further details on experimental settings can be found in Appendix H.

5.1 Main Results

Refer to caption
(a) Wolfpack: max. of 5 agents.
Refer to caption
(b) Wolfpack: max. of 9 agents.
Refer to caption
(c) LBF: max. of 5 agents.
Refer to caption
(d) LBF: max. of 9 agents.
Figure 3: Comparison between CIAO and GPL in Wolfpack and LBF with a maximum of 5 and 9 agents.

We initially address Questions 1 through experiments conducted on the original versions of Wolfpack and LBF, as depicted in Fig. 3. It is evident that CIAO-C outperforms GPL in the majority of scenarios with varying maximum numbers of agents. This not only verifies the correctness and effectiveness of our theory, irrespective of dynamic affinity graph structures but also demonstrates its capability in facilitating collaboration between the learner and temporary teammates in the open ad hoc teamwork problem. Upon comparing CIAO-C and CIAO-S, it becomes apparent that the star graph may be more effective in scenarios with fewer agents, whereas the complete graph exhibits greater effectiveness in scenarios with more agents. This observation aligns with the intuition that the direct influence from the learner to each teammate may not suffice as the number of agents increases. Instead, indirect influence, where a teammate is influenced by the learner to subsequently influence another teammate, becomes crucial.

5.2 Ablation Study

Refer to caption
(a) Wolfpack: max. of 5 agents.
Refer to caption
(b) Wolfpack: max. of 9 agents.
Refer to caption
(c) LBF: max. of 5 agents.
Refer to caption
(d) LBF: max. of 9 agents.
Figure 4: Comparison between CIAO-C and its ablations in Wolfpack and LBF with a maximum of 5 and 9 agents.

We present experimental results comparing CIAO-S and its ablations, as well as CIAO-C and its ablations. As illustrated in Figs. 4 and 5, both CIAO-C-NP and CIAO-S-NP exhibit notably inferior performance compared to CIAO-C or CIAO-S. This observation demonstrates the validity of DVSC and confirms the accuracy of the joint Q-value representation based on our theory. This outcome provides an additional perspective in addressing Question 1.

Refer to caption
(a) Wolfpack: max. of 5 agents.
Refer to caption
(b) Wolfpack: max. of 9 agents.
Refer to caption
(c) LBF: max. of 5 agents.
Refer to caption
(d) LBF: max. of 9 agents.
Figure 5: Comparison between CIAO-S and its ablations in Wolfpack and LBF with a maximum of 5 and 9 agents.

Adhering to the tradition of CAG, convention mandates setting individual utility to zero. However, in our theory, we extend its range to include non-zero values, enhancing its adaptability across diverse scenarios. This adaptability is demonstrated in the comparison between CIAO-C or CIAO-S and CIAO-C-ZI or CIAO-S-ZI in Figs. 4 and 5. Although our theory does not inherently provide specific insights into the range of individual utility, we propose a hypothesis aligned with other definitions in CAG, asserting that individual utility is non-negative. This hypothesis ensures self-consistency in our generalization, as detailed in Definition 4 in Appendix E. The superior performances of CIAO-C or CIAO-S over their ablations affirm the acceptability of our hypothesis.

5.3 Validity of Remark 4

Refer to caption
(a) Wolfpack: max. of 5 agents.
Refer to caption
(b) LBF: max. of 5 agents.
Figure 6: Comparison of training losses for CIAO between the implementations with omitting the effect of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and those without (denoted as “-Va”).

We now validate our claim in Remark 4 that minimizing the GPL training loss (omitting the effect of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT) is an approximation of Eq. (7). Based on the GPL training loss, we implement its variant that filters out the transition samples of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, following the suggestion from Remark 4, referred to as CIAO-C-Va and CIAO-S-Va. As shown in Fig. 6, in both LBF and Wolfpack with the maximum of 5 agents, CIAO-C and CIAO-S trained with the GPL training loss achieve the approximate performances to those with the variant training loss considering the effect of 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\subset\mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.

5.4 Generalization in Agent-Type Sets

We now evaluate the generalizability of CIAO in agent-type sets through two scenarios: (1) the agent-type set for training has intersection of one agent-type with that for testing; (2) the agent-type set for training is mutually exclusive to that for testing. As seen from Figs. 7 and 8, the dynamic affinity graph as the star graph is more generalizable than the complete graph. One hypothesis for this phenomenon is that although the complete graph may be able to capture broader relationships among agents, it could be unnecessary for open ad hoc teamwork (as explained in Remark 2). The underlying principles behind this result deserve to be investigated in the future research.

Refer to caption
(a) Intersecting training and testing agent-type sets.
Refer to caption
(b) Mutually exclusive training and testing agent-type sets.
Figure 7: Comparison between CIAO and GPL on LBF with intersecting and mutually exclusive agent-type sets in training and testing, respectively. The maximum temporary team size is 5.
Refer to caption
(a) LBF: max. of 5 agents.
Refer to caption
(b) LBF: max. of 9 agents.
Figure 8: Comparison between CIAO and GPL on Wolfpack with intersecting and mutually exclusive agent-type sets in training and testing, respectively. The maximum temporary team size is 5.

6 Conclusion

Discussion. In this work we address the challenging problem of open ad hoc teamwork, aiming to design an agent capable of collaborating with teammates without prior coordination under dynamically changing team compositions. We propose a novel approach by incorporating cooperative game theory to develop a new theory. This theory effectively gives an interpretation to the joint Q-value representation leveraged in the state-of-the-art algorithm, GPL. Building upon the empirical foundation of GPL, we introduce a novel algorithm, CIAO, which includes an additional regularizer and a constraint for representation thanks to our theory. Consequently, CIAO can be seen as a subclass of GPL, providing extra information through our theory to narrow down the joint Q-value’s hypothesis space, facilitating learning. Besides, the incorporation of dynamic affinity graphs into OSB-CAG opens up a new avenue of designing graphs describing agent relationships aligned to game objectives. Experimental results validate the effectiveness of our theory and demonstrate the superior performance of CIAO.

Limitation and Future Work. This work is the first in establishing both a theory and a practical algorithm rooted in cooperative game theory to address ad hoc teamwork. It opens up avenues of several promising future directions. Firstly, to enhance the scope and applicability of our theory, a logical next step involves exploring the adaptivity of teammates with time-varying agent-types, a factor currently omitted in our theory for simplicity. Another compelling direction is investigating the design of understandable joint Q-value representation for open ad hoc teamwork, other than linear decomposition with pairwise relationships and individual values justified in this work. This thread can push forward the potential deployment of ad hoc teamwork to safety-critical environments requiring trustworthy and cost-saving solutions, with less trial-and-error interactions.

Impact Statement

The outcomes of this paper could significantly enhance the progress of autonomous vehicles, smart grids, and various decision-making scenarios involving multiple independently controlled agents under uncertainties. However, it is crucial to acknowledge potential drawbacks. Like many machine learning algorithms, our work may encounter challenges related to human value alignment, when the targets in interaction are humans in the potential applications. Addressing this concern is part of our ongoing research, building upon findings from related fields that emphasize alignment issues.

Acknowledgement

This work is partially supported by UKRI Turing AI World-Leading Researcher Fellowship, EP/W002973/1. The computational resources are supported by CSC – IT Center for Science LTD., Finland. Yuan Zhang receives funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 953348 (ELO-X).

References

  • Agmon & Stone (2012) Agmon, N. and Stone, P. Leading ad hoc agents in joint action settings with multiple teammates. In AAMAS, pp.  341–348, 2012.
  • Agmon et al. (2014) Agmon, N., Barrett, S., and Stone, P. Modeling uncertainty in leading ad hoc teams. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pp.  397–404, 2014.
  • Albrecht & Ramamoorthy (2013) Albrecht, S. V. and Ramamoorthy, S. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp.  1155–1156, 2013.
  • Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Barrett & Stone (2015) Barrett, S. and Stone, P. Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • Barrett et al. (2017) Barrett, S., Rosenfeld, A., Kraus, S., and Stone, P. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132–171, 2017.
  • Bhat & Alqahtani (2021) Bhat, J. R. and Alqahtani, S. A. 6g ecosystem: Current status and future perspective. IEEE Access, 9:43134–43167, 2021.
  • Brafman & Tennenholtz (1996) Brafman, R. I. and Tennenholtz, M. On partially controlled multi-agent systems. Journal of Artificial Intelligence Research, 4:477–507, 1996.
  • Brânzei & Larson (2009) Brânzei, S. and Larson, K. Coalitional affinity games. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp.  1319–1320, 2009.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chalkiadakis et al. (2022) Chalkiadakis, G., Elkind, E., and Wooldridge, M. Computational aspects of cooperative game theory. Springer Nature, 2022.
  • Chen et al. (2020) Chen, S., Andrejczuk, E., Cao, Z., and Zhang, J. AATEAM: achieving the ad hoc teamwork by employing the attention mechanism. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7095–7102. AAAI Press, 2020.
  • De Peuter & Kaski (2023) De Peuter, S. and Kaski, S. Zero-shot assistance in sequential decision problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  11551–11559, 2023.
  • Du et al. (2019) Du, Y., Han, L., Fang, M., Liu, J., Dai, T., and Tao, D. Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Duan et al. (2022) Duan, J., Yu, S., Tan, H. L., Zhu, H., and Tan, C. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022.
  • Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
  • Foerster et al. (2016) Foerster, J., Assael, I. A., De Freitas, N., and Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29, 2016.
  • Foerster et al. (2018) Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Gu et al. (2022) Gu, P., Zhao, M., Hao, J., and An, B. Online ad hoc teamwork under partial observability. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Harsanyi (1967) Harsanyi, J. C. Games with incomplete information played by “bayesian” players, i–iii part i. the basic model. Management science, 14(3):159–182, 1967.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Jiang & Lu (2018) Jiang, J. and Lu, Z. Learning attentional communication for multi-agent cooperation. Advances in neural information processing systems, 31, 2018.
  • Kalyanakrishnan et al. (2007) Kalyanakrishnan, S., Liu, Y., and Stone, P. Half field offense in robocup soccer: A multiagent reinforcement learning case study. In RoboCup 2006: Robot Soccer World Cup X 10, pp.  72–85. Springer, 2007.
  • Kim et al. (2019) Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T., Son, K., and Yi, Y. Learning to schedule communication in multi-agent reinforcement learning. arXiv preprint arXiv:1902.01554, 2019.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Koller & Friedman (2009) Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Mguni et al. (2022) Mguni, D. H., Jafferjee, T., Wang, J., Nieves, N. P., Slumbers, O., Tong, F., Li, Y., Zhu, J., Yang, Y., and Wang, J. LIGS: learnable intrinsic-reward generation selection for multi-agent learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Mirsky et al. (2022) Mirsky, R., Carlucho, I., Rahman, A., Fosong, E., Macke, W., Sridharan, M., Stone, P., and Albrecht, S. V. A survey of ad hoc teamwork research. In Multi-Agent Systems: 19th European Conference, EUMAS 2022, Düsseldorf, Germany, September 14–16, 2022, Proceedings, pp.  275–293. Springer, 2022.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Rahman et al. (2021) Rahman, M. A., Hopner, N., Christianos, F., and Albrecht, S. V. Towards open ad hoc teamwork using graph-based policy learning. In International Conference on Machine Learning, pp.  8776–8786. PMLR, 2021.
  • Rashid et al. (2018) Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., and Whiteson, S. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  4292–4301. PMLR, 2018.
  • Rashid et al. (2020) Rashid, T., Farquhar, G., Peng, B., and Whiteson, S. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.
  • Shapley (1953) Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
  • Shneiderman (2020) Shneiderman, B. Human-centered artificial intelligence: Three fresh ideas. AIS Transactions on Human-Computer Interaction, 12(3):109–124, 2020.
  • Sliwinski & Zick (2017) Sliwinski, J. and Zick, Y. Learning hedonic games. In IJCAI, pp.  2730–2736, 2017.
  • Smith & Gasser (2005) Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
  • Stone & Kraus (2010) Stone, P. and Kraus, S. To teach or not to teach? decision making under uncertainty in ad hoc teams. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pp.  117–124, 2010.
  • Stone et al. (2009) Stone, P., Kaminka, G. A., and Rosenschein, J. S. Leading a best-response teammate in an ad hoc team. In International Workshop on Agent-Mediated Electronic Commerce, pp.  132–146. Springer, 2009.
  • Stone et al. (2010) Stone, P., Kaminka, G. A., Kraus, S., and Rosenschein, J. S. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Fox, M. and Poole, D. (eds.), Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010. AAAI Press, 2010.
  • Sukhbaatar et al. (2016) Sukhbaatar, S., Fergus, R., et al. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29, 2016.
  • Sunehag et al. (2018) Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V. F., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., and Graepel, T. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2018, Stockholm, Sweden, July 10-15, 2018, pp.  2085–2087. International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA / ACM, 2018.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
  • Tacchetti et al. (2019) Tacchetti, A., Song, H. F., Mediano, P. A. M., Zambaldi, V. F., Kramár, J., Rabinowitz, N. C., Graepel, T., Botvinick, M. M., and Battaglia, P. W. Relational forward models for multi-agent learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • Wang et al. (2020) Wang, J., Zhang, Y., Kim, T.-K., and Gu, Y. Shapley q-value: A local reward approach to solve global reward games. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7285–7292, Apr 2020.
  • Wang et al. (2021) Wang, J., Xu, W., Gu, Y., Song, W., and Green, T. C. Multi-agent reinforcement learning for active voltage control on power distribution networks. Advances in Neural Information Processing Systems, 34:3271–3284, 2021.
  • Wang et al. (2022) Wang, J., Zhang, Y., Gu, Y., and Kim, T.-K. Shaq: Incorporating shapley value theory into multi-agent q-learning. Advances in Neural Information Processing Systems, 35:5941–5954, 2022.
  • Wu et al. (2011) Wu, F., Zilberstein, S., and Chen, X. Online planning for ad hoc autonomous agent teams. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI), pp.  439–445, 2011.
  • Xie et al. (2021) Xie, A., Losey, D., Tolsma, R., Finn, C., and Sadigh, D. Learning latent representations to influence multi-agent interaction. In Conference on robot learning, pp.  575–588. PMLR, 2021.
  • Xue et al. (2022) Xue, K., Xu, J., Yuan, L., Li, M., Qian, C., Zhang, Z., and Yu, Y. Multi-agent dynamic algorithm configuration. Advances in Neural Information Processing Systems, 35:20147–20161, 2022.
  • Zhao et al. (2023) Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • Zintgraf et al. (2021) Zintgraf, L. M., Devlin, S., Ciosek, K., Whiteson, S., and Hofmann, K. Deep interactive bayesian reinforcement learning via meta-learning. In Dignum, F., Lomuscio, A., Endriss, U., and Nowé, A. (eds.), AAMAS ’21: 20th International Conference on Autonomous Agents and Multiagent Systems, Virtual Event, United Kingdom, May 3-7, 2021, pp.  1712–1714. ACM, 2021. doi: 10.5555/3463952.3464210.

Appendix A Related Works

Theoretical Models for Ad Hoc Teamwork. In our review of theoretical models for describing ad hoc teamwork (AHT), we begin by discussing foundational works. Brafman & Tennenholtz (1996) pioneered the study of ad hoc teamwork by investigating the repeated matrix game with a single teammate. Subsequent contributions extended this line of inquiry to scenarios involving multiple teammates, as exemplified by Agmon & Stone (2012), who expanded the analysis to incorporate multiple teammates. Agmon et al. (2014) further relaxed assumptions by allowing teammates’ policies to be drawn from a known set. Stone et al. Stone & Kraus (2010) proposed collaborative multi-armed bandits, initially formalizing AHT but with notable assumptions, such as knowing teammates’ policies and environments. Albrecht & Ramamoorthy (2013) introduced the stochastic Bayesian game (SBG) as the first complete theoretical model for addressing dynamic environments and unknown teammates in AHT. Building upon the SBG, Rahman et al. (2021) proposed the open stochastic Bayesian game (OSBG) to address open ad hoc teamwork (OAHT). Zintgraf et al. (2021) modelled AHT as interactive Bayesian reinforcement learning (IBRL) in Markov games, focusing on solving non-stationary teammates’ policies within episodes. In contrast, Xie et al. (2021) introduced a hidden parameter Markov decision process (HiP-MDP) to address scenarios where teammates’ policies vary across episodes but remain stationary within each episode. In this paper, we contribute to the theoretical landscape of AHT by extending the coalitional affinity game (CAG) from the perspective of cooperative game theory, under the assumptions similar to SBG and OSBG. In more details, we introduce a novel theoretical model, referred to as Open Stochastic Bayesian Coalitional Affinity Game (OSB-CAG), shedding light on the interactive process between the learner and temporary teammates. This theoretical model can be seen as an extension of OSBG (see Appendix B), where the relationship between agents is conceptualized as a dynamic affinity graph in theory, moving beyond treating the graph solely as an implementation tool.444If the dynamic affinity graph is with no edges, the OSB-CAG will degrade to a plain OSBG. Our proposed solution concept, DVSC, provides a fresh perspective on how the learner can find optimal policies to attract temporary teammates for effective collaboration. Furthermore, we introduce a more specified transition function under our theoretical model in place of the one proposed by Rahman et al. (2021). The main benefit of our proposed transition function is that it enjoys a strong relationship to the underlying assumptions, and explicitly subsumes the concrete interactive process described by Rahman et al. (2021).

Algorithms for Ad Hoc Teamwork. We now review AHT from an algorithmic standpoint. The best response algorithm (Stone et al., 2009), initially proposed under the assumptions of a matrix game and well-known teammates’ policies, laid the foundation for algorithmic solutions in this domain. Extending this work, REACT (Agmon et al., 2014) emerged as a solution effective for matrices where teammates’ policies are drawn from a known set. Wu et al. (2011) introduced a novel approach using biased adaptive play to estimate teammates’ actions based on their historical actions. They combined this with Monte Carlo tree search to plan the ad hoc agent’s actions. HBA (Albrecht & Ramamoorthy, 2013) expanded the scope beyond matrix games, maintaining a probability distribution of predetermined agent types and maximizing long-term payoffs through an extended Bellman operator. PLASTIC-Policy (Barrett et al., 2017) addressed more realistic scenarios, such as RoboCup (Kalyanakrishnan et al., 2007), by training teammates’ policies through behavior cloning and the ad hoc agent’s policy through FQI (Ernst et al., 2005). AATEAM (Chen et al., 2020) extended PLASTIC-Policy, incorporating an attention network (Bahdanau et al., 2014) to enhance the estimation of unseen agent types. Rahman et al. (2021) integrated modern deep learning techniques, including GNNs and RL algorithms, into HBA to address open ad hoc teamwork (OAHT) and introduced GPL. ODITS (Gu et al., 2022) was proposed to handle teammates with rapidly changing behaviors under partial observability. In this paper, we introduce CIAO, a novel algorithm based on our proposed theory (OSB-CAG with DVSC as a solution concept). Specifically, CIAO extends the joint Q-value representation and training loss of GPL. Additionally, CIAO generalizes the implementation of training losses to various structures of the dynamic affinity graph, known as the coordination graph in GPL, with theoretical guarantees. This provides a design paradigm of training loss to facilitate the investigation of diverse dynamic affinity graph structures. This paradigm not only can cater for various scenarios of applications, but also can facilitate realizing the ideas inspired by other fields. Furthermore, we prove in theory and demonstrate in experiments that the existing GPL training loss is a viable approximation of the exact learning paradigm under our theory.

Relationship to Cooperative Multi-Agent Reinforcement Learning. Cooperative multi-agent reinforcement learning (MARL) primarily aims at training and controlling agents altogether to optimally achieve a shared goal. The key research topics are credit assignment (also known as value decomposition in some literature) (Foerster et al., 2018; Sunehag et al., 2018; Rashid et al., 2018), reward shaping (Du et al., 2019; Mguni et al., 2022), and communication (Foerster et al., 2016; Sukhbaatar et al., 2016; Jiang & Lu, 2018; Kim et al., 2019). In this paper, we shift the focus to AHT, where only one agent (referred to as learner) is controllable and trained to collaborate with an unknown set of uncontrollable agents to achieve a shared goal. Although the teammates’ behaviours in AHT can be influenced by the learner’s action (under assumption that they are capable of reacting to the learner’s action) (Mirsky et al., 2022), the joint policy may still be sub-optimal owing to either the reactivity of teammates or the effectiveness to attract teammates in implementation. On the other hand, a transferable utility game known as the convex game, belonging to cooperative game theory was introduced for employing Shapley value as a credit assignment scheme with theoretical guarantees and interpretation, to address credit assignment (Wang et al., 2020, 2022). In this paper, we introduce CAG, belonging to non-transferable utility games (a broader class including transferable utility games), for establishing a graph-based joint Q-value representation with theoretical guarantees and understandings to address OAHT.

Appendix B Open Stochastic Bayesian Game

We now review the open stochastic Bayesian game (OSBG) that describes the open ad hoc teamwork for establishing GPL (Rahman et al., 2021). It is defined as a tuple such that 𝒩,𝒮,(𝒜j)j𝒩,Θ,R,T,γ𝒩𝒮subscriptsubscript𝒜𝑗𝑗𝒩Θ𝑅𝑇𝛾\langle\mathcal{N},\mathcal{S},(\mathcal{A}_{j})_{j\in\mathcal{N}},\Theta,R,T,\gamma\rangle⟨ caligraphic_N , caligraphic_S , ( caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT , roman_Θ , italic_R , italic_T , italic_γ ⟩. 𝒩𝒩\mathcal{N}caligraphic_N is a set of all possible agents; 𝒮𝒮\mathcal{S}caligraphic_S is a set of states; 𝒜jsubscript𝒜𝑗\mathcal{A}_{j}caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is agent j𝑗jitalic_j’s action set; ΘΘ\Thetaroman_Θ is a set of all possible agent types. Let the joint action set under a variable agent set 𝒩t𝒩subscript𝒩𝑡𝒩\mathcal{N}_{t}\subseteq\mathcal{N}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_N be defined as that 𝒜𝒩t=×j𝒩t𝒜j\mathcal{A}_{{\scriptscriptstyle\mathcal{N}}_{t}}=\times_{j\in\mathcal{N}_{t}}% \mathcal{A}_{j}caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = × start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, the joint action space under the variable number of agents is defined as that 𝒜𝒩=𝒩t(𝒩){a|a𝒜𝒩t}subscript𝒜𝒩subscriptsubscript𝒩𝑡𝒩conditional-set𝑎𝑎subscript𝒜subscript𝒩𝑡\mathcal{A}_{\scriptscriptstyle\mathcal{N}}=\bigcup_{\mathcal{N}_{t}\in\mathbb% {P}(\mathcal{N})}\{a|a\in\mathcal{A}_{{\scriptscriptstyle\mathcal{N}}_{t}}\}caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_P ( caligraphic_N ) end_POSTSUBSCRIPT { italic_a | italic_a ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, while the joint agent-type space under the variable number of agents is defined as that Θ𝒩=𝒩t(𝒩){θ|θΘ|𝒩t|}subscriptΘ𝒩subscriptsubscript𝒩𝑡𝒩conditional-set𝜃𝜃superscriptΘsubscript𝒩𝑡\Theta_{\scriptscriptstyle\mathcal{N}}=\bigcup_{\mathcal{N}_{t}\in\mathbb{P}(% \mathcal{N})}\{\theta|\theta\in\Theta^{{\scriptscriptstyle|\mathcal{N}_{t}|}}\}roman_Θ start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_P ( caligraphic_N ) end_POSTSUBSCRIPT { italic_θ | italic_θ ∈ roman_Θ start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT }. R:𝒮×𝒜𝒩:𝑅𝒮subscript𝒜𝒩R:\mathcal{S}\times\mathcal{A}_{\scriptscriptstyle\mathcal{N}}\rightarrow% \mathbb{R}italic_R : caligraphic_S × caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT → blackboard_R is the learner’s reward. T:𝒮×Θ𝒩×𝒜𝒩𝒮×Θ𝒩:𝑇𝒮subscriptΘ𝒩subscript𝒜𝒩𝒮subscriptΘ𝒩T:\mathcal{S}\times\Theta_{\scriptscriptstyle\mathcal{N}}\times\mathcal{A}_{% \scriptscriptstyle\mathcal{N}}\rightarrow\mathcal{S}\times\Theta_{% \scriptscriptstyle\mathcal{N}}italic_T : caligraphic_S × roman_Θ start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT × caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT → caligraphic_S × roman_Θ start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT is a transition function to describe the evolution of states and agents of variable types. The learner’s action value function Qπi(st,ati)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖Q^{\pi^{i}}(s_{t},a_{t}^{i})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is defined as follows:

Qπi(st,ati)=𝔼atiπti[Qπi(st,ati,ati)]=𝔼st,θtT,atiπti,atiπi[t=0γtR(st,at)],superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖subscript𝔼similar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖delimited-[]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑎𝑡𝑖subscript𝔼formulae-sequencesimilar-tosubscript𝑠𝑡subscript𝜃𝑡𝑇formulae-sequencesimilar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖similar-tosuperscriptsubscript𝑎𝑡𝑖superscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡Q^{\pi^{i}}(s_{t},a_{t}^{i})=\mathbb{E}_{a_{t}^{-i}\sim\pi_{t}^{-i}}\left[Q^{% \pi^{i}}(s_{t},a_{t}^{-i},a_{t}^{i})\right]=\mathbb{E}_{\begin{subarray}{c}s_{% t},\theta_{t}\sim T,a_{t}^{-i}\sim\pi_{t}^{-i},a_{t}^{i}\sim\pi^{i}\end{% subarray}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})\Big{]},italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_T , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ,

where γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is a discount factor; stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a state at timestep t𝑡titalic_t, atisuperscriptsubscript𝑎𝑡𝑖a_{t}^{-i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT is a joint action of teammates i𝑖-i- italic_i at timestep t𝑡titalic_t and atisuperscriptsubscript𝑎𝑡𝑖a_{t}^{i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the learner i𝑖iitalic_i’s action at timestep t𝑡titalic_t; πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the learner’s stationary policy and πtisuperscriptsubscript𝜋𝑡𝑖\pi_{t}^{-i}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT is a joint policy of teammates i𝑖-i- italic_i; Qπi(st,ati,ati)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑎𝑡𝑖Q^{\pi^{i}}(s_{t},a_{t}^{-i},a_{t}^{i})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is a joint Q-value. The learner’s policy πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT is optimal, if and only if Qπi,(st,ati)Qπi(st,ati)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖Q^{\pi^{i,*}}(s_{t},a_{t}^{i})\geq Q^{\pi^{i}}(s_{t},a_{t}^{i})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ≥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) for all πi,st,atisuperscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖\pi^{i},s_{t},a_{t}^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The teammates’ joint policy is represented as that πti:𝒮×Θ𝒩Δ(𝒜𝒩):superscriptsubscript𝜋𝑡𝑖𝒮subscriptΘ𝒩Δsubscript𝒜𝒩\pi_{t}^{-i}:\mathcal{S}\times\Theta_{\scriptscriptstyle\mathcal{N}}% \rightarrow\Delta(\mathcal{A}_{\scriptscriptstyle{\mathcal{N}}})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT : caligraphic_S × roman_Θ start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT → roman_Δ ( caligraphic_A start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT ). The learner is unable to observe the teammates’ types and their policies, which can only be inferred through the history states and actions. The learner’s decision making is conducted by selecting the actions that maximize Qπi(st,ati)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖Q^{\pi^{i}}(s_{t},a_{t}^{i})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ).

Appendix C Further Details of Implementation

Given the learner’s lack of knowledge about PEsubscript𝑃𝐸P_{E}italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and πtisuperscriptsubscript𝜋𝑡𝑖\pi_{t}^{-i}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT, it is essential to discuss strategies for estimating these terms to achieve the convergence of Eq. (7). In the GPL framework, these two terms are implemented as the type inference model and the agent model, respectively. The implementation details are presented below.

C.1 GPL Framework

We now review the GPL’s empirical framework (Rahman et al., 2021). This framework consists of the following modules: the type inference model, the joint action value model and the agent model. We only summarize the model specifications. Note that while the original GPL framework is oriented towards a fixed coordination graph, specifically a complete graph, we relax this constraint to accommodate any graph structures as needed.

Type Inference Model. This is modelled as a LSTM (Hochreiter & Schmidhuber, 1997) to infer agent-types of a team at timestep t𝑡titalic_t given that of a team at timestep t1𝑡1t-1italic_t - 1. The agent-type is modelled as a fixed-length hidden-state vector of LSTM, named as agent-type embedding. At each timestep t𝑡titalic_t, the state information of an emergent team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is reproduced to a batch of agents’ information Bt=[ut,xt,1,,ut,xt,|𝒩t|]subscript𝐵𝑡superscriptsubscript𝑢𝑡subscript𝑥𝑡1subscript𝑢𝑡subscript𝑥𝑡subscript𝒩𝑡topB_{t}=[\langle u_{t},x_{t,1}\rangle,...,\langle u_{t},x_{t,{\scriptscriptstyle% |\mathcal{N}_{t}|}}\rangle]^{\top}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ ⟨ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ⟩ , … , ⟨ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t , | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ⟩ ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where each agent is preserved a vector composing utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and xt,isubscript𝑥𝑡𝑖x_{t,i}italic_x start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT which are observations and agent specific information extracted from state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Along with additional information such as the agent-type embedding of 𝒩t1subscript𝒩𝑡1\mathcal{N}_{t-1}caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the cell state, LSTM estimates the agent-type embedding of 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To address the situation of changing team size, at each timestep the agent-type embedding of the agents who leave a team would be removed, while the new added agents’ agent-type embedding would be set to a zero vector.

Joint Action Value Model. The joint Q-value, denoted as Q^πi(st,at)superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡\hat{Q}^{\pi^{i}}(s_{t},a_{t})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), is approximated as the sum of the corresponding individual utilities, Q^jπi(atj|st)superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\hat{Q}_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and pairwise utilities, Q^jkπi(atj,atk|st)superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), based on the coordination graph structure. The approximation is expressed as follows:

Q^πi(st,at)=j𝒩tQ^jπi(atj|st)+(j,k)tQ^jkπi(atj,atk|st).superscript^𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\hat{Q}^{\pi^{i}}(s_{t},a_{t})=\sum_{j\in\mathcal{N}_{t}}\hat{Q}_{j}^{\pi^{i}}% (a_{t}^{j}|s_{t})+\sum_{(j,k)\in\mathcal{E}_{t}}\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{% j},a_{t}^{k}|s_{t}).over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Both Q^jπi(atj|st)superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\hat{Q}_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Q^jkπi(atj,atk|st)superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are implemented as multilayer perceptrons (MLPs) parameterised by β𝛽\betaitalic_β and δ𝛿\deltaitalic_δ, denoted as MLPβsubscriptMLP𝛽\text{MLP}_{\beta}MLP start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and MLPδsubscriptMLP𝛿\text{MLP}_{\delta}MLP start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT. The input of MLPβsubscriptMLP𝛽\text{MLP}_{\beta}MLP start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is the concatenation of the learner’s agent-type embedding θtisuperscriptsubscript𝜃𝑡𝑖\theta_{t}^{i}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the teammate j𝑗jitalic_j’s agent-type embedding θtjsuperscriptsubscript𝜃𝑡𝑗\theta_{t}^{j}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Its output is a vector with a length of |𝒜j|subscript𝒜𝑗|\mathcal{A}_{j}|| caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | estimating Qjπi(atj|st)superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The detailed expression is shown as follows:

Q^jπi(atj|st)=MLPβ(θtj,θti)(atj).superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscriptMLP𝛽superscriptsubscript𝜃𝑡𝑗superscriptsubscript𝜃𝑡𝑖superscriptsubscript𝑎𝑡𝑗\hat{Q}_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})=\text{MLP}_{\beta}(\theta_{t}^{j},% \theta_{t}^{i})(a_{t}^{j}).over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = MLP start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) .

The pairwise utility Q^jkπi(atj,atk|st)superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is approximated by low-rank factorization, as follows:

Q^jkπi(atj,atk|st)=(MLPδ(θtj,θti)MLPδ(θtk,θti))(atj,atk),superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscriptMLP𝛿superscriptsuperscriptsubscript𝜃𝑡𝑗superscriptsubscript𝜃𝑡𝑖topsubscriptMLP𝛿superscriptsubscript𝜃𝑡𝑘superscriptsubscript𝜃𝑡𝑖superscriptsubscript𝑎𝑡𝑗superscriptsubscript𝑎𝑡𝑘\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=\big{(}\text{MLP}_{\delta}(% \theta_{t}^{j},\theta_{t}^{i})^{\top}\text{MLP}_{\delta}(\theta_{t}^{k},\theta% _{t}^{i})\big{)}(a_{t}^{j},a_{t}^{k}),over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( MLP start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT MLP start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ,

where the input of MLPδsubscriptMLP𝛿\text{MLP}_{\delta}MLP start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is the same as MLPβsubscriptMLP𝛽\text{MLP}_{\beta}MLP start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT; the output of MLPδ(θtj,θti)subscriptMLP𝛿superscriptsubscript𝜃𝑡𝑗superscriptsubscript𝜃𝑡𝑖\text{MLP}_{\delta}(\theta_{t}^{j},\theta_{t}^{i})MLP start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is a matrix with the shape K×|𝒜j|𝐾subscript𝒜𝑗K\times|\mathcal{A}_{j}|italic_K × | caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | and K|𝒜j|much-less-than𝐾subscript𝒜𝑗K\ll|\mathcal{A}_{j}|italic_K ≪ | caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |.

Agent Model. It is assumed that all other connected agents, as described by a coordination graph, would influence an agent’s actions. To model this situation, GNN is applied to process the agent-type embedding of a temporary team, denoted as θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where each team member is represented as a node. More specifically, a GNN model called relational forward model (RFM) (Tacchetti et al., 2019) parameterised by η𝜂\etaitalic_η is applied to transform θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (as the initial node representation) to n¯tsubscript¯𝑛𝑡\bar{n}_{t}over¯ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (as the new node representation) considering other agents’ effects. Then, n¯tsubscript¯𝑛𝑡\bar{n}_{t}over¯ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is employed to infer qζ,η(ati|st)subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡q_{\zeta,\eta}(a_{t}^{-i}|s_{t})italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), as the approximation of teammates’ joint policy, πti(ati|st,θti)superscriptsubscript𝜋𝑡𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡superscriptsubscript𝜃𝑡𝑖\pi_{t}^{-i}(a_{t}^{-i}|s_{t},\theta_{t}^{-i})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ). The detailed expression is as follows:

qζ,η(ati|st)=jiqζ,η(atj|st),qζ,η(atj|st)=Softmax(MLPη(n¯tj))(atj).formulae-sequencesubscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscriptproduct𝑗𝑖subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡SoftmaxsubscriptMLP𝜂superscriptsubscript¯𝑛𝑡𝑗superscriptsubscript𝑎𝑡𝑗\begin{split}q_{\zeta,\eta}(a_{t}^{-i}|s_{t})=\prod_{j\in-i}q_{\zeta,\eta}(a_{% t}^{j}|s_{t}),\\ q_{\zeta,\eta}(a_{t}^{j}|s_{t})=\text{Softmax}(\text{MLP}_{\eta}(\bar{n}_{t}^{% j}))(a_{t}^{j}).\end{split}start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = Softmax ( MLP start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( over¯ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) . end_CELL end_ROW

Learner’s Action Value Model. Substituting the agent model and the joint action value model defined above into Eq. (2), the learner’s Q-value for its own decision making is approximated as follows:

Q^πi(st,ati)=Q^iπi(ati|st)+atj𝒜j,(j,i)t(Q^jπi(atj|st)+Q^ijπi(ati,atj|st))qζ,η(atj|st)+atj𝒜j,atk𝒜k,(j,k)tQ^jkπi(atj,atk|st)qζ,η(atj|st)qζ,η(atj|st).superscript^𝑄subscript𝜋𝑖subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript^𝑄𝑖superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscriptformulae-sequencesuperscriptsubscript𝑎𝑡𝑗subscript𝒜𝑗𝑗𝑖subscript𝑡superscriptsubscript^𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡superscriptsubscript^𝑄𝑖𝑗superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscriptformulae-sequencesuperscriptsubscript𝑎𝑡𝑗subscript𝒜𝑗superscriptsubscript𝑎𝑡𝑘subscript𝒜𝑘𝑗𝑘subscript𝑡superscriptsubscript^𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\begin{split}\hat{Q}^{\pi_{i}}(s_{t},a_{t}^{i})&=\hat{Q}_{i}^{\pi^{i}}(a_{t}^{% i}|s_{t})+\sum_{\begin{subarray}{c}a_{t}^{j}\in\mathcal{A}_{j},(j,i)\in% \mathcal{E}_{t}\end{subarray}}\Big{(}\hat{Q}_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})+% \hat{Q}_{ij}^{\pi^{i}}(a_{t}^{i},a_{t}^{j}|s_{t})\Big{)}q_{\zeta,\eta}(a_{t}^{% j}|s_{t})\\ &+\sum_{\begin{subarray}{c}a_{t}^{j}\in\mathcal{A}_{j},a_{t}^{k}\in\mathcal{A}% _{k},\\ (j,k)\in\mathcal{E}_{t}\end{subarray}}\hat{Q}_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{% k}|s_{t})q_{\zeta,\eta}(a_{t}^{j}|s_{t})q_{\zeta,\eta}(a_{t}^{j}|s_{t}).\end{split}start_ROW start_CELL over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_CELL start_CELL = over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ( italic_j , italic_i ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW

C.2 Overall Training Procedure of CIAO

We now summarize the overall training procedure of CIAO in Algorithm 1. Note that in the GPL framework, the type inference model is absorbed into the joint Q-value and the agent model as a LSTM, respectively. This construction aims to prevent these two models’ gradients from interfering against each other during training (Rahman et al., 2021).

Algorithm 1 Overall training procedure of CIAO
  Input: dynamic affinity graph structure G𝐺Gitalic_G, number of training episodes e𝑒eitalic_e, length of an episode T𝑇Titalic_T, replay buffer \mathcal{B}caligraphic_B
  repeat
     Clear the replay buffer \mathcal{B}caligraphic_B.
     Reset the environment and receive the initial observations.
     for timestep=1timestep1\text{timestep}=1timestep = 1 to T𝑇Titalic_T do
        Execute learner’s action by ϵitalic-ϵ\epsilonitalic_ϵ-greedy policy.
        Store observations (including teammates’ actions) for the current timestep in the replay buffer \mathcal{B}caligraphic_B.
     end for
     Generate the joint Q-value and the agent model as per GPL framework, based on the dynamic affinity graph G𝐺Gitalic_G.
     Update parameters of pairwise utilities and individual utilities by the loss function proposed in Section 4.3.
     Update parameters of the agent model by the following loss function:
L(ζ,η)=1Tt=1Tlogqζ,η(ati|st).𝐿𝜁𝜂1𝑇superscriptsubscript𝑡1𝑇subscript𝑞𝜁𝜂conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡L(\zeta,\eta)=-\frac{1}{T}\sum_{t=1}^{T}\log q_{\zeta,\eta}(a_{t}^{-i}|s_{t}).italic_L ( italic_ζ , italic_η ) = - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_q start_POSTSUBSCRIPT italic_ζ , italic_η end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .
  until meeting the number of training episodes m𝑚mitalic_m

Appendix D Assumptions

Assumption 1.

The following conditional independencies are assumed to hold in any distribution P𝑃Pitalic_P over the set of variables in an OSB-CAG: (1) (θtθt1,st1,at1|𝒩t,st)perpendicular-toabsentperpendicular-tosubscript𝜃𝑡subscript𝜃𝑡1subscript𝑠𝑡1conditionalsubscript𝑎𝑡1subscript𝒩𝑡subscript𝑠𝑡(\theta_{t}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$}}}\theta_{% t-1},s_{t-1},a_{t-1}\ |\ \mathcal{N}_{t},s_{t})( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); (2) (𝒩t,stθt1|𝒩t1,st1,at1)perpendicular-toabsentperpendicular-tosubscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡1subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1(\mathcal{N}_{t},s_{t}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$% }}}\theta_{t-1}\ |\ \mathcal{N}_{t-1},s_{t-1},a_{t-1})( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ); (3) (𝒩tat|st,θt)perpendicular-toabsentperpendicular-tosubscript𝒩𝑡conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜃𝑡(\mathcal{N}_{t}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$}}}a_{% t}|s_{t},\theta_{t})( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); (4) (θtjj,θtj|{j},st)perpendicular-toabsentperpendicular-tosuperscriptsubscript𝜃𝑡𝑗𝑗conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡(\theta_{t}^{j}\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10.0mu\perp$}}}-j,% \theta_{t}^{-j}\ |\ \{j\},s_{t})( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_RELOP ⟂ ⟂ end_RELOP - italic_j , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Assumption 1 indicates the assumptions encoding the relationships among random variables that are entailed by any probability distribution describing the open ad hoc teamwork process, referred to as conditional independencies (Koller & Friedman, 2009, Ch. 2).

As for conditional independence (1), it implies that the agent-types θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the current timestep are conditionally independent of the related variables θt1,st1,at1subscript𝜃𝑡1subscript𝑠𝑡1subscript𝑎𝑡1\theta_{t-1},s_{t-1},a_{t-1}italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for the preceding timestep, given the agent set 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the current timestep. This is reflected by PE(θt|𝒩t,st)=P(θt|𝒩t,st,st1,at1,θt1)subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡𝑃conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})=P(\theta_{t}|\mathcal{N}_{t},s_{t},s_{% t-1},a_{t-1},\theta_{t-1})italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ).

As for conditional independence (2), it implies that the agent set 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the current timestep is independent of the agent-types θt1subscript𝜃𝑡1\theta_{t-1}italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for the preceding timestep, given the variables 𝒩t1,st1,at1subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1\mathcal{N}_{t-1},s_{t-1},a_{t-1}caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for the preceding timestep. This is reflected by PT(𝒩t,st|𝒩t1,st1,at1)=P(𝒩t,st|𝒩t1,st1,at1,θt1)subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1𝑃subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1})=P(\mathcal{N}_{% t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1},\theta_{t-1})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ).

As for conditional independence (3), it implies that the agent set 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is independent of the joint action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, given the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the agent-type set θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the same timestep. This is reflected by P(𝒩t|st,θt)=P(𝒩t|st,at,θt)𝑃conditionalsubscript𝒩𝑡subscript𝑠𝑡subscript𝜃𝑡𝑃conditionalsubscript𝒩𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝜃𝑡P(\mathcal{N}_{t}|s_{t},\theta_{t})=P(\mathcal{N}_{t}|s_{t},a_{t},\theta_{t})italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Note that this condition coincides with scenarios encoded by Assumption 2, where the agent j𝑗jitalic_j’s policy is able to be varied across timesteps, and the policy is only correlated with its agent-type. In turn, this implies that an agent’s mind could be changed across timesteps, which is an evidence that open ad hoc teamwork is also suitable for modelling human-AI cooperation (Shneiderman, 2020; De Peuter & Kaski, 2023). However, for clarity and simplicity to introduce our theory, we assume in this paper that the policy is fixed (time invariant or stationary) across timesteps, as shown in Assumption 5.

As for conditional independence (4), it implies that an agent j𝑗jitalic_j’s agent-type θtjsuperscriptsubscript𝜃𝑡𝑗\theta_{t}^{j}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for some timestep is conditionally independent of other agents j𝑗-j- italic_j and their agent-types θtjsuperscriptsubscript𝜃𝑡𝑗\theta_{t}^{-j}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT, given itself denoted as j𝑗jitalic_j and the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for that timestep. This is reflected by j=1|𝒩t|PA(θtj|{j},st)=PE(θt|𝒩t,st)superscriptsubscriptproduct𝑗1subscript𝒩𝑡subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡\prod_{j=1}^{|\mathcal{N}_{t}|}P_{A}(\theta_{t}^{j}|\{j\},s_{t})=P_{E}(\theta_% {t}|\mathcal{N}_{t},s_{t})∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Assumption 2.

Suppose that αjk(atj,atk|st)=0subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡0\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=0italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 for tT𝑡𝑇t\geq Titalic_t ≥ italic_T, where T𝑇Titalic_T is the timestep when agent j𝑗jitalic_j or k𝑘kitalic_k leaves the environment, and Rj(atj|st)=0subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡0R_{j}(a_{t}^{j}|s_{t})=0italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 for tT𝑡superscript𝑇t\geq T^{\prime}italic_t ≥ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the timestep when agent j𝑗jitalic_j leaves the environment.

Assumption 2 introduces a metric to quantify the impact of agents leaving the environment. Essentially, it posits that an agent that has departed from the environment no longer exerts any influence on the remaining agents within the environment.

Assumption 2.

There exists an underlying agent-type set to generate ad hoc teammates in an environment which is unknown to the learner.

Assumption 2 provides a natural framework for describing the agent types of teammates. In scenarios where the agent type set is sufficiently large, traversing all possible agent types or compositions becomes impractical. Therefore, this assumption ensures that the generalizability of open ad hoc teamwork is not compromised.

Assumption 3.

Teammates can be influenced by the learner through its decision making.

Assumption 3 constitutes a fundamental and commonly assumed property essential for rationalizing the ad hoc teamwork problem. Often referred to as the reactivity of teammates (Barrett et al., 2017), this assumption posits that teammates must be capable of reacting to or being influenced by the learner. Without this interaction, the problem would regress to a scenario akin to a single-agent problem, where teammates merely function as moving ‘obstacles.’ To avert such a pathological situation, maintaining this assumption serves as a crucial boundary for ad hoc teamwork.

Assumption 4.

The agents stay in the environment at least for a period of timesteps.

Assumption 4 is a prerequisite ensuring the feasibility of completing arbitrary tasks. Without this condition, wherein an agent joining at a given timestep remains in the environment for a non-instantaneous duration, there would be minimal opportunity for teams of agents to react to and influence each other effectively.

Assumption 5.

Each teammate of an arbitrary agent type is equipped with a fixed policy.

Assumption 5 serves as a simplified condition for analyzing the learner’s convergence to the optimal policy. By assuming fixed policies for teammates, the Markov process becomes stationary from the learner’s perspective, facilitating a more tractable analysis of convergence dynamics. However, this can be further relaxed to cater for more realistic situations.

Appendix E Generalization of Preference Values for Coalitional Affinity Game

At the beginning, it is worth noting that in the original work of CAG (Brânzei & Larson, 2009), the definition of the preference value of an arbitrary agent j𝑗jitalic_j is as follows:

v¯j(𝒞)={0if 𝒞={j},(j,k),k𝒞w¯(j,k)otherwise.subscript¯𝑣𝑗𝒞cases0if 𝒞={j}subscriptformulae-sequence𝑗𝑘𝑘𝒞¯𝑤𝑗𝑘otherwise\bar{v}_{j}(\mathcal{C})=\begin{cases}0&\text{if $\mathcal{C}=\{j\}$},\\ \sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}\bar{w}(j,k)&\text{otherwise}.\end{cases}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) = { start_ROW start_CELL 0 end_CELL start_CELL if caligraphic_C = { italic_j } , end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT over¯ start_ARG italic_w end_ARG ( italic_j , italic_k ) end_CELL start_CELL otherwise . end_CELL end_ROW (9)

While the condition that each agent’s preference value of a coalition including only itself, equals zero, is convenient and straightforward for analysis, it imposes limitations on the representational capacity for various problems. To address this issue, we generalize the definition of the preference value function in Eq. (9) to the form as follows:

vj(𝒞)={bj0if 𝒞={j},(j,k),k𝒞w(j,k)otherwise.subscript𝑣𝑗𝒞casessubscript𝑏𝑗0if 𝒞={j}subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘otherwisev_{j}(\mathcal{C})=\begin{cases}b_{j}\geq 0&\text{if $\mathcal{C}=\{j\}$},\\ \sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}w(j,k)&\text{otherwise}.\end{cases}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) = { start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ 0 end_CELL start_CELL if caligraphic_C = { italic_j } , end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_w ( italic_j , italic_k ) end_CELL start_CELL otherwise . end_CELL end_ROW (10)

The main difference between the definitions in Eq. (9) and Eq. (10) is that the preference value of the coalition only including a single agent is not forced to be zero in Eq. (10). Albeit that the results shown in the original work of CAG (Brânzei & Larson, 2009) are based on each agent’s original preference value function shown in Eq. (9), we can still generalize and leverage the results by conducting translation to each agent’s preference value function by its preference value of the coalition including itself, to align with condition of v¯(𝒞)¯𝑣𝒞\bar{v}(\mathcal{C})over¯ start_ARG italic_v end_ARG ( caligraphic_C ) in Eq. (9). In more details, we can transform the newly defined preference value function in Eq. (10) as follows:

v^j(𝒞)=vj(𝒞)vj({j})={0if 𝒞={j},(j,k),k𝒞w(j,k)vj({j})otherwise.subscript^𝑣𝑗𝒞subscript𝑣𝑗𝒞subscript𝑣𝑗𝑗cases0if 𝒞={j}subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘subscript𝑣𝑗𝑗otherwise\hat{v}_{j}(\mathcal{C})=v_{j}(\mathcal{C})-v_{j}(\{j\})=\begin{cases}0&\text{% if $\mathcal{C}=\{j\}$},\\ \sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}w(j,k)-v_{j}(\{j\})&\text{otherwise}% .\end{cases}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) = italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = { start_ROW start_CELL 0 end_CELL start_CELL if caligraphic_C = { italic_j } , end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_w ( italic_j , italic_k ) - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) end_CELL start_CELL otherwise . end_CELL end_ROW (11)

Therefore, we can directly leverage the results from the previous work (Brânzei & Larson, 2009) by replacing v¯j(𝒞)subscript¯𝑣𝑗𝒞\bar{v}_{j}(\mathcal{C})over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) with v^j(𝒞)subscript^𝑣𝑗𝒞\hat{v}_{j}(\mathcal{C})over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ), and generalize the results to the newly defined preference value function in Eq. (10) by conducting the change of variables to the results according to Eq. (11).

The generalised preference value vj(𝒞)subscript𝑣𝑗𝒞v_{j}(\mathcal{C})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) plays an important role of proving the results of OSB-CAG in the following sections.

Definition 4.

In a CAG with the generalised preference value function, for any agent j𝑗jitalic_j, its preference value of the coalition including only itself is defined as that vj({j})0subscript𝑣𝑗𝑗0v_{j}(\{j\})\geq 0italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) ≥ 0.

In the conventional definition of a coalition value function555The preference value function of an agent can be seen as a coalition value function specifically defined for the agent. in the cooperative game theory, the value of an empty set (empty coalition) is defined as zero (Chalkiadakis et al., 2022). In the context of a CAG, we can formally extend the domain of an agent j𝑗jitalic_j’s preference value function by considering the empty set such that vj()=0subscript𝑣𝑗0v_{j}(\emptyset)=0italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∅ ) = 0. This extension can be interpreted as that an agent imagines a scenario where it is not included (with no incentives to join). If vj({j})<0subscript𝑣𝑗𝑗0v_{j}(\{j\})<0italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) < 0, it may lead to a paradox that an agent j𝑗jitalic_j would choose to disappear from the environment (e.g. suicide) to escape independence, which is apparently opposite to morality and ethics. To avoid the paradox, it is reasonable to generalise an agent j𝑗jitalic_j’s preference value of the coalition including itself to only the non-negative range such that vj({j})0subscript𝑣𝑗𝑗0v_{j}(\{j\})\geq 0italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) ≥ 0.

Appendix F Derivation Details of Definition 2

Definition 5.

We say that a blocking coalition 𝒞𝒞\mathcal{C}caligraphic_C weakly blocks a coalition structure 𝒞𝒮𝒞𝒮\mathcal{CS}caligraphic_C caligraphic_S if every agent j𝒞𝑗𝒞j\in\mathcal{C}italic_j ∈ caligraphic_C weakly prefers 𝒞𝒞\mathcal{C}caligraphic_C to 𝒞𝒮(j)𝒞𝒮𝑗\mathcal{CS}(j)caligraphic_C caligraphic_S ( italic_j ) and there exists at least one agent k𝒞𝑘𝒞k\in\mathcal{C}italic_k ∈ caligraphic_C who strictly prefers 𝒞𝒞\mathcal{C}caligraphic_C to 𝒞𝒮(j)𝒞𝒮𝑗\mathcal{CS}(j)caligraphic_C caligraphic_S ( italic_j ). A coalition structure 𝒞𝒮={𝒞1,,𝒞m}𝒞𝒮subscript𝒞1subscript𝒞𝑚\mathcal{CS}=\{\mathcal{C}_{1},...,\mathcal{C}_{m}\}caligraphic_C caligraphic_S = { caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } admitting no weakly blocking coalition 𝒞𝒞k𝒞subscript𝒞𝑘\mathcal{C}\ {{\subset}}\ \mathcal{C}_{k}caligraphic_C ⊂ caligraphic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, for some 1km1𝑘𝑚1\leq k\leq m1 ≤ italic_k ≤ italic_m, is called inner stable.

Theorem 4 (Brânzei & Larson (2009)).

If a CAG is symmetric, then the social-welfare maximizing partition exhibits inner stability.

Theorem 4 directly holds for the newly defined vj(𝒞)subscript𝑣𝑗𝒞v_{j}(\mathcal{C})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) in this paper, since it is irrelevant to the detailed representation (feasible domain) of a preference value function (see Theorem 2 and 5 in the previous work (Brânzei & Larson, 2009)).

Lemma 1.

If a CAG is symmetric, then maximizing the social welfare under a grand coalition results in strict core stability.

Proof.

Following Definition 5, it is not difficult to observe that a grand coalition exhibiting strict core stability is equivalent to a grand coalition exhibiting inner stability. Therefore, we can directly obtain the result by Theorem 4. ∎

F.1 Derivation of Dynamic Variational Strict Core

In an OSB-CAG, at any timestep t𝑡titalic_t, under an arbitrary state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S along with a temporary team (including the learner i𝑖iitalic_i), denoted as 𝒩t𝒩subscript𝒩𝑡𝒩\mathcal{N}_{t}\ {{\subseteq}}\ \mathcal{N}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_N, and the temporary team’s joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the coalition reward can be equivalently expressed as a preference value of an agent belonging to a temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that Rj(st,at)=vj(𝒩t)subscript𝑅𝑗subscript𝑠𝑡subscript𝑎𝑡subscript𝑣𝑗subscript𝒩𝑡R_{j}(s_{t},a_{t})=v_{j}(\mathcal{N}_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be interpreted as the grand coalition at any timestep t𝑡titalic_t. Thereby, reaching the strict core stability at any timestep t𝑡titalic_t is equivalent to maximizing the social welfare at the timestep. Different from the previous work (Brânzei & Larson, 2009) that given the predetermined preference values, the coalition structure is as a decision variable to reach the strict core; in this paper, we predetermine a temporary team, as the target coalition structure, and the learner i𝑖iitalic_i’s action is as an extended decision variable to change the preference values (coalition rewards) in order to reach the variational strict core (VSC) that is defined with the same criterion as the strict core, but with different target variables as elements to form the solution set. The learner i𝑖iitalic_i’s action is generated by its policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. By Assumption 3, we can get that the learner’s action is able to influence teammates’ actions. Therefore, the teammates’ coalition rewards as an evaluation of their policies will also be varied accordingly. This explains that the learner’s action can be seen as a decision variable that is able to change teammates’ coalition rewards. By Lemma 1, if a dynamic affinity graph at timestep t𝑡titalic_t is symmetric, we can express the VSC for any timestep t𝑡titalic_t (under an arbitrary st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S along with a temporary team 𝒩t𝒩subscript𝒩𝑡𝒩\mathcal{N}_{t}\ {{\subseteq}}\ \mathcal{N}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_N, a joint agent type θtΘ|𝒩t|subscript𝜃𝑡superscriptΘsubscript𝒩𝑡\theta_{t}\in\Theta^{\scriptscriptstyle|\mathcal{N}_{t}|}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, the teammates’ policies πti(ati|st,θti)superscriptsubscript𝜋𝑡𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡superscriptsubscript𝜃𝑡𝑖\pi_{t}^{-i}(a_{t}^{-i}|s_{t},\theta_{t}^{-i})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ) with respect to their agent types and the state) to find the learner’s optimal action (rather than find a coalition structure in the previous work) as follows:

VSC:={ai,|j𝒩tRj(st,ati,,ati)j𝒩tRj(st,ati,ati),ati𝒜i}.assignVSCconditional-setsuperscript𝑎𝑖formulae-sequencesubscript𝑗subscript𝒩𝑡subscript𝑅𝑗subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑎𝑡𝑖subscript𝑗subscript𝒩𝑡subscript𝑅𝑗subscript𝑠𝑡superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑎𝑡𝑖for-allsuperscriptsubscript𝑎𝑡𝑖subscript𝒜𝑖\texttt{VSC}:=\Big{\{}a^{i,*}\ \Big{|}\ \sum_{j\in\mathcal{N}_{t}}R_{j}(s_{t},% a_{t}^{i,*},a_{t}^{-i})\geq\sum_{j\in\mathcal{N}_{t}}R_{j}(s_{t},a_{t}^{i},a_{% t}^{-i}),\ \forall a_{t}^{i}\in\mathcal{A}_{i}\Big{\}}.VSC := { italic_a start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } . (12)

Note that the strict core defined in Eq. (12) implicitly assumes that the teammates’ reaction is instantaneous (happening at the same timestep). Recall that our aim is to find the learner’s optimal stationary policy πi,superscript𝜋𝑖\pi^{i,*}italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT that generates actions across timesteps (in a long horizon), in order to influence the temporary teammates occurring at any timestep to collaborate (meeting the strict core stability). We now generalize the VSC defined in Eq. (12) by considering the process of generating states, teammates, agent types and teammates’ actions, named as dynamic variational strict core (DVSC). The DSVC is defined as follows:

DVSC:={πi,|𝔼πi,[t=0γtj𝒩tRj(st,at)]𝔼πi[t=0γtj𝒩tRj(st,at)],s0𝒮,πi},assignDVSCconditional-setsuperscript𝜋𝑖formulae-sequencesubscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑗subscript𝒩𝑡subscript𝑅𝑗subscript𝑠𝑡subscript𝑎𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑗subscript𝒩𝑡subscript𝑅𝑗subscript𝑠𝑡subscript𝑎𝑡for-allsubscript𝑠0𝒮for-allsuperscript𝜋𝑖\texttt{DVSC}:=\Big{\{}\ \pi^{i,*}\ \Big{|}\ \mathbb{E}_{\pi^{i,*}}\big{[}\sum% _{t=0}^{\infty}\gamma^{t}\sum_{j\in\mathcal{N}_{t}}R_{j}(s_{t},a_{t})\big{]}% \geq\mathbb{E}_{\pi^{i}}\big{[}\sum_{t=0}^{\infty}\gamma^{t}\sum_{j\in\mathcal% {N}_{t}}R_{j}(s_{t},a_{t})\big{]},\forall s_{0}\in\mathcal{S},\forall\pi^{i}\ % \Big{\}},DVSC := { italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , ∀ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S , ∀ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } , (13)

where atiπisimilar-tosuperscriptsubscript𝑎𝑡𝑖superscript𝜋𝑖a_{t}^{i}\sim\pi^{i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and atiπtisimilar-tosuperscriptsubscript𝑎𝑡𝑖superscriptsubscript𝜋𝑡𝑖a_{t}^{-i}\sim\pi_{t}^{-i}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT; 𝔼πi[]subscript𝔼superscript𝜋𝑖delimited-[]\mathbb{E}_{\pi^{i}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ⋅ ] denotes the expectation that also implicitly depends on θtPEsimilar-tosubscript𝜃𝑡subscript𝑃𝐸\theta_{t}\sim P_{E}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, 𝒩t,stPOsimilar-tosubscript𝒩𝑡subscript𝑠𝑡subscript𝑃𝑂\mathcal{N}_{t},s_{t}\sim P_{O}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT.

Note that the VSC defined in Eq. (13) weakens the implicit assumption of the strict core defined in Eq. (12). In more details, it allows the teammates to react at the successor timesteps instead of the mandatory instantaneous reaction at the same timestep. Nevertheless, this requires that the learner has potential for adapting to the teammates (through interaction with teammates for a period). By Assumption 4 and 5, the learner’s adaption to the temporary teammates is possible.

Appendix G Mathematical Proofs

G.1 The Proof of Proposition 1

Proposition 1.

T(𝒩t,st,θt|st1,at1,θt1)𝑇subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1T(\mathcal{N}_{t},s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_{t-1})italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) for t1𝑡1t\geq 1italic_t ≥ 1 can be expressed in terms of the following well-defined probability distributions: PI(𝒩0,s0)subscript𝑃𝐼subscript𝒩0subscript𝑠0P_{I}(\mathcal{N}_{0},s_{0})italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), PT(𝒩t,st|𝒩t1,st1,at1)subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) for t1𝑡1t\geq 1italic_t ≥ 1, and PA(θtj|{j},st)subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡P_{A}(\theta_{t}^{j}|\{j\},s_{t})italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for t0𝑡0t\geq 0italic_t ≥ 0.

Proof.

To ease the proof, we assume that stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are discrete variables with no loss of generality. We prove that T(𝒩t,st,θt|st1,at1,θt1)𝑇subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1T(\mathcal{N}_{t},s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_{t-1})italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) can be expressed as the probability distributions we have defined, as follows:

T(𝒩t,st,θt|st1,at1,θt1)=P(θt|𝒩t,st,st1,at1,θt1)PO(𝒩t,st|st1,at1,θt1)=PE(θt|𝒩t,st)PO(𝒩t,st|st1,at1,θt1)(By conditional independence (1) in Assumption 1)=PE(θt|𝒩t,st)𝒩t1P(𝒩t,st,𝒩t1|st1,at1,θt1)=PE(θt|𝒩t,st)𝒩t1P(𝒩t,st|𝒩t1,st1,at1,θt1)P(𝒩t1|st1,at1,θt1)=PE(θt|𝒩t,st)𝒩t1PT(𝒩t,st|𝒩t1,st1,at1)P(𝒩t1|st1,θt1)(By conditional independence (2) and (3) in Assumption 1)=j=1|𝒩t|PA(θtj|{j},st)𝒩t1PT(𝒩t,st|𝒩t1,st1,at1)P(𝒩t1|st1,θt1).(By conditional independence (4) in Assumption 1)formulae-sequence𝑇subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝜃𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1𝑃conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscript𝑃𝑂subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscript𝑃𝑂subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1(By conditional independence (1) in Assumption 1)subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscriptsubscript𝒩𝑡1𝑃subscript𝒩𝑡subscript𝑠𝑡conditionalsubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscriptsubscript𝒩𝑡1𝑃subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1𝑃conditionalsubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1subscript𝜃𝑡1subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡subscriptsubscript𝒩𝑡1subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1𝑃conditionalsubscript𝒩𝑡1subscript𝑠𝑡1subscript𝜃𝑡1(By conditional independence (2) and (3) in Assumption 1)superscriptsubscriptproduct𝑗1subscript𝒩𝑡subscript𝑃𝐴conditionalsuperscriptsubscript𝜃𝑡𝑗𝑗subscript𝑠𝑡subscriptsubscript𝒩𝑡1subscript𝑃𝑇subscript𝒩𝑡conditionalsubscript𝑠𝑡subscript𝒩𝑡1subscript𝑠𝑡1subscript𝑎𝑡1𝑃conditionalsubscript𝒩𝑡1subscript𝑠𝑡1subscript𝜃𝑡1(By conditional independence (4) in Assumption 1)\begin{split}&\quad T(\mathcal{N}_{t},s_{t},\theta_{t}|s_{t-1},a_{t-1},\theta_% {t-1})\\ &=P(\theta_{t}|\mathcal{N}_{t},s_{t},s_{t-1},a_{t-1},\theta_{t-1})P_{O}(% \mathcal{N}_{t},s_{t}|s_{t-1},a_{t-1},\theta_{t-1})\\ &=P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})P_{O}(\mathcal{N}_{t},s_{t}|s_{t-1},a% _{t-1},\theta_{t-1})\quad\text{(By conditional independence (1) in Assumption % \ref{assm:conditional_independencies})}\\ &=P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})\sum_{\scriptscriptstyle{\mathcal{N}}% _{t-1}}P(\mathcal{N}_{t},s_{t},\mathcal{N}_{t-1}|s_{t-1},a_{t-1},\theta_{t-1})% \\ &=P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})\sum_{\scriptscriptstyle{\mathcal{N}}% _{t-1}}P(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1},\theta_{t-1})% P(\mathcal{N}_{t-1}|s_{t-1},a_{t-1},\theta_{t-1})\\ &=P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})\sum_{\scriptscriptstyle{\mathcal{N}}% _{t-1}}P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_{t-1},s_{t-1},a_{t-1})P(% \mathcal{N}_{t-1}|s_{t-1},\theta_{t-1})\\ &\quad\text{(By conditional independence (2) and (3) in Assumption \ref{assm:% conditional_independencies})}\\ &=\prod_{j=1}^{|\mathcal{N}_{t}|}P_{A}(\theta_{t}^{j}|\{j\},s_{t})\sum_{% \scriptscriptstyle{\mathcal{N}}_{t-1}}P_{T}(\mathcal{N}_{t},s_{t}|\mathcal{N}_% {t-1},s_{t-1},a_{t-1})P(\mathcal{N}_{t-1}|s_{t-1},\theta_{t-1}).\\ &\quad\text{(By conditional independence (4) in Assumption \ref{assm:% conditional_independencies})}\end{split}start_ROW start_CELL end_CELL start_CELL italic_T ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_P ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (By conditional independence (1) in Assumption ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL (By conditional independence (2) and (3) in Assumption ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | { italic_j } , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL (By conditional independence (4) in Assumption ) end_CELL end_ROW

To complete the above proof, we need to further show the expression of P(𝒩t|st,θt)𝑃conditionalsubscript𝒩𝑡subscript𝑠𝑡subscript𝜃𝑡P(\mathcal{N}_{t}|s_{t},\theta_{t})italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as follows:

P(𝒩t|st,θt)=stPE(θt|𝒩t,st)P(𝒩t,st)𝒩tstPE(θt|𝒩t,st)P(𝒩t,st).𝑃conditionalsubscript𝒩𝑡subscript𝑠𝑡subscript𝜃𝑡subscriptsubscript𝑠𝑡subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡𝑃subscript𝒩𝑡subscript𝑠𝑡subscriptsubscript𝒩𝑡subscriptsubscript𝑠𝑡subscript𝑃𝐸conditionalsubscript𝜃𝑡subscript𝒩𝑡subscript𝑠𝑡𝑃subscript𝒩𝑡subscript𝑠𝑡P(\mathcal{N}_{t}|s_{t},\theta_{t})=\frac{\sum_{s_{t}}P_{E}(\theta_{t}|% \mathcal{N}_{t},s_{t})P(\mathcal{N}_{t},s_{t})}{\sum_{\scriptscriptstyle{% \mathcal{N}}_{t}}\sum_{s_{t}}P_{E}(\theta_{t}|\mathcal{N}_{t},s_{t})P(\mathcal% {N}_{t},s_{t})}.italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .

Apparently, we are required to prove that P(𝒩t,st)𝑃subscript𝒩𝑡subscript𝑠𝑡P(\mathcal{N}_{t},s_{t})italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) admits factorization into the probability distributions we have defined. We now conduct this by mathematical induction as follows:

Base case: As per the definition, PI(𝒩0,s0)subscript𝑃𝐼subscript𝒩0subscript𝑠0P_{I}(\mathcal{N}_{0},s_{0})italic_P start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is a predefined probability distribution to express P(𝒩0,s0)𝑃subscript𝒩0subscript𝑠0P(\mathcal{N}_{0},s_{0})italic_P ( caligraphic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for t=0𝑡0t=0italic_t = 0.

Induction case: Assume the induction hypothesis that P(𝒩t,st)𝑃subscript𝒩𝑡subscript𝑠𝑡P(\mathcal{N}_{t},s_{t})italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) admits factorization into the probability distributions we have defined, for any t0𝑡0t\geq 0italic_t ≥ 0.

Next, we aim to prove that P(𝒩t+1,st+1)𝑃subscript𝒩𝑡1subscript𝑠𝑡1P(\mathcal{N}_{t+1},s_{t+1})italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) admits factorization into the probability distributions we have defined, based on the induction hypothesis, such that

P(𝒩t+1,st+1)=𝒩tstatP(𝒩t+1,st+1,𝒩t,st,at),𝑃subscript𝒩𝑡1subscript𝑠𝑡1subscriptsubscript𝒩𝑡subscriptsubscript𝑠𝑡subscriptsubscript𝑎𝑡𝑃subscript𝒩𝑡1subscript𝑠𝑡1subscript𝒩𝑡subscript𝑠𝑡subscript𝑎𝑡P(\mathcal{N}_{t+1},s_{t+1})=\sum_{\scriptscriptstyle{\mathcal{N}}_{t}}\sum_{s% _{t}}\sum_{a_{t}}P(\mathcal{N}_{t+1},s_{t+1},\mathcal{N}_{t},s_{t},a_{t}),italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where

P(𝒩t+1,st+1,𝒩t,st,at)=PT(𝒩t+1,st+1|𝒩t,st,at)P(𝒩t,st)πt(at|st).𝑃subscript𝒩𝑡1subscript𝑠𝑡1subscript𝒩𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝑃𝑇subscript𝒩𝑡1conditionalsubscript𝑠𝑡1subscript𝒩𝑡subscript𝑠𝑡subscript𝑎𝑡𝑃subscript𝒩𝑡subscript𝑠𝑡subscript𝜋𝑡conditionalsubscript𝑎𝑡subscript𝑠𝑡P(\mathcal{N}_{t+1},s_{t+1},\mathcal{N}_{t},s_{t},a_{t})=P_{T}(\mathcal{N}_{t+% 1},s_{t+1}|\mathcal{N}_{t},s_{t},a_{t})P(\mathcal{N}_{t},s_{t})\pi_{t}(a_{t}|s% _{t}).italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Conclusion: P(𝒩t,st)𝑃subscript𝒩𝑡subscript𝑠𝑡P(\mathcal{N}_{t},s_{t})italic_P ( caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is proved to admit factorization into the probability distributions we have defined for any t0𝑡0t\geq 0italic_t ≥ 0. ∎

G.2 The Proof of Theorem 1

Theorem 5 (Brânzei & Larson (2009)).

In a CAG with an affinity graph G=𝒩,𝐺𝒩G=\langle\mathcal{N},\mathcal{E}\rangleitalic_G = ⟨ caligraphic_N , caligraphic_E ⟩, if for all (j,k)𝑗𝑘(j,k)\in\mathcal{E}( italic_j , italic_k ) ∈ caligraphic_E, w¯(j,k)0¯𝑤𝑗𝑘0\bar{w}(j,k)\geq 0over¯ start_ARG italic_w end_ARG ( italic_j , italic_k ) ≥ 0, then the grand coalition is in the strict core.

Lemma 2.

In a CAG with an affinity graph G=𝒩,𝐺𝒩G=\langle\mathcal{N},\mathcal{E}\rangleitalic_G = ⟨ caligraphic_N , caligraphic_E ⟩ and the generalised preference value function vj(𝒞)subscript𝑣𝑗𝒞v_{j}(\mathcal{C})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ), if the following conditions are satisfied such that

w(j,k)zjk({j}),vj({j})=(j,k),k𝒩zjk({j}),(j,k),formulae-sequence𝑤𝑗𝑘subscript𝑧𝑗𝑘𝑗formulae-sequencesubscript𝑣𝑗𝑗subscriptformulae-sequence𝑗𝑘𝑘𝒩subscript𝑧𝑗𝑘𝑗for-all𝑗𝑘\begin{split}w(j,k)\geq z_{jk}(\{j\}),\\ v_{j}(\{j\})=\sum_{(j,k)\in\mathcal{E},k\in\mathcal{N}}z_{jk}(\{j\}),\\ \forall(j,k)\in\mathcal{E},\end{split}start_ROW start_CELL italic_w ( italic_j , italic_k ) ≥ italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) , end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_N end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) , end_CELL end_ROW start_ROW start_CELL ∀ ( italic_j , italic_k ) ∈ caligraphic_E , end_CELL end_ROW (14)

then the grand coalition is in the strict core.

Proof.

Recall that we have generalised the preference value function in this paper (see Appendix E). Theorem 5 only holds for the case where the preference value function is defined as v¯j(𝒞)subscript¯𝑣𝑗𝒞\bar{v}_{j}(\mathcal{C})over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) in Eq. (9). As a result, we first investigate the conditions that makes Theorem 5 still hold for the generalised preference value function vj(𝒞)subscript𝑣𝑗𝒞v_{j}(\mathcal{C})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) in Eq. (10). As discussed before, we can transform the generalised preference value function vj(𝒞)subscript𝑣𝑗𝒞v_{j}(\mathcal{C})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) to the feasible domain of the original preference value function v¯j(𝒞)subscript¯𝑣𝑗𝒞\bar{v}_{j}(\mathcal{C})over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) by translation such that

v^j(𝒞)=vj(𝒞)vj({j})={0if 𝒞={j},(j,k),k𝒞w(j,k)vj({j})otherwise.subscript^𝑣𝑗𝒞subscript𝑣𝑗𝒞subscript𝑣𝑗𝑗cases0if 𝒞={j}subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘subscript𝑣𝑗𝑗otherwise\hat{v}_{j}(\mathcal{C})=v_{j}(\mathcal{C})-v_{j}(\{j\})=\begin{cases}0&\text{% if $\mathcal{C}=\{j\}$},\\ \sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}w(j,k)-v_{j}(\{j\})&\text{otherwise}% .\end{cases}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) = italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = { start_ROW start_CELL 0 end_CELL start_CELL if caligraphic_C = { italic_j } , end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_w ( italic_j , italic_k ) - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) end_CELL start_CELL otherwise . end_CELL end_ROW

It is apparent that the domain of v^j(𝒞)subscript^𝑣𝑗𝒞\hat{v}_{j}(\mathcal{C})over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) is aligned with that of v¯j(𝒞)subscript¯𝑣𝑗𝒞\bar{v}_{j}(\mathcal{C})over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ). Therefore, we can substitute v^j(𝒞)subscript^𝑣𝑗𝒞\hat{v}_{j}(\mathcal{C})over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ) for v¯j(𝒞)subscript¯𝑣𝑗𝒞\bar{v}_{j}(\mathcal{C})over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ). Since Theorem 5 only considers the grand coalition, we can temporarily ignore the case of that 𝒞={j}𝒞𝑗\mathcal{C}=\{j\}caligraphic_C = { italic_j }. For any 𝒞{j}𝒞𝑗\mathcal{C}\neq\{j\}caligraphic_C ≠ { italic_j } of v^j(𝒞)subscript^𝑣𝑗𝒞\hat{v}_{j}(\mathcal{C})over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( caligraphic_C ), we can rewrite the expression (j,k),k𝒞w(j,k)vj({j})subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘subscript𝑣𝑗𝑗\sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}w(j,k)-v_{j}(\{j\})∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_w ( italic_j , italic_k ) - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) as follows:

(j,k),k𝒞w(j,k)vj({j})=(j,k),k𝒞{w(j,k)zjk({j})}:=(j,k),k𝒞w^(j,k),subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘subscript𝑣𝑗𝑗subscriptformulae-sequence𝑗𝑘𝑘𝒞𝑤𝑗𝑘subscript𝑧𝑗𝑘𝑗assignsubscriptformulae-sequence𝑗𝑘𝑘𝒞^𝑤𝑗𝑘\begin{split}\sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}w(j,k)-v_{j}(\{j\})&=% \sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}\left\{w(j,k)-z_{jk}(\{j\})\right\}% \\ &:=\sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}\hat{w}(j,k),\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_w ( italic_j , italic_k ) - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT { italic_w ( italic_j , italic_k ) - italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL := ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG ( italic_j , italic_k ) , end_CELL end_ROW

where

w^(j,k):=w(j,k)zjk({j}),vj({j})=(j,k),k𝒞zjk({j}).formulae-sequenceassign^𝑤𝑗𝑘𝑤𝑗𝑘subscript𝑧𝑗𝑘𝑗subscript𝑣𝑗𝑗subscriptformulae-sequence𝑗𝑘𝑘𝒞subscript𝑧𝑗𝑘𝑗\begin{split}\hat{w}(j,k):=w(j,k)-z_{jk}(\{j\}),\\ v_{j}(\{j\})=\sum_{(j,k)\in\mathcal{E},k\in\mathcal{C}}z_{jk}(\{j\}).\end{split}start_ROW start_CELL over^ start_ARG italic_w end_ARG ( italic_j , italic_k ) := italic_w ( italic_j , italic_k ) - italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) , end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_C end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) . end_CELL end_ROW

By the condition that w^(j,k)0^𝑤𝑗𝑘0\hat{w}(j,k)\geq 0over^ start_ARG italic_w end_ARG ( italic_j , italic_k ) ≥ 0, for (j,k)𝑗𝑘(j,k)\in\mathcal{E}( italic_j , italic_k ) ∈ caligraphic_E, from Theorem 5, we can directly obtain the conditions to enable the grand coalition 𝒩𝒩\mathcal{N}caligraphic_N being in the strict core such that

w(j,k)zjk({j}),vj({j})=(j,k),k𝒩zjk({j}),(j,k).formulae-sequence𝑤𝑗𝑘subscript𝑧𝑗𝑘𝑗formulae-sequencesubscript𝑣𝑗𝑗subscriptformulae-sequence𝑗𝑘𝑘𝒩subscript𝑧𝑗𝑘𝑗for-all𝑗𝑘\begin{split}w(j,k)\geq z_{jk}(\{j\}),\\ v_{j}(\{j\})=\sum_{(j,k)\in\mathcal{E},k\in\mathcal{N}}z_{jk}(\{j\}),\\ \forall(j,k)\in\mathcal{E}.\end{split}start_ROW start_CELL italic_w ( italic_j , italic_k ) ≥ italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) , end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E , italic_k ∈ caligraphic_N end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( { italic_j } ) , end_CELL end_ROW start_ROW start_CELL ∀ ( italic_j , italic_k ) ∈ caligraphic_E . end_CELL end_ROW (15)

Theorem 1.

In an OSB-CAG, for any dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ at any timestep t𝑡titalic_t, if there exists a joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for any agent j𝒩t𝑗subscript𝒩𝑡j\in\mathcal{N}_{t}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, satisfying Rj(at|st)Rj(atj|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}|s_{t})\geq R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for any st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, then DVSC always exists.

Proof.

To avoid losing the generality, we consider an arbitrary dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ for a temporary team 𝒩t𝒩subscript𝒩𝑡𝒩\mathcal{N}_{t}\ {{\subseteq}}\ \mathcal{N}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_N at an arbitrary timestep t𝑡titalic_t. For any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and any joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the affinity weight wjk(atj,atk|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of any (j,k)t𝑗𝑘subscript𝑡(j,k)\in\mathcal{E}_{t}( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented as a corresponding w(j,k)𝑤𝑗𝑘w(j,k)italic_w ( italic_j , italic_k ) such that w(j,k)=wjk(atj,atk|st)𝑤𝑗𝑘subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡w(j,k)=w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})italic_w ( italic_j , italic_k ) = italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Similarly, each agent j𝑗jitalic_j’s preference reward for the coalition including only itself Rj(atj|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can also be represented as a corresponding vj({j})subscript𝑣𝑗𝑗v_{j}(\{j\})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) such that vj({j})=Rj(atj|st)subscript𝑣𝑗𝑗subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡v_{j}(\{j\})=R_{j}(a_{t}^{j}|s_{t})italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( { italic_j } ) = italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Thereafter, we can apply Lemma 2 to the situation here at a single timestep t𝑡titalic_t. Substituting the above variables into Eq. (14) in Lemma 2, it is not difficult to observe that if for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, there exists a joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT such that (j,k)t,k𝒩twjk(atj,atk|st)Rj(atj|st)subscriptformulae-sequence𝑗𝑘subscript𝑡𝑘subscript𝒩𝑡subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\sum_{(j,k)\in\mathcal{E}_{t},k\in\mathcal{N}_{t}}w_{jk}(a_{t}^{j},a_{t}^{k}|s% _{t})\geq R_{j}(a_{t}^{j}|s_{t})∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), then there always exists a Rj(atj|st)=(j,k)t,k𝒩tβjk(atj|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscriptformulae-sequence𝑗𝑘subscript𝑡𝑘subscript𝒩𝑡subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t},k\in\mathcal{N}_{t}}\beta% _{jk}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfying the condition that wjk(atj,atk|st)βjk(atj|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})\geq\beta_{jk}(a_{t}^{j}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for all (j,k)t𝑗𝑘subscript𝑡(j,k)\in\mathcal{E}_{t}( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S. Analogously, we can obtain the same results for all timesteps as above, which achieves the long-horizon objective as defined in the DVSC. Therefore, we can conclude that for any dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ at any timestep t𝑡titalic_t, if there exists a joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for any agent j𝒩t𝑗subscript𝒩𝑡j\in\mathcal{N}_{t}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, satisfying Rj(at|st)=(j,k)t,k𝒩twjk(atj,atk|st)Rj(atj|st)subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscriptformulae-sequence𝑗𝑘subscript𝑡𝑘subscript𝒩𝑡subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{j}(a_{t}|s_{t})=\sum_{(j,k)\in\mathcal{E}_{t},k\in\mathcal{N}_{t}}w_{jk}(a_% {t}^{j},a_{t}^{k}|s_{t})\geq R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for any st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, then the DVSC defined in Eq. (4) always exists. ∎

G.3 The Proof of Theorem 2

Lemma 3.

Under Assumption 2, it is valid to have the expressions that Qjkπi(atj,atk|st)=𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)]superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}% ^{\infty}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] and Qjπi(atj|st)=𝔼πi[τ=tγτtRj(aτj|sτ)]superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}^{\infty}% \gamma^{\tau-t}R_{j}(a_{\tau}^{j}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ], with the learner i𝑖iitalic_i’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

Proof.

Suppose that agent j𝑗jitalic_j or k𝑘kitalic_k leaves the environment at timestep T𝑇Titalic_T, then we can have the expression that Qjkπi(atj,atk|st)=𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)]superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}% ^{\infty}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] by the condition in Assumption 2 that αjk(aτj,aτk|sτ)=0subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏0\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})=0italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) = 0 for τT𝜏𝑇\tau\geq Titalic_τ ≥ italic_T if agent j𝑗jitalic_j or k𝑘kitalic_k leaves the environment at timestep T𝑇Titalic_T as follows:

Qjkπi(atj,atk|st)=𝔼πi[τ=tTγτtαjk(aτj,aτk|sτ)]=𝔼πi[τ=tTγτtαjk(aτj,aτk|sτ)+τ=Tγτtαjk(aτj,aτk|sτ)=0 by Assumption 2]=𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)].superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡𝑇superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡𝑇superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscriptsuperscriptsubscript𝜏𝑇superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏absent0 by Assumption 2subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏\begin{split}Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})&=\mathbb{E}_{\pi^{i}}% [\sum_{\tau=t}^{T}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau% })]\\ &=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}^{T}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j% },a_{\tau}^{k}|s_{\tau})+\underbrace{\sum_{\tau=T}^{\infty}\gamma^{\tau-t}% \alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})}_{\qquad\qquad\quad=0\text{ by% Assumption \ref{assm:agent_leaves_env}}}]\\ &=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\alpha_{jk}(a_{% \tau}^{j},a_{\tau}^{k}|s_{\tau})].\end{split}start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_τ = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT = 0 by Assumption end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] . end_CELL end_ROW

Similarly, by the condition in Assumption 2 that Rj(aτj|sτ)=0subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏0R_{j}(a_{\tau}^{j}|s_{\tau})=0italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) = 0 for τT𝜏superscript𝑇\tau\geq T^{\prime}italic_τ ≥ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if agent j𝑗jitalic_j leaves the environment at timestep Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we can derive the result that Qjπi(atj|st)=𝔼πi[τ=tγτtRj(aτj|sτ)]superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}^{\infty}% \gamma^{\tau-t}R_{j}(a_{\tau}^{j}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ]. ∎

Theorem 2.

Under Assumption 2, if wjk(aτj,aτk|sτ)=αjk(aτj,aτk|sτ)+βjk(aτj|sτ)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏w_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})=\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{% k}|s_{\tau})+\beta_{jk}(a_{\tau}^{j}|s_{\tau})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ), then the joint Q-value of the learner’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT can be expressed as follows:

Qπi(st,at)=(j,k)tQjkπi(atj,atk|st)+j𝒩tQjπi(atj|st)=j𝒩t{(j,k)tQjkπi(atj,atk|st)+j𝒩tQjπi(atj|st)}:=j𝒩tQjπi(at|st),superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑗subscript𝒩𝑡subscript𝑗𝑘subscript𝑡superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡assignsubscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡\begin{split}Q^{\pi^{i}}(s_{t},a_{t})&=\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{% \pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})+\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(% a_{t}^{j}|s_{t})\\ =\sum_{j\in\mathcal{N}_{t}}&\Big{\{}\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi^% {i}}(a_{t}^{j},a_{t}^{k}|s_{t})+\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(a_{t% }^{j}|s_{t})\Big{\}}\\ :=\sum_{j\in\mathcal{N}_{t}}&Q_{j}^{\pi^{i}}(a_{t}|s_{t}),\end{split}start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL { ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } end_CELL end_ROW start_ROW start_CELL := ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW

where Qjkπi(atj,atk|st)=𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)]superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}% ^{\infty}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] and Qjπi(atj|st)=𝔼πi[τ=tγτtRj(aτj|sτ)]superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})=\mathbb{E}_{\pi^{i}}[\sum_{\tau=t}^{\infty}% \gamma^{\tau-t}R_{j}(a_{\tau}^{j}|s_{\tau})]italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ].

Proof.

By Assumption 2 and the result of Lemma 3, for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and any joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we can represent the joint Q-value under any learner i𝑖iitalic_i’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT such as Qπi(st,at)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡Q^{\pi^{i}}(s_{t},a_{t})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as follows:

Qπi(st,at)=𝔼πi[τ=tγτtR(sτ,aτ)]=𝔼πi[τ=tγτtj𝒩τRj(aτ|sτ)]=𝔼πi[τ=tγτtj𝒩τ((j,k)ταjk(aτj,aτk|sτ)+Rj(aτj|sτ))]=𝔼πi[τ=tγτt(j𝒩τ((j,k)ταjk(aτj,aτk|sτ)+Rj(aτj|sτ))+j𝒩t\𝒩τ((j,k)t\ταjk(aτj,aτk|sτ)+Rj(aτj|sτ))=0 by Assumption 2)]=𝔼πi[τ=tγτtj𝒩t((j,k)tαjk(aτj,aτk|sτ)+Rj(aτj|sτ))]=j𝒩t{(j,k)t𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)=Qjkπi(atj,atk|st) by Lemma 3]+𝔼πi[τ=tγτtRj(aτj|sτ)]=Qjπi(atj|st) by Lemma 3}=j𝒩t{(j,k)tQjkπi(atj,atk|st)+Qjπi(atj|st)}=j𝒩t(j,k)tQjkπi(atj,atk|st)+j𝒩tQjπi(atj|st)=(j,k)tQjkπi(atj,atk|st)+j𝒩tQjπi(atj|st).\begin{split}Q^{\pi^{i}}(s_{t},a_{t})&=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t% }^{\infty}\gamma^{\tau-t}R(s_{\tau},a_{\tau})\Big{]}\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\sum_{j\in% \mathcal{N}_{\tau}}R_{j}(a_{\tau}|s_{\tau})\Big{]}\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\sum_{j\in% \mathcal{N}_{\tau}}\Big{(}\sum_{(j,k)\in\mathcal{E}_{\tau}}\alpha_{jk}(a_{\tau% }^{j},a_{\tau}^{k}|s_{\tau})+R_{j}(a_{\tau}^{j}|s_{\tau})\Big{)}\Big{]}\\ &=\mathbb{E}_{\pi^{i}}\bigg{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\bigg{(}% \sum_{j\in\mathcal{N}_{\tau}}\Big{(}\sum_{(j,k)\in\mathcal{E}_{\tau}}\alpha_{% jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})+R_{j}(a_{\tau}^{j}|s_{\tau})\Big{)}\\ &\qquad\qquad\qquad\qquad+\underbrace{\sum_{j\in\mathcal{N}_{t}\backslash% \mathcal{N}_{\tau}}\Big{(}\sum_{(j,k)\in\mathcal{E}_{t}\backslash\mathcal{E}_{% \tau}}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})+R_{j}(a_{\tau}^{j}|s_{% \tau})\Big{)}}_{\qquad\qquad\quad=0\text{ by Assumption \ref{assm:agent_leaves% _env}}}\bigg{)}\bigg{]}\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\sum_{j\in% \mathcal{N}_{t}}\Big{(}\sum_{(j,k)\in\mathcal{E}_{t}}\alpha_{jk}(a_{\tau}^{j},% a_{\tau}^{k}|s_{\tau})+R_{j}(a_{\tau}^{j}|s_{\tau})\Big{)}\Big{]}\\ &=\sum_{j\in\mathcal{N}_{t}}\bigg{\{}\sum_{(j,k)\in\mathcal{E}_{t}}\underbrace% {\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\alpha_{jk}(a% _{\tau}^{j},a_{\tau}^{k}|s_{\tau})}_{=Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{% t})\text{ by Lemma \ref{lemm:basic_agent_truncated_episode}}}\Big{]}+% \underbrace{\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}R_% {j}(a_{\tau}^{j}|s_{\tau})\Big{]}}_{=Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})\text{ by% Lemma \ref{lemm:basic_agent_truncated_episode}}}\bigg{\}}\\ &=\sum_{j\in\mathcal{N}_{t}}\bigg{\{}\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi% ^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})+Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t})\bigg{\}}\\ &=\sum_{j\in\mathcal{N}_{t}}\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi^{i}}(a_{% t}^{j},a_{t}^{k}|s_{t})+\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(a_{t}^{j}|s_% {t})\\ &=\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})+% \sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(a_{t}^{j}|s_{t}).\end{split}start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT \ caligraphic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT \ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT = 0 by Assumption end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by Lemma end_POSTSUBSCRIPT ] + under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by Lemma end_POSTSUBSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW (16)

By the fashion of Bellman optimality equation, for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and any joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we can write out each agent j𝑗jitalic_j’s preference Q-value under the learner i𝑖iitalic_i’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, Qjπi(at|st)superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡Q_{j}^{\pi^{i}}(a_{t}|s_{t})italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), as follows:

Qjπi(st,at)=𝔼πi[τ=tγτtRj(aτ|sτ)]=𝔼πi[τ=tγτt((j,k)ταjk(aτj,aτk|sτ)+Rj(aτj|sτ))]=𝔼πi[τ=tγτt((j,k)ταjk(aτj,aτk|sτ))]+𝔼πi[τ=tγτtRj(aτj|sτ)]=𝔼πi[τ=tγτt((j,k)ταjk(aτj,aτk|sτ)+(j,k)t\ταjk(aτj,aτk|sτ)=0 by Assumption 2)]+𝔼πi[τ=tγτtRj(aτj|sτ)]=𝔼πi[τ=tγτt((j,k)tαjk(aτj,aτk|sτ))]+𝔼πi[τ=tγτtRj(aτj|sτ)]=(j,k)t𝔼πi[τ=tγτtαjk(aτj,aτk|sτ)]=Qjkπi(atj,atk|st) by Lemma 3+𝔼πi[τ=tγτtRj(aτj|sτ)]=Qjπi(atj|st) by Lemma 3=(j,k)tQjkπi(atj,atk|st)+Qjπi(atj|st).superscriptsubscript𝑄𝑗superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsubscript𝑎𝜏subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑗𝑘subscript𝜏subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑗𝑘subscript𝜏subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑗𝑘subscript𝜏subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscriptsubscript𝑗𝑘\subscript𝑡subscript𝜏subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏absent0 by Assumption 2subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑗𝑘subscript𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏subscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏subscript𝑗𝑘subscript𝑡subscriptsubscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝜏𝑗conditionalsuperscriptsubscript𝑎𝜏𝑘subscript𝑠𝜏absentsuperscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡 by Lemma 3subscriptsubscript𝔼superscript𝜋𝑖delimited-[]superscriptsubscript𝜏𝑡superscript𝛾𝜏𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝜏𝑗subscript𝑠𝜏absentsuperscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡 by Lemma 3subscript𝑗𝑘subscript𝑡superscriptsubscript𝑄𝑗𝑘superscript𝜋𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\begin{split}Q_{j}^{\pi^{i}}(s_{t},a_{t})&=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{% \tau=t}^{\infty}\gamma^{\tau-t}R_{j}(a_{\tau}|s_{\tau})\Big{]}\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\Big{(}\sum_% {(j,k)\in\mathcal{E}_{\tau}}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})+R_% {j}(a_{\tau}^{j}|s_{\tau})\Big{)}\Big{]}\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\Big{(}\sum_% {(j,k)\in\mathcal{E}_{\tau}}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})% \Big{)}\Big{]}+\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t% }R_{j}(a_{\tau}^{j}|s_{\tau})\Big{]}\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\Big{(}\sum_% {(j,k)\in\mathcal{E}_{\tau}}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})+% \underbrace{\sum_{(j,k)\in\mathcal{E}_{t}\backslash\mathcal{E}_{\tau}}\alpha_{% jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})}_{\qquad\qquad\quad=0\text{ by % Assumption \ref{assm:agent_leaves_env}}}\Big{)}\Big{]}+\mathbb{E}_{\pi^{i}}% \Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}R_{j}(a_{\tau}^{j}|s_{\tau})\Big{]% }\\ &=\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}\Big{(}\sum_% {(j,k)\in\mathcal{E}_{t}}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})\Big{)% }\Big{]}+\mathbb{E}_{\pi^{i}}\Big{[}\sum_{\tau=t}^{\infty}\gamma^{\tau-t}R_{j}% (a_{\tau}^{j}|s_{\tau})\Big{]}\\ &=\sum_{(j,k)\in\mathcal{E}_{t}}\underbrace{\mathbb{E}_{\pi^{i}}\Big{[}\sum_{% \tau=t}^{\infty}\gamma^{\tau-t}\alpha_{jk}(a_{\tau}^{j},a_{\tau}^{k}|s_{\tau})% \Big{]}}_{=Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})\text{ by Lemma \ref{% lemm:basic_agent_truncated_episode}}}+\underbrace{\mathbb{E}_{\pi^{i}}\Big{[}% \sum_{\tau=t}^{\infty}\gamma^{\tau-t}R_{j}(a_{\tau}^{j}|s_{\tau})\Big{]}}_{=Q_% {j}^{\pi^{i}}(a_{t}^{j}|s_{t})\text{ by Lemma \ref{lemm:basic_agent_truncated_% episode}}}\\ &=\sum_{(j,k)\in\mathcal{E}_{t}}Q_{jk}^{\pi^{i}}(a_{t}^{j},a_{t}^{k}|s_{t})+Q_% {j}^{\pi^{i}}(a_{t}^{j}|s_{t}).\end{split}start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) + under⏟ start_ARG ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT \ caligraphic_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT = 0 by Assumption end_POSTSUBSCRIPT ) ] + blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by Lemma end_POSTSUBSCRIPT + under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_τ - italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by Lemma end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW (17)

By substituting the expression of Qjπi(st,at)superscriptsubscript𝑄𝑗superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡Q_{j}^{\pi^{i}}(s_{t},a_{t})italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) derived in Eq. (17) into Eq. (16), we can get the following relation:

Qπi(st,at)=j𝒩tQjπi(at|st).superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡superscriptsubscript𝑄𝑗superscript𝜋𝑖conditionalsubscript𝑎𝑡subscript𝑠𝑡Q^{\pi^{i}}(s_{t},a_{t})=\sum_{j\in\mathcal{N}_{t}}Q_{j}^{\pi^{i}}(a_{t}|s_{t}).italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (18)

G.4 The Proof of the Conditions of Symmetry for Various Dynamic Affinity Graphs

Proposition 3.

For the learner i𝑖iitalic_i and any teammate j𝑗jitalic_j or k𝑘kitalic_k, the constraints Ri(ati|st)=jiRj(atj|st)subscript𝑅𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑗𝑖subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{i}(a_{t}^{i}|s_{t})=\sum_{j\in-i}R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and αjk(atj,atk|st)=αkj(atk,atj|st)subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{kj}(a_{t}^{k},a_{t}^{j}|s_{t})italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for any at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, are necessary for a star dynamic affinity graph to be symmetric.

Proof.

Recall that a symmetric dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ needs to satisfy the following condition that wjk(atj,atk|st)=wkj(atk,atj|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑤𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=w_{kj}(a_{t}^{k},a_{t}^{j}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for all (j,k)t𝑗𝑘subscript𝑡(j,k)\in\mathcal{E}_{t}( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and any joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In the dynamic affinity graph as a star graph, the affinity weights of any (i,j)t𝑖𝑗subscript𝑡(i,j)\in\mathcal{E}_{t}( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or (j,i)t𝑗𝑖subscript𝑡(j,i)\in\mathcal{E}_{t}( italic_j , italic_i ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented as follows:

wij(ati,atj|st)=αij(ati,atj|st)+βij(ati|st),where Ri(ati|st)=jiβij(ati|st),wji(atj,ati|st)=αji(atj,ati|st)+βji(atj|st),where Rj(atj|st)=βji(atj|st).formulae-sequencesubscript𝑤𝑖𝑗superscriptsubscript𝑎𝑡𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝛼𝑖𝑗superscriptsubscript𝑎𝑡𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝛽𝑖𝑗conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡formulae-sequencewhere subscript𝑅𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑗𝑖subscript𝛽𝑖𝑗conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡formulae-sequencesubscript𝑤𝑗𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝛼𝑗𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝛽𝑗𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡where subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝛽𝑗𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\begin{split}w_{ij}(a_{t}^{i},a_{t}^{j}|s_{t})=\alpha_{ij}(a_{t}^{i},a_{t}^{j}% |s_{t})+\beta_{ij}(a_{t}^{i}|s_{t}),\text{where }R_{i}(a_{t}^{i}|s_{t})=\sum_{% j\in-i}\beta_{ij}(a_{t}^{i}|s_{t}),\\ w_{ji}(a_{t}^{j},a_{t}^{i}|s_{t})=\alpha_{ji}(a_{t}^{j},a_{t}^{i}|s_{t})+\beta% _{ji}(a_{t}^{j}|s_{t}),\text{where }R_{j}(a_{t}^{j}|s_{t})=\beta_{ji}(a_{t}^{j% }|s_{t}).\end{split}start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , where italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , where italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW

It is not difficult to observe that for all st𝒮for-allsubscript𝑠𝑡𝒮\forall s_{t}\in\mathcal{S}∀ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT the following conditions that

αij(ati,atj|st)=αji(atj,ati|st),Ri(ati|st)=jiRj(atj|st),formulae-sequencesubscript𝛼𝑖𝑗superscriptsubscript𝑎𝑡𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝛼𝑗𝑖superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑅𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑗𝑖subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\begin{split}\alpha_{ij}(a_{t}^{i},a_{t}^{j}|s_{t})=\alpha_{ji}(a_{t}^{j},a_{t% }^{i}|s_{t}),\\ R_{i}(a_{t}^{i}|s_{t})=\sum_{j\in-i}R_{j}(a_{t}^{j}|s_{t}),\end{split}start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW

are necessary for that the star dynamic affinity graph is symmetric. In more details, that Ri(ati|st)=jiRj(atj|st)subscript𝑅𝑖conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝑗𝑖subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡R_{i}(a_{t}^{i}|s_{t})=\sum_{j\in-i}R_{j}(a_{t}^{j}|s_{t})italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a necessary condition for the existence of the one-to-one correspondence that βij(ati|st)=βji(atj|st)=Rj(atj|st)subscript𝛽𝑖𝑗conditionalsuperscriptsubscript𝑎𝑡𝑖subscript𝑠𝑡subscript𝛽𝑗𝑖conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\beta_{ij}(a_{t}^{i}|s_{t})=\beta_{ji}(a_{t}^{j}|s_{t})=R_{j}(a_{t}^{j}|s_{t})italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). ∎

Proposition 4.

For any two agents j𝑗jitalic_j or k𝑘kitalic_k, the constraints Rj(atj|st)=Rk(atk|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑅𝑘conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})=R_{k}(a_{t}^{k}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and αjk(atj,atk|st)=αkj(atk,atj|st)subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{kj}(a_{t}^{k},a_{t}^{j}|s_{t})italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for any at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, are necessary for the complete dynamic affinity graph to be symmetric.

Proof.

Recall that a symmetric dynamic affinity graph Gt=𝒩t,tsubscript𝐺𝑡subscript𝒩𝑡subscript𝑡G_{t}=\langle\mathcal{N}_{t},\mathcal{E}_{t}\rangleitalic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ needs to satisfy the following condition that wjk(atj,atk|st)=wkj(atk,atj|st)subscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑤𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=w_{kj}(a_{t}^{k},a_{t}^{j}|s_{t})italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for all (j,k)t𝑗𝑘subscript𝑡(j,k)\in\mathcal{E}_{t}( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for any state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and any joint action at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In the dynamic affinity graph as a complete graph, the affinity weights of any (j,k)t𝑗𝑘subscript𝑡(j,k)\in\mathcal{E}_{t}( italic_j , italic_k ) ∈ caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented as follows:

wjk(atj,atk|st)=αjk(atj,atk|st)+βjk(atj|st),where Rj(atj|st)=kjβjk(atj|st).formulae-sequencesubscript𝑤𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡where subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑘𝑗subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡w_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})+\beta% _{jk}(a_{t}^{j}|s_{t}),\text{where }R_{j}(a_{t}^{j}|s_{t})=\sum_{k\in-j}\beta_% {jk}(a_{t}^{j}|s_{t}).italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , where italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k ∈ - italic_j end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

It is not difficult to observe that for all st𝒮for-allsubscript𝑠𝑡𝒮\forall s_{t}\in\mathcal{S}∀ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and at𝒜𝒩tsubscript𝑎𝑡subscript𝒜subscript𝒩𝑡a_{t}\in\mathcal{A}_{\scriptscriptstyle\mathcal{N}_{t}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT the following conditions that

αjk(atj,atk|st)=αkj(atk,atj|st),Rj(atj|st)=Rk(atk|st),formulae-sequencesubscript𝛼𝑗𝑘superscriptsubscript𝑎𝑡𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝛼𝑘𝑗superscriptsubscript𝑎𝑡𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑅𝑘conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\begin{split}\alpha_{jk}(a_{t}^{j},a_{t}^{k}|s_{t})=\alpha_{kj}(a_{t}^{k},a_{t% }^{j}|s_{t}),\\ R_{j}(a_{t}^{j}|s_{t})=R_{k}(a_{t}^{k}|s_{t}),\end{split}start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_α start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW

are necessary for that the complete dynamic affinity graph is symmetric. In more details, that Rj(atj|st)=kjβjk(atj|st)=jkβkj(atk|st)=Rk(atk|st)subscript𝑅𝑗conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑘𝑗subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝑗𝑘subscript𝛽𝑘𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡subscript𝑅𝑘conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡R_{j}(a_{t}^{j}|s_{t})=\sum_{k\in-j}\beta_{jk}(a_{t}^{j}|s_{t})=\sum_{j\in-k}% \beta_{kj}(a_{t}^{k}|s_{t})=R_{k}(a_{t}^{k}|s_{t})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k ∈ - italic_j end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ - italic_k end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a necessary condition for the existence of the one-to-one correspondence that βjk(atj|st)=βkj(atk|st)subscript𝛽𝑗𝑘conditionalsuperscriptsubscript𝑎𝑡𝑗subscript𝑠𝑡subscript𝛽𝑘𝑗conditionalsuperscriptsubscript𝑎𝑡𝑘subscript𝑠𝑡\beta_{jk}(a_{t}^{j}|s_{t})=\beta_{kj}(a_{t}^{k}|s_{t})italic_β start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). ∎

G.5 The Proof of Theorem 3

Theorem 3.

Under Assumption 2 and an arbitrary learner’s deterministic stationary policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the Bellman equation for the OSB-CAG with DVSC as a solution concept is expressed as follows: Qπi(st,at)=R(st,at)+γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qπi(st+1,at+1)]]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼formulae-sequencesimilar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡1subscript𝑎𝑡1Q^{\pi^{i}}(s_{t},a_{t})=R(s_{t},a_{t})+\gamma\mathbb{E}_{{\scriptscriptstyle% \mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big{[}\mathbb{E}_{\begin{subarray}{c}% \theta_{t+1}\sim P_{E},\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}% (s_{t+1},a_{t+1})\big{]}\Big{]}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ].

Proof.

We derive Eq. (6) as follows.

By the result of Theorem 2, we can represent the joint Q-value under an arbitrary learner’s deterministic stationary policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT referred to as Qπi(st,at)superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡Q^{\pi^{i}}(s_{t},a_{t})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as follows:

Qπi(st,at)=j𝒩tQjπi(at|st),superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡Q^{\pi^{i}}(s_{t},a_{t})=\sum_{j\in\mathcal{N}_{t}}Q^{\pi^{i}}_{j}(a_{t}|s_{t}),italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (19)

Next, we can expand the preference Q-value of each agent j𝒩t𝑗subscript𝒩𝑡j\in\mathcal{N}_{t}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following the fashion of the Bellman equation such that

Qjπi(at|st)=Rj(at|st)+γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qjπi(at+1|st+1)]].subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡1subscript𝑠𝑡1Q^{\pi^{i}}_{j}(a_{t}|s_{t})=R_{j}(a_{t}|s_{t})+\gamma\mathbb{E}_{{% \scriptscriptstyle\mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big{[}\mathbb{E}_{% \begin{subarray}{c}\theta_{t+1}\sim P_{E},\\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}_{j}(a_{t+1}|s_{t+1})\big% {]}\Big{]}.italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ] . (20)

Then, we can sum up Eq. (20) for all possible agents belonging to the temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and get an equation to evaluate the influence of the learner’s policy πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to a temporary team 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that

Qπi(st,at)=j𝒩tQjπi(at|st)=j𝒩tRj(at|st)+j𝒩tγ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qjπi(at+1|st+1)]]=R(st,at)+j𝒩t+1γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qjπi(at+1|st+1)]]Since Qjπi(at+1|st+1)=0 for agent j𝒩t\𝒩t+1 by Assumption 2.=R(st,at)+γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[j𝒩t+1Qjπi(at+1|st+1)]]=R(st,at)+γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qπi(st+1,at+1)]].superscript𝑄superscript𝜋𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑗subscript𝒩𝑡subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑗subscript𝒩𝑡subscript𝑅𝑗conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑗subscript𝒩𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡1subscript𝑠𝑡1𝑅subscript𝑠𝑡subscript𝑎𝑡subscriptsubscript𝑗subscript𝒩𝑡1𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡1subscript𝑠𝑡1Since Qjπi(at+1|st+1)=0 for agent j𝒩t\𝒩t+1 by Assumption 2.𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]subscript𝑗subscript𝒩𝑡1subscriptsuperscript𝑄superscript𝜋𝑖𝑗conditionalsubscript𝑎𝑡1subscript𝑠𝑡1𝑅subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]superscript𝑄superscript𝜋𝑖subscript𝑠𝑡1subscript𝑎𝑡1\begin{split}Q^{\pi^{i}}(s_{t},a_{t})&=\sum_{j\in\mathcal{N}_{t}}Q^{\pi^{i}}_{% j}(a_{t}|s_{t})\\ &=\sum_{j\in\mathcal{N}_{t}}R_{j}(a_{t}|s_{t})+\sum_{j\in\mathcal{N}_{t}}% \gamma\mathbb{E}_{{\scriptscriptstyle\mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big% {[}\mathbb{E}_{\begin{subarray}{c}\theta_{t+1}\sim P_{E},\\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}_{j}(a_{t+1}|s_{t+1})\big% {]}\Big{]}\\ &=R(s_{t},a_{t})+\underbrace{\sum_{j\in\mathcal{N}_{t+1}}\gamma\mathbb{E}_{{% \scriptscriptstyle\mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big{[}\mathbb{E}_{% \begin{subarray}{c}\theta_{t+1}\sim P_{E},\\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}_{j}(a_{t+1}|s_{t+1})\big% {]}\Big{]}}_{\text{Since $Q^{\pi^{i}}_{j}(a_{t+1}|s_{t+1})=0$ for agent $j\in% \mathcal{N}_{t}\backslash\mathcal{N}_{t+1}$ by Assumption \ref{assm:agent_% leaves_env}.}}\\ &=R(s_{t},a_{t})+\gamma\mathbb{E}_{{\scriptscriptstyle\mathcal{N}_{t+1}},s_{t+% 1}\sim P_{O}}\Big{[}\mathbb{E}_{\begin{subarray}{c}\theta_{t+1}\sim P_{E},\\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}\sum_{j\in\mathcal{N}_{t+1}}Q^{\pi^{% i}}_{j}(a_{t+1}|s_{t+1})\big{]}\Big{]}\\ &=R(s_{t},a_{t})+\gamma\mathbb{E}_{{\scriptscriptstyle\mathcal{N}_{t+1}},s_{t+% 1}\sim P_{O}}\Big{[}\mathbb{E}_{\begin{subarray}{c}\theta_{t+1}\sim P_{E},\\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}(s_{t+1},a_{t+1})\big{]}% \Big{]}.\end{split}start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ] end_ARG start_POSTSUBSCRIPT Since italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = 0 for agent italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT \ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT by Assumption . end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ] . end_CELL end_ROW (21)

Note that Eq. (21) does not hold if 𝒩t𝒩t+1subscript𝒩𝑡subscript𝒩𝑡1\mathcal{N}_{t}\ {{\subset}}\ \mathcal{N}_{t+1}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, since it is problematic to expand the preference Q-value of an agent k𝒩t+1𝑘subscript𝒩𝑡1k\in\mathcal{N}_{t+1}italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT but 𝒩tabsentsubscript𝒩𝑡\notin\mathcal{N}_{t}∉ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t𝑡titalic_t, which can be seen as a singularity of this equation. More specifically, 0=Qkπi(at|st)=Rk(at|st)+γ𝔼𝒩t+1,st+1PO[𝔼θt+1PE,at+1πt+1[Qkπi(at+1|st+1)]]>00subscriptsuperscript𝑄superscript𝜋𝑖𝑘conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑅𝑘conditionalsubscript𝑎𝑡subscript𝑠𝑡𝛾subscript𝔼similar-tosubscript𝒩𝑡1subscript𝑠𝑡1subscript𝑃𝑂delimited-[]subscript𝔼similar-tosubscript𝜃𝑡1subscript𝑃𝐸similar-tosubscript𝑎𝑡1subscript𝜋𝑡1delimited-[]subscriptsuperscript𝑄superscript𝜋𝑖𝑘conditionalsubscript𝑎𝑡1subscript𝑠𝑡100=Q^{\pi^{i}}_{k}(a_{t}|s_{t})=R_{k}(a_{t}|s_{t})+\gamma\mathbb{E}_{{% \scriptscriptstyle\mathcal{N}_{t+1}},s_{t+1}\sim P_{O}}\Big{[}\mathbb{E}_{% \begin{subarray}{c}\theta_{t+1}\sim P_{E},\\ a_{t+1}\sim\pi_{t+1}\end{subarray}}\big{[}Q^{\pi^{i}}_{k}(a_{t+1}|s_{t+1})\big% {]}\Big{]}>00 = italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ blackboard_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ] > 0 is impossible, given that at least Rk(at|st)>0subscript𝑅𝑘conditionalsubscript𝑎superscript𝑡subscript𝑠superscript𝑡0R_{k}(a_{t^{\prime}}|s_{t^{\prime}})>0italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > 0, implying agent k𝑘kitalic_k’s preference for collaborating with other agents, at a timestep ttsuperscript𝑡𝑡t^{\prime}\geq titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_t. ∎

Appendix H Experimental Settings

We evaluate our proposed CIAO in two existing environments, LBF and Wolfpack, both configured with open team settings (Rahman et al., 2021). In these settings, teammates are randomly selected to enter the environment and remain for a specified number of timesteps. If a teammate surpasses its allocated lifetime, it is removed from the environment and placed in a re-entry queue with a randomly assigned waiting time. The randomized re-entry queue results in varied compositions of teammates in a temporary team. When the number of agents in the environment does not reach its maximum, agents in the re-entry queue are introduced to the environment. Specifically, in the Wolfpack environment, we uniformly determine the active duration by selecting a value between 25 and 35 timesteps, while the dead duration is uniformly sampled between 15 and 25 timesteps. Conversely, the durations for LBF are somewhat shorter, with the active duration uniformly sampled between 15 and 25 timesteps, and the dead duration between 10 and 20 timesteps.

The teammate policies adhere to the experimental settings used for testing GPL (Rahman et al., 2021), which encompass a range of heuristic policies and pre-trained policies. Specifically, for Wolfpack, the teammate set includes the following agents: random agent, greedy agent, greedy probabilistic agent, teammate-aware agents, GNN-Based teammate-aware agents, graph DQN agents, greedy waiting agents, greedy probabilistic waiting agents, and greedy team-aware waiting agents. In the case of LBF, a combination of heuristics and A2C agents is employed as the teammate policy set. For more detailed information about teammate policies, we recommend referring to Appendix B.4 of GPL’s paper.

In our investigation of different agent-type sets within LBF experiments (see Appendix I.2), we deliberately exclude the A2C agent from the original agent-type set, thereby establishing a distinct agent-type subset. It’s crucial to acknowledge that the A2C agent provided by GPL is designed for scenarios with a maximum of 5 agents. Tailored to scenarios involving a greater number of agents, specifically up to 9, we undertake the additional step of training an A2C agent tailored to these expanded requirements.

In our experiments of studying the generalizability of CIAO, we constructed the agent-type sets for training and testing, respectively, for Wolfpack and LBF. The details are shown in Tab. 1.

Table 1: Variant agent-type sets for training and testing in experiments for evaluating generalizability of CIAO. The shorthand “Int” stands for the scenario where agent-type sets for training have intersection with testing. The shrothand “Exc” stands for the scenarios where agent-type sets for training are mutually exclusive to testing.
Scenario Name Training Testing
Wolfpack-Int GreedyPredatorAgent, GreedyProbabilisticAgent, TeammateAwarePredator, DistilledCoopAStarAgent, GraphDQNAgent GraphDQNAgent, RandomAgent, GreedyWaitingAgent, GreedyProbabilisticWaitingAgent, TeammateAwareWaitingAgent
Wolfpack-Exc GreedyPredatorAgent, GreedyProbabilisticAgent, TeammateAwarePredator, DistilledCoopAStarAgent, GraphDQNAgent RandomAgent, GreedyWaitingAgent, GreedyProbabilisticWaitingAgent, TeammateAwareWaitingAgent
LBF-Int H8, H7, H6, H5, A2C0 A2C0, H1, H2, H3, H4
LBF-Exc H8, H7, H6, H5, A2C0 H1, H2, H3, H4

H.1 Detailed Hyperparameters and Computing Resources

We summarize the values of the common hyperparameters of algorithms that are used in our experiments, as shown in Tabs. 2 and 3. The optimizer we use during training is Adam (Kingma & Ba, 2014), with the default hyperparameters except learning rate. All algorithms in experiments are implemented in PyTorch (Paszke et al., 2019).

Table 2: Shared hyperparameters for LBF. Note that the arguments intersection_generalization, exclusion_generalization and exclude_A2Cagent cannot be simultaneously set to be True.
Hyperparameter Value
lr 0.00025
gamma 0.99
max_num_steps 1000000
eps_length 200
update_frequency 4
saving_frequency 50
pair_comp bmm
num_envs 16
tau 0.001
eval_eps 5
weight_predict 1.0
num_players_train 3
num_players_test 5 for a maximum of 5 agents
9 for a maximum of 9 agents
exclude_A2Cagent True for the agent-type set excluding A2C agent
False for the default agent-type sets
intersection_generalization True for the agent-type sets for training and testing are intersected
False for the default agent-type sets
exclusion_generalization True for the agent-type sets for training and testing are mutually exclusive
False for the default agent-type sets
seed 0
eval_init_seed 2500
Table 3: Shared hyperparameters for Wolfpack. Note that the arguments intersection_generalization and exclusion_generalization cannot be simultaneously set to be True.
Hyperparameter Value
lr 0.00025
gamma 0.99
num_episodes 4000
update_frequency 4
saving_frequency 50
pair_comp bmm
num_envs 16
tau 0.001
eval_eps 5
weight_predict 1.0
num_players_train 3
num_players_test 5 for a maximum of 5 agents
9 for a maximum of 9 agents
intersection_generalization True for the agent-type sets for training and testing are intersected
False for the default agent-type sets
exclusion_generalization True for the agent-type sets for training and testing are mutually exclusive
False for the default agent-type sets
seed 0
eval_init_seed 2500
close_penalty 0.5

Then, we list the exclusive hyperparameters of all algorithms implemented in this work, as shown in Tab. 4.

Table 4: Exclusive hyperparameters of all algorithms implemented in this paper.
Algorithm weight_regularizer graph pair_range indiv_range
GPL 0.0 complete free free
CIAO-S 0.5 star pos pos
CIAO-S-NP 0.5 star neg pos
CIAO-S-FI 0.5 star pos free
CIAO-S-ZI 0.5 star pos zero
CIAO-S-NI 0.5 star pos neg
CIAO-C 0.5 complete pos pos
CIAO-C-NP 0.5 complete neg pos
CIAO-C-FI 0.5 complete pos free
CIAO-C-ZI 0.5 complete pos zero
CIAO-C-NI 0.5 complete pos neg

All experiments have been run on Xeon Gold 6230 with 30 CPU cores and 30 GB primary memory. An experiment conducted on Wolfpack requires approximately 11 hours, whereas an experiment on LBF typically takes around 12 hours.

Appendix I Additional Experimental Results

I.1 Additional Evaluation on Small Number of Agents

Refer to caption
(a) LBF including A2C agent.
Refer to caption
(b) LBF excluding A2C agent.
Refer to caption
(c) Wolfpack.
Figure 9: Comparison between CIAO and GPL in evaluation, across different scenarios of a maximum of 3 agents.

We present a performance comparison between CIAO and GPL across various scenarios involving a maximum of 3 agents, as illustrated in Fig. 9. The results indicate comparable performances on LBF, while CIAO-S significantly outperforms the other algorithms in the Wolfpack scenario. This observation leads to the conclusion that the star graph structure is better suited for Wolfpack. The rationale behind this outcome is that, in instances with a small number of agents in Wolfpack, conveying the learner’s ’instructions’ through one teammate to another is less effective. This contrasts with the scenario depicted in Fig. 3(b), where a larger number of agents necessitates transmitting the learner’s instructions through an intermediary teammate. The consistency of these findings reinforces the argument for the star graph structure’s superiority in Wolfpack scenarios.

I.2 LBF with Agent-Type Sets Excluding A2C Agent

Refer to caption
(a) LBF: max. of 5 agents.
Refer to caption
(b) LBF: max. of 9 agents.
Figure 10: Comparison between CIAO and GPL on LBF with the agent-type set excluding the agent-type generated by RL algorithms, in scenarios with a maximum of 5 and 9 agents.

We extend our evaluation of CIAO to LBF considering the agent-type set without the agent-type trained by RL (A2C agent), as depicted in Fig. 10. A comparison between Fig. 10 and Fig. 3 leads to the conclusion that CIAO-S exhibits comparatively robust performance across different agent-type sets, whereas CIAO-C demonstrates robustness primarily in scenarios with a larger number of agents. The underlying reasons for CIAO-C’s limited robustness in situations with a small number of agents remain a topic for future investigation. Additionally, exploring the correlation between the performance of these algorithms in testing and RL-based agent-types is a valuable topic for further research.

I.3 Additional Ablation Study on LBF with Agent-Type Sets Excluding A2C Agent

Refer to caption
(a) Maximum of 3 agents.
Refer to caption
(b) Maximum of 5 agents.
Refer to caption
(c) Maximum of 9 agents.
Figure 11: Comparison between CIAO-C and its ablations in evaluation, on LBF excluding A2C agent, across scenarios of various maximum numbers of agents as 3, 5 and 9, respectively.
Refer to caption
(a) Maximum of 3 agents.
Refer to caption
(b) Maximum of 5 agents.
Refer to caption
(c) Maximum of 9 agents.
Figure 12: Comparison between CIAO-S and its ablations in evaluation, on LBF excluding A2C agent, across scenarios of various maximum numbers of agents.

We present a comprehensive performance comparison among CIAO-C, CIAO-S, and their respective ablation variants on LBF, excluding the A2C agent. Figs. 11 and 12 illustrate the results for CIAO-C and CIAO-S, respectively. In the majority of situations, our hypothesis regarding the non-negative individual utility range is validated. However, we note that the unregularized individual utility exhibits satisfactory performance but is prone to instability. Additionally, our theoretical expectation of a non-negative pairwise utility range is violated for CIAO-C in scenarios involving a maximum of 3 and 5 agents. The root cause of this deviation requires further investigation, suggesting a potential avenue for future research into dynamic affinity graph structures.

I.4 Additional Ablation Study on CIAO with No Regularizers

Refer to caption
(a) LBF including A2C agent.
Refer to caption
(b) LBF excluding A2C agent.
Refer to caption
(c) Wolfpack.
Figure 13: Comparison between CIAO and its ablation variant with no consideration of regularizers, denoted as CIAO-X-NR (“X” is either “C” or “S”), on the regularization loss during training, across different scenarios where the training is conducted with a maximum of 3 agents.
Refer to caption
(a) LBF including A2C agent.
Refer to caption
(b) LBF excluding A2C agent.
Refer to caption
(c) Wolfpack.
Figure 14: Comparison between CIAO and its ablation variant with no consideration of regularizers, denoted as CIAO-X-NR (“X” is either “C” or “S”), across different scenarios where the evaluation is conducted with a maximum of 5 agents.
Refer to caption
(a) LBF including A2C agent.
Refer to caption
(b) LBF excluding A2C agent.
Refer to caption
(c) Wolfpack.
Figure 15: Comparison between CIAO and its ablation variant with no consideration of regularizers, denoted as CIAO-X-NR (“X” is either “C” or “S”), across different scenarios where the evaluation is conducted with a maximum of 9 agents.

We conduct a performance comparison between CIAO and its ablation variant, excluding considerations of regularizers. In Fig. 13, the regularization losses during training are depicted, affirming the importance of incorporating regularizers. Notably, the effectiveness of regularizers is not consistently robust in the context of LBF, as shown in Figs. 14 and 15. Two potential explanations arise: (1) unique properties of the LBF environment may diminish the impact of regularizers, and (2) the regularization, driven by a sufficient condition to address DVSC as an RL problem, may lack consideration of other eligible conditions. The exploration of these possibilities is deferred to the future research.