Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\UseRawInputEncoding

Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints*

Jianuo Huang1 *This work was not supported by any organization1School of Computing and Data Science, Xiamen University Malaysia Sepang 43900, Malaysia SWE2109555@xmu.edu.my
Abstract

In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications.

I INTRODUCTION

Safe reinforcement learning (RL) and multi-agent RL (MARL) are critical in navigating complex scenarios where multiple agents interact dynamically, such as in autonomous driving, robotics, and healthcare. This paper integrates control barrier functions (CBFs) into multi-agent diffusion models to ensure agents learn policies that optimize rewards while adhering to stringent safety constraints. By embedding CBFs, the research aims to enhance the safety and stability of learning processes, fostering safer interactions among agents in real-world applications. This approach not only advances RL theory but also holds promise for practical implementations where safety is paramount.

II RELATED WORK

II-A Safe Reinforcement Learning

Safe reinforcement learning (RL) in constrained Markov decision processes (CMDPs) aims to maximize cumulative rewards while ensuring safety constraints. Garcıa and Fernández (2015) categorize safe RL methods into Reward Shaping, Policy Constraints, Model-Based Approaches, Lyapunov-Based Methods, and Barrier Functions. Reward Shaping modifies rewards to penalize unsafe actions, while Policy Constraints, like Achiam et al.’s (2017) Constrained Policy Optimization (CPO), explicitly incorporate safety constraints into policy optimization. Model-Based Approaches, such as Fisac et al. (2018), combine model predictive control with RL for safety guarantees. Lyapunov-Based Methods use Lyapunov functions to maintain stability, as proposed by Chow et al. (2018). Barrier Functions, like Control Barrier Functions (CBFs) used by Ames et al. (2017), ensure the state remains within a safe set. Extending safe RL to multi-agent settings introduces additional challenges due to the need for coordination and communication among agents. Stooke et al. (2020) introduce PID Lagrangian Methods, which dynamically enforce safety constraints using PID controllers. Yang et al. (2022) propose a constrained update projection approach to maintain safety in multi-agent settings, even with communication delays. Zhang et al. (2021) suggest decentralized safety mechanisms that rely on local observations and communication without a central coordinator. Li et al. (2021) introduce graph-based methods for safe multi-agent RL, modeling agent interactions as a graph to ensure scalable and efficient coordination while satisfying safety constraints.

II-B Multi-Agent Safe Reinforcement Learning

Extending safe RL to multi-agent settings introduces additional challenges due to the need for coordination and communication among agents. To address these challenges, several approaches have been proposed. Stooke et al. (2020) introduce PID Lagrangian methods for responsive safety in multi-agent RL, utilizing proportional-integral-derivative (PID) controllers to dynamically enforce safety constraints, which is effective in scenarios requiring quick responses to changing safety conditions. Yang et al. (2022) propose a constrained update projection approach for safe policy optimization in multi-agent settings, where policy updates are projected onto a feasible set that satisfies safety constraints, proving effective in scenarios with communication delays and failures. Recognizing the infeasibility of fully connected communication networks in many real-world applications, Zhang et al. (2021) propose a decentralized safe RL framework that leverages local observations and communication to ensure safety without relying on a central coordinator. Additionally, Li et al. (2021) introduce graph-based methods for safe multi-agent RL, modeling agent interactions as a graph, which allows for scalable and efficient coordination among agents while ensuring safety constraints are satisfied. These diverse approaches collectively address the complexities of multi-agent coordination and communication in safety-critical environments.

II-C Diffusion Models in Reinforcement Learning

Diffusion models have recently gained attention for their ability to generate realistic data samples, enhancing decision-making in reinforcement learning (RL) for trajectory prediction and planning. Ajay et al. (2023) demonstrated the effectiveness of state trajectory diffusion in single-agent RL, improving performance by modeling complex environmental dynamics. Song et al. (2021) introduced a score-based generative model using diffusion processes to create high-quality samples, aiding in trajectory prediction and decision-making. Chen et al. (2022) combined diffusion models with model predictive control (MPC) for trajectory optimization, enhancing safety and efficiency in autonomous navigation. Extending diffusion models to multi-agent RL remains challenging but promising, enabling agents to predict future states, coordinate actions, and ensure safety. Overall, diffusion models offer significant advantages in RL by improving trajectory prediction and optimization, making RL systems more robust and efficient in dynamic environments.

II-D Integrated Approaches for Safe Reinforcement Learning Using Diffusion Models

Integrating safe reinforcement learning (RL) methods with diffusion models presents a promising direction for enhancing the safety and performance of multi-agent systems. One notable approach involves combining control barrier functions (CBFs) with diffusion models to create a robust framework that enforces safety constraints dynamically while optimizing policies for multi-agent systems. This integration leverages the strengths of CBFs in maintaining safety by ensuring that the state remains within a safe set and the predictive power of diffusion models to anticipate future states and actions. Fisac et al. (2018) exemplify this by integrating model-based RL with safety guarantees provided by CBFs, ensuring that the agent’s actions remain within safe bounds while optimizing performance. Additionally, Lyapunov-based methods, such as those proposed by Chow et al. (2018), have been extended to diffusion models to provide stability and safety guarantees in RL. This integration generates safe state trajectories that inform the agent’s decision-making process, thereby enhancing the robustness and reliability of the RL system in dynamic and uncertain environments. These integrated approaches illustrate the potential of combining theoretical safety frameworks with advanced generative models to develop more reliable and efficient multi-agent RL systems.

III MATH

III-A Preliminaries

III-A1 Control Barrier Functions with Diffusion Models

For a nonlinear affine control system:

s˙(t)=f(s(t),a(t)),˙𝑠𝑡𝑓𝑠𝑡𝑎𝑡\dot{s}(t)=f(s(t),a(t)),over˙ start_ARG italic_s end_ARG ( italic_t ) = italic_f ( italic_s ( italic_t ) , italic_a ( italic_t ) ) , (1)

where sSn𝑠𝑆superscript𝑛s\in S\subseteq\mathbb{R}^{n}italic_s ∈ italic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the system state, and aAm𝑎𝐴superscript𝑚a\in A\subseteq\mathbb{R}^{m}italic_a ∈ italic_A ⊆ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the admissible control input.

Definition 1 A set Cn𝐶superscript𝑛C\subseteq\mathbb{R}^{n}italic_C ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is forward invariant for system (1) if the solutions for some aA𝑎𝐴a\in Aitalic_a ∈ italic_A beginning at any s0Csubscript𝑠0𝐶s_{0}\in Citalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_C meet xtCsubscript𝑥𝑡𝐶x_{t}\in Citalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_C, t0for-all𝑡0\forall t\geq 0∀ italic_t ≥ 0.

Definition 2 A function hhitalic_h is a control barrier function (CBF) if there exists an extended class 𝒦subscript𝒦\mathcal{K}_{\infty}caligraphic_K start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT function α𝛼\alphaitalic_α, i.e., α𝛼\alphaitalic_α is strictly increasing and satisfies α(0)=0𝛼00\alpha(0)=0italic_α ( 0 ) = 0, such that for the control system (1):

supaA[sh(s)f(s,a)]α(h(s)),subscriptsupremum𝑎𝐴delimited-[]subscript𝑠𝑠𝑓𝑠𝑎𝛼𝑠\sup_{a\in A}[\nabla_{s}h(s)\cdot f(s,a)]\geq-\alpha(h(s)),roman_sup start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_h ( italic_s ) ⋅ italic_f ( italic_s , italic_a ) ] ≥ - italic_α ( italic_h ( italic_s ) ) , (2)

for all sC𝑠𝐶s\in Citalic_s ∈ italic_C.

Theorem 1.(Ames et al. (2017)) Given a CBF h(s)𝑠h(s)italic_h ( italic_s ) from Definition 2, if s0𝒞subscript𝑠0𝒞s_{0}\in\mathcal{C}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_C, then any a𝑎aitalic_a generated by a Lipschitz continuous controller that satisfies the constraint in (2), t0for-all𝑡0\forall t\geq 0∀ italic_t ≥ 0 renders 𝒞𝒞\mathcal{C}caligraphic_C forward invariant for system (1).

In single-agent reinforcement learning, for a control policy π:SA:𝜋𝑆𝐴\pi:S\rightarrow Aitalic_π : italic_S → italic_A, CBF hhitalic_h, state space sS𝑠𝑆s\in Sitalic_s ∈ italic_S, action space aA𝑎𝐴a\in Aitalic_a ∈ italic_A and let SdSsubscript𝑆𝑑𝑆{S}_{d}\subseteq{S}italic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊆ italic_S be the dangerous set, Ss=SSdsubscript𝑆𝑠𝑆subscript𝑆𝑑{S}_{s}={S}\setminus{S}_{d}italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_S ∖ italic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT be the safe set, which contains the set of initial conditions S0Sssubscript𝑆0subscript𝑆𝑠{S}_{0}\subseteq{S}_{s}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. It is proved in (Ames et al., 2014) that if these three conditions:

(sS0,h(s)0)(sSd,h(s)<0)formulae-sequencefor-all𝑠subscript𝑆0𝑠0formulae-sequencefor-all𝑠subscript𝑆𝑑𝑠0\displaystyle(\forall s\in S_{0},h(s)\geq 0)\land(\forall s\in S_{d},h(s)<0)( ∀ italic_s ∈ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_h ( italic_s ) ≥ 0 ) ∧ ( ∀ italic_s ∈ italic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_h ( italic_s ) < 0 ) (3)
(s{sh(s)0},shf(s,a)+α(h)0),formulae-sequencefor-all𝑠conditional-set𝑠𝑠0subscript𝑠𝑓𝑠𝑎𝛼0\displaystyle\land(\forall s\in\{s\mid h(s)\geq 0\},\nabla_{s}h\cdot f(s,a)+% \alpha(h)\geq 0),∧ ( ∀ italic_s ∈ { italic_s ∣ italic_h ( italic_s ) ≥ 0 } , ∇ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_h ⋅ italic_f ( italic_s , italic_a ) + italic_α ( italic_h ) ≥ 0 ) ,

are satisfied with a=π(s)𝑎𝜋𝑠a=\pi(s)italic_a = italic_π ( italic_s ), then s(t){s|h(s)0}𝑠𝑡conditional-set𝑠𝑠0s(t)\in\{s|h(s)\geq 0\}italic_s ( italic_t ) ∈ { italic_s | italic_h ( italic_s ) ≥ 0 } for t[0,)for-all𝑡0\forall t\in[0,\infty)∀ italic_t ∈ [ 0 , ∞ ), which means the state is forward invariant according to Theorem 1, and it would never enter the dangerous set Sdsubscript𝑆𝑑S_{d}italic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT under π𝜋\piitalic_π.

Diffusion Models generate data from the dataset D:={xi}0iMassign𝐷subscriptsubscript𝑥𝑖0𝑖𝑀D:=\{x_{i}\}_{0\leq i\leq M}italic_D := { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 0 ≤ italic_i ≤ italic_M end_POSTSUBSCRIPT. The forward diffusion process is defined as: q(xk+1|xk):=𝒩(xk+1;αkxk,(1αk)I)assign𝑞conditionalsubscript𝑥𝑘1subscript𝑥𝑘𝒩subscript𝑥𝑘1subscript𝛼𝑘subscript𝑥𝑘1subscript𝛼𝑘𝐼q(x_{k+1}|x_{k}):=\mathcal{N}(x_{k+1};\sqrt{\alpha_{k}}x_{k},(1-\alpha_{k})I)italic_q ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_I ) and the reverse diffusion process is pθ(xk1|xk):=𝒩(xk1|μθ(xk,k),Σθ(xk,k))assignsubscript𝑝𝜃conditionalsubscript𝑥𝑘1subscript𝑥𝑘𝒩conditionalsubscript𝑥𝑘1subscript𝜇𝜃subscript𝑥𝑘𝑘subscriptΣ𝜃subscript𝑥𝑘𝑘p_{\theta}(x_{k-1}|x_{k}):=\mathcal{N}(x_{k-1}|\mu_{\theta}(x_{k},k),\Sigma_{% \theta}(x_{k},k))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ), where 𝒩(μ,Σ)𝒩𝜇Σ\mathcal{N}(\mu,\Sigma)caligraphic_N ( italic_μ , roman_Σ ) is a Gaussian distribution with mean μ𝜇\muitalic_μ and variance ΣΣ\Sigmaroman_Σ. Here, α𝛼\alphaitalic_α is known as the “diffusion rate” and is precalculated using a “variance scheduler.” The term I𝐼Iitalic_I is an identity matrix. By predicting the parameters for the reverse diffusion process at each time step with a neural network, new samples that closely match the underlying data distribution are generated. The reverse diffusion process can be estimated by the loss function as follows (Ho et al., 2020):

(θ)=𝔼k[1,K],x0q,ϵ𝒩(0,I)[ϵϵθ(xk,k)2].𝜃subscript𝔼formulae-sequencesimilar-to𝑘1𝐾formulae-sequencesimilar-tosubscript𝑥0𝑞similar-toitalic-ϵ𝒩0𝐼delimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑘𝑘2\mathcal{L}(\theta)=\mathbb{E}_{k\sim[1,K],x_{0}\sim q,\epsilon\sim\mathcal{N}% (0,I)}\left[\left\|\epsilon-\epsilon_{\theta}(x_{k},k)\right\|^{2}\right].caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_k ∼ [ 1 , italic_K ] , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (4)

The predicted noise ϵθ(xk,k)subscriptitalic-ϵ𝜃subscript𝑥𝑘𝑘\epsilon_{\theta}(x_{k},k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) is to estimate the noise ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ).

In the forward process, we use classifier-free guidance, which requires an additional condition y𝑦yitalic_y to generate target synthetic data. In this work, we incorporated CBF into the diffusion model for the finite-time forward invariance and reward for the optimal policy. Classifier-free guidance modifies the original training setup to learn both a conditional ϵθ(xk,y,k)subscriptitalic-ϵ𝜃subscript𝑥𝑘𝑦𝑘\epsilon_{\theta}(x_{k},y,k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y , italic_k ) and an unconditional conditional noise ϵθ(xk,,k)subscriptitalic-ϵ𝜃subscript𝑥𝑘𝑘\epsilon_{\theta}(x_{k},\emptyset,k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∅ , italic_k ) where a dummy value \emptyset takes the place of y𝑦yitalic_y. The perturbed noise ϵθ(xk,,k)+ω(ϵθ(xk,y,k)ϵθ(xk,,k))subscriptitalic-ϵ𝜃subscript𝑥𝑘𝑘𝜔subscriptitalic-ϵ𝜃subscript𝑥𝑘𝑦𝑘subscriptitalic-ϵ𝜃subscript𝑥𝑘𝑘\epsilon_{\theta}(x_{k},\emptyset,k)+\omega(\epsilon_{\theta}(x_{k},y,k)-% \epsilon_{\theta}(x_{k},\emptyset,k))italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∅ , italic_k ) + italic_ω ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y , italic_k ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∅ , italic_k ) ) is used to later generate samples.

About diffusion decision-making in single-agent settings, diffusing over state trajectories only (Ajay et al., 2023), is claimed to be easier to model and can obtain better performance due to the less smooth nature of action sequences:

τ^:=[s^t,s^t+1,,s^t+H1],assign^𝜏subscript^𝑠𝑡subscript^𝑠𝑡1subscript^𝑠𝑡𝐻1\hat{\tau}:=[\hat{s}_{t},\hat{s}_{t+1},\ldots,\hat{s}_{t+H-1}],over^ start_ARG italic_τ end_ARG := [ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + italic_H - 1 end_POSTSUBSCRIPT ] , (5)

where H𝐻Hitalic_H is the trajectory length that the diffusion model generates and t𝑡titalic_t is the time a state was visited in trajectory τ𝜏\tauitalic_τ. However, sampling states from the diffusion model cannot get the corresponding action. To infer the policy, we could use the inverse dynamics model to generate the action by two consecutive states in the trajectory:

a^t=Iϕ(st,s^t+1).subscript^𝑎𝑡subscript𝐼italic-ϕsubscript𝑠𝑡subscript^𝑠𝑡1\hat{a}_{t}=I_{\phi}(s_{t},\hat{s}_{t+1}).over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) . (6)

III-A2 Multi-agent Offline Reinforcement Learning with Safety Constraints

The safe MARL problem is normally formulated as a Constrained Markov Decision Process (CMDP) {N,S,O,A,p,ρ0,γ,R,h}𝑁𝑆𝑂𝐴𝑝superscript𝜌0𝛾𝑅\{N,S,O,A,p,\rho^{0},\gamma,R,h\}{ italic_N , italic_S , italic_O , italic_A , italic_p , italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_γ , italic_R , italic_h }. Here, N={1,,n}𝑁1𝑛N=\{1,\ldots,n\}italic_N = { 1 , … , italic_n } is the set of agents, the joint state space is S={S1,S2,,Sn}𝑆subscript𝑆1subscript𝑆2subscript𝑆𝑛S=\{S_{1},S_{2},\ldots,S_{n}\}italic_S = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where stiSisubscriptsuperscript𝑠𝑖𝑡subscript𝑆𝑖s^{i}_{t}\in S_{i}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the state of agent i𝑖iitalic_i at time step t𝑡titalic_t, O𝑂Oitalic_O is the local observation, A=i=1nAi𝐴superscriptsubscriptproduct𝑖1𝑛subscript𝐴𝑖A=\prod_{i=1}^{n}A_{i}italic_A = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the joint action space, p:S×AS:𝑝𝑆𝐴𝑆p:S\times A\rightarrow Sitalic_p : italic_S × italic_A → italic_S is the probabilistic transition function, ρ0superscript𝜌0\rho^{0}italic_ρ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the initial state distribution, γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor, R:S×A×S:𝑅𝑆𝐴𝑆R:S\times A\times S\rightarrow\mathbb{R}italic_R : italic_S × italic_A × italic_S → blackboard_R is the joint reward function, h:S:𝑆h:S\rightarrow\mathbb{R}italic_h : italic_S → blackboard_R is the constraint function; in this paper, we use CBF as the constraint function. At time step t𝑡titalic_t, the joint state at time t𝑡titalic_t is denoted by st={st1,,stn}subscript𝑠𝑡subscriptsuperscript𝑠1𝑡subscriptsuperscript𝑠𝑛𝑡s_{t}=\{s^{1}_{t},\ldots,s^{n}_{t}\}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, and every agent i𝑖iitalic_i takes action atisubscriptsuperscript𝑎𝑖𝑡a^{i}_{t}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to its policy πi(ati|st)superscript𝜋𝑖conditionalsubscriptsuperscript𝑎𝑖𝑡subscript𝑠𝑡\pi^{i}(a^{i}_{t}|s_{t})italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Together with other agents’ actions, it gives a joint action at=(at1,,atn)subscript𝑎𝑡subscriptsuperscript𝑎1𝑡subscriptsuperscript𝑎𝑛𝑡a_{t}=(a^{1}_{t},\ldots,a^{n}_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the joint policy π(at|st)=i=1nπi(ati|st)𝜋conditionalsubscript𝑎𝑡subscript𝑠𝑡superscriptsubscriptproduct𝑖1𝑛superscript𝜋𝑖conditionalsubscriptsuperscript𝑎𝑖𝑡subscript𝑠𝑡\pi(a_{t}|s_{t})=\prod_{i=1}^{n}\pi^{i}(a^{i}_{t}|s_{t})italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In offline settings, instead of collecting online data in environments, we only have access to a static dataset D𝐷Ditalic_D to learn the policies. The dataset D𝐷Ditalic_D generally comprises trajectories τ𝜏\tauitalic_τ, i.e., observation-action sequences.

For each agent i𝑖iitalic_i, we define Ntisubscriptsuperscript𝑁𝑖𝑡{N}^{i}_{t}italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the set of its neighborhood agents at time t𝑡titalic_t. Let otiOisubscriptsuperscript𝑜𝑖𝑡subscript𝑂𝑖o^{i}_{t}\in O_{i}italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the local observation of agent i𝑖iitalic_i, which is the states of Ntisubscriptsuperscript𝑁𝑖𝑡{N}^{i}_{t}italic_N start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT neighborhood agents. Notice that the dimension of oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not fixed and depends on the quantity of neighboring agents.

We assume the safety of agent i𝑖iitalic_i is jointly determined by sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let Oisubscript𝑂𝑖{O}_{i}italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the set of all observations and Xi:=Si×Oiassignsubscript𝑋𝑖subscript𝑆𝑖subscript𝑂𝑖X_{i}:={S}_{i}\times{O}_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the state-observation space that contains the safe set Xi,ssubscript𝑋𝑖𝑠X_{i,s}italic_X start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT, dangerous set Xi,dsubscript𝑋𝑖𝑑X_{i,d}italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT, and initial conditions Xi,0Xi,ssubscript𝑋𝑖0subscript𝑋𝑖𝑠X_{i,0}\subseteq X_{i,s}italic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ⊆ italic_X start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT. Let d𝑑ditalic_d describe the minimum distance from agent i𝑖iitalic_i to other agents, relative speed V𝑉Vitalic_V, deceleration b𝑏bitalic_b, and minimal stopped gap κssubscript𝜅𝑠\kappa_{s}italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, then d(si,oi)<V22b+κs𝑑subscript𝑠𝑖subscript𝑜𝑖superscript𝑉22𝑏subscript𝜅𝑠d(s_{i},o_{i})<\frac{V^{2}}{2\cdot b}+\kappa_{s}italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT implies a collision. Then Xi,s={(si,oi)|d(si,oi)V22b+κs}subscript𝑋𝑖𝑠conditional-setsubscript𝑠𝑖subscript𝑜𝑖𝑑subscript𝑠𝑖subscript𝑜𝑖superscript𝑉22𝑏subscript𝜅𝑠X_{i,s}=\{(s_{i},o_{i})|d(s_{i},o_{i})\geq\frac{V^{2}}{2\cdot b}+\kappa_{s}\}italic_X start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } and Xi,d={(si,oi)|d(si,oi)<V22b+κs}subscript𝑋𝑖𝑑conditional-setsubscript𝑠𝑖subscript𝑜𝑖𝑑subscript𝑠𝑖subscript𝑜𝑖superscript𝑉22𝑏subscript𝜅𝑠X_{i,d}=\{(s_{i},o_{i})|d(s_{i},o_{i})<\frac{V^{2}}{2\cdot b}+\kappa_{s}\}italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }. Since there is a surjection from S𝑆{S}italic_S to Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we may define d¯isubscript¯𝑑𝑖\bar{d}_{i}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the lifting of d𝑑ditalic_d from Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to S𝑆{S}italic_S. Ss:={sS|i=1,,N,d¯i(s)V22b+κs}assignsubscript𝑆𝑠conditional-set𝑠𝑆formulae-sequencefor-all𝑖1𝑁subscript¯𝑑𝑖𝑠superscript𝑉22𝑏subscript𝜅𝑠{S}_{s}:=\{s\in{S}|\forall i=1,\ldots,N,\bar{d}_{i}(s)\geq\frac{V^{2}}{2\cdot b% }+\kappa_{s}\}italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT := { italic_s ∈ italic_S | ∀ italic_i = 1 , … , italic_N , over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ≥ divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } is then defined. Formally speaking, a multi-agent system’s safety can be described as follows:

Definition 3 If the minimum distance satisfies d(si,oi)V22b+κs𝑑subscript𝑠𝑖subscript𝑜𝑖superscript𝑉22𝑏subscript𝜅𝑠d(s_{i},o_{i})\geq\frac{V^{2}}{2\cdot b}+\kappa_{s}italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for agent i𝑖iitalic_i and t𝑡titalic_t, then agent i𝑖iitalic_i is safe at time t𝑡titalic_t. If for ifor-all𝑖\forall i∀ italic_i, agent i𝑖iitalic_i is safe at time t𝑡titalic_t, then the multi-agent system is safe at time t𝑡titalic_t, and s𝒮s𝑠subscript𝒮𝑠s\in\mathcal{S}_{s}italic_s ∈ caligraphic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

III-B Methodology

III-B1 Framework for Control Barrier Function in Multi-agent Reinforcement Learning

A simple CBF for a multi-agent dynamic system is a centralized function that accounts for the joint states of all agents. However, it may cause an exponential explosion in the state space; it is also difficult to define a safety constraint for the entire system while ensuring that the security of individual agents will not be violated.

By Definition 3, we consider decentralized CBF to guarantee the multi-agent system’s safety. From equation (3), we propose the following CBF:

((si,oi)Xi,0,hi(si,oi)0)formulae-sequencefor-allsubscript𝑠𝑖subscript𝑜𝑖subscript𝑋𝑖0subscript𝑖subscript𝑠𝑖subscript𝑜𝑖0\displaystyle\left(\forall(s_{i},o_{i})\in X_{i,0},h_{i}(s_{i},o_{i})\geq 0\right)( ∀ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 ) (7)
((si,oi)Xi,d,hi(si,oi)<0)formulae-sequencefor-allsubscript𝑠𝑖subscript𝑜𝑖subscript𝑋𝑖𝑑subscript𝑖subscript𝑠𝑖subscript𝑜𝑖0\displaystyle\land\left(\forall(s_{i},o_{i})\in X_{i,d},h_{i}(s_{i},o_{i})<0\right)∧ ( ∀ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < 0 )
((si,oi){(si,oi)hi(si,oi)0},\displaystyle\land\left(\forall(s_{i},o_{i})\in\left\{(s_{i},o_{i})\mid h_{i}(% s_{i},o_{i})\geq 0\right\},\right.∧ ( ∀ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 } ,
sihifi(si,ai)+oihio˙i(t)+α(hi)0)\displaystyle\left.\qquad\nabla_{s_{i}}h_{i}\cdot f_{i}(s_{i},a_{i})+\nabla_{o% _{i}}h_{i}\cdot\dot{o}_{i}(t)+\alpha(h_{i})\geq 0\right)∇ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) + italic_α ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 )

where o˙i(t)subscript˙𝑜𝑖𝑡\dot{o}_{i}(t)over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) denotes the time derivative of observation, which depends on other agents’ actions. It can be assessed and included in the training process without an explicit expression. Here, the state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the local state and observation of the corresponding agent i𝑖iitalic_i. We refer to conditions (3) as the decentralized CBF for agent i𝑖iitalic_i.

Proposition 1 If the decentralized CBF conditions in (7) are satisfied, then tfor-all𝑡\forall t∀ italic_t and ifor-all𝑖\forall i∀ italic_i, (sti,oti){(si,oi)hi(si,oi)0}subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑜𝑖𝑡conditional-setsubscript𝑠𝑖subscript𝑜𝑖subscript𝑖subscript𝑠𝑖subscript𝑜𝑖0(s^{i}_{t},o^{i}_{t})\in\{(s_{i},o_{i})\mid h_{i}(s_{i},o_{i})\geq 0\}( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 }, which implies the state would never enter Xi,dsubscript𝑋𝑖𝑑X_{i,d}italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT for any agent i𝑖iitalic_i. Thus, the multi-agent system is safe by Definition (3).

According to Proposition 1, CBF can be a decentralized paradigm for every agent in the entire multi-agent system. Since state-observation satisfying hi(si,oi)0subscript𝑖subscript𝑠𝑖subscript𝑜𝑖0h_{i}(s_{i},o_{i})\geq 0italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 is forward invariant, agent i𝑖iitalic_i never gets closer than V22b+κssuperscript𝑉22𝑏subscript𝜅𝑠\frac{V^{2}}{2\cdot b}+\kappa_{s}divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to all of its neighboring agents. According to the definition of hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, hi(si,oi)>0d¯i(s)V22b+κssubscript𝑖subscript𝑠𝑖subscript𝑜𝑖0subscript¯𝑑𝑖𝑠superscript𝑉22𝑏subscript𝜅𝑠h_{i}(s_{i},o_{i})>0\Rightarrow\bar{d}_{i}(s)\geq\frac{V^{2}}{2\cdot b}+\kappa% _{s}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0 ⇒ over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ≥ divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The multi-agent system is safe according to Definition 3 since i,hi(si,oi)0for-all𝑖subscript𝑖subscript𝑠𝑖subscript𝑜𝑖0\forall i,h_{i}(s_{i},o_{i})\geq 0∀ italic_i , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 implies that i,d¯i(s)V22b+κsfor-all𝑖subscript¯𝑑𝑖𝑠superscript𝑉22𝑏subscript𝜅𝑠\forall i,\bar{d}_{i}(s)\geq\frac{V^{2}}{2\cdot b}+\kappa_{s}∀ italic_i , over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ≥ divide start_ARG italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ italic_b end_ARG + italic_κ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

Next, we need to formulate the control barrier function hi(si,oi)subscript𝑖subscript𝑠𝑖subscript𝑜𝑖h_{i}(s_{i},o_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to get a safe set from dataset D𝐷Ditalic_D. Let τi={si,oi}subscript𝜏𝑖subscript𝑠𝑖subscript𝑜𝑖\tau_{i}=\{s_{i},o_{i}\}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } be a trajectory of the state and observation of agent i𝑖iitalic_i. Let 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the set of all possible trajectories of agent i𝑖iitalic_i. Let isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the function classes of hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and policy πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Define the function yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: 𝒯i×i×𝒱isubscript𝒯𝑖subscript𝑖subscript𝒱𝑖\mathcal{T}_{i}\times\mathcal{H}_{i}\times\mathcal{V}_{i}\rightarrow\mathbb{R}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → blackboard_R as:

yi(τi,hi,πi)subscript𝑦𝑖subscript𝜏𝑖subscript𝑖subscript𝜋𝑖\displaystyle y_{i}(\tau_{i},h_{i},\pi_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) :=min{infXi,0𝒯ihi(si,oi),infXi,d𝒯ihi(si,oi),\displaystyle:=\min\Bigl{\{}\inf_{X_{i,0}\cap\mathcal{T}_{i}}h_{i}(s_{i},o_{i}% ),\inf_{X_{i,d}\cap\mathcal{T}_{i}}-h_{i}(s_{i},o_{i}),:= roman_min { roman_inf start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT ∩ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , roman_inf start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT ∩ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (8)
infXi,s𝒯i(h˙i+α(hi))}.\displaystyle\quad\inf_{X_{i,s}\cap\mathcal{T}_{i}}(\dot{h}_{i}+\alpha(h_{i}))% \Bigr{\}}.roman_inf start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT ∩ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over˙ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } .

Notice that the third item on the right side of Equation (8) depends on both the policy and CBF, since h˙i=sihifi(si,ui)+oihio˙i(t),ui=πi(si,oi)formulae-sequencesubscript˙𝑖subscriptsubscript𝑠𝑖subscript𝑖subscript𝑓𝑖subscript𝑠𝑖subscript𝑢𝑖subscriptsubscript𝑜𝑖subscript𝑖subscript˙𝑜𝑖𝑡subscript𝑢𝑖subscript𝜋𝑖subscript𝑠𝑖subscript𝑜𝑖\dot{h}_{i}=\nabla_{s_{i}}h_{i}\cdot f_{i}(s_{i},u_{i})+\nabla_{o_{i}}h_{i}% \cdot\dot{o}_{i}(t),u_{i}=\pi_{i}(s_{i},o_{i})over˙ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). It is clear that if we can find hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and πi(si,oi)subscript𝜋𝑖subscript𝑠𝑖subscript𝑜𝑖\pi_{i}(s_{i},o_{i})italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) such that yi(τi,hi,πi)>0subscript𝑦𝑖subscript𝜏𝑖subscript𝑖subscript𝜋𝑖0y_{i}(\tau_{i},h_{i},\pi_{i})>0italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0 for τi𝒯ifor-allsubscript𝜏𝑖subscript𝒯𝑖\forall\tau_{i}\in\mathcal{T}_{i}∀ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ifor-all𝑖\forall i∀ italic_i, then the conditions in (7) are satisfied. We solve the objective:

For all i𝑖iitalic_i, find hiisubscript𝑖subscript𝑖h_{i}\in\mathcal{H}_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and πi𝒱isubscript𝜋𝑖subscript𝒱𝑖\pi_{i}\in\mathcal{V}_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, such that yi(τi,hi,πi)γsubscript𝑦𝑖subscript𝜏𝑖subscript𝑖subscript𝜋𝑖𝛾y_{i}(\tau_{i},h_{i},\pi_{i})\geq\gammaitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_γ, where γ>0𝛾0\gamma>0italic_γ > 0 is a margin for the satisfaction of the CBF condition in (7).

III-B2 Diffusion Model with Guidance

We formulate the diffusion model as follows:

maxθ𝔼τ𝒟[logpθ(τ|y())],subscript𝜃subscript𝔼similar-to𝜏𝒟delimited-[]subscript𝑝𝜃conditional𝜏𝑦\max_{\theta}\mathbb{E}_{\tau\sim\mathcal{D}}[\log p_{\theta}(\tau|y(\cdot))],roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ | italic_y ( ⋅ ) ) ] , (9)

Our goal is to estimate τ𝜏\tauitalic_τ conditioned on y()𝑦{y}(\cdot)italic_y ( ⋅ ) with pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. In this paper, 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) includes the CBF and the reward under the trajectory.

Given an offline dataset D𝐷Ditalic_D that consists of all agents’ trajectories data, our diffusion model also takes a decentralized manner to make it consistent with the decentralized CBF. The model is parameterized through the unified noise model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the inverse dynamics model Iϕsubscript𝐼italic-ϕI_{\phi}italic_I start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT of each agent i𝑖iitalic_i with the reverse diffusion loss and the inverse dynamics loss:

(θ,ϕ):=assign𝜃italic-ϕabsent\displaystyle\mathcal{L}(\theta,\phi):=caligraphic_L ( italic_θ , italic_ϕ ) := 𝔼τ0𝒟,βBern(p)[ϵϵθi(τ^ki,(1β)yi(τ0)\displaystyle\ \mathbb{E}_{\tau_{0}\in\mathcal{D},\beta\sim\text{Bern}(p)}\big% {[}\big{\|}\epsilon-\epsilon^{i}_{\theta}\big{(}\hat{\tau}^{i}_{k},(1-\beta)y^% {i}(\tau_{0})blackboard_E start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_D , italic_β ∼ Bern ( italic_p ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( 1 - italic_β ) italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (10)
+β,k)2]\displaystyle+\beta\emptyset,k\big{)}\big{\|}^{2}\big{]}+ italic_β ∅ , italic_k ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+ti𝔼(si,oi,ai)𝒟[aiIϕi((sti,oti),\displaystyle+\sum_{t}\sum_{i}\mathbb{E}_{(s_{i},o_{i},a_{i})\in\mathcal{D}}% \big{[}\big{\|}a_{i}-I^{i}_{\phi}\big{(}(s^{i}_{t},o^{i}_{t}),+ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT [ ∥ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
(st+1i,ot+1i))2].\displaystyle\hskip 30.00005pt(s^{i}_{t+1},o^{i}_{t+1})\big{)}\big{\|}^{2}\big% {]}.( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

III-B3 Implementation Details

However, there are some other gaps between methodology and practical implementation. First, equation (8) does not provide an exact way of designing loss functions. Second, the CBF and π𝜋\piitalic_π are coupled, where minor approximation errors can bootstrap across them and lead to severe instability; furthermore, h˙isubscript˙𝑖\dot{h}_{i}over˙ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has term o˙isubscript˙𝑜𝑖\dot{o}_{i}over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Third, we do not have the loss function considering reward maximization.

Based on equation (8), we formulate the loss function: c=iicsuperscript𝑐subscript𝑖subscriptsuperscript𝑐𝑖\mathcal{L}^{c}=\sum_{i}\mathcal{L}^{c}_{i}caligraphic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where icsubscriptsuperscript𝑐𝑖\mathcal{L}^{c}_{i}caligraphic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the loss function for agent i𝑖iitalic_i:

ic(θi)=subscriptsuperscript𝑐𝑖subscript𝜃𝑖absent\displaystyle\mathcal{L}^{c}_{i}(\theta_{i})=caligraphic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = siχi,0max(0,γhiθi(si,oi))subscriptsubscript𝑠𝑖subscript𝜒𝑖00𝛾superscriptsubscript𝑖subscript𝜃𝑖subscript𝑠𝑖subscript𝑜𝑖\displaystyle\sum_{s_{i}\in\chi_{i,0}}\max\left(0,\gamma-h_{i}^{\theta_{i}}(s_% {i},o_{i})\right)∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_χ start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (11)
+siχi,dmax(0,γ+hiθi(si,oi))subscriptsubscript𝑠𝑖subscript𝜒𝑖𝑑0𝛾superscriptsubscript𝑖subscript𝜃𝑖subscript𝑠𝑖subscript𝑜𝑖\displaystyle+\sum_{s_{i}\in\chi_{i,d}}\max\left(0,\gamma+h_{i}^{\theta_{i}}(s% _{i},o_{i})\right)+ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_χ start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
+(si,ai)χi,hmax(0,γsihiθifi(si,ai)\displaystyle+\sum_{(s_{i},a_{i})\in\chi_{i,h}}\max\Bigg{(}0,\gamma-\nabla_{s_% {i}}h_{i}^{\theta_{i}}\cdot f_{i}(s_{i},a_{i})+ ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_χ start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ - ∇ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
oihiθio˙iα(hiθi)),\displaystyle\hskip 30.00005pt-\nabla_{o_{i}}h_{i}^{\theta_{i}}\cdot\dot{o}_{i% }-\alpha(h_{i}^{\theta_{i}})\Bigg{)},- ∇ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_α ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) ,

where γ𝛾\gammaitalic_γ is the margin of satisfaction of CBF. We need to evaluate o˙isubscript˙𝑜𝑖\dot{o}_{i}over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is the time derivative of the observation. Instead, we approximate h˙i(si,oi)=sihiθifi(si,ai)+oihiθio˙isubscript˙𝑖subscript𝑠𝑖subscript𝑜𝑖subscriptsubscript𝑠𝑖superscriptsubscript𝑖subscript𝜃𝑖subscript𝑓𝑖subscript𝑠𝑖subscript𝑎𝑖subscriptsubscript𝑜𝑖superscriptsubscript𝑖subscript𝜃𝑖subscript˙𝑜𝑖\dot{h}_{i}(s_{i},o_{i})={\nabla_{s_{i}}h_{i}^{\theta_{i}}\cdot f_{i}(s_{i},a_% {i})+\nabla_{o_{i}}h_{i}^{\theta_{i}}\cdot\dot{o}_{i}}over˙ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the forward difference method h˙i(si,oi)=hi[(si(t+Δt),oi(t+Δt))hi(si(t),oi(t))]/Δt.subscript˙𝑖subscript𝑠𝑖subscript𝑜𝑖subscript𝑖delimited-[]subscript𝑠𝑖𝑡Δ𝑡subscript𝑜𝑖𝑡Δ𝑡subscript𝑖subscript𝑠𝑖𝑡subscript𝑜𝑖𝑡Δ𝑡\dot{h}_{i}(s_{i},o_{i})=h_{i}[(s_{i}(t+\Delta t),o_{i}(t+\Delta t))-h_{i}(s_{% i}(t),o_{i}(t))]/\Delta t.over˙ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ) - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) ] / roman_Δ italic_t . So, we only need sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the dataset, then the loss function becomes:

ic(θi)=subscriptsuperscript𝑐𝑖subscript𝜃𝑖absent\displaystyle\mathcal{L}^{c}_{i}(\theta_{i})=caligraphic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = siXi,0max(0,γhiθi(si,oi))subscriptsubscript𝑠𝑖subscript𝑋𝑖00𝛾superscriptsubscript𝑖subscript𝜃𝑖subscript𝑠𝑖subscript𝑜𝑖\displaystyle\sum_{s_{i}\in X_{i,0}}\max\left(0,\gamma-h_{i}^{\theta_{i}}(s_{i% },o_{i})\right)∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (12)
+siXi,dmax(0,γ+hiθi(si,oi))subscriptsubscript𝑠𝑖subscript𝑋𝑖𝑑0𝛾superscriptsubscript𝑖subscript𝜃𝑖subscript𝑠𝑖subscript𝑜𝑖\displaystyle+\sum_{s_{i}\in X_{i,d}}\max\left(0,\gamma+h_{i}^{\theta_{i}}(s_{% i},o_{i})\right)+ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
+(si,ai)Xi,hmax(0,γΔhiΔtα(hiθi(si,oi))),subscriptsubscript𝑠𝑖subscript𝑎𝑖subscript𝑋𝑖0𝛾Δsubscript𝑖Δ𝑡𝛼superscriptsubscript𝑖subscript𝜃𝑖subscript𝑠𝑖subscript𝑜𝑖\displaystyle+\sum_{(s_{i},a_{i})\in X_{i,h}}\max\left(0,\gamma-\frac{\Delta h% _{i}}{\Delta t}-\alpha\big{(}h_{i}^{\theta_{i}}(s_{i},o_{i})\big{)}\right),+ ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ - divide start_ARG roman_Δ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG - italic_α ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) ,

where

Δhi=hi(si(t+Δt),oi(t+Δt))hi(si(t),oi(t)).Δsubscript𝑖subscript𝑖subscript𝑠𝑖𝑡Δ𝑡subscript𝑜𝑖𝑡Δ𝑡subscript𝑖subscript𝑠𝑖𝑡subscript𝑜𝑖𝑡\Delta h_{i}=h_{i}(s_{i}(t+\Delta t),o_{i}(t+\Delta t))-h_{i}(s_{i}(t),o_{i}(t% )).roman_Δ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ) - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ) .

For the class-𝒦𝒦\mathcal{K}caligraphic_K function α()𝛼\alpha(\cdot)italic_α ( ⋅ ), we simply choose a linear function. Note here that icsubscriptsuperscript𝑐𝑖\mathcal{L}^{c}_{i}caligraphic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT only proposes safety constraints. We incorporate the safety reward into its reward and denote this safe version. The safety reward is rpsubscript𝑟𝑝r_{p}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT when the agent enters the dangerous set.

Ri=𝔼[t=1Hγt(ri[t]rp)]superscript𝑅𝑖𝔼delimited-[]superscriptsubscript𝑡1𝐻superscript𝛾𝑡subscript𝑟𝑖delimited-[]𝑡subscript𝑟𝑝R^{i}=\mathbb{E}\left[\sum_{t=1}^{H}\gamma^{t}\left(r_{i}[t]-r_{p}\right)\right]italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_t ] - italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ] (13)

We propose the objective function:

maxπ𝔼s[Vrπ(s)𝕀sXi,h],subscript𝜋subscript𝔼𝑠delimited-[]subscriptsuperscript𝑉𝜋𝑟𝑠subscript𝕀𝑠subscript𝑋𝑖\displaystyle\max_{\pi}\mathbb{E}_{s}\left[V^{\pi}_{r}(s)\cdot\mathbb{I}_{s\in X% _{i,h}}\right],roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_s ) ⋅ blackboard_I start_POSTSUBSCRIPT italic_s ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , (14)

Inspired by IQL, we do not explicitly learn the policy by a separate value function that approximates an expectile only concerning the action distribution:

iVr=𝔼(si,ai)Xi,h[LT(Qir(sti,ati)Vir(sti))],superscriptsubscript𝑖subscript𝑉𝑟subscript𝔼subscript𝑠𝑖subscript𝑎𝑖subscript𝑋𝑖delimited-[]superscript𝐿𝑇superscriptsubscript𝑄𝑖𝑟subscriptsuperscript𝑠𝑖𝑡subscriptsuperscript𝑎𝑖𝑡superscriptsubscript𝑉𝑖𝑟subscriptsuperscript𝑠𝑖𝑡\mathcal{L}_{i}^{V_{r}}=\mathbb{E}_{{(s_{i},a_{i})\in X_{i,h}}}\left[L^{T}% \left(Q_{i}^{r}(s^{i}_{t},a^{i}_{t})-V_{i}^{r}(s^{i}_{t})\right)\right],caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] , (15)
iQr=superscriptsubscript𝑖subscript𝑄𝑟absent\displaystyle\mathcal{L}_{i}^{Q_{r}}=caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = 𝔼(si,ai,ri)Xi,h[(rti+γVir(st+1i)\displaystyle\mathbb{E}_{{(s_{i},a_{i},r_{i})\in X_{i,h}}}\left[(r^{i}_{t}+% \gamma V_{i}^{r}(s^{i}_{t+1})\right.blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) (16)
Qir(sti,ati))2].\displaystyle\left.-Q_{i}^{r}(s^{i}_{t},a^{i}_{t}))^{2}\right].- italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

IV EXPERIMENTS

IV-A Background and Objectives

Multi-agent reinforcement learning (MARL) is crucial for safety-critical applications like autonomous driving, robotics, and healthcare. However, most current methods emphasize online learning, posing significant risks in real-world deployments. The interaction between multiple agents in these environments necessitates strict safety constraints to prevent accidents and ensure efficient operations. This research proposes an innovative framework that combines diffusion models with Control Barrier Functions (CBFs) to enhance the safety and efficiency of multi-agent actions in offline reinforcement learning settings. We aim to validate this framework on various benchmark datasets and compare its performance with existing methodologies.

IV-B Datasets and Experimental Environment

For this research, we will use the DSRL (Distributed Safe Reinforcement Learning) benchmark dataset, specifically designed for safe offline multi-agent reinforcement learning (MARL), along with additional datasets to cover various safety-critical scenarios such as autonomous driving and robotics. The datasets will be split into training (70%) and validation (30%) sets, ensuring the distribution of safety-critical scenarios is maintained. Model training will involve the proposed model and baseline algorithms (PID Lagrangian Methods and Constrained Policy Optimization), with hyperparameter tuning for optimal performance. Periodic validation will monitor training progress and adjust hyperparameters to prevent overfitting. The testing procedure includes evaluating the final performance on a separate test set containing unseen scenarios to assess generalization. Multiple runs (e.g., 10 runs) will be conducted for each model to ensure statistical significance and robustness of the results. Our experimental environment will consist of simulated settings that closely mimic real-world safety-critical situations, including multiple agents with dynamic interactions and potential hazards, providing a robust platform to evaluate the safety and efficiency of the proposed framework.

IV-C Methodology

Algorithm Unseen Env 1 Unseen Env 2 Unseen Env 3 Avg Generalization Score
Proposed Model 800 810 820 810
PID Lagrangian Methods 760 770 780 770
Constrained Policy Optimization 740 750 760 750
Independent Q-Learning (IQL) 700 710 720 710
MADDPG 720 730 740 730
TABLE I: Generalization Metrics Across Unseen Environments

IV-C1 Model Architecture

We propose a novel framework for safe multi-agent reinforcement learning (MARL) using the following components:

  • Centralized Training with Decentralized Execution (CTDE): This architecture allows for centralized policy learning while enabling each agent to execute actions based on local observations.

  • Diffusion Model: Utilized for trajectory prediction, the diffusion model generates potential future states of the agents in the environment.

  • Control Barrier Functions (CBFs): These functions enforce safety constraints, ensuring that agents operate within safe bounds at all times.

IV-C2 Training Procedure

The training procedure involves the following steps:

Algorithm Hyperparameter Value Validation Reward Validation Safety %
Proposed Model Learning Rate 0.001 850 95%
PID Lagrangian Methods PID Coefficients [0.1, 0.01, 0.001] 800 92%
Constrained Policy Optimization Constraint Weight 1.0 780 90%
Independent Q-Learning (IQL) Learning Rate 0.01 750 85%
MADDPG Actor Learning Rate 0.001 770 88%
TABLE II: Hyperparameter Tuning Results
  1. 1.

    Pre-training Diffusion Models: Initially, diffusion models are pre-trained on offline datasets to learn the underlying data distribution. Let 𝒟={xi}i=1M𝒟superscriptsubscriptsubscript𝑥𝑖𝑖1𝑀\mathcal{D}=\{x_{i}\}_{i=1}^{M}caligraphic_D = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT represent the dataset. The forward diffusion process is defined as:

    q(xk+1|xk)=𝒩(αkxk,(1αk)𝐈),𝑞conditionalsubscript𝑥𝑘1subscript𝑥𝑘𝒩subscript𝛼𝑘subscript𝑥𝑘1subscript𝛼𝑘𝐈q(x_{k+1}|x_{k})=\mathcal{N}(\sqrt{\alpha_{k}}x_{k},(1-\alpha_{k})\mathbf{I}),italic_q ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_I ) , (17)

    where αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the diffusion rate and 𝐈𝐈\mathbf{I}bold_I is the identity matrix. The reverse diffusion process is modeled as:

    pθ(xk1|xk)=𝒩(μθ(xk,k),Σθ(xk,k)).subscript𝑝𝜃conditionalsubscript𝑥𝑘1subscript𝑥𝑘𝒩subscript𝜇𝜃subscript𝑥𝑘𝑘subscriptΣ𝜃subscript𝑥𝑘𝑘p_{\theta}(x_{k-1}|x_{k})=\mathcal{N}(\mu_{\theta}(x_{k},k),\Sigma_{\theta}(x_% {k},k)).italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ) . (18)

    The parameters θ𝜃\thetaitalic_θ are learned by minimizing the loss function:

    (θ)=𝔼k,x0,ϵ[ϵϵθ(xk,k)2],𝜃subscript𝔼𝑘subscript𝑥0italic-ϵdelimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑘𝑘2\mathcal{L}(\theta)=\mathbb{E}_{k,x_{0},\epsilon}\left[\|\epsilon-\epsilon_{% \theta}(x_{k},k)\|^{2}\right],caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_k , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (19)

    where ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(0,\mathbf{I})italic_ϵ ∼ caligraphic_N ( 0 , bold_I ).

  2. 2.

    Integration of CBFs: Control Barrier Functions are integrated into the diffusion model to guide the learning process towards safe trajectories. For a nonlinear affine control system:

    s˙(t)=f(s(t),a(t)),˙𝑠𝑡𝑓𝑠𝑡𝑎𝑡\dot{s}(t)=f(s(t),a(t)),over˙ start_ARG italic_s end_ARG ( italic_t ) = italic_f ( italic_s ( italic_t ) , italic_a ( italic_t ) ) , (20)

    where sSn𝑠𝑆superscript𝑛s\in S\subseteq\mathbb{R}^{n}italic_s ∈ italic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the system state and aAm𝑎𝐴superscript𝑚a\in A\subseteq\mathbb{R}^{m}italic_a ∈ italic_A ⊆ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the control input, a CBF h(s)𝑠h(s)italic_h ( italic_s ) satisfies:

    supaA[sh(s)f(s,a)]α(h(s)).subscriptsupremum𝑎𝐴delimited-[]subscript𝑠𝑠𝑓𝑠𝑎𝛼𝑠\sup_{a\in A}[\nabla_{s}h(s)\cdot f(s,a)]\geq-\alpha(h(s)).roman_sup start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_h ( italic_s ) ⋅ italic_f ( italic_s , italic_a ) ] ≥ - italic_α ( italic_h ( italic_s ) ) . (21)
  3. 3.

    Multi-objective Loss Function: The combined model is trained using a loss function that includes both safety constraints and reward optimization. Let τ={(st,at,rt)}t=1T𝜏superscriptsubscriptsubscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡𝑡1𝑇\tau=\{(s_{t},a_{t},r_{t})\}_{t=1}^{T}italic_τ = { ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denote a trajectory, the objective is to maximize:

    maxθ,ϕ𝔼τ𝒟[t=1Tγtrt]+λi=1nCBF(hi),subscript𝜃italic-ϕsubscript𝔼similar-to𝜏𝒟delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡subscript𝑟𝑡𝜆superscriptsubscript𝑖1𝑛subscriptCBFsubscript𝑖\max_{\theta,\phi}\mathbb{E}_{\tau\sim\mathcal{D}}\left[\sum_{t=1}^{T}\gamma^{% t}r_{t}\right]+\lambda\sum_{i=1}^{n}\mathcal{L}_{\text{CBF}}(h_{i}),roman_max start_POSTSUBSCRIPT italic_θ , italic_ϕ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ caligraphic_D end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT CBF end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (22)

    where γ𝛾\gammaitalic_γ is the discount factor, λ𝜆\lambdaitalic_λ is a weighting factor, and CBF(hi)subscriptCBFsubscript𝑖\mathcal{L}_{\text{CBF}}(h_{i})caligraphic_L start_POSTSUBSCRIPT CBF end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the loss associated with the control barrier function for agent i𝑖iitalic_i:

    CBF(hi)=subscriptCBFsubscript𝑖absent\displaystyle\mathcal{L}_{\text{CBF}}(h_{i})=caligraphic_L start_POSTSUBSCRIPT CBF end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = siXi,0max(0,γhi(si))subscriptsubscript𝑠𝑖subscript𝑋𝑖00𝛾subscript𝑖subscript𝑠𝑖\displaystyle\sum_{s_{i}\in X_{i,0}}\max(0,\gamma-h_{i}(s_{i}))∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (23)
    +siXi,dmax(0,γ+hi(si)).subscriptsubscript𝑠𝑖subscript𝑋𝑖𝑑0𝛾subscript𝑖subscript𝑠𝑖\displaystyle+\sum_{s_{i}\in X_{i,d}}\max(0,\gamma+h_{i}(s_{i})).+ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

V Loss Function Design

The loss function design incorporates both safety constraints and reward optimization:

V-A Control Barrier Function Loss

To ensure safety, the Control Barrier Function (CBF) loss includes:

  1. 1.

    Forward Invariance: Maintains the safety set’s invariance over time.

  2. 2.

    Penalty for Constraint Violations: Applies penalties for actions violating safety constraints.

  3. 3.

    Time Derivative Approximation: Uses a forward difference method for approximating time derivatives in the loss calculation.

The CBF loss is formulated as:

CBF(hi)=subscriptCBFsubscript𝑖absent\displaystyle\mathcal{L}_{\text{CBF}}(h_{i})=caligraphic_L start_POSTSUBSCRIPT CBF end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = siXi,0max(0,γhi(si))subscriptsubscript𝑠𝑖subscript𝑋𝑖00𝛾subscript𝑖subscript𝑠𝑖\displaystyle\sum_{s_{i}\in X_{i,0}}\max(0,\gamma-h_{i}(s_{i}))∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (24)
+siXi,dmax(0,γ+hi(si)).subscriptsubscript𝑠𝑖subscript𝑋𝑖𝑑0𝛾subscript𝑖subscript𝑠𝑖\displaystyle+\sum_{s_{i}\in X_{i,d}}\max(0,\gamma+h_{i}(s_{i})).+ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max ( 0 , italic_γ + italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

V-B Reward Maximization Loss

The reward maximization loss aims to optimize expected rewards while adhering to safety constraints. It utilizes an inverse dynamics model to generate actions from predicted state trajectories. The expected reward is given by:

𝔼[t=1Tγtrt],𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡subscript𝑟𝑡\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t}r_{t}\right],blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , (25)

where γ𝛾\gammaitalic_γ is the discount factor.

V-C Combined Loss

The total loss combines the CBF loss and the reward maximization loss, including a margin of satisfaction to ensure robust adherence to safety constraints.

The combined loss function is:

=CBF+λ𝔼[t=1Tγtrt],subscriptCBF𝜆𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡subscript𝑟𝑡\mathcal{L}=\mathcal{L}_{\text{CBF}}+\lambda\mathbb{E}\left[\sum_{t=1}^{T}% \gamma^{t}r_{t}\right],caligraphic_L = caligraphic_L start_POSTSUBSCRIPT CBF end_POSTSUBSCRIPT + italic_λ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , (26)

where λ𝜆\lambdaitalic_λ is a weighting factor that balances safety and reward optimization.

Algorithm Avg Cumulative Reward Std Dev Safety State % Violation Freq Violation Severity
Proposed Model 850 30 95% 5 0.1
PID Lagrangian Methods 800 40 92% 8 0.2
Constrained Policy Optimization 780 35 90% 10 0.3
Independent Q-Learning (IQL) 750 50 85% 15 0.5
MADDPG 770 45 88% 12 0.4
TABLE III: Performance Metrics of Different Algorithms

V-D Comparative Analysis and Results

We compare the proposed model against several baseline algorithms: (1) PID Lagrangian Methods (Stooke et al., 2020), which use Proportional-Integral-Derivative (PID) methods to enforce safety constraints in reinforcement learning; (2) Constrained Policy Optimization (Yang et al., 2022), which focuses on optimizing policies under safety constraints using projection methods; (3) Independent Q-Learning (IQL), where each agent learns its Q-function independently; and (4) Multi-Agent Deep Deterministic Policy Gradient (MADDPG), which extends DDPG to multi-agent settings, allowing for coordination among agents.

V-D1 Performance Metrics Table

The table III below compares the performance metrics of the proposed model with baseline algorithms, including average cumulative reward, standard deviation, safety state percentage, violation frequency, and violation severity.

Analysis:

  • The proposed model achieves the highest average cumulative reward (850) and the lowest standard deviation (30), indicating both high performance and consistency.

  • It maintains the highest safety state percentage (95%) with the lowest violation frequency (5) and severity (0.1), demonstrating superior adherence to safety constraints compared to the baseline algorithms.

V-D2 Generalization Performance Across Unseen Environments

This table I presents the performance of the algorithms across three unseen environments, along with the average generalization score. The evaluation of generalization performance is crucial to understand how well the model can adapt to new, unseen scenarios, which is a key aspect in real-world applications where agents encounter dynamic and unpredictable environments.

Analysis:

  • The proposed model consistently performs well across all unseen environments, achieving the highest average generalization score (𝔾=810𝔾810\mathbb{G}=810blackboard_G = 810), indicating robust generalization capabilities. This high score reflects the model’s ability to maintain its performance even when faced with scenarios it was not explicitly trained on. Such robustness is essential for deploying reinforcement learning models in practical settings, where variability and unforeseen circumstances are the norms. Formally, let trainsubscripttrain\mathcal{E}_{\text{train}}caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT and testsubscripttest\mathcal{E}_{\text{test}}caligraphic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT be the training and test environments respectively. The model’s performance is given by 𝔼test[R(π)]subscript𝔼subscripttestdelimited-[]𝑅superscript𝜋\mathbb{E}_{\mathcal{E}_{\text{test}}}[R(\pi^{*})]blackboard_E start_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] where πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal policy derived from trainsubscripttrain\mathcal{E}_{\text{train}}caligraphic_E start_POSTSUBSCRIPT train end_POSTSUBSCRIPT.

  • Additionally, the consistent performance across multiple unseen environments demonstrates the effectiveness of the diffusion model and control barrier functions (CBFs) in providing a stable and adaptive learning framework. This suggests that the proposed approach not only learns optimal policies but also retains flexibility and resilience, which are critical for long-term deployment and operational safety. Mathematically, let h(s,a)𝑠𝑎h(s,a)italic_h ( italic_s , italic_a ) denote the safety function and R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) the reward function. The model optimizes 𝔼[t=0TγtR(st,at)]𝔼delimited-[]superscriptsubscript𝑡0𝑇superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡\mathbb{E}\left[\sum_{t=0}^{T}\gamma^{t}R(s_{t},a_{t})\right]blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] subject to h(st,at)0t[0,T]subscript𝑠𝑡subscript𝑎𝑡0for-all𝑡0𝑇h(s_{t},a_{t})\geq 0\ \forall t\in[0,T]italic_h ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 ∀ italic_t ∈ [ 0 , italic_T ].

  • The strong generalization performance of the proposed model can also be attributed to its decentralized architecture, which allows each agent to operate based on local observations while still benefiting from centralized training. This balance between centralized policy learning and decentralized execution enables the model to effectively coordinate multi-agent actions without compromising on adaptability and responsiveness to local changes in the environment. Formally, let πi(si,oi)subscript𝜋𝑖subscript𝑠𝑖subscript𝑜𝑖\pi_{i}(s_{i},o_{i})italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) be the policy of agent i𝑖iitalic_i based on its state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and local observation oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The joint policy π(s)=i=1Nπi(si,oi)𝜋𝑠superscriptsubscriptproduct𝑖1𝑁subscript𝜋𝑖subscript𝑠𝑖subscript𝑜𝑖\pi(s)=\prod_{i=1}^{N}\pi_{i}(s_{i},o_{i})italic_π ( italic_s ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is optimized in a decentralized manner, ensuring that [hi(si,oi)0]βdelimited-[]subscript𝑖subscript𝑠𝑖subscript𝑜𝑖0𝛽\mathbb{P}[h_{i}(s_{i},o_{i})\geq 0]\geq\betablackboard_P [ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 0 ] ≥ italic_β for each agent i𝑖iitalic_i, where β𝛽\betaitalic_β is a predefined safety threshold.

V-D3 Hyperparameter Tuning Results

The following table II shows the hyperparameter tuning results for each algorithm, including hyperparameter values, validation rewards, and validation safety percentages. Hyperparameter tuning is an essential process in training reinforcement learning models, as it involves finding the optimal set of parameters that maximize performance while ensuring safety constraints are met.

Refer to caption
Figure 1: Sensitivity analysis of hyperparameter β𝛽\betaitalic_β showing different outcomes over episodes.

V-E Average Speed Evaluation

Refer to caption
Figure 2: Average speed comparison between our method and the baseline without LLM.

Analysis:

  • The proposed model achieves the highest validation reward (𝔼[R]=850𝔼delimited-[]𝑅850\mathbb{E}[R]=850blackboard_E [ italic_R ] = 850) and safety percentage ([safe]=0.95delimited-[]safe0.95\mathbb{P}[\text{safe}]=0.95blackboard_P [ safe ] = 0.95) with optimal hyperparameter settings. This demonstrates the hyperparameter tuning process’s efficacy, as the chosen parameters effectively balance the trade-offs between reward maximization and adherence to safety constraints. This balance is crucial for real-world applications where both performance and safety are paramount.

  • The high validation reward indicates the proposed model’s proficiency in learning policies that yield substantial returns. This is particularly significant in applications where maximizing rewards leads to superior outcomes, such as enhanced efficiency in autonomous driving (η𝜂\etaitalic_η) or improved performance in robotic tasks (ρ𝜌\rhoitalic_ρ). Formally, if R(t)𝑅𝑡R(t)italic_R ( italic_t ) denotes the reward at time t𝑡titalic_t, the model maximizes 𝔼[t=0TγtR(t)]𝔼delimited-[]superscriptsubscript𝑡0𝑇superscript𝛾𝑡𝑅𝑡\mathbb{E}[\sum_{t=0}^{T}\gamma^{t}R(t)]blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_t ) ], where γ𝛾\gammaitalic_γ is the discount factor.

  • Furthermore, the 95% safety percentage underscores the model’s consistent adherence to safety constraints, thereby minimizing the risk of unsafe actions. This high safety adherence is critical for deploying reinforcement learning models in safety-critical environments, ensuring that the agents’ actions do not compromise operational integrity or lead to catastrophic failures. Mathematically, if h(s)𝑠h(s)italic_h ( italic_s ) represents the safety constraint function, the model ensures [h(st)0t[0,T]]=0.95delimited-[]subscript𝑠𝑡0for-all𝑡0𝑇0.95\mathbb{P}[h(s_{t})\geq 0\ \forall t\in[0,T]]=0.95blackboard_P [ italic_h ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 ∀ italic_t ∈ [ 0 , italic_T ] ] = 0.95.

  • The effectiveness of the hyperparameter tuning process also reflects the robustness of the proposed framework. By systematically exploring the parameter space ΘΘ\Thetaroman_Θ and optimizing the model settings, the framework ensures that the agents are well-prepared to handle a variety of scenarios while maintaining high performance and safety standards. Let θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal set of hyperparameters, then θ=argmaxθΘ𝔼[R(θ)]superscript𝜃subscript𝜃Θ𝔼delimited-[]𝑅𝜃\theta^{*}=\arg\max_{\theta\in\Theta}\mathbb{E}[R(\theta)]italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT blackboard_E [ italic_R ( italic_θ ) ] subject to [h(s;θ)0]0.95delimited-[]𝑠𝜃00.95\mathbb{P}[h(s;\theta)\geq 0]\geq 0.95blackboard_P [ italic_h ( italic_s ; italic_θ ) ≥ 0 ] ≥ 0.95.

  • Overall, the successful hyperparameter tuning and resulting high performance metrics reinforce the potential of the proposed model as a reliable and efficient solution for multi-agent reinforcement learning in complex, real-world environments. This process not only fine-tunes the model but also enhances its generalization capabilities and operational safety, making it a strong candidate for deployment in diverse applications. Formally, the objective can be expressed as a multi-objective optimization problem: maxπ,θ𝔼[R(π,θ)] s.t. [h(st)0]0.95tsubscript𝜋𝜃𝔼delimited-[]𝑅𝜋𝜃 s.t. delimited-[]subscript𝑠𝑡00.95for-all𝑡\max_{\pi,\theta}\mathbb{E}[R(\pi,\theta)]\text{ s.t. }\mathbb{P}[h(s_{t})\geq 0% ]\geq 0.95\ \forall troman_max start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT blackboard_E [ italic_R ( italic_π , italic_θ ) ] s.t. blackboard_P [ italic_h ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 ] ≥ 0.95 ∀ italic_t, where π𝜋\piitalic_π denotes the policy and θ𝜃\thetaitalic_θ represents the hyperparameters.

VI Visualization Results

VI-A Episode Reward Progression

Figure  3 shows the progression of rewards over episodes, highlighting the learning process of the model. The dashed line represents the 100-episode average reward, providing a smoothed view of the agent’s performance trends over time. The red dots indicate saved checkpoints, signifying significant improvements or milestones in the training process. Initially, there are fluctuations in the reward, but as training progresses, the reward trend generally increases, demonstrating the agent’s learning and improvement over time.

Refer to caption
Figure 3: Reward progression over episodes with significant checkpoints marked.

VI-B Reward Distribution in Aggressive Model

Figure  4 illustrates the distribution of various rewards for an aggressive model across different actions. The plot includes collision rewards, right lane rewards, high-speed rewards, and road rewards, each represented by different line styles and colors. This visualization helps to understand how the aggressive model balances different reward components when taking actions. The high fluctuations in collision and high-speed rewards suggest that the aggressive model frequently encounters risk-reward trade-offs.

Refer to caption
Figure 4: Distribution of rewards for the aggressive model across actions.

VI-C Hyperparameter Sensitivity Analysis

Figure 3 presents the sensitivity analysis of the hyperparameter β𝛽\betaitalic_β. The left and right subplots show the rates of different outcomes (Leader go first, Crash, Slow, Follower go first) over episodes for different values of β𝛽\betaitalic_β. The plots illustrate how the choice of β𝛽\betaitalic_β affects the agent’s performance and safety outcomes. Higher β𝛽\betaitalic_β values tend to lead to safer behaviors with fewer crashes, while lower values might result in more aggressive behaviors but higher risks. Figure  2 compares the average speed over evaluation epochs between our method and a baseline method without using a large language model (LLM). The solid red line represents our method, while the dashed blue line represents the baseline method. The shaded areas indicate the variability of speeds during the evaluation. Our method consistently outperforms the baseline, achieving higher average speeds with less variability, indicating more stable and efficient performance.

VII CONCLUSION

This paper presents a novel framework integrating diffusion models and Control Barrier Functions (CBFs) for offline multi-agent reinforcement learning (MARL) with safety constraints. Our approach addresses the challenges of ensuring safety in dynamic and uncertain environments, crucial for applications such as autonomous driving, robotics, and healthcare. Leveraging diffusion models for trajectory prediction and planning, our model allows agents to anticipate future states and coordinate actions effectively. The incorporation of CBFs dynamically enforces safety constraints, ensuring agents operate within safe bounds at all times. Extensive experiments on the DSRL benchmark and additional safety-critical datasets show that our model consistently outperforms baseline algorithms in cumulative rewards and adherence to safety constraints. Hyperparameter tuning results further validate the robustness and efficiency of our approach. The strong generalization capabilities of our model, demonstrated by superior performance across unseen environments, highlight its potential for real-world deployment. This adaptability ensures the framework remains effective even in scenarios not encountered during training. In conclusion, our integration of diffusion models with CBFs offers a promising direction for developing safe and efficient MARL systems. Future work will extend this framework to more complex environments and refine the integration of safety constraints to enhance the reliability and performance of MARL systems in real-world applications.

VIII Limitation

While our framework integrating diffusion models and Control Barrier Functions (CBFs) shows significant advancements in ensuring safety and performance in multi-agent reinforcement learning (MARL), several limitations must be acknowledged. The computational complexity can be substantial, especially in high-dimensional environments with many agents, leading to increased training times and resource requirements. Approximation errors in CBF constraints may affect stability, particularly in dynamic environments. The current implementation assumes minimal communication delays, which may not hold in real-world applications like autonomous driving. The framework also relies heavily on the quality of offline datasets, risking poor generalization in unobserved situations. Finally, our evaluation is limited to benchmark scenarios, and real-world environments may present unforeseen challenges. Addressing these limitations through optimized computational methods, robust communication protocols, and diverse datasets will be critical for practical applicability.

References

  • [1] Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1), 1437-1480.
  • [2] Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 22-31).
  • [3] Fisac, J. F., Akametalu, A. K., Zeilinger, M. N., Kaynama, S., Gillula, J. H., & Tomlin, C. J. (2018). A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7), 2737-2752.
  • [4] Chow, Y., Nachum, O., Duenez-Guzman, E., & Ghavamzadeh, M. (2018). A Lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems (pp. 8092-8101).
  • [5] Ames, A. D., Xu, X., Grizzle, J. W., & Tabuada, P. (2017). Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8), 3861-3876.
  • [6] Stooke, A., Achiam, J., & Abbeel, P. (2020). Responsive safety in reinforcement learning by PID Lagrangian methods. In Proceedings of the 37th International Conference on Machine Learning (pp. 9133-9143).
  • [7] Yang, L., Ji, J., Dai, J., Zhang, L., Zhou, B., Li, P., Yang, Y., & Pan, G. (2022). Constrained update projection approach to safe policy optimization. ArXiv, abs/2209.07089.
  • [8] Zhang, K., Yang, Z., & Başar, T. (2021). Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, 321-384.
  • [9] Li, S., Gao, Y., Meng, Z., & Zheng, Z. (2021). Graph-based approaches for multi-agent reinforcement learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 32(10), 4331-4352.
  • [10] Ajay, A., Song, J., Eysenbach, B., Zhou, S., Finn, C., & Levine, S. (2023). Diffusing policies for goal-conditioned exploration. In Proceedings of the 40th International Conference on Machine Learning.
  • [11] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations.
  • [12] Chen, T., Zhang, R., Zhang, W., Sun, M., & Liu, W. (2022). Model predictive control with trajectory optimization for autonomous navigation using diffusion models. In IEEE/RSJ International Conference on Intelligent Robots and Systems.
  • [13] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • [14] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning (pp. 387-395).
  • [15] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International Conference on Learning Representations.
  • [16] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 2642-2650).
  • [17] Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., & Quillen, D. (2018). Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. International Journal of Robotics Research, 37(4-5), 421-436.
  • [18] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (pp. 1861-1870).
  • [19] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • [20] Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., … & Silver, D. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350-354.
  • [21] Fujimoto, S., Hoof, H. V., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning (pp. 1587-1596).
  • [22] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (pp. 6379-6390).
  • [23] Mordatch, I., & Abbeel, P. (2018). Emergence of grounded compositional language in multi-agent populations. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
  • [24] OpenAI, Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., … & Zhokhov, P. (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  • [25] Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295.
  • [26] Papoudakis, G., Christianos, F., Schäfer, L., & Albrecht, S. V. (2021). Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Advances in Neural Information Processing Systems (pp. 4671-4684).
  • [27] Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3), 345-383.
  • [28] Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 4295-4304).
  • [29] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2018). Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
  • [30] Hernandez-Leal, P., Kartal, B., & Taylor, M. E. (2019). A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 33, 750-797.
  • [31] Mousavi, S. S., Schukat, M., & Howley, E. (2016). Deep reinforcement learning: An overview. In Proceedings of SAI Intelligent Systems Conference (pp. 426-440).
  • [32] Oliehoek, F. A., & Amato, C. (2016). A concise introduction to decentralized POMDPs. Springer.
  • [33] Matignon, L., Laurent, G. J., & Le Fort-Piat, N. (2012). Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. The Knowledge Engineering Review, 27(1), 1-31.
  • [34] Nguyen, T. T., Nguyen, N. D., & Nahavandi, S. (2020). Deep reinforcement learning for multi-agent systems: A review of challenges, solutions, and applications. IEEE Transactions on Cybernetics, 50(6), 3826-3839.
  • [35] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 5571-5580).
  • [36] Wang, Y., Yuan, Y., Zhang, T., & Zhang, C. (2020). Multi-agent reinforcement learning with emergent roles. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 7281-7288).
  • [37] Zhou, M., Zhang, W., & Tang, Y. (2020). Factorized Q-learning for large-scale multi-agent systems. In International Conference on Learning Representations.
  • [38] Ye, D., Zhang, M., & Yang, Y. (2015). A multi-agent framework for packet routing in wireless sensor networks. Sensors, 15(5), 10026-10047.
  • [39] Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multi-agent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2), 156-172.
  • [40] Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., … & Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4), e0172395.
  • [41] Iqbal, S., & Sha, F. (2019). Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 2961-2970).
  • [42] Jin, Y., Zhang, L., & Gao, S. (2019). Dual-attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6299-6307).
  • [43] Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning (pp. 330-337).
  • [44] Meng, F., Ling, Y., Wu, Y., Song, Q., & Wang, Z. (2021). Curriculum-based multi-agent reinforcement learning. In Advances in Neural Information Processing Systems (pp. 19721-19733).
  • [45] Suarez, J., Su, X., Xia, Y.,& Kaelbling, L. (2021). Neural programming architectures for deep reinforcement learning. In International Conference on Learning Representations.