Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning

Homayoun Honari1, Amir M. Soufi Enayati1, Mehran Ghafarian Tamizi2, Homayoun Najjaran1,2 *This work was supported by Apera AI and Mathematics of Information Technology and Complex Systems (MITACS) under IT16412 Mitacs Accelerate and Natural Sciences and Engineering Research Council (NSERC) Canada under the Grant RTI-2023-00418, and a partial equipment support was received from Kinova® Inc.1Department of Mechanical Engineering, University of Victoria, Victoria, BC, Canada {hmnhonari,amsoufi,najjaran}@uvic.ca2Department of Electrical and Computer Engineering, University of Victoria, Victoria, BC, Canada mehranght@uvic.ca
Abstract

Safe Reinforcement Learning (Safe RL) is one of the prevalently studied subcategories of trial-and-error-based methods with the intention to be deployed on real-world systems. In safe RL, the goal is to maximize reward performance while minimizing constraints, often achieved by setting bounds on constraint functions and utilizing the Lagrangian method. However, deploying Lagrangian-based safe RL in real-world scenarios is challenging due to the necessity of threshold fine-tuning, as imprecise adjustments may lead to suboptimal policy convergence. To mitigate this challenge, we propose a unified Lagrangian-based model-free architecture called Meta Soft Actor-Critic Lagrangian (Meta SAC-Lag). Meta SAC-Lag uses meta-gradient optimization to automatically update the safety-related hyperparameters. The proposed method is designed to address safe exploration and threshold adjustment with minimal hyperparameter tuning requirement. In our pipeline, the inner parameters are updated through the conventional formulation and the hyperparameters are adjusted using the meta-objectives which are defined based on the updated parameters. Our results show that the agent can reliably adjust the safety performance due to the relatively fast convergence rate of the safety threshold. We evaluate the performance of Meta SAC-Lag in five simulated environments against Lagrangian baselines, and the results demonstrate its capability to create synergy between parameters, yielding better or competitive results. Furthermore, we conduct a real-world experiment involving a robotic arm tasked with pouring coffee into a cup without spillage. Meta SAC-Lag is successfully trained to execute the task, while minimizing effort constraints. The success of Meta SAC-Lag in performing the experiment is intended to be a step toward practical deployment of safe RL algorithms to learn the control process of safety-critical real-world systems without explicit engineering.

I Introduction

Reinforcement Learning (RL) is one of the most important paradigms for learning to control physical systems. However, a major shortcoming of RL is its need for exploration and extensive trial and error. For that reason, while we observe its wide success in various domains such as Energy systems [1], Video games [2], and Robotics [3], the real-world deployment of these algorithms to learn the control process poses is challenging [4] since the exploration process might lead the system to states that might damage the system and incur heavy costs to the user. To this end, safe RL methods aim to address this issue by optimizing the policy such that it is compliant with the constraints. The constraints are defined such that they aim to prevent the system from exceeding its physical limitations.

The most common approach in safe RL is through the Lagrangian method. Specified under the Constrained Markov Decision Process (CMDP) framework, through defining thresholds for the constraints, the multi-objective optimization problem is converted to constraint satisfaction and is solved by casting it to an unconstrained problem using the Lagrangian method. While the approach has been extensively studied in the literature [5, 6], without precise tuning and engineering of the constraint thresholds, the Lagrangian methods will suffer from convergence to suboptimal policies. For that reason, the real-world use of these algorithms is rendered to be challenging due to the iterative process of hyperparameter tuning.

Refer to caption
Figure 1: Safety-critical environments used to deploy Meta SAC-Lag. The top two rows represent simulated environments with four general safety topics: locomotion (a), obstacle avoidance (b,c), robotic manipulation (d), dexterous manipulation (e). The bottom row represents Pour Coffee environment (f,g) used to study the deployability of the algorithm in a real-world setup.

To address these challenges, based on the Soft Actor-Critic (SAC) architecture [7], our algorithm aims to address two fundamental problems: safe exploration and tuning-free constraint adjustment. Previous attempts to automate the tuning of exploration-related hyperparameter of SAC have been mostly focused on optimizing the performance of the system [8, 9]. However, addressing the safety compliance of SAC has been limited. To this end, we propose a threshold-free safety-aware exploration optimization pipeline. In addition, our approach optimizes the safety threshold according to the overall performance of the policy. We are able to update the aforementioned hyperparameters using the metagradients w.r.t. to the meta-objectives which are computed based on the gradients of the internal learnable parameters. Finally, to assess the performance of Meta SAC-Lag, as depicted in Fig. 1, we study its performance against several baseline algorithms in five simulated robotic tasks with four different application themes. We observe that our method attains better or comparable results in terms of safety or reward performance while automatically tuning the safety-related hyperparameters. Furthermore, we present a safety benchmark test case, called Pour Coffee, which attempts to relocate and pour a coffee-filled mug into another cup. Constraint violation happens in case of collision or the spillage of the coffee. We deploy and train Meta SAC-Lag in a real-world setup using Kinova Gen3 robot. Our implementation shows that not only is Meta SAC-Lag capable of safe deployment without the iterative process of hyperparameter tuning but also, the learning process of the policy results in a smooth and jerk-free execution of the task with minimum effort imposed on the system.

Our main contributions can be summarized as:

  • We propose a Lagrangian-based safe RL method able to automatically adjust the constraint bounds.

  • Meta SAC-Lag addresses safe exploration through an unconstrained metagradient-based optimization pipeline.

  • We validate the applicability of Meta SAC-Lag in five simulated robotic environments against baseline algorithms.

  • A test environment, called Pour Coffee, is presented, and, with minimal prior safety-related hyperparameter tuning, Meta SAC-Lag is trained on a real-world Kinova Gen3 setup. The algorithm successfully achieves the task objective with minimized effort exerted on the robot.

II Related Work

The constrained Markov decision process (CMDP), is the theoretical building block of safe RL. CMDPs have been widely studied in the RL paradigm [10, 11] and are solved using Lagrangian methods [12]. In this regard, Shen et. al [13] devised the risk-sensitive policy optimization (RSPO) algorithm which sequentially decreases the Lagrangian multiplier to zero. Furthermore, Stooke et. al [14] updates the multiplier using PID control. Additionally, Reward-constrained policy optimization (RCPO) employs dual gradient descent optimization for the policy and Lagrange multiplier [15].

In another aspect, metagradient optimization has been explored thoroughly in RL hyperparameter tuning. Initially, model-agnostic meta-learning (MAML) [16] introduced meta-optimization of initial weights to enable fast task adaptation within a few gradient descent steps. In a different approach, Meta-Gradient RL [17] extended the concept to learn the hyperparameters of return functions online. This paradigm offered a general approach, applicable to other RL hyperparameters. Subsequently, similar techniques were applied for auto-tuning other RL hyperparameters, such as exploration thresholds [18], entropy temperature in SAC [9], auxiliary tasks and sub-goals [19], and differentiable hyperparameters of loss functions [20]. Despite these advancements, metagradient methods have not been extensively explored in constrained RL paradigms [5], with few applications focusing on ensuring safety in sensitive learning tasks, as seen in the work by Calian et al. [21]. The authors utilized meta-gradients to update the Lagrange multiplier learning rate in an off-policy RL framework.

The Lagrangian methods are not the sole approach taken toward solving safe RL. Thananjeyan et al. [22] trained a recovery policy in parallel to the task policy and used it whenever the task policy chose actions deemed too risky. More prominently, model-based RL and safety guarantees for risk aversion during training are proposed. Koppejan et al. [23] used the neuroevolutionary approach to exploiting domain expertise to learn safe models for model-based RL. Thomas et al. [24], in a different approach, used near-future imagination to plan safe trajectories ahead of time. Moldovan et al. [25] focused on risk aversion in MDPs using near-optimal Chernoff bounds. Lyapunov functions have also been used to guarantee safety during training [26, 27], though constructing Lyapunov functions remains a challenge due to their typically hand-crafted nature and the absence of clear principles for agent safety and performance optimization.

In summary, RL agent evaluation relies on human-provided rewards, but the risk of misstated human objectives is often overlooked. Also, despite significant progress in safe RL methodologies, there remains a big gap in readily deployable agents in industrial contexts. Moreover, a critical balance exists between reward and cost in RL, as each action can impact both aspects, creating a multi-dimensional problem. These challenges hinder robust and dependable RL algorithms suitable for real-world implementation. Our motivation for this work is to use metagradient optimization for self-tuning of the safety threshold. This will minimize the need for safety-related hyperparameter tuning in safe RL while improving performance. Ultimately, the self-tuning safety threshold will enable us to deploy the agent directly in the real world.

III Background

In this section, we investigate the background by exploring essential preliminary concepts that serve as the foundation for this paper. We start with the discussion of the CMDP framework. Furthermore, we delve into the formulation of the safety critic and SAC.

III-A Constrained Markov Decision Process (CMDP)

CMDP comprises the tuple <𝒮,𝒜,𝒫,r,c,γr,γc,ρ0><\mathcal{S},\mathcal{A},\mathcal{P},r,c,\gamma_{r},\gamma_{c},\rho_{0}>< caligraphic_S , caligraphic_A , caligraphic_P , italic_r , italic_c , italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > where S𝑆Sitalic_S denotes the state space, A𝐴Aitalic_A represents the action space, and r𝑟ritalic_r denotes the reward function: r:𝒮×𝒜×𝒮:𝑟maps-to𝒮𝒜𝒮r:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto\mathbb{R}italic_r : caligraphic_S × caligraphic_A × caligraphic_S ↦ blackboard_R. The transition function 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫maps-to𝒮𝒜𝒮01\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S ↦ [ 0 , 1 ] defines the likelihood 𝒫(s|s,a)𝒫conditionalsuperscript𝑠𝑠𝑎\mathcal{P}(s^{\prime}|s,a)caligraphic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) of moving from state s𝑠sitalic_s to ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by executing action a𝑎aitalic_a. The probability distribution function ρ0:𝒮[0,1]:subscript𝜌0maps-to𝒮01\rho_{0}:\mathcal{S}\mapsto[0,1]italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_S ↦ [ 0 , 1 ] denotes initial state distribution of the framework. Furthermore, c(s)𝑐𝑠c(s)italic_c ( italic_s ) is the constraint indicator function which determines whether state s𝑠sitalic_s violates the constraint functions specified by C𝐶Citalic_C: c(s)=𝟙[C(s)==1]c(s)=\mathds{1}[C(s)==1]italic_c ( italic_s ) = blackboard_1 [ italic_C ( italic_s ) = = 1 ]. Parameters γr[0,1)subscript𝛾𝑟01\gamma_{r}\in[0,1)italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ [ 0 , 1 ) and γc[0,1)subscript𝛾𝑐01\gamma_{c}\in[0,1)italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ [ 0 , 1 ) serve as discount factors for reward and safety critics, respectively. Ultimately, the solution to CMDP is represented by the policy π:𝒮×𝒜[0,1]:𝜋maps-to𝒮𝒜01\pi:\mathcal{S}\times\mathcal{A}\mapsto[0,1]italic_π : caligraphic_S × caligraphic_A ↦ [ 0 , 1 ] which is the probability distribution over actions. The value function associated with policy π𝜋\piitalic_π for a specific state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) and the corresponding recursive equation, known as the Bellman equation, can be formulated as follows:

Qrπ(s,a)=𝔼st𝒫,atπ[t=0γtr(st,at)|s0=s,a0=a]subscriptsuperscript𝑄𝜋𝑟𝑠𝑎subscript𝔼formulae-sequencesimilar-tosubscript𝑠𝑡𝒫similar-tosubscript𝑎𝑡𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝑠subscript𝑎0𝑎\displaystyle Q^{\pi}_{r}(s,a)=\mathbb{E}_{s_{t}\sim\mathcal{P},a_{t}\sim\pi}[% \sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a]italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_P , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ] (1)
=𝔼s𝒫π[r(s,a)+γVrπ(s)]absentsubscriptsuperscript𝔼𝜋similar-tosuperscript𝑠𝒫delimited-[]𝑟𝑠𝑎𝛾subscriptsuperscript𝑉𝜋𝑟superscript𝑠\displaystyle\qquad\qquad=\mathbb{E}^{\pi}_{s^{\prime}\sim\mathcal{P}}[r(s,a)+% \gamma V^{\pi}_{r}(s^{\prime})]= blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_P end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a ) + italic_γ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

Additionally, the primary function of the safety critic is to estimate the probability of a policy failure occurring in the future, determined by the expected cumulative discounted probability of failure.

Qcπ(s,a)=𝔼st𝒫,atπ[c(s)+(1c(s))t=1[γctc(st)]]superscriptsubscript𝑄𝑐𝜋𝑠𝑎subscript𝔼formulae-sequencesimilar-tosubscript𝑠𝑡𝒫similar-tosubscript𝑎𝑡𝜋delimited-[]𝑐𝑠1𝑐𝑠superscriptsubscript𝑡1delimited-[]superscriptsubscript𝛾𝑐𝑡𝑐subscript𝑠𝑡\displaystyle Q_{c}^{\pi}(s,a)=\mathbb{E}_{s_{t}\sim\mathcal{P},a_{t}\sim\pi}% \big{[}c(s)+(1-c(s))\sum_{t=1}^{\infty}[\gamma_{c}^{t}c(s_{t})]\big{]}italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_P , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π end_POSTSUBSCRIPT [ italic_c ( italic_s ) + ( 1 - italic_c ( italic_s ) ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT [ italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_c ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ] (2)
=Pr[c(s)==1]+γc𝔼s𝒫[(1c(s))Vcπ(s)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\text{Pr}[c(s)==1]+\gamma_{c}% \mathbb{E}_{s^{\prime}\sim\mathcal{P}}\big{[}(1-c(s))V_{c}^{\pi}(s^{\prime})% \big{]}= Pr [ italic_c ( italic_s ) = = 1 ] + italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_P end_POSTSUBSCRIPT [ ( 1 - italic_c ( italic_s ) ) italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

Finally, the main objective of an RL algorithm in a CMDP framework is to find a policy to maximize expected return while satisfying the constraints starting from the initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

π=argmaxπΠ𝒥rπ=argmaxπΠ𝔼s0ρ0π[t=0γtrt]superscript𝜋𝜋Πargmaxsubscriptsuperscript𝒥𝜋𝑟𝜋Πargmaxsubscriptsuperscript𝔼𝜋similar-tosubscript𝑠0subscript𝜌0delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡\displaystyle\pi^{*}=\underset{\pi\in\Pi}{\text{argmax}}~{}\mathcal{J}^{\pi}_{% r}=\underset{\pi\in\Pi}{\text{argmax}}~{}\mathbb{E}^{\pi}_{s_{0}\sim\rho_{0}}[% \sum_{t=0}^{\infty}\gamma^{t}r_{t}]italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_π ∈ roman_Π end_UNDERACCENT start_ARG argmax end_ARG caligraphic_J start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = start_UNDERACCENT italic_π ∈ roman_Π end_UNDERACCENT start_ARG argmax end_ARG blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] (3)
s.t.𝒥cπ=𝔼s0ρ0π[t=0γctct]εformulae-sequencestsuperscriptsubscript𝒥𝑐𝜋subscriptsuperscript𝔼𝜋similar-tosubscript𝑠0subscript𝜌0delimited-[]superscriptsubscript𝑡0superscriptsubscript𝛾𝑐𝑡subscript𝑐𝑡𝜀\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\mathrm{s.t.}~{}~{}\mathcal{J}_{% c}^{\pi}=~{}\mathbb{E}^{\pi}_{s_{0}\sim\rho_{0}}[\sum_{t=0}^{\infty}\gamma_{c}% ^{t}c_{t}]\leq\varepsilonroman_s . roman_t . caligraphic_J start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ≤ italic_ε

III-B Soft Actor Critic (SAC)

SAC [7] optimizes a stochastic policy in an off-policy manner, utilizing two neural networks: one for estimating Q𝑄Qitalic_Q-function (critic) and another for policy updates (actor). A key feature of SAC is entropy regularization, where the policy aims to strike a balance between maximizing expected return and maximizing entropy. This balance mirrors the exploration-exploitation trade-off; higher entropy encourages greater exploration, potentially accelerating learning and prevent convergence to suboptimal solutions.

Considering ωrsubscript𝜔𝑟\omega_{r}italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and ϕitalic-ϕ\phiitalic_ϕ as parameters representing the critic and actor networks, respectively, training these networks involves sampling a batch of samples from the replay buffer. ωrsubscript𝜔𝑟\omega_{r}italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is updated by taking the gradient through the mean squared error (MSE) loss between the critic output and the target value:

𝒥rQωr=𝔼(s,a,r)𝒟[12(Qωr(s,a)Qrtar(s,a))2],superscriptsubscript𝒥𝑟subscript𝑄subscript𝜔𝑟subscript𝔼similar-to𝑠𝑎𝑟𝒟delimited-[]12superscriptsubscript𝑄subscript𝜔𝑟𝑠𝑎superscriptsubscript𝑄𝑟tar𝑠𝑎2\displaystyle\mathcal{J}_{r}^{Q_{\omega_{r}}}=\mathbb{E}_{(s,a,r)\sim\mathcal{% D}}[\frac{1}{2}(Q_{\omega_{r}}(s,a)-Q_{r}^{\text{tar}}(s,a))^{2}],caligraphic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a , italic_r ) ∼ caligraphic_D end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where Qrtarsubscriptsuperscript𝑄tar𝑟Q^{\mathrm{tar}}_{r}italic_Q start_POSTSUPERSCRIPT roman_tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is calculated as:

Qrtar(s,a)=𝔼s𝒫(s,a)aπϕ[r(s,a)+γr(Qr(s,a)αlog(πϕ(a|s)))]superscriptsubscript𝑄𝑟tar𝑠𝑎subscript𝔼similar-tosuperscript𝑠𝒫𝑠𝑎similar-tosuperscript𝑎subscript𝜋italic-ϕdelimited-[]𝑟𝑠𝑎subscript𝛾𝑟subscript𝑄𝑟superscript𝑠superscript𝑎𝛼𝑙𝑜𝑔subscript𝜋italic-ϕconditionalsuperscript𝑎superscript𝑠Q_{r}^{\text{tar}}(s,a)=\\ \mathbb{E}_{\begin{subarray}{c}s^{\prime}\sim\mathcal{P}(s,a)\\ a^{\prime}\sim\pi_{\phi}\end{subarray}}[r(s,a)+\gamma_{r}(Q_{r}(s^{\prime},a^{% \prime})-\alpha log(\pi_{\phi}(a^{\prime}|s^{\prime})))]start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT ( italic_s , italic_a ) = end_CELL end_ROW start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_P ( italic_s , italic_a ) end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_α italic_l italic_o italic_g ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) ] end_CELL end_ROW (5)

Furthermore, the policy πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is optimized by taking the gradient through the critic and the expected entropy of the policy:

𝒥rπϕ=𝔼s𝒟aπϕ[αlog(πϕ(a|s))Qωr(s,a)]superscriptsubscript𝒥𝑟subscript𝜋italic-ϕsubscript𝔼similar-to𝑠𝒟similar-to𝑎subscript𝜋italic-ϕdelimited-[]𝛼𝑙𝑜𝑔subscript𝜋italic-ϕconditional𝑎𝑠subscript𝑄subscript𝜔𝑟𝑠𝑎\mathcal{J}_{r}^{\pi_{\phi}}=\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}\\ a\sim\pi_{\phi}\end{subarray}}[\alpha log(\pi_{\phi}(a|s))-Q_{\omega_{r}}(s,a)]caligraphic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_α italic_l italic_o italic_g ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) - italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ] (6)

Finally, it is important to note that the safety critic (Qωcsubscript𝑄subscript𝜔𝑐Q_{\omega_{c}}italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT) defined in Section III-A is trained using the same loss formulation in Eq. 4, without the entropy term.

Refer to caption
Figure 2: The Computational Graph of the Meta SAC-Lag.

IV Method

In this section the process of metagradient optimization of the safety threshold ε𝜀\varepsilonitalic_ε and entropy temperature α𝛼\alphaitalic_α is discussed.

IV-A Metagradient Optimization

Metagradient optimization is the process with which we can optimize the hyperparameters that are not a part of the main loss function. Fundamentally, these meta-parameters111In this paper, we use hyperparameters and meta-parameters terms interchangeably. dictate the dynamics of the system and direct it toward a certain behavior. In the context of metagradient reinforcement learning [17], in abstract terms, the learnable system variables are parameterized as θ𝜃\thetaitalic_θ. These parameters are updated to θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by following the rule:

θ=θ+f(𝒥,θ,η,)superscript𝜃𝜃𝑓𝒥𝜃𝜂\theta^{\prime}=\theta+f(\mathcal{J},\theta,\eta,\mathcal{B})italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θ + italic_f ( caligraphic_J , italic_θ , italic_η , caligraphic_B ) (7)

where η𝜂\etaitalic_η is the list of hyperparameters, \mathcal{B}caligraphic_B a mini-batch of experience, and f𝑓fitalic_f the gradient of the objective function 𝒥𝒥\mathcal{J}caligraphic_J w.r.t. θ𝜃\thetaitalic_θ. Furthermore, the optimization process of the meta-parameters η𝜂\etaitalic_η can be formulated based on the updated parameter θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

η=η+βη𝒥(θ,η,)η=η+βη𝒥(θ,η,)θdθdηsuperscript𝜂𝜂subscript𝛽𝜂superscript𝒥superscript𝜃𝜂superscript𝜂𝜂subscript𝛽𝜂superscript𝒥superscript𝜃𝜂superscriptsuperscript𝜃dsuperscript𝜃d𝜂\eta^{\prime}=\eta+\beta_{\eta}\frac{\partial\mathcal{J}^{\prime}(\theta^{% \prime},\eta,\mathcal{B}^{\prime})}{\partial\eta}=\eta+\beta_{\eta}\frac{% \partial\mathcal{J}^{\prime}(\theta^{\prime},\eta,\mathcal{B}^{\prime})}{% \partial\theta^{\prime}}\frac{\mathrm{d}\theta^{\prime}}{\mathrm{d}\eta}italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_η + italic_β start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT divide start_ARG ∂ caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_η , caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_η end_ARG = italic_η + italic_β start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT divide start_ARG ∂ caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_η , caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG divide start_ARG roman_d italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_η end_ARG (8)

where 𝒥superscript𝒥\mathcal{J}^{\prime}caligraphic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the meta-objective used for the optimization of the meta-parameters, βηsubscript𝛽𝜂\beta_{\eta}italic_β start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT the learning rate associated with η𝜂\etaitalic_η, and superscript\mathcal{B}^{\prime}caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT a resampled mini-batch validation set similar to the cross-validation method in the meta-optimization literature [28]. Finally, dθdηdsuperscript𝜃d𝜂\frac{\mathrm{d}\theta^{\prime}}{\mathrm{d}\eta}divide start_ARG roman_d italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_η end_ARG can be calculated as:

dθdη=(I+f(𝒥,θ,η,)θ)dθdη+f(𝒥,θ,η,)ηdsuperscript𝜃d𝜂𝐼𝑓𝒥𝜃𝜂𝜃d𝜃d𝜂𝑓𝒥𝜃𝜂𝜂\frac{\mathrm{d}\theta^{\prime}}{\mathrm{d}\eta}=\biggl{(}I+\frac{\partial f(% \mathcal{J},\theta,\eta,\mathcal{B})}{\partial\theta}\biggr{)}\frac{\mathrm{d}% \theta}{\mathrm{d}\eta}+\frac{\partial f(\mathcal{J},\theta,\eta,\mathcal{B})}% {\partial\eta}divide start_ARG roman_d italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG roman_d italic_η end_ARG = ( italic_I + divide start_ARG ∂ italic_f ( caligraphic_J , italic_θ , italic_η , caligraphic_B ) end_ARG start_ARG ∂ italic_θ end_ARG ) divide start_ARG roman_d italic_θ end_ARG start_ARG roman_d italic_η end_ARG + divide start_ARG ∂ italic_f ( caligraphic_J , italic_θ , italic_η , caligraphic_B ) end_ARG start_ARG ∂ italic_η end_ARG (9)

IV-B SAC-Lagrangian

In the Lagrangian version of the SAC we aim to optimize the policy based on its reward objective such that it is compliant with the safety objective:

πϕ=maxπϕΠ𝒥rπϕπϕ=𝔼s𝒟aπϕ[Qωr(s,a)αlogπϕ(s,a)]subscriptsuperscript𝜋italic-ϕsubscriptsubscript𝜋italic-ϕΠsuperscriptsubscript𝒥subscript𝑟subscript𝜋italic-ϕsubscript𝜋italic-ϕsubscript𝔼similar-to𝑠𝒟similar-to𝑎subscript𝜋italic-ϕdelimited-[]subscript𝑄subscript𝜔𝑟𝑠𝑎𝛼subscript𝜋italic-ϕ𝑠𝑎\displaystyle\pi^{*}_{\phi}=\max_{\pi_{\phi}\in\Pi}~{}\mathcal{J}_{r_{\pi_{% \phi}}}^{\pi_{\phi}}=\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}\\ a\sim\pi_{\phi}\end{subarray}}[Q_{\omega_{r}}(s,a)-\alpha\log{\pi_{\phi}(s,a)}]italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ roman_Π end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_α roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a ) ] (10)
s.t.𝒥cπϕ=𝔼s𝒟aπϕ[Qωc(s,a)]εformulae-sequencestsuperscriptsubscript𝒥𝑐subscript𝜋italic-ϕsubscript𝔼similar-to𝑠𝒟similar-to𝑎subscript𝜋italic-ϕdelimited-[]subscript𝑄subscript𝜔𝑐𝑠𝑎𝜀\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\mathrm{s.t.}~{}~% {}\mathcal{J}_{c}^{\pi_{\phi}}=\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}% \\ a\sim\pi_{\phi}\end{subarray}}[Q_{\omega_{c}}(s,a)]\leq\varepsilonroman_s . roman_t . caligraphic_J start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ] ≤ italic_ε

Naturally, multiple constraints can be defined for the policy to consider all of them. However, in this paper, in order to keep the formulation simple and general, we consider a single constraint signal that is the result of the superposition of all the constraint functions. In this paper, in contrast to [7], we refrain from considering α𝛼\alphaitalic_α as an additional constraint and aim to optimize it through metagradient optimization.

Furthermore, the optimization process of policy in Eq. 10 is formulated by casting it as a Lagrangian loss and backpropagating through the loss:

minν0maxπϕΠ(πϕ,ν,ε,α)=𝒥rπϕπϕν(𝒥cπϕε)subscript𝜈0subscriptsubscript𝜋italic-ϕΠsubscript𝜋italic-ϕ𝜈𝜀𝛼superscriptsubscript𝒥subscript𝑟subscript𝜋italic-ϕsubscript𝜋italic-ϕ𝜈superscriptsubscript𝒥𝑐subscript𝜋italic-ϕ𝜀\displaystyle\min_{\nu\geq 0}~{}\max_{\pi_{\phi}\in\Pi}~{}\mathcal{L}(\pi_{% \phi},\nu,\varepsilon,\alpha)=\mathcal{J}_{r_{\pi_{\phi}}}^{\pi_{\phi}}-\nu(% \mathcal{J}_{c}^{\pi_{\phi}}-\varepsilon)roman_min start_POSTSUBSCRIPT italic_ν ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ roman_Π end_POSTSUBSCRIPT caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν , italic_ε , italic_α ) = caligraphic_J start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_ν ( caligraphic_J start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_ε ) (11)
=𝔼s𝒟aπϕ[Qωr(s,a)αlogπϕ(s,a)ν(Qωc(s,a)ε)]absentsubscript𝔼similar-to𝑠𝒟similar-to𝑎subscript𝜋italic-ϕdelimited-[]subscript𝑄subscript𝜔𝑟𝑠𝑎𝛼subscript𝜋italic-ϕ𝑠𝑎𝜈subscript𝑄subscript𝜔𝑐𝑠𝑎𝜀\displaystyle~{}~{}~{}~{}=\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}\\ a\sim\pi_{\phi}\end{subarray}}[Q_{\omega_{r}}(s,a)-\alpha\log{\pi_{\phi}(s,a)}% -\nu(Q_{\omega_{c}}(s,a)-\varepsilon)]= blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_α roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ν ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ε ) ]

where ν𝜈\nuitalic_ν is the Lagrange multiplier.

IV-C Meta SAC-Lag

Following the conventional notation in the context of gradient-based hyperparameter optimization [29], we split the parameters into inner and outer parameters. Rather than a one-shot optimization as in Eq. 7, we propose a sequential updating approach. We define and update the inner parameters as:

θinner=[νϕ]=[νϕ]+[ν(πϕ,ν,ε,α)ϕ(πϕ,ν,ε,α)]subscriptsuperscript𝜃innermatrixsuperscript𝜈superscriptitalic-ϕmatrix𝜈italic-ϕmatrixsubscript𝜈subscript𝜋italic-ϕ𝜈𝜀𝛼subscriptitalic-ϕsubscript𝜋italic-ϕsuperscript𝜈𝜀𝛼\theta^{\prime}_{\mathrm{inner}}=\begin{bmatrix}\nu^{\prime}\\ \phi^{\prime}\end{bmatrix}=\begin{bmatrix}\nu\\ \phi\end{bmatrix}+\begin{bmatrix}-\nabla_{\nu}\mathcal{L}(\pi_{\phi},\nu,% \varepsilon,\alpha)\\ \nabla_{\phi}\mathcal{L}(\pi_{\phi},\nu^{\prime},\varepsilon,\alpha)\end{bmatrix}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_inner end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_ν end_CELL end_ROW start_ROW start_CELL italic_ϕ end_CELL end_ROW end_ARG ] + [ start_ARG start_ROW start_CELL - ∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν , italic_ε , italic_α ) end_CELL end_ROW start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε , italic_α ) end_CELL end_ROW end_ARG ] (12)

Furthermore, in the same sequential manner, we first update ε𝜀\varepsilonitalic_ε and then α𝛼\alphaitalic_α:

θouter=[εα]=[εα]+[ε𝒥ε(πϕ)α𝒥α(πϕ,ν,ε)]subscriptsuperscript𝜃outermatrixsuperscript𝜀superscript𝛼matrix𝜀𝛼matrixsubscript𝜀subscript𝒥𝜀subscript𝜋superscriptitalic-ϕsubscript𝛼subscript𝒥𝛼subscript𝜋superscriptitalic-ϕsuperscript𝜈superscript𝜀\theta^{\prime}_{\mathrm{outer}}=\begin{bmatrix}\varepsilon^{\prime}\\ \alpha^{\prime}\end{bmatrix}=\begin{bmatrix}\varepsilon\\ \alpha\end{bmatrix}+\begin{bmatrix}\nabla_{\varepsilon}\mathcal{J}_{% \varepsilon}(\pi_{\phi^{\prime}})\\ \nabla_{\alpha}\mathcal{J}_{\alpha}(\pi_{\phi^{\prime}},\nu^{\prime},% \varepsilon^{\prime})\end{bmatrix}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_outer end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_ε end_CELL end_ROW start_ROW start_CELL italic_α end_CELL end_ROW end_ARG ] + [ start_ARG start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] (13)

where 𝒥εsubscript𝒥𝜀\mathcal{J}_{\varepsilon}caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT and 𝒥αsubscript𝒥𝛼\mathcal{J}_{\alpha}caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT correspond to the objective functions of ε𝜀\varepsilonitalic_ε and α𝛼\alphaitalic_α, respectively. To this end, we intended to design the objective function for ε𝜀\varepsilonitalic_ε solely based on the performance of the resultant policy. Our intuition behind the aforementioned design stems from the idea that the threshold should be adjusted such that it improves the performance of the agent as a whole. For that purpose, the ε𝜀\varepsilonitalic_ε objective function is proposed as:

𝒥ε(πϕ)=𝔼s𝒟aπϕ[νcopyQωc(s,a)Qωr(s,a)]subscript𝒥𝜀subscript𝜋superscriptitalic-ϕsubscript𝔼similar-to𝑠𝒟similar-to𝑎subscript𝜋superscriptitalic-ϕdelimited-[]superscriptsubscript𝜈copysubscript𝑄subscript𝜔𝑐𝑠𝑎subscript𝑄subscript𝜔𝑟𝑠𝑎\mathcal{J}_{\varepsilon}(\pi_{\phi^{\prime}})=\mathbb{E}_{\begin{subarray}{c}% s\sim\mathcal{D}\\ a\sim\pi_{\phi^{\prime}}\end{subarray}}[\nu_{\mathrm{copy}}^{\prime}Q_{\omega_% {c}}(s,a)-Q_{\omega_{r}}(s,a)]caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_ν start_POSTSUBSCRIPT roman_copy end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ] (14)
Algorithm 1 Meta SAC-Lag
1:
2:Initialize Policy network ϕ0superscriptitalic-ϕ0\phi^{0}italic_ϕ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, Exploration rate α0superscript𝛼0\alpha^{0}italic_α start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
3:Critic network ωr10,ωr20subscriptsuperscript𝜔0subscript𝑟1subscriptsuperscript𝜔0subscript𝑟2\omega^{0}_{r_{1}},\omega^{0}_{r_{2}}italic_ω start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Safety critic network ωc10,ωc20subscriptsuperscript𝜔0subscript𝑐1subscriptsuperscript𝜔0subscript𝑐2\omega^{0}_{c_{1}},\omega^{0}_{c_{2}}italic_ω start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
4:Lagrangian values ε0,ν0superscript𝜀0superscript𝜈0\varepsilon^{0},\nu^{0}italic_ε start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ν start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
5:Learning rates βϕ,βε,βν,βαsubscript𝛽italic-ϕsubscript𝛽𝜀subscript𝛽𝜈subscript𝛽𝛼\beta_{\phi},\beta_{\varepsilon},\beta_{\nu},\beta_{\alpha}italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT
6:Create Transition buffer 𝒟𝒟\mathcal{D}caligraphic_D, Safety buffer 𝒟ssubscript𝒟s\mathcal{D}_{\mathrm{s}}caligraphic_D start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT, and Initial state buffer 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
7:Randomly sample initial state s0ρ0similar-tosubscript𝑠0subscript𝜌0s_{0}\sim\rho_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and fill 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
8:for e=1,𝑒1e=1,\ldotsitalic_e = 1 , … do
9:     Reset environment s0ρ0=env.reset()formulae-sequencesimilar-tosubscript𝑠0subscript𝜌0𝑒𝑛𝑣𝑟𝑒𝑠𝑒𝑡s_{0}\sim\rho_{0}=env.reset()italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_e italic_n italic_v . italic_r italic_e italic_s italic_e italic_t ( )
10:     for t=0,,T1𝑡0𝑇1t=0,\ldots,T-1italic_t = 0 , … , italic_T - 1 do
11:         Sample action atπϕsimilar-tosubscript𝑎𝑡subscript𝜋italic-ϕa_{t}\sim\pi_{\phi}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
12:         st+1,rt,ctenv.step(at)formulae-sequencesubscript𝑠𝑡1subscript𝑟𝑡subscript𝑐𝑡𝑒𝑛𝑣𝑠𝑡𝑒𝑝subscript𝑎𝑡s_{t+1},r_{t},c_{t}\leftarrow env.step(a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_e italic_n italic_v . italic_s italic_t italic_e italic_p ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
13:         if ct==1c_{t}==1italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = = 1 then
14:              𝒟s𝒟s(st,at,ct,st+1)subscript𝒟𝑠subscript𝒟𝑠subscript𝑠𝑡subscript𝑎𝑡subscript𝑐𝑡subscript𝑠𝑡1\mathcal{D}_{s}\leftarrow\mathcal{D}_{s}\cup(s_{t},a_{t},c_{t},s_{t+1})caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∪ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
15:         else
16:              𝒟𝒟(st,at,rt,ct,st+1)𝒟𝒟subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑐𝑡subscript𝑠𝑡1\mathcal{D}\leftarrow\mathcal{D}\cup(s_{t},a_{t},r_{t},c_{t},s_{t+1})caligraphic_D ← caligraphic_D ∪ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )          
17:         Train ωc1,ωc2subscript𝜔subscript𝑐1subscript𝜔subscript𝑐2\omega_{c_{1}},\omega_{c_{2}}italic_ω start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT on 𝒟𝒟s𝒟subscript𝒟𝑠\mathcal{D}\cup\mathcal{D}_{s}caligraphic_D ∪ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (Eq. 2)
18:         Sample a batch of transitions =absent\mathcal{B}=caligraphic_B =
19:              {(s,a,r,c,s)}𝒟𝑠𝑎𝑟𝑐superscript𝑠𝒟\{(s,a,r,c,s^{\prime})\}\in\mathcal{D}{ ( italic_s , italic_a , italic_r , italic_c , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } ∈ caligraphic_D
20:         Train ωr1,ωr2subscript𝜔subscript𝑟1subscript𝜔subscript𝑟2\omega_{r_{1}},\omega_{r_{2}}italic_ω start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT using \mathcal{B}caligraphic_B (Eq. 4)
21:         ννβνν(πϕ,ν,ε,α)superscript𝜈𝜈subscript𝛽𝜈subscript𝜈subscript𝜋italic-ϕ𝜈𝜀𝛼\nu^{\prime}\leftarrow\nu-\beta_{\nu}\nabla_{\nu}\mathcal{L}(\pi_{\phi},\nu,% \varepsilon,\alpha)italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_ν - italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν , italic_ε , italic_α ) using \mathcal{B}caligraphic_B (Eq. 11)
22:         ϕϕ+βϕϕ(πϕ,ν,ε,α)superscriptitalic-ϕitalic-ϕsubscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝜋italic-ϕsuperscript𝜈𝜀𝛼\phi^{\prime}\leftarrow\phi+\beta_{\phi}\nabla_{\phi}\mathcal{L}(\pi_{\phi},% \nu^{\prime},\varepsilon,\alpha)italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_ϕ + italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε , italic_α ) using \mathcal{B}caligraphic_B (Eq. 11)
23:         Resample ={s𝒟}superscript𝑠𝒟\mathcal{B^{\prime}}=\{s\in\mathcal{D}\}caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_s ∈ caligraphic_D }
24:         εε+βεε𝒥ε(πϕ)superscript𝜀𝜀subscript𝛽𝜀subscript𝜀subscript𝒥𝜀subscript𝜋superscriptitalic-ϕ\varepsilon^{\prime}\leftarrow\varepsilon+\beta_{\varepsilon}\nabla_{% \varepsilon}\mathcal{J}_{\varepsilon}(\pi_{\phi^{\prime}})italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_ε + italic_β start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) using superscript\mathcal{B}^{\prime}caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (Eq. 14)
25:         αα+βαα𝒥α(πϕ,ν,ε)superscript𝛼𝛼subscript𝛽𝛼subscript𝛼subscript𝒥𝛼subscript𝜋superscriptitalic-ϕsuperscript𝜈superscript𝜀\alpha^{\prime}\leftarrow\alpha+\beta_{\alpha}\nabla_{\alpha}\mathcal{J}_{% \alpha}(\pi_{\phi^{\prime}},\nu^{\prime},\varepsilon^{\prime})italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_α + italic_β start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) using 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Eq. 15)
26:         νν,ϕϕ,εε,ααformulae-sequence𝜈superscript𝜈formulae-sequenceitalic-ϕsuperscriptitalic-ϕformulae-sequence𝜀superscript𝜀𝛼superscript𝛼\nu\leftarrow\nu^{\prime},\phi\leftarrow\phi^{\prime},\varepsilon\leftarrow% \varepsilon^{\prime},\alpha\leftarrow\alpha^{\prime}italic_ν ← italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ϕ ← italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε ← italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_α ← italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
27:         if ct==1c_{t}==1italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = = 1 then Break               

The objective function 𝒥εsubscript𝒥𝜀\mathcal{J}_{\varepsilon}caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is consistent with the objective function of the policy. This is evident by comparing Eq. 14 with the gradient w.r.t. the policy parameters ϕitalic-ϕ\phiitalic_ϕ in Eq. 11. The objective function for ε𝜀\varepsilonitalic_ε is designed to minimize the policy objective. This design stems from the idea that ε𝜀\varepsilonitalic_ε aims to capture the worst-case performance of the policy πϕsubscript𝜋superscriptitalic-ϕ\pi_{\phi^{\prime}}italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Hence, by being optimized in this way, the safety region of the policy can be correctly adjusted to reflect that. It is important to note that we use νcopysuperscriptsubscript𝜈copy\nu_{\mathrm{copy}}^{\prime}italic_ν start_POSTSUBSCRIPT roman_copy end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to indicate that we merely use νsuperscript𝜈\nu^{\prime}italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT value in the objective function and not include its gradient w.r.t. ε𝜀\varepsilonitalic_ε. We observed better performance by the gradient detachment in our early experiments which may be due to the injection of bias in νsuperscript𝜈\nu^{\prime}italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into its optimization process. Furthermore, to optimize the exploration value α𝛼\alphaitalic_α, [9] used Qωrsubscript𝑄subscript𝜔𝑟Q_{\omega_{r}}italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the objective function to change the value based on the performance of the policy. Therefore, in order to make the exploration rate of the Meta SAC-Lag safety compliant, we propose the objective function of α𝛼\alphaitalic_α as:

𝒥α(πϕ,ν,ε)=max0<α1𝔼s0ρ0aπϕdet[Qωr(s0,a)ν(Qωc(s0,a)ε)]subscript𝒥𝛼subscript𝜋superscriptitalic-ϕsuperscript𝜈superscript𝜀subscript0𝛼1subscript𝔼similar-tosubscript𝑠0subscript𝜌0similar-to𝑎superscriptsubscript𝜋superscriptitalic-ϕdetdelimited-[]subscript𝑄subscript𝜔𝑟subscript𝑠0𝑎superscript𝜈subscript𝑄subscript𝜔𝑐subscript𝑠0𝑎superscript𝜀\mathcal{J}_{\alpha}(\pi_{\phi^{\prime}},\nu^{\prime},\varepsilon^{\prime})=\\ ~{}\max_{0<\alpha\leq 1}\mathbb{E}_{\begin{subarray}{c}s_{0}\sim\rho_{0}\\ a\sim\pi_{\phi^{\prime}}^{\mathrm{det}}\end{subarray}}[Q_{\omega_{r}}(s_{0},a)% -\nu^{\prime}(Q_{\omega_{c}}(s_{0},a)-\varepsilon^{\prime})]start_ROW start_CELL caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT 0 < italic_α ≤ 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_det end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a ) - italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_CELL end_ROW (15)

where πϕdetsuperscriptsubscript𝜋superscriptitalic-ϕdet\pi_{\phi^{\prime}}^{\mathrm{det}}italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_det end_POSTSUPERSCRIPT indicates the deterministic action value output by the policy. Basically, we use the expectation of the Lagrangian formulation evaluated in the initial states encountered by the agent. To gain a better understanding of the gradient relations, illustration of the optimization process of Meta SAC-Lag is depicted in Fig. 2.

IV-D Implementation Details

The learning process of Meta SAC-Lag is presented in Algorithm 1. The proposed algorithm utilizes three replay buffers for training. The main replay buffer 𝒟𝒟\mathcal{D}caligraphic_D stores all the transitions occurred while interacting with the environment, safety replay buffer 𝒟ssubscript𝒟s\mathcal{D}_{\mathrm{s}}caligraphic_D start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT stores all the transitions that have led to a constraint violation, and 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT builds an approximation of ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by generating samples from the distribution. We used a sampled batch 𝒟𝒟\mathcal{B}\subset\mathcal{D}caligraphic_B ⊂ caligraphic_D to train the critic networks and the inner parameters ν𝜈\nuitalic_ν and ϕitalic-ϕ\phiitalic_ϕ. Following that, as discussed in Section IV-A, we use resampled 𝒟superscript𝒟\mathcal{B}^{\prime}\subset\mathcal{D}caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊂ caligraphic_D and 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to train the meta-parameters ε𝜀\varepsilonitalic_ε and α𝛼\alphaitalic_α, respectively. The resampling process is analogous to the meta-testing process and is used to reduce bias in the training of the outer parameters [28, 29]. Moreover, following the original architecture [7], Meta SAC-Lag uses two critic and safety critic networks to prevent the overestimation of the value functions. To this end, the target values in Eq. 5 are calculated as min{Qω¯r1,Qω¯r2}subscript𝑄subscript¯𝜔subscript𝑟1subscript𝑄subscript¯𝜔subscript𝑟2\min\{Q_{\bar{\omega}_{r_{1}}},Q_{\bar{\omega}_{r_{2}}}\}roman_min { italic_Q start_POSTSUBSCRIPT over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and max{Qω¯c1,Qω¯c2}subscript𝑄subscript¯𝜔subscript𝑐1subscript𝑄subscript¯𝜔subscript𝑐2\max\{Q_{\bar{\omega}_{c_{1}}},Q_{\bar{\omega}_{c_{2}}}\}roman_max { italic_Q start_POSTSUBSCRIPT over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT over¯ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, respectively. The ω¯¯𝜔\bar{\omega}over¯ start_ARG italic_ω end_ARG notation is used to indicate the target networks which are copies of the main networks updated with a time delay. Proposed in [30], the target networks aim to increase the stability of the training process and are calculated using the polyak averaging: ω¯=τω+(1τ)ω¯¯𝜔𝜏𝜔1𝜏¯𝜔\bar{\omega}=\tau\omega+(1-\tau)\bar{\omega}over¯ start_ARG italic_ω end_ARG = italic_τ italic_ω + ( 1 - italic_τ ) over¯ start_ARG italic_ω end_ARG. The hyperparameter τ(0,1)𝜏01\tau\in(0,1)italic_τ ∈ ( 0 , 1 ) typically has a value near zero.

Finally, it is also worth mentioning, in contrast to the original SAC, we use RMSProp [31] instead of Adam to calculate the higher-order gradients of the parameters ν𝜈\nuitalic_ν, ϕitalic-ϕ\phiitalic_ϕ, and ε𝜀\varepsilonitalic_ε since backpropagating through RMSProp seems to be more numerically stable [9].

V Experiments

Refer to caption
Figure 3: Performance of Meta SAC-Lag compared with the baseline algorithms. (Top row): Reward performance during the learning process. (Higher values are better) (Middle row): The value of Exploration hyperparameter (α𝛼\alphaitalic_α). (Bottom row): Episodic policy safety performance of the algorithms during the learning process. (Lower values are better). The dashed lines illustrate the constraint threshold value (ε𝜀\varepsilonitalic_ε).

In this section, we evaluate the performance of Meta SAC-Lag. Specifically, our aim is to study two questions:

  • How much does the added autonomy affect the performance of the algorithm compared to the baseline methods?

  • How capable is Meta SAC-Lag to learn optimal performance in a real-world setup while avoiding actions that might catastrophically damage the system?

V-A Test Benchmarks and Baselines

In order to study how the proposed algorithm will perform in safety-critical robotic scenarios, we use five simulated robotic environments with four different themes:

  • Locomotion: In this theme, the purpose of control is to move the robotic system in the forward direction. The safety constraints are violated whenever the controller’s actions make the system exceed its limits, e.g., the velocity is higher than a certain threshold or the robot is falling to the ground. For that purpose, we use the Mujoco-based [32] Humanoid-Velocity environment from the Safety Gymnasium codebase [33]. It is important to note that the safety-related reward shaping of this environment is removed to have a better understanding of the safety performance of the algorithms.

  • Obstacle Avoidance: In many real-world robotic applications, there are mobile robots with manipulation capabilities. An important constraint of these systems is achieving their goal while avoiding certain regions in their surroundings. We adopt Isaac Gym-based FreightFrankaCloseDrawer [34]. In this setup, the robot attempts to get near a drawer and close it while avoiding a red region. In addition, we use Car-Circle2 task [33] where the objective is to steer a car in a circular motion while avoiding collision with two walls.

  • Manipulation: Another important area of safety-concerned robotic applications is manipulation. For that purpose, we use two embodied scenarios. For the robotic manipulation task we use Push Topple [35, 36] environment where the robotic arm must relocate a box without toppling it. Furthermore, in the dexterous manipulation scenario, we adopt the Egg Manipulate task where the agent must rotate an egg to a specific orientation without dropping it or exerting a force of more than 20 N. For both tasks, we use the Gymnasium Robotics codebase [37].

It is important to note that in the training process, we treat the constraints as hard constraints and terminate the episode whenever a violation has happened in the system. Furthermore, three baseline algorithms are chosen to compare and study the performance of Meta SAC-Lag:

  • -

    SACv2-Lag: The basic form of Meta SAC-Lag which uses Eq. 11 to optimize the policy and the Lagrangian multiplier with a fixed safety threshold.

  • -

    Reward Constrained Policy Optimization (RCPO-SACv2): Optimizes the policy using the Q𝑄Qitalic_Q-function formulated as 𝔼π[Q^(s,a)=Qr(s,a)νQc(s,a)]subscript𝔼𝜋delimited-[]^𝑄𝑠𝑎subscript𝑄𝑟𝑠𝑎𝜈subscript𝑄𝑐𝑠𝑎\mathbb{E}_{\pi}[\hat{Q}(s,a)=Q_{r}(s,a)-\nu Q_{c}(s,a)]blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG ( italic_s , italic_a ) = italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_ν italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_s , italic_a ) ]. The dual variable ν𝜈\nuitalic_ν is also updated using Eq. 11.

  • -

    RCPO-MetaSAC: To show the effectiveness of our safe exploration technique, we use the Q^(s,a)^𝑄𝑠𝑎\hat{Q}(s,a)over^ start_ARG italic_Q end_ARG ( italic_s , italic_a ) formulation in RCPO and optimize α𝛼\alphaitalic_α using the approach proposed in [9].

  • -

    Meta SAC-Lag 𝒥nlsubscript𝒥𝑛𝑙\mathcal{J}_{nl}caligraphic_J start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT: Inspired by [38], we experiment with a nonlinear objective function for ε𝜀\varepsilonitalic_ε specified as:

    𝒥εnl(πϕ)=𝔼s𝒟aπϕ[{Qωr(s,a)Qωc(s,a)ifQωr(s,a)<0Qωr(s,a)(1Qωc(s,a))otherwise]superscriptsubscript𝒥𝜀𝑛𝑙subscript𝜋superscriptitalic-ϕsubscript𝔼similar-to𝑠𝒟similar-to𝑎subscript𝜋superscriptitalic-ϕdelimited-[]{Qωr(s,a)Qωc(s,a)ifQωr(s,a)<0Qωr(s,a)(1Qωc(s,a))otherwise\mathcal{J}_{\varepsilon}^{nl}(\pi_{\phi^{\prime}})=\\ ~{}~{}~{}~{}\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}\\ a\sim\pi_{\phi^{\prime}}\end{subarray}}\biggl{[}\scalebox{0.8}{\mbox{$% \displaystyle\begin{cases}Q_{\omega_{r}}(s,a)Q_{\omega_{c}}(s,a)&\mathrm{if}~{% }Q_{\omega_{r}}(s,a)<0\\ Q_{\omega_{r}}(s,a)(1-Q_{\omega_{c}}(s,a))&\mathrm{otherwise}\\ \end{cases}$}}\biggr{]}start_ROW start_CELL caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_l end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ { start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_CELL start_CELL roman_if italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) < 0 end_CELL end_ROW start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ( 1 - italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ) end_CELL start_CELL roman_otherwise end_CELL end_ROW ] end_CELL end_ROW (16)

    Essentially, 𝒥εnlsuperscriptsubscript𝒥𝜀𝑛𝑙\mathcal{J}_{\varepsilon}^{nl}caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_l end_POSTSUPERSCRIPT can have the advantage of no reliance on external parameter values, as opposed to Eq. 14 which uses νsuperscript𝜈\nu^{\prime}italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the objective formulation.

To have a fair comparison, we tune the values of ε𝜀\varepsilonitalic_ε and ν𝜈\nuitalic_ν for SACv2-Lag and RCPO-SACv2. Also, we use the values of RCPO-SACv2 for RCPO-MetaSAC. The values are outlined in Table I. For the value of α𝛼\alphaitalic_α, SACv2 constrains the policy entropy as 𝔼s𝒟aπ[log(π(st,at))]subscript𝔼similar-to𝑠𝒟similar-to𝑎𝜋delimited-[]𝜋subscript𝑠𝑡subscript𝑎𝑡\mathbb{E}_{\begin{subarray}{c}s\sim\mathcal{D}\\ a\sim\pi\end{subarray}}[-\log(\pi(s_{t},a_{t}))]\geq\mathcal{H}blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ - roman_log ( italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ≥ caligraphic_H and defines the α𝛼\alphaitalic_α loss as minα>0(α)=𝔼s𝒟aπ[α(log(π(st,at))+)]subscript𝛼0𝛼subscript𝔼similar-to𝑠𝒟similar-to𝑎𝜋delimited-[]𝛼𝜋subscript𝑠𝑡subscript𝑎𝑡\min_{\alpha>0}\mathcal{L}(\alpha)=\mathbb{E}_{\begin{subarray}{c}s\sim% \mathcal{D}\\ a\sim\pi\end{subarray}}[\alpha(\log(\pi(s_{t},a_{t}))+\mathcal{H})]roman_min start_POSTSUBSCRIPT italic_α > 0 end_POSTSUBSCRIPT caligraphic_L ( italic_α ) = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_a ∼ italic_π end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_α ( roman_log ( italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + caligraphic_H ) ]. The authors propose the formula =dim(𝒜)dim𝒜\mathcal{H}=-\mathrm{dim}(\mathcal{A})caligraphic_H = - roman_dim ( caligraphic_A ) as their target entropy. Furthermore, two important initial hyperparameter values of Meta SAC-Lag are automatically tuned; therefore, we set ε=1𝜀1\varepsilon=1italic_ε = 1 and α=1𝛼1\alpha=1italic_α = 1 as their initial values. Moreover, due to their similar training pipelines, we use the initial values of ν𝜈\nuitalic_ν for SACv2-Lag in Table I for Meta SAC-Lag. We also set γr=0.99subscript𝛾𝑟0.99\gamma_{r}=0.99italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.99 and γc=0.6subscript𝛾𝑐0.6\gamma_{c}=0.6italic_γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 0.6 for all the tasks. The results indicate the mean and variance of the performance of the algorithms across multiple independent runs.

TABLE I: Hyperparameter values (ε𝜀\varepsilonitalic_ε and ν𝜈\nuitalic_ν) of the comparison methods

Environment / Parameter ε𝜀\varepsilonitalic_ε Meta SAC-Lag SACv2-Lag RCPO-SACv2 RCPO-MetaSAC Humanoid-Velocity 0.4 10 10 Franka DrawerClose 0.6 10 10 Car-Circle2 0.5 100 1 Fetch PushTopple 0.5 1000 10 Egg Manipulate 0.5 100 1

V-B Simulation Results

The simulation results are depicted in Fig. 3. To this end, we also report the return and the policy episodic violation rate in Table II. The violation rate is calculated as the average number of failures over a specific window of episodes. The results not only indicate that Meta SAC-Lag provides automated tuning of the safety-related hyperparameters but also, that the convergence process of the policy incurs lower constraint violations and yields higher or comparable returns. Furthermore, the update profile of α𝛼\alphaitalic_α shows that as training goes on, in most cases, Meta SAC-Lag updates α𝛼\alphaitalic_α to values lower than SACv2. This indicates that as the policy converges to a near-safe optimal solution, α𝛼\alphaitalic_α is rapidly decreased to favor exploitation and prevent further constraint violations. Moreover, we can observe similar α𝛼\alphaitalic_α profiles in Meta SAC-Lag and RCPO-SACv2 which can be attributed to α𝛼\alphaitalic_α being optimized using similar objective functions. In addition, the optimization process of ε𝜀\varepsilonitalic_ε shows a generally fast convergence. The fast convergence of ε𝜀\varepsilonitalic_ε provides the advantage of stable optimization as other values can updated based on the optimally achieved value of ε𝜀\varepsilonitalic_ε. Finally, regarding the comparison between Eq. 14 and Eq. 16 we observe consistently better performance of Eq. 14 in both aspects of return and safety. In summary, the optimization outcomes of Meta SAC-Lag demonstrate that the algorithm excels across a range of embodied control tasks, proficiently learning optimal solutions, while demanding minimal hyperparameter tuning.

TABLE II: Relative Performance of the Algorithms During the Learning Process
(The best performance is shown in bold)
Task / Method SACv2-Lag RCPO-SACv2 RCPO-MetaSAC MetaSAC-Lag MetaSAC-Lag 𝒥nlsubscript𝒥nl\mathcal{J}_{\mathrm{nl}}caligraphic_J start_POSTSUBSCRIPT roman_nl end_POSTSUBSCRIPT
Jr Jc Jr Jc Jr Jc Jr Jc Jr Jc
Humanoid-Velocity 956.11 0.42 690.66 0.57 435.93 0.91 812.90 0.49 712.72 0.67
Franka DrawerClose 2.19 0.69 -67.20 0.89 -29.85 0.85 63.13 0.20 54.02 0.30
Car-Circle2 11.65 0.10 12.67 0.32 12.36 0.24 13.79 0.31 13.95 0.29
Fetch-Topple -6.6 0.66 6.55 0.06 7.6 0.007 7.49 0.004 4.98 0.17
Egg Manipulate 128.66 0.21 121.53 0.23 124.99 0.17 115.44 0.11 109.95 0.14

V-C Real-World Deployment

Deployability can be regarded as one of the most important obstacles in using RL for learning to control real-world systems [39]. Choosing unsafe actions might lead the system to states that might damage it catastrophically, if chosen repeatedly. Therefore, using the conventional safe RL algorithms hinders their deployability since they require intensive hyperparameter tuning. In line with our purpose of assessing the deployability of a safe RL method, we propose a simple, yet important, safe RL testbench. This task, which we call Pour Coffee, is the task of moving a coffee-filled mug from a home position to a specific location and pouring the coffee into another cup. The task is executed using a Kinova Gen3 robot and its digital twin is created in the PyBullet simulation environment [40]. We define the state space 𝒮={XcupOcupX˙cupXgoalOgoal}𝒮subscript𝑋𝑐𝑢𝑝subscript𝑂𝑐𝑢𝑝subscript˙𝑋𝑐𝑢𝑝subscript𝑋𝑔𝑜𝑎𝑙subscript𝑂𝑔𝑜𝑎𝑙\mathcal{S}=\biggl{\{}X_{cup}\,\cup~{}O_{cup}~{}\cup~{}\dot{X}_{cup}~{}\cup~{}% X_{goal}~{}\cup~{}O_{goal}\biggr{\}}caligraphic_S = { italic_X start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT ∪ italic_O start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT ∪ over˙ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT ∪ italic_X start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT ∪ italic_O start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT } where X={x,y,z}𝑋𝑥𝑦𝑧X=\{x,y,z\}italic_X = { italic_x , italic_y , italic_z } and O={ψ,θ,ϕ}𝑂𝜓𝜃italic-ϕO=\{\psi,\theta,\phi\}italic_O = { italic_ψ , italic_θ , italic_ϕ } refer to the Cartesian position and the Euler angles in the Tait-Bryan ZYX intrinsic convention, respectively. Furthermore, the action of the agent maps to the velocity of the end-effector: 𝒜={x˙cup,y˙cup,z˙cup,ϕ˙cup}𝒜subscript˙𝑥𝑐𝑢𝑝subscript˙𝑦𝑐𝑢𝑝subscript˙𝑧𝑐𝑢𝑝subscript˙italic-ϕ𝑐𝑢𝑝\mathcal{A}=\{\dot{x}_{cup},\dot{y}_{cup},\dot{z}_{cup},\dot{\phi}_{cup}\}caligraphic_A = { over˙ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT , over˙ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT , over˙ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT , over˙ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT }. Moreover, we hierarchically define the reward function for reaching and pouring the coffee based on the Euclidean distance between the cup and the goal d=XcupXgoal2𝑑subscriptnormsubscript𝑋𝑐𝑢𝑝subscript𝑋𝑔𝑜𝑎𝑙2d=||X_{cup}-X_{goal}||_{2}italic_d = | | italic_X start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

r(s,a,s)={r1d+r2X¨cup+r3𝟙[spillage]ifd>dthresh|ϕcupϕgoal|+10otherwise𝑟𝑠𝑎superscript𝑠casessubscript𝑟1𝑑subscript𝑟2normsubscript¨𝑋𝑐𝑢𝑝subscript𝑟31delimited-[]spillageif𝑑subscript𝑑threshsubscriptitalic-ϕ𝑐𝑢𝑝subscriptitalic-ϕ𝑔𝑜𝑎𝑙10otherwise\displaystyle r(s,a,s^{\prime})=\begin{cases}r_{1}\cdot d+r_{2}\cdot||\ddot{X}% _{cup}||+r_{3}\cdot\mathds{1}[\mathrm{spillage}]&\mathrm{if}~{}d>d_{\mathrm{% thresh}}\\ -|\phi_{cup}-\phi_{goal}|+10&\mathrm{otherwise}\end{cases}italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_d + italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ | | over¨ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT | | + italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ blackboard_1 [ roman_spillage ] end_CELL start_CELL roman_if italic_d > italic_d start_POSTSUBSCRIPT roman_thresh end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - | italic_ϕ start_POSTSUBSCRIPT italic_c italic_u italic_p end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT | + 10 end_CELL start_CELL roman_otherwise end_CELL end_ROW

(17)

where r1=2,r2=0.05,r3=1formulae-sequencesubscript𝑟12formulae-sequencesubscript𝑟20.05subscript𝑟31r_{1}=-2,~{}r_{2}=-0.05,~{}r_{3}=-1italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 2 , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 0.05 , italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 1 and dthresh=5cmsubscript𝑑thresh5𝑐𝑚d_{\mathrm{thresh}}=5~{}cmitalic_d start_POSTSUBSCRIPT roman_thresh end_POSTSUBSCRIPT = 5 italic_c italic_m. Furthermore, the system violates the safety constraints whenever self-collision or collision with the environment objects occurs. Additionally, we can define another constraint as spilling the coffee. As will be shown, this constraint forces the policy to be less jerky and aims to minimize the acceleration. The advantage of this approach, in contrast to similar environments [41], is the fact that it will eliminate the need to engineer the reward function to minimize the jerk and acceleration of the robot.

TABLE III: Pour Coffee Reward-Constraint Settings

Experiment Setting Reward Violation Distance (r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) Acceleration (r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) Penalty (r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) Collision Spillage Simulation #1 Simulation #2 Simulation #3 Simulation #4 Real

We conduct experiments with Meta SAC-Lag in four reward and constraint settings. The experiments aim to study whether formulating the problem sub-objectives can be more practical by defining them as constraints rather than shaping the reward explicitly. In the presented task, coffee spillage provides an implicit sub-objective that can be explicitly modeled as the sub-objective of minimizing the jerk and acceleration of the end-effector during the execution of the task. As shown in Table III, three experiments (Simulation #2, #3, #4) utilize different reward shaping schemes along with various constraint definitions. Moreover, we trained Meta SAC-Lag without engineered reward shaping (Simulation #1) both in the simulation environment and the real-world Kinova Gen3 setup. In order to make comparisons and evaluate the Sim2Real capability, the simulation-trained models were deployed on the robot using the checkpoints saved during the learning process.

The evaluation results are depicted in Fig. 4. The results illustrate that, as a result of providing a denser reward signal, explicit reward shaping can have positive effects in the increase of the success rate. However, using the spillage constraint helps the algorithm be even more effort-compliant resulting in lower jerk and comparable acceleration results. In other words, while being successful in executing the task is the most important metric, in a real-world scenario, sacrificing the performance to lower the effort of the system and satisfy other safety concerns can be reasonable. In addition, regarding the comparison between Sim2Real and Real deployment of the proposed algorithm, we can observe that while both setups have similar behaviors, the real-world deployment is slightly hindered by the system’s physical limitations, such as sensor noise, control saturation, system fatigue, etc. Despite all that, the algorithm trained on the real-world setup without engineered reward function achieves results comparable to the models trained in the simulation.

Refer to caption
Figure 4: Deployment results of Meta SAC-Lag on the real-world setup. (a) and (b) represent the jerk and acceleration of the end effector during the training process. (c) shows the final success rate of the algorithms.

VI Conclusions

The paper focused on the problem of automatic hyperparameter tuning in Lagrangian safe RL methods. A novel model-free architecture called Meta SAC-Lag was proposed which addressed two inherent problems: safe exploration and constraint bound tuning. To this end, through the use of metagradient optimization, the algorithm is capable of adjusting the safety-related hyperparameters with minimal initial tuning. Furthermore, we studied the performance of our algorithm in five simulated embodied applications with the themes of locomotion, obstacle avoidance, robotic manipulation, and dexterous manipulation. We observed that the synergy created between the parameters and the hyperparameters results in comparable or better performance of the policy in terms of reward or safety. Additionally, we conducted an experiment in a real-world setup involving a practical coffee-pouring robotic environment without any explicit safety-related reward shaping. We deployed the algorithm on the Kinova Gen3 robot and showed that the proposed algorithm can be helpful for real-world safety-sensitive applications by reducing the reliance on heuristic implementation of safety. We also observed that formulation of safety solely as the violation rather than engineering the reward function results in applying lower levels of effort at the cost of a diminished success performance. This trade-off can be especially favorable in real-world setups where safety violations are costly. Specifically, the proposed algorithm will learn the optimal policy in the real-world setup while adhering to the collision constraints and minimizing the effort imposed on the robot.

References

  • [1] A. Perera and P. Kamalaruban, “Applications of reinforcement learning in energy systems,” Renewable and Sustainable Energy Reviews, vol. 137, p. 110618, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1364032120309023
  • [2] K. Shao, Z. Tang, Y. Zhu, N. Li, and D. Zhao, “A survey of deep reinforcement learning in video games,” arXiv preprint arXiv:1912.10944, 2019.
  • [3] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation,” The International Journal of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020.
  • [4] G. Dulac-Arnold, D. Mankowitz, and T. Hester, “Challenges of real-world reinforcement learning,” arXiv preprint arXiv:1904.12901, 2019.
  • [5] S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll, “A review of safe reinforcement learning: Methods, theory and applications,” arXiv preprint arXiv:2205.10330, 2022.
  • [6] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, pp. 411–444, 2022.
  • [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
  • [8] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2019.
  • [9] Y. Wang and T. Ni, “Meta-sac: Auto-tune the entropy temperature of soft actor-critic via metagradient,” arXiv preprint arXiv:2007.01932, 2020.
  • [10] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [11] E. Altman, Constrained Markov decision processes.   Routledge, 2021.
  • [12] D. Bertsekas, Dynamic programming and optimal control: Volume I.   Athena scientific, 2012, vol. 4.
  • [13] Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014.
  • [14] A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by pid lagrangian methods,” in International Conference on Machine Learning.   PMLR, 2020, pp. 9133–9143.
  • [15] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” arXiv preprint arXiv:1805.11074, 2018.
  • [16] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning.   PMLR, 2017, pp. 1126–1135.
  • [17] Z. Xu, H. P. van Hasselt, and D. Silver, “Meta-gradient reinforcement learning,” Advances in neural information processing systems, vol. 31, 2018.
  • [18] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine, “Learning to walk via deep reinforcement learning,” arXiv preprint arXiv:1812.11103, 2018.
  • [19] V. Veeriah, T. Zahavy, M. Hessel, Z. Xu, J. Oh, I. Kemaev, H. P. van Hasselt, D. Silver, and S. Singh, “Discovery of options via meta-learned subgoals,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 861–29 873, 2021.
  • [20] T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh, “A self-tuning actor-critic algorithm,” Advances in neural information processing systems, vol. 33, pp. 20 913–20 924, 2020.
  • [21] D. A. Calian, D. J. Mankowitz, T. Zahavy, Z. Xu, J. Oh, N. Levine, and T. Mann, “Balancing constraints and rewards with meta-gradient d4pg,” arXiv preprint arXiv:2010.06324, 2020.
  • [22] B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg, “Recovery rl: Safe reinforcement learning with learned recovery zones,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4915–4922, 2021.
  • [23] R. Koppejan and S. Whiteson, “Neuroevolutionary reinforcement learning for generalized control of simulated helicopters,” Evolutionary intelligence, vol. 4, pp. 219–241, 2011.
  • [24] G. Thomas, Y. Luo, and T. Ma, “Safe reinforcement learning by imagining the near future,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 859–13 869, 2021.
  • [25] T. M. Moldovan and P. Abbeel, “Safe exploration in markov decision processes,” arXiv preprint arXiv:1205.4810, 2012.
  • [26] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” Advances in neural information processing systems, vol. 30, 2017.
  • [27] Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” Advances in neural information processing systems, vol. 31, 2018.
  • [28] A. Beirami, M. Razaviyayn, S. Shahrampour, and V. Tarokh, “On optimal generalizability in parametric learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [29] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel programming for hyperparameter optimization and meta-learning,” in International conference on machine learning.   PMLR, 2018, pp. 1568–1577.
  • [30] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [31] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
  • [32] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ international conference on intelligent robots and systems.   IEEE, 2012, pp. 5026–5033.
  • [33] J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y. Geng, Y. Zhong, J. Dai, and Y. Yang, “Safety gymnasium: A unified safe reinforcement learning benchmark,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [34] J. Liang, V. Makoviychuk, A. Handa, N. Chentanez, M. Macklin, and D. Fox, “Gpu-accelerated robotic simulation for distributed reinforcement learning,” in Conference on Robot Learning.   PMLR, 2018, pp. 270–282.
  • [35] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg, “Conservative safety critics for exploration,” arXiv preprint arXiv:2010.14497, 2020.
  • [36] H.-L. Hsu, Q. Huang, and S. Ha, “Improving safety in deep reinforcement learning using unsupervised action planning,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 5567–5573.
  • [37] R. de Lazcano, K. Andreas, J. J. Tai, S. R. Lee, and J. Terry, “Gymnasium robotics,” 2023. [Online]. Available: http://github.com/Farama-Foundation/Gymnasium-Robotics
  • [38] H. Honari, M. G. Tamizi, and H. Najjaran, “Safety optimized reinforcement learning via multi-objective policy optimization,” arXiv preprint arXiv:2402.15197, 2024.
  • [39] A. M. S. Enayati, R. Dershan, Z. Zhang, D. Richert, and H. Najjaran, “Facilitating sim-to-real by intrinsic stochasticity of real-time simulation in reinforcement learning for robot manipulation,” IEEE Transactions on Artificial Intelligence, pp. 1–15, 2023.
  • [40] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
  • [41] Y. Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y. Zhu, “robosuite: A modular simulation framework and benchmark for robot learning,” arXiv preprint arXiv:2009.12293, 2020.

-A Theoretical Analysis

In this section, the gradients of the each component of the algorithm is derived. While the automated differentiation tools for deep learning such as PyTorch and TensorFlow provide automated gradient calculation of this algorithm, the gradient analysis of Meta SAC-Lag may provide insightful information.

Step 1: In the beginning of optimization, the gradient of the Lagrangian multiplier ν𝜈\nuitalic_ν is calculated using the inner loss (for ease of presentation, the expected values and the source of s𝑠sitalic_s are dropped in this analysis):

ν𝒥νsubscript𝜈subscript𝒥𝜈\displaystyle\nabla_{\nu}\mathcal{J}_{\nu}∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT =ν(πϕ,ν,ε,α)absentsubscript𝜈subscript𝜋italic-ϕ𝜈𝜀𝛼\displaystyle=\nabla_{\nu}\mathcal{L}(\pi_{\phi},\nu,\varepsilon,\alpha)= ∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν , italic_ε , italic_α ) (18)
=ν[Qωr(s,πϕ(s))ν(Qωc(s,πϕ(s))ε)αlogπϕ(a|s)]absentsubscript𝜈subscript𝑄subscript𝜔𝑟𝑠subscript𝜋italic-ϕ𝑠𝜈subscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠𝜀𝛼subscript𝜋italic-ϕconditional𝑎𝑠\displaystyle=\nabla_{\nu}\left[Q_{\omega_{r}}(s,\pi_{\phi}(s))-\nu(Q_{\omega_% {c}}(s,\pi_{\phi}(s))-\varepsilon)-\alpha\log\pi_{\phi}(a|s)\right]= ∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε ) - italic_α roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) ]
=[Qωc(s,πϕ(s))ε]absentdelimited-[]subscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠𝜀\displaystyle=-[Q_{\omega_{c}}(s,\pi_{\phi}(s))-\varepsilon]= - [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε ]

Hence, ν𝜈\nuitalic_ν is updated as:

ννβνν𝒥ν=ν+βν[Qωc(s,πϕ(s))ε]superscript𝜈𝜈subscript𝛽𝜈subscript𝜈subscript𝒥𝜈𝜈subscript𝛽𝜈delimited-[]subscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠𝜀\nu^{\prime}\leftarrow\nu-\beta_{\nu}\nabla_{\nu}\mathcal{J}_{\nu}=\nu+\beta_{% \nu}\left[Q_{\omega_{c}}(s,\pi_{\phi}(s))-\varepsilon\right]italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_ν - italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = italic_ν + italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε ] (19)

Step 2: Following the Lagrangian multiplier, the gradient of the actor parameters w.r.t. the Lagrangian loss is calculated as:

ϕ𝒥ϕsubscriptitalic-ϕsubscript𝒥italic-ϕ\displaystyle\nabla_{\phi}\mathcal{J}_{\phi}∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT =(πϕ,ν,ε,α)absentsubscript𝜋italic-ϕsuperscript𝜈𝜀𝛼\displaystyle=\mathcal{L}(\pi_{\phi},\nu^{\prime},\varepsilon,\alpha)= caligraphic_L ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε , italic_α ) (20)
=ϕ[Qωr(s,πϕ(s))ν(Qωc(s,πϕ(s))ε)αlogπϕ(a|s)]absentsubscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscript𝜋italic-ϕ𝑠superscript𝜈subscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠𝜀𝛼subscript𝜋italic-ϕconditional𝑎𝑠\displaystyle=\nabla_{\phi}\left[Q_{\omega_{r}}(s,\pi_{\phi}(s))-\nu^{\prime}(% Q_{\omega_{c}}(s,\pi_{\phi}(s))-\varepsilon)-\alpha\log\pi_{\phi}(a|s)\right]= ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε ) - italic_α roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) ]
=ϕ[Qωr(s,πϕ(s))(ν+βν(Qωc(s,πϕ(s))ε)(Qωc(s,πϕ(s))ε)αlogπϕ(a|s)]\displaystyle=\nabla_{\phi}\left[Q_{\omega_{r}}(s,\pi_{\phi}(s))-(\nu+\beta_{% \nu}(Q_{\omega_{c}}(s,\pi_{\phi}(s))-\varepsilon)(Q_{\omega_{c}}(s,\pi_{\phi}(% s))-\varepsilon)-\alpha\log\pi_{\phi}(a|s)\right]= ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - ( italic_ν + italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε ) ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε ) - italic_α roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) ]
=ϕQωr(s,πϕ(s))ϕQωc(s,πϕ(s))[2βνQωc(s,πϕ(s))2βνε+ν]αϕlogπϕ(a|s)absentsubscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscript𝜋italic-ϕ𝑠subscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠delimited-[]2subscript𝛽𝜈subscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠2subscript𝛽𝜈𝜀𝜈𝛼subscriptitalic-ϕsubscript𝜋italic-ϕconditional𝑎𝑠\displaystyle=\nabla_{\phi}Q_{\omega_{r}}(s,\pi_{\phi}(s))-\nabla_{\phi}Q_{% \omega_{c}}(s,\pi_{\phi}(s))\left[2\beta_{\nu}Q_{\omega_{c}}(s,\pi_{\phi}(s))-% 2\beta_{\nu}\varepsilon+\nu\right]-\alpha\nabla_{\phi}\log\pi_{\phi}(a|s)= ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) [ 2 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - 2 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_ε + italic_ν ] - italic_α ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s )

The actor parameters are then updated as:

ϕsuperscriptitalic-ϕ\displaystyle\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ϕ+βϕϕ𝒥ϕabsentitalic-ϕsubscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝒥italic-ϕ\displaystyle\leftarrow\phi+\beta_{\phi}\nabla_{\phi}\mathcal{J}_{\phi}← italic_ϕ + italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT (21)
=ϕ+βϕ[ϕQωr(s,πϕ(s))ϕQωc(s,πϕ(s))[2βνQωc(s,πϕ(s))2βνε+ν]\displaystyle=\phi+\beta_{\phi}\big{[}\nabla_{\phi}Q_{\omega_{r}}(s,\pi_{\phi}% (s))-\nabla_{\phi}Q_{\omega_{c}}(s,\pi_{\phi}(s))\left[2\beta_{\nu}Q_{\omega_{% c}}(s,\pi_{\phi}(s))-2\beta_{\nu}\varepsilon+\nu\right]= italic_ϕ + italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) [ 2 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) - 2 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_ε + italic_ν ]
αϕlogπϕ(a|s)]\displaystyle\qquad\qquad\qquad\qquad\qquad\quad-\alpha\nabla_{\phi}\log\pi_{% \phi}(a|s)\big{]}- italic_α ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) ]

Step 3: The first meta-parameter used in the training pipeline is the safety threshold ε𝜀\varepsilonitalic_ε. Assuming the meta-objective formulation of Equation 14, the gradient of ε𝜀\varepsilonitalic_ε can be derived as:

ε𝒥εsubscript𝜀subscript𝒥𝜀\displaystyle\nabla_{\varepsilon}\mathcal{J}_{\varepsilon}∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT =ε[νcopyQωc(s,πϕ(s))Qωr(s,πϕ(s))]absentsubscript𝜀subscriptsuperscript𝜈copysubscript𝑄subscript𝜔𝑐𝑠subscript𝜋superscriptitalic-ϕ𝑠subscript𝑄subscript𝜔𝑟𝑠subscript𝜋superscriptitalic-ϕ𝑠\displaystyle=\nabla_{\varepsilon}\left[\nu^{\prime}_{\text{copy}}Q_{\omega_{c% }}(s,\pi_{\phi^{\prime}}(s))-Q_{\omega_{r}}(s,\pi_{\phi^{\prime}}(s))\right]= ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT [ italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT copy end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ] (22)
=εϕTϕ𝒥ε+εν[νϕTϕ𝒥ε+ν𝒥ε0]+ε𝒥ε0absentsubscript𝜀superscriptsuperscriptitalic-ϕTsubscriptsuperscriptitalic-ϕsubscript𝒥𝜀subscript𝜀superscript𝜈delimited-[]subscriptsuperscript𝜈superscriptsuperscriptitalic-ϕTsubscriptsuperscriptitalic-ϕsubscript𝒥𝜀superscriptcancelsubscriptsuperscript𝜈subscript𝒥𝜀0superscriptcancelsubscript𝜀subscript𝒥𝜀0\displaystyle=\nabla_{\varepsilon}{\phi^{\prime}}^{\textbf{T}}\nabla_{\phi^{% \prime}}\mathcal{J}_{\varepsilon}+\nabla_{\varepsilon}\nu^{\prime}\left[\nabla% _{\nu^{\prime}}{\phi^{\prime}}^{\textbf{T}}\nabla_{\phi^{\prime}}\mathcal{J}_{% \varepsilon}+\cancelto{0}{\nabla_{\nu^{\prime}}\mathcal{J}_{\varepsilon}}% \right]+\cancelto{0}{\nabla_{\varepsilon}\mathcal{J}_{\varepsilon}}= ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT + SUPERSCRIPTOP cancel ∇ start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT 0 ] + SUPERSCRIPTOP cancel ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT 0

It should be noted that the chain of gradients that are calculated must be causal. For that reason, gradients such as ϕνsubscriptsuperscriptitalic-ϕsuperscript𝜈\nabla_{\phi^{\prime}}\nu^{\prime}∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT would be meaningless and have not been written. Moreover, each component of Equation 22 can be computed as:

εϕ=2βϕβνϕQωc(s,πϕ(s))subscript𝜀superscriptitalic-ϕ2subscript𝛽italic-ϕsubscript𝛽𝜈subscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠\displaystyle\nabla_{\varepsilon}\phi^{\prime}=2\beta_{\phi}\beta_{\nu}\nabla_% {\phi}Q_{\omega_{c}}(s,\pi_{\phi}(s))∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) (23)
ϕ𝒥ε=νϕQωc(s,πϕ(s))ϕQωr(s,πϕ(s))subscriptsuperscriptitalic-ϕsubscript𝒥𝜀superscript𝜈subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋superscriptitalic-ϕ𝑠subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscript𝜋superscriptitalic-ϕ𝑠\displaystyle\nabla_{\phi^{\prime}}\mathcal{J}_{\varepsilon}=\nu^{\prime}% \nabla_{\phi^{\prime}}Q_{\omega_{c}}(s,\pi_{\phi^{\prime}}(s))-\nabla_{\phi^{% \prime}}Q_{\omega_{r}}(s,\pi_{\phi^{\prime}}(s))∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) (24)
εν=βνsubscript𝜀superscript𝜈subscript𝛽𝜈\displaystyle\nabla_{\varepsilon}\nu^{\prime}=-\beta_{\nu}∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT (25)
νϕ=βϕϕQωc(s,πϕ(s))subscriptsuperscript𝜈superscriptitalic-ϕsubscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠\displaystyle\nabla_{\nu^{\prime}}\phi^{\prime}=-\beta_{\phi}\nabla_{\phi}Q_{% \omega_{c}}(s,\pi_{\phi}(s))∇ start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) (26)

Therefore, the final gradient is calculated as:

εsubscript𝜀\displaystyle\nabla_{\varepsilon}∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT 𝒥ε=[3βνβϕϕQωc(s,πϕ(s))]T[ϕQωr(s,πϕ(s))νϕQωc(s,πϕ(s))]subscript𝒥𝜀superscriptdelimited-[]3subscript𝛽𝜈subscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠Tdelimited-[]subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscript𝜋superscriptitalic-ϕ𝑠superscript𝜈subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋superscriptitalic-ϕ𝑠\displaystyle\mathcal{J}_{\varepsilon}={\left[-3\beta_{\nu}\beta_{\phi}\nabla_% {\phi}Q_{\omega_{c}}(s,\pi_{\phi}(s))\right]^{\textbf{T}}\left[\nabla_{\phi^{% \prime}}Q_{\omega_{r}}(s,\pi_{\phi^{\prime}}(s))-\nu^{\prime}\nabla_{\phi^{% \prime}}Q_{\omega_{c}}(s,\pi_{\phi^{\prime}}(s))\right]}caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = [ - 3 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ] (27)

Hence, the value of ε𝜀\varepsilonitalic_ε is updated as:

εsuperscript𝜀\displaystyle\varepsilon^{\prime}italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ε+βεε𝒥εabsent𝜀subscript𝛽𝜀subscript𝜀subscript𝒥𝜀\displaystyle\leftarrow\varepsilon+\beta_{\varepsilon}\nabla_{\varepsilon}% \mathcal{J}_{\varepsilon}← italic_ε + italic_β start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT (28)
=ε+βε[3βνβϕϕQωc(s,πϕ(s))]T[ϕQωr(s,πϕ(s))νϕQωc(s,πϕ(s))]absent𝜀subscript𝛽𝜀superscriptdelimited-[]3subscript𝛽𝜈subscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠Tdelimited-[]subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscript𝜋superscriptitalic-ϕ𝑠superscript𝜈subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋superscriptitalic-ϕ𝑠\displaystyle=\varepsilon+\beta_{\varepsilon}\scalebox{1.0}{\mbox{$% \displaystyle\left[-3\beta_{\nu}\beta_{\phi}\nabla_{\phi}Q_{\omega_{c}}(s,\pi_% {\phi}(s))\right]^{\textbf{T}}\left[\nabla_{\phi^{\prime}}Q_{\omega_{r}}(s,\pi% _{\phi^{\prime}}(s))-\nu^{\prime}\nabla_{\phi^{\prime}}Q_{\omega_{c}}(s,\pi_{% \phi^{\prime}}(s))\right]$}}= italic_ε + italic_β start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT [ - 3 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ]

The meta-gradient of ε𝜀\varepsilonitalic_ε consists of two components. While the right hand side evaluates the performance of the new actor parameter ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by calculating the tradeoff between safety and optimality, the left hand side is the main direction driver of the gradient. The left hand side of the meta-gradient equation determines the policy’s degree of unsafety with respect to the safety critic. The gradient of the policy w.r.t. the safety critic would determine the effect of the changes in the policy parameters on the measure of safety. The gradient will always try to force the policy to become safer by moving in the negative direction of the gradient. The idea of minimizing the policy objective discussed in Section IV-C is also clear from the observation of the gradients of ε𝜀\varepsilonitalic_ε and specifically in Equation 24 and the left hand side of Equation 27. Without the negative sign the algorithm would try to match ε𝜀\varepsilonitalic_ε with the unsafety performance of the policy rather than forcing it to become safer.

Step 4: The final step of the optimization involves calculating the meta-gradient of the temperature α𝛼\alphaitalic_α which is calculated as:

α𝒥αsubscript𝛼subscript𝒥𝛼\displaystyle\nabla_{\alpha}\mathcal{J}_{\alpha}∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT =α[Qωr(s,πϕdet(s))ν(Qωc(s,πϕdet(s))ε)]absentsubscript𝛼subscript𝑄subscript𝜔𝑟𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠superscript𝜈subscript𝑄subscript𝜔𝑐𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠superscript𝜀\displaystyle=\nabla_{\alpha}[Q_{\omega_{r}}(s,\pi^{det}_{\phi^{\prime}}(s))-% \nu^{\prime}(Q_{\omega_{c}}(s,\pi^{det}_{\phi^{\prime}}(s))-\varepsilon^{% \prime})]= ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] (29)
=αν0[ν𝒥α+νϕTϕ𝒥α+νεε𝒥α]absentsuperscriptcancelsubscript𝛼superscript𝜈0delimited-[]subscriptsuperscript𝜈subscript𝒥𝛼subscriptsuperscript𝜈superscriptsuperscriptitalic-ϕTsubscriptsuperscriptitalic-ϕsubscript𝒥𝛼subscriptsuperscript𝜈superscript𝜀subscriptsuperscript𝜀subscript𝒥𝛼\displaystyle=\cancelto{0}{\nabla_{\alpha}\nu^{\prime}}[\nabla_{\nu^{\prime}}% \mathcal{J}_{\alpha}+\nabla_{\nu^{\prime}}{\phi^{\prime}}^{\textbf{T}}\nabla_{% \phi^{\prime}}\mathcal{J}_{\alpha}+\nabla_{\nu^{\prime}}\varepsilon^{\prime}% \nabla_{\varepsilon^{\prime}}\mathcal{J}_{\alpha}]= SUPERSCRIPTOP cancel ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 0 [ ∇ start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ]
+αϕT[ϕ𝒥α+ϕεε𝒥α]+αε0ε𝒥αsubscript𝛼superscriptsuperscriptitalic-ϕTdelimited-[]subscriptsuperscriptitalic-ϕsubscript𝒥𝛼subscriptsuperscriptitalic-ϕsuperscript𝜀subscriptsuperscript𝜀subscript𝒥𝛼superscriptcancelsubscript𝛼superscript𝜀0subscriptsuperscript𝜀subscript𝒥𝛼\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\nabla_{\alpha}{\phi^{\prime}}^{% \textbf{T}}[\nabla_{\phi^{\prime}}\mathcal{J}_{\alpha}+\nabla_{\phi^{\prime}}% \varepsilon^{\prime}\nabla_{\varepsilon^{\prime}}\mathcal{J}_{\alpha}]+% \cancelto{0}{\nabla_{\alpha}\varepsilon^{\prime}}\nabla_{\varepsilon^{\prime}}% \mathcal{J}_{\alpha}+ ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ] + SUPERSCRIPTOP cancel ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT 0 ∇ start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT

The components of the gradient can then be calculated as:

ε𝒥α=νsubscriptsuperscript𝜀subscript𝒥𝛼superscript𝜈\displaystyle\nabla_{\varepsilon^{\prime}}\mathcal{J}_{\alpha}=\nu^{\prime}∇ start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (30)
αϕ=βϕϕlogπϕ(a|s)subscript𝛼superscriptitalic-ϕsubscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝜋italic-ϕconditional𝑎𝑠\displaystyle\nabla_{\alpha}\phi^{\prime}=-\beta_{\phi}\nabla_{\phi}\log\pi_{% \phi}(a|s)∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) (31)
ϕ𝒥α=ϕQωr(s,πϕdet(s))νϕQωc(s,πϕdet(s))subscriptsuperscriptitalic-ϕsubscript𝒥𝛼subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠superscript𝜈subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠\displaystyle\nabla_{\phi^{\prime}}\mathcal{J}_{\alpha}=\nabla_{\phi^{\prime}}% Q_{\omega_{r}}(s,\pi^{det}_{\phi^{\prime}}(s))-\nu^{\prime}\nabla_{\phi^{% \prime}}Q_{\omega_{c}}(s,\pi^{det}_{\phi^{\prime}}(s))∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) (32)
ϕε=βε[3βνβϕϕQωc(s,πϕ(s))]T[HϕQωr(s,πϕ(s))νHϕQωc(s,πϕ(s))]subscriptsuperscriptitalic-ϕsuperscript𝜀subscript𝛽𝜀superscriptdelimited-[]3subscript𝛽𝜈subscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋italic-ϕ𝑠Tdelimited-[]subscript𝐻superscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscript𝜋superscriptitalic-ϕ𝑠superscript𝜈subscript𝐻superscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscript𝜋superscriptitalic-ϕ𝑠\displaystyle\nabla_{\phi^{\prime}}\varepsilon^{\prime}=\beta_{\varepsilon}% \scalebox{1.0}{\mbox{$\displaystyle\left[-3\beta_{\nu}\beta_{\phi}\nabla_{\phi% }Q_{\omega_{c}}(s,\pi_{\phi}(s))\right]^{\textbf{T}}\left[H_{\phi^{\prime}}Q_{% \omega_{r}}(s,\pi_{\phi^{\prime}}(s))-\nu^{\prime}H_{\phi^{\prime}}Q_{\omega_{% c}}(s,\pi_{\phi^{\prime}}(s))\right]$}}∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT [ - 3 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ italic_H start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ] (33)

Therefore, the final gradient of α𝛼\alphaitalic_α is obtained as:

α𝒥αsubscript𝛼subscript𝒥𝛼\displaystyle\nabla_{\alpha}\mathcal{J}_{\alpha}∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT =βϕϕlogπϕ(a|s)T[ϕQωr(s,πϕdet(s))νϕQωc(s,πϕdet(s))\displaystyle=-\beta_{\phi}\nabla_{\phi}\log\pi_{\phi}(a|s)^{\textbf{T}}[% \nabla_{\phi^{\prime}}Q_{\omega_{r}}(s,\pi^{det}_{\phi^{\prime}}(s))-\nu^{% \prime}\nabla_{\phi^{\prime}}Q_{\omega_{c}}(s,\pi^{det}_{\phi^{\prime}}(s))= - italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) (34)
+νβε[3βνβϕϕQωc(s,πϕ(s))]T[HϕQωr(s,πϕ(s))νHϕQωc(s,πϕ(s))]]\displaystyle\qquad+\nu^{\prime}\beta_{\varepsilon}\scalebox{1.0}{\mbox{$% \displaystyle\left[-3\beta_{\nu}\beta_{\phi}\nabla_{\phi}Q_{\omega_{c}}(s,\pi_% {\phi}(s))\right]^{\textbf{T}}\left[H_{\phi^{\prime}}Q_{\omega_{r}}(s,\pi_{% \phi^{\prime}}(s))-\nu^{\prime}H_{\phi^{\prime}}Q_{\omega_{c}}(s,\pi_{\phi^{% \prime}}(s))\right]$}}]+ italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT [ - 3 italic_β start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) ] start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ italic_H start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ] ]
=βϕϕlogπϕ(a|s)T[ϕQωr(s,πϕdet(s))νϕQωc(s,πϕdet(s))]absentsubscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝜋italic-ϕsuperscriptconditional𝑎𝑠Tdelimited-[]subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠superscript𝜈subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠\displaystyle=-\beta_{\phi}\nabla_{\phi}\log\pi_{\phi}(a|s)^{\textbf{T}}[% \nabla_{\phi^{\prime}}Q_{\omega_{r}}(s,\pi^{det}_{\phi^{\prime}}(s))-\nu^{% \prime}\nabla_{\phi^{\prime}}Q_{\omega_{c}}(s,\pi^{det}_{\phi^{\prime}}(s))]= - italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ]

The second term of the gradient (ϕεε𝒥αsubscriptsuperscriptitalic-ϕsuperscript𝜀subscriptsuperscript𝜀subscript𝒥𝛼\nabla_{\phi^{\prime}}\varepsilon^{\prime}\nabla_{\varepsilon^{\prime}}% \mathcal{J}_{\alpha}∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT) was dropped because in the architecture of the neural networks used to implement Meta SAC-Lag, the ReLU activation function was used. An important feature of ReLU is the fact that it has zero second-order derivative almost everywhere. Hence, the Hessian matrix in Equation 33 would be equal to zero.

Finally, the update rule for ε𝜀\varepsilonitalic_ε follows:

αsuperscript𝛼\displaystyle\alpha^{\prime}italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT α+βαα𝒥αabsent𝛼subscript𝛽𝛼subscript𝛼subscript𝒥𝛼\displaystyle\leftarrow\alpha+\beta_{\alpha}\nabla_{\alpha}\mathcal{J}_{\alpha}← italic_α + italic_β start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT (35)
=αβαβϕϕlogπϕ(a|s)T[ϕQωr(s,πϕdet(s))νϕQωc(s,πϕdet(s))]absent𝛼subscript𝛽𝛼subscript𝛽italic-ϕsubscriptitalic-ϕsubscript𝜋italic-ϕsuperscriptconditional𝑎𝑠Tdelimited-[]subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠superscript𝜈subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑐𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠\displaystyle=\alpha-\beta_{\alpha}\beta_{\phi}\nabla_{\phi}\log\pi_{\phi}(a|s% )^{\textbf{T}}[\nabla_{\phi^{\prime}}Q_{\omega_{r}}(s,\pi^{det}_{\phi^{\prime}% }(s))-\nu^{\prime}\nabla_{\phi^{\prime}}Q_{\omega_{c}}(s,\pi^{det}_{\phi^{% \prime}}(s))]= italic_α - italic_β start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ]

By observing the right hand side of α𝒥αsubscript𝛼subscript𝒥𝛼\nabla_{\alpha}\mathcal{J}_{\alpha}∇ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT it can be noticed that term is equivalent to ϕ[Qωr(s,πϕdet(s))νQωc(s,πϕdet(s))]subscriptsuperscriptitalic-ϕsubscript𝑄subscript𝜔𝑟𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠superscript𝜈subscript𝑄subscript𝜔𝑐𝑠subscriptsuperscript𝜋𝑑𝑒𝑡superscriptitalic-ϕ𝑠\nabla_{\phi^{\prime}}[Q_{\omega_{r}}(s,\pi^{det}_{\phi^{\prime}}(s))-\nu^{% \prime}Q_{\omega_{c}}(s,\pi^{det}_{\phi^{\prime}}(s))]∇ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT italic_d italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ] which is similar to the critic formulation of RCPO as Q^π(s,a)=Qπ(s,a)νQcπ(s,a)superscript^𝑄𝜋𝑠𝑎superscript𝑄𝜋𝑠𝑎𝜈subscriptsuperscript𝑄𝜋𝑐𝑠𝑎\hat{Q}^{\pi}(s,a)=Q^{\pi}(s,a)-\nu Q^{\pi}_{c}(s,a)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_ν italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_s , italic_a ). Hence, in essence the meta-gradient calculation of α𝛼\alphaitalic_α in Meta SAC-Lag and RCPO-MetaSAC are the same. This observation can be confirmed by noticing the similarity in the updating profile of α𝛼\alphaitalic_α in the simulation results in Figure 3. The profile is also similar to that of Meta SAC-Lag𝒥nlsubscript𝒥𝑛𝑙\mathcal{J}_{nl}caligraphic_J start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT because due to the use of ReLU and the Hessian matrix becoming zero, the effect of 𝒥εsubscript𝒥𝜀\mathcal{J}_{\varepsilon}caligraphic_J start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT on α𝛼\alphaitalic_α has been eliminated and the updating procedure for both of them would be the same.