Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal

Jifeng Hu1†footnotetext: Equal contribution.footnotetext: Correspondence: chenhc@jlu.edu.cn and yichang@jlu.edu.cn.      Li Shen2†      Sili Huang3†      Zhejian Yang4      Hechang Chen5      Lichao Sun6      Yi Chang7      Dacheng Tao8 1,3,4,5,7School of Artificial Intelligence, Jilin University, Changchun, China
2JD Explore Academy, Beijing, China
6Lehigh University, Bethlehem, Pennsylvania, USA
8College of Computing and Data Science, NTU, Singapore
https://github.com/JF-Hu/Continual_Diffuser
Abstract

Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks’ datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90909090 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks. Source code is available at here.

1 Introduction

Artificial neural networks, such as diffusion models, have made impressive successes in decision-making scenarios, e.g., game playing Mnih et al. (2015), robotics manipulation Kaufmann et al. (2023), and autonomous driving Almalioglu et al. (2022). However, in most situations, a new challenge of difficult adaption to changing data arises when we adopt the general strategy of learning during the training phase and evaluating with fixed neural network weights Dohare et al. (2024). Changes are prevalent in real-world applications when performing learning in games, logistics, and control systems. A crucial step towards achieving Artificial General Intelligence (AGI) is mastering the human-like ability to continuously learn and quickly adapt to new scenarios over the duration of their lifetime Berariu et al. (2021). Unfortunately, it is usually ineffective for current methods to simply continue learning on new scenarios when new datasets arrive. They will show a dilemma between storing historical knowledge (stability) in their brains and adapting to environmental changes (plasticity) Zeng et al. (2019).

Recently, we have noticed that diffusion probabilistic models (DPMs) have emerged as an expressive structure for tackling complex decision-making tasks such as robotics manipulation by formulating deep reinforcement learning (RL) as a sequential modeling problem He et al. (2023); Wang et al. (2022a); Kang et al. (2023). Although recent DPMs have shown impressive performance in robotics manipulation, they, however, usually focus on a narrow setting, where the environment is well-defined and remains static all the time Ajay et al. (2022); Yang et al. (2023), just like we introduce above. In contrast, in real-world applications, the environment changes dynamically in chronological order, forming a continuous stream of data encompassing various tasks. In this situation, it is challenging for the agents to contain historical knowledge (stability) in their brains and adapt to environmental changes (plasticity) quickly based on already acquired knowledge Anand and Precup (2023); Yue et al. (2024). Thus, a natural question arises:

Can we incorporate DPMs’ merit of high expression and concurrently endow DPMs the ability towards better plasticity and stability in continual offline RL?

Facing the long-standing challenge of plasticity-stability dilemma in continual RL, current studies of continual learning can be roughly classified into three categories. Structure-based methods Wang et al. (2022b); Zhang et al. (2023a); Smith et al. (2023a); Mendez and Eaton (2022); Mallya and Lazebnik (2018) propose the use of a base model for pertaining and sub-modules for each task so as to store separated knowledge and reduce catastrophic forgetting. Regularization-based methods Zhang et al. (2023b, 2022); Kirkpatrick et al. (2017); Kessler et al. (2020); Nguyen et al. (2017) propose using auxiliary regularization loss such as L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty, KL divergence, and weight importance to contain policy optimization and avoid catastrophic forgetting during training. Rehearsal-based methods Smith et al. (2023b); Peng et al. (2023); Huang et al. (2024); Rolnick et al. (2019); Chaudhry et al. (2018) are considered simple yet effective in alleviating catastrophic forgetting as rehearsal mimics the memory consolidation mechanism of hippocampus replay inside biological systems. There are many strategies to perform rehearsal. For instance, a typical method is gradient projection Chaudhry et al. (2018), which contains the gradients from new data loss as close as to previous tasks, furthest preventing performance decrease.

Although these methods are effective for continual learning, they present limited improvement in continual offline RL because of extra challenges such as distribution shift and value uncertain estimation. Recently, diffusion-based methods, such as DD and Diffuser Ajay et al. (2022); Kang et al. (2023); Wang et al. (2022a); Janner et al. (2022), propose to resolve the above two extra challenges from sequential modeling and have shown impressive results in many offline RL tasks. However, they concentrate solely on training a diffuser that can only solve one task, thus showing limitations in real-world applications where training datasets or tasks usually arrive sequentially. Though recent works, such as MTDIFF He et al. (2023), consider diffusers as planner or data generators for multi-task RL, the problem setting of their work is orthogonal to ours.

In this view, we take one step forward to investigate diffusers with arriving datasets and find that recent state-of-the-art diffusion-based models suffer from catastrophic forgetting when new tasks arrive sequentially (See Section 3.1 for more details.). To address this issue, we propose “Continual Diffuser” (CoD), which endows the diffuser with the capabilities of quickly adapting to new tasks (plasticity) meanwhile retaining the historical knowledge (stability) with experience rehearsal. First of all, to take advantage of the potential of diffusion models, we construct an offline RL benchmark that consists of 90909090 tasks from multiple domains, such as Continual World (CW) and Gym-MuJoCo. These continual datasets will be released to all researchers soon at the present stage, and we will actively maintain and progressively incorporate more datasets into our benchmark. Based on the benchmark, we train our method on each task with sequential modeling of trajectories and make decisions with conditional generation in evaluation. Then, a small portion of each previous task dataset is reserved as the rehearsal buffer to replay periodically to our model. Finally, extensive experiments on a series of tasks show that CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based models and other representative continual RL methods on most tasks. In summary, our contributions are threefold:

  • We construct a continual offline RL benchmark that contains 90 tasks in the current stage, and we will actively incorporate more datasets for all researchers.

  • We investigate the possibility of integrating experience rehearsal and diffuser, then propose the Continual Diffuser (CoD) to balance plasticity and stability.

  • Extensive experiments on a series of tasks show that CoD can achieve a promising plasticity-stability trade-off and outperform existing baselines on most tasks.

Refer to caption
Figure 1: The framework of CoD. Unfolding the training process with time, our model slides on the sample chain that is constructed by sampling from the current and rehearsal buffers. For each task i𝑖iitalic_i, CoD replays small portion samples of previous tasks to reduce catastrophic forgetting and generate a solution that can solve all previous tasks. Detailed structure of CoD is shown in the low right corner.

2 Results

In this section, we will introduce environmental settings and evaluation metrics Section 2.1 and 2.2. Then, in Section 2.3 and  2.4, we first introduce a novel continual offline RL benchmark, including the task description and the corresponding dataset statistics, and introduce various baselines. Finally, in Section 2.5 and 2.6, we report the comparison results, ablation study, and parameters sensitivity analysis.

2.1 Environmental Settings

Following the same setting as prior works Zhang et al. (2023c); Yang et al. (2023), we conduct thorough experiments on Continual World and Gym-MuJoCo benchmarks. In Continual World, we adopt the task setting of CW10 and CW20 where CW20 means two concatenated CW10. All CW tasks are version v1. Besides, we also select Ant-dir for evaluation, which includes 40 tasks, and we arbitrarily select four tasks (tasks-10-15-19-25) for training and evaluation. See Appendix 5.5 for more details.

2.2 Evaluation Metrics

In order to compare the performance on a series of tasks, we follow previous studies Wołczyk et al. (2021); Anand and Precup (2023) and adopt the totally average success rate P(ρ)𝑃𝜌P(\rho)italic_P ( italic_ρ ) (higher is better), forward transfer FT𝐹𝑇FTitalic_F italic_T (higher is better), forgetting F𝐹Fitalic_F (lower is better), and the total performance P+FTF𝑃𝐹𝑇𝐹P+FT-Fitalic_P + italic_F italic_T - italic_F (higher is better) as evaluation metrics. Suppose that we use pi(ρ)subscript𝑝𝑖𝜌p_{i}(\rho)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ρ ) to represent the average success rate on task i𝑖iitalic_i at gradient update step ρ𝜌\rhoitalic_ρ and each task train ΔΔ\Deltaroman_Δ gradient steps, then the total average success rate P(ρ)=i=1Ipi(ρ)𝑃𝜌superscriptsubscript𝑖1𝐼subscript𝑝𝑖𝜌P(\rho)=\sum_{i=1}^{I}p_{i}(\rho)italic_P ( italic_ρ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ρ ), where pi(ρ)[0,1]subscript𝑝𝑖𝜌01p_{i}(\rho)\in[0,1]italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ρ ) ∈ [ 0 , 1 ]. The forward transfer FT𝐹𝑇FTitalic_F italic_T denotes the normalized AUC area between the training curve and the reference curve. Note that FTi<1𝐹subscript𝑇𝑖1FT_{i}<1italic_F italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 1 and it might also be negative. Mathematically, FT=1IiFTi=1IiAUCiAUCref,i1AUCref,i𝐹𝑇1𝐼subscript𝑖𝐹subscript𝑇𝑖1𝐼subscript𝑖𝐴𝑈subscript𝐶𝑖𝐴𝑈subscript𝐶𝑟𝑒𝑓𝑖1𝐴𝑈subscript𝐶𝑟𝑒𝑓𝑖FT=\frac{1}{I}\sum_{i}FT_{i}=\frac{1}{I}\sum_{i}\frac{AUC_{i}-AUC_{ref,i}}{1-% AUC_{ref,i}}italic_F italic_T = divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_A italic_U italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_U italic_C start_POSTSUBSCRIPT italic_r italic_e italic_f , italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_A italic_U italic_C start_POSTSUBSCRIPT italic_r italic_e italic_f , italic_i end_POSTSUBSCRIPT end_ARG, where we set AUCref,i=0.5𝐴𝑈subscript𝐶𝑟𝑒𝑓𝑖0.5AUC_{ref,i}=0.5italic_A italic_U italic_C start_POSTSUBSCRIPT italic_r italic_e italic_f , italic_i end_POSTSUBSCRIPT = 0.5 and AUCi=(pi(iΔ)+pi((i+1)Δ))/2𝐴𝑈subscript𝐶𝑖subscript𝑝𝑖𝑖Δsubscript𝑝𝑖𝑖1Δ2AUC_{i}=(p_{i}(i\cdot\Delta)+p_{i}((i+1)\cdot\Delta))/2italic_A italic_U italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i ⋅ roman_Δ ) + italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_i + 1 ) ⋅ roman_Δ ) ) / 2 for simplicity. The forgetting Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as the performance decrease between pi((i+1)Δ)subscript𝑝𝑖𝑖1Δp_{i}((i+1)\cdot\Delta)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_i + 1 ) ⋅ roman_Δ ) and pI1(IΔ)subscript𝑝𝐼1𝐼Δp_{I-1}(I\cdot\Delta)italic_p start_POSTSUBSCRIPT italic_I - 1 end_POSTSUBSCRIPT ( italic_I ⋅ roman_Δ ), thus F=1IiIFi𝐹1𝐼superscriptsubscript𝑖𝐼subscript𝐹𝑖F=\frac{1}{I}\sum_{i}^{I}F_{i}italic_F = divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Refer to caption
Figure 2: The comparison of CoD and other diffusion-based models under the continual offline RL setting where “w/o” denotes “without”, Multitask CoD is a multitask variant of CoD, CoD-LoRA uses low-rank adaptation during training, and CoD-RCR denotes that we train CoD with return condition. IL-rehearsal denotes imitation learning with rehearsal. We train these methods on four arbitrarily selected tasks (tasks 10-15-19-25). The results show that previous diffusion-based methods (“DD-w/o rehearsal”, “Diffuser-w/o rehearsal”, and “MTDIFF”) exhibit severe forgetting when the datasets arrive sequentially.

2.3 Novel Benchmark for Continual Offline RL

To take advantage of the potential of diffusion models, we propose a benchmark for continual offline RL (CORL), comprising datasets from 90 tasks, including 88 Continual World tasks and 2 Gym-MuJoCo tasks Wołczyk et al. (2021); Todorov et al. (2012). For the Gym-MuJoCo domain, there are 42 environmental variants, which are constructed by altering the agent goals. In order to collect the offline datasets, we trained Soft Actor-Critic (SAC) on each task for approximately 1M time steps Haarnoja et al. (2018).

Continual World Wołczyk et al. (2021) is a popular testbed that is constructed based on Meta-World Yu et al. (2020) and consists of realistic robotic manipulation such as Pushing, Reaching, and Door Opening. CW is convenient for training and evaluating the abilities of forward transfer and forgetting because the state and action space are the same across all tasks. Firstly, we will define the task-incremental CORL (TICORL), task-incremental CORL (TICORL), and task-incremental CORL (TICORL) Van de Ven et al. (2022). In RL, we call the CL setting CICORL, where the CL tasks are constructed in the same environment with different goals, such as different directions or velocities. We call the CL setting TICORL, where the CL tasks are indeed different environments but with the same purposes. For instance, the CL settings with the purpose of pushing blocks (e.g., “push wall” and “push mug” tasks in Continual World) in different robotic control tasks formulate the TICORL. Finally, we can use the tasks of different purposes, such as push, pull, turn, and press blocks, to construct the DICORL. For example, CW10 and CW20 form the mixed TICORL and DICORL setups because the task sequence contains multiple purposes. Additionally, Gym-MuJoCo’s 42 environmental variants facilitate constructing a CICORL setup. Researchers can use these datasets in any sequence or length for CL tasks to test the plasticity-stability trade-off of their methods. We also provide multiple quality datasets, such as ‘medium’ and ‘expert,’ in our benchmark. We list the information statistics of our benchmark in Table 11 and 12, and Figure 8 and 9, where the episodic time limit is set to 200, and the evaluation time step is set to 1M and 0.4M for different qualities datasets.

Ant-dir is an 8-joint ant environment. The different tasks are defined according to the target direction, where the agent should maximize its return with maximal speed in the pre-defined direction. As shown in Table 13, there are 40 tasks (distinguished with “task id”) with different uniformly sampled goal directions in Ant-dir. For each task, the dataset contains approximately 200k transitions, where the observation and action dimensions are 27 and 8, respectively. We found that the Ant-dir datasets have been used by many researchers Xu et al. (2022); Li et al. (2020); Rakelly et al. (2019), so we incorporate them into our benchmark. Moreover, we report the mean return information of each sub-task in Table 13 and Figure 10. As for Cheetah-dir, it only contains two tasks that represent forward and backward goal directions. Compared with Ant-dir, Cheetah-dir possesses lower observation and action space.

2.4 Baselines

We compare our method (CoD) with various representative baselines, encompassing structure-based, regularization-based, and rehearsal-based methods. In structure-based methods, we select LoRA Li et al. (2023), PackNet Mallya and Lazebnik (2018), and Multitask. For regularization-based methods, we select L2, EWC Kirkpatrick et al. (2017), MAS Aljundi et al. (2018), and VCL Nguyen et al. (2017) for evaluation. Rehearsal-based baselines include t-DGR Yue et al. (2024), DGR Shin et al. (2017), CRIL Gao et al. (2021), A-GEM Chaudhry et al. (2018), and IL Ho and Ermon (2016). Besides, we also include several diffusion-based methods Janner et al. (2022); Ajay et al. (2022) and Multitask methods, such as MTDIFF He et al. (2023) for the evaluation.

Table 1: The performance comparison on the Continual World and Ant-dir datasets. We compare our method (CoD) with baselines trained with the offline pattern as well as the online pattern. We report the average success rate, backward forgetting, and forward transfer of our method and several representative baselines in Continual World tasks (shown in parts (a) and (b)). Moreover, we conduct experiments on CW4 (“hammer-v1”, “push-wall-v1”, “faucet-close-v1”, “push-back-v1”) with mixed-quality datasets and show the results in part (c). For Ant-dir datasets shown in part (d), we report the comparison results with diffusion-based, non-diffusion-based, and multitask methods.
Continual World 10 Continual World 20
train mode Model P \uparrow FT \uparrow F \downarrow P+FT-F \uparrow P \uparrow FT \uparrow F \downarrow P+FT-F \uparrow
(a) offline baselines EWC 0.20±plus-or-minus\pm±0.16 0.30±plus-or-minus\pm±0.21 0.80±plus-or-minus\pm±0.16 -0.30 0.30±plus-or-minus\pm±0.21 0.30±plus-or-minus\pm±0.21 0.70±plus-or-minus\pm±0.21 -0.10
Finetune 0.20±plus-or-minus\pm±0.16 0.10±plus-or-minus\pm±0.09 0.80±plus-or-minus\pm±0.16 -0.50 0.10±plus-or-minus\pm±0.09 0.10±plus-or-minus\pm±0.09 0.80±plus-or-minus\pm±0.16 -0.60
DGR 0.30±plus-or-minus\pm±0.21 0.90±plus-or-minus\pm±0.09 0.70±plus-or-minus\pm±0.21 0.50 0.50±plus-or-minus\pm±0.25 0.90±plus-or-minus\pm±0.09 0.50±plus-or-minus\pm±0.25 0.90
t-DGR 0.40±plus-or-minus\pm±0.24 0.70±plus-or-minus\pm±0.21 0.60±plus-or-minus\pm±0.24 0.50 0.50±plus-or-minus\pm±0.25 0.90±plus-or-minus\pm±0.09 0.50±plus-or-minus\pm±0.25 0.90
CRIL 0.70±plus-or-minus\pm±0.21 0.80±plus-or-minus\pm±0.16 0.20±plus-or-minus\pm±0.16 1.30 0.70±plus-or-minus\pm±0.21 0.90±plus-or-minus\pm±0.09 0.00±plus-or-minus\pm±0.00 1.60
Multitask 1.00±plus-or-minus\pm±0.00 0.90±plus-or-minus\pm±0.09 0.00±plus-or-minus\pm±0.00 1.90 1.00±plus-or-minus\pm±0.00 0.90±plus-or-minus\pm±0.09 0.00±plus-or-minus\pm±0.00 1.90
CoD 0.98±plus-or-minus\pm±0.01 0.89±plus-or-minus\pm±0.09 -0.01±plus-or-minus\pm±0.001 1.88 0.98±plus-or-minus\pm±0.01 0.89±plus-or-minus\pm±0.09 0.00±plus-or-minus\pm±0.00 1.87
(b) online baselines A-GEM 0.02±plus-or-minus\pm±0.01 -0.76±plus-or-minus\pm±0.02 0.22±plus-or-minus\pm±0.02 -0.96 0.17±plus-or-minus\pm±0.10 0.17±plus-or-minus\pm±0.11 0.64±plus-or-minus\pm±0.12 -0.30
PackNet 0.05±plus-or-minus\pm±0.01 -0.60±plus-or-minus\pm±0.01 0.35±plus-or-minus\pm±0.02 -0.90 0.14±plus-or-minus\pm±0.09 -0.34±plus-or-minus\pm±0.19 0.53±plus-or-minus\pm±0.20 -0.73
VCL 0.10±plus-or-minus\pm±0.03 -0.81±plus-or-minus\pm±0.05 -0.02±plus-or-minus\pm±0.008 -0.69 0.18±plus-or-minus\pm±0.13 -0.62±plus-or-minus\pm±0.14 0.02±plus-or-minus\pm±0.06 -0.46
MAS 0.23±plus-or-minus\pm±0.05 -0.63±plus-or-minus\pm±0.08 -0.05±plus-or-minus\pm±0.03 -0.35 0.41±plus-or-minus\pm±0.15 -0.12±plus-or-minus\pm±0.18 -0.01±plus-or-minus\pm±0.01 0.30
EWC 0.30±plus-or-minus\pm±0.03 -0.36±plus-or-minus\pm±0.05 0.02±plus-or-minus\pm±0.02 -0.08 0.56±plus-or-minus\pm±0.20 0.13±plus-or-minus\pm±0.28 0.01±plus-or-minus\pm±0.02 0.68
L2 0.21±plus-or-minus\pm±0.03 -0.58±plus-or-minus\pm±0.07 0.02±plus-or-minus\pm±0.02 -0.39 0.51±plus-or-minus\pm±0.09 0.12±plus-or-minus\pm±0.19 0.10±plus-or-minus\pm±0.03 0.53
Multitask 1.00±plus-or-minus\pm±0.00 0.90±plus-or-minus\pm±0.09 0.00±plus-or-minus\pm±0.00 1.90 1.00±plus-or-minus\pm±0.00 0.90±plus-or-minus\pm±0.09 0.00±plus-or-minus\pm±0.00 1.90
Continual World 4
(c) offline baselines IL-rehearsal 0.57±plus-or-minus\pm±0.19 0.12±plus-or-minus\pm±0.54 0.18±plus-or-minus\pm±0.09 0.51
CoD 0.85±plus-or-minus\pm±0.02 0.60±plus-or-minus\pm±0.13 0.05±plus-or-minus\pm±0.01 1.40
Ant-dir
(d) offline baselines Model CoD Multitask CoD IL- rehearsal CoD- LoRA Diffuser-w/o rehearsal CoD-RCR MTDIFF DD-w/o rehearsal
Mean return 478.19±plus-or-minus\pm±15.84 485.15±plus-or-minus\pm±  5.86 402.53±plus-or-minus\pm±17.67 296.03±plus-or-minus\pm±11.95 270.44±plus-or-minus\pm±  5.54 140.44±plus-or-minus\pm±32.11 84.01±plus-or-minus\pm±41.10 -11.15±plus-or-minus\pm±45.27
Refer to caption
Figure 3: The comparison of our method CoD and other baselines on CW20 where these baselines are trained with online and offline datasets and are trained with 500k gradient steps on each task. In the above figure, we use the dash-dotted lines to indicate the task changes. Part (a) shows the comparison where the baselines are trained in online mode, while in part (b), the baselines are trained with offline datasets.

2.5 Main Results

Ant-dir Results.    To show the effectiveness of our method in reducing catastrophic forgetting, we compare our method with other diffusion-based methods on the Ant-dir tasks ordered by 10-15-19-25. As shown in Table 1 (d) and Figure 2, the results illustrate: 1) Directly applying previous diffusion-based methods into continual offline RL will lead to severe catastrophic forgetting because the scores of Diffuser-w/o rehearsal and DD-w/o rehearsal are far behind CoD. 2) Extending the technique of LoRA into the diffusion model may not always work. The reason lies in that the parameter quantity size is small, which inspires us to construct diffuser foundation models in future work. 3) Rehearsal can bring significant improvements on diffuser as CoD approaches the score of Multitask CoD.

Online Continual World Results.    Considering that offline datasets prohibit further exploration in the environments, which may hinder the capability of some baselines that are designed for online training. We conduct CW10 and CW20 experiments of these methods under the online continual RL setting. Similarly, we constrain the interaction as 500k time steps for each task and report the comparison results in Figure 3 (a) and Table 1 (a). The results show that our method (CoD) surpasses other baselines by a large margin, which illustrates the superior performance over balancing plasticity and stability. Besides, it is indeed that some methods, such as EWC, are more suitable for online training by comparing the performances in Figure 3 (a) and (b). Additionally, we also report the comparison under mixed-quality datasets CL setting in Table 1 (c). Please refer to Appendix 5.4 for the comparison of model plasticity and generation acceleration details.

Offline Continual World Results.    This section presents the comparison between CoD and six representative continual RL methods on CW10 and CW20 benchmarks. In order to show the capabilities of plasticity (quick adaptation to unseen tasks) and stability (lasting retention of previous knowledge), we keep the size of training samples, number of gradient updates, and computation constant. Figure 3 (b) and Table 1 (b) summarize the results of CW10 and CW20 tasks. We observe that our method can quickly master these manipulation tasks and remember the acquired knowledge when new tasks arrive, while the baselines (except for Multitask) struggle between plasticity and stability because the performance of these baselines fluctuates among tasks. Moreover, after 5M gradient steps, our method still remembers how to solve the same task it learns, which shows small forgetting. The results of the table also show that though some baselines exhibit high forward transfer, the average success rate is lower than our method, and they forget knowledge fleetly.

Refer to caption
Figure 4: The parameters sensitivity analysis of rehearsal frequency υ𝜐\upsilonitalic_υ and rehearsal sample diversity ξ𝜉\xiitalic_ξ on CW20.

2.6 Ablation Study

To show the effectiveness of experience rehearsal, we conduct an ablation study of CoD in CW and Ant-dir tasks. We compare our method with and without experience rehearsal and find that experience rehearsal indeed brings significant performance gain. For example, CoD achieves 76.82% performance gain compared with CoD-w/o rehearsal. In CW 20 tasks, CoD reaches mean success rate from 20% to 98% when incorporating experience rehearsal. Refer to Table 5 for more results.

Sensitivity of Key Hyperparameters.    In the experiments, we introduce the key hyper-parameters: the rehearsal frequency (υ𝜐\upsilonitalic_υ) and rehearsal sample diversity (ξ𝜉\xiitalic_ξ). The larger υ𝜐\upsilonitalic_υ will aggravate the catastrophic forgetting because the model can access previous samples after a longer training process. A large value of ξ𝜉\xiitalic_ξ will improve the performance and increase the storage burden, while a small value is more cost-efficient for longer CL tasks but is more challenging to hold the performance. We conduct the sensitivity of the hyperparameters on the CW and Ant-dir environments, and the results are shown in Figure 4 and Figure 5. According to the results, our method can still reach good performance with the variation of υ𝜐\upsilonitalic_υ and ξ𝜉\xiitalic_ξ.

3 Discussion

3.1 Catastrophic Forgetting of Diffuser

Previous diffusion-based methods Ajay et al. (2022); Janner et al. (2022); Yue et al. (2024), such as DD and Diffuser, are usually proposed to solve a single task, which is not in line with the real-world situation where the task will dynamically change. Thus, it is meaningful but challenging to train a diffuser that can adapt to new tasks (plasticity) while retaining historical knowledge. When we directly extend the original diffusion-based method in continual offline RL, we can imagine that severe catastrophic forgetting will arise in the performance because there are no mechanisms to retain preceding knowledge. As shown in Figure 2, in order to show the catastrophic forgetting, we compare our method and the representative diffusion-based methods on Ant-dir, where we arbitrarily select four tasks, task-10, task-15, task-19, and task-25, to form the CL setting. Diffuser-w/o rehearsal and DD-w/o rehearsal represent the original method Diffuser and DD, respectively. Multitask CoD and MTDIFF are the multitask baselines, which can access all training datasets in any time step, and CoD-RCR represents we use return condition for decision generation during the training stage. CoD-LoRA denotes that we train CoD with the technique of low-rank adaptation. IL-rehearsal is the imitation learning with rehearsal. The results show that previous diffusion-based methods exhibit severe catastrophic forgetting when the datasets arrive sequentially, and at the same time, the good performance of CoD illustrates experience rehearsal is effective in reducing catastrophic forgetting.

3.2 Reducing Catastrophic Forgetting with Experience Rehearsal

In Section 2.5, we illustrate the effectiveness of experience rehearsal through the experiments on our proposed offline CL benchmark, which contains 90 tasks for evaluation. From the perspective of the CL tasks quantity, we evaluate carious quantity settings, such as 4 tasks for Ant-dir, 4 tasks for CW4, 10 tasks for CW10, and 20 tasks for CW20. From the perspective of classification of traditional CL settings, our experimental settings contain CICORL, TICORL, and DICORL. In the Ant-dir environment, we select 10-15-19-25 task sequence as the CL setting and conduct the experiment compared with other diffusion-based methods. From the results shown in Figure 2, we can see distinct catastrophic forgetting on the recent diffusion-based method, though they show strong performance in other offline RL tasks Kang et al. (2023); He et al. (2023). To borrow the merits of diffusion models’ strong expression on offline RL and equip them with the ability to reduce catastrophic forgetting, we propose to use experience rehearsal to master the CORL. Detailed architecture is shown in Figure 1, and we postpone the method description in Section 4.3.

Refer to caption
Figure 5: The parameters sensitivity of Ant-dir.

Apart from the Ant-dir environment, we also report the performance on more complex CL tasks, i.e., CW10 and CW20, in Table 1. Considering that most baselines are trained in online mode in their original papers, we first select the online baselines and compare their mean success rate with our method. The results (Table 1 and Figure 3) show that our method (CoD) surpasses other baselines by a large margin, which illustrates the superior performance over balancing plasticity and stability. Besides, we also compare our method with these baselines trained with offline datasets, where the results show that our method can quickly master these manipulation tasks and remember the acquired knowledge when new tasks arrive, while the baselines (except for Multitask) struggle between plasticity and stability because the performance of these baselines fluctuates among tasks. When the previous tasks appear once again after 5M training steps, the baselines show different levels of catastrophic forgetting because the performance decreases after 5M steps. However, our method still remembers how to solve the same task it learned before, which shows small forgetting. Moreover, we also conduct mixed-quality dataset experiments to show our method’s capability of learning from sub-optimal offline datasets. For more details, please refer to Appendix 5.4.

To investigate the influence of key hyperparameters, we report the performance of the rehearsal frequency (υ𝜐\upsilonitalic_υ) and rehearsal sample diversity (ξ𝜉\xiitalic_ξ) in Figure 4 and Figure 5, where larger υ𝜐\upsilonitalic_υ corresponds to aggravated catastrophic forgetting and a larger value of ξ𝜉\xiitalic_ξ will improve the performance and increase the storage burden. In practice, we find that usually υ=2𝜐2\upsilon=2italic_υ = 2 and ξ=10%𝜉percent10\xi=10\%italic_ξ = 10 % indicate good performance and pose small challenges for the computation and memory burden (see Appendix 5.2 for memory and efficiency analysis.).

4 Methods

4.1 Continual Offline RL

In this paper, we focus on the task-incremental setting of task-aware continual learning in the offline RL field where the different tasks come successively for training Zhang et al. (2023c); Wang et al. (2023b); Smith et al. (2023a); Schwarz et al. (2018); Abel et al. (2023); Wang et al. (2023c). Each task is defined as a corresponding Markov Decision Process (MDP) =𝒮,𝒜,𝒫,,γ𝒮𝒜𝒫𝛾\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\ranglecaligraphic_M = ⟨ caligraphic_S , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ ⟩, where 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A represent the state and action space, respectively, 𝒫:𝒮×𝒜Δ(𝒮):𝒫𝒮𝒜Δ𝒮\mathcal{P}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})caligraphic_P : caligraphic_S × caligraphic_A → roman_Δ ( caligraphic_S ) denotes the Markovian transition probability, :𝒮×𝒜×𝒮:𝒮𝒜𝒮\mathcal{R}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}caligraphic_R : caligraphic_S × caligraphic_A × caligraphic_S → blackboard_R is the reward function, and γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is the discount factor. In order to distinguish different tasks, we use subscript i𝑖iitalic_i for task i𝑖iitalic_i, such as isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒮i,𝒜i,𝒫i,isubscript𝒮𝑖subscript𝒜𝑖subscript𝒫𝑖subscript𝑖\mathcal{S}_{i},\mathcal{A}_{i},\mathcal{P}_{i},\mathcal{R}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. At each time step t𝑡titalic_t in task i𝑖iitalic_i, the agent receives a state si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT from the environment and produces an action ai,tsubscript𝑎𝑖𝑡a_{i,t}italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT with a stochastic or deterministic policy π𝜋\piitalic_π. Then a reward ri,t=r(si,t,ai,t)subscript𝑟𝑖𝑡𝑟subscript𝑠𝑖𝑡subscript𝑎𝑖𝑡r_{i,t}=r(s_{i,t},a_{i,t})italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) from the environment serves as the feedback to the executed action of the agent. Continual offline RL aims to find an optimal policy that can maximize the discounted return iI𝔼π[t=0γtr(si,t,ai,t)]superscriptsubscript𝑖𝐼subscript𝔼𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑖𝑡subscript𝑎𝑖𝑡\sum_{i}^{I}\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{i,t},a_{i,t})]∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) ] Yang et al. (2023); Sun et al. (2023); Wei et al. (2023) on all tasks with previously collected dataset {Di}iIsubscriptsubscript𝐷𝑖𝑖𝐼\{D_{i}\}_{i\in I}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT.

4.2 Conditional Diffusion Probabilistic Models

In this paper, diffusion-based models are proposed to model the distribution of trajectory τ𝜏\tauitalic_τ, where each trajectory can be regarded as a data point. Then we can use diffusion models to learn the trajectory distribution q(τ)=q(τ0:K)𝑑τ1:K𝑞𝜏𝑞superscript𝜏:0𝐾differential-dsuperscript𝜏:1𝐾q(\tau)=\int q(\tau^{0:K})d\tau^{1:K}italic_q ( italic_τ ) = ∫ italic_q ( italic_τ start_POSTSUPERSCRIPT 0 : italic_K end_POSTSUPERSCRIPT ) italic_d italic_τ start_POSTSUPERSCRIPT 1 : italic_K end_POSTSUPERSCRIPT with a predefined forward diffusion process q(τk|τk1)=𝒩(τk;αkτk1,(1αk)𝑰)𝑞conditionalsuperscript𝜏𝑘superscript𝜏𝑘1𝒩superscript𝜏𝑘subscript𝛼𝑘superscript𝜏𝑘11subscript𝛼𝑘𝑰q(\tau^{k}|\tau^{k-1})=\mathcal{N}(\tau^{k};\sqrt{\alpha_{k}}\tau^{k-1},(1-% \alpha_{k})\bm{I})italic_q ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_italic_I ) and the trainable reverse process pθ(τk1|τk)=𝒩(τk1;μθ(τk,k),Σk)subscript𝑝𝜃conditionalsuperscript𝜏𝑘1superscript𝜏𝑘𝒩superscript𝜏𝑘1subscript𝜇𝜃superscript𝜏𝑘𝑘subscriptΣ𝑘p_{\theta}(\tau^{k-1}|\tau^{k})=\mathcal{N}(\tau^{k-1};\mu_{\theta}(\tau^{k},k% ),\Sigma_{k})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ) , roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where k[1,K]𝑘1𝐾k\in[1,K]italic_k ∈ [ 1 , italic_K ] is the diffusion step, αksubscript𝛼𝑘\sqrt{\alpha_{k}}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG and 1αk1subscript𝛼𝑘\sqrt{1-\alpha_{k}}square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG control the drift and diffusion coefficients, μθ(τk)=1αk(τkβk1α¯kϵθ(τk,k))subscript𝜇𝜃superscript𝜏𝑘1subscript𝛼𝑘subscript𝜏𝑘subscript𝛽𝑘1subscript¯𝛼𝑘subscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘\mu_{\theta}(\tau^{k})=\frac{1}{\sqrt{\alpha_{k}}}(\tau_{k}-\frac{\beta_{k}}{% \sqrt{1-\bar{\alpha}_{k}}}\epsilon_{\theta}(\tau^{k},k))italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ) ), Σk=1α¯k11α¯kβk𝑰subscriptΣ𝑘1subscript¯𝛼𝑘11subscript¯𝛼𝑘subscript𝛽𝑘𝑰\Sigma_{k}=\frac{1-\bar{\alpha}_{k-1}}{1-\bar{\alpha}_{k}}\beta_{k}\bm{I}roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_I, and αk+βk=1subscript𝛼𝑘subscript𝛽𝑘1\alpha_{k}+\beta_{k}=1italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1. ϵθ(τk,k)subscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘\epsilon_{\theta}(\tau^{k},k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ) represents the noising model (Sohl-Dickstein et al., 2015). According to Ho et al. (2020), we can train ϵθ(τk,k)subscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘\epsilon_{\theta}(\tau^{k},k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ) with the below simplified objective

(θ)=𝔼kU(1,2,,K),ϵ𝒩(0,𝑰),τ0D[ϵϵθ(τk,k)22],𝜃subscript𝔼formulae-sequencesimilar-to𝑘𝑈12𝐾formulae-sequencesimilar-toitalic-ϵ𝒩0𝑰similar-tosuperscript𝜏0𝐷delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘22\mathcal{L}(\theta)=\mathbb{E}_{k\sim U(1,2,...,K),\epsilon\sim\mathcal{N}(0,% \bm{I}),\tau^{0}\sim D}[||\epsilon-\epsilon_{\theta}(\tau^{k},k)||_{2}^{2}],caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_k ∼ italic_U ( 1 , 2 , … , italic_K ) , italic_ϵ ∼ caligraphic_N ( 0 , bold_italic_I ) , italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_D end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where k𝑘kitalic_k is the diffusion time step, U𝑈Uitalic_U is uniform distribution, ϵitalic-ϵ\epsilonitalic_ϵ is multivariant Gaussian noise, τ0=τsuperscript𝜏0𝜏\tau^{0}=\tauitalic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_τ is sampled from the replay buffer D𝐷Ditalic_D, and θ𝜃\thetaitalic_θ is the parameters of model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Conditions play a vital role in conditional generation because this method makes the outputs of diffusion models controllable. We can also use two conditions methods, classifier-guided and classifier-free, to train diffusion models pθ(τk1|τk,𝒞)subscript𝑝𝜃conditionalsuperscript𝜏𝑘1superscript𝜏𝑘𝒞p_{\theta}(\tau^{k-1}|\tau^{k},\mathcal{C})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C ) (Liu et al., 2023). The classifier-guided method separates the training of the unconditional diffusion model and conditional guide and then combines them together, i.e., pθ,ϕ(τk1|τk,𝒞)pθ(τk1|τk)pϕ(𝒞|τk)proportional-tosubscript𝑝𝜃italic-ϕconditionalsuperscript𝜏𝑘1superscript𝜏𝑘𝒞subscript𝑝𝜃conditionalsuperscript𝜏𝑘1superscript𝜏𝑘subscript𝑝italic-ϕconditional𝒞superscript𝜏𝑘p_{\theta,\phi}(\tau^{k-1}|\tau^{k},\mathcal{C})\propto p_{\theta}(\tau^{k-1}|% \tau^{k})p_{\phi}(\mathcal{C}|\tau^{k})italic_p start_POSTSUBSCRIPT italic_θ , italic_ϕ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C ) ∝ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_C | italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). The corresponding sampling process is p(τk1|τk,𝒞)=𝒩(μθ+Σklogpϕ(𝒞|τ),Σk)𝑝conditionalsuperscript𝜏𝑘1superscript𝜏𝑘𝒞𝒩subscript𝜇𝜃subscriptΣ𝑘𝑙𝑜𝑔subscript𝑝italic-ϕconditional𝒞𝜏subscriptΣ𝑘p(\tau^{k-1}|\tau^{k},\mathcal{C})=\mathcal{N}(\mu_{\theta}+\Sigma_{k}\cdot% \nabla log~{}p_{\phi}(\mathcal{C}|\tau),\Sigma_{k})italic_p ( italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT | italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ∇ italic_l italic_o italic_g italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( caligraphic_C | italic_τ ) , roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Compared with classifier-guided, the classifier-free method implicitly builds the correlation between the trajectories and conditions in the training phase by learning unconditional and conditional noise ϵθ(τk,,k)subscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘\epsilon_{\theta}(\tau^{k},\emptyset,k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ∅ , italic_k ) and ϵθ(τk,𝒞,k)subscriptitalic-ϵ𝜃superscript𝜏𝑘𝒞𝑘\epsilon_{\theta}(\tau^{k},\mathcal{C},k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C , italic_k ), where \emptyset is usually the zero vector (Ajay et al., 2022). Then the perturbed noise at each diffusion time step is calculated by ϵθ(τk,,k)+ω(ϵθ(τk,𝒞,k)ϵθ(τk,,k))subscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘𝜔subscriptitalic-ϵ𝜃superscript𝜏𝑘𝒞𝑘subscriptitalic-ϵ𝜃superscript𝜏𝑘𝑘\epsilon_{\theta}(\tau^{k},\emptyset,k)+\omega(\epsilon_{\theta}(\tau^{k},% \mathcal{C},k)-\epsilon_{\theta}(\tau^{k},\emptyset,k))italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ∅ , italic_k ) + italic_ω ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C , italic_k ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ∅ , italic_k ) ). In this paper, we adopt the classifier-free guidance due to its simplicity, controllability, and higher performance Ajay et al. (2022).

4.3 Continual Diffuser

In this section, we introduce the Continual Diffuser (CoD), as shown in Figure 1, which contains classifier-free task-conditional training, experience rehearsal, and conditional generation for decision.

Data Organization.     In RL, we leverage the characteristic of the diffusion model that can capture joint distributions in high-dimensional continual space by formulating the training data from single-step transition to multi-step sequences. Specifically, we have I𝐼Iitalic_I tasks, and each task isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of N𝑁Nitalic_N trajectories {τi}1Nsuperscriptsubscriptsubscript𝜏𝑖1𝑁\{\tau_{i}\}_{1}^{N}{ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where the τi,n={si,t,n,ai,t,n}subscript𝜏𝑖𝑛subscript𝑠𝑖𝑡𝑛subscript𝑎𝑖𝑡𝑛\tau_{i,n}=\{s_{i,t,n},a_{i,t,n}\}italic_τ start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT } will be split into equaling sequences with Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT time steps as the discrepancy of trajectories may occur across tasks. In the following parts, we slightly abuse this notation τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to represent the sequence data with length Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT sampled from task i𝑖iitalic_i ’s dataset Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and τ^isubscript^𝜏𝑖\hat{\tau}_{i}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the generative sequence.

Task Condition.    In order to distinguish different tasks, we propose to use environment-related information as the task condition. For example, in the Ant-dir environment, the agent’s goal is to maximize its speed in the pre-defined direction, which is given as the goal in the specific tasks. So, we propose to use this information as condition 𝒞tasksubscript𝒞𝑡𝑎𝑠𝑘\mathcal{C}_{task}caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT to train our model. In each diffusion step k𝑘kitalic_k, the task condition 𝒞tasksubscript𝒞𝑡𝑎𝑠𝑘\mathcal{C}_{task}caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT will pass through a task embedding function to obtain task embedding, which will be fed into the diffusion model jointly with diffusion time step embedding. Apart from the task conditions that are used implicitly in the training, we also need explicit observation conditions. We use the first state si,t,nsubscript𝑠𝑖𝑡𝑛s_{i,t,n}italic_s start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT of the Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT length sampled sequence τi,n={si,t,n,ai,t,n,si,t+1,n,ai,t+1,n,,si,t+Te1,n,ai,t+Te1,n}subscript𝜏𝑖𝑛subscript𝑠𝑖𝑡𝑛subscript𝑎𝑖𝑡𝑛subscript𝑠𝑖𝑡1𝑛subscript𝑎𝑖𝑡1𝑛subscript𝑠𝑖𝑡subscript𝑇𝑒1𝑛subscript𝑎𝑖𝑡subscript𝑇𝑒1𝑛\tau_{i,n}=\{s_{i,t,n},a_{i,t,n},s_{i,t+1,n},a_{i,t+1,n},...,s_{i,t+T_{e}-1,n}% ,a_{i,t+T_{e}-1,n}\}italic_τ start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i , italic_t + 1 , italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t + 1 , italic_n end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT } as the conditions. Then at each diffusion generation step, after we obtain the generated sequences {s^i,t,n,a^i,t,n,,s^i,t+Te1,n,a^i,t+Te1,n}ksuperscriptsubscript^𝑠𝑖𝑡𝑛subscript^𝑎𝑖𝑡𝑛subscript^𝑠𝑖𝑡subscript𝑇𝑒1𝑛subscript^𝑎𝑖𝑡subscript𝑇𝑒1𝑛𝑘\{\hat{s}_{i,t,n},\hat{a}_{i,t,n},...,\hat{s}_{i,t+T_{e}-1,n},\hat{a}_{i,t+T_{% e}-1,n}\}^{k}{ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , … , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the first observation s^i,t,nsubscript^𝑠𝑖𝑡𝑛\hat{s}_{i,t,n}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT is directly replaced by si,t,nsubscript𝑠𝑖𝑡𝑛s_{i,t,n}italic_s start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT, i.e., τ^i,nk={si,t,n,a^i,t,n,,s^i,t+Te1,n,a^i,t+Te1,n}ksuperscriptsubscript^𝜏𝑖𝑛𝑘superscriptsubscript𝑠𝑖𝑡𝑛subscript^𝑎𝑖𝑡𝑛subscript^𝑠𝑖𝑡subscript𝑇𝑒1𝑛subscript^𝑎𝑖𝑡subscript𝑇𝑒1𝑛𝑘\hat{\tau}_{i,n}^{k}=\{s_{i,t,n},\hat{a}_{i,t,n},...,\hat{s}_{i,t+T_{e}-1,n},% \hat{a}_{i,t+T_{e}-1,n}\}^{k}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT , … , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

Training Objective.     Following the previous studies of the diffusion model Ho et al. (2020); He et al. (2023), the training and generation for each task i𝑖iitalic_i are defined as

i(θ)=𝔼kU(1,K),ϵ𝒩(0,𝑰),τi0Di[ϵϵθ(τik,𝒞taski,k)22],subscript𝑖𝜃subscript𝔼formulae-sequencesimilar-to𝑘𝑈1𝐾formulae-sequencesimilar-toitalic-ϵ𝒩0𝑰similar-tosubscriptsuperscript𝜏0𝑖subscript𝐷𝑖delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃superscriptsubscript𝜏𝑖𝑘subscript𝒞𝑡𝑎𝑠𝑘𝑖𝑘22\mathcal{L}_{i}(\theta)=\mathbb{E}_{k\sim U(1,K),\epsilon\sim\mathcal{N}(0,\bm% {I}),\tau^{0}_{i}\sim D_{i}}[||\epsilon-\epsilon_{\theta}(\tau_{i}^{k},% \mathcal{C}_{task~{}i},k)||_{2}^{2}],caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_k ∼ italic_U ( 1 , italic_K ) , italic_ϵ ∼ caligraphic_N ( 0 , bold_italic_I ) , italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k italic_i end_POSTSUBSCRIPT , italic_k ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)
τik1=α¯k1βk1α¯kτ¯i+αk(1α¯k1)1α¯kτik+|Σk|𝒛,superscriptsubscript𝜏𝑖𝑘1subscript¯𝛼𝑘1subscript𝛽𝑘1subscript¯𝛼𝑘subscript¯𝜏𝑖subscript𝛼𝑘1subscript¯𝛼𝑘11subscript¯𝛼𝑘superscriptsubscript𝜏𝑖𝑘subscriptΣ𝑘𝒛\tau_{i}^{k-1}=\frac{\sqrt{\bar{\alpha}_{k-1}}\beta_{k}}{1-\bar{\alpha}_{k}}% \cdot\bar{\tau}_{i}+\frac{\sqrt{\alpha}_{k}(1-\bar{\alpha}_{k-1})}{1-\bar{% \alpha}_{k}}\tau_{i}^{k}+|\Sigma_{k}|\bm{z},italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT = divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⋅ over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + | roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_z , (2)

where z𝒩(𝟎,𝑰)similar-to𝑧𝒩0𝑰z\sim\mathcal{N}(\bm{0},\bm{I})italic_z ∼ caligraphic_N ( bold_0 , bold_italic_I ), τ¯i=τik1α¯kϵ¯α¯k,|Σk|=1α¯k11α¯kβkformulae-sequencesubscript¯𝜏𝑖superscriptsubscript𝜏𝑖𝑘1subscript¯𝛼𝑘¯italic-ϵsubscript¯𝛼𝑘subscriptΣ𝑘1subscript¯𝛼𝑘11subscript¯𝛼𝑘subscript𝛽𝑘\bar{\tau}_{i}=\frac{\tau_{i}^{k}-\sqrt{1-\bar{\alpha}_{k}}\bar{\epsilon}}{% \sqrt{\bar{\alpha}_{k}}},|\Sigma_{k}|=\frac{1-\bar{\alpha}_{k-1}}{1-\bar{% \alpha}_{k}}\beta_{k}over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG over¯ start_ARG italic_ϵ end_ARG end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG , | roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | = divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and ϵ¯=ϵθ(τik,,k)+ω(ϵθ(τik,𝒞task,k)ϵθ(τik,,k))¯italic-ϵsubscriptitalic-ϵ𝜃superscriptsubscript𝜏𝑖𝑘𝑘𝜔subscriptitalic-ϵ𝜃superscriptsubscript𝜏𝑖𝑘subscript𝒞𝑡𝑎𝑠𝑘𝑘subscriptitalic-ϵ𝜃superscriptsubscript𝜏𝑖𝑘𝑘\bar{\epsilon}=\epsilon_{\theta}(\tau_{i}^{k},\emptyset,k)+\omega(\epsilon_{% \theta}(\tau_{i}^{k},\mathcal{C}_{task},k)-\epsilon_{\theta}(\tau_{i}^{k},% \emptyset,k))over¯ start_ARG italic_ϵ end_ARG = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ∅ , italic_k ) + italic_ω ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT , italic_k ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , ∅ , italic_k ) ).

Experience Rehearsal.    In this paper, we propose periodic rehearsal to strengthen the knowledge of previous tasks, which mimics the memory consolidation mechanism of hippocampus replay inside biological systems. When a new dataset Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of task i𝑖iitalic_i arrives, we preserve a small portion ξ𝜉\xiitalic_ξ of the entire dataset, donated as 𝒟isubscript𝒟𝑖\mathscr{D}_{i}script_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For the small training dataset 𝒟isubscript𝒟𝑖\mathscr{D}_{i}script_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is easy to overfit these data for most rehearsal-based methods. Fortunately, inspired by the distributional robust optimization, increasing the hardness of the samples will hinder memory overfitting. The discrete type of diffusion process τk=αkτk1+1αkϵsuperscript𝜏𝑘subscript𝛼𝑘superscript𝜏𝑘11subscript𝛼𝑘italic-ϵ\tau^{k}=\sqrt{\alpha_{k}}\tau^{k-1}+\sqrt{1-\alpha_{k}}\epsilonitalic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_τ start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_ϵ can be reformulated as the corresponding continuous forward process dτ=12β(t)τdt+β(t)dW𝑑𝜏12𝛽𝑡𝜏𝑑𝑡𝛽𝑡𝑑𝑊d\tau=-\frac{1}{2}\beta(t)\tau dt+\sqrt{\beta(t)}dWitalic_d italic_τ = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_β ( italic_t ) italic_τ italic_d italic_t + square-root start_ARG italic_β ( italic_t ) end_ARG italic_d italic_W, where W𝑊Witalic_W is the standard Wiener process (a.k.a. Brownian motion). This process gradually inserts directional noise (i.e., increasing the hardness) to induce transformation from trajectory distribution to Gaussian distribution. So rehearsal-based diffusers naturally possess the capability of reducing memory overfitting, and the total objective function is

minθΘ[𝔼τjDjj(θ,τj,𝒞taskj)+𝔼τi𝒟i,i<ji(θ,τi,𝒞taski)]subscriptfor-all𝜃Θsubscript𝔼subscript𝜏𝑗subscript𝐷𝑗subscript𝑗𝜃subscript𝜏𝑗subscript𝒞𝑡𝑎𝑠𝑘𝑗subscript𝔼formulae-sequencesubscript𝜏𝑖subscript𝒟𝑖𝑖𝑗subscript𝑖𝜃subscript𝜏𝑖subscript𝒞𝑡𝑎𝑠𝑘𝑖\small\min_{\forall\theta\in\Theta}[\mathbb{E}_{\tau_{j}\in D_{j}}\mathcal{L}_% {j}(\theta,\tau_{j},\mathcal{C}_{task~{}j})+\mathbb{E}_{\tau_{i}\in\mathscr{D}% _{i},i<j}\mathcal{L}_{i}(\theta,\tau_{i},\mathcal{C}_{task~{}i})]roman_min start_POSTSUBSCRIPT ∀ italic_θ ∈ roman_Θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ , italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k italic_j end_POSTSUBSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ script_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i < italic_j end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k italic_i end_POSTSUBSCRIPT ) ] (3)

In practice, we usually set the rehearsal frequency υ𝜐\upsilonitalic_υ as 2 gradient steps and the portion ξ𝜉\xiitalic_ξ as 10%.

Architecture.    In this paper, we adopt temporal Unet with one-dimensional convolution blocks as the diffusion model to predict noises. Specifically, temporal Unet contains several down-sampling blocks, a middle block, several up-sampling blocks, a time embedding block, and a task embedding block. We train the time embedding block and task embedding block to generate time and task embeddings that are added to the observation-action sequence

τi,t:t+Te1,n=(si,t,nsi,t+1,nsi,t+Te1,nai,t,nai,t+1,nai,t+Te1,n).subscript𝜏:𝑖𝑡𝑡subscript𝑇𝑒1𝑛matrixsubscript𝑠𝑖𝑡𝑛subscript𝑠𝑖𝑡1𝑛subscript𝑠𝑖𝑡subscript𝑇𝑒1𝑛subscript𝑎𝑖𝑡𝑛subscript𝑎𝑖𝑡1𝑛subscript𝑎𝑖𝑡subscript𝑇𝑒1𝑛\tau_{i,t:t+T_{e}-1,n}=\begin{pmatrix}s_{i,t,n}&s_{i,t+1,n}&...&s_{i,t+T_{e}-1% ,n}\\ a_{i,t,n}&a_{i,t+1,n}&...&a_{i,t+T_{e}-1,n}\\ \end{pmatrix}.italic_τ start_POSTSUBSCRIPT italic_i , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t + 1 , italic_n end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i , italic_t , italic_n end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i , italic_t + 1 , italic_n end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_i , italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 , italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) .

In the return conditional diffusion models, we replace the task embedding block with the return embedding block. Also, following the implementation of low-rank adaptation in Natural Language Processing Hu et al. (2021); Mangrulkar et al. (2022), we increase the LoRA module in down-sampling, middle, and up-sampling blocks to construct the LoRA variant CoD-LoRA.

4.4 Conclusion

First of all, to facilitate the development of the continual offline RL community, a continual offline benchmark that contains 90 tasks is constructed based on Continual World and Gym-MuJoCo. Based on our benchmark, we propose Continual Diffuser (CoD), an effective continual offline RL method that possesses the capabilities of plasticity and stability with experience rehearsal. Finally, extensive experiments illustrate the superior plasticity-stability trade-off when compared with representative continual RL baselines.

CODE AND DATA AVAILABILITY

The code and data are available in GitHub at https://github.com/JF-Hu/Continual_Diffuser.

ACKNOWLEDGEMENT

We would like to thank Lijun Bian for her contributions to the figures and tables of this manuscript. We thank Runliang Niu for his contributions to providing help on the computing resource.

References

  • Abel et al. (2023) David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, and Satinder Singh. A definition of continual reinforcement learning. arXiv preprint arXiv:2307.11046, 2023.
  • Ajay et al. (2022) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  • Aljundi et al. (2018) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  • Almalioglu et al. (2022) Yasin Almalioglu, Mehmet Turan, Niki Trigoni, and Andrew Markham. Deep learning-based robust positioning for all-weather autonomous driving. Nature machine intelligence, 4(9):749–760, 2022.
  • Anand and Precup (2023) Nishanth Anand and Doina Precup. Prediction and control in continual reinforcement learning. arXiv preprint arXiv:2312.11669, 2023.
  • Atkinson et al. (2021) Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. Neurocomputing, 428:291–307, 2021.
  • Beeson and Montana (2023) Alex Beeson and Giovanni Montana. Balancing policy constraint and ensemble size in uncertainty-based offline reinforcement learning. arXiv preprint arXiv:2303.14716, 2023.
  • Berariu et al. (2021) Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, and Claudia Clopath. A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042, 2021.
  • Chaudhry et al. (2018) Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
  • Chen et al. (2021) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  • Chi et al. (2023) Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  • Dohare et al. (2024) Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, 2024.
  • Fontanesi et al. (2019) Laura Fontanesi, Sebastian Gluth, Mikhail S Spektor, and Jörg Rieskamp. A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review, 26(4):1099–1121, 2019.
  • Foret et al. (2020) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  • Gao et al. (2021) Chongkai Gao, Haichuan Gao, Shangqi Guo, Tianren Zhang, and Feng Chen. Cril: Continual robot imitation learning via generative and prediction model. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6747–5754. IEEE, 2021.
  • Ghosh et al. (2022) Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, and Sergey Levine. Offline rl policies should be trained to be adaptive. In International Conference on Machine Learning, pages 7513–7530. PMLR, 2022.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  • He et al. (2023) Haoran He, Chenjia Bai, Kang Xu, Zhuoran Yang, Weinan Zhang, Dong Wang, Bin Zhao, and Xuelong Li. Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. arXiv preprint arXiv:2305.18459, 2023.
  • Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hong et al. (2023) Joey Hong, Anca Dragan, and Sergey Levine. Offline rl with observation histories: Analyzing and improving sample complexity. arXiv preprint arXiv:2310.20663, 2023.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Huang et al. (2024) Kaixin Huang, Li Shen, Chen Zhao, Chun Yuan, and Dacheng Tao. Solving continual offline reinforcement learning with decision transformer. arXiv preprint arXiv:2401.08478, 2024.
  • Janner et al. (2021) Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  • Janner et al. (2022) Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  • Kang et al. (2023) Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning. arXiv preprint arXiv:2305.20081, 2023.
  • Kaplanis et al. (2019) Christos Kaplanis, Murray Shanahan, and Claudia Clopath. Policy consolidation for continual reinforcement learning. arXiv preprint arXiv:1902.00255, 2019.
  • Kaufmann et al. (2023) Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023.
  • Kessler et al. (2020) Samuel Kessler, Jack Parker-Holder, Philip Ball, Stefan Zohren, and Stephen J Roberts. Unclear: A straightforward method for continual reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • Kidambi et al. (2020) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Korycki and Krawczyk (2021) Lukasz Korycki and Bartosz Krawczyk. Class-incremental experience replay for continual learning under concept drift. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3649–3658, 2021.
  • Kostrikov et al. (2021) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  • Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • Laskin et al. (2020) Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020.
  • Lee et al. (2024) Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. (2020) Jiachen Li, Quan Vuong, Shuang Liu, Minghua Liu, Kamil Ciosek, Henrik Christensen, and Hao Su. Multi-task batch reinforcement learning with metric learning. Advances in Neural Information Processing Systems, 33:6197–6210, 2020.
  • Li et al. (2023) Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023.
  • Liu et al. (2023) Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
  • Mallya and Lazebnik (2018) Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  • Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  • Mendez and Eaton (2022) Jorge A Mendez and Eric Eaton. How to reuse and compose knowledge for a lifetime of tasks: A survey on continual learning and functional composition. arXiv preprint arXiv:2207.07730, 2022.
  • Meyer et al. (2023) Edan Meyer, Adam White, and Marlos C Machado. Harnessing discrete representations for continual reinforcement learning. arXiv preprint arXiv:2312.01203, 2023.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Nguyen et al. (2017) Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
  • Nguyen-Tang and Arora (2024) Thanh Nguyen-Tang and Raman Arora. On sample-efficient offline reinforcement learning: Data diversity, posterior sampling, and beyond. arXiv preprint arXiv:2401.03301, 2024.
  • Ni et al. (2023) Fei Ni, Jianye Hao, Yao Mu, Yifu Yuan, Yan Zheng, Bin Wang, and Zhixuan Liang. Metadiffuser: Diffusion model as conditional planner for offline meta-rl. arXiv preprint arXiv:2305.19923, 2023.
  • Nichol and Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • Peng et al. (2023) Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning, pages 27585–27610. PMLR, 2023.
  • Rafailov et al. (2021) Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, pages 1154–1168. PMLR, 2021.
  • Rakelly et al. (2019) Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340. PMLR, 2019.
  • Rigter et al. (2022) Marc Rigter, Bruno Lacerda, and Nick Hawes. Rambo-rl: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581, 2022.
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • (56) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding, 2022. URL https://arxiv. org/abs/2205.11487, 4.
  • Schwarz et al. (2018) Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018.
  • Shin et al. (2017) Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
  • Smith et al. (2023a) James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023a.
  • Smith et al. (2023b) James Seale Smith, Junjiao Tian, Shaunak Halbe, Yen-Chang Hsu, and Zsolt Kira. A closer look at rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2409–2419, 2023b.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Sun et al. (2023) Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. Smart: Self-supervised multi-task pretraining with control transformers. arXiv preprint arXiv:2301.09816, 2023.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  • Van de Ven et al. (2022) Gido M Van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
  • Wang et al. (2023a) Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, and Yu Qiao. Critic-guided decision transformer for offline reinforcement learning. arXiv preprint arXiv:2312.13716, 2023a.
  • Wang et al. (2022a) Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022a.
  • Wang et al. (2023b) Zhenyi Wang, Li Shen, Tiehang Duan, Qiuling Suo, Le Fang, Wei Liu, and Mingchen Gao. Distributionally robust memory evolution with generalized divergence for continual learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  • Wang et al. (2023c) Zhenyi Wang, Enneng Yang, Li Shen, and Heng Huang. A comprehensive survey of forgetting in deep learning beyond continual learning. arXiv preprint arXiv:2307.09218, 2023c.
  • Wang et al. (2022b) Zhi Wang, Chunlin Chen, and Daoyi Dong. A dirichlet process mixture of robust task models for scalable lifelong reinforcement learning. IEEE Transactions on Cybernetics, 2022b.
  • Wei et al. (2023) Yao Wei, Yanchao Sun, Ruijie Zheng, Sai Vemprala, Rogerio Bonatti, Shuhang Chen, Ratnesh Madaan, Zhongjie Ba, Ashish Kapoor, and Shuang Ma. Is imitation all you need? generalized decision-making with dual-phase training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16221–16231, 2023.
  • Wołczyk et al. (2021) Maciej Wołczyk, Michał Zajac, Razvan Pascanu, Łukasz Kucinski, and Piotr Miłoś. Continual world: A robotic benchmark for continual reinforcement learning. Advances in Neural Information Processing Systems, 34:28496–28510, 2021.
  • Wu et al. (2019) Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  • Xu et al. (2022) Mengdi Xu, Yikang Shen, Shun Zhang, Yuchen Lu, Ding Zhao, Joshua Tenenbaum, and Chuang Gan. Prompting decision transformer for few-shot policy generalization. In international conference on machine learning, pages 24631–24645. PMLR, 2022.
  • Yang et al. (2023) Yijun Yang, Tianyi Zhou, Jing Jiang, Guodong Long, and Yuhui Shi. Continual task allocation in meta-policy network via sparse prompting. arXiv preprint arXiv:2305.18444, 2023.
  • Yu et al. (2020) Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  • Yue et al. (2024) William Yue, Bo Liu, and Peter Stone. t-dgr: A trajectory-based deep generative replay method for continual learning in decision making. arXiv preprint arXiv:2401.02576, 2024.
  • Zeng et al. (2019) Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
  • Zhang et al. (2023a) Qizhe Zhang, Bocheng Zou, Ruichuan An, Jiaming Liu, and Shanghang Zhang. Split & merge: Unlocking the potential of visual adapters via sparse training. arXiv preprint arXiv:2312.02923, 2023a.
  • Zhang et al. (2022) Tiantian Zhang, Xueqian Wang, Bin Liang, and Bo Yuan. Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Zhang et al. (2023b) Tiantian Zhang, Zichuan Lin, Yuxing Wang, Deheng Ye, Qiang Fu, Wei Yang, Xueqian Wang, Bin Liang, Bo Yuan, and Xiu Li. Dynamics-adaptive continual reinforcement learning via progressive contextualization. IEEE Transactions on Neural Networks and Learning Systems, 2023b.
  • Zhang et al. (2023c) Tiantian Zhang, Kevin Zehua Shen, Zichuan Lin, Bo Yuan, Xueqian Wang, Xiu Li, and Deheng Ye. Replay-enhanced continual reinforcement learning. arXiv preprint arXiv:2311.11557, 2023c.
  • Zhu et al. (2023) Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Yong Yu, and Weinan Zhang. Diffusion models for reinforcement learning: A survey. arXiv preprint arXiv:2311.01223, 2023.

5 Supplementary Material

5.1 Pseudocode of Continual Diffuser

The pseudocode for CoD training is shown in Algorithm 1. First of all, we process the datasets of I𝐼Iitalic_I tasks before training, including splitting the trajectories into equal sequences and normalizing the sequences to facilitate learning. As shown in lines 9249249-249 - 24, for each task i𝑖iitalic_i, we check the task index in the whole task sequence and sample different samples from the different buffers. For example, for task i,i>0𝑖𝑖0i,i>0italic_i , italic_i > 0, we will perform experience rehearsal every υ𝜐\upsilonitalic_υ train steps by sampling data from 𝒟j,j0,,i1formulae-sequencesubscript𝒟𝑗𝑗0𝑖1\mathscr{D}_{j},j\in{0,...,i-1}script_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ 0 , … , italic_i - 1, where j𝑗jitalic_j is sampled from U(0,i1)𝑈0𝑖1U(0,i-1)italic_U ( 0 , italic_i - 1 ). Then, the networks ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ftask(ϕ)subscript𝑓𝑡𝑎𝑠𝑘italic-ϕf_{task}(\phi)italic_f start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( italic_ϕ ), and ftime(φ)subscript𝑓𝑡𝑖𝑚𝑒𝜑f_{time}(\varphi)italic_f start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT ( italic_φ ) are updated according to Equation (1) and Equation (3). After training on task i𝑖iitalic_i, we preserve a small portion (ξ𝜉\xiitalic_ξ) of the dataset of buffer Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as task i𝑖iitalic_i’s rehearsal buffer. During the evaluation of multiple tasks (shown in Algorithm 2), we successively generate decisions with CoD and calculate the evaluation metrics.

5.2 Implement Details

Compute.    Experiments are carried out on NVIDIA GeForce RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz. Each run of the experiments spanned about 24-72 hours, depending on the algorithm and the length of task sequences.

Hyperparameters.     In the implementation, we select the maximum diffusion steps as 200, and the default structure is Unet. Then, in order to speed up the generation efficiency during evaluation, we consider the speed-up technique of DDIM Song et al. [2020] and realize it in our method, thus accomplishing 19.043x acceleration compared to the original generation method. The sequence length is set to 48 in all experiments, where a larger sequence length can capture a more sophisticated distribution of trajectories and may also increase the computation burden. We set the LoRA dimension as 64 for each module of down-sampling, middle, and up-sampling blocks, and the percent of LoRA parameters is approximately 12% in our experiments.

Table 2: The hyperparameters of CoD.
Hyperparameter Value
Architecture network backbone Unet
hidden dimension 128
down-sampling blocks 3
middle blocks 2
up-sampling blocks 2
convolution multiply (1, 4, 8)
normalizer Gaussian normalizer
sampling type of diffusion DDIM
Training condition guidance ω𝜔\omegaitalic_ω 1.2
max diffusion step K𝐾Kitalic_K 200
sequence length Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT 48
loss function MSE
learning rate 31043superscript1043\cdot 10^{-4}3 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
batch size 32
optimizer Adam
discount factorγ𝛾\gammaitalic_γ 0.99
LoRA dimension 64
condition dropout 0.25
sampling speed-up stride 10
rehearsal frequency υ𝜐\upsilonitalic_υ 2
rehearsal sample diversity ξ𝜉\xiitalic_ξ 0.1
Input: Noise prediction model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, task MLP ftask(ϕ)subscript𝑓𝑡𝑎𝑠𝑘italic-ϕf_{task}(\phi)italic_f start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( italic_ϕ ), time MLP ftime(φ)subscript𝑓𝑡𝑖𝑚𝑒𝜑f_{time}(\varphi)italic_f start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT ( italic_φ ), tasks set i,i{1,,I}subscript𝑖𝑖1𝐼\mathcal{M}_{i},i\in\{1,...,I\}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_I }, max diffusion step K𝐾Kitalic_K, sequence length Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, state dimension dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, action dimension dasubscript𝑑𝑎d_{a}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, reply buffer Di,i{1,,I}subscript𝐷𝑖𝑖1𝐼D_{i},i\in\{1,...,I\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_I }, rehearsal frequency (υ𝜐\upsilonitalic_υ), rehearsal sample diversity (ξ𝜉\xiitalic_ξ), noise schedule α0:Ksubscript𝛼:0𝐾\alpha_{0:K}italic_α start_POSTSUBSCRIPT 0 : italic_K end_POSTSUBSCRIPT and β0:Ksubscript𝛽:0𝐾\beta_{0:K}italic_β start_POSTSUBSCRIPT 0 : italic_K end_POSTSUBSCRIPT
Output: ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ftask(ϕ)subscript𝑓𝑡𝑎𝑠𝑘italic-ϕf_{task}(\phi)italic_f start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( italic_ϕ ), ftime(φ)subscript𝑓𝑡𝑖𝑚𝑒𝜑f_{time}(\varphi)italic_f start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT ( italic_φ )
1 Initialization: θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ, φ𝜑\varphiitalic_φ
2 // Prepare for Training
3 Separate the state-action trajectories of Di,i{1,,I}subscript𝐷𝑖𝑖1𝐼D_{i},i\in\{1,...,I\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_I } into state-action sequences with length Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT
4 Normalize state-action sequences to obey Gaussian distribution
5 // Training
6 for each task i𝑖iitalic_i do
7       for each train epoch do
8             for each train step m𝑚mitalic_m do
9                   if i>0𝑖0i>0italic_i > 0 and m𝑚mitalic_m % υ𝜐\upsilonitalic_υ == 0 then
10                         Sample j from {0,,i1}0𝑖1\{0,...,i-1\}{ 0 , … , italic_i - 1 }
11                         Sample b𝑏bitalic_b sequences τj0={sj,t:t+Te,aj,t:t+Te}b×Te×(ds+da)superscriptsubscript𝜏𝑗0subscript𝑠:𝑗𝑡𝑡subscript𝑇𝑒subscript𝑎:𝑗𝑡𝑡subscript𝑇𝑒superscript𝑏subscript𝑇𝑒subscript𝑑𝑠subscript𝑑𝑎\tau_{j}^{0}=\{s_{j,t:t+T_{e}},a_{j,t:t+T_{e}}\}\in\mathbb{R}^{b\times T_{e}% \times(d_{s}+d_{a})}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_j , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT from task j𝑗jitalic_j’s rehearsal buffer 𝒟j,j<isubscript𝒟𝑗𝑗𝑖\mathscr{D}_{j},j<iscript_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j < italic_i
12                        
13                  else
14                         Sample b𝑏bitalic_b sequences τi0={si,t:t+Te,ai,t:t+Te}b×Te×(ds+da)superscriptsubscript𝜏𝑖0subscript𝑠:𝑖𝑡𝑡subscript𝑇𝑒subscript𝑎:𝑖𝑡𝑡subscript𝑇𝑒superscript𝑏subscript𝑇𝑒subscript𝑑𝑠subscript𝑑𝑎\tau_{i}^{0}=\{s_{i,t:t+T_{e}},a_{i,t:t+T_{e}}\}\in\mathbb{R}^{b\times T_{e}% \times(d_{s}+d_{a})}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_t : italic_t + italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT from task i𝑖iitalic_i’s buffer Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
15                        
16                   end if
17                  Obtain the corresponding task conditions 𝒞tasksubscript𝒞𝑡𝑎𝑠𝑘\mathcal{C}_{task}caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT
18                   Sample diffusion time step kUniform(K)similar-to𝑘Uniform𝐾k\sim\text{Uniform}(K)italic_k ∼ Uniform ( italic_K )
19                   Obtain τiksuperscriptsubscript𝜏𝑖𝑘\tau_{i}^{k}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT or τjksuperscriptsubscript𝜏𝑗𝑘\tau_{j}^{k}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT by adding noise to τi0superscriptsubscript𝜏𝑖0\tau_{i}^{0}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT or τj0superscriptsubscript𝜏𝑗0\tau_{j}^{0}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
20                   Sample Gaussian noise ϵ𝒩(0,𝑰),ϵb×Te×(ds+da)formulae-sequencesimilar-toitalic-ϵ𝒩0𝑰italic-ϵsuperscript𝑏subscript𝑇𝑒subscript𝑑𝑠subscript𝑑𝑎\epsilon\sim\mathcal{N}(0,\bm{I}),\epsilon\in\mathbb{R}^{b\times T_{e}\times(d% _{s}+d_{a})}italic_ϵ ∼ caligraphic_N ( 0 , bold_italic_I ) , italic_ϵ ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT
21                   Train ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ftask(ϕ)subscript𝑓𝑡𝑎𝑠𝑘italic-ϕf_{task}(\phi)italic_f start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( italic_ϕ ), and ftime(φ)subscript𝑓𝑡𝑖𝑚𝑒𝜑f_{time}(\varphi)italic_f start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT ( italic_φ ) according to Equation (1) and Equation (3)
22                  
23             end for
24            Save model periodically
25            
26       end for
27      Preserve a small portion (ξ𝜉\xiitalic_ξ) of the dataset of buffer Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as task i𝑖iitalic_i’s rehearsal buffer 𝒟isubscript𝒟𝑖\mathscr{D}_{i}script_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
28      
29 end for
Algorithm 1 Training of Continual Diffuser (CoD)
Input: Well trained noise prediction model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, task MLP ftask(ϕ)subscript𝑓𝑡𝑎𝑠𝑘italic-ϕf_{task}(\phi)italic_f start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( italic_ϕ ), time MLP ftime(φ)subscript𝑓𝑡𝑖𝑚𝑒𝜑f_{time}(\varphi)italic_f start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT ( italic_φ ), tasks set i,i{1,,I}subscript𝑖𝑖1𝐼\mathcal{M}_{i},i\in\{1,...,I\}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_I }, max diffusion step K𝐾Kitalic_K, noise schedule α0:Ksubscript𝛼:0𝐾\alpha_{0:K}italic_α start_POSTSUBSCRIPT 0 : italic_K end_POSTSUBSCRIPT and β0:Ksubscript𝛽:0𝐾\beta_{0:K}italic_β start_POSTSUBSCRIPT 0 : italic_K end_POSTSUBSCRIPT
1 // Prepare for Evaluation
2 Normalize state-action sequences to obey Gaussian distribution
3 // Evaluation
4 for each evaluation task i𝑖iitalic_i do
5       for each evaluation step do
6             Receive state si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and task identify from the task i𝑖iitalic_i
7             Obtain the corresponding task conditions 𝒞tasksubscript𝒞𝑡𝑎𝑠𝑘\mathcal{C}_{task}caligraphic_C start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT
8             Let k=K𝑘𝐾k=Kitalic_k = italic_K
9             Sample τ^ik1×Te×(ds+da)subscriptsuperscript^𝜏𝑘𝑖superscript1subscript𝑇𝑒subscript𝑑𝑠subscript𝑑𝑎\hat{\tau}^{k}_{i}\in\mathbb{R}^{1\times T_{e}\times(d_{s}+d_{a})}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT from normal distribution 𝒩(0,𝑰)𝒩0𝑰\mathcal{N}(0,\bm{I})caligraphic_N ( 0 , bold_italic_I )
10             Replace the first state of τ^iksubscriptsuperscript^𝜏𝑘𝑖\hat{\tau}^{k}_{i}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT
11             for each generation step k𝑘kitalic_k do
12                   Generate sequences τ^ik1subscriptsuperscript^𝜏𝑘1𝑖\hat{\tau}^{k-1}_{i}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ftask(ϕ)subscript𝑓𝑡𝑎𝑠𝑘italic-ϕf_{task}(\phi)italic_f start_POSTSUBSCRIPT italic_t italic_a italic_s italic_k end_POSTSUBSCRIPT ( italic_ϕ ), and ftime(φ)subscript𝑓𝑡𝑖𝑚𝑒𝜑f_{time}(\varphi)italic_f start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT ( italic_φ ) according to Equation (2)
13                   Replace the first state of τ^ik1subscriptsuperscript^𝜏𝑘1𝑖\hat{\tau}^{k-1}_{i}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT
14                  
15             end for
16            Perform the first action of τ^ik1subscriptsuperscript^𝜏𝑘1𝑖\hat{\tau}^{k-1}_{i}over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the task i𝑖iitalic_i
17             Observe reward r𝑟ritalic_r from the task i𝑖iitalic_i
18            
19       end for
20      Record the success rate on task i𝑖iitalic_i
21      
22 end for
23Calculate the total mean success rate
Algorithm 2 Evaluation of Continual Diffuser (CoD)

5.3 Related Work

Diffusion-Based Models for RL.    Diffusion models have made big progress in many fields, such as image synthesis and text generation Ho et al. [2020], Saharia et al. , Nichol and Dhariwal [2021], Beeson and Montana [2023], Sohl-Dickstein et al. [2015], Rombach et al. [2022]. Recently, a series of works have demonstrated the tremendous potential of diffusion-based models in offline RL tasks such as goal-based planning, composable constraint combination, scalable trajectory generation, and complex skill synthesis Janner et al. [2022], Fontanesi et al. [2019], Ajay et al. [2022], Chi et al. [2023]. For example, Janner et al. [2022] propose to use the value function as the guide during trajectory generation, effectively reducing the effects of out-of-distribution actions and reaching remarkable performance in offline RL tasks. Besides, diffusion models can also be used as policies to model the multimodal distribution from states to actions and as planners to perform long-horizon planning Kang et al. [2023], Wang et al. [2022a], He et al. [2023], Ni et al. [2023]. For instance, Kang et al. [2023] use diffusion models as policies to model the distribution from states to actions, while He et al. [2023] endow diffusion models with the ability to perform planning and data augmentation with different task-specific prompts.

Continual Learning in RL.    Continual learning (CL) aims to solve multi-tasks that come sequentially with explicit boundaries (task-aware CL) or implicit boundaries (task-free CL) and achieve no catastrophic forgetting and good task transferring (i.e., plasticity-stability dilemma) at the same time Zhang et al. [2023c], Meyer et al. [2023], Wang et al. [2023b]. Multitask learning methods He et al. [2023], Laskin et al. [2020] are usually regarded as the upper bound of continual learning. Existing studies for continual RL can be roughly classified into three categories: Structure-based methods focus on novel model structures such as sub-networks, mixture-of-experts, hypernetworks, and low-rank adaptation Wang et al. [2022b], Zhang et al. [2023a], Smith et al. [2023a], Mallya and Lazebnik [2018]. Regularization-based methods propose using auxiliary regularization loss to constrain the policy optimization and avoid catastrophic forgetting during training Zhang et al. [2023b, 2022], Kessler et al. [2020], Kaplanis et al. [2019]. Rehearsal-based methods preserve experiences of previous tasks or train generative models that can produce pseudo-samples to maintain knowledge of past tasks Korycki and Krawczyk [2021], Smith et al. [2023b], Atkinson et al. [2021], Peng et al. [2023]. Besides, recent plasticity-preserving studies Lee et al. [2024], Foret et al. [2020] reveal that the plasticity of models can be enhanced by weight re-initialization and noisification when facing the early interactions overfitting within a single task.

Offline RL.    Offline RL mainly focuses on how to train optimal policies with previously collected large datasets without expensive and risky data collection processes Wang et al. [2023a], Levine et al. [2020], Kostrikov et al. [2021], Kumar et al. [2020], Ghosh et al. [2022]. It, however, remains a huge challenge for training when facing the distribution shift between the learned policy and the data-collected policy and the overestimation of out-of-distribution (OOD) actions Hong et al. [2023], Kostrikov et al. [2021]. To solve these issues, previous studies on offline-RL tasks generally rely on methods from constrained optimization, safe learning, imitation learning, and amendatory estimation Kostrikov et al. [2021], Kumar et al. [2020], Wu et al. [2019], Ghosh et al. [2022]. Besides, planning and optimizing in the world model with limited interactions also serves as a promising way to train satisfactory policies Rigter et al. [2022], Kidambi et al. [2020], Rafailov et al. [2021]. Recently, sequential modeling has been proposed to fit the joint state-action distribution over the trajectories with transformer-based models and diffusion-based models Janner et al. [2021], Nguyen-Tang and Arora [2024], Chen et al. [2021], Ajay et al. [2022], Zhu et al. [2023], Janner et al. [2022].

5.4 Additional Experiments

Refer to caption
Figure 6: The comparison of our method CoD and other baselines on CW10 where these baselines are trained with offline datasets and are trained with 500k gradient steps on each task.
Refer to caption
Figure 7: The comparison of our method CoD and other baselines on CW10 where these baselines are trained with online environments and are trained with 500k interaction steps on each task.

Offline Continual World Results on CW10.     We report the performance on CW10 in Figure 6 when the baselines are trained with offline datasets. The results show that the learning speed of our method (CoD) is much more efficient than other baselines when executing the same gradient updates. Besides, we can observe that the performance of generative methods is more effective than non-generative methods, which shows the powerful expressiveness of generative models in modeling complex environments and generating pseudo-samples with high fidelity.

Online Continual World Results on CW10.     Apart from the offline comparison, we also modified the original baselines and conducted experiments on CW10, where several new online baselines were introduced. Similarly, the results in Figure 7 also show that our method (CoD) surpasses the baselines by a large margin, illustrating the superiority of CoD. We do not incorporate several offline baselines trained with generative models into online comparison because the generative process consumes much more time for interaction, which exceeds the tolerable range of training. These baselines trained with generative models are more suitable for training on offline datasets.

Mixed Dataset Training Analysis.     We can classify the training under the sub-optimal demonstrations into two situations. You can click here to return to Section 2.5 quickly for continual reading of the main body.

The first is learning from noise datasets. In order to simulate the training under the sub-optimal demonstrations, we insert noise into the observations of the current dataset to obtain sub-optimal demonstrations, i.e., o¯=o+clip(η𝒩(0,I),ρ)¯𝑜𝑜𝑐𝑙𝑖𝑝𝜂𝒩0𝐼𝜌\bar{o}=o+clip(\eta*\mathcal{N}(0,I),\rho)over¯ start_ARG italic_o end_ARG = italic_o + italic_c italic_l italic_i italic_p ( italic_η ∗ caligraphic_N ( 0 , italic_I ) , italic_ρ ). The larger noise denotes datasets with lower quality. We report the results in Table 3. The results illustrate that the performance decreases with the noise increasing, which inspires us to find additional techniques to reduce the influence of the noise on samples, such as adding an extra denoising module before diffuser training.

Table 3: The experiments of CoD when training with noise datasets on the Ant-dir tasks.
Noise level η𝜂\etaitalic_η 0 0.1 0.5
Bound ρ𝜌\rhoitalic_ρ - (-0.5, 0.5) (-1.0, 1.0)
Score 478.19±plus-or-minus\pm±15.84 247.41±plus-or-minus\pm±5.48 163.00±plus-or-minus\pm±5.16

The second is learning from datasets sampled with mixed-quality policies. We construct the ‘medium’ datasets on several Continual World tasks (CW4) to show the performance on the mixed-quality datasets, where the trajectories come from a series of behavior policies during the training stage. With the training stage going, we update the policy network many times, and each gradient update step will be regarded as generating a new behavior policy. Then, the performance of the policy will be improved. Next, we use the behavior policies whose performance ranges from medium to well-trained performance to collect ‘medium’ datasets, i.e., the ‘medium’ datasets contain unsuccessful trajectories and successful trajectories simultaneously (Refer to Table 11 for more statistics.). Based on the mixed-quality CW4 datasets, we adopt IL as the baseline and compare our method with IL. The corresponding experimental results are shown in Table 1 (c). The results show that our method (CoD) can achieve better performance than the baseline in the ‘medium’ dataset quality setting, which shows its effectiveness.

Plasticity Comparison.     In order to compare the plasticity of our method and representative plasticity-preserving methods Lee et al. [2024], Foret et al. [2020], we conduct the experiments on the Ant-dir environment with task setting as ‘10-15-19-25’, which is the same as the setting in the main body. The results are reported in Table 4, where the final performance means evaluation on all tasks after the whole training on all tasks and the performance gain of plasticity (task-level) is calculated according to mean(P(train15test15) - P(train10test15) + P(train19test19) - P(train15test19) + P(train25test25) - P(train19test25)). The results illustrate that our model reaches better final performance than PLASTIC and SAM. Besides, in the task-level plasticity performance comparison, our method also obtains a higher score. Although PLASTIC and SAM do not perform well here, it’s worth noting that PLASTIC and SAM are not designed to resolve continual learning under changing tasks but to address early interactions overfitting within a single task. The granularity of plasticity referred to in CoD is larger than that in PLASTIC and SAM. Click here to return to Section 2.5 quickly for continual reading of the main body.

Table 4: The comparison of our method and plasticity-preserving methods PLASTIC and SAM on the Ant-dir environment. We report the performance of PLASTIC and SAM with online and offline training under the continual learning setting.
Model CoD Diffuser-w/o rehearsal PLASTIC (online) SAM (online) PLASTIC (offline) SAM (offline)
Final performance 478.19±plus-or-minus\pm±15.84 270.44±plus-or-minus\pm±5.54 201.45±plus-or-minus\pm±0.56 202.17±plus-or-minus\pm±0.46 186.71±plus-or-minus\pm±4.55 187.19±plus-or-minus\pm±4.53
Performance gain of plasticity (task-level) 407.84 348.15 8.70 10.94 4.60 55.77

Parameters Sensitivity on Ant-dir.     In Section 2.6, we conduct the parameter sensitivity analysis on CW to show the effects of rehearsal frequency υ𝜐\upsilonitalic_υ and rehearsal diversity ξ𝜉\xiitalic_ξ. We also report the results of parameters sensitivity on Ant-dir in Figure 5 and Table 6, where υ=inf𝜐𝑖𝑛𝑓\upsilon=infitalic_υ = italic_i italic_n italic_f means we do not perform rehearsal during training. The results show that with the increase of υ𝜐\upsilonitalic_υ, the performance declines because the model can not use previous datasets to strengthen its memory in time.

Table 5: The ablation study of CoD.
Method Mean episode return
Task Ant-dir CW10 CW20
CoD-w/o rehearsal 270.44±plus-or-minus\pm±5.54 0.20±plus-or-minus\pm±0.01 0.18±plus-or-minus\pm±0.01
CoD (Ours) 478.19±plus-or-minus\pm±15.84 0.98±plus-or-minus\pm±0.01 0.98±plus-or-minus\pm±0.01
Table 6: The absolute performance of parameters sensitivity of Ant-dir.
ξ𝜉\xiitalic_ξ
1% 5% 10% 20%
υ𝜐\upsilonitalic_υ 2 383.07±plus-or-minus\pm±8.71 465.82±plus-or-minus\pm±5.03 475.82±plus-or-minus\pm±8.14 493.83±plus-or-minus\pm±3.17
10 344.73±plus-or-minus\pm±0.00 381.27±plus-or-minus\pm±0.00 368.42±plus-or-minus\pm±0.00 369.94±plus-or-minus\pm±0.00
14 331.29±plus-or-minus\pm±0.00 351.79±plus-or-minus\pm±0.00 345.71±plus-or-minus\pm±0.00 352.32±plus-or-minus\pm±0.00
inf 271.79±plus-or-minus\pm±8.48 271.79±plus-or-minus\pm±8.48 271.79±plus-or-minus\pm±8.48 271.79±plus-or-minus\pm±8.48

Efficiency Analysis of Generation Speed.     The generation process of diffusion models is indeed computationally intensive because the mechanism of generation requires multiple rounds to generate a sequence. However, we can draw inspiration from previous studies Nichol and Dhariwal [2021], Song et al. [2020] in related domains and accelerate the generation process. For example, we can reduce the reverse diffusion step from 200 to 10. To show the efficiency of accelerating during the generation process, we conduct a comparison of generation speed. We report the results in Table 7, where the 200 diffusion steps setting is the original version, and the 10 diffusion steps setting is our accelerated version. In the experiments of our manuscript, we adopt the 10 diffusion steps setting, which improves the sampling speed (19.043×) with a larger margin than the original sampling version. It’s worth noting that our implemented accelerate technique can also use other diffusion steps settings, but we find that 10 diffusion steps setting performs well on performance and generation efficiency.

Table 7: The comparison of generation speed with different generation steps. In the main body of our manuscript, we use the 10 diffusion steps setting for all experiments.
Diffusion steps 200 (original) 100 50 25 10
Time consumption of per generation (s) 3.085±plus-or-minus\pm±0.077 1.839±plus-or-minus\pm±0.116 0.850±plus-or-minus\pm±0.007 0.394±plus-or-minus\pm±0.006 0.159±plus-or-minus\pm±0.005
Speed-up ratio 1.678× 3.629× 7.830× 19.043×

Ablation Study on Mixed Datasets.     In Table 8, we report the effects of rehearsal sample diversity ξ𝜉\xiitalic_ξ on the ‘medium’ datasets. From the results, we can see that increasing the rehearsal sample diversity is beneficial to the performance, which is in line with the experiments in the main body of our manuscript. Besides, the results also show that our method (CoD) can reach a better plasticity-stability trade-off than the baseline in the ‘medium’ dataset quality setting.

Table 8: The ablation study of CoD on ‘medium’ CW4 datasets. We select the continual tasks setting as CW4 (“hammer-v1”, “push-wall-v1”, “faucet-close-v1”, “push-back-v1”), where the ‘medium’ experiences come from the behavior policies from the middle training stage to the end training stage.
Model υ𝜐\upsilonitalic_υ ξ𝜉\xiitalic_ξ P \uparrow FT \uparrow F \downarrow P+FT-F \uparrow
CoD 2 1% 0.85±plus-or-minus\pm±0.02 0.60±plus-or-minus\pm±0.13 0.05±plus-or-minus\pm±0.01 1.40
CoD 2 10% 0.90±plus-or-minus\pm±0.02 0.60±plus-or-minus\pm±0.12 -0.01±plus-or-minus\pm±0.01 1.51
IL 2 1% 0.57±plus-or-minus\pm±0.19 0.12±plus-or-minus\pm±0.54 0.18±plus-or-minus\pm±0.09 0.51
IL 2 10% 0.63±plus-or-minus\pm±0.17 0.28±plus-or-minus\pm±0.27 0.28±plus-or-minus\pm±0.18 0.63

Computation Costs Analysis.     In order to show the consumption of computational costs, we report the comparison of computation costs during the training stage in Table 9, where we obtain the statistical data with ‘wandb.’ The results show that increasing the rehearsal samples does not significantly increase computation costs and training time.

Table 9: The comparison of computation costs and training time using our method with different hyperparameter settings. Experiments are carried out on NVIDIA GeForce RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz. We report the results according to ‘wand.’
υ𝜐\upsilonitalic_υ 2 2 2 2 14 10 6 inf
ξ𝜉\xiitalic_ξ 20 10 5 1 10 10 10 -
Process memory in use (non-swap) (MB) 19914.52 20058.77 20023.48 20045.32 20010.42 20026.34 20049.25 19983.4
Train time (h) 143.713 144.488 143.681 143.398 145.081 145.973 146.397 152.274

5.5 Statistics of Continual Offline RL Benchmarks

Table 10: The total statistics of our benchmark.
Environment Tasks number Quality Samples per task
Continual World 88 expert 1M
4 medium 0.4M
Ant-dir 40 expert 0.2M
Cheetah-dir 2 expert 0.2M

To take advantage of the potential of diffusion models, we first collect an offline benchmark that contains dozens of tasks from multiple domains, such as Continual World and Gym-MuJoCo Wołczyk et al. [2021], Todorov et al. [2012]. In order to collect the interaction data, we trained Soft Actor-Critic on each task for approximately 1M time steps Haarnoja et al. [2018]. Totally, the benchmark contains 90 tasks, where 88 tasks come from Continual World, 2 tasks come from Gym-MuJoCo.

Specifically, CW Wołczyk et al. [2021] tasks are constructed based on Meta-World Yu et al. [2020]. CW consists of many realistic robotic manipulation tasks such as Pushing, Reaching, Door Opening, Pick, and Place. CW is convenient for training and evaluating the abilities of forward transfer and forgetting because the state and action space are the same across all the tasks. In our benchmark, we collect “expert” and “medium” datasets, where the episodic time limit is set to 200, and the evaluation time step is set to 1M and 0.4M for “expert" and “medium” datasets, respectively. Thus, we obtain 5000 and 2000 episodes for these two quality tasks, as shown in Table 11 and Table 12, in which we also report the mean success rate of these two qualities dataset. Besides, we also provide the return information of all datasets in Figure 8 and Figure 9. Out of these tasks defined in Meta-World, we usually select ten tasks from them as the setting of continual learning, i.e., CW10, and CW20 denotes the setting of two CW10.

Aligning with the traditional definition of various CL settings Van de Ven et al. [2022], this benchmark supports constructing task-incremental CORL (TICORL), domain-incremental CORL (DICORL), and class-incremental CORL (CICORL) settings. Researchers can use these datasets in any sequence or length for CL tasks to test the plasticity-stability trade-off of their proposed methods. The future expansion plan of this benchmark will gather datasets such as ‘random’ and ‘full’ training qualities datasets to bolster training robust CL agents.

Ant-dir is an 8-joint ant environment. The different tasks are defined according to the target direction, where the agent should maximize its return with maximal speed in the pre-defined direction. As shown in Table 13, there are 40 tasks (distinguished with “task id”) with different uniformly sampled goal directions in Ant-dir. For each task, the dataset contains approximately 200k transitions, where the observation and action dimensions are 27 and 8, respectively. We found that the Ant-dir datasets have been used by many researchers Xu et al. [2022], Li et al. [2020], Rakelly et al. [2019], so we incorporate them into our benchmark. Moreover, we report the mean return information of each sub-task in Table 13 and Figure 10. As for Cheetah-dir, it only contains two tasks that represent forward and backward goal directions. Compared with Ant-dir, Cheetah-dir possesses lower observation and action space.

Table 11: The information statistics of offline Continual World-v1 datasets.
Continual World dataset quality episode length episode number mean success observation dimension action dimension
assembly-v1 expert 200 5000 1.0 13 4
basketball-v1 expert 200 5000 1.0 13 4
button-press-topdown-v1 expert 200 5000 0.99 13 4
button-press-topdown-wall-v1 expert 200 5000 0.9864 13 4
button-press-v1 expert 200 5000 0.99 13 4
button-press-wall-v1 expert 200 5000 1.0 13 4
coffee-button-v1 expert 200 5000 0.9916 13 4
coffee-pull-v1 expert 200 5000 0.99 13 4
coffee-push-v1 expert 200 5000 0.99 13 4
dial-turn-v1 expert 200 5000 0.9902 13 4
disassemble-v1 expert 200 5000 0.0 13 4
door-close-v1 expert 200 5000 0.9902 13 4
door-open-v1 expert 200 5000 0.9898 13 4
drawer-close-v1 expert 200 5000 0.9894 13 4
drawer-open-v1 expert 200 5000 0.99 13 4
faucet-close-v1 expert 200 5000 0.9896 13 4
faucet-open-v1 expert 200 5000 0.9154 13 4
hammer-v1 expert 200 5000 0.99 13 4
handle-press-side-v1 expert 200 5000 0.9878 13 4
handle-press-v1 expert 200 5000 0.99 13 4
handle-pull-side-v1 expert 200 5000 0.9888 13 4
handle-pull-v1 expert 200 5000 0.99 13 4
lever-pull-v1 expert 200 5000 0.0 13 4
peg-insert-side-v1 expert 200 5000 0.9604 13 4
peg-unplug-side-v1 expert 200 5000 0.99 13 4
pick-out-of-hole-v1 expert 200 5000 0.0 13 4
pick-place-v1 expert 200 5000 1.0 13 4
pick-place-wall-v1 expert 200 5000 0.8196 13 4
plate-slide-back-side-v1 expert 200 5000 0.99 13 4
plate-slide-back-v1 expert 200 5000 0.9886 13 4
plate-slide-side-v1 expert 200 5000 0.7992 13 4
plate-slide-v1 expert 200 5000 0.5694 13 4
push-back-v1 expert 200 5000 0.9922 13 4
push-v1 expert 200 5000 0.9844 13 4
push-wall-v1 expert 200 5000 1.00 13 4
reach-wall-v1 expert 200 5000 0.99 13 4
shelf-place-v1 expert 200 5000 1.00 13 4
soccer-v1 expert 200 5000 0.0066 13 4
stick-pull-v1 expert 200 5000 0.93 13 4
stick-push-v1 expert 200 5000 0.4486 13 4
sweep-into-v1 expert 200 5000 0.9662 13 4
sweep-v1 expert 200 5000 0.0834 13 4
window-close-v1 expert 200 5000 0.99 13 4
window-open-v1 expert 200 5000 0.99 13 4
hammer-v1 medium 200 2000 0.7689 13 4
push-wall-v1 medium 200 2000 0.7465 13 4
faucet-close-v1 medium 200 2000 0.9364 13 4
push-back-v1 medium 200 2000 0.3168 13 4
Table 12: The information statistics of offline Continual World-v2 datasets.
Continual World dataset quality episode length episode number mean success observation dimension action dimension
basketball-v2 expert 200 5000 1.0 39 4
box-close-v2 expert 200 5000 1.0 39 4
button-press-topdown-v2 expert 200 5000 1.0 39 4
button-press-topdown-wall-v2 expert 200 5000 1.0 39 4
button-press-v2 expert 200 5000 1.0 39 4
button-press-wall-v2 expert 200 5000 1.0 39 4
coffee-button-v2 expert 200 5000 1.0 39 4
dial-turn-v2 expert 200 5000 1.0 39 4
door-close-v2 expert 200 5000 1.0 39 4
door-lock-v2 expert 200 5000 1.0 39 4
door-open-v2 expert 200 5000 1.0 39 4
door-unlock-v2 expert 200 5000 1.0 39 4
drawer-close-v2 expert 200 5000 1.0 39 4
drawer-open-v2 expert 200 5000 1.0 39 4
faucet-close-v2 expert 200 5000 1.0 39 4
faucet-open-v2 expert 200 5000 1.0 39 4
hammer-v2 expert 200 5000 1.0 39 4
hand-insert-v2 expert 200 5000 1.0 39 4
handle-press-side-v2 expert 200 5000 1.0 39 4
handle-press-v2 expert 200 5000 1.0 39 4
handle-pull-side-v2 expert 200 5000 1.0 39 4
handle-pull-v2 expert 200 5000 1.0 39 4
lever-pull-v2 expert 200 5000 1.0 39 4
peg-insert-side-v2 expert 200 5000 1.0 39 4
peg-unplug-side-v2 expert 200 5000 1.0 39 4
pick-out-of-hole-v2 expert 200 5000 1.0 39 4
pick-place-v2 expert 200 5000 1.0 39 4
plate-slide-back-side-v2 expert 200 5000 1.0 39 4
plate-slide-back-v2 expert 200 5000 1.0 39 4
plate-slide-side-v2 expert 200 5000 1.0 39 4
plate-slide-v2 expert 200 5000 1.0 39 4
push-back-v2 expert 200 2668 1.0 39 4
push-v2 expert 200 5000 1.0 39 4
push-wall-v2 expert 200 5000 1.0 39 4
reach-v2 expert 200 5000 1.0 39 4
reach-wall-v2 expert 200 5000 1.0 39 4
shelf-place-v2 expert 200 5000 1.0 39 4
stick-pull-v2 expert 200 5000 1.0 39 4
stick-push-v2 expert 200 5000 1.0 39 4
sweep-into-v2 expert 200 5000 1.0 39 4
sweep-v2 expert 200 5000 1.0 39 4
window-close-v2 expert 200 5000 1.0 39 4
window-open-v2 expert 200 5000 1.0 39 4
Table 13: The information statistics of offline Gym-MuJoCo datasets.
MuJoCo dataset task id quality episode length episode number mean return observation dimension action dimension
Ant-dir 4 expert 200 999 315.7402 27 8
Ant-dir 6 expert 200 1000 865.8379 27 8
Ant-dir 7 expert 200 1000 993.9981 27 8
Ant-dir 9 expert 200 999 390.8016 27 8
Ant-dir 10 expert 200 1000 744.8206 27 8
Ant-dir 13 expert 200 1000 922.9069 27 8
Ant-dir 15 expert 200 1000 522.9190 27 8
Ant-dir 16 expert 200 1000 835.9635 27 8
Ant-dir 17 expert 200 999 352.7341 27 8
Ant-dir 18 expert 200 1000 367.9050 27 8
Ant-dir 19 expert 200 999 369.9799 27 8
Ant-dir 21 expert 200 1000 868.7162 27 8
Ant-dir 22 expert 200 1000 577.2005 27 8
Ant-dir 23 expert 200 1000 386.7926 27 8
Ant-dir 24 expert 200 1000 547.0642 27 8
Ant-dir 25 expert 200 1000 501.6898 27 8
Ant-dir 26 expert 200 1000 357.3981 27 8
Ant-dir 27 expert 200 1000 439.8590 27 8
Ant-dir 28 expert 200 1000 484.8640 27 8
Ant-dir 29 expert 200 1000 439.0989 27 8
Ant-dir 30 expert 200 999 305.6620 27 8
Ant-dir 31 expert 200 999 478.8927 27 8
Ant-dir 32 expert 200 999 442.5488 27 8
Ant-dir 33 expert 200 1000 952.0699 27 8
Ant-dir 34 expert 200 1000 909.5234 27 8
Ant-dir 35 expert 200 999 352.6703 27 8
Ant-dir 36 expert 200 1000 593.1572 27 8
Ant-dir 37 expert 200 1000 374.4446 27 8
Ant-dir 38 expert 200 999 390.5748 27 8
Ant-dir 39 expert 200 999 307.2525 27 8
Ant-dir 40 expert 200 1000 524.2991 27 8
Ant-dir 41 expert 200 1000 360.6967 27 8
Ant-dir 42 expert 200 1000 454.5446 27 8
Ant-dir 43 expert 200 999 285.9895 27 8
Ant-dir 44 expert 200 1000 878.4141 27 8
Ant-dir 45 expert 200 1000 813.5594 27 8
Ant-dir 46 expert 200 1000 900.4641 27 8
Ant-dir 47 expert 200 1000 422.5884 27 8
Ant-dir 48 expert 200 1000 865.0776 27 8
Ant-dir 49 expert 200 1000 398.1321 27 8
Cheetah-dir 0 expert 200 999 666.5849 20 6
Cheetah-dir 1 expert 200 999 1134.3012 20 6
Refer to caption
Figure 8: The return statistics of the Continual World-v1. We calculate the episode return of the Continual World datasets and report the corresponding histogram.
Refer to caption
Figure 9: The return statistics of the Continual World-v2. We calculate the episode return of the Continual World datasets and report the corresponding histogram.
Refer to caption
Figure 10: The return statistics of the Ant-dir. We calculate the episode return of the Ant-dir datasets and report the corresponding histogram.