Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal

Jifeng Hu^1†^†^†footnotetext: ^†Equal contribution.^†^†footnotetext: ^∗Correspondence: chenhc@jlu.edu.cn and yichang@jlu.edu.cn. Li Shen^2† Sili Huang^3† Zhejian Yang⁴ Hechang Chen⁵^∗ Lichao Sun⁶ Yi Chang⁷^∗ Dacheng Tao⁸ ^1,3,4,5,7School of Artificial Intelligence, Jilin University, Changchun, China
²JD Explore Academy, Beijing, China
⁶Lehigh University, Bethlehem, Pennsylvania, USA
⁸College of Computing and Data Science, NTU, Singapore
https://github.com/JF-Hu/Continual_Diffuser

Abstract

Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks’ datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains $90$ tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks. Source code is available at here.

1 Introduction

Artificial neural networks, such as diffusion models, have made impressive successes in decision-making scenarios, e.g., game playing Mnih et al. (2015), robotics manipulation Kaufmann et al. (2023), and autonomous driving Almalioglu et al. (2022). However, in most situations, a new challenge of difficult adaption to changing data arises when we adopt the general strategy of learning during the training phase and evaluating with fixed neural network weights Dohare et al. (2024). Changes are prevalent in real-world applications when performing learning in games, logistics, and control systems. A crucial step towards achieving Artificial General Intelligence (AGI) is mastering the human-like ability to continuously learn and quickly adapt to new scenarios over the duration of their lifetime Berariu et al. (2021). Unfortunately, it is usually ineffective for current methods to simply continue learning on new scenarios when new datasets arrive. They will show a dilemma between storing historical knowledge (stability) in their brains and adapting to environmental changes (plasticity) Zeng et al. (2019).

Recently, we have noticed that diffusion probabilistic models (DPMs) have emerged as an expressive structure for tackling complex decision-making tasks such as robotics manipulation by formulating deep reinforcement learning (RL) as a sequential modeling problem He et al. (2023); Wang et al. (2022a); Kang et al. (2023). Although recent DPMs have shown impressive performance in robotics manipulation, they, however, usually focus on a narrow setting, where the environment is well-defined and remains static all the time Ajay et al. (2022); Yang et al. (2023), just like we introduce above. In contrast, in real-world applications, the environment changes dynamically in chronological order, forming a continuous stream of data encompassing various tasks. In this situation, it is challenging for the agents to contain historical knowledge (stability) in their brains and adapt to environmental changes (plasticity) quickly based on already acquired knowledge Anand and Precup (2023); Yue et al. (2024). Thus, a natural question arises:

Can we incorporate DPMs’ merit of high expression and concurrently endow DPMs the ability towards better plasticity and stability in continual offline RL?

Facing the long-standing challenge of plasticity-stability dilemma in continual RL, current studies of continual learning can be roughly classified into three categories. Structure-based methods Wang et al. (2022b); Zhang et al. (2023a); Smith et al. (2023a); Mendez and Eaton (2022); Mallya and Lazebnik (2018) propose the use of a base model for pertaining and sub-modules for each task so as to store separated knowledge and reduce catastrophic forgetting. Regularization-based methods Zhang et al. (2023b, 2022); Kirkpatrick et al. (2017); Kessler et al. (2020); Nguyen et al. (2017) propose using auxiliary regularization loss such as $L_{2}$ penalty, KL divergence, and weight importance to contain policy optimization and avoid catastrophic forgetting during training. Rehearsal-based methods Smith et al. (2023b); Peng et al. (2023); Huang et al. (2024); Rolnick et al. (2019); Chaudhry et al. (2018) are considered simple yet effective in alleviating catastrophic forgetting as rehearsal mimics the memory consolidation mechanism of hippocampus replay inside biological systems. There are many strategies to perform rehearsal. For instance, a typical method is gradient projection Chaudhry et al. (2018), which contains the gradients from new data loss as close as to previous tasks, furthest preventing performance decrease.

Although these methods are effective for continual learning, they present limited improvement in continual offline RL because of extra challenges such as distribution shift and value uncertain estimation. Recently, diffusion-based methods, such as DD and Diffuser Ajay et al. (2022); Kang et al. (2023); Wang et al. (2022a); Janner et al. (2022), propose to resolve the above two extra challenges from sequential modeling and have shown impressive results in many offline RL tasks. However, they concentrate solely on training a diffuser that can only solve one task, thus showing limitations in real-world applications where training datasets or tasks usually arrive sequentially. Though recent works, such as MTDIFF He et al. (2023), consider diffusers as planner or data generators for multi-task RL, the problem setting of their work is orthogonal to ours.

In this view, we take one step forward to investigate diffusers with arriving datasets and find that recent state-of-the-art diffusion-based models suffer from catastrophic forgetting when new tasks arrive sequentially (See Section 3.1 for more details.). To address this issue, we propose “Continual Diffuser” (CoD), which endows the diffuser with the capabilities of quickly adapting to new tasks (plasticity) meanwhile retaining the historical knowledge (stability) with experience rehearsal. First of all, to take advantage of the potential of diffusion models, we construct an offline RL benchmark that consists of $90$ tasks from multiple domains, such as Continual World (CW) and Gym-MuJoCo. These continual datasets will be released to all researchers soon at the present stage, and we will actively maintain and progressively incorporate more datasets into our benchmark. Based on the benchmark, we train our method on each task with sequential modeling of trajectories and make decisions with conditional generation in evaluation. Then, a small portion of each previous task dataset is reserved as the rehearsal buffer to replay periodically to our model. Finally, extensive experiments on a series of tasks show that CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based models and other representative continual RL methods on most tasks. In summary, our contributions are threefold:

•

We construct a continual offline RL benchmark that contains 90 tasks in the current stage, and we will actively incorporate more datasets for all researchers.
•

We investigate the possibility of integrating experience rehearsal and diffuser, then propose the Continual Diffuser (CoD) to balance plasticity and stability.
•

Extensive experiments on a series of tasks show that CoD can achieve a promising plasticity-stability trade-off and outperform existing baselines on most tasks.

Refer to caption — Figure 1: The framework of CoD. Unfolding the training process with time, our model slides on the sample chain that is constructed by sampling from the current and rehearsal buffers. For each task $i$ , CoD replays small portion samples of previous tasks to reduce catastrophic forgetting and generate a solution that can solve all previous tasks. Detailed structure of CoD is shown in the low right corner.

2 Results

In this section, we will introduce environmental settings and evaluation metrics Section 2.1 and 2.2. Then, in Section 2.3 and 2.4, we first introduce a novel continual offline RL benchmark, including the task description and the corresponding dataset statistics, and introduce various baselines. Finally, in Section 2.5 and 2.6, we report the comparison results, ablation study, and parameters sensitivity analysis.

2.1 Environmental Settings

Following the same setting as prior works Zhang et al. (2023c); Yang et al. (2023), we conduct thorough experiments on Continual World and Gym-MuJoCo benchmarks. In Continual World, we adopt the task setting of CW10 and CW20 where CW20 means two concatenated CW10. All CW tasks are version v1. Besides, we also select Ant-dir for evaluation, which includes 40 tasks, and we arbitrarily select four tasks (tasks-10-15-19-25) for training and evaluation. See Appendix 5.5 for more details.

2.2 Evaluation Metrics

In order to compare the performance on a series of tasks, we follow previous studies Wołczyk et al. (2021); Anand and Precup (2023) and adopt the totally average success rate $P(\rho)$ (higher is better), forward transfer $FT$ (higher is better), forgetting $F$ (lower is better), and the total performance $P+FT-F$ (higher is better) as evaluation metrics. Suppose that we use $p_{i}(\rho)$ to represent the average success rate on task $i$ at gradient update step $\rho$ and each task train $\Delta$ gradient steps, then the total average success rate $P(\rho)=\sum_{i=1}^{I}p_{i}(\rho)$ , where $p_{i}(\rho)\in[0,1]$ . The forward transfer $FT$ denotes the normalized AUC area between the training curve and the reference curve. Note that $FT_{i}<1$ and it might also be negative. Mathematically, $FT=\frac{1}{I}\sum_{i}FT_{i}=\frac{1}{I}\sum_{i}\frac{AUC_{i}-AUC_{ref,i}}{1-% AUC_{ref,i}}$ , where we set $AUC_{ref,i}=0.5$ and $AUC_{i}=(p_{i}(i\cdot\Delta)+p_{i}((i+1)\cdot\Delta))/2$ for simplicity. The forgetting $F_{i}$ is defined as the performance decrease between $p_{i}((i+1)\cdot\Delta)$ and $p_{I-1}(I\cdot\Delta)$ , thus $F=\frac{1}{I}\sum_{i}^{I}F_{i}$ .

2.3 Novel Benchmark for Continual Offline RL

To take advantage of the potential of diffusion models, we propose a benchmark for continual offline RL (CORL), comprising datasets from 90 tasks, including 88 Continual World tasks and 2 Gym-MuJoCo tasks Wołczyk et al. (2021); Todorov et al. (2012). For the Gym-MuJoCo domain, there are 42 environmental variants, which are constructed by altering the agent goals. In order to collect the offline datasets, we trained Soft Actor-Critic (SAC) on each task for approximately 1M time steps Haarnoja et al. (2018).

Continual World Wołczyk et al. (2021) is a popular testbed that is constructed based on Meta-World Yu et al. (2020) and consists of realistic robotic manipulation such as Pushing, Reaching, and Door Opening. CW is convenient for training and evaluating the abilities of forward transfer and forgetting because the state and action space are the same across all tasks. Firstly, we will define the task-incremental CORL (TICORL), task-incremental CORL (TICORL), and task-incremental CORL (TICORL) Van de Ven et al. (2022). In RL, we call the CL setting CICORL, where the CL tasks are constructed in the same environment with different goals, such as different directions or velocities. We call the CL setting TICORL, where the CL tasks are indeed different environments but with the same purposes. For instance, the CL settings with the purpose of pushing blocks (e.g., “push wall” and “push mug” tasks in Continual World) in different robotic control tasks formulate the TICORL. Finally, we can use the tasks of different purposes, such as push, pull, turn, and press blocks, to construct the DICORL. For example, CW10 and CW20 form the mixed TICORL and DICORL setups because the task sequence contains multiple purposes. Additionally, Gym-MuJoCo’s 42 environmental variants facilitate constructing a CICORL setup. Researchers can use these datasets in any sequence or length for CL tasks to test the plasticity-stability trade-off of their methods. We also provide multiple quality datasets, such as ‘medium’ and ‘expert,’ in our benchmark. We list the information statistics of our benchmark in Table 11 and 12, and Figure 8 and 9, where the episodic time limit is set to 200, and the evaluation time step is set to 1M and 0.4M for different qualities datasets.

Ant-dir is an 8-joint ant environment. The different tasks are defined according to the target direction, where the agent should maximize its return with maximal speed in the pre-defined direction. As shown in Table 13, there are 40 tasks (distinguished with “task id”) with different uniformly sampled goal directions in Ant-dir. For each task, the dataset contains approximately 200k transitions, where the observation and action dimensions are 27 and 8, respectively. We found that the Ant-dir datasets have been used by many researchers Xu et al. (2022); Li et al. (2020); Rakelly et al. (2019), so we incorporate them into our benchmark. Moreover, we report the mean return information of each sub-task in Table 13 and Figure 10. As for Cheetah-dir, it only contains two tasks that represent forward and backward goal directions. Compared with Ant-dir, Cheetah-dir possesses lower observation and action space.

2.4 Baselines

We compare our method (CoD) with various representative baselines, encompassing structure-based, regularization-based, and rehearsal-based methods. In structure-based methods, we select LoRA Li et al. (2023), PackNet Mallya and Lazebnik (2018), and Multitask. For regularization-based methods, we select L2, EWC Kirkpatrick et al. (2017), MAS Aljundi et al. (2018), and VCL Nguyen et al. (2017) for evaluation. Rehearsal-based baselines include t-DGR Yue et al. (2024), DGR Shin et al. (2017), CRIL Gao et al. (2021), A-GEM Chaudhry et al. (2018), and IL Ho and Ermon (2016). Besides, we also include several diffusion-based methods Janner et al. (2022); Ajay et al. (2022) and Multitask methods, such as MTDIFF He et al. (2023) for the evaluation.

Table 1: The performance comparison on the Continual World and Ant-dir datasets. We compare our method (CoD) with baselines trained with the offline pattern as well as the online pattern. We report the average success rate, backward forgetting, and forward transfer of our method and several representative baselines in Continual World tasks (shown in parts (a) and (b)). Moreover, we conduct experiments on CW4 (“hammer-v1”, “push-wall-v1”, “faucet-close-v1”, “push-back-v1”) with mixed-quality datasets and show the results in part (c). For Ant-dir datasets shown in part (d), we report the comparison results with diffusion-based, non-diffusion-based, and multitask methods.

		Continual World 10				Continual World 20
train mode	Model	P $\uparrow$	FT $\uparrow$	F $\downarrow$	P+FT-F $\uparrow$	P $\uparrow$	FT $\uparrow$	F $\downarrow$	P+FT-F $\uparrow$
(a) offline baselines	EWC	0.20 $\pm$ 0.16	0.30 $\pm$ 0.21	0.80 $\pm$ 0.16	-0.30	0.30 $\pm$ 0.21	0.30 $\pm$ 0.21	0.70 $\pm$ 0.21	-0.10
	Finetune	0.20 $\pm$ 0.16	0.10 $\pm$ 0.09	0.80 $\pm$ 0.16	-0.50	0.10 $\pm$ 0.09	0.10 $\pm$ 0.09	0.80 $\pm$ 0.16	-0.60
	DGR	0.30 $\pm$ 0.21	0.90 $\pm$ 0.09	0.70 $\pm$ 0.21	0.50	0.50 $\pm$ 0.25	0.90 $\pm$ 0.09	0.50 $\pm$ 0.25	0.90
	t-DGR	0.40 $\pm$ 0.24	0.70 $\pm$ 0.21	0.60 $\pm$ 0.24	0.50	0.50 $\pm$ 0.25	0.90 $\pm$ 0.09	0.50 $\pm$ 0.25	0.90
	CRIL	0.70 $\pm$ 0.21	0.80 $\pm$ 0.16	0.20 $\pm$ 0.16	1.30	0.70 $\pm$ 0.21	0.90 $\pm$ 0.09	0.00 $\pm$ 0.00	1.60
	Multitask	1.00 $\pm$ 0.00	0.90 $\pm$ 0.09	0.00 $\pm$ 0.00	1.90	1.00 $\pm$ 0.00	0.90 $\pm$ 0.09	0.00 $\pm$ 0.00	1.90
	CoD	0.98 $\pm$ 0.01	0.89 $\pm$ 0.09	-0.01 $\pm$ 0.001	1.88	0.98 $\pm$ 0.01	0.89 $\pm$ 0.09	0.00 $\pm$ 0.00	1.87
(b) online baselines	A-GEM	0.02 $\pm$ 0.01	-0.76 $\pm$ 0.02	0.22 $\pm$ 0.02	-0.96	0.17 $\pm$ 0.10	0.17 $\pm$ 0.11	0.64 $\pm$ 0.12	-0.30
	PackNet	0.05 $\pm$ 0.01	-0.60 $\pm$ 0.01	0.35 $\pm$ 0.02	-0.90	0.14 $\pm$ 0.09	-0.34 $\pm$ 0.19	0.53 $\pm$ 0.20	-0.73
	VCL	0.10 $\pm$ 0.03	-0.81 $\pm$ 0.05	-0.02 $\pm$ 0.008	-0.69	0.18 $\pm$ 0.13	-0.62 $\pm$ 0.14	0.02 $\pm$ 0.06	-0.46
	MAS	0.23 $\pm$ 0.05	-0.63 $\pm$ 0.08	-0.05 $\pm$ 0.03	-0.35	0.41 $\pm$ 0.15	-0.12 $\pm$ 0.18	-0.01 $\pm$ 0.01	0.30
	EWC	0.30 $\pm$ 0.03	-0.36 $\pm$ 0.05	0.02 $\pm$ 0.02	-0.08	0.56 $\pm$ 0.20	0.13 $\pm$ 0.28	0.01 $\pm$ 0.02	0.68
	L2	0.21 $\pm$ 0.03	-0.58 $\pm$ 0.07	0.02 $\pm$ 0.02	-0.39	0.51 $\pm$ 0.09	0.12 $\pm$ 0.19	0.10 $\pm$ 0.03	0.53
	Multitask	1.00 $\pm$ 0.00	0.90 $\pm$ 0.09	0.00 $\pm$ 0.00	1.90	1.00 $\pm$ 0.00	0.90 $\pm$ 0.09	0.00 $\pm$ 0.00	1.90
	Continual World 4
(c) offline baselines	IL-rehearsal	0.57 $\pm$ 0.19	0.12 $\pm$ 0.54	0.18 $\pm$ 0.09	0.51
(c) offline baselines	CoD	0.85 $\pm$ 0.02	0.60 $\pm$ 0.13	0.05 $\pm$ 0.01	1.40
		Ant-dir
(d) offline baselines	Model	CoD	Multitask CoD	IL- rehearsal	CoD- LoRA	Diffuser-w/o rehearsal	CoD-RCR	MTDIFF	DD-w/o rehearsal
	Mean return	478.19 $\pm$ 15.84	485.15 $\pm$ 5.86	402.53 $\pm$ 17.67	296.03 $\pm$ 11.95	270.44 $\pm$ 5.54	140.44 $\pm$ 32.11	84.01 $\pm$ 41.10	-11.15 $\pm$ 45.27

2.5 Main Results

Ant-dir Results. To show the effectiveness of our method in reducing catastrophic forgetting, we compare our method with other diffusion-based methods on the Ant-dir tasks ordered by 10-15-19-25. As shown in Table 1 (d) and Figure 2, the results illustrate: 1) Directly applying previous diffusion-based methods into continual offline RL will lead to severe catastrophic forgetting because the scores of Diffuser-w/o rehearsal and DD-w/o rehearsal are far behind CoD. 2) Extending the technique of LoRA into the diffusion model may not always work. The reason lies in that the parameter quantity size is small, which inspires us to construct diffuser foundation models in future work. 3) Rehearsal can bring significant improvements on diffuser as CoD approaches the score of Multitask CoD.

Online Continual World Results. Considering that offline datasets prohibit further exploration in the environments, which may hinder the capability of some baselines that are designed for online training. We conduct CW10 and CW20 experiments of these methods under the online continual RL setting. Similarly, we constrain the interaction as 500k time steps for each task and report the comparison results in Figure 3 (a) and Table 1 (a). The results show that our method (CoD) surpasses other baselines by a large margin, which illustrates the superior performance over balancing plasticity and stability. Besides, it is indeed that some methods, such as EWC, are more suitable for online training by comparing the performances in Figure 3 (a) and (b). Additionally, we also report the comparison under mixed-quality datasets CL setting in Table 1 (c). Please refer to Appendix 5.4 for the comparison of model plasticity and generation acceleration details.

Offline Continual World Results. This section presents the comparison between CoD and six representative continual RL methods on CW10 and CW20 benchmarks. In order to show the capabilities of plasticity (quick adaptation to unseen tasks) and stability (lasting retention of previous knowledge), we keep the size of training samples, number of gradient updates, and computation constant. Figure 3 (b) and Table 1 (b) summarize the results of CW10 and CW20 tasks. We observe that our method can quickly master these manipulation tasks and remember the acquired knowledge when new tasks arrive, while the baselines (except for Multitask) struggle between plasticity and stability because the performance of these baselines fluctuates among tasks. Moreover, after 5M gradient steps, our method still remembers how to solve the same task it learns, which shows small forgetting. The results of the table also show that though some baselines exhibit high forward transfer, the average success rate is lower than our method, and they forget knowledge fleetly.

2.6 Ablation Study

To show the effectiveness of experience rehearsal, we conduct an ablation study of CoD in CW and Ant-dir tasks. We compare our method with and without experience rehearsal and find that experience rehearsal indeed brings significant performance gain. For example, CoD achieves 76.82% performance gain compared with CoD-w/o rehearsal. In CW 20 tasks, CoD reaches mean success rate from 20% to 98% when incorporating experience rehearsal. Refer to Table 5 for more results.

Sensitivity of Key Hyperparameters. In the experiments, we introduce the key hyper-parameters: the rehearsal frequency ( $\upsilon$ ) and rehearsal sample diversity ( $\xi$ ). The larger $\upsilon$ will aggravate the catastrophic forgetting because the model can access previous samples after a longer training process. A large value of $\xi$ will improve the performance and increase the storage burden, while a small value is more cost-efficient for longer CL tasks but is more challenging to hold the performance. We conduct the sensitivity of the hyperparameters on the CW and Ant-dir environments, and the results are shown in Figure 4 and Figure 5. According to the results, our method can still reach good performance with the variation of $\upsilon$ and $\xi$ .

3 Discussion

3.1 Catastrophic Forgetting of Diffuser

Previous diffusion-based methods Ajay et al. (2022); Janner et al. (2022); Yue et al. (2024), such as DD and Diffuser, are usually proposed to solve a single task, which is not in line with the real-world situation where the task will dynamically change. Thus, it is meaningful but challenging to train a diffuser that can adapt to new tasks (plasticity) while retaining historical knowledge. When we directly extend the original diffusion-based method in continual offline RL, we can imagine that severe catastrophic forgetting will arise in the performance because there are no mechanisms to retain preceding knowledge. As shown in Figure 2, in order to show the catastrophic forgetting, we compare our method and the representative diffusion-based methods on Ant-dir, where we arbitrarily select four tasks, task-10, task-15, task-19, and task-25, to form the CL setting. Diffuser-w/o rehearsal and DD-w/o rehearsal represent the original method Diffuser and DD, respectively. Multitask CoD and MTDIFF are the multitask baselines, which can access all training datasets in any time step, and CoD-RCR represents we use return condition for decision generation during the training stage. CoD-LoRA denotes that we train CoD with the technique of low-rank adaptation. IL-rehearsal is the imitation learning with rehearsal. The results show that previous diffusion-based methods exhibit severe catastrophic forgetting when the datasets arrive sequentially, and at the same time, the good performance of CoD illustrates experience rehearsal is effective in reducing catastrophic forgetting.

3.2 Reducing Catastrophic Forgetting with Experience Rehearsal

In Section 2.5, we illustrate the effectiveness of experience rehearsal through the experiments on our proposed offline CL benchmark, which contains 90 tasks for evaluation. From the perspective of the CL tasks quantity, we evaluate carious quantity settings, such as 4 tasks for Ant-dir, 4 tasks for CW4, 10 tasks for CW10, and 20 tasks for CW20. From the perspective of classification of traditional CL settings, our experimental settings contain CICORL, TICORL, and DICORL. In the Ant-dir environment, we select 10-15-19-25 task sequence as the CL setting and conduct the experiment compared with other diffusion-based methods. From the results shown in Figure 2, we can see distinct catastrophic forgetting on the recent diffusion-based method, though they show strong performance in other offline RL tasks Kang et al. (2023); He et al. (2023). To borrow the merits of diffusion models’ strong expression on offline RL and equip them with the ability to reduce catastrophic forgetting, we propose to use experience rehearsal to master the CORL. Detailed architecture is shown in Figure 1, and we postpone the method description in Section 4.3.

Apart from the Ant-dir environment, we also report the performance on more complex CL tasks, i.e., CW10 and CW20, in Table 1. Considering that most baselines are trained in online mode in their original papers, we first select the online baselines and compare their mean success rate with our method. The results (Table 1 and Figure 3) show that our method (CoD) surpasses other baselines by a large margin, which illustrates the superior performance over balancing plasticity and stability. Besides, we also compare our method with these baselines trained with offline datasets, where the results show that our method can quickly master these manipulation tasks and remember the acquired knowledge when new tasks arrive, while the baselines (except for Multitask) struggle between plasticity and stability because the performance of these baselines fluctuates among tasks. When the previous tasks appear once again after 5M training steps, the baselines show different levels of catastrophic forgetting because the performance decreases after 5M steps. However, our method still remembers how to solve the same task it learned before, which shows small forgetting. Moreover, we also conduct mixed-quality dataset experiments to show our method’s capability of learning from sub-optimal offline datasets. For more details, please refer to Appendix 5.4.

To investigate the influence of key hyperparameters, we report the performance of the rehearsal frequency ( $\upsilon$ ) and rehearsal sample diversity ( $\xi$ ) in Figure 4 and Figure 5, where larger $\upsilon$ corresponds to aggravated catastrophic forgetting and a larger value of $\xi$ will improve the performance and increase the storage burden. In practice, we find that usually $\upsilon=2$ and $\xi=10\%$ indicate good performance and pose small challenges for the computation and memory burden (see Appendix 5.2 for memory and efficiency analysis.).

4 Methods

4.1 Continual Offline RL

In this paper, we focus on the task-incremental setting of task-aware continual learning in the offline RL field where the different tasks come successively for training Zhang et al. (2023c); Wang et al. (2023b); Smith et al. (2023a); Schwarz et al. (2018); Abel et al. (2023); Wang et al. (2023c). Each task is defined as a corresponding Markov Decision Process (MDP) $\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle$ , where $\mathcal{S}$ and $\mathcal{A}$ represent the state and action space, respectively, $\mathcal{P}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})$ denotes the Markovian transition probability, $\mathcal{R}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$ is the reward function, and $\gamma\in[0,1)$ is the discount factor. In order to distinguish different tasks, we use subscript $i$ for task $i$ , such as $\mathcal{M}_{i}$ , $\mathcal{S}_{i},\mathcal{A}_{i},\mathcal{P}_{i},\mathcal{R}_{i}$ , and $\gamma_{i}$ . At each time step $t$ in task $i$ , the agent receives a state $s_{i,t}$ from the environment and produces an action $a_{i,t}$ with a stochastic or deterministic policy $\pi$ . Then a reward $r_{i,t}=r(s_{i,t},a_{i,t})$ from the environment serves as the feedback to the executed action of the agent. Continual offline RL aims to find an optimal policy that can maximize the discounted return $\sum_{i}^{I}\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{i,t},a_{i,t})]$ Yang et al. (2023); Sun et al. (2023); Wei et al. (2023) on all tasks with previously collected dataset $\{D_{i}\}_{i\in I}$ .

4.2 Conditional Diffusion Probabilistic Models

In this paper, diffusion-based models are proposed to model the distribution of trajectory $\tau$ , where each trajectory can be regarded as a data point. Then we can use diffusion models to learn the trajectory distribution $q(\tau)=\int q(\tau^{0:K})d\tau^{1:K}$ with a predefined forward diffusion process $q(\tau^{k}|\tau^{k-1})=\mathcal{N}(\tau^{k};\sqrt{\alpha_{k}}\tau^{k-1},(1-% \alpha_{k})\bm{I})$ and the trainable reverse process $p_{\theta}(\tau^{k-1}|\tau^{k})=\mathcal{N}(\tau^{k-1};\mu_{\theta}(\tau^{k},k% ),\Sigma_{k})$ , where $k\in[1,K]$ is the diffusion step, $\sqrt{\alpha_{k}}$ and $\sqrt{1-\alpha_{k}}$ control the drift and diffusion coefficients, $\mu_{\theta}(\tau^{k})=\frac{1}{\sqrt{\alpha_{k}}}(\tau_{k}-\frac{\beta_{k}}{% \sqrt{1-\bar{\alpha}_{k}}}\epsilon_{\theta}(\tau^{k},k))$ , $\Sigma_{k}=\frac{1-\bar{\alpha}_{k-1}}{1-\bar{\alpha}_{k}}\beta_{k}\bm{I}$ , and $\alpha_{k}+\beta_{k}=1$ . $\epsilon_{\theta}(\tau^{k},k)$ represents the noising model (Sohl-Dickstein et al., 2015). According to Ho et al. (2020), we can train $\epsilon_{\theta}(\tau^{k},k)$ with the below simplified objective

\mathcal{L}(\theta)=\mathbb{E}_{k\sim U(1,2,...,K),\epsilon\sim\mathcal{N}(0,% \bm{I}),\tau^{0}\sim D}[||\epsilon-\epsilon_{\theta}(\tau^{k},k)||_{2}^{2}],

where $k$ is the diffusion time step, $U$ is uniform distribution, $\epsilon$ is multivariant Gaussian noise, $\tau^{0}=\tau$ is sampled from the replay buffer $D$ , and $\theta$ is the parameters of model $\epsilon_{\theta}$ .

Conditions play a vital role in conditional generation because this method makes the outputs of diffusion models controllable. We can also use two conditions methods, classifier-guided and classifier-free, to train diffusion models $p_{\theta}(\tau^{k-1}|\tau^{k},\mathcal{C})$ (Liu et al., 2023). The classifier-guided method separates the training of the unconditional diffusion model and conditional guide and then combines them together, i.e., $p_{\theta,\phi}(\tau^{k-1}|\tau^{k},\mathcal{C})\propto p_{\theta}(\tau^{k-1}|% \tau^{k})p_{\phi}(\mathcal{C}|\tau^{k})$ . The corresponding sampling process is $p(\tau^{k-1}|\tau^{k},\mathcal{C})=\mathcal{N}(\mu_{\theta}+\Sigma_{k}\cdot% \nabla log~{}p_{\phi}(\mathcal{C}|\tau),\Sigma_{k})$ . Compared with classifier-guided, the classifier-free method implicitly builds the correlation between the trajectories and conditions in the training phase by learning unconditional and conditional noise $\epsilon_{\theta}(\tau^{k},\emptyset,k)$ and $\epsilon_{\theta}(\tau^{k},\mathcal{C},k)$ , where $\emptyset$ is usually the zero vector (Ajay et al., 2022). Then the perturbed noise at each diffusion time step is calculated by $\epsilon_{\theta}(\tau^{k},\emptyset,k)+\omega(\epsilon_{\theta}(\tau^{k},% \mathcal{C},k)-\epsilon_{\theta}(\tau^{k},\emptyset,k))$ . In this paper, we adopt the classifier-free guidance due to its simplicity, controllability, and higher performance Ajay et al. (2022).

4.3 Continual Diffuser

In this section, we introduce the Continual Diffuser (CoD), as shown in Figure 1, which contains classifier-free task-conditional training, experience rehearsal, and conditional generation for decision.

Data Organization. In RL, we leverage the characteristic of the diffusion model that can capture joint distributions in high-dimensional continual space by formulating the training data from single-step transition to multi-step sequences. Specifically, we have $I$ tasks, and each task $\mathcal{M}_{i}$ consists of $N$ trajectories $\{\tau_{i}\}_{1}^{N}$ , where the $\tau_{i,n}=\{s_{i,t,n},a_{i,t,n}\}$ will be split into equaling sequences with $T_{e}$ time steps as the discrepancy of trajectories may occur across tasks. In the following parts, we slightly abuse this notation $\tau_{i}$ to represent the sequence data with length $T_{e}$ sampled from task $i$ ’s dataset $D_{i}$ and $\hat{\tau}_{i}$ to denote the generative sequence.

Task Condition. In order to distinguish different tasks, we propose to use environment-related information as the task condition. For example, in the Ant-dir environment, the agent’s goal is to maximize its speed in the pre-defined direction, which is given as the goal in the specific tasks. So, we propose to use this information as condition $\mathcal{C}_{task}$ to train our model. In each diffusion step $k$ , the task condition $\mathcal{C}_{task}$ will pass through a task embedding function to obtain task embedding, which will be fed into the diffusion model jointly with diffusion time step embedding. Apart from the task conditions that are used implicitly in the training, we also need explicit observation conditions. We use the first state $s_{i,t,n}$ of the $T_{e}$ length sampled sequence $\tau_{i,n}=\{s_{i,t,n},a_{i,t,n},s_{i,t+1,n},a_{i,t+1,n},...,s_{i,t+T_{e}-1,n}% ,a_{i,t+T_{e}-1,n}\}$ as the conditions. Then at each diffusion generation step, after we obtain the generated sequences $\{\hat{s}_{i,t,n},\hat{a}_{i,t,n},...,\hat{s}_{i,t+T_{e}-1,n},\hat{a}_{i,t+T_{% e}-1,n}\}^{k}$ , the first observation $\hat{s}_{i,t,n}$ is directly replaced by $s_{i,t,n}$ , i.e., $\hat{\tau}_{i,n}^{k}=\{s_{i,t,n},\hat{a}_{i,t,n},...,\hat{s}_{i,t+T_{e}-1,n},% \hat{a}_{i,t+T_{e}-1,n}\}^{k}$ .

Training Objective. Following the previous studies of the diffusion model Ho et al. (2020); He et al. (2023), the training and generation for each task $i$ are defined as

\mathcal{L}_{i}(\theta)=\mathbb{E}_{k\sim U(1,K),\epsilon\sim\mathcal{N}(0,\bm% {I}),\tau^{0}_{i}\sim D_{i}}[||\epsilon-\epsilon_{\theta}(\tau_{i}^{k},% \mathcal{C}_{task~{}i},k)||_{2}^{2}],

(1)

\tau_{i}^{k-1}=\frac{\sqrt{\bar{\alpha}_{k-1}}\beta_{k}}{1-\bar{\alpha}_{k}}% \cdot\bar{\tau}_{i}+\frac{\sqrt{\alpha}_{k}(1-\bar{\alpha}_{k-1})}{1-\bar{% \alpha}_{k}}\tau_{i}^{k}+|\Sigma_{k}|\bm{z},

(2)

where $z\sim\mathcal{N}(\bm{0},\bm{I})$ , $\bar{\tau}_{i}=\frac{\tau_{i}^{k}-\sqrt{1-\bar{\alpha}_{k}}\bar{\epsilon}}{% \sqrt{\bar{\alpha}_{k}}},|\Sigma_{k}|=\frac{1-\bar{\alpha}_{k-1}}{1-\bar{% \alpha}_{k}}\beta_{k}$ , and $\bar{\epsilon}=\epsilon_{\theta}(\tau_{i}^{k},\emptyset,k)+\omega(\epsilon_{% \theta}(\tau_{i}^{k},\mathcal{C}_{task},k)-\epsilon_{\theta}(\tau_{i}^{k},% \emptyset,k))$ .

Experience Rehearsal. In this paper, we propose periodic rehearsal to strengthen the knowledge of previous tasks, which mimics the memory consolidation mechanism of hippocampus replay inside biological systems. When a new dataset $D_{i}$ of task $i$ arrives, we preserve a small portion $\xi$ of the entire dataset, donated as $\mathscr{D}_{i}$ . For the small training dataset $\mathscr{D}_{i}$ , it is easy to overfit these data for most rehearsal-based methods. Fortunately, inspired by the distributional robust optimization, increasing the hardness of the samples will hinder memory overfitting. The discrete type of diffusion process $\tau^{k}=\sqrt{\alpha_{k}}\tau^{k-1}+\sqrt{1-\alpha_{k}}\epsilon$ can be reformulated as the corresponding continuous forward process $d\tau=-\frac{1}{2}\beta(t)\tau dt+\sqrt{\beta(t)}dW$ , where $W$ is the standard Wiener process (a.k.a. Brownian motion). This process gradually inserts directional noise (i.e., increasing the hardness) to induce transformation from trajectory distribution to Gaussian distribution. So rehearsal-based diffusers naturally possess the capability of reducing memory overfitting, and the total objective function is

\small\min_{\forall\theta\in\Theta}[\mathbb{E}_{\tau_{j}\in D_{j}}\mathcal{L}_% {j}(\theta,\tau_{j},\mathcal{C}_{task~{}j})+\mathbb{E}_{\tau_{i}\in\mathscr{D}% _{i},i<j}\mathcal{L}_{i}(\theta,\tau_{i},\mathcal{C}_{task~{}i})]

(3)

In practice, we usually set the rehearsal frequency $\upsilon$ as 2 gradient steps and the portion $\xi$ as 10%.

Architecture. In this paper, we adopt temporal Unet with one-dimensional convolution blocks as the diffusion model to predict noises. Specifically, temporal Unet contains several down-sampling blocks, a middle block, several up-sampling blocks, a time embedding block, and a task embedding block. We train the time embedding block and task embedding block to generate time and task embeddings that are added to the observation-action sequence

\tau_{i,t:t+T_{e}-1,n}=\begin{pmatrix}s_{i,t,n}&s_{i,t+1,n}&...&s_{i,t+T_{e}-1% ,n}\\ a_{i,t,n}&a_{i,t+1,n}&...&a_{i,t+T_{e}-1,n}\\ \end{pmatrix}.

In the return conditional diffusion models, we replace the task embedding block with the return embedding block. Also, following the implementation of low-rank adaptation in Natural Language Processing Hu et al. (2021); Mangrulkar et al. (2022), we increase the LoRA module in down-sampling, middle, and up-sampling blocks to construct the LoRA variant CoD-LoRA.

4.4 Conclusion

First of all, to facilitate the development of the continual offline RL community, a continual offline benchmark that contains 90 tasks is constructed based on Continual World and Gym-MuJoCo. Based on our benchmark, we propose Continual Diffuser (CoD), an effective continual offline RL method that possesses the capabilities of plasticity and stability with experience rehearsal. Finally, extensive experiments illustrate the superior plasticity-stability trade-off when compared with representative continual RL baselines.

CODE AND DATA AVAILABILITY

The code and data are available in GitHub at https://github.com/JF-Hu/Continual_Diffuser.

ACKNOWLEDGEMENT

We would like to thank Lijun Bian for her contributions to the figures and tables of this manuscript. We thank Runliang Niu for his contributions to providing help on the computing resource.

References

Abel et al. (2023) David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, and Satinder Singh. A definition of continual reinforcement learning. arXiv preprint arXiv:2307.11046, 2023.
Ajay et al. (2022) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
Aljundi et al. (2018) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
Almalioglu et al. (2022) Yasin Almalioglu, Mehmet Turan, Niki Trigoni, and Andrew Markham. Deep learning-based robust positioning for all-weather autonomous driving. Nature machine intelligence, 4(9):749–760, 2022.
Anand and Precup (2023) Nishanth Anand and Doina Precup. Prediction and control in continual reinforcement learning. arXiv preprint arXiv:2312.11669, 2023.
Atkinson et al. (2021) Craig Atkinson, Brendan McCane, Lech Szymanski, and Anthony Robins. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. Neurocomputing, 428:291–307, 2021.
Beeson and Montana (2023) Alex Beeson and Giovanni Montana. Balancing policy constraint and ensemble size in uncertainty-based offline reinforcement learning. arXiv preprint arXiv:2303.14716, 2023.
Berariu et al. (2021) Tudor Berariu, Wojciech Czarnecki, Soham De, Jorg Bornschein, Samuel Smith, Razvan Pascanu, and Claudia Clopath. A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042, 2021.
Chaudhry et al. (2018) Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
Chen et al. (2021) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
Chi et al. (2023) Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
Dohare et al. (2024) Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, 2024.
Fontanesi et al. (2019) Laura Fontanesi, Sebastian Gluth, Mikhail S Spektor, and Jörg Rieskamp. A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review, 26(4):1099–1121, 2019.
Foret et al. (2020) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
Gao et al. (2021) Chongkai Gao, Haichuan Gao, Shangqi Guo, Tianren Zhang, and Feng Chen. Cril: Continual robot imitation learning via generative and prediction model. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6747–5754. IEEE, 2021.
Ghosh et al. (2022) Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, and Sergey Levine. Offline rl policies should be trained to be adaptive. In International Conference on Machine Learning, pages 7513–7530. PMLR, 2022.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
He et al. (2023) Haoran He, Chenjia Bai, Kang Xu, Zhuoran Yang, Weinan Zhang, Dong Wang, Bin Zhao, and Xuelong Li. Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. arXiv preprint arXiv:2305.18459, 2023.
Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Hong et al. (2023) Joey Hong, Anca Dragan, and Sergey Levine. Offline rl with observation histories: Analyzing and improving sample complexity. arXiv preprint arXiv:2310.20663, 2023.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Huang et al. (2024) Kaixin Huang, Li Shen, Chen Zhao, Chun Yuan, and Dacheng Tao. Solving continual offline reinforcement learning with decision transformer. arXiv preprint arXiv:2401.08478, 2024.
Janner et al. (2021) Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
Janner et al. (2022) Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
Kang et al. (2023) Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning. arXiv preprint arXiv:2305.20081, 2023.
Kaplanis et al. (2019) Christos Kaplanis, Murray Shanahan, and Claudia Clopath. Policy consolidation for continual reinforcement learning. arXiv preprint arXiv:1902.00255, 2019.
Kaufmann et al. (2023) Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023.
Kessler et al. (2020) Samuel Kessler, Jack Parker-Holder, Philip Ball, Stefan Zohren, and Stephen J Roberts. Unclear: A straightforward method for continual reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, 2020.
Kidambi et al. (2020) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Korycki and Krawczyk (2021) Lukasz Korycki and Bartosz Krawczyk. Class-incremental experience replay for continual learning under concept drift. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3649–3658, 2021.
Kostrikov et al. (2021) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
Laskin et al. (2020) Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020.
Lee et al. (2024) Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Li et al. (2020) Jiachen Li, Quan Vuong, Shuang Liu, Minghua Liu, Kamil Ciosek, Henrik Christensen, and Hao Su. Multi-task batch reinforcement learning with metric learning. Advances in Neural Information Processing Systems, 33:6197–6210, 2020.
Li et al. (2023) Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023.
Liu et al. (2023) Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
Mallya and Lazebnik (2018) Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
Mendez and Eaton (2022) Jorge A Mendez and Eric Eaton. How to reuse and compose knowledge for a lifetime of tasks: A survey on continual learning and functional composition. arXiv preprint arXiv:2207.07730, 2022.
Meyer et al. (2023) Edan Meyer, Adam White, and Marlos C Machado. Harnessing discrete representations for continual reinforcement learning. arXiv preprint arXiv:2312.01203, 2023.
Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Nguyen et al. (2017) Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
Nguyen-Tang and Arora (2024) Thanh Nguyen-Tang and Raman Arora. On sample-efficient offline reinforcement learning: Data diversity, posterior sampling, and beyond. arXiv preprint arXiv:2401.03301, 2024.
Ni et al. (2023) Fei Ni, Jianye Hao, Yao Mu, Yifu Yuan, Yan Zheng, Bin Wang, and Zhixuan Liang. Metadiffuser: Diffusion model as conditional planner for offline meta-rl. arXiv preprint arXiv:2305.19923, 2023.
Nichol and Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
Peng et al. (2023) Liangzu Peng, Paris Giampouras, and René Vidal. The ideal continual learner: An agent that never forgets. In International Conference on Machine Learning, pages 27585–27610. PMLR, 2023.
Rafailov et al. (2021) Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, pages 1154–1168. PMLR, 2021.
Rakelly et al. (2019) Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340. PMLR, 2019.
Rigter et al. (2022) Marc Rigter, Bruno Lacerda, and Nick Hawes. Rambo-rl: Robust adversarial model-based offline reinforcement learning. arXiv preprint arXiv:2204.12581, 2022.
Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
(56) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding, 2022. URL https://arxiv. org/abs/2205.11487, 4.
Schwarz et al. (2018) Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018.
Shin et al. (2017) Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
Smith et al. (2023a) James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023a.
Smith et al. (2023b) James Seale Smith, Junjiao Tian, Shaunak Halbe, Yen-Chang Hsu, and Zsolt Kira. A closer look at rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2409–2419, 2023b.
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
Sun et al. (2023) Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. Smart: Self-supervised multi-task pretraining with control transformers. arXiv preprint arXiv:2301.09816, 2023.
Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
Van de Ven et al. (2022) Gido M Van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
Wang et al. (2023a) Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, and Yu Qiao. Critic-guided decision transformer for offline reinforcement learning. arXiv preprint arXiv:2312.13716, 2023a.
Wang et al. (2022a) Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022a.
Wang et al. (2023b) Zhenyi Wang, Li Shen, Tiehang Duan, Qiuling Suo, Le Fang, Wei Liu, and Mingchen Gao. Distributionally robust memory evolution with generalized divergence for continual learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
Wang et al. (2023c) Zhenyi Wang, Enneng Yang, Li Shen, and Heng Huang. A comprehensive survey of forgetting in deep learning beyond continual learning. arXiv preprint arXiv:2307.09218, 2023c.
Wang et al. (2022b) Zhi Wang, Chunlin Chen, and Daoyi Dong. A dirichlet process mixture of robust task models for scalable lifelong reinforcement learning. IEEE Transactions on Cybernetics, 2022b.
Wei et al. (2023) Yao Wei, Yanchao Sun, Ruijie Zheng, Sai Vemprala, Rogerio Bonatti, Shuhang Chen, Ratnesh Madaan, Zhongjie Ba, Ashish Kapoor, and Shuang Ma. Is imitation all you need? generalized decision-making with dual-phase training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16221–16231, 2023.
Wołczyk et al. (2021) Maciej Wołczyk, Michał Zajac, Razvan Pascanu, Łukasz Kucinski, and Piotr Miłoś. Continual world: A robotic benchmark for continual reinforcement learning. Advances in Neural Information Processing Systems, 34:28496–28510, 2021.
Wu et al. (2019) Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
Xu et al. (2022) Mengdi Xu, Yikang Shen, Shun Zhang, Yuchen Lu, Ding Zhao, Joshua Tenenbaum, and Chuang Gan. Prompting decision transformer for few-shot policy generalization. In international conference on machine learning, pages 24631–24645. PMLR, 2022.
Yang et al. (2023) Yijun Yang, Tianyi Zhou, Jing Jiang, Guodong Long, and Yuhui Shi. Continual task allocation in meta-policy network via sparse prompting. arXiv preprint arXiv:2305.18444, 2023.
Yu et al. (2020) Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
Yue et al. (2024) William Yue, Bo Liu, and Peter Stone. t-dgr: A trajectory-based deep generative replay method for continual learning in decision making. arXiv preprint arXiv:2401.02576, 2024.
Zeng et al. (2019) Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019.
Zhang et al. (2023a) Qizhe Zhang, Bocheng Zou, Ruichuan An, Jiaming Liu, and Shanghang Zhang. Split & merge: Unlocking the potential of visual adapters via sparse training. arXiv preprint arXiv:2312.02923, 2023a.
Zhang et al. (2022) Tiantian Zhang, Xueqian Wang, Bin Liang, and Bo Yuan. Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Zhang et al. (2023b) Tiantian Zhang, Zichuan Lin, Yuxing Wang, Deheng Ye, Qiang Fu, Wei Yang, Xueqian Wang, Bin Liang, Bo Yuan, and Xiu Li. Dynamics-adaptive continual reinforcement learning via progressive contextualization. IEEE Transactions on Neural Networks and Learning Systems, 2023b.
Zhang et al. (2023c) Tiantian Zhang, Kevin Zehua Shen, Zichuan Lin, Bo Yuan, Xueqian Wang, Xiu Li, and Deheng Ye. Replay-enhanced continual reinforcement learning. arXiv preprint arXiv:2311.11557, 2023c.
Zhu et al. (2023) Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Yong Yu, and Weinan Zhang. Diffusion models for reinforcement learning: A survey. arXiv preprint arXiv:2311.01223, 2023.

5 Supplementary Material

5.1 Pseudocode of Continual Diffuser

The pseudocode for CoD training is shown in Algorithm 1. First of all, we process the datasets of $I$ tasks before training, including splitting the trajectories into equal sequences and normalizing the sequences to facilitate learning. As shown in lines $9-24$ , for each task $i$ , we check the task index in the whole task sequence and sample different samples from the different buffers. For example, for task $i,i>0$ , we will perform experience rehearsal every $\upsilon$ train steps by sampling data from $\mathscr{D}_{j},j\in{0,...,i-1}$ , where $j$ is sampled from $U(0,i-1)$ . Then, the networks $\epsilon_{\theta}$ , $f_{task}(\phi)$ , and $f_{time}(\varphi)$ are updated according to Equation (1) and Equation (3). After training on task $i$ , we preserve a small portion ( $\xi$ ) of the dataset of buffer $D_{i}$ as task $i$ ’s rehearsal buffer. During the evaluation of multiple tasks (shown in Algorithm 2), we successively generate decisions with CoD and calculate the evaluation metrics.

5.2 Implement Details

Compute. Experiments are carried out on NVIDIA GeForce RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz. Each run of the experiments spanned about 24-72 hours, depending on the algorithm and the length of task sequences.

Hyperparameters. In the implementation, we select the maximum diffusion steps as 200, and the default structure is Unet. Then, in order to speed up the generation efficiency during evaluation, we consider the speed-up technique of DDIM Song et al. [2020] and realize it in our method, thus accomplishing 19.043x acceleration compared to the original generation method. The sequence length is set to 48 in all experiments, where a larger sequence length can capture a more sophisticated distribution of trajectories and may also increase the computation burden. We set the LoRA dimension as 64 for each module of down-sampling, middle, and up-sampling blocks, and the percent of LoRA parameters is approximately 12% in our experiments.

Table 2: The hyperparameters of CoD.

	Hyperparameter	Value
Architecture	network backbone	Unet
	hidden dimension	128
	down-sampling blocks	3
	middle blocks	2
	up-sampling blocks	2
	convolution multiply	(1, 4, 8)
	normalizer	Gaussian normalizer
	sampling type of diffusion	DDIM
Training	condition guidance $\omega$	1.2
	max diffusion step $K$	200
	sequence length $T_{e}$	48
	loss function	MSE
	learning rate	$3\cdot 10^{-4}$
	batch size	32
	optimizer	Adam
	discount factor $\gamma$	0.99
	LoRA dimension	64
	condition dropout	0.25
	sampling speed-up stride	10
	rehearsal frequency $\upsilon$	2
	rehearsal sample diversity $\xi$	0.1

Input: Noise prediction model

\epsilon_{\theta}

, task MLP

f_{task}(\phi)

, time MLP

f_{time}(\varphi)

, tasks set

\mathcal{M}_{i},i\in\{1,...,I\}

, max diffusion step

K

, sequence length

T_{e}

, state dimension

d_{s}

, action dimension

d_{a}

, reply buffer

D_{i},i\in\{1,...,I\}

, rehearsal frequency (

\upsilon

), rehearsal sample diversity (

\xi

), noise schedule

\alpha_{0:K}

and

\beta_{0:K}

Output:

\epsilon_{\theta}

f_{task}(\phi)

f_{time}(\varphi)

1 Initialization:

\theta

\phi

\varphi

2 // Prepare for Training

3 Separate the state-action trajectories of

D_{i},i\in\{1,...,I\}

into state-action sequences with length

T_{e}

4 Normalize state-action sequences to obey Gaussian distribution

5 // Training

6 for each task $i$ do

7 for each train epoch do

8 for each train step $m$ do

9 if $i>0$ and $m$ % $\upsilon$ == 0 then

10 Sample j from

\{0,...,i-1\}

11 Sample

b

sequences

\tau_{j}^{0}=\{s_{j,t:t+T_{e}},a_{j,t:t+T_{e}}\}\in\mathbb{R}^{b\times T_{e}% \times(d_{s}+d_{a})}

from task

j

’s rehearsal buffer

\mathscr{D}_{j},j<i

13 else

14 Sample

b

sequences

\tau_{i}^{0}=\{s_{i,t:t+T_{e}},a_{i,t:t+T_{e}}\}\in\mathbb{R}^{b\times T_{e}% \times(d_{s}+d_{a})}

from task

i

’s buffer

D_{i}

16 end if

17 Obtain the corresponding task conditions

\mathcal{C}_{task}

18 Sample diffusion time step

k\sim\text{Uniform}(K)

19 Obtain

\tau_{i}^{k}

\tau_{j}^{k}

by adding noise to

\tau_{i}^{0}

\tau_{j}^{0}

20 Sample Gaussian noise

\epsilon\sim\mathcal{N}(0,\bm{I}),\epsilon\in\mathbb{R}^{b\times T_{e}\times(d% _{s}+d_{a})}

21 Train

\epsilon_{\theta}

f_{task}(\phi)

, and

f_{time}(\varphi)

according to Equation (1) and Equation (3)

23 end for

24 Save model periodically

26 end for

27 Preserve a small portion (

\xi

) of the dataset of buffer

D_{i}

as task

i

’s rehearsal buffer

\mathscr{D}_{i}

29 end for

Algorithm 1 Training of Continual Diffuser (CoD)

Input: Well trained noise prediction model

\epsilon_{\theta}

, task MLP

f_{task}(\phi)

, time MLP

f_{time}(\varphi)

, tasks set

\mathcal{M}_{i},i\in\{1,...,I\}

, max diffusion step

K

, noise schedule

\alpha_{0:K}

and

\beta_{0:K}

1 // Prepare for Evaluation

2 Normalize state-action sequences to obey Gaussian distribution

3 // Evaluation

4 for each evaluation task $i$ do

5 for each evaluation step do

6 Receive state

s_{i,t}

and task identify from the task

i

7 Obtain the corresponding task conditions

\mathcal{C}_{task}

8 Let

k=K

9 Sample

\hat{\tau}^{k}_{i}\in\mathbb{R}^{1\times T_{e}\times(d_{s}+d_{a})}

from normal distribution

\mathcal{N}(0,\bm{I})

10 Replace the first state of

\hat{\tau}^{k}_{i}

with

s_{i,t}

11 for each generation step $k$ do

12 Generate sequences

\hat{\tau}^{k-1}_{i}

with

\epsilon_{\theta}

f_{task}(\phi)

, and

f_{time}(\varphi)

according to Equation (2)

13 Replace the first state of

\hat{\tau}^{k-1}_{i}

with

s_{i,t}

15 end for

16 Perform the first action of

\hat{\tau}^{k-1}_{i}

in the task

i

17 Observe reward

r

from the task

i

19 end for

20 Record the success rate on task

i

22 end for

23Calculate the total mean success rate

Algorithm 2 Evaluation of Continual Diffuser (CoD)

5.3 Related Work

Diffusion-Based Models for RL. Diffusion models have made big progress in many fields, such as image synthesis and text generation Ho et al. [2020], Saharia et al. , Nichol and Dhariwal [2021], Beeson and Montana [2023], Sohl-Dickstein et al. [2015], Rombach et al. [2022]. Recently, a series of works have demonstrated the tremendous potential of diffusion-based models in offline RL tasks such as goal-based planning, composable constraint combination, scalable trajectory generation, and complex skill synthesis Janner et al. [2022], Fontanesi et al. [2019], Ajay et al. [2022], Chi et al. [2023]. For example, Janner et al. [2022] propose to use the value function as the guide during trajectory generation, effectively reducing the effects of out-of-distribution actions and reaching remarkable performance in offline RL tasks. Besides, diffusion models can also be used as policies to model the multimodal distribution from states to actions and as planners to perform long-horizon planning Kang et al. [2023], Wang et al. [2022a], He et al. [2023], Ni et al. [2023]. For instance, Kang et al. [2023] use diffusion models as policies to model the distribution from states to actions, while He et al. [2023] endow diffusion models with the ability to perform planning and data augmentation with different task-specific prompts.

Continual Learning in RL. Continual learning (CL) aims to solve multi-tasks that come sequentially with explicit boundaries (task-aware CL) or implicit boundaries (task-free CL) and achieve no catastrophic forgetting and good task transferring (i.e., plasticity-stability dilemma) at the same time Zhang et al. [2023c], Meyer et al. [2023], Wang et al. [2023b]. Multitask learning methods He et al. [2023], Laskin et al. [2020] are usually regarded as the upper bound of continual learning. Existing studies for continual RL can be roughly classified into three categories: Structure-based methods focus on novel model structures such as sub-networks, mixture-of-experts, hypernetworks, and low-rank adaptation Wang et al. [2022b], Zhang et al. [2023a], Smith et al. [2023a], Mallya and Lazebnik [2018]. Regularization-based methods propose using auxiliary regularization loss to constrain the policy optimization and avoid catastrophic forgetting during training Zhang et al. [2023b, 2022], Kessler et al. [2020], Kaplanis et al. [2019]. Rehearsal-based methods preserve experiences of previous tasks or train generative models that can produce pseudo-samples to maintain knowledge of past tasks Korycki and Krawczyk [2021], Smith et al. [2023b], Atkinson et al. [2021], Peng et al. [2023]. Besides, recent plasticity-preserving studies Lee et al. [2024], Foret et al. [2020] reveal that the plasticity of models can be enhanced by weight re-initialization and noisification when facing the early interactions overfitting within a single task.

Offline RL. Offline RL mainly focuses on how to train optimal policies with previously collected large datasets without expensive and risky data collection processes Wang et al. [2023a], Levine et al. [2020], Kostrikov et al. [2021], Kumar et al. [2020], Ghosh et al. [2022]. It, however, remains a huge challenge for training when facing the distribution shift between the learned policy and the data-collected policy and the overestimation of out-of-distribution (OOD) actions Hong et al. [2023], Kostrikov et al. [2021]. To solve these issues, previous studies on offline-RL tasks generally rely on methods from constrained optimization, safe learning, imitation learning, and amendatory estimation Kostrikov et al. [2021], Kumar et al. [2020], Wu et al. [2019], Ghosh et al. [2022]. Besides, planning and optimizing in the world model with limited interactions also serves as a promising way to train satisfactory policies Rigter et al. [2022], Kidambi et al. [2020], Rafailov et al. [2021]. Recently, sequential modeling has been proposed to fit the joint state-action distribution over the trajectories with transformer-based models and diffusion-based models Janner et al. [2021], Nguyen-Tang and Arora [2024], Chen et al. [2021], Ajay et al. [2022], Zhu et al. [2023], Janner et al. [2022].

5.4 Additional Experiments

Offline Continual World Results on CW10. We report the performance on CW10 in Figure 6 when the baselines are trained with offline datasets. The results show that the learning speed of our method (CoD) is much more efficient than other baselines when executing the same gradient updates. Besides, we can observe that the performance of generative methods is more effective than non-generative methods, which shows the powerful expressiveness of generative models in modeling complex environments and generating pseudo-samples with high fidelity.

Online Continual World Results on CW10. Apart from the offline comparison, we also modified the original baselines and conducted experiments on CW10, where several new online baselines were introduced. Similarly, the results in Figure 7 also show that our method (CoD) surpasses the baselines by a large margin, illustrating the superiority of CoD. We do not incorporate several offline baselines trained with generative models into online comparison because the generative process consumes much more time for interaction, which exceeds the tolerable range of training. These baselines trained with generative models are more suitable for training on offline datasets.

Mixed Dataset Training Analysis. We can classify the training under the sub-optimal demonstrations into two situations. You can click here to return to Section 2.5 quickly for continual reading of the main body.

The first is learning from noise datasets. In order to simulate the training under the sub-optimal demonstrations, we insert noise into the observations of the current dataset to obtain sub-optimal demonstrations, i.e., $\bar{o}=o+clip(\eta*\mathcal{N}(0,I),\rho)$ . The larger noise denotes datasets with lower quality. We report the results in Table 3. The results illustrate that the performance decreases with the noise increasing, which inspires us to find additional techniques to reduce the influence of the noise on samples, such as adding an extra denoising module before diffuser training.

Table 3: The experiments of CoD when training with noise datasets on the Ant-dir tasks.

Noise level $\eta$	0	0.1	0.5
Bound $\rho$	-	(-0.5, 0.5)	(-1.0, 1.0)
Score	478.19 $\pm$ 15.84	247.41 $\pm$ 5.48	163.00 $\pm$ 5.16

The second is learning from datasets sampled with mixed-quality policies. We construct the ‘medium’ datasets on several Continual World tasks (CW4) to show the performance on the mixed-quality datasets, where the trajectories come from a series of behavior policies during the training stage. With the training stage going, we update the policy network many times, and each gradient update step will be regarded as generating a new behavior policy. Then, the performance of the policy will be improved. Next, we use the behavior policies whose performance ranges from medium to well-trained performance to collect ‘medium’ datasets, i.e., the ‘medium’ datasets contain unsuccessful trajectories and successful trajectories simultaneously (Refer to Table 11 for more statistics.). Based on the mixed-quality CW4 datasets, we adopt IL as the baseline and compare our method with IL. The corresponding experimental results are shown in Table 1 (c). The results show that our method (CoD) can achieve better performance than the baseline in the ‘medium’ dataset quality setting, which shows its effectiveness.

Plasticity Comparison. In order to compare the plasticity of our method and representative plasticity-preserving methods Lee et al. [2024], Foret et al. [2020], we conduct the experiments on the Ant-dir environment with task setting as ‘10-15-19-25’, which is the same as the setting in the main body. The results are reported in Table 4, where the final performance means evaluation on all tasks after the whole training on all tasks and the performance gain of plasticity (task-level) is calculated according to mean(P(train15test15) - P(train10test15) + P(train19test19) - P(train15test19) + P(train25test25) - P(train19test25)). The results illustrate that our model reaches better final performance than PLASTIC and SAM. Besides, in the task-level plasticity performance comparison, our method also obtains a higher score. Although PLASTIC and SAM do not perform well here, it’s worth noting that PLASTIC and SAM are not designed to resolve continual learning under changing tasks but to address early interactions overfitting within a single task. The granularity of plasticity referred to in CoD is larger than that in PLASTIC and SAM. Click here to return to Section 2.5 quickly for continual reading of the main body.

Table 4: The comparison of our method and plasticity-preserving methods PLASTIC and SAM on the Ant-dir environment. We report the performance of PLASTIC and SAM with online and offline training under the continual learning setting.

Model	CoD	Diffuser-w/o rehearsal	PLASTIC (online)	SAM (online)	PLASTIC (offline)	SAM (offline)
Final performance	478.19 $\pm$ 15.84	270.44 $\pm$ 5.54	201.45 $\pm$ 0.56	202.17 $\pm$ 0.46	186.71 $\pm$ 4.55	187.19 $\pm$ 4.53
Performance gain of plasticity (task-level)	407.84	348.15	8.70	10.94	4.60	55.77

Parameters Sensitivity on Ant-dir. In Section 2.6, we conduct the parameter sensitivity analysis on CW to show the effects of rehearsal frequency $\upsilon$ and rehearsal diversity $\xi$ . We also report the results of parameters sensitivity on Ant-dir in Figure 5 and Table 6, where $\upsilon=inf$ means we do not perform rehearsal during training. The results show that with the increase of $\upsilon$ , the performance declines because the model can not use previous datasets to strengthen its memory in time.

Table 5: The ablation study of CoD.

Method	Mean episode return
Task	Ant-dir	CW10	CW20
CoD-w/o rehearsal	270.44 $\pm$ 5.54	0.20 $\pm$ 0.01	0.18 $\pm$ 0.01
CoD (Ours)	478.19 $\pm$ 15.84	0.98 $\pm$ 0.01	0.98 $\pm$ 0.01

Table 6: The absolute performance of parameters sensitivity of Ant-dir.

		$\xi$
		1%	5%	10%	20%
$\upsilon$	2	383.07 $\pm$ 8.71	465.82 $\pm$ 5.03	475.82 $\pm$ 8.14	493.83 $\pm$ 3.17
	10	344.73 $\pm$ 0.00	381.27 $\pm$ 0.00	368.42 $\pm$ 0.00	369.94 $\pm$ 0.00
	14	331.29 $\pm$ 0.00	351.79 $\pm$ 0.00	345.71 $\pm$ 0.00	352.32 $\pm$ 0.00
	inf	271.79 $\pm$ 8.48	271.79 $\pm$ 8.48	271.79 $\pm$ 8.48	271.79 $\pm$ 8.48

Efficiency Analysis of Generation Speed. The generation process of diffusion models is indeed computationally intensive because the mechanism of generation requires multiple rounds to generate a sequence. However, we can draw inspiration from previous studies Nichol and Dhariwal [2021], Song et al. [2020] in related domains and accelerate the generation process. For example, we can reduce the reverse diffusion step from 200 to 10. To show the efficiency of accelerating during the generation process, we conduct a comparison of generation speed. We report the results in Table 7, where the 200 diffusion steps setting is the original version, and the 10 diffusion steps setting is our accelerated version. In the experiments of our manuscript, we adopt the 10 diffusion steps setting, which improves the sampling speed (19.043×) with a larger margin than the original sampling version. It’s worth noting that our implemented accelerate technique can also use other diffusion steps settings, but we find that 10 diffusion steps setting performs well on performance and generation efficiency.

Table 7: The comparison of generation speed with different generation steps. In the main body of our manuscript, we use the 10 diffusion steps setting for all experiments.

Diffusion steps	200 (original)	100	50	25	10
Time consumption of per generation (s)	3.085 $\pm$ 0.077	1.839 $\pm$ 0.116	0.850 $\pm$ 0.007	0.394 $\pm$ 0.006	0.159 $\pm$ 0.005
Speed-up ratio	1×	1.678×	3.629×	7.830×	19.043×

Ablation Study on Mixed Datasets. In Table 8, we report the effects of rehearsal sample diversity $\xi$ on the ‘medium’ datasets. From the results, we can see that increasing the rehearsal sample diversity is beneficial to the performance, which is in line with the experiments in the main body of our manuscript. Besides, the results also show that our method (CoD) can reach a better plasticity-stability trade-off than the baseline in the ‘medium’ dataset quality setting.

Table 8: The ablation study of CoD on ‘medium’ CW4 datasets. We select the continual tasks setting as CW4 (“hammer-v1”, “push-wall-v1”, “faucet-close-v1”, “push-back-v1”), where the ‘medium’ experiences come from the behavior policies from the middle training stage to the end training stage.

Model	$\upsilon$	$\xi$	P $\uparrow$	FT $\uparrow$	F $\downarrow$	P+FT-F $\uparrow$
CoD	2	1%	0.85 $\pm$ 0.02	0.60 $\pm$ 0.13	0.05 $\pm$ 0.01	1.40
CoD	2	10%	0.90 $\pm$ 0.02	0.60 $\pm$ 0.12	-0.01 $\pm$ 0.01	1.51
IL	2	1%	0.57 $\pm$ 0.19	0.12 $\pm$ 0.54	0.18 $\pm$ 0.09	0.51
IL	2	10%	0.63 $\pm$ 0.17	0.28 $\pm$ 0.27	0.28 $\pm$ 0.18	0.63

Computation Costs Analysis. In order to show the consumption of computational costs, we report the comparison of computation costs during the training stage in Table 9, where we obtain the statistical data with ‘wandb.’ The results show that increasing the rehearsal samples does not significantly increase computation costs and training time.

Table 9: The comparison of computation costs and training time using our method with different hyperparameter settings. Experiments are carried out on NVIDIA GeForce RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz. We report the results according to ‘wand.’

$\upsilon$	2	2	2	2	14	10	6	inf
$\xi$	20	10	5	1	10	10	10	-
Process memory in use (non-swap) (MB)	19914.52	20058.77	20023.48	20045.32	20010.42	20026.34	20049.25	19983.4
Train time (h)	143.713	144.488	143.681	143.398	145.081	145.973	146.397	152.274

5.5 Statistics of Continual Offline RL Benchmarks

Table 10: The total statistics of our benchmark.

Environment	Tasks number	Quality	Samples per task
Continual World	88	expert	1M
Continual World	4	medium	0.4M
Ant-dir	40	expert	0.2M
Cheetah-dir	2	expert	0.2M

To take advantage of the potential of diffusion models, we first collect an offline benchmark that contains dozens of tasks from multiple domains, such as Continual World and Gym-MuJoCo Wołczyk et al. [2021], Todorov et al. [2012]. In order to collect the interaction data, we trained Soft Actor-Critic on each task for approximately 1M time steps Haarnoja et al. [2018]. Totally, the benchmark contains 90 tasks, where 88 tasks come from Continual World, 2 tasks come from Gym-MuJoCo.

Specifically, CW Wołczyk et al. [2021] tasks are constructed based on Meta-World Yu et al. [2020]. CW consists of many realistic robotic manipulation tasks such as Pushing, Reaching, Door Opening, Pick, and Place. CW is convenient for training and evaluating the abilities of forward transfer and forgetting because the state and action space are the same across all the tasks. In our benchmark, we collect “expert” and “medium” datasets, where the episodic time limit is set to 200, and the evaluation time step is set to 1M and 0.4M for “expert" and “medium” datasets, respectively. Thus, we obtain 5000 and 2000 episodes for these two quality tasks, as shown in Table 11 and Table 12, in which we also report the mean success rate of these two qualities dataset. Besides, we also provide the return information of all datasets in Figure 8 and Figure 9. Out of these tasks defined in Meta-World, we usually select ten tasks from them as the setting of continual learning, i.e., CW10, and CW20 denotes the setting of two CW10.

Aligning with the traditional definition of various CL settings Van de Ven et al. [2022], this benchmark supports constructing task-incremental CORL (TICORL), domain-incremental CORL (DICORL), and class-incremental CORL (CICORL) settings. Researchers can use these datasets in any sequence or length for CL tasks to test the plasticity-stability trade-off of their proposed methods. The future expansion plan of this benchmark will gather datasets such as ‘random’ and ‘full’ training qualities datasets to bolster training robust CL agents.

Ant-dir is an 8-joint ant environment. The different tasks are defined according to the target direction, where the agent should maximize its return with maximal speed in the pre-defined direction. As shown in Table 13, there are 40 tasks (distinguished with “task id”) with different uniformly sampled goal directions in Ant-dir. For each task, the dataset contains approximately 200k transitions, where the observation and action dimensions are 27 and 8, respectively. We found that the Ant-dir datasets have been used by many researchers Xu et al. [2022], Li et al. [2020], Rakelly et al. [2019], so we incorporate them into our benchmark. Moreover, we report the mean return information of each sub-task in Table 13 and Figure 10. As for Cheetah-dir, it only contains two tasks that represent forward and backward goal directions. Compared with Ant-dir, Cheetah-dir possesses lower observation and action space.

Table 11: The information statistics of offline Continual World-v1 datasets.

Continual World dataset	quality	episode length	episode number	mean success	observation dimension	action dimension
assembly-v1	expert	200	5000	1.0	13	4
basketball-v1	expert	200	5000	1.0	13	4
button-press-topdown-v1	expert	200	5000	0.99	13	4
button-press-topdown-wall-v1	expert	200	5000	0.9864	13	4
button-press-v1	expert	200	5000	0.99	13	4
button-press-wall-v1	expert	200	5000	1.0	13	4
coffee-button-v1	expert	200	5000	0.9916	13	4
coffee-pull-v1	expert	200	5000	0.99	13	4
coffee-push-v1	expert	200	5000	0.99	13	4
dial-turn-v1	expert	200	5000	0.9902	13	4
disassemble-v1	expert	200	5000	0.0	13	4
door-close-v1	expert	200	5000	0.9902	13	4
door-open-v1	expert	200	5000	0.9898	13	4
drawer-close-v1	expert	200	5000	0.9894	13	4
drawer-open-v1	expert	200	5000	0.99	13	4
faucet-close-v1	expert	200	5000	0.9896	13	4
faucet-open-v1	expert	200	5000	0.9154	13	4
hammer-v1	expert	200	5000	0.99	13	4
handle-press-side-v1	expert	200	5000	0.9878	13	4
handle-press-v1	expert	200	5000	0.99	13	4
handle-pull-side-v1	expert	200	5000	0.9888	13	4
handle-pull-v1	expert	200	5000	0.99	13	4
lever-pull-v1	expert	200	5000	0.0	13	4
peg-insert-side-v1	expert	200	5000	0.9604	13	4
peg-unplug-side-v1	expert	200	5000	0.99	13	4
pick-out-of-hole-v1	expert	200	5000	0.0	13	4
pick-place-v1	expert	200	5000	1.0	13	4
pick-place-wall-v1	expert	200	5000	0.8196	13	4
plate-slide-back-side-v1	expert	200	5000	0.99	13	4
plate-slide-back-v1	expert	200	5000	0.9886	13	4
plate-slide-side-v1	expert	200	5000	0.7992	13	4
plate-slide-v1	expert	200	5000	0.5694	13	4
push-back-v1	expert	200	5000	0.9922	13	4
push-v1	expert	200	5000	0.9844	13	4
push-wall-v1	expert	200	5000	1.00	13	4
reach-wall-v1	expert	200	5000	0.99	13	4
shelf-place-v1	expert	200	5000	1.00	13	4
soccer-v1	expert	200	5000	0.0066	13	4
stick-pull-v1	expert	200	5000	0.93	13	4
stick-push-v1	expert	200	5000	0.4486	13	4
sweep-into-v1	expert	200	5000	0.9662	13	4
sweep-v1	expert	200	5000	0.0834	13	4
window-close-v1	expert	200	5000	0.99	13	4
window-open-v1	expert	200	5000	0.99	13	4
hammer-v1	medium	200	2000	0.7689	13	4
push-wall-v1	medium	200	2000	0.7465	13	4
faucet-close-v1	medium	200	2000	0.9364	13	4
push-back-v1	medium	200	2000	0.3168	13	4

Table 12: The information statistics of offline Continual World-v2 datasets.

Continual World dataset	quality	episode length	episode number	mean success	observation dimension	action dimension
basketball-v2	expert	200	5000	1.0	39	4
box-close-v2	expert	200	5000	1.0	39	4
button-press-topdown-v2	expert	200	5000	1.0	39	4
button-press-topdown-wall-v2	expert	200	5000	1.0	39	4
button-press-v2	expert	200	5000	1.0	39	4
button-press-wall-v2	expert	200	5000	1.0	39	4
coffee-button-v2	expert	200	5000	1.0	39	4
dial-turn-v2	expert	200	5000	1.0	39	4
door-close-v2	expert	200	5000	1.0	39	4
door-lock-v2	expert	200	5000	1.0	39	4
door-open-v2	expert	200	5000	1.0	39	4
door-unlock-v2	expert	200	5000	1.0	39	4
drawer-close-v2	expert	200	5000	1.0	39	4
drawer-open-v2	expert	200	5000	1.0	39	4
faucet-close-v2	expert	200	5000	1.0	39	4
faucet-open-v2	expert	200	5000	1.0	39	4
hammer-v2	expert	200	5000	1.0	39	4
hand-insert-v2	expert	200	5000	1.0	39	4
handle-press-side-v2	expert	200	5000	1.0	39	4
handle-press-v2	expert	200	5000	1.0	39	4
handle-pull-side-v2	expert	200	5000	1.0	39	4
handle-pull-v2	expert	200	5000	1.0	39	4
lever-pull-v2	expert	200	5000	1.0	39	4
peg-insert-side-v2	expert	200	5000	1.0	39	4
peg-unplug-side-v2	expert	200	5000	1.0	39	4
pick-out-of-hole-v2	expert	200	5000	1.0	39	4
pick-place-v2	expert	200	5000	1.0	39	4
plate-slide-back-side-v2	expert	200	5000	1.0	39	4
plate-slide-back-v2	expert	200	5000	1.0	39	4
plate-slide-side-v2	expert	200	5000	1.0	39	4
plate-slide-v2	expert	200	5000	1.0	39	4
push-back-v2	expert	200	2668	1.0	39	4
push-v2	expert	200	5000	1.0	39	4
push-wall-v2	expert	200	5000	1.0	39	4
reach-v2	expert	200	5000	1.0	39	4
reach-wall-v2	expert	200	5000	1.0	39	4
shelf-place-v2	expert	200	5000	1.0	39	4
stick-pull-v2	expert	200	5000	1.0	39	4
stick-push-v2	expert	200	5000	1.0	39	4
sweep-into-v2	expert	200	5000	1.0	39	4
sweep-v2	expert	200	5000	1.0	39	4
window-close-v2	expert	200	5000	1.0	39	4
window-open-v2	expert	200	5000	1.0	39	4

Table 13: The information statistics of offline Gym-MuJoCo datasets.

MuJoCo dataset	task id	quality	episode length	episode number	mean return	observation dimension	action dimension
Ant-dir	4	expert	200	999	315.7402	27	8
Ant-dir	6	expert	200	1000	865.8379	27	8
Ant-dir	7	expert	200	1000	993.9981	27	8
Ant-dir	9	expert	200	999	390.8016	27	8
Ant-dir	10	expert	200	1000	744.8206	27	8
Ant-dir	13	expert	200	1000	922.9069	27	8
Ant-dir	15	expert	200	1000	522.9190	27	8
Ant-dir	16	expert	200	1000	835.9635	27	8
Ant-dir	17	expert	200	999	352.7341	27	8
Ant-dir	18	expert	200	1000	367.9050	27	8
Ant-dir	19	expert	200	999	369.9799	27	8
Ant-dir	21	expert	200	1000	868.7162	27	8
Ant-dir	22	expert	200	1000	577.2005	27	8
Ant-dir	23	expert	200	1000	386.7926	27	8
Ant-dir	24	expert	200	1000	547.0642	27	8
Ant-dir	25	expert	200	1000	501.6898	27	8
Ant-dir	26	expert	200	1000	357.3981	27	8
Ant-dir	27	expert	200	1000	439.8590	27	8
Ant-dir	28	expert	200	1000	484.8640	27	8
Ant-dir	29	expert	200	1000	439.0989	27	8
Ant-dir	30	expert	200	999	305.6620	27	8
Ant-dir	31	expert	200	999	478.8927	27	8
Ant-dir	32	expert	200	999	442.5488	27	8
Ant-dir	33	expert	200	1000	952.0699	27	8
Ant-dir	34	expert	200	1000	909.5234	27	8
Ant-dir	35	expert	200	999	352.6703	27	8
Ant-dir	36	expert	200	1000	593.1572	27	8
Ant-dir	37	expert	200	1000	374.4446	27	8
Ant-dir	38	expert	200	999	390.5748	27	8
Ant-dir	39	expert	200	999	307.2525	27	8
Ant-dir	40	expert	200	1000	524.2991	27	8
Ant-dir	41	expert	200	1000	360.6967	27	8
Ant-dir	42	expert	200	1000	454.5446	27	8
Ant-dir	43	expert	200	999	285.9895	27	8
Ant-dir	44	expert	200	1000	878.4141	27	8
Ant-dir	45	expert	200	1000	813.5594	27	8
Ant-dir	46	expert	200	1000	900.4641	27	8
Ant-dir	47	expert	200	1000	422.5884	27	8
Ant-dir	48	expert	200	1000	865.0776	27	8
Ant-dir	49	expert	200	1000	398.1321	27	8
Cheetah-dir	0	expert	200	999	666.5849	20	6
Cheetah-dir	1	expert	200	999	1134.3012	20	6