1. Introduction
Reinforcement learning (RL) is an effective method to solve sequential decision-making tasks, where a learning agent interacts with the environment to improve its performance through trial and error [
1]. RL has achieved exceptional success in challenging tasks, such as object manipulation [
2,
3,
4,
5], game playing [
6,
7,
8,
9], and autonomous driving [
10,
11,
12,
13]. Despite its remarkable advancement, RL still faces appealing difficulties caused by the need of a reward function [
14,
15]. For each task that the agent has to accomplish, a carefully designed reward function must be provided. However, designing a hand-crafted reward function may require too much time or expense, especially in complex tasks. This problem has motivated a number of research studies on imitation learning (IL), where expert-generated demonstration data are provided instead of a reward function in order to help the agent learn how to perform a task [
16,
17]. For this reason, IL has been growing in popularity and achieved some successes in numerous tasks, including robotics control [
18,
19,
20] and autonomous driving [
21,
22,
23,
24].
Despite certain achievements, IL agents are designed to focus on accomplishing only a single, narrowly defined task. Therefore, when given a new task, the agent has to start the learning process again from the ground up, even if it has already learned a task that is related to and shares the same structure with the new one. On the other hand, humans possess an astonishing ability in the learning process, where the knowledge learned from source tasks can be leveraged for learning a new task. For example, an infant can reuse and augment the motor skills obtained when he learns to walk or uses his hand, for more complex tasks later in his life (e.g., riding a bike). Transfer learning (TL) is a technique based on this idea. TL enables the agent to reuse its knowledge learned from a source task in order to facilitate learning a new target task, resulting in a more generalized agent.
Recent studies have applied TL to RL/IL agents and achieved some success, especially in robot manipulation tasks since these tasks usually share a common structure (i.e., robot arm) [
25,
26,
27]. Nevertheless, there is still an enormous difference between human ability and TL. Since TL is designed to leverage the learned knowledge to accelerate the acquisition of the new target task, the learning performance on the target task may be improved in exchange for the deterioration of the source task’s performance. In other words, the agent forgets how to perform the previously learned task when learning a new one, which is described as the catastrophic forgetting problem [
28,
29]. On the contrary, humans can perform well on both source and target tasks.
To address the aforementioned gap, a novel challenge on task adaptation in imitation learning is discussed in this paper, in which a trained agent on a source task faces a new target task and must optimize its overall performance on both tasks. In order words, the research objective is to help the agent achieve high learning performance on the target task, while avoiding the performance deterioration on the source task. The problem can be served as a step toward building a general-purpose agent. As one illustrative example, consider a household robot learning to assist its human owner. Initially, the human might want to teach the robot to load clothes into the washer by providing demonstrations of the task. At a later time, the user could teach the robot to fold clothes. These tasks are related to each other since they involve manipulating clothes, hence the robot is expected to perform well on both tasks and leverage any relevant knowledge obtained from loading the washer while folding clothes. In order to achieve such a knowledge transfer ability, a task adaptation method for imitation learning is proposed in this paper. Being inspired by the idea of repetition learning in neuroscience [
30,
31,
32], the general idea of the proposed method is to make the agent repeatedly review the learned knowledge of the source task while learning the target task at the same time. Accordingly, the proposed method is two-fold. Firstly, to allow the agent to repeatedly review the learned knowledge of the source task, a task adaptation algorithm is proposed. In the adaptation process, the learned knowledge is expanded by adding the knowledge of the target task. Secondly, a novel IL agent which is capable of finding an optimal policy using expert-generated demonstrations, is proposed. This agent allows the learned knowledge of the source task to be encoded into a high-dimensional vector, namely task embedding, which then supports the knowledge expansion in the adaptation process. The evaluation results show that the proposed method has a better learning ability compared to existing transfer learning approaches.
The main contributions of this work are summarized as follows:
An imitation learning agent is proposed to learn an optimal policy using expert-generated demonstration data. The agent is capable of encoding its knowledge into high-dimensional task embedding space in order to support the knowledge expansion in the later adaptation process.
Given a new target task, a task adaptation algorithm is proposed in order to enable the agent to broaden its knowledge without forgetting the previous source task by leveraging the idea of repetition learning in neuroscience. The resulting agent can provide a better generalization and consistently perform well on both source and target tasks.
A set of experiments are conducted over a number of simulated tasks in order to evaluate the performance of the proposed task adaptation method in terms of success rate, average cumulative reward, and computational cost. The evaluation results demonstrate the effectiveness of the proposed method in comparison with existing transfer learning methods.
The rest of the paper is organized as follows:
Section 2 reviews existing studies on transfer learning and some existing works that are related to the proposed method. The formulation of the task adaptation problem in imitation learning is presented in
Section 3. A detailed description of the proposed approach is provided in
Section 4.
Section 5 provides the details of the experimental settings and results.
Section 6 discusses the potential of the proposed method in real-world problems. The conclusion is given in
Section 7.
2. Related Work
Transfer learning (TL) aims to accelerate, adapt, and improve the agent’s learning process on a new target task by transferring knowledge learned from the previous source task. Whereas TL has been intensively studied and shown appealing performance in supervised learning [
33,
34,
35,
36,
37,
38,
39], it remains an open question in reinforcement learning and imitation learning fields. Fine tuning is the most explored approach for transfer learning in both RL and IL settings [
40,
41,
42]. In fine tuning, the RL/IL agent is pre-trained on a source task and then retrained to a new target task. Fine tuning does not require strong assumptions about the target domain, making it an easily applicable approach. There are different approaches to transfer learning that have been proposed, such as reward shaping [
43,
44,
45], inter-task mapping [
46,
47,
48], representation learning [
49,
50,
51], etc. However, these methods were designed for RL agents; directly applying them to transfer an IL agent does not necessarily lead to successful results since RL and IL differ in many factors. Moreover, the key challenge in transfer learning is catastrophic forgetting, in which the agent tends to unexpectedly lose the knowledge that was learned from the source task while transferring to the new target task. The reason is due to the changes in the agent’s network parameters that are related to the source task getting overwritten to fulfill the target task’s objectives [
28]. Therefore, TL methods are not suitable for an agent that revisits the earlier task. In contrast, instead of transferring the knowledge learned from the source task to a new target task, the proposed adaptation method attempts to expand the agent’s learned knowledge. The knowledge expansion allows the agent to learn a new target task while retaining the previously learned source task’s knowledge, resulting in an agent that can perform well on both the source and target tasks after adaptation.
Besides transfer learning, the proposed adaptation method of learning to perform both source and target tasks also bears similarity to multi-task learning, where an agent is trained to perform multiple tasks simultaneously [
52,
53,
54,
55,
56]. In multi-task learning, the knowledge transfer is enabled by learning a shared representation among tasks. However, in this study, the proposed adaptation method focuses on learning the source and target tasks sequentially. In addition, the performance deterioration on the previously learned source task is more highlighted compared to both transfer learning and multi-task learning.
3. Problem Formulation
The task adaptation problem in IL can be formalized as a sequential Markov decision process (MDP). A MDP
for a task
x with finite time horizon
[
1] is represented as the following equation:
where
and
represent the continuous state and action spaces, respectively;
denotes the transition probability function;
is the reward function; and
is the discount factor. In the IL setting, the reward function
is unknown. A stochastic policy
for
describes a mapping from each state to the probability of taking each action. The goal of an IL agent is to learn an optimal policy
that imitates the expert policy
given demonstrations from that expert. An expert demonstration for a task
x is defined as a sequence of state–action pairs
.
Let denote a source task, which provides prior knowledge that is accessible by the target task , such that by leveraging , the target agent learns better in the target task . The main objective in this study is to learn an optimal policy for both source and target tasks, by leveraging from as well as from .
4. The Proposed Agent and Adaptation Algorithm
The proposed method presented in this section involves two main processes: learning from a source task and adapting to a new target task. The main objective is to build an agent that can perform consistently well on both source and target tasks. In order to achieve this, the general of this novel idea is to allow the agent to repeatedly review the knowledge learned from the source task, while learning the new knowledge of the target task. The idea is inspired by a human learning effect, which is repetition learning. Prior studies in neuroscience have proved that when humans learn by repetition, their memory performance can be enhanced and retained for a longer time [
30,
31,
32], giving humans the unique ability to perform most sophisticated tasks with ease. Therefore, in this paper, developing a similarly intelligent method is focused on in order to achieve the main research objective and to tackle the task adaptation problem in imitation learning.
Accordingly, the proposed method is two-fold. Firstly, an adaptation algorithm is proposed to allow the agent to learn the new target task by expanding its knowledge. More concretely, on top of the knowledge that the agent has learned from a source task, the knowledge of a target task is added. In addition, the agent repeatedly uses such knowledge to learn the target task and review the previously learned source task to ensure that the learning performance on the target task is high, while the deterioration of the learning performance on the source task is small. Secondly, to support the expansion of the to-be-learned knowledge, a novel imitation learning (IL) agent is proposed. This agent encodes the learned knowledge into a latent space, namely task embedding space, in which the learned knowledge from task
x at time step
t can be represented by a high-dimensional vector
.
Figure 1 illustrates the task embedding space before and after applying the proposed task adaptation algorithm. The task embedding space allows the proposed adaptation algorithm to add the new knowledge of the target task while minimizing its impacts on the source task’s knowledge. In addition, since the source and target tasks are related to each other, there are some common knowledge between those two tasks. This shared common knowledge can be captured by the task embedding that helps accelerate the adaptation process. The details of the proposed method are provided in the following sub-sections.
4.1. The Proposed Agent
In this subsection, the proposed agent is described in detail. The proposed agent is an imitation learning method that finds an optimal policy for the source task using expert-generated demonstration data. The agent is capable of encoding the learned knowledge into a task embedding in order to support the later adaptation progress. The architecture of the proposed agent is illustrated in
Figure 2. The proposed agent is a combination of three deep feed-forward networks
E,
G, and
D, which have different responsibilities.
4.1.1. Task-Embedding Network E
The task-embedding network
E is designed to encode the learned knowledge into a high-dimensional task embedding space. Specifically,
E maps a state
of task
x at time step
t into a task embedding
,
. Since
contains the information of the task, it is expected that
can capture the similarities and differences between source and target tasks. In order to achieve that, contrastive learning is introduced to train
E. Contrastive learning aims to bring task embeddings of the same task close to each other in the task embedding space and to push dissimilar ones far apart. In order words,
E is trained to minimize distance
and maximize distance
, where
is a negative cosine similarity function defined as
where
x and
y can be the same or different task.
The optimization function
to train
E is defined as follows:
where
is an indicator function.
4.1.2. Action Generator Network G and Discriminator Network D
The action generator network
G aims to generate an optimal action
using the input task embedding
. The discriminator network
D is designed to distinguish between expert action
and the training agent’s action
. The intuition behind this is that the expert actions are assumed to be optimal in the imitation learning setting, thus,
G are trained to minimize the difference between
and
. In order to achieve that, the adversarial loss [
57] is applied for both networks:
The optimal policy is achieved using a RL-based policy gradient method, which relies on reward signal provided by the discriminator.
4.1.3. Full Objective
During the source task’s learning process, a set of expert-generated demonstrations
is provided where each demonstration is a sequence of state-actions pairs
. The task embedding for each demonstration state
at time step
t can be computed using
. It should be noted that the contrastive loss function
used to train
E requires two inputs
and
, where
x and
y can be of the same or different task. In this source task learning process, the target task demonstrations are not provided yet, thus, the second task embedding input
is generated by introducing the Gaussian noise
∼
to augment
as follows:
where
. In addition, since
is an augmentation of
, it might not belong to the state space
of the source task. Thus, the resulting
is not used as an input to
G to generate an action, but it is used to help compute the loss
only. This means that
can be treated as a constant. In other words, the gradient flows back from
is unnecessary in the backpropagation. This can be indicated using the stop-gradient operation
as follows [
58,
59]:
With the generated action
, the full objective function to train the proposed agent on the source task is
The algorithm to train the proposed agent on the source task is outlined in Algorithm 1.
Algorithm 1 Training the proposed agent on the source task. |
- 1:
Input - 2:
A set of expert demonstrations on the source task - 3:
Randomly initialize task embedding network E, generator G and discriminator D - 4:
fork = 0, 1, 2, … do - 5:
Sample an expert demonstration - 6:
Sample state-action pairs ∼ - 7:
Compute - 8:
Compute - 9:
Generate action - 10:
Compute the loss - 11:
Update the parameters of F, G, and D - 12:
Update policy with the reward signal - 13:
end for - 14:
Output - 15:
Learned policy for source task
|
4.2. The Proposed Task Adaptation Algorithm
Leveraging the task embedding space learned by the proposed agent, a novel adaptation algorithm is presented in order to adapt the agent to a new target task by adding the knowledge of the target task to the task-embedding space as shown in
Figure 2. In addition, to prevent losing the previously learned knowledge to perform the source task, a novel idea based on repetition learning is applied in the proposed adaptation algorithm. The idea can be illustrated as shown in
Figure 3. The intuition behind this idea is that during the adaptation process, the agent is allowed to repeatedly review how to perform the previously learned source task while learning the target task. Each time the agent switches to a different task, its performance drops, but then it recovers. This distinctive learning process allows the agent to continuously review its learned knowledge and generalize to both source and target tasks, resulting in an agent that can perform well on both tasks. It is similar to humans; when humans repeatedly practice an action, it leads to better performance. In addition, the process enables the agent to surpass the performance of an agent that is adapted using transfer learning. As shown in
Figure 3, using transfer learning, the adapted agent completes its adaptation process right after adapting the source task to the target task. For this reason, when facing the source task again after adaptation, the performance of the agent deteriorates due to the catastrophic forgetting problem.
It is important to note that, theoretically, the more knowledge the agent gains, the higher performance the agent can provide on both source and target tasks. As shown in
Figure 3, after facing the source task again, the performance of the agent on the source task increases. However, in practice, there is still an amount of performance deterioration on the source task since the agent is not able to fully utilize the learned knowledge. This observation is further discussed in the evaluation and discussion sections.
In this paper, a hyperparameter
is introduced, which denotes the probability that the agent repeatedly reviews the source task’s knowledge. With
, the balance between the performance on the target task and the performance deterioration on the source task can be controlled. For instance, the higher the value of
, the higher the probability that the agent can review the previously learned source task, resulting in a smaller deterioration of the source task’s performance in exchange for low performance on the target task. It should be noted that if
, the proposed task adaptation algorithm can be seen as a transfer learning method where it is only focused on improving the target task’s performance. The task adaptation algorithm is outlined in Algorithm 2.
Algorithm 2 The proposed adaptation algorithm. |
- 1:
Input - 2:
A set of expert demonstrations on the target task - 3:
A set of expert demonstrations on the source task - 4:
Randomly initialize task embedding network E, generator G and discriminator D - 5:
fork = 0, 1, 2, … do - 6:
Sample an expert demonstration on the target task - 7:
Sample an expert demonstration on the source task - 8:
Sample state-action pairs ∼ and ∼ - 9:
n ← uniform random number between 0 and 1 - 10:
if then ▹ Review source task’s learned knowledge - 11:
Compute - 12:
Compute - 13:
Generate action - 14:
Compute the loss - 15:
else ▹ Learn target task - 16:
Compute - 17:
Compute - 18:
Generate action - 19:
Compute the loss - 20:
end if - 21:
Update the parameters of F, G, and D - 22:
Update policy with the reward signal - 23:
end for - 24:
Output - 25:
Learned policy for both source and target task
|
6. Discussion
In this section, the effects of applying repetition learning on the performance of the proposed method and the important role of the task embedding network E are discussed in detail.
The experimental results assessed in the previous section have shown the potential of the proposed adaptation method in tackling the task adaptation problem in imitation learning. As shown in
Table 3 and
Table 4, the proposed method could provide consistent and high performance in terms of success rate and average cumulative reward on both source and target tasks with varying difficulty levels. This indicates that the proposed method can be applied to more challenging tasks with larger state and action spaces. Moreover,
Table 5 shows that the performance deterioration on the source task was also reduced compared to transfer learning baselines. This promising result demonstrates the effectiveness of the proposed adaptation method, in which the idea of repetition learning was leveraged in order to allow the agent to review the previously learned source task. Although the success rate and training time remained limited, the proposed method presents a plausible approach to tackle the task adaptation problem in imitation learning. It can be further improved in order to provide better overall performance toward practical imitation learning tasks.
In order to support the idea of repetition learning, an imitation learning agent was proposed, which was able to encode its learned knowledge into a task-embedding space. To provide an ablation study of the task embedding network
E in the proposed agent, a small experiment was conducted, where a number of task embeddings
and
were collected by executing the adapted agent in the WindowOpen–WindowClose experiment on both source task (i.e., WindowOpen) and target task (i.e., WindowClose). The WindowOpen–WindowClose was chosen because both source and target tasks are similar and have a large and equal size of the state space, which can provide a sufficient ablation result. In each task, the adapted agent was run in the simulation over 100 trials. After that, t-distributed stochastic neighbor embedding (t-SNE) was applied in order to project the collected high-dimensional task embeddings to a two-dimensional space for visualization as shown in
Figure 7. t-SNE captures the distance relation between task embeddings. If two embeddings were close in the task-embedding space, they stay close in the resulting visualization, and vice versa. Therefore, from
Figure 7, it can be seen that task embeddings of the source and target tasks were well separated. Moreover,
Figure 7 also shows that some target task embeddings were mixed with the source task embeddings. This was expected since the WindowOpen and WindowClose tasks shared the same structure (i.e., robot hand and window), thus, these target task embeddings represented the shared knowledge between the source and target tasks. This result indicates that the proposed adaptation method not only successfully expands the task embedding space without forgetting the previously learned knowledge, but also leverages the source task’s knowledge in order to accelerate and adapt to the new target task. This leads to high performance on the target task shown in
Table 4 and a low performance deterioration on the source task shown in
Table 5.
Although the novel idea of applying repetition learning and encoding the task knowledge into a task embedding has significantly improved the adapted agent on both tasks, there is still one limitation. As shown in
Figure 3, ideally, the adapted agent should be able to perform both source and target tasks better over time and eventually surpass its performance on the source task before being adapted. However, as shown in the experimental results, there was an amount of deterioration in the source task’s performance, thus, the proposed method is still limited compared to human learning ability. Overcoming this problem can be served as a key step toward building a continual learning agent, where the agent can learn and adapt to not only one but multiple target tasks. In future work, this will be the main focus of the authors in order to provide a general-purpose agent that can become a better learner over time, i.e., learning new tasks better and faster, and performing better on previously learned tasks.