1. Introduction
Within the context of the considerable attention paid to transfer learning in the artificial intelligence (AI) field, reinforcement learning (RL) has come to the forefront given its potential to adapt previously acquired knowledge to new related tasks or environments [
1]. Significantly, RL necessitates that agents acquire optimal decision-making strategies through iterative interactions with the environment, raising the issue of sample efficiency [
2]. By leveraging the capabilities of transfer learning, RL agents can substantially diminish the temporal and computational resources needed for them to adapt to new circumstances, thereby enhancing the efficacy and resilience of their learning algorithms. Broadly speaking, the implementation of transfer learning in RL enables RL agents to decompose a singular task into several constituent subtasks and subsequently leverage the knowledge amassed from these subtasks to expedite the learning process for related tasks [
3,
4,
5]. Nonetheless, in instances where the subtasks are not correlated, transfer learning remains unattainable. Therefore, in this paper, we focus on distinct scenarios with different reward functions but uniform subtask dynamics.
As a crucial brain structure involved in memory and knowledge transfer, the hippocampus has been extensively studied in order to understand its role in facilitating learning across various cognitive domains [
6,
7,
8,
9,
10]. This process of knowledge transfer shares similarities with the concept of transfer learning in RL algorithms, and successor representation (SR) has emerged as an explainable algorithmic theory that may provide insights into the function of the hippocampus in transfer learning [
7,
11,
12,
13]. In alignment with the neural activity observed in the hippocampus, SR is a predictive model of the environment that emphasizes the relationships between current states and potential future states, with noteworthy implications for transfer learning in RL algorithms. By incorporating SR as an underlying mechanism, RL agents can potentially leverage the same principles that govern hippocampal function in knowledge transfer, resulting in more efficient and robust learning algorithms. This connection between neural correlates and RL provides researchers with the opportunity to explore the application of SR to RL further and investigate its potential to enhance the capability of transfer learning.
Through linear approximation, successor features (SFs) extend SR algorithms and enhance their applicability in transfer learning by addressing the reward dynamics and enabling them to learn from environmental structures independently [
14,
15,
16,
17,
18]. With the employment of a temporal difference (TD) learning algorithm, integrating an eligibility trace into SFs becomes a seamless extension [
2]. Indeed, investigations have revealed that SFs enhanced with an eligibility trace, that is, predecessor features (PFs), surpass the performance of standalone SFs [
19,
20]. PFs are specifically designed to assess the influence of an agent’s past state on its current state, underscoring the relationship between historical experiences and the present circumstances. With regard to temporal difference (TD) learning rule algorithms, SFs adhere to a constrained learning rule variant, TD (
), while a more comprehensive form of the learning rule, TD(
), applies to PFs. In spite of the extensive research on SFs within the realm of transfer learning, to the best of our knowledge, the application of PFs has been examined notably less.
In RL, the hyperparameters play a pivotal role in transfer learning tasks, determining the success or failure of knowledge transfer between tasks. For success, the appropriate hyperparameter configurations must be identified, as they directly influence a learning agent’s ability to adapt to and generalize across diverse settings. Inappropriate hyperparameter settings may hinder an agent’s ability to recognize and exploit commonalities between tasks or environments, leading to suboptimal performance and reduced learning efficiency [
3]. Conversely, well-tuned hyperparameters enable agents to effectively capitalize on shared structures and rapidly adapt to novel situations, thereby reducing the time and computational resources needed for high performance. However, most of the existing research has focused on developing novel transfer learning techniques and evaluating their performance in specific tasks and environments [
3,
4,
5]. Although hyperparameter optimization is evidently important in transfer learning, the current state of the research in this area remains limited.
Previous works have investigated the sensitivity of deep neural networks to noise and quality distortions such as blur and pixelation to observe their effect on deep learning models, particularly in image recognition tasks [
21,
22,
23]. These studies have shown that noise, such as that encountered by RL agents navigating complex environments, can significantly degrade the model performance. Equally, given the well-documented problem of working with small amounts of data in intelligent systems, recent research has emphasized the need for robust models that can operate under data sparsity and variability [
24]. Our approach addresses these issues by optimizing adaptive learning in noisy environments and limited data scenarios. Advanced approaches such as transformer-based few-shot learning have shown promise in dealing with noisy labels and varying conditions, demonstrating the potential of adaptive algorithms in complex environments [
25]. Furthermore, by combining a conditional variational autoencoder and generative adversarial networks (GANs), Li et al. [
26] developed a zero-shot detection method for unmanned aerial vehicle sensors. This approach represented an innovative strategy for identifying faults with minimal training data, showcasing the potential of advanced machine learning models to handle data complexity and sparsity. Overall, these studies highlight the importance of addressing noise in sensor data to improve the resilience and adaptability of RL agents in noisy environments.
In this investigation, we focus on the critical role of hyperparameter tuning in enhancing the effectiveness of transfer learning strategies, particularly concerning PF and SF algorithms. The real-world implications of RL underscore the need to assess the performance of these algorithms in noisy environments and their potential to augment transfer learning. We previously identified a complex relationship between noise levels and PF performance that was notably influenced by the
parameter [
27]. Motivated by these findings, here, we seek to explore how the hyperparameters can impact the efficiency of transfer learning further, especially in noisy conditions, and to pave the way for more adaptive and robust RL applications. This study offers novel contributions by (1) evaluating the impact of the hyperparameters on adaptive behavior in noisy spatial learning environments using a T-maze; (2) introducing an SF- and PF-based framework for the comparison and quantification of sensitivity to noise and adaptation; (3) and identifying the most robust hyperparameter configurations under noisy and variable conditions.
In this study, we evaluated the adaptive performance in a noisy T-maze using a Markov decision process (MDP) framework and a PF learning algorithm.
Section 2 discusses the noise-related challenges in RL, the benefits of enhancing its efficiency using transfer learning, and the crucial role of hyperparameter optimization in improving the performance and robustness of RL algorithms.
Section 3 outlines our methodology, including the T-maze setup, hyperparameters, and performance metrics.
Section 4 details the results of testing 25 combinations of hyperparameters and their reward acquisition, step lengths, and adaptability.
Section 5 discusses the impact of key hyperparameters—the reward learning rate
and
—on adaptation from both algorithmic and neuroscientific perspectives. We conclude by outlining the limitations of our study and our directions for future research seeking to understand learning algorithms and optimize them further within AI and neuroscience contexts.
3. Materials and Methods
3.1. Markov Decision Processes
In this study, we operate under the assumption that RL agents engage with their environment through MDPs. The MDPs employed in this research are represented by the tuple , which encompasses the following components. Set denotes the states, encapsulating information pertaining to the environment, set signifies the range of actions that an agent is capable of, function assigns the immediate reward obtained in specific state s, and the discount, factor , is responsible for controlling the value attributed to future rewards.
The primary objective of the RL agent is to optimize the policy function so that it maximizes the cumulative discounted reward or return, denoted as , where the immediate reward is a function of the state . To tackle this challenge, dynamic programming techniques are employed to define and calculate a value function , reliant on the policy chosen, . This value function is approximated by a function , which is parameterized by a weight vector . This weight vector is updated using TD learning as follows: .
TD(0) refers to an algorithm that employs the one-step TD update rule mentioned above. In contrast, TD() utilizes an eligibility trace to integrate past experiences into the update process. The update mechanism for TD() is expressed as , where represents the TD error, and denotes the eligibility trace, with serving as the decay parameter for the trace.
3.2. Predecessor Feature Learning
In this section, a brief introduction to the implementation of PFs in RL algorithms is provided. Given that this topic has been thoroughly discussed in previous papers [
19,
27], here, we present a concise overview of their main points and their relevance to the current study.
The core idea in SF and PF learning is that the value function
, which depends on a policy function
, can be decomposed into the expected visiting occupancy
and the reward of the successor state
as follows:
where the matrix
, which is referred to as the SR, symbolizes the discounted expectation of transitioning from a given state
s to its successor
.
To estimate and , we can employ weight matrices and vectors to factorize them as and . The state feature vector can be depicted as a one-hot vector within the tabular environment .
Through employing the TD(
) learning rule, the matrix
can be updated as follows:
where
and
represent the eligibility trace of the TD(
) learning rule and the feature weight learning rate, respectively. Instances with
correspond to SFs, while those with
pertain to PFs.
To update
to learn the factorized reward vector
, the TD learning rule is employed as follows:
where
represents the learning rate for the reward vector.
3.3. Experimental Design for Transfer Learning
To evaluate the efficacy of the PF learning algorithm for transfer learning, we devised a T-maze environment with noisy observation vectors. The T-maze was implemented within a 9 × 9 grid world, with the agent beginning at the bottom center (
Figure 1A). An episode ends when the agent obtains a reward of 1, situated at the end of either the left or right arm. A minimum of 12 steps are required for the agent to acquire a reward, and if it reaches 500 steps, the episode ends with a reward of 0. To examine the effectiveness of transfer learning in this T-maze environment, we alternated the reward’s position to the opposing arm every 20 episodes (
Figure 1B). The agent’s action space comprised four possible actions: moving up, down, left, or right.
3.4. Gaussian Noise in State Observations
A one-hot state feature vector
representing the agent’s current state was provided (
Figure 2). Given the common occurrence of sensor noise in real-world contexts, to increase the complexity and reality of the T-maze environment, we introduced noise into
. As previously described [
27], we incorporated noise solely into the state observations while keeping the state transition dynamics in the environment unchanged. A Gaussian noise term was added to the observation vector
received by the agent in each state as follows:
where
denotes a Gaussian noise vector with a mean of zero and the covariance matrix
, and
represents the identity matrix. Our prior research indicates that excessive levels of noise can obscure discernible differences in performance between agents [
27]. Consequently, in this study, we set the noise parameter
to 0.05, a level previously established as low, to ensure that the comparative analysis was clear.
The feature vectors
utilized in Equations (
2) and (
3) for the learning PFs were transformed into observation vectors
, allowing the PF algorithm to be described in Algorithm 1.
Algorithm 1 Predecessor feature learning with noisy observations |
- 1:
procedure PF() - 2:
initialize - 3:
for episode in 1…n do - 4:
initial state of episode - 5:
(eligibility trace reset) - 6:
for pair (,) and reward r in episode do - 7:
- 8:
- 9:
- 10:
- 11:
- 12:
- 13:
return
|
In line with our previous finding that random initialization improves the learning efficiency,
was initialized using random variables [
42].
3.5. Hyperparameters of the Algorithm
In order to assess the influence of the hyperparameters on the agent’s transfer learning capabilities in a noisy T-maze environment, specifically
and
, a series of 100 trials were conducted, with each encompassing 100 episodes. We selected the hyperparameters
and
due to their crucial roles in shaping the adaptive behavior of RL agents, particularly in noisy environments. The eligibility trace decay rate
controls how past states and actions influence the current learning through credit assignment over time. Preliminary experiments and our previous findings [
27] demonstrated that varying
significantly affects the performance of the agent in noisy conditions, enhancing its adaptability by appropriately balancing the influence of past experiences. The reward learning rate
dictates the pace at which the agent updates its reward expectations, balancing learning speed with stability. Effectively tuning
is vital in noisy environments: Lower values stabilize learning by reducing the sensitivity to noise, while higher values enable quicker adaptation to changing reward patterns.
We examined five distinct hyperparameter configurations, with
and
and
and
. The transfer learning efficacy of 25 unique agents was analyzed by employing unique combinations of these two hyperparameters. Notably,
represents the SF algorithm, with which the learning is not influenced by an eligibility trace and the immediate state–action–reward relationships are effectively isolated. This setting provides a baseline against which the impact of eligibility traces can be assessed [
14]. Additionally, these parameter ranges facilitate the robust exploration of how varying the level of credit assignment (
) and the learning rate (
) affects the agent’s adaptability, particularly under noisy conditions. Studies have shown that by adjusting these hyperparameters, a trade-off between learning speed and stability is achieved, and complex dynamic environments can be managed, in which both the state transitions and reward structures can vary significantly [
15]. This structured approach using grid search at the normal scale rather than a logarithmic scale ensures precise tuning, enhancing the overall model performance by allowing the effects of incremental parameter adjustments to be directly and systemically evaluated.
In our experiments, the -greedy policy was implemented as the agent’s policy function. This policy alternates between exploring random actions with a probability of and exploiting known actions with the highest Q-value estimate with a probability of . To achieve a balance between exploration and exploitation, the probability of decayed over episodes according to , where k was the episode index, while the feature weight learning rate and the discount factor were set to 0.1 and 0.95, respectively.
3.6. Evaluation Metrics for Transfer Learning
In this study, we employed four evaluation metrics to assess the performance of agents trained with different hyperparameters in transfer learning. These metrics were as follows:
Cumulative reward: The total reward accumulated by the agent over the course of the episodes, which serves as an indicator of the agent’s overall performance in the task.
Step length to the end of the episode: The number of steps it took the agent to reach the end of an episode, reflecting its ability to efficiently navigate the environment.
Adaptation rate: An indicator of the number of episodes an agent took to adapt to the reward location being switched. The adaptation rate was calculated by assessing the episodes in which the agent successfully reached the reward location five times consecutively, which were then considered in determining the adaptation rate. For example, if an agent reached this performance level by the 10th episode after the reward location shifted, its adaptation rate would equate to 10, meaning the optimal adaptation rate was defined as 5. A lower adaptation rate implies swifter adaptation of the agent to the new reward location, whereas a higher adaptation rate denotes a more gradual process of adjustment.
Adaptation step length: This measure evaluates the number of steps required by an agent to reach its defined adaptation rate. For instance, with the adaptation rate set at 10, this metric calculates the steps necessary to successfully navigate 10 episodes following a change in the reward’s location. Utilizing a calculation approach akin to that for the adaptation rate but focused on steps rather than episodes, this metric offers an alternative view of an agent’s ability to adjust to environmental changes.
In this study, we examined the relationships between each evaluation metric and the hyperparameters, particularly
and
, using the Spearman’s correlation analysis tool from the SciPy library (version 1.7.3) [
43] and the linear regression model from the statsmodels library (version 0.13.5) [
44].
5. Conclusions and Future Work
The present study systematically investigated the effects of different hyperparameter values on the performance, efficiency, and adaptation of agents in the context of transfer learning in a T-maze environment. Our results revealed that higher levels were consistently associated with improved cumulative rewards and adaptation metrics, while varying the values had a less consistent influence across different metrics.
In this section, we aim to (1) discuss the impact of the hyperparameters on agent adaptability and performance, delving into the role of and in modulating the trade-off between learning efficiency and effective forgetting; (2) delve into the neuroscientific implications of our findings, drawing parallels between the mechanisms that underpin the performance of agents in RL and biological learning processes; (3) acknowledge the limitations of the current study and propose avenues for future research; and (4) summarize the key takeaways from this study.
5.1. Hyperparameters
As may be evident, our findings have demonstrated that a higher learning rate
contributes to enhanced efficiency in learning transitions because it centers around learning the locations of the reward. On the other hand, the eligibility trace parameter,
, in retrospectively updating the future occupancy of the state, has a more nuanced impact in a simple T-maze environment. This is because even when the reward position switches to the other arm, the same state transition matrix applies until the agent reaches the bifurcation from which the arms lead. This observation highlights the importance of considering the specific characteristics of the learning environment when interpreting the influence of hyperparameters on agent performance. In more complex environments, the interplay between
and
might differ, and therefore, the best combination of hyperparameters for maximizing the learning efficiency and transfer capabilities may also change [
2].
The first panel in
Figure 6C indicates that higher
values initially boost the learning efficiency before the reward’s location changes. This could be interpreted as
allowing the state values to be forgotten more slowly when its location shifts, which may be disadvantageous in terms of adaptability in dynamic settings. Given that effective forgetting is crucial for flexible learning [
45], it is plausible that elevated
values might hinder adaptability by impeding the process of forgetting, and this hypothesis is supported by the observation that after multiple relocations of the reward, the optimal performance involves lower
values.
Our findings position the role of in harmonizing between adaptability and the retention of previously acquired information as a prime concern for the development of more efficient learning algorithms in the future. In this respect, research investigating the performance of PF and SF learning algorithms in more diverse and challenging environments could advance our understanding of the ability of hyperparameter optimization to inform effective learning and adaptability.
Indeed, this study contributes significantly to our understanding in the context of T-maze environments by systematically examining the effects of varying the hyperparameter values on agent performance and adaptation. Our results corroborate the emphasis in previous studies of the importance of learning rates to RL tasks [
2], where higher rates facilitate rapid updates to knowledge and adaptability to environmental changes. However, the impact of
on agent performance appears to be less consistent, suggesting that eligibility traces affect the learning outcomes differently depending on the complexity of the task and the structure of the environment at hand.
5.2. Neuroscientific Implications
This investigation sought to elucidate the efficacy of transfer learning within PF and SF learning algorithms, particularly in terms of their alignment with the neural processes that have been uncovered within the mammalian brain, and contribute to the development of biologically inspired AI algorithms [
13,
46,
47].
Previous research has demonstrated that mammalian brains employ RL mechanisms for decision-making, learning, and adaptation [
48,
49,
50,
51]. In particular, the dopamine system plays a crucial role in indicating errors in reward predictions and modulating synaptic plasticity [
52,
53]. Similarities between the TD learning algorithm that underpins RL in artificial agents and the functioning of the dopaminergic system in the brain have also been identified [
54,
55,
56,
57,
58]. In this study, the TD learning algorithm involved a balance between PFs and SFs, and the relationship determined between the hyperparameters (
and
) and agent performance may offer insights into the optimal balance of neural processes that facilitate efficient learning and adaptation in biological systems as well.
One study recently published by Bono et al. [
59] provided a compelling explanation of how the TD(
) algorithm is implemented biologically in the hippocampus through synaptic spike-timing-dependent plasticity (STDP). Their insights shed light on our study’s findings specifically as they further elucidate the role of the
hyperparameter in learning and adaptation. Bono et al. [
59] discussed the relationship between
and various biological parameters, including state dwell times, neuronal firing rates, and neuromodulation. Interestingly, in alignment with psychological studies but in contradiction to conventional RL theory, their study suggests that the discount factor decreases hyperbolically over time. When considered in conjunction with our findings, a potential relationship emerges between the
hyperparameter’s role in learning and adaptation and the time constant in STDP. In our study, it seemed that the
hyperparameter may play a similar role in balancing efficiently retaining and forgetting information, with higher values promoting faster retrospective updates but potentially hindering effective forgetting. The neurobiological underpinnings of STDP could therefore be instrumental in balancing between efficient learning and the effective erasure of obsolete information.
As regards the biological implementation of the TD(
) rule, synaptic eligibility traces are well-known phenomena [
60]. Higher
values facilitate more extensive retrospective updates to the state, which could be interpreted as extensions of the eligibility traces over time. Additionally, if we assume that these eligible synapses are updated in response to the arrival of reward signals such as dopamine [
61], this aligns with the findings for the
parameter in our study. Consequently, we postulate that plasticity updates, contingent upon eligibility tracing and the reward, are tied to the balance between
and
. Future inquiries could delve deeper into these interconnections to foster the development of more biologically congruent AI algorithms and augment our comprehension of the brain’s learning mechanisms.
In addition to the mechanisms of the dopaminergic system, our findings may also inform our understanding of the role of the hippocampus in spatial learning and memory. The hippocampus is a brain structure that has been implicated in forming and retrieving spatial memories, as well as in encoding contextual information [
6]. The PFs and SFs used in our study may be analogous to the neural representations of spatial context and future states that are formed in the hippocampus [
12,
13,
59,
62,
63]. Thus, the relationships observed between the hyperparameters and agent performance in our T-maze task could have implications for the neural processes that underlie the formation of and updates to spatial memories and context-dependent decision-making.
5.3. Limitations and Future Research Directions
Although this study uncovered valuable insights into the performance of PF and SF learning algorithms in a T-maze environment and the optimal hyperparameters for this context, it also had several limitations. Firstly, the T-maze environment used is relatively simplistic and may not reflect the complexity involved in real-world tasks, which limits the generalizability of our findings. Moreover, we concentrated on a noise level of 0.05, which may not represent the diverse levels of uncertainty inherent in different domains, thereby complicating the application of our results to other environments and noise levels [
27]. Additionally, given that our study explored a limited range of combinations of
and
, other unexplored combinations might yield different insights or even a superior agent performance. Lastly, not all aspects of agent adaptability and learning were covered in that we focused on four evaluation metrics. Including alternative or additional metrics, particularly those related to transfer learning and adaptation, may unveil further distinctions within the learning process.
We focused on evaluating the performance of SFs and PFs within a noisy T-maze environment in this study. In our previous work, SFs and PFs were analyzed and compared against traditional RL methods, such as Q-learning and Q(
) learning, with SFs and PFs outperforming these methods in noisy environments [
27]. These findings highlight the advantages of SFs and PFs in scenarios where traditional algorithms struggle due to the added complexity of noise. While advanced deep learning methods, including transformers and GANs, have shown significant promise in handling noise [
25,
26], they are not directly applicable to our experimental setup due to the discrete, grid-based nature of the T-maze. They have typically been designed for high-dimensional, continuous-state spaces, which offer more variability and complexity than the structured, low-dimensional T-maze environment. Consequently, methods such as SFs and PFs were more suitable for our task. Despite the constraints of our current comparison, these limitations provide avenues for future research. Transformers and GANs could be integrated and evaluated by expanding our approach to more complex RL environments, uncovering the potential of these advanced deep learning algorithms to enhance adaptability and performance in more sophisticated settings and broadening the applicability of our findings.
This study’s findings serve as a springboard for future research seeking to ascertain the robustness of PF and SF learning algorithms within a wider variety of environments, from complex grid worlds to autonomous vehicles. Developing metrics using concepts such as path optimality or the development of latent cognitive maps could deepen our understanding of agent performance. Furthermore, extending the range of hyperparameters and utilizing optimization techniques such as grid search, random search, or Bayesian optimization could uncover combinations that promote superior learning and adaptability [
40,
41,
64].
High-fidelity neurobiological models could be employed to examine their influence on agent performance, whether applying spiking neural network models or incorporating plasticity rules, such as STDP, into the modeling process [
13,
59,
62,
63]. This approach could lead to a more profound comprehension of neurobiological learning processes, laying the groundwork for in-depth investigations into the role of neuromodulators like dopamine and serotonin in enhancing adaptability and flexibility [
65,
66,
67].
5.4. Conclusions
This study elucidated the role of hyperparameters in augmenting transfer learning and adaptation in a T-maze environment. By delving into the correlation between hyperparameters and evaluation metrics, we have enhanced our comprehension of the determinants that facilitate effective learning and adaptation in AI and machine learning.
Specifically, we observed a positive correlation between elevated values and higher cumulative reward and adaptation metrics, while the effect of appeared to be negligible. Conversely, our findings suggest that as reward repositioning progresses, high values hinder the acquisition of new knowledge. These observations may aid in the selection and optimization of the hyperparameters when fine-tuning reinforcement learning algorithms.
In bringing biological insights into transfer learning, memory, and efficient forgetting processes into play, our findings also extend beyond AI. This could pave the way for novel directions within the research on cognitive disorders such as obsessive–compulsive disorder by improving our algorithmic understanding of learning and memory and their connection to cognitive flexibility.