Adaptive behavior with stable synapses

Cristiano Capone Corresponding author: cristiano0capone@gmail.com Natl. Center for Radiation Protection and Computational Physics, Istituto Superiore di Sanità, 00161 Rome, Italy Luca Falorsi Natl. Center for Radiation Protection and Computational Physics, Istituto Superiore di Sanità, 00161 Rome, Italy PhD Program in Mathematics, Dept. of Mathematics, “Sapienza” University of Rome, 00185 Rome, Italy Maurizio Mattia Natl. Center for Radiation Protection and Computational Physics, Istituto Superiore di Sanità, 00161 Rome, Italy

Abstract

Behavioral changes in animals and humans, as a consequence of an error or a verbal instruction, can be extremely rapid. Improvement in behavioral performances are usually associated in machine learning and reinforcement learning to synaptic plasticity, and, in general, to changes and optimization of network parameters. However, such rapid changes are not coherent with the timescales of synaptic plasticity, suggesting that the mechanism responsible for that could be a dynamical network reconfiguration. In the last few years, similar capabilities have been observed in transformers, foundational architecture in the field of machine learning that are widely used in applications such as natural language and image processing. Transformers are capable of in-context learning, the ability to adapt and acquire new information dynamically within the context of the task or environment they are currently engaged in, without the need for significant changes to their underlying parameters. Building upon the notion of something unique within transformers enabling the emergence of this property, we claim that it could be supported by gain-modulation, feature extensively observed in biological networks. We propose an architecture composed of gain-modulated recurrent networks that excels at in-context learning, showing abilities inaccessible to standard networks. We demonstrate that we can extend our approach to non-linear and temporal tasks and to reinforcement learning. Our framework contributes to understanding the principles underlying in-context learning and adaptive behavior in both natural and artificial intelligence.

1 Introduction

The study of behavioral adaptation, observed in both humans and animals, has been a longstanding subject of research. The rapidity with which behaviors can change in response to new cues challenges the conventional explanation of synaptic plasticity. This led researchers to explore mechanisms that underlie such flexible adaptations [1].

In-context learning as an emerging property in AI

In-context learning (ICL) is the capacity of a model to adapt its behavior, without weight updates, to solve tasks not encountered during training. Initially, ICL was observed in architectures tailored for few-shot learning [2] or even zero-shot learning [3]. However, the game changed when it was observed that ICL emerges naturally in large-scale transformers [4, 5]. Their exceptional capability to adapt to contextual information allowed transformer-based architectures to achieve state-of-the-art performances in many domains, such as natural language processing and image analysis[6, 7, 8]. Despite some intriguing explanations [9, 10, 11], the emergence of this property remains not completely understood. In particular, it is not clear how to associate it with in-context learning abilities observed in biological networks. The major contribution of this work is the definition of a constructive method to induce in-context learning in biologically plausible neural networks, in a broad variety of scenarios.

Limitations of transformers architecture

Despite their success, transformer architectures present many shortcomings, primarily in memory requirements, that scale quadratically with sequence length. This limits the scalability of transformers and hinders their applicability to tasks requiring the processing of long sequences, such as full-length document analysis or video understanding. To address these problems, several efficient linearized attention models have emerged, characterized by a forward pass executed in an RNN-like manner with constant inference memory costs. Recently, deep linear RNN architectures [12, 13] have yielded notable performance improvements over transformers, particularly in long-sequence tasks. Zucchet et al. [14] explore the efficacy of these deep linear gated RNNs, incorporating element-wise multiplications, in approximating attention mechanisms and implementing in-context supervised learning.

Biological support for in-context learning

Our investigation extends this premise, aiming to unravel how in-context learning, as observed in deep learning artificial neural networks, might also manifest in biological recurrent neural networks. Are there unique features in transformers that are also present in biological networks? We claim that in-context learning can be supported by input segregation and dendritic amplification, features extensively observed in biological networks. We argue that those are a biologically plausible ingredient capable of implementing a process that is similar to the attention mechanism present in transformers. Recent findings on dendritic computational properties [15] and on the complexity of pyramidal neurons dynamics [16] motivated the study of multi-compartment neuron models in the development of new biologically plausible learning rules [17, 18, 19, 20]. It has has been proposed that segregation of dendritic input [18] (i. e., neurons receive sensory information and higher-order feedback in segregated compartments) and generation of high-frequency bursts of spikes [20] would support backpropagation in biological neurons. In [21] authors suggest that this neuronal architecture naturally allows for orchestrating “hierarchical imitation learning”, enabling the decomposition of challenging long-horizon decision-making tasks into simpler subtasks. They show a possible implementation of this in a two-level network, where the high-network produces the contextual signal for the low-network. Here, we propose an architecture composed of gain-modulated recurrent networks that demonstrate remarkable in-context learning capabilities, which we refer to as ’dynamical adaptation’. Specifically, we illustrate that our biologically plausible architecture can dynamically adapt its behavior in response to feedback from the environment without altering its synaptic weights. We present results for supervised learning of temporal trajectories and reinforcement learning, involving non trivial input-output temporal relations. This novel architecture aims to bridge the gap between biological-inspired in-context learning and the capabilities of artificial neural networks, offering a promising avenue for advancing our understanding of adaptive behaviour in both natural and artificial intelligence domains.

Our work generalizes and provides a biologically plausible implementation for the type of networks presented in [14]. Notably, our architecture has the same order of magnitude of trainable parameters and hidden units (see Appendix B). In our approach, in-context learning (ICL) can emerge simply by tuning the readout weights of the two involved networks. In contrast, previous implementations require biologically implausible mechanisms of temporal credit assignment for the network weights to converge correctly. We show that we can side-step this problem by dividing the architecture into two separate components and introducing an additional objective, which forces the first network to approximate the gradient of the second.

2 Methods

2.1 Dynamical adaptation replaces synaptic plasticity

Consider a generic learning task, in which the response $y$ to a state $x$ and is influenced by a set of parameters $w$ . The latter are adjusted in function of an internal error $e$ . This can be rewritten in a generic formulation as follows:

\begin{cases}\tau_{e}\,\dot{e}&=E(e,y,f),\\ \tau_{y}\,\dot{y}&=Y(y,w,x),\\ \tau_{w}\,\dot{w}&=W(w,x,e).\end{cases}

(1)

where $f$ is feedback from the environment (e.g. the reward, the target behavior, …). Alternatively to the standard interpretation of $w$ as synaptic weights of a neuronal network, we consider them as the state variables of a dynamical system coupled to the dynamics of $y$ , expressed by the activity of an auxiliary network. For this reason, we also refer to them as “virtual” weights.

2.2 Dynamical supervised learning for temporal trajectory

We consider the task of learning a target temporal trajectory $y^{targ}(t)$ , We define $y(t)=\bm{w}\cdot\bm{x(t)}$ as the current estimation of $y^{targ}(t)$ , where $\bm{w}$ and $\bm{x(t)}$ are virtual weights and input vectors respectively, since we extended to formulation to a multidimensional dataset. In this case, there is not a stationary projection of the training set, but the current value of the target signal itself is projected at every time stem $t$ . We do not separate the target estimation on a test set and on a training set, as a consequence, $y(t)=y^{train}(t)$ . In this case, learning can be formulated as follows:

\begin{cases}e&=(y^{targ}-y)\\ y&=\bm{x}\cdot\bm{w}\\ \tau_{w}\,\dot{\bm{w}}&=\bm{x}\,e\end{cases}

(2)

where we removed the dependence on time $t$ for simplicity. The operations required are nonlinear, and usually are naturally implemented by the plasticity rule and by the multiplication between presynaptic activity and synaptic weights. However, here $\bm{w}$ are not actual weights but rather dynamical variables.

We propose, as possible implementation of this, that such non-linear functions are computed by two neural networks networks $W_{\Theta^{w}}$ , and $Y_{\Theta^{y}}$ :

\begin{cases}e&=(y^{targ}-y)\\ y&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{x},\bm{w})\simeq Y(\bm{x},\bm{w})=% \bm{x}\cdot\bm{w}\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{x},e)% \simeq W(\bm{x},e)=\bm{x}\,e\end{cases}

(3)

In particular, we used two RNNs (see details below in the following Method subsections) that receive $\bm{x},\bm{w},e$ as inputs and provide the proper output thanks to a suited training of their readout weights ( $\Theta^{y}$ and $\Theta^{w}$ , following reservoir computing paradigm). In addition, we introduce a novel concept known as a gain-modulated network. This concept is inspired by the remarkable ability observed in biological neurons, particularly L5 neurons as discussed in previous studies (e.g., [22, 21]). These neurons exhibit the capacity to non-linearly integrate segregated inputs, a process critical for various cognitive functions. More details in the section "Gain Modulated Reservoir Computing". We empirically demonstrate that this network architecture outperforms standard RNNs in approximating and generalising the required virtual update rules described above, suggesting that gain modulation might be an important requirement for the adaptive behaviour observed in biological agents. The formulation shown above can only be used to tackle linear problems, by learning linear relationships between $\bm{x}$ and $y$ . However, if we consider that $\bm{x}$ is the activity of another RNN that operates as a reservoir computer extracting nonlinear features of the input sequence, a wider class of tasks can be tackled without changing the constraint of a linear readout $y=wx$ .

2.3 Dynamical reinforcement learning

In the context of reinforcement learning, our framework outlines a systematic approach for modelling agents that can dynamically adapt their behaviour across diverse environments. We start by defining a policy network, denoted as $\bm{\pi}=\mathrm{softmax}(\bm{y})$ which implements a policy mapping the agent state encoded by the vector $\bm{x}$ to a probability distribution over actions. For the sake of simplicity, We assume a linear agent such that $\bm{y}=\bm{w}\cdot\bm{x}$ . This assumption is done without loss of generality since this could be easily extended by resorting to a reservoir computer as an intermediate layer as described above. The policy depends on the virtual weights $\bm{w}$ , determined by the activity of an additional RNN. This auxiliary network adjusts its internal activity based on the rewards received, effectively implementing policy gradient updates of the virtual weights and thereby modulating the agent behaviour in real time. This can be formalized in a formulation that is very similar to the one used above, by changing the definition of $e(t)$ as follows:

\begin{cases}\bm{e}&=r\left(\bm{\mathds{1}}_{a}-\bm{\pi}\right)\\ \bm{y}&=\bm{w}\cdot\bm{x}\\ \tau_{w}\,\dot{\bm{w}}&=\bm{e}\odot\bm{x}\end{cases}

(4)

This is the dynamics obtained by evaluating the policy gradient with respect to the virtual weights $\bm{w}$ . $a(t)$ is an integer value indicating the action at time $t$ among $\mathrm{D}$ possible ones. $\mathds{1}_{a(t)}$ represents the ’one-hot encoded’ action (as defined in [23, 24]) at time $t$ . It is a $\mathrm{D}$ -element vector where the $a(t)$ -th element is one, and all other elements are zero.

This formulation holds for a null discount factor [23, 24]), we refer to the Appendix section C for the description of the general case. The above equation can be rewritten in order to refer only to two networks, one for estimating the gradients, and one for the scalar product.

\begin{cases}\tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}% }}(r,a,\bm{\pi})\simeq W(r,a,\bm{\pi})=r\left(\bm{\mathds{1}}_{a}-\bm{\pi}(\bm% {y})\right)\odot\bm{x}\\ \bm{y}&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{x},\bm{w})\simeq Y(\bm{x},\bm{% w})=\bm{x}\cdot\bm{w}\end{cases}

(5)

2.4 Algorithm Distillation

To demonstrate the ability of this architecture to learn in context, we train networks Y and W using an Algorithmic Distillation protocol, as defined in [25]. Namely, we consider a family of environments $\mathcal{A}$ with different reward distributions ¹¹1A family of environments is a set of possible environments with the same action-state transitions but different reward distributions.. We then train the aforementioned networks using $N_{s}$ learning histories obtained through policy gradient methods from a subset of these environments, designated as $\mathcal{A}_{\text{ID}}$ . The network performance is then tested out of distribution, i.e., on the rest of the environments $\mathcal{A}_{\text{OOD}}:=\mathcal{A}\setminus\mathcal{A}_{\text{ID}}$

2.5 Gain modulated reservoir computing (GM-RC)

Reservoir Computing

Reservoir Computing (RC) represents a paradigm in machine learning that provides a model for biologically plausible computation and learning, drawing inspiration from the information processing mechanisms observed in biological neural networks.

In this approach, a random RNN is employed to extract features from a time-dependent signal $\bm{x}(t)$ . The dynamics of the RNN describe the evolution of $N$ hidden units $\bm{z}(t)=(z_{1}(t),\cdots,z_{N}(t))$ , governed by the following differential equation:

\displaystyle\tau_{z}\dot{\bm{z}}=\phi\left(J\bm{z}+R\bm{x}\right)-\bm{z}

(6)

The value of each unit represents the activity of a population of neurons following the Wilson and Cowan formulation ([26]). Here, $J$ and $R$ represent fixed random matrices, representing the recurrent connections and the projection from inputs to hidden units, respectively. Subsequently, the features extracted by the network can be utilized to predict a target signal $\bm{y}^{targ}(t)$ by learning readout weights $\Theta$ , such that $|\bm{y}^{targ}(t)-\Theta\bm{z}(t)|^{2}_{2}$ is minimized. The reservoir is then implementing a map

\displaystyle\bm{y}(t)=\mathrm{RNN}_{\Theta}(\{\bm{x}(s)\}_{s\leq t})

(7)

where $\Theta$ represents trainable parameters. In the rest of the article, we will write $\bm{y}=\mathrm{RNN}_{\Theta}(\{\bm{x}\})$ , dropping the temporal dependencies. ²²2 For this mapping to be well-defined, we can assume the RNN to have the echo-state property [27].

When addressing input-output mappings without time dependencies, we consider a network operating in the $\tau_{z},J_{ij}\to 0$ limit. In this scenario, the network computes an instantaneous function of the input, represented as $\mathrm{NN}_{\Theta}(x)=\Theta\phi\left(R\bm{x}\right)$ . Essentially, this is equivalent to considering a one-layer feed-forward network with random fixed input weights. This architecture is also referred to in the literature as the Extreme Learning Machine [28].

Gain modulated network architecture

Analysing Equations (3) and (5) we observe that a network must possess the capability to perform multiplications of its inputs to approximate a gradient descent update of virtual parameters. Building upon this insight, we introduce a gain-modulated reservoir network (GM-RC). This architecture draws inspiration from the morphology and function of pyramidal neurons in the cortex, which nonlinearly integrate inputs from basal and apical dendrites [22, 29]. Here, we consider an additional input source $\bm{x}^{\text{ap}}$ randomly projected into the apical dendrite of each neuron by the matrix $R^{\text{ap}}$ . Consistent with experimental observations in L5 pyramidal neurons, we allow the apical inputs to modulate the gain of the activation function, thereby altering its slope. Consequently, the resulting RNN equation is formulated as:

\displaystyle\tau_{z}\dot{\bm{z}}

\displaystyle=\phi\left(\left(\alpha\bm{b}^{\text{ap}}+\gamma\cdot R^{\text{ap% }}\bm{x}^{\text{ap}}\right)\odot(J\bm{z}+\beta\cdot R^{\text{ap}}\bm{x}^{\text% {ap}}+R\bm{x})\right)-\bm{z}\

(8)

Where $\bm{b}^{\text{ap}}$ is a constant bias vector. The hyperparameters $\alpha,\beta,\gamma$ modulate the effect of the gain modulation of the apical inputs. Specifically, when $\gamma=0$ , we obtain a network in which $\bm{x}^{\text{ap}}$ does not affect the gain modulation. Similarly, as before, the expression $\bm{y}=\mathrm{GMRNN}_{\Theta}(\{\bm{x}\}|\{\gamma\bm{x}^{\text{ap}}\})$ will denote the input $\to$ output mapping implemented by a gain-modulated reservoir. As before, removing time dependencies, we will have an instantaneous function $\bm{y}=\mathrm{GMNN}_{\Theta}(\bm{x}|\gamma\bm{x}^{\text{ap}})=\Theta\phi\left% (\left(\alpha\bm{b}^{\text{ap}}+\gamma\cdot R^{\text{ap}}\bm{x}^{\text{ap}}% \right)\odot(R\bm{x})\right)$ of the inputs $\bm{x}^{\text{ap}}$ and $\bm{x}$ . We explicitly maintain the dependence on $\gamma$ because, in our experiments, we use this parameter to regulate the gain modulation effect of the apical inputs on the network. The choice for the name $\bm{x}^{\text{ap}}$ is inspired by the current received in the apical dendrites of L5 pyramidal neurons, that are believed to carry contextual/high-level information [22].

3 Results

Computing resources

The experiments of this paper were executed on a Macbook pro M3 CPU 12-core with 36 GB of RAM, and on a Macbook pro 2,9 GHz 6-Core Intel Core i9 with 32 GB of RAM.

Refer to caption — Figure 1: Dynamical adaptation of temporal trajectories: A. Overview of the network architecture employed. A recurrent network composed of $N=20$ units (described by the vector $\bm{z}(t)$ , depicted in pink, which follows Eq. (6)) is utilized for learning a periodic trajectory, following the prescription of reservoir computing. One recurrent network is tasked with estimating the gradient of virtual weights, while another is dedicated to estimating the behavioral reconfiguration (illustrated in green and orange, respectively) resulting from these updated virtual weights. B. Illustration of our architecture dynamically adjusting (without synaptic alterations) to adhere to the desired dynamics. Errors (represented by the blue trajectory) are fed back to the initial network to assess necessary updates to the virtual weights $\delta w$ , a $N$ -dimensional vector (shown in the green trajectory). Subsequently, these updates are transmitted to the second network, modifying the decoding of reservoir dynamics (indicated by the orange lines). Initially, the reservoir receives the target trajectory as input (in open loop, before the red vertical line), which is later replaced by the estimated trajectory itself (in closed loop, after the red vertical line). C. Our networks were pre-trained on five target frequencies (marked by red vertical lines) and tested across a range of frequencies, evaluating the mean squared error (MSE) between the target and estimated trajectories in closed-loop scenarios (solid: median, dashed: 20-th/80-th percentile (statistics evaluated over 10 realizations).

Dynamical adaptation for temporal trajectories

We consider the task of autonomously predicting a temporal trajectory $\{y^{targ}(t)\}_{t}$ . Following the reservoir computing paradigm, we employ an RNN (Fig. 1A, pink), whose dynamics follows Equation (6), to extract temporal features. During training, the reservoir receives the target dynamics $y^{targ}(t)$ as input ( $x=y^{targ}(t)$ in Eq. (6), while $R$ is Gaussian matrix with zero mean and variance, $\sigma_{R}^{2}$ ) and is tasked with predicting the subsequent step of the trajectory via a linear readout of its activity (open loop, see Fig. 1B). This setup can be reformulated as a dynamical supervised learning problem, as described in the methods section, with the inputs $\bm{x}$ replaced by the features $\bm{z}(t)$ extracted by the RNN. The readout virtual weights can then be dynamically adjusted, minimizing the error between the target and the current prediction $y(t)$ . This results in the following dynamics:

\begin{cases}e&=(y^{targ}-y)\\ y&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}({\color[rgb]{1,0.58984375,0.70703125}% \bm{z}},\bm{w})\simeq{\color[rgb]{1,0.58984375,0.70703125}\bm{z}}\cdot\bm{w}\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}({\color[% rgb]{1,0.58984375,0.70703125}\bm{z}},e)\simeq{\color[rgb]{% 1,0.58984375,0.70703125}\bm{z}}\,e\end{cases}

(9)

Here, the network $W_{\Theta^{w}}$ $({\color[rgb]{1,0.58984375,0.70703125}\bm{z}},e)=\textrm{GMNN}_{\Theta^{w}}($ $\bm{z}$ $,e|\gamma_{e}e)$ is dedicated to estimating the gradient of virtual weights, while $Y_{\Theta^{y}}$ $($ $\bm{z}$ $,w)=\textrm{GMNN}_{\Theta^{y}}($ $\bm{z}$ $,w|\gamma w)$ is tasked with estimating the predicted $y(t)$ as a function of the new virtual weights (Fig.1A green and orange respectively). Here, the parameters $\gamma,\gamma_{e}$ define the strength of the gain modulation, as explained in the Methods section.

These gain-modulated architectures are pre-trained to replicate, respectively, gradient descent updates and scalar products obtained on a set of $N_{train}$ training sequences $\{\{y_{\alpha}^{targ}(t)\}_{t}:\ \alpha\in[N_{train}]\}$ . More specifically, the target sequences are $5$ sinusoidal functions with different frequencies (see vertical red lines in Fig. 1C.)

In the closed-loop phase, the features $\bm{z}$ are obtained by directly feeding the network estimation $y(t)$ as the input to the RNN ( $x=y(t)$ in Eq. (6)). This results in an autonomous dynamical system that reproduces the target trajectory. (see Fig. 1B). In Fig.1B we report an example of successful dynamical adaptation of our model. Errors, represented by the blue trajectory, are fed back to the initial network to assess necessary updates to virtual weights, shown in the green trajectory. These updates are then transmitted to the second network, modifying the decoding of reservoir dynamics indicated by the orange lines. Initially, the reservoir receives the target trajectory as input in an open-loop fashion before the red vertical line, which is later replaced by the estimated trajectory itself in a closed-loop configuration after the red vertical line. The networks were pre-trained on five target frequencies marked by red vertical lines in Fig.1C and then tested across a range of frequencies. Evaluation was performed by assessing the mean squared error (MSE) between the target and estimated trajectories in closed-loop scenarios (see Fig.1C, solid and dashed lines represent respectively, median and 20th/80th percentile range. Statistics is evaluated over 10 realizations of the experiment).

3.1 Multi Armed Bandits

Dynamical adaptation is now investigated within the framework of reinforcement learning. We provide robust evidence supporting the hypothesis that gain modulation represents a fundamental component in implementing in-context learning in biological agents. We first explore stateless environments $\mathcal{A}^{bandits}$ , which are represented by Bernoulli K-armed bandits [30]. Within each environment $\alpha\in\mathcal{A}^{bandits}$ , there exists a subset $P_{\alpha}\subset[K]$ of arms that yield a reward with high probability ( $p=0.95$ ). During the training phase, the in-distribution environments give a high reward probability to even-numbered arms, whereas during testing, the out-of-distribution environments assign a high reward probability to odd-numbered arms.

In this simplified bandit scenario, where state information is absent, the virtual weights $\bm{w}$ directly parameterize the policy probabilities: $\bm{\pi}=\mathrm{softmax}{(\bm{w})}$ . Consequently, our focus lies primarily on analyzing the behavior of the network $W_{\Theta^{w}}$ $(\cdot)$ acting on the virtual parameters (Fig. 2A, green). This serves as an ideal test bed for evaluating the network’s ability to learn the policy gradient update rule and generalize it to out-of-distribution scenarios.

To parameterize $W_{\Theta^{w}}$ $(\cdot)$ we employ a gain-modulated architecture $W_{\Theta^{w}}$ $(\bm{\pi},a,r)=\mathrm{GMNN}_{\Theta^{w}}(\bm{\pi},a,r|\gamma r)$ , where $r$ is the reward and $a$ is the action. We train this network to approximate the policy gradient update rule. As training data, we use policy gradient estimates over one single learning trajectory (1000 rounds, learning rate $lr=0.1$ ) in an in-distribution environment. We then test the distilled policy gradient networks with a higher learning rate ( $lr=1.0$ ) models in out-of-distribution (OOD) environments. We train the network to approximate the policy gradient update rule using policy gradient estimates from a single learning trajectory (1000 rounds, learning rate $lr=0.1$ ) in an in-distribution environment. We then test the distilled policy gradient networks with a higher learning rate ( $lr=1.0$ ) in out-of-distribution (OOD) environments.

To systematically investigate the impact of gain modulation ( $\gamma=1$ ) on out-of-distribution performance, we compare it with networks where the reward does not modulate the network gain ( $\gamma=0$ ). For each case, we select the optimal hyperparameters through a grid-based search (additional details in the Appendix).

Our experimental findings are presented in Fig.2. First, comparing models of different sizes (number $N$ of hidden units), we find that models with gain modulation require significantly fewer neurons to achieve maximum scores in OOD environments (Fig.2C).

In Fig.2B, we report the regret per round distribution at the 100th round and compare it with the distribution obtained by policy gradient with the same learning rate and iterations. A gain-modulated network achieves a regret distribution comparable to the policy gradient in both ID and OOD environments, with a lower median. In contrast, networks without gain modulation show significantly higher regret. Analyzing the regret curves in Fig.2D, we observe that a model with gain modulation often learns a more data-efficient algorithm than its source, even in OOD environments. Conversely, a model without gain modulation fails to generalize to OOD environments and does not converge to zero regret.

In summary, gain modulation enables the network to consistently and efficiently distill the correct gradient update rule and generalize it to unseen environments, predicting the correct virtual weight update in regions of the input space far from the training data.

3.2 Reinforcement learning in a reaching task.

We examine a reaching task, known as the dark room task (a simple instance of the water maze task [31]), set in a 2D maze within the domain $(-1,1)\times(-1,1)$ with a grid size of $0.1$ . Within this grid-like environment, the agent navigates by selecting one of four actions: up, down, left, or right, thereby determining its subsequent position. The primary goal is to locate a concealed object within the maze, with the agent having sole awareness of its own position. Feedback is provided via rewards, where the agent receives a reward of $10$ if the distance to the object is $0.1$ , $15$ if the distance is $0$ , and $0$ otherwise. Through iterative exploration, the agent develops a strategy to efficiently traverse the maze and pinpoint the object despite the limited visibility. The position of the agent is encoded separately using 25 input units $\bm{x}$ each, employing Gaussian activation functions distributed on a 5x5 grid in the maze and with a width of $0.2$ .

To accomplish this task we consider a network architecture composed of two networks. One GM-network, $W_{\Theta^{w}}$ $(r,a,\bm{\pi})=\textrm{GMNN}_{\Theta^{w}}(r,a,\bm{\pi}|\gamma_{r}r)$ is responsible for estimating the necessary update of virtual weights through policy gradient (Fig. 3A, green network), while another GM-network $Y_{\Theta^{y}}$ $(\bm{x},\bm{w})=\textrm{GMNN}_{\Theta^{y}}(\bm{x},\bm{w}|\gamma\bm{w})$ estimates policy reconfiguration (Fig. 3A, shown in orange) resulting from changes in virtual weights. Here, the parameters $\gamma,\gamma_{r}$ define the strength of the gain modulation, as explained in the Methods section. Their dynamics can be described by the following equation Eq.(5). Indeed, we compare the case with (Fig. 3A) and without (Fig. 3B) gain modulation. We refer to Supporting Information for further details on the training procedure.

Firstly, we assessed the performance of the policy gradient algorithm using a set of 8 food locations (refer to Fig. 3C, indicated by red crosses), where the total reward averaged over 200 trials was observed against the number of trials (Fig. 3E, depicted by blue lines, thin lines for individual positions and thick lines for the average across all positions). After multiple trials, the agent successfully achieved precise targeting of the circles. Data collected from these experiments were utilized to train our networks to estimate gradients and scalar products, as defined in Eq. (5).

To validate that our trained model is capable of dynamically implementing policy gradient itself, we tested it on both the training set locations (ID, Fig. 3C, left panel) and new test positions (OOD, Fig. 3C, right panel). For each food position (coded with different colors), we illustrate a sample trajectory executed by the agent (in corresponding colors) to reach the target at the end of the training. The agent’s precision closely matches that of the plastic policy gradient learning rule. We present the reward plotted against the number of trials for both training (Fig. 3E, right panel, pink line) and test (Fig. 3C, right panel, pink line) food locations.

We compared these performances against an architecture lacking gain modulation ( $\gamma=0$ ), observing worst performances (see Fig. 3B, D, F). Notably, while performances for ID food locations are acceptable (Fig. 3E, left panel, pink lines), those for OOD cases are extremely poor (Fig. 3E, right panel, pink lines). This observation, coupled with the results from the preceding section, suggests that gain modulation is a crucial component in facilitating the generalization of adaptive behavioral capabilities in recurrent networks.

We demonstrate, that our architecture is capable to perform the temporal computation required to learn delayed action-reward temporal relations (see section C in Appendix), requiring evaluating temporal credit assignment.

4 Discussion

It is believed that the remarkable capability of transformers to adapt to contextual information, is the key ingredient that allowed transformer-based architectures [4, 32], achieving state-of-the-art in many domains, such as processing and generation of natural language and images [6, 7, 8]. Currently, mechanisms based on attention, associative memory [9], and induction-based copying [10] by heads are the predominant tentative explanations for the emergence of in-context learning in transformers. This explanation theoretically demonstrates that the attention mechanism has ability to execute gradient descent updates [11]. However, the emergence of this property remains not fully understood and it is not clear how to translate this to the in-context learning capabilities observed in human and animals. In this this work we proposed a constructive method to induce in-context learning in biologically plausible neural networks, in a broad variety of scenarios.

When tasked to learn temporal trajectories, through error feedback and virtual weight updates, the network achieves successful dynamical adaptation, without synaptic plasticity. Similarly, in reinforcement learning scenarios, the policy in response to the environment state is dynamically modulated by virtual weights, that are updated in function of the reward. We stress that virtual weights are not synaptic weights, but their values and updates are evaluated and encoded by the activity of other RNNs, that were pre-trained to perform gradient updates. Another pre-trained network receives those weights as an input, along with the current environment features, and computes an adapted in-context response. We find that networks with gain modulation exhibit improved performance and robustness to variations compared to those without gain modulation. Moreover, these gain-modulated networks demonstrate more data-efficient learning algorithms, outperforming counterparts in both in-distribution and out-of-distribution environments, showing its capability to face novel scenarios.

In conclusion, our approach provides an explicit framework to induce ICL in biologically plausible networks, possibly opening the route to a formal understanding of ICL in biological agents.

Limitations of this study

In this study, we focused on dynamical adaptation of virtual readout weights, in other words, only the last layer is considered. We are neglecting the dynamics on recurrent weights and weights of the hidden units, which, in general, strongly affect performance and stability. Generalizability to untested real-world conditions is uncertain, and we do not investigate the scalability to larger architectures. We discuss that the mechanism we discuss could be present in biological agents, allowing in-context learning. However, the theoretical basis for attention mechanisms executing gradient descent updates lacks comprehensive empirical validation in biological experiments.

Additionally, the study does not fully address how dynamic gradient structures could be learned or developed in biological networks.

5 Acknowledgments

This research has received financial support from the Italian National Recovery and Resilience Plan (PNRR), M4C2, funded by the European Union - NextGenerationEU (Project IR0000011, CUP B51E22000150006, ‘EBRAINS-Italy’) to MM. LF is supported by ICSC – Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, funded by European Union – NextGenerationEU.

Source code availability

The source code is available under CC-BY license in the https://github.com/cristianocapone/ABSS public repository.

References

[1] Inah Lee and Choong-Hee Lee. Contextual behavior and neural circuits. Frontiers in neural circuits, 7:84, 2013.
[2] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
[3] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pages 2152–2161. PMLR, 2015.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[5] Aaditya Singh, Stephanie Chan, Ted Moskovitz, Erin Grant, Andrew Saxe, and Felix Hill. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 36, 2024.
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[7] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[8] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
[9] Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, et al. Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020.
[10] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
[11] Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
[12] Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
[13] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023.
[14] Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes Von Oswald, Maxime Larcher, Angelika Steger, and Joao Sacramento. Gated recurrent neural networks discover attention. arXiv preprint arXiv:2309.01775, 2023.
[15] Panayiota Poirazi and Athanasia Papoutsi. Illuminating dendritic function with computational models. Nature Reviews Neuroscience, 21(6):303–321, 2020.
[16] Matthew Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in neurosciences, 36(3):141–151, 2013.
[17] Robert Urbanczik and Walter Senn. Learning by the dendritic prediction of somatic spiking. Neuron, 81(3):521–528, 2014.
[18] Jordan Guerguiev, Timothy P Lillicrap, and Blake A Richards. Towards deep learning with segregated dendrites. eLife, 6:e22901, 2017.
[19] João Sacramento, Rui Ponte Costa, Yoshua Bengio, and Walter Senn. Dendritic cortical microcircuits approximate the backpropagation algorithm. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8721–8732. Curran Associates, Inc., 2018.
[20] Alexandre Payeur, Jordan Guerguiev, Friedemann Zenke, Blake A Richards, and Richard Naud. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience, 24(7):1010–1019, 2021.
[21] Cristiano Capone, Cosimo Lupo, Paolo Muratore, and Pier Stanislao Paolucci. Beyond spiking networks: The computational advantages of dendritic amplification and input segregation. Proceedings of the National Academy of Sciences, 120(49):e2220743120, 2023.
[22] Matthew Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in Neurosciences, 36(3):141 – 151, 2013.
[23] Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):1–15, 2020.
[24] Cristiano Capone and Pier Stanislao Paolucci. Towards biologically plausible dreaming and planning. arXiv preprint arXiv:2205.10044, 2022.
[25] Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
[26] Hugh R Wilson and Jack D Cowan. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical journal, 12(1):1–24, 1972.
[27] Izzet B Yildiz, Herbert Jaeger, and Stefan J Kiebel. Re-visiting the echo state property. Neural networks, 35:1–9, 2012.
[28] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: theory and applications. Neurocomputing, 70(1-3):489–501, 2006.
[29] Adam S Shai, Costas A Anastassiou, Matthew E Larkum, and Christof Koch. Physiology of layer 5 pyramidal neurons in mouse primary visual cortex: coincidence detection through bursting. PLoS computational biology, 11(3):e1004090, 2015.
[30] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7–7, 1985.
[31] Richard GM Morris. Spatial localization does not require the presence of local cues. Learning and motivation, 12(2):239–260, 1981.
[32] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
[33] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

Appendix A Relationship with previous work on product-based architectures implementing in context-learning

The relationship between attention-based transformer architectures and in-context learning was first noted in [11] where it was shown through constructive proof, confirmed by experiments, that linear attention layers can implement a gradient descent update for linear regression in its forward pass. Building upon this insight, [14] showed that a linear gated RNN can implement the same mechanism, showing that a linear two-layer gated RNN can replicate a linear transformer. The work proposes an implementation that uses $O(d^{2})$ hidden units and has $O(d^{4})$ trainable weights, that can be reduced to $O(d^{3})$ by side gating.

Our gain modulated architecture significant analogies with the gated RNNs analyzed in [14]. Consider a linear ³³3For simplicity, here we assume $\alpha,\beta=0,\ \gamma=1$ and consider a linear activation function $\phi=Id$ . gain-modulated network architectures implementing the functions ${\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})$ , ${\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},y^{targ}-y)$ , dynamically performing a ICL linear regression task that requires to map temporal dependent features $z(t)\in\mathbb{R}^{N_{in}}$ to predict an output $\bm{y}^{targ}\in\mathbb{R}^{N_{out}}$ :

\begin{cases}\bm{y}&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})=\Theta% ^{y}\left(R^{ap}_{y}\bm{w}\right)\odot(R_{y}\bm{z}(t))\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},\bm% {y}^{targ}-\bm{y})=\Theta^{w}\left(R^{ap}_{w}\left(\hat{y}-y\right)\right)% \odot(R_{w}\bm{z}(t))\end{cases}

(10)

This architecture can be seen as a particular instance of the RNN with side gating (see Equation (15) in [14]). Interestingly, a similar mechanism is also present in the recently proposed Mamba layer [33].

Appendix B Approximating gradients with gain modulated architectures

In this appendix, we show that an architecture with gain modulation is more suited to approximate gradient terms involved in dynamical learning.

Scalar product

We first consider the task of approximating a scalar product between virtual weights $\bm{w}\in\mathbb{R}^{N_{in}}$ and features $\bm{x}\in\mathbb{R}^{N_{in}}$ . To achieve this, we train the readout weights $\Theta$ of a gain modulated network $\mathrm{GMNN}_{\Theta}(\{\bm{x}\}|\{\gamma\bm{w}\})$ with $N_{h}$ hidden features to approximate the function $dot(\bm{x},\bm{w})=\sum_{i=1}^{N_{in}}x_{i}w_{i}$ .

The training dataset is composed of $1000$ pairs $(\bm{x},\bm{w})$ uniformly sampled in the hypercube $[0,1]^{2N_{in}}$ . The test set consists of the same number of pairs sampled in the hypercube $[-1,1]^{2N_{in}}$ .

We compare an architecture with gain modulation ( $\gamma=1$ ) with an architecture without gain modulation $\gamma=0$ . For each of the two settings, we select the best hyperparameters by fixing the dimensionality of the inputs ( $N_{in}=5$ ) and hidden units ( $N_{h}=101$ ) and varying the standard deviation of the projection matrices $R_{ij},R^{ap}_{ij}\sim\mathcal{N}(0,\sigma_{R}^{2}),\ \log_{10}(\sigma_{R})\in% \{-2,-1.8,\cdots,0\}$ , the standard deviation of the bias $b_{i}\sim\mathcal{N}(0,\sigma_{b}^{2}),\ \sigma_{b}\in\{0,0.1,\cdots,1\}$ and the nonlinearity type $\phi\in\{\mathrm{tanh},\mathrm{softplus}\}$ . The best hyperparameters found were used in the models to test the approximation performance of the dot product varying the number of input features ( $N_{in}\in[1,10]\cap\mathbb{N}$ ) and hidden units ( $N_{h}\in[1,200]\cap\mathbb{N}$ ). Results are shown in Fig. A1 A, B, C. We see that while both architectures require a quadratic number of hidden units to achieve low error on the test set, the gain-modulated network after a threshold number of units is hidden units is reached, is able to perfectly approximate the scalar product function consistently achieving $\sim 10^{-7}$ RMSE error. At the same time, the error in the network without gain modulation remains several orders of magnitude higher.

Scalar-vector product

We then tackle the task of approximating a vector-scalar product between features $\bm{x}\in\mathbb{R}^{N_{in}}$ and features and error $\bm{e}\in\mathbb{R}$ . To achieve this, we train the readout weights $\Theta$ of a gain modulated network $\mathrm{GMNN}_{\Theta}(\{\bm{x}\}|\{\gamma e\})$ with $N_{h}$ hidden features to approximate the function $prod(\bm{x},e)=e\bm{x}\in\mathbb{R}^{N_{in}}$ .

The training dataset is composed of $1000$ pairs $(\bm{x},e)$ uniformly sampled in the hypercube $[0,1]^{N_{in}}\times[0,1]$ . The test set consists of the same number of pairs sampled in the hypercube $[-1,1]^{N_{in}}\times[-1,1]$ .

We compare an architecture with gain modulation ( $\gamma=1$ ) with an architecture without gain modulation $\gamma=0$ . For each of the two settings, we select the best hyperparameters by fixing the dimensionality of the inputs ( $N_{in}=5$ ) and hidden units ( $N_{h}=21$ ) and varying the standard deviation of the projection matrices $R_{ij}\sim\mathcal{N}(0,\sigma_{R}^{2}/N_{in}),R^{ap}_{ij}\sim\mathcal{N}(0,% \sigma_{R}^{2},\ \log_{10}(\sigma_{R})\in\{-2,-1.8,\cdots,0\}$ , the standard deviation of the bias $b_{i}\sim\mathcal{N}(0,\sigma_{b}^{2}),\ \sigma_{b}\in\{0,0.1,\cdots,1\}$ and the nonlinearity type $\phi\in\{\mathrm{tanh},\mathrm{softplus}\}$ . The best hyperparameters found were used in the models to test the approximation performance of the dot product varying the number of input features ( $N_{in}\in[1,10]\cap\mathbb{N}$ ) and hidden units ( $N_{h}\in[1,40]\cap\mathbb{N}$ ). Results are shown in Fig. A1 D, E, F. We see that the network without gain modulation requires a quadratic number of hidden units to achieve low error on the test set. As in the scalar product case, the gain-modulated network can nearly perfectly approximate the target function after a threshold number of hidden units is reached. Significantly, in this case, the threshold scales linearly with $\mathbb{R}^{N_{in}}$ .

On the number of hidden units and trainable weights needed in our model

Consider the architecture in Eq. (10) with $N_{out}=1$ :

\begin{cases}y&={\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})=\Theta^{y}% \left(R^{ap}_{y}\bm{w}\right)\odot(R_{y}\bm{z}(t))\\ \tau_{w}\,\dot{\bm{w}}&={\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},y^{% targ}-y)=\Theta^{w}\left(R^{ap}_{w}\left(\hat{y}-y\right)\right)\odot(R_{w}\bm% {z}(t))\end{cases}

(11)

Where $R^{ap}_{y},R_{y},R^{ap}_{w},R_{w}$ are fixed random matrices respectively of dimension $N_{y}\times N_{in},N_{y}\times N_{in},N_{w}\times 1,N_{w}\times N_{in}$ , and $\Theta^{y},\Theta^{w}$ are trainable readout matrices of sizes $1\times N_{y},N_{in}\times N_{w}$ . In this simplified linear setting, the features extracted by the network are linear combinations of the product of the input entries that need to be linearly combined by the readout weights to the target function, which in both cases is composed of a specific linear combination of these products. It is then straightforward to observe that, to approximate the dot product, in ${\color[rgb]{1,.5,0}Y_{\Theta^{y}}}(\bm{z},\bm{w})$ we need $N_{w}\sim O(N_{in}^{2})$ hidden units and $O(N_{in}^{2})$ readout parameters. as confirmed in the experiment. This number can be further reduced to $O(N_{in})$ both for the hidden units and trainable weights ⁴⁴4This can be done assuming that $R^{ap},R_{w}$ are diagonal matrices (and thus reducing the problem to estimating $N_{in}$ $1\times 1$ products). As for the network ${\color[rgb]{0,0.66796875,0}W_{\Theta^{w}}}(\bm{z},y^{targ}-y)$ approximating vector-scalar products, similar considerations support the experimental observation that $N_{y}\sim O(N_{in})$ hidden units and $O(N_{in}^{2})$ features are needed in this case.

For a multivariate $\mathbb{R}^{N_{in}}\to{N_{out}}$ linear regression task, we can consider $N_{out}$ independent modules of the type described before. In this case, the number of trainable parameters is $O(N^{2}_{in}\cdot N_{out})$ with $O(N_{in}\cdot N_{out})$ hidden units ( $O(N_{in}^{2}\cdot N_{out})$ in the more biologically plausible case).

Appendix C Reinforcement Learning and the Discount Factor

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. The discount factor, usually denoted by $\gamma$ (where $0\leq\gamma\leq 1$ ), is crucial in RL as it determines the importance of future rewards. The cumulative reward $R_{t}$ at time step $t$ is given by:

R_{t}=r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+\gamma^{3}r_{t+3}+\ldots

In our experiment, the policy gradient update rule in the presence of the discount factor could be approximated as (see [23]):

\begin{cases}\bm{e}&=\left(\bm{\mathds{1}}_{a}-\bm{\pi}\right)\odot\bm{x}\\ \bm{y}&=\bm{w}\cdot\bm{x}\\ \tau_{w}\,\dot{\bm{w}}&=r\bm{\hat{e}}\end{cases}

(12)

where $\bm{\hat{e}}$ is an exponential temporal filtering of $\bm{e}$ , with a timescale that is proportional to $-\frac{1}{\log(\gamma)}$ . The importance of the discount factor lies in its ability to balance immediate and future rewards. A higher $\gamma$ values future rewards more significantly, encouraging the agent to consider the long-term consequences of its actions. Conversely, a lower $\gamma$ makes the agent prioritize immediate rewards. This balance is essential for the stability of the learning process. An appropriately chosen $\gamma$ ensures stable learning and convergence; if $\gamma$ is too high, the agent might overvalue distant future rewards, leading to instability, while a very low $\gamma$ can result in short-sighted behavior. If the discount factor is zero, the agent’s policy would change only when it is very close to the moment the reward is received. In a reaching task, the agent will only change its policy when it is very close to the reward, completely ignoring the need for long-term planning and making it unlikely to ever reach the goal from distant starting points.

To demonstrate that our approach performs the temporal computation required to consider future rewards, we compare the performances of our network (see Fig. A2.A-B) to a network without recurrent connections, which therefore cannot perform temporal computations. Recurrent connections enable a network to maintain a memory of past states and actions, effectively allowing it to use information from previous time steps to inform current decisions. Without these connections, a network operates purely on the current input without any contextual information from prior steps, thus lacking the ability to perform temporal computations (see Fig. A2.D-E).

Performances are higher in the first case. This can be visualized by looking at the policy at the end of the dynamic reinforcement learning for a specific target location. When the recurrent weights are set to zero, the policy points towards the target position only when nearby the target itself (see Fig. A2F), resulting in failure when the agent randomly moves in the wrong direction at the beginning of the task (see Fig. A2F, black line).

On the other hand, in the presence of recurrent weights, the proper policy (pointing towards the target, see Fig. A2C) is known even when far from the target, allowing optimal long-term planning (see Fig. A2C, black line).

Appendix D Additional details on the "dynamical learning of a temporal trajectory" experiment

In the "dynamical learning of a temporal trajectory" experiment we test our architecture and non-synaptic learning approach on temporal tasks. The primary goal is to analyze the network’s ability to generalize beyond the target frequencies used to pre-train our networks.

Table 1: Simulation Parameters

Parameter	Symbol	Reservoir	Gradient Net	Scalar Net
Network Size	$N$	20	500	500
Input Dimension	$I$	1	100 + 10	100+100
Apical Input Dimension	$I^{ap}$	0	1	100
Output Dimension	$O$	1	100	1
Time Step	$dt$	0.005
Reservoir Time Constant	$\tau_{m_{f}}$	10 $dt$	1 $dt$	1 $dt$
Input weights var	$\sigma_{input}$	0.06	0.06	0.06
Apical Input weights var	$\sigma_{input}^{ap}$	0.0	0.1	0.1
Recurrent weights	$\sigma_{rec}$	0.99 / $\sqrt{N}$	0.5 / $\sqrt{N}$	0. / $\sqrt{N}$
Gain-modulation factor	$\gamma_{net}$	0.	1.	1.

Three networks are defined in this experiment: one uses a reservoir to compute temporal features and encode the target temporal trajectory, one predicts the required weight updates, and one predicts the scalar product, as discussed in the main text. The parameters used in the simulation are summarized in Table 1.

Inputs are projected to the network through Gaussian weights with zero mean and variance $\sigma_{in}^{2}$ to distribute the input information across multiple units in the reservoir.

Data collection and pre-training

The readout of the first reservoir network is defined as $R_{j}$ , and the readout at time $t$ is given by:

y(t)=\sum_{j}R_{j}z_{j}(t),

where $z_{j}(t)$ are the reservoir states.

Training and evaluation involve presenting the networks with five target trajectories $y^{targ}(t)=0.8sin(\omega_{targ}t)$ , with five angular velocities, $\omega_{targ}$ , ranging from $0.04$ to $0.08$ . The reservoir readout parameters are trained using online gradient descent to minimize the error between the current output and the target output. The loss function is defined as:

L=\frac{1}{2}\left(y(t)-y^{\text{target}}(t)\right)^{2}.

The gradient used for online training is:

\Delta R_{j}=-\eta\frac{\partial L}{\partial R_{j}}=-\eta\left(y(t)-y^{\text{% target}}(t)\right)z_{j}(t),

where $\eta$ is the learning rate.

During data collection, trajectories of network states, errors, and weight updates are recorded. The gradient and scalar-product networks are trained on this data to predict the required gradient given a specific temporal error, and to estimate the current trajectory given the current virtual weights.

The networks are pre-trained by estimating only their readout weights, in accordance with a reservoir computing prescription. They are trained using a pseudo-inverse approach.

Dynamical supervised online learning and evaluation

Once the architecture is pre-trained, it can be tested to learn new trajectory, unobserved during the pretraining. readout weights of the reservoir are no longer changed online using gradient, but rather virtual weights are changed following the prescription of the gradient network.

Virtual weights are then used as an input to the scalar-product network to predict the current prediction $y(t)$ .

Performances are evaluated by measuring the MSE between the target trajectory ant the predicted one $y(t)$ , for different values of the trajectory angular velocity, equally distribute between 0 $.01$ and $0.1$ .

Appendix E Additional details on the bandit experiment

Family of tasks

We consider a family of tasks $\mathcal{A}$ , such that every element $\alpha\in\mathcal{A}$ represents a Bernoulli $K$ -armed bandit problem with rewards $\{R_{i}^{\alpha}\}_{i=1}^{K}$ . Each task $\alpha$ is specified by a set $P_{\alpha}\subset\{1\cdot,K\}$ of positive arms, such that

\displaystyle R_{i}^{\alpha}=\begin{cases}Ber({p})&i\in{P_{\alpha}}\\ Ber({1-p})&i\in{P_{\alpha}}\end{cases}

(13)

In the main text experiment, we used $K=10$ and $P_{ID}:=\{i\in[K]|i\equiv 0(\text{mod}2)\}$ and $P_{OOD}:=\{i\in[K]|i\equiv 1(\text{mod}2)\}$ .

Network details and hyperparameter search

Regret per round

Given an agent that plays in an environment $\alpha$ receiving rewards $\{r_{t}\}_{t\in\mathbb{N}_{+}}$ , the regret per round $\rho(T)$ achieved by the agent at round $T$ is defined as:

\displaystyle\rho(T)=\mu^{\alpha}_{\star}-\frac{1}{T}\sum_{t=1}^{T}r_{t}

(14)

Where $\mu^{\alpha}_{*}:=\max\{\mathbb{E}[R_{k}^{\alpha}]|k\in[K]\}$ is the maximum expected reward that can be obtained in each round playing the optimal policy that selects one of the best arms with probability $1$ .

Appendix F Additional details on the darkroom experiment

This experiment investigates the capability of out dynamical reinforcement learning approach, to improve an agent’s ability to navigate a 2D environment. The agent is trained to locate a randomly placed, not observable food source, with its policy evolving over multiple episodes through.

The environment is a 2D grid where an agent starts at the center, aiming to reach a randomly positioned food item. The food positions vary every 600 episodes, introducing. The agent’s movements are restricted within the grid, with actions limited to moving left, right, up, or down.

To represent the agent’s position, a place cell encoder converts the agent’s $(x,y)$ coordinates into a higher-dimensional feature vector using Gaussian functions. This encoding helps in effectively capturing spatial information.

The agent’s policy is linear, with action probabilities derived from a softmax function applied to the encoded state. For this reason, only two networks are used for the task, the gradient and the scalar-product network see Table 2 for details on parameters.

Data collection and pre-training

The policy is updated using the policy gradient method, adjusting weights based on received rewards to improve decision-making over time.

Learning data are collected for 8 different positions of the food, equally distributed on a circle.

Two types of networks are used: one to model the agent’s state dynamics and another to handle gradient updates necessary for learning. These networks influence the agent’s internal state and learning process, enabling policy refinement.

The two networks readout are trained with a linear regression on their readout weight, to reproduce the proper weight updates an scalar products.

Dynamical reinforcement learning and evaluation

The agent is tested again on the 8 positions used for pre-training, and on 8 new positions. Similarly to previous experiments, policy gradient is no longer used and replaced by virtual weights updated estimated by the gradient network, and used by the scalar-product network to predict agent policy.

Performance is measured by the total reward accumulated over episodes. Visualizing the agent’s trajectories reveals its navigation efficiency and decision-making process.

Table 2: Simulation Parameters

Parameter	Symbol	Gradient Net	Scalar Net
Network Size	$N$	500	1000
Input Dimension	$I$	25+4+10	4 $\times$ 25+4 $\times$ 25
Apical Input Dimension	$I^{ap}$	1	4 $\times$ 25
Output Dimension	$O$	4 $\times$ 25	4
Time Step	$dt$	0.005
Reservoir Time Constant	$\tau_{m_{f}}$	1 $dt$	1 $dt$
Input weights var	$\sigma_{input}$	0.01	0.01
Apical Input weights var	$\sigma_{input}$	0.1	0.1
Recurrent weights	$\sigma_{rec}$	0.5 / $\sqrt{N}$	0. / $\sqrt{N}$
Gain-modulation factor	$\gamma_{net}$	1.	1.